Re: Consumption stops on some Kafka partitions in job running on Flink 1.15.2

Samuel Chase Mon, 07 Nov 2022 01:01:24 -0800

Say we have two partitions A and B. A is the partition which is not
receiving any events, but B is seeing events and processing normally.
Now, because A is not having any events, with no Watermarking
strategy, it is holding the watermark back as described in the docs.
Now, does this mean that lag on B will build up as it cannot move
forward until A starts seeing some input? This is what I can gather to
the best of my understanding.


How does this explain the situations where:
1. A single partition (out of 30) had lag increasing.
2. Nine partitions (out of 30) had lag increasing.

Will be great, if anyone can help clear this up for me.

On Mon, Nov 7, 2022 at 2:18 PM Salva Alcántara <salcantara...@gmail.com> wrote:
>
> Hi Samuel,
>
> I'm glad to hear that! Let us know how the problem is finally solved. 
> Personally I'd upgrade to 1.15.3.
>
> Salva
>
> On Mon, Nov 7, 2022 at 9:42 AM Samuel Chase <samebch...@gmail.com> wrote:
>>
>> Hi Salva,
>>
>> Thanks for the pointers. They were helpful in gaining a better
>> understanding what happened.
>>
>> In both situations, these outages occurred at a time of the lowest
>> traffic in a day. Due to business-logic reasons, we are using a
>> partition key which may not result in even distribution across all
>> partitions. It seems conceivable to me that during times of low
>> traffic some partitions may not see any events for some time.
>>
>> Now, with no watermarking strategy, I believe we are running into the
>> problem described in
>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/event-time/generating_watermarks/#dealing-with-idle-sources
>> where the Watermark cannot move forward. The docs recommend using a
>> Watermark strategy i.e. withIdleness which detect inputs as being idle
>> which will not hold the watermark back.
>>
>> However, https://issues.apache.org/jira/browse/FLINK-28975 suggests
>> there is an issue with `withIdleness`. Am I right in understanding
>> that we cannot safely use `withIdleness` now in the Watermarking
>> strategy. Does this affect me? I am not doing any joins on the
>> streams, I just need the message to be sent to the sink, and no lag
>> building up.
>>
>> Plan of action:
>>
>> A. Use Watermarking strategy `withIdleness` on 1.15.2 (which is
>> affected by FLINK-28975).
>>
>> OR
>>
>> B. Upgrade to 1.15.3 (FLINK-28975 is fixed here) with a Watermarking
>> strategy `withIdleness`.
>>
>> On Mon, Nov 7, 2022 at 1:05 PM Salva Alcántara <salcantara...@gmail.com> 
>> wrote:
>> >
>> > Hi Samuel,
>> >
>> > Maybe related to this https://issues.apache.org/jira/browse/FLINK-28975? 
>> > See also:
>> > - 
>> > https://stackoverflow.com/questions/72654182/flink-interval-join-datastream-with-kafkasource-drops-all-records
>> >
>> > I left a similar comment in your SO post.
>> >
>> > Regards,
>> >
>> > Salva
>> >
>> > On Mon, Nov 7, 2022 at 7:27 AM Samuel Chase <samebch...@gmail.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> At work we are using Flink to store timers and notify us when they are
>> >> triggered. It's been working great over several versions over the
>> >> years. Flink 1.5 -> Flink 1.9 -> Flink 1.15.2.
>> >>
>> >> A few months ago we upgraded from Flink 1.9 to Flink 1.15.2. In the
>> >> process we had to upgrade all the Flink API code in our job to use the
>> >> new APIs.
>> >>
>> >> Our job code has a Kafka Source and a Kafka Sink. For our Source, we
>> >> are currently using `WatermarkStrategy.noWatermarks()`. It has been
>> >> running fine ever since we upgraded, but in the last few weeks we have
>> >> faced two outages.
>> >>
>> >> Configuration:
>> >>
>> >> 2 JobManager nodes
>> >> 5 TaskManager nodes (4 slots each)
>> >> Parallelism: 16
>> >> Source topic: 30 partitions
>> >> Using `setStartingOffsets(OffsetsInitializer.latest())` while
>> >> initializing the source.
>> >>
>> >> Outage #1
>> >>
>> >> Our monitoring system alerted us that lag is building up on one
>> >> partition (out of 30). We did not know of anything we could to do
>> >> jumpstart consumption on that partition other than by forcing a
>> >> reassignment. When the TaskManager service on the node to which the
>> >> partition was assigned was restarted, the lag reduced soon after.
>> >>
>> >> Outage #2
>> >>
>> >> Something similar happened again, but this time, lag was building up
>> >> on 9 (out of 30) partitions. Once again, we restarted the TaskManager
>> >> services on all the nodes, and it started consuming once again.
>> >>
>> >> We asked a question on SO,
>> >> https://stackoverflow.com/q/74272277/2165719 and was directed to ask
>> >> on the mailing list as well.
>> >>
>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-182%3A+Support+watermark+alignment+of+FLIP-27+Sources
>> >>
>> >> In another post, https://stackoverflow.com/a/70101290/2165719 there is
>> >> a suggestion to use `WatermarkStrategy.withIdleness(...)`. Could this
>> >> help us?
>> >>
>> >> Any help/guidance here would be much appreciated.
>> >>
>> >> Thanks,

Re: Consumption stops on some Kafka partitions in job running on Flink 1.15.2

Reply via email to