Hi,

We are using Flink 1.6.1 at the moment and we have a streaming job
configured to create a checkpoint every 10 seconds. Looking at the
checkpointing times in the UI, we can see that one subtask is much slower
creating the endpoint, at least in its "End to End Duration", and seems
caused by a longer "Checkpoint Duration (Async)".

For instance, in the attach screenshot, while most of the subtasks take
half a second, one (and it is always one) takes 2 seconds.

But we have worse problems. We have seen cases where the checkpoint times
out for one tasks, while most take one second, the outlier takes more than
5 minutes (which is the max time we allow for a checkpoint). This can
happen if there is back pressure. We only allow one checkpoint at a time as
well.

Why could one subtask take more time? This jobs read from kafka partitions
and hash by key, and we don't see any major data skew between the
partitions. Does one partition do more work?

We do have a cluster of 20 machines, in EMR, with TMs that have multiple
slots (in legacy mode).

Is this something that could have been fixed in a more recent version?

Thanks for any insight!

Bruno

Reply via email to