Hi, We are using Flink 1.6.1 at the moment and we have a streaming job configured to create a checkpoint every 10 seconds. Looking at the checkpointing times in the UI, we can see that one subtask is much slower creating the endpoint, at least in its "End to End Duration", and seems caused by a longer "Checkpoint Duration (Async)".
For instance, in the attach screenshot, while most of the subtasks take half a second, one (and it is always one) takes 2 seconds. But we have worse problems. We have seen cases where the checkpoint times out for one tasks, while most take one second, the outlier takes more than 5 minutes (which is the max time we allow for a checkpoint). This can happen if there is back pressure. We only allow one checkpoint at a time as well. Why could one subtask take more time? This jobs read from kafka partitions and hash by key, and we don't see any major data skew between the partitions. Does one partition do more work? We do have a cluster of 20 machines, in EMR, with TMs that have multiple slots (in legacy mode). Is this something that could have been fixed in a more recent version? Thanks for any insight! Bruno