Hi Bruno,

there are multiple reasons why one of the subtasks can take longer for
checkpointing. It looks as if there is not much data skew since the state
sizes are relatively equal. It also looks as if the individual tasks all
start at the same time with the checkpointing which indicates that there
mustn't be a lot of back-pressure in the DAG (or all tasks were equally
back-pressured). This narrows the problem cause down to the asynchronous
write operation. One potential problem could be if the external system to
which you write your checkpoint data has some kind of I/O limit/quota.
Maybe the sum of write accesses deplete the maximum quota you have. You
could try whether running the job with a lower parallelism solves the
problems.

For further debugging it could be helpful to get access to the logs of the
JobManager and the TaskManagers on DEBUG log level. It could also be
helpful to learn which state backend you are using.

Cheers,
Till

On Tue, Jan 8, 2019 at 12:52 PM Bruno Aranda <bara...@apache.org> wrote:

> Hi,
>
> We are using Flink 1.6.1 at the moment and we have a streaming job
> configured to create a checkpoint every 10 seconds. Looking at the
> checkpointing times in the UI, we can see that one subtask is much slower
> creating the endpoint, at least in its "End to End Duration", and seems
> caused by a longer "Checkpoint Duration (Async)".
>
> For instance, in the attach screenshot, while most of the subtasks take
> half a second, one (and it is always one) takes 2 seconds.
>
> But we have worse problems. We have seen cases where the checkpoint times
> out for one tasks, while most take one second, the outlier takes more than
> 5 minutes (which is the max time we allow for a checkpoint). This can
> happen if there is back pressure. We only allow one checkpoint at a time as
> well.
>
> Why could one subtask take more time? This jobs read from kafka partitions
> and hash by key, and we don't see any major data skew between the
> partitions. Does one partition do more work?
>
> We do have a cluster of 20 machines, in EMR, with TMs that have multiple
> slots (in legacy mode).
>
> Is this something that could have been fixed in a more recent version?
>
> Thanks for any insight!
>
> Bruno
>
>
>

Reply via email to