Hey Shai!

Thanks for reporting this.

It's hard to tell what causes this from your email, but could you
check the checkpoint interface
(https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/checkpoint_monitoring.html)
and report how much progress the checkpoints make before timing out?

The "Checkpoint Coordinator is suspending" message indicates that the
job failed and the checkpoint coordinator is shut down because of
that. Can you check the TaskManager and JobManager logs if other
errors are reported? Feel free to share them. Then I could help with
going over them.

– Ufuk


On Tue, Feb 21, 2017 at 2:47 PM, Shai Kaplan <shai.kap...@microsoft.com> wrote:
> Hi.
>
> I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After
> some running time (minutes-hours) Flink fails to save checkpoints, and stops
> processing records (I'm not sure if the checkpointing failure is the cause
> of the problem or just a symptom).
>
> After several checkpoints that take some seconds each, they start failing
> due to 30 minutes timeout.
>
> When I restart one of the Task Manager services (just to get the job
> restarted), the job is recovered from the last successful checkpoint (the
> state size continues to grow, so it's probably not the reason for the
> failure), advances somewhat, saves some more checkpoints, and then enters
> the failing state again.
>
> One of the times it happened, the first failed checkpoint failed due to
> "Checkpoint Coordinator is suspending.", so it might be an indicator for the
> cause of the problem, but looking into Flink's code I can't see how a
> running job could get to this state.
>
> I am using RocksDB for state, and the state is saved to Azure Blob Store,
> using the NativeAzureFileSystem HDFS connector over the wasbs protocol.
>
> Any ideas? Possibly a bug in Flink or RocksDB?

Reply via email to