Hi.
I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After some
running time (minutes-hours) Flink fails to save checkpoints, and stops
processing records (I'm not sure if the checkpointing failure is the cause of
the problem or just a symptom).
After several checkpoints that take some seconds each, they start failing due
to 30 minutes timeout.
When I restart one of the Task Manager services (just to get the job
restarted), the job is recovered from the last successful checkpoint (the state
size continues to grow, so it's probably not the reason for the failure),
advances somewhat, saves some more checkpoints, and then enters the failing
state again.
One of the times it happened, the first failed checkpoint failed due to
"Checkpoint Coordinator is suspending.", so it might be an indicator for the
cause of the problem, but looking into Flink's code I can't see how a running
job could get to this state.
I am using RocksDB for state, and the state is saved to Azure Blob Store, using
the NativeAzureFileSystem HDFS connector over the wasbs protocol.
Any ideas? Possibly a bug in Flink or RocksDB?