Hi.
I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After some 
running time (minutes-hours) Flink fails to save checkpoints, and stops 
processing records (I'm not sure if the checkpointing failure is the cause of 
the problem or just a symptom).
After several checkpoints that take some seconds each, they start failing due 
to 30 minutes timeout.
When I restart one of the Task Manager services (just to get the job 
restarted), the job is recovered from the last successful checkpoint (the state 
size continues to grow, so it's probably not the reason for the failure), 
advances somewhat, saves some more checkpoints, and then enters the failing 
state again.
One of the times it happened, the first failed checkpoint failed due to 
"Checkpoint Coordinator is suspending.", so it might be an indicator for the 
cause of the problem, but looking into Flink's code I can't see how a running 
job could get to this state.
I am using RocksDB for state, and the state is saved to Azure Blob Store, using 
the NativeAzureFileSystem HDFS connector over the wasbs protocol.
Any ideas? Possibly a bug in Flink or RocksDB?

Reply via email to