Hi Shai,

I was facing similar issue , however now the stream is not stuck in between.

you can refer this thread for the configurations I have done :
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-td11752.html

What is the configuration on which you running the job ?
What is the RocksDB predefined option you are using ?



Regards,
Vinay Patil

On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan [via Apache Flink User Mailing
List archive.] <ml-node+s2336050n1177...@n4.nabble.com> wrote:

> Hi.
>
> I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After
> some running time (minutes-hours) Flink fails to save checkpoints, and
> stops processing records (I'm not sure if the checkpointing failure is the
> cause of the problem or just a symptom).
>
> After several checkpoints that take some seconds each, they start failing
> due to 30 minutes timeout.
>
> When I restart one of the Task Manager services (just to get the job
> restarted), the job is recovered from the last successful checkpoint (the
> state size continues to grow, so it's probably not the reason for the
> failure), advances somewhat, saves some more checkpoints, and then enters
> the failing state again.
>
> One of the times it happened, the first failed checkpoint failed due to
> "Checkpoint Coordinator is suspending.", so it might be an indicator for
> the cause of the problem, but looking into Flink's code I can't see how a
> running job could get to this state.
>
> I am using RocksDB for state, and the state is saved to Azure Blob Store,
> using the NativeAzureFileSystem HDFS connector over the wasbs protocol.
>
> Any ideas? Possibly a bug in Flink or RocksDB?
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-
> checkpointing-gets-stuck-tp11776.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1...@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776p11778.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.

Reply via email to