Hi Vinay. I couldn't understand from the thread, what configuration solved your problem?
I'm using the default predefined option. Perhaps it's not the best configuration for my setting (I'm using Azure DS5_v2 machines), I honestly haven't given much thought to that particular detail, but I think it should only affect the performance, not make the job totally stuck. Thanks. From: vinay patil [mailto:vinay18.pa...@gmail.com] Sent: Tuesday, February 21, 2017 3:58 PM To: user@flink.apache.org Subject: Re: Flink checkpointing gets stuck Hi Shai, I was facing similar issue , however now the stream is not stuck in between. you can refer this thread for the configurations I have done : http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-td11752.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FRe-Checkpointing-with-RocksDB-as-statebackend-td11752.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=z0YAi2n6itetqIfkD6tuOpHKQY0qbOLNUuAoYiQEWak%3D&reserved=0> What is the configuration on which you running the job ? What is the RocksDB predefined option you are using ? Regards, Vinay Patil On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan [via Apache Flink User Mailing List archive.] <[hidden email]</user/SendEmail.jtp?type=node&node=11778&i=0>> wrote: Hi. I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After some running time (minutes-hours) Flink fails to save checkpoints, and stops processing records (I'm not sure if the checkpointing failure is the cause of the problem or just a symptom). After several checkpoints that take some seconds each, they start failing due to 30 minutes timeout. When I restart one of the Task Manager services (just to get the job restarted), the job is recovered from the last successful checkpoint (the state size continues to grow, so it's probably not the reason for the failure), advances somewhat, saves some more checkpoints, and then enters the failing state again. One of the times it happened, the first failed checkpoint failed due to "Checkpoint Coordinator is suspending.", so it might be an indicator for the cause of the problem, but looking into Flink's code I can't see how a running job could get to this state. I am using RocksDB for state, and the state is saved to Azure Blob Store, using the NativeAzureFileSystem HDFS connector over the wasbs protocol. Any ideas? Possibly a bug in Flink or RocksDB? ________________________________ If you reply to this email, your message will be added to the discussion below: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=Qt7qCSOvhSkzQA1y9ze13UqEotuWt0yKSQJ9gIV1DW8%3D&reserved=0> To start a new topic under Apache Flink User Mailing List archive., email [hidden email]</user/SendEmail.jtp?type=node&node=11778&i=1> To unsubscribe from Apache Flink User Mailing List archive., click here. NAML<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2Ftemplate%2FNamlServlet.jtp%3Fmacro%3Dmacro_viewer%26id%3Dinstant_html%2521nabble%253Aemail.naml%26base%3Dnabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace%26breadcrumbs%3Dnotify_subscribers%2521nabble%253Aemail.naml-instant_emails%2521nabble%253Aemail.naml-send_instant_email%2521nabble%253Aemail.naml&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=JdgLltimPhln4llGkLOpTCvHKy2GFVUC%2BuoM5gZOH4w%3D&reserved=0> ________________________________ View this message in context: Re: Flink checkpointing gets stuck<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776p11778.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=vtsd7KXC3G5zn3ZmCEyo0RYi16TJjrrzj%2FG8a%2BPBECs%3D&reserved=0> Sent from the Apache Flink User Mailing List archive. mailing list archive<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2F&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=JjFgdLMaCzZ9FcQ992QUZtnP%2BjxAZghzA7g05nBurLU%3D&reserved=0> at Nabble.com.