Hi Shai, I checked online that Azure DS5_v2 has SSD for storage, why don't you try to use FLASH_SSD_OPTIMIZED option
In my case as well the stream was getting stuck for few minutes, my checkpoint duration is 6secs and minimumPauseIntervalBetweenCheckpoints is 5secs https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/large_state_tuning.html I think if the writes to RocksDB are blocked then the stream can block for certain interval https://github.com/facebook/rocksdb/wiki/Write-Stalls First try with FLASH_SSD_OPTIMIZED option, and don't give unnecessary high heap memory to TM as rocksDB also uses physical memory Regards, Vinay Patil On Tue, Feb 21, 2017 at 8:03 PM, Shai Kaplan [via Apache Flink User Mailing List archive.] <ml-node+s2336050n11780...@n4.nabble.com> wrote: > Hi Vinay. > > > > I couldn't understand from the thread, what configuration solved your > problem? > > > > I'm using the default predefined option. Perhaps it's not the best > configuration for my setting (I'm using Azure DS5_v2 machines), I honestly > haven't given much thought to that particular detail, but I think it should > only affect the performance, not make the job totally stuck. > > > > Thanks. > > > > *From:* vinay patil [mailto:[hidden email] > <http:///user/SendEmail.jtp?type=node&node=11780&i=0>] > *Sent:* Tuesday, February 21, 2017 3:58 PM > *To:* [hidden email] <http:///user/SendEmail.jtp?type=node&node=11780&i=1> > *Subject:* Re: Flink checkpointing gets stuck > > > > Hi Shai, > > I was facing similar issue , however now the stream is not stuck in > between. > > you can refer this thread for the configurations I have done : > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re- > Checkpointing-with-RocksDB-as-statebackend-td11752.html > <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FRe-Checkpointing-with-RocksDB-as-statebackend-td11752.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=z0YAi2n6itetqIfkD6tuOpHKQY0qbOLNUuAoYiQEWak%3D&reserved=0> > > What is the configuration on which you running the job ? > What is the RocksDB predefined option you are using ? > > > Regards, > > Vinay Patil > > > > On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan [via Apache Flink User > Mailing List archive.] <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=11778&i=0>> wrote: > > Hi. > > I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After > some running time (minutes-hours) Flink fails to save checkpoints, and > stops processing records (I'm not sure if the checkpointing failure is the > cause of the problem or just a symptom). > > After several checkpoints that take some seconds each, they start failing > due to 30 minutes timeout. > > When I restart one of the Task Manager services (just to get the job > restarted), the job is recovered from the last successful checkpoint (the > state size continues to grow, so it's probably not the reason for the > failure), advances somewhat, saves some more checkpoints, and then enters > the failing state again. > > One of the times it happened, the first failed checkpoint failed due to > "Checkpoint Coordinator is suspending.", so it might be an indicator for > the cause of the problem, but looking into Flink's code I can't see how a > running job could get to this state. > > I am using RocksDB for state, and the state is saved to Azure Blob Store, > using the NativeAzureFileSystem HDFS connector over the wasbs protocol. > > Any ideas? Possibly a bug in Flink or RocksDB? > > > ------------------------------ > > *If you reply to this email, your message will be added to the discussion > below:* > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink- > checkpointing-gets-stuck-tp11776.html > <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=Qt7qCSOvhSkzQA1y9ze13UqEotuWt0yKSQJ9gIV1DW8%3D&reserved=0> > > To start a new topic under Apache Flink User Mailing List archive., email > [hidden > email] <http:///user/SendEmail.jtp?type=node&node=11778&i=1> > To unsubscribe from Apache Flink User Mailing List archive., click here. > NAML > <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2Ftemplate%2FNamlServlet.jtp%3Fmacro%3Dmacro_viewer%26id%3Dinstant_html%2521nabble%253Aemail.naml%26base%3Dnabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace%26breadcrumbs%3Dnotify_subscribers%2521nabble%253Aemail.naml-instant_emails%2521nabble%253Aemail.naml-send_instant_email%2521nabble%253Aemail.naml&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=JdgLltimPhln4llGkLOpTCvHKy2GFVUC%2BuoM5gZOH4w%3D&reserved=0> > > > > > ------------------------------ > > View this message in context: Re: Flink checkpointing gets stuck > <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776p11778.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=vtsd7KXC3G5zn3ZmCEyo0RYi16TJjrrzj%2FG8a%2BPBECs%3D&reserved=0> > Sent from the Apache Flink User Mailing List archive. mailing list archive > <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2F&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=JjFgdLMaCzZ9FcQ992QUZtnP%2BjxAZghzA7g05nBurLU%3D&reserved=0> > at Nabble.com. > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink- > checkpointing-gets-stuck-tp11776p11780.html > To start a new topic under Apache Flink User Mailing List archive., email > ml-node+s2336050n1...@n4.nabble.com > To unsubscribe from Apache Flink User Mailing List archive., click here > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx> > . > NAML > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776p11783.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.