Hi Shai,

I checked online that Azure DS5_v2 has SSD for storage, why don't you try
to use FLASH_SSD_OPTIMIZED option

In my case as well the stream was getting stuck for few minutes, my
checkpoint duration is 6secs and minimumPauseIntervalBetweenCheckpoints is
5secs

https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/large_state_tuning.html

I think if the writes to RocksDB are blocked then the stream can block for
certain interval
https://github.com/facebook/rocksdb/wiki/Write-Stalls

First try with FLASH_SSD_OPTIMIZED option, and don't give unnecessary high
heap memory to TM as rocksDB also uses physical memory




Regards,
Vinay Patil

On Tue, Feb 21, 2017 at 8:03 PM, Shai Kaplan [via Apache Flink User Mailing
List archive.] <ml-node+s2336050n11780...@n4.nabble.com> wrote:

> Hi Vinay.
>
>
>
> I couldn't understand from the thread, what configuration solved your
> problem?
>
>
>
> I'm using the default predefined option. Perhaps it's not the best
> configuration for my setting (I'm using Azure DS5_v2 machines), I honestly
> haven't given much thought to that particular detail, but I think it should
> only affect the performance, not make the job totally stuck.
>
>
>
> Thanks.
>
>
>
> *From:* vinay patil [mailto:[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=11780&i=0>]
> *Sent:* Tuesday, February 21, 2017 3:58 PM
> *To:* [hidden email] <http:///user/SendEmail.jtp?type=node&node=11780&i=1>
> *Subject:* Re: Flink checkpointing gets stuck
>
>
>
> Hi Shai,
>
> I was facing similar issue , however now the stream is not stuck in
> between.
>
> you can refer this thread for the configurations I have done :
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-td11752.html
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FRe-Checkpointing-with-RocksDB-as-statebackend-td11752.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=z0YAi2n6itetqIfkD6tuOpHKQY0qbOLNUuAoYiQEWak%3D&reserved=0>
>
> What is the configuration on which you running the job ?
> What is the RocksDB predefined option you are using ?
>
>
> Regards,
>
> Vinay Patil
>
>
>
> On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan [via Apache Flink User
> Mailing List archive.] <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=11778&i=0>> wrote:
>
> Hi.
>
> I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After
> some running time (minutes-hours) Flink fails to save checkpoints, and
> stops processing records (I'm not sure if the checkpointing failure is the
> cause of the problem or just a symptom).
>
> After several checkpoints that take some seconds each, they start failing
> due to 30 minutes timeout.
>
> When I restart one of the Task Manager services (just to get the job
> restarted), the job is recovered from the last successful checkpoint (the
> state size continues to grow, so it's probably not the reason for the
> failure), advances somewhat, saves some more checkpoints, and then enters
> the failing state again.
>
> One of the times it happened, the first failed checkpoint failed due to
> "Checkpoint Coordinator is suspending.", so it might be an indicator for
> the cause of the problem, but looking into Flink's code I can't see how a
> running job could get to this state.
>
> I am using RocksDB for state, and the state is saved to Azure Blob Store,
> using the NativeAzureFileSystem HDFS connector over the wasbs protocol.
>
> Any ideas? Possibly a bug in Flink or RocksDB?
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-
> checkpointing-gets-stuck-tp11776.html
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=Qt7qCSOvhSkzQA1y9ze13UqEotuWt0yKSQJ9gIV1DW8%3D&reserved=0>
>
> To start a new topic under Apache Flink User Mailing List archive., email 
> [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=11778&i=1>
> To unsubscribe from Apache Flink User Mailing List archive., click here.
> NAML
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2Ftemplate%2FNamlServlet.jtp%3Fmacro%3Dmacro_viewer%26id%3Dinstant_html%2521nabble%253Aemail.naml%26base%3Dnabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace%26breadcrumbs%3Dnotify_subscribers%2521nabble%253Aemail.naml-instant_emails%2521nabble%253Aemail.naml-send_instant_email%2521nabble%253Aemail.naml&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=JdgLltimPhln4llGkLOpTCvHKy2GFVUC%2BuoM5gZOH4w%3D&reserved=0>
>
>
>
>
> ------------------------------
>
> View this message in context: Re: Flink checkpointing gets stuck
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776p11778.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=vtsd7KXC3G5zn3ZmCEyo0RYi16TJjrrzj%2FG8a%2BPBECs%3D&reserved=0>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2F&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=JjFgdLMaCzZ9FcQ992QUZtnP%2BjxAZghzA7g05nBurLU%3D&reserved=0>
> at Nabble.com.
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-
> checkpointing-gets-stuck-tp11776p11780.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1...@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776p11783.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.

Reply via email to