Hi Vinay.

I couldn't understand from the thread, what configuration solved your problem?

I'm using the default predefined option. Perhaps it's not the best 
configuration for my setting (I'm using Azure DS5_v2 machines), I honestly 
haven't given much thought to that particular detail, but I think it should 
only affect the performance, not make the job totally stuck.

Thanks.

From: vinay patil [mailto:vinay18.pa...@gmail.com]
Sent: Tuesday, February 21, 2017 3:58 PM
To: user@flink.apache.org
Subject: Re: Flink checkpointing gets stuck

Hi Shai,

I was facing similar issue , however now the stream is not stuck in between.
you can refer this thread for the configurations I have done : 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-td11752.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FRe-Checkpointing-with-RocksDB-as-statebackend-td11752.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=z0YAi2n6itetqIfkD6tuOpHKQY0qbOLNUuAoYiQEWak%3D&reserved=0>

What is the configuration on which you running the job ?
What is the RocksDB predefined option you are using ?


Regards,
Vinay Patil

On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan [via Apache Flink User Mailing 
List archive.] <[hidden email]</user/SendEmail.jtp?type=node&node=11778&i=0>> 
wrote:
Hi.
I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After some 
running time (minutes-hours) Flink fails to save checkpoints, and stops 
processing records (I'm not sure if the checkpointing failure is the cause of 
the problem or just a symptom).
After several checkpoints that take some seconds each, they start failing due 
to 30 minutes timeout.
When I restart one of the Task Manager services (just to get the job 
restarted), the job is recovered from the last successful checkpoint (the state 
size continues to grow, so it's probably not the reason for the failure), 
advances somewhat, saves some more checkpoints, and then enters the failing 
state again.
One of the times it happened, the first failed checkpoint failed due to 
"Checkpoint Coordinator is suspending.", so it might be an indicator for the 
cause of the problem, but looking into Flink's code I can't see how a running 
job could get to this state.
I am using RocksDB for state, and the state is saved to Azure Blob Store, using 
the NativeAzureFileSystem HDFS connector over the wasbs protocol.
Any ideas? Possibly a bug in Flink or RocksDB?

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=Qt7qCSOvhSkzQA1y9ze13UqEotuWt0yKSQJ9gIV1DW8%3D&reserved=0>
To start a new topic under Apache Flink User Mailing List archive., email 
[hidden email]</user/SendEmail.jtp?type=node&node=11778&i=1>
To unsubscribe from Apache Flink User Mailing List archive., click here.
NAML<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2Ftemplate%2FNamlServlet.jtp%3Fmacro%3Dmacro_viewer%26id%3Dinstant_html%2521nabble%253Aemail.naml%26base%3Dnabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace%26breadcrumbs%3Dnotify_subscribers%2521nabble%253Aemail.naml-instant_emails%2521nabble%253Aemail.naml-send_instant_email%2521nabble%253Aemail.naml&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=JdgLltimPhln4llGkLOpTCvHKy2GFVUC%2BuoM5gZOH4w%3D&reserved=0>


________________________________
View this message in context: Re: Flink checkpointing gets 
stuck<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776p11778.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=vtsd7KXC3G5zn3ZmCEyo0RYi16TJjrrzj%2FG8a%2BPBECs%3D&reserved=0>
Sent from the Apache Flink User Mailing List archive. mailing list 
archive<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2F&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=JjFgdLMaCzZ9FcQ992QUZtnP%2BjxAZghzA7g05nBurLU%3D&reserved=0>
 at Nabble.com.

Reply via email to