Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-07 Thread Ufuk Celebi
On 4 November 2016 at 17:09:25, Josh (jof...@gmail.com) wrote: > Thanks, I didn't know about the -z flag! > > I haven't been able to get it to work though (using yarn-cluster, with a > zookeeper root configured to /flink in my flink-conf.yaml) > > I can see my job directory in ZK under > /fl

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-04 Thread Josh
Thanks, I didn't know about the -z flag! I haven't been able to get it to work though (using yarn-cluster, with a zookeeper root configured to /flink in my flink-conf.yaml) I can see my job directory in ZK under /flink/application_1477475694024_0015 and I've tried a few ways to restore the job:

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-04 Thread Ufuk Celebi
If the configured ZooKeeper paths are still the same, the job should be recovered automatically. On each submission a unique ZK namespace is used based on the app ID. So you have in ZK: /flink/app_id/... You would have to set that manually to resume an old application. You can do this via -z flag

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-04 Thread Josh
Hi Ufuk, I see, but in my case the failure caused YARN application moved into a finished/failed state - so the application itself is no longer running. How can I restart the application (or start a new YARN application) and ensure that it uses the checkpoint pointer stored in Zookeeper? Thanks, J

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-04 Thread Ufuk Celebi
No you don't need to manually trigger a savepoint. With HA checkpoints are persisted externally and store a pointer in ZooKeeper to recover them after a JobManager failure. On Fri, Nov 4, 2016 at 2:27 PM, Josh wrote: > I have a follow up question to this - if I'm running a job in 'yarn-cluster' >

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-04 Thread Josh
I have a follow up question to this - if I'm running a job in 'yarn-cluster' mode with HA and then at some point the YARN application fails due to some hardware failure (i.e. the YARN application moves to "FINISHED"/"FAILED" state), how can I restore the job from the most recent checkpoint? I can

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-04 Thread Maximilian Michels
Hi Anchit, The documentation mentions that you need Zookeeper in addition to setting the application attempts. Zookeeper is needed to retrieve the current leader for the client and to filter out old leaders in case multiple exist (old processes could even stay alive in Yarn). Moreover, it is neede

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-03 Thread Anchit Jatana
Hi Maximilian, Thanks for you response. Since, I'm running the application on YARN cluster using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..' command. Is there anything more that I need to configure apart from setting up 'yarn.application-attempts: 10' property inside conf/flink-c

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-03 Thread Maximilian Michels
Hi Anchit, It is possible that the application crashes for many different reasons, e.g. error in user code, hardware/network failures. Have you configured high availability for Yarn as described in the documentation: https://ci.apache.org/projects/flink/flink-docs-release-1.1/setup/jobmanager_hig

Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

2016-11-02 Thread Anchit Jatana
Hi All, I started my flink application on YARN using flink run -m yarn-cluster, after running smoothly for 20 hrs it failed. Ideally the application should have recover on losing the Job Manger (which runs in the same container as the application master) pertaining to the fault tolerant nature of