Thanks, I didn't know about the -z flag!

I haven't been able to get it to work though (using yarn-cluster, with a
zookeeper root configured to /flink in my flink-conf.yaml)

I can see my job directory in ZK under
/flink/application_1477475694024_0015 and I've tried a few ways to restore
the job:

./bin/flink run -m yarn-cluster -yz /application_1477475694024_0015 ....
./bin/flink run -m yarn-cluster -yz application_1477475694024_0015 ....
./bin/flink run -m yarn-cluster -yz /flink/application_1477475694024_0015/
....
./bin/flink run -m yarn-cluster -yz /flink/application_1477475694024_0015
....

The job starts from scratch each time, without restored state.

Am I doing something wrong? I've also tried with -z instead of -yz but I'm
using yarn-cluster to run a single job, so I think it should be -yz.



On Fri, Nov 4, 2016 at 2:33 PM, Ufuk Celebi <u...@apache.org> wrote:

> If the configured ZooKeeper paths are still the same, the job should
> be recovered automatically. On each submission a unique ZK namespace
> is used based on the app ID.
>
> So you have in ZK:
> /flink/app_id/...
>
> You would have to set that manually to resume an old application. You
> can do this via -z flag
> (https://ci.apache.org/projects/flink/flink-docs-
> release-1.2/setup/cli.html).
>
> Does this work?
>
> On Fri, Nov 4, 2016 at 3:28 PM, Josh <jof...@gmail.com> wrote:
> > Hi Ufuk,
> >
> > I see, but in my case the failure caused YARN application moved into a
> > finished/failed state - so the application itself is no longer running.
> How
> > can I restart the application (or start a new YARN application) and
> ensure
> > that it uses the checkpoint pointer stored in Zookeeper?
> >
> > Thanks,
> > Josh
> >
> > On Fri, Nov 4, 2016 at 1:52 PM, Ufuk Celebi <u...@apache.org> wrote:
> >>
> >> No you don't need to manually trigger a savepoint. With HA checkpoints
> >> are persisted externally and store a pointer in ZooKeeper to recover
> >> them after a JobManager failure.
> >>
> >> On Fri, Nov 4, 2016 at 2:27 PM, Josh <jof...@gmail.com> wrote:
> >> > I have a follow up question to this - if I'm running a job in
> >> > 'yarn-cluster'
> >> > mode with HA and then at some point the YARN application fails due to
> >> > some
> >> > hardware failure (i.e. the YARN application moves to
> "FINISHED"/"FAILED"
> >> > state), how can I restore the job from the most recent checkpoint?
> >> >
> >> > I can use `flink run -m yarn-cluster -s s3://my-savepoints/id .....`
> to
> >> > restore from a savepoint, but what if I haven't manually taken a
> >> > savepoint
> >> > recently?
> >> >
> >> > Thanks,
> >> > Josh
> >> >
> >> > On Fri, Nov 4, 2016 at 10:06 AM, Maximilian Michels <m...@apache.org>
> >> > wrote:
> >> >>
> >> >> Hi Anchit,
> >> >>
> >> >> The documentation mentions that you need Zookeeper in addition to
> >> >> setting the application attempts. Zookeeper is needed to retrieve the
> >> >> current leader for the client and to filter out old leaders in case
> >> >> multiple exist (old processes could even stay alive in Yarn).
> Moreover,
> >> >> it
> >> >> is needed to persist the state of the application.
> >> >>
> >> >>
> >> >> -Max
> >> >>
> >> >>
> >> >> On Thu, Nov 3, 2016 at 7:43 PM, Anchit Jatana
> >> >> <development.anc...@gmail.com> wrote:
> >> >> > Hi Maximilian,
> >> >> >
> >> >> > Thanks for you response. Since, I'm running the application on YARN
> >> >> > cluster
> >> >> > using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..'
> >> >> > command.
> >> >> > Is there anything more that I need to configure apart from setting
> up
> >> >> > 'yarn.application-attempts: 10' property inside
> conf/flink-conf.yaml.
> >> >> >
> >> >> > Just wished to confirm if there is anything more that I need to
> >> >> > configure to
> >> >> > set up HA on 'yarn-cluster' mode.
> >> >> >
> >> >> > Thank you
> >> >> >
> >> >> > Regards,
> >> >> > Anchit
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > View this message in context:
> >> >> >
> >> >> > http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Flink-Application-on-YARN-failed-on-losing-Job-Manager-No-
> recovery-Need-help-debug-the-cause-from-los-tp9839p9887.html
> >> >> > Sent from the Apache Flink User Mailing List archive. mailing list
> >> >> > archive at Nabble.com.
> >> >
> >> >
> >
> >
>

Reply via email to