On 4 November 2016 at 17:09:25, Josh (jof...@gmail.com) wrote:
> Thanks, I didn't know about the -z flag!
>
> I haven't been able to get it to work though (using yarn-cluster, with a
> zookeeper root configured to /flink in my flink-conf.yaml)
>
> I can see my job directory in ZK under
> /fl
Thanks, I didn't know about the -z flag!
I haven't been able to get it to work though (using yarn-cluster, with a
zookeeper root configured to /flink in my flink-conf.yaml)
I can see my job directory in ZK under
/flink/application_1477475694024_0015 and I've tried a few ways to restore
the job:
If the configured ZooKeeper paths are still the same, the job should
be recovered automatically. On each submission a unique ZK namespace
is used based on the app ID.
So you have in ZK:
/flink/app_id/...
You would have to set that manually to resume an old application. You
can do this via -z flag
Hi Ufuk,
I see, but in my case the failure caused YARN application moved into a
finished/failed state - so the application itself is no longer running. How
can I restart the application (or start a new YARN application) and ensure
that it uses the checkpoint pointer stored in Zookeeper?
Thanks,
J
No you don't need to manually trigger a savepoint. With HA checkpoints
are persisted externally and store a pointer in ZooKeeper to recover
them after a JobManager failure.
On Fri, Nov 4, 2016 at 2:27 PM, Josh wrote:
> I have a follow up question to this - if I'm running a job in 'yarn-cluster'
>
I have a follow up question to this - if I'm running a job in
'yarn-cluster' mode with HA and then at some point the YARN application
fails due to some hardware failure (i.e. the YARN application moves to
"FINISHED"/"FAILED" state), how can I restore the job from the most recent
checkpoint?
I can
Hi Anchit,
The documentation mentions that you need Zookeeper in addition to
setting the application attempts. Zookeeper is needed to retrieve the
current leader for the client and to filter out old leaders in case
multiple exist (old processes could even stay alive in Yarn). Moreover, it
is neede
Hi Maximilian,
Thanks for you response. Since, I'm running the application on YARN cluster
using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..' command.
Is there anything more that I need to configure apart from setting up
'yarn.application-attempts: 10' property inside conf/flink-c
Hi Anchit,
It is possible that the application crashes for many different
reasons, e.g. error in user code, hardware/network failures. Have you
configured high availability for Yarn as described in the
documentation:
https://ci.apache.org/projects/flink/flink-docs-release-1.1/setup/jobmanager_hig
Hi All,
I started my flink application on YARN using flink run -m yarn-cluster,
after running smoothly for 20 hrs it failed. Ideally the application should
have recover on losing the Job Manger (which runs in the same container as
the application master) pertaining to the fault tolerant nature of
10 matches
Mail list logo