Re: High Availability on Yarn

Aljoscha Krettek Wed, 03 May 2017 02:06:07 -0700

Hi,
As a first comment, the work mentioned in the FLIP-6 doc you linked is still 
work-in-progress. You cannot use these abstractions yet without going into the 
code and setting up a cluster “by hand”.


The documentation for one-step deployment of a Job to YARN is available here: 
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/yarn_setup.html#run-a-single-flink-job-on-yarn
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/yarn_setup.html#run-a-single-flink-job-on-yarn>

Regarding your third question, ZooKeeper is mostly used for discovery and 
leader election. That is, JobManagers use it to decide who is the main JM and 
who are standby JMs. TaskManagers use it to discover the leading JobManager 
that they should connect to.

I’m also cc’ing Till, who should know this stuff better and can maybe explain 
it in a bit more detail.

Best,
Aljoscha
> On 1. May 2017, at 18:59, Jain, Ankit <ankit.j...@here.com> wrote:
> 
> Hi fellow users,
> We are trying to straighten out high availability story for flink.
>  
> Our setup includes a long running EMR cluster, job submission is a two-step 
> process – 1) Flink cluster is first created using flink yarn client on the 
> EMR cluster already running 2) Flink job is submitted.
>  
> I also saw references that with 1.2, these two steps have been combined into 
> 1 – is that change in FlinkYarnSessionCli.java? Can somebody point to 
> documentation please?
>  
> W/o worrying about Yarn RM (not Flink Yarn RM that seems to be newly 
> introduced) failure for now, I want to understand first how task manager & 
> job manager failures are handled.
>  
> My questions-
> 1)       
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 
> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077> 
> suggests a new RM has been added and now there is one JobManager for each 
> job. Since Yarn RM will now talk to Flink RM( instead of JobManager 
> previously), will Yarn automatically restart failing Flink RM?
> 2)       Is there any documentation on behavior of new Flink RM that will 
> come up? How will previously running JobManagers & TaskManagers find out 
> about new RM?
> 3)       
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/jobmanager_high_availability.html#configuration
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/jobmanager_high_availability.html#configuration>
>  requires configuring Zookeeper even for Yarn – Is this needed for handling 
> Task Manager failures or JM or both? Will Yarn not take care of JM failures?
>  
> It may sound like I am little confused between role of Yarn and Flink 
> components– who has the most burden of HA? Documentation in current state is 
> lacking clarity – I know it is still evolving.
>  
> Please let me know if somebody can help clear the confusion.
>  
> Thanks
> Ankit
>  
>

Re: High Availability on Yarn

Reply via email to