Hey Ethan,

YARN's HA support is marginal right now, and we're still investigating
this stuff. Some useful things to read are:

* https://issues.apache.org/jira/browse/YARN-128
* https://issues.apache.org/jira/browse/YARN-149
* https://issues.apache.org/jira/browse/YARN-353
* https://issues.apache.org/jira/browse/YARN-556


Also, CDH seems to be packaging some of the ZK-based HA stuff already:

  
https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest
/CDH5-High-Availability-Guide/cdh5hag_cfg_RM_HA.html


At LI, we're still experimenting with the best setup, so my guidance might
not be state of the art. We currently configure the YARN RM's store
(yarn.resourcemanager.store.class) to use the file system store
(org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateSt
ore). The failover is a manual operation where we copy the RM state to a
new machine, and then start the RM on that machine. You then need to front
the RM with a VIP or DNS entry, which you can update to point to the new
RM machine when a failover occurs. The NMs need to be configured to point
to this VIP/DNS entry, so that when a failover occurs, the NMs don't need
to update their yarn-site.xml files.


It sounds like in the future you won't need to use VIPs/DNS entries. You
should probably also email the YARN mailing list, just in case we're
misinformed or unaware of some new updates.

Cheers,
Chris

On 2/21/14 2:27 PM, "Ethan Setnik" <[email protected]> wrote:

>I'm looking to deploy Samza on AWS infrastructure in a HA configuration.
>I
>have a clear picture of how to configure all the components such that they
>do not contain any single point of failure.
>
>I'm stuck, however, when it comes to the YARN architecture.  It seems that
>YARN relies on the single-master / multi-slave pattern as described in the
>YARN documentation.  This introduces a single point of failure at the
>ResourceManager level such that a failed ResourceManager will fail the
>entire YARN cluster.  How does LinkedIn architect a HA configuration for
>Samza on YARN such that a complete instance failure of ResourceManager
>provides failover for the YARN cluster?
>
>Thanks for your help.
>
>Best,
>Ethan
>
>
>-- 
>Ethan Setnik
>MobileAware
>
>m: +1 617 513 2052
>e: [email protected]

Reply via email to