Hi Chris, I have made the Samza run in HA yarn, leveraging the high available configuration. Just put my coarse approach here in case someone faces the similar problem.
The HA yarn is from CDH5-beta 2 version, which is ZK-based HA yarn. It seems not working by just replacing the jar file. So the way I made it work is a little hacky: changed the samza-yarn a little, having the client check the current active RM from Zookeeper every time it submits AM. ( Because HA yarn keeps the active RM name in the ZK ). Of course, Samza works well. It will automatically get restarted when the RM changes (that is, standby RM becomes active when active RM fails). Hope someone has a better idea for doing this. Thank you. Cheers, Fang, Yan [email protected] +1 (206) 849-4108 On Mon, Mar 10, 2014 at 4:35 PM, Yan Fang <[email protected]> wrote: > Hi Chris, > > Thank you! You are correct, I am actually working in a CDH5-beta version. > Will definitely try as you recommended and do some experiments to see how > Samza performances. > > Cheers, > > Fang, Yan > [email protected] > +1 (206) 849-4108 > > > On Mon, Mar 10, 2014 at 3:54 PM, Chris Riccomini > <[email protected]>wrote: > >> Hey Yan, >> >> I'm not aware of anyone successfully running Samza with CDH5's HA YARN. As >> far as I understand, those patches are not fully merged in to Apache yet >> (I could be wrong, though). >> >> At a minimum, you'll probably need to replace Samza's 2.2 YARN jars with >> the CDH5 jars, so that Samza properly interprets the different configs >> (e.g. The new RM style of config, which you've mentioned). >> >> I'm not sure how Samza's YARN AM will behave when the RM is failed over. >> You'll have to experiment with this and see. If you find anything out, >> it'd be very very useful if you could share it with the rest of us. Samza >> and HA RMs is something that we're investigating as well. >> >> Cheers, >> Chris >> >> On 3/10/14 12:11 PM, "Yan Fang" <[email protected]> wrote: >> >> >Hi All, >> > >> >Happy daylight saving! I am wondering if anyone in this mailing-list has >> >successfully run the Samza in a HA YARN cluster ? >> > >> >We are trying to run Samza in CDH5 which has HA YARN configurations. I am >> >able to run Samza only by updating the yarn-default.xml (change >> >yarn.resourcemanager.address), the same approach Nirmal Kumar mentioned >> in >> >"Running Samza on multi node". Otherwise, it will always connect to >> >0.0.0.0 >> >in yarn-default.xml. (I am sure I set the conf file and YARN_HOME >> >correctly.) >> > >> >So my question is: >> >1. Can't Samza interpret HA YARN configuration file correctly? ( Is that >> >because the HA YARN configuration is using, say, >> >yarn.resourcemanager.address.*rm15* instead of >> >yarn.resourcemanager.address >> >?) >> > >> >2. Is it possible to switch to a new RM automatically when one is down? >> >Because we have two RMs, one for Active and one for Standby but I can >> only >> >put one RM address in yarn-deault.xml. I am wondering if it is possible >> to >> >detect the active RM automatically in Samza (or other method)? >> > >> >3. Any one has the luck to leverage the HA YARN? >> > >> >Thank you. >> > >> >Cheers, >> > >> >Fang, Yan >> >[email protected] >> >+1 (206) 849-4108 >> > >> > >> >On Fri, Feb 21, 2014 at 3:23 PM, Chris Riccomini >> ><[email protected]>wrote: >> > >> >> Hey Ethan, >> >> >> >> YARN's HA support is marginal right now, and we're still investigating >> >> this stuff. Some useful things to read are: >> >> >> >> * https://issues.apache.org/jira/browse/YARN-128 >> >> * https://issues.apache.org/jira/browse/YARN-149 >> >> * https://issues.apache.org/jira/browse/YARN-353 >> >> * https://issues.apache.org/jira/browse/YARN-556 >> >> >> >> >> >> Also, CDH seems to be packaging some of the ZK-based HA stuff already: >> >> >> >> >> >> >> >> >> https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/late >> >>st >> >> /CDH5-High-Availability-Guide/cdh5hag_cfg_RM_HA.html >> >> >> >> >> >> At LI, we're still experimenting with the best setup, so my guidance >> >>might >> >> not be state of the art. We currently configure the YARN RM's store >> >> (yarn.resourcemanager.store.class) to use the file system store >> >> >> >> >>(org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMState >> >>St >> >> ore). The failover is a manual operation where we copy the RM state to >> a >> >> new machine, and then start the RM on that machine. You then need to >> >>front >> >> the RM with a VIP or DNS entry, which you can update to point to the >> new >> >> RM machine when a failover occurs. The NMs need to be configured to >> >>point >> >> to this VIP/DNS entry, so that when a failover occurs, the NMs don't >> >>need >> >> to update their yarn-site.xml files. >> >> >> >> >> >> It sounds like in the future you won't need to use VIPs/DNS entries. >> You >> >> should probably also email the YARN mailing list, just in case we're >> >> misinformed or unaware of some new updates. >> >> >> >> Cheers, >> >> Chris >> >> >> >> On 2/21/14 2:27 PM, "Ethan Setnik" <[email protected]> >> wrote: >> >> >> >> >I'm looking to deploy Samza on AWS infrastructure in a HA >> >>configuration. >> >> >I >> >> >have a clear picture of how to configure all the components such that >> >>they >> >> >do not contain any single point of failure. >> >> > >> >> >I'm stuck, however, when it comes to the YARN architecture. It seems >> >>that >> >> >YARN relies on the single-master / multi-slave pattern as described in >> >>the >> >> >YARN documentation. This introduces a single point of failure at the >> >> >ResourceManager level such that a failed ResourceManager will fail the >> >> >entire YARN cluster. How does LinkedIn architect a HA configuration >> >>for >> >> >Samza on YARN such that a complete instance failure of ResourceManager >> >> >provides failover for the YARN cluster? >> >> > >> >> >Thanks for your help. >> >> > >> >> >Best, >> >> >Ethan >> >> > >> >> > >> >> >-- >> >> >Ethan Setnik >> >> >MobileAware >> >> > >> >> >m: +1 617 513 2052 >> >> >e: [email protected] >> >> >> >> >> >> >
