Hi Su Yi, I think there maybe a misunderstanding. For the failure detection, if the containers die ( because of NM failure or whatever reason ), AM will bring up new containers in the same NM or a different NM according to the resource availability. It does not take as much as 10 mins to recover. One way you can test is that, you run a Samza job and manually kill the NM or the thread to see how quickly it recovers. In terms of how yarn.nm.liveness-monitor.expiry-interval-ms plays the role here, not very sure. Hope any yarn expert in the community can explain it a little.
The goal of standby container in SAMZA-406 is to recover quickly when the task has a lot of local state and so reading changelog takes a long time, not to reduce the time of *allocating* the container, which, I believe, is taken care by the YARN. Hope this help a little. Thanks. Cheers, Fang, Yan [email protected] +1 (206) 849-4108 On Thu, Jan 1, 2015 at 4:20 AM, Su Yi <[email protected]> wrote: > Hi Timothy, > > There are 4 nodes in total : a,b,c,d > Resource manager : a > Node manager : a,b,c,d > Kafka and zookeeper running on : a > > YARN configuration is : > > <property> > <description>How long to wait until a node manager is considered > dead.</description> > <name>yarn.nm.liveness-monitor.expiry-interval-ms</name> > <value>1000</value> > </property> > > <property> > <description>How often to check that node managers are still > alive.</description> > <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name> > <value>100</value> > </property> > > From web UI of Samza, I found that node 'a' appeared and disappeared again > and again in the node list. > > Su Yi > > On 2015-01-01 02:54:48,"Timothy Chen" <[email protected]> wrote: > > >Hi Su Yi, > > > >Can you elaborate a bit more what you mean by unstable cluster when > >you configured the heartbeat interval to be 1s? > > > >Tim > > > >On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <[email protected]> wrote: > >> Hello, > >> > >> Here are some thoughts about HA of Samza. > >> > >> 1. Failure detection > >> > >> The problem is, failure detection of container completely depends on > YARN in Samza. YARN counts on Node Manager reporting container failures, > however Node Manager could fail, too (like, if the machine failed, NM would > fail). Node Manager failures can be detected through heartbeat by Resource > Manager, but, by default it'll take 10 mins to confirm Node Manager > failure. I think, that's OK with batch processing, but not stream > processing. > >> > >> Configuring yarn failure confirm interval to 1s, result in an unstable > yarn cluster(4 node in total). With 2s, all things works fine, but it takes > 10s~20s to get lost container(machine shut down) back. Considering that > testing stream task is very simple(stateless), the recovery time is > relatively long. > >> > >> I am not an expert on YARN, I don't know why it, by default, takes such > a long time to confirm node failure. To my understanding, YARN is something > trying to be general, and it is not sufficient for stream processing > framework. Extra effort should be done beyond YARN on failure detection in > stream processing. > >> > >> 2. Task redeployment > >> > >> After Resource Manager informed Samza of container failure, Samza > should apply for resources from YARN to redeploy failed tasks, which > consumes time during recovery. And, recovery time is critical for HA in > stream processing. I think, maintaining a few standby containers may > eliminate this overhead on recovery time. Samza could deploy failed tasks > on the standby containers than requesting from YARN. > >> > >> Hot standby containers, which is described in SAMZA-406( > https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery > time, however it's costly(it doubles the resources needed). > >> > >> I'm wondering, what does these stuffs means to you, and how about the > feasibility. By the way, I'm using Samza 0.7 . > >> > >> Thank you for reading. > >> > >> Happy New Year!;-) > >> > >> Su Yi >
