Re: Samza job killed by left orphaned on YARN

2016-05-19 Thread Yi Pan
Hi, David and all, The "ultimate" solution is probably to implement SAMZA-871 , which allows Samza JobCoordinator directly identifies whether a container is alive or not w/o dependency on the cluster management systems. This is also considered toget

Re: Samza job killed by left orphaned on YARN

2016-05-19 Thread David Yu
Just stumbled upon this post and sees to be the same issue: https://issues.apache.org/jira/browse/SAMZA-498 We followed the fix to create a wrapper kill script and everything works. Do we have a plan to fix this in the next version of Samza? Thanks, David On Wed, May 18, 2016 at 11:53 AM, Jac

Re: Samza job killed by left orphaned on YARN

2016-05-18 Thread Jacob Maes
Hmm, could there be something in your job holding up the container shutdown process? Perhaps something ignoring SIGTERM/Thread.interrupt, by chance? Also, I think there's a YARN property specifying the amount of time the NM waits between sending a SIGTERM and a SIGKILL, though I can't find it at t

Re: Samza job killed by left orphaned on YARN

2016-05-18 Thread David Yu
>From the NM log, I'm seeing: 2016-05-18 06:29:06,248 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e01_1463512986427_0007_01_022016-05-18 06:29:06,265 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ap

Re: Samza job killed by left orphaned on YARN

2016-05-18 Thread David Yu
Jacob, I have checked and made sure that NM is running on the node: $ ps aux | grep java ... yarn 25623 0.5 0.8 2366536 275488 ? Sl May17 7:04 /usr/java/jdk1.8.0_51/bin/java -Dproc_nodemanager ... org.apache.hadoop.yarn.server.nodemanager.NodeManager Thanks, David On Wed, May

Re: Samza job killed by left orphaned on YARN

2016-05-18 Thread Jacob Maes
Hey David, The only time I've seen orphaned containers is when the NM dies. If the NM isn't running, the RM has no means to kill the containers on a node. Can you verify that the NM was healthy at the time of the shut down? If it wasn't healthy and/or it was restarted, one option that may help is

Samza job killed by left orphaned on YARN

2016-05-17 Thread David Yu
Samza version = 0.10.0 YARN version = Hadoop 2.6.0-cdh5.4.9 We are experience issues when killing a Samza job: $ yarn application -kill application_1463512986427_0007 Killing application application_1463512986427_0007 16/05/18 06:29:05 INFO impl.YarnClientImpl: Killed application application_14