Hey Zhao, Yes, I believe you're correct. I've never explicitly tested this failure case, but what you describe is what should happen.
Cheers, Chris On 1/16/15 11:27 PM, "Zhao Weinan" <[email protected]> wrote: >Hi Chris, > >Thanks for replying, your explaination is pretty clear and useful. Next >time we'll use SIGTERM and check the orphaned child processes. > > I did some digging into the source code, found >org.apache.samza.job.yarn.SamzaAppMaster will exit on it's heatbeat error >with RM, so if we SIGKILL the RM and NMs all, then the SamzaAppMaster will >exit itself, leaving samza task containers uknown what happend, am I >right? > > > >2015-01-17 3:49 GMT+08:00 Chris Riccomini ><[email protected]>: > >> Hey Zhao, >> >> Yes, this is expected behavior. SIGKILL'ing NMs will result in all >> processes being leaked. This is really an issue with the way Linux >>handles >> orphaned child processes. It currently just changes the PPID to 1, and >> allows the process to continue executing. I did some brief exploration >>of >> this here: >> >> >> >>http://riccomini.name/posts/linux/2012-09-25-kill-subprocesses-linux-bash >>/ >> >> At LinkedIn, we do several things: >> >> 1. Soft kill (SIGTERM) the NMs, to allow the NMs to properly shutdown >>all >> containers. >> 2. Before deploying an NM, we verify that there are no existing >>processes >> with "container_*" running with a PPID of 1. >> >> You could also verify that *all* container_* processes are dead after >> SIGKILL'ing the NM, if you want to make extra sure that you haven't >>leaked >> containers (which could lead to a double-writing messages in Samza). >> >> In practice, once we implemented (1), above, we haven't seen any leaked >> containers. In a case where an NM dies unexpectedly (e.g. the JVM >> segfaults, or something) you have to go and clean the leaked processes. >> >> Cheers, >> Chris >> >> On 1/16/15 5:20 AM, "Zhao Weinan" <[email protected]> wrote: >> >> >Hi, >> > >> >We are running some samza task on hadoop yarn 2.4.1. And for some >>reason, >> >we restart the whole cluster by SIGKILL RMs and NMs, with samza task >>left. >> >Then we found that samza task preserved through the SIGKILL and >>restart, >> >which made us trouble to locate task process over clusters. It's that >> >expected? >> > >> >Thanks! >> >>
