Hi Chris,

Thanks for replying, your explaination is pretty clear and useful. Next
time we'll use SIGTERM and check the orphaned child processes.

 I did some digging into the source code, found
org.apache.samza.job.yarn.SamzaAppMaster will exit on it's heatbeat error
with RM, so if we SIGKILL the RM and NMs all, then the SamzaAppMaster will
exit itself, leaving samza task containers uknown what happend, am I right?



2015-01-17 3:49 GMT+08:00 Chris Riccomini <[email protected]>:

> Hey Zhao,
>
> Yes, this is expected behavior. SIGKILL'ing NMs will result in all
> processes being leaked. This is really an issue with the way Linux handles
> orphaned child processes. It currently just changes the PPID to 1, and
> allows the process to continue executing. I did some brief exploration of
> this here:
>
>
> http://riccomini.name/posts/linux/2012-09-25-kill-subprocesses-linux-bash/
>
> At LinkedIn, we do several things:
>
> 1. Soft kill (SIGTERM) the NMs, to allow the NMs to properly shutdown all
> containers.
> 2. Before deploying an NM, we verify that there are no existing processes
> with "container_*" running with a PPID of 1.
>
> You could also verify that *all* container_* processes are dead after
> SIGKILL'ing the NM, if you want to make extra sure that you haven't leaked
> containers (which could lead to a double-writing messages in Samza).
>
> In practice, once we implemented (1), above, we haven't seen any leaked
> containers. In a case where an NM dies unexpectedly (e.g. the JVM
> segfaults, or something) you have to go and clean the leaked processes.
>
> Cheers,
> Chris
>
> On 1/16/15 5:20 AM, "Zhao Weinan" <[email protected]> wrote:
>
> >Hi,
> >
> >We are running some samza task on hadoop yarn 2.4.1. And for some reason,
> >we restart the whole cluster by SIGKILL RMs and NMs, with samza task left.
> >Then we found that samza task preserved through the SIGKILL and restart,
> >which made us trouble to locate task process over clusters. It's that
> >expected?
> >
> >Thanks!
>
>

Reply via email to