Hey Zhao,

Yes, I believe you're correct. I've never explicitly tested this failure
case, but what you describe is what should happen.

Cheers,
Chris

On 1/16/15 11:27 PM, "Zhao Weinan" <[email protected]> wrote:

>Hi Chris,
>
>Thanks for replying, your explaination is pretty clear and useful. Next
>time we'll use SIGTERM and check the orphaned child processes.
>
> I did some digging into the source code, found
>org.apache.samza.job.yarn.SamzaAppMaster will exit on it's heatbeat error
>with RM, so if we SIGKILL the RM and NMs all, then the SamzaAppMaster will
>exit itself, leaving samza task containers uknown what happend, am I
>right?
>
>
>
>2015-01-17 3:49 GMT+08:00 Chris Riccomini
><[email protected]>:
>
>> Hey Zhao,
>>
>> Yes, this is expected behavior. SIGKILL'ing NMs will result in all
>> processes being leaked. This is really an issue with the way Linux
>>handles
>> orphaned child processes. It currently just changes the PPID to 1, and
>> allows the process to continue executing. I did some brief exploration
>>of
>> this here:
>>
>>
>> 
>>http://riccomini.name/posts/linux/2012-09-25-kill-subprocesses-linux-bash
>>/
>>
>> At LinkedIn, we do several things:
>>
>> 1. Soft kill (SIGTERM) the NMs, to allow the NMs to properly shutdown
>>all
>> containers.
>> 2. Before deploying an NM, we verify that there are no existing
>>processes
>> with "container_*" running with a PPID of 1.
>>
>> You could also verify that *all* container_* processes are dead after
>> SIGKILL'ing the NM, if you want to make extra sure that you haven't
>>leaked
>> containers (which could lead to a double-writing messages in Samza).
>>
>> In practice, once we implemented (1), above, we haven't seen any leaked
>> containers. In a case where an NM dies unexpectedly (e.g. the JVM
>> segfaults, or something) you have to go and clean the leaked processes.
>>
>> Cheers,
>> Chris
>>
>> On 1/16/15 5:20 AM, "Zhao Weinan" <[email protected]> wrote:
>>
>> >Hi,
>> >
>> >We are running some samza task on hadoop yarn 2.4.1. And for some
>>reason,
>> >we restart the whole cluster by SIGKILL RMs and NMs, with samza task
>>left.
>> >Then we found that samza task preserved through the SIGKILL and
>>restart,
>> >which made us trouble to locate task process over clusters. It's that
>> >expected?
>> >
>> >Thanks!
>>
>>

Reply via email to