Re: Slider stop not working

Chackravarthy Esakkimuthu Sun, 17 May 2015 11:06:09 -0700

Gour/Steve,

The issue was because of improper kill command construction by
DefaultContainerExecutor, and hence kill SIGTERM itself was not issued to
SliderAgent, hence all the agents as well as components continue to run.


I made one change in Shell.java (hadoop-common)  to construct the kill
command including two hyphens, then now slider stop works properly:)

It was, *kill -signalNo -<process_id>*
changed as,   *kill -signalNo -- -<process_id>*

I have update the same in JIRA as well,

https://issues.apache.org/jira/browse/YARN-3561


Thanks,
Chackra


On Thu, May 14, 2015 at 1:03 PM, Chackravarthy Esakkimuthu <
chaku.mi...@gmail.com> wrote:

> sure Gour, would like to take up this task and contribute. Thanks for the
> pointers for me to proceed with, I will get in touch with you incase If I
> need any more help.
>
> And wrt kill -s TERM on main.py processes (tried on both parent and child
> process independently), please find the result as follows :
>
> In none of the cases, application was killed.
>
> *1) Slider app created, and its running (not stopped)*
>
> *1.1) kill 'bash main.py' process*
>
>    -  it killed both 'bash main.py' and its 'child main.py' process
>    -  but the application process (nimbus) still running
>
>
> SliderAgent.log :
>
> *INFO 2015-05-14 12:17:06,591 Controller.py:497 - Component states
> (result): Expected: 4 and Actual: 5*
> *ERROR 2015-05-14 12:17:06,596 Controller.py:289 - Got terminateAgent
> command*
> *INFO 2015-05-14 12:17:16,597 Controller.py:217 - Terminate agent command
> received from AM, stopping the agent ...*
> *INFO 2015-05-14 12:17:16,597 ProcessHelper.py:39 - Removing pid file*
> *WARNING 2015-05-14 12:17:16,598 ProcessHelper.py:44 - Unable to remove
> pid file: [Errno 2] No such file or directory:
> '/grid/4/hadoop/yarn/local/usercache/yarn/appcache/application_1431424102217_0003/container_1431424102217_0003_01_000002/infra/run/agent.pid'*
> *INFO 2015-05-14 12:17:16,598 ProcessHelper.py:46 - Removing temp files*
> *1.2) kill 'child main.py' process*
>
>    - it also killed both 'bash main.py' and its 'child main.py' process
>    - but the application process (nimbus) still running
>
>
> SliderAgent.log :
>
> *INFO 2015-05-14 12:25:37,990 main.py:56 - signal received, exiting.*
> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:39 - Removing pid file*
> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:46 - Removing temp files*
>
>
> *2) Slider app created, and its stopped.*
>
> *2.1) kill 'bash main.py' process*
>
>    - it killed only 'bash main.py' and not 'child main.py' process
>    - And application process (nimbus) still running
>    - there is *no logs came in SliderAgent*
>    - And container logs are completely cleared by the time this action is
>    done
>
> *2.2) kill 'child main.py' process*
>
>    - it killed both 'bash main.py' and its 'child main.py' process
>    - And application process (nimbus) still running
>    - And container logs are completely cleared by the time this action is
>    done
>
>
> SliderAgent.log :
>
> *INFO 2015-05-14 12:48:25,589 main.py:56 - signal received, exiting.*
> *INFO 2015-05-14 12:48:25,589 ProcessHelper.py:39 - Removing pid file*
> *INFO 2015-05-14 12:48:25,590 ProcessHelper.py:46 - Removing temp files*
>
>
> On Thu, May 14, 2015 at 1:38 AM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> Hi Chackra,
>>
>> You are absolutely right. The workaround that I was planning to work on,
>> should be implemented as a neat backup solution, when YARN fails to
>> shutdown containers (in this and certain other possible scenarios).
>>
>> In fact, we had filed a bug long time back along the same lines,
>> predicting this issue (for another scenario) -
>> https://issues.apache.org/jira/browse/SLIDER-479
>>
>> As you had expressed interest to contribute to Slider, I was thinking if
>> you would have some cycles and be willing to take this up. You can work on
>> the develop branch and use SLIDER-479. Slider develop branch is compatible
>> with HDP 2.2, so we can easily test the fix in your cluster.
>>
>> Let me know, and I can help all along the way.
>>
>> In case you have some cycles, here are some pointers that might help you
>> to approach this problem -
>>
>> 1. Slider has a notion of sending a terminate command to the agent which
>> the agent obeys and gracefully brings itself down
>> 2. In this scenario since Slider AM goes down, the agents can look for a
>> node in Zookeeper (when it looses connection with AM) and shut themselves
>> down if the node is missing (using the terminate code path or something
>> more elegant)
>> 3. Of course this Zookeeper node needs to be created by Slider AM in the
>> beginning of create cluster and then deleted just before the AM shuts down
>> as part of the stop command (might have to look into YARN pre-emption
>> scenario, but we can ignore this for now). We do not want to delete this in
>> AM failure/restart scenario.
>> 4. Any other better ideas or elegant solution you can think of
>>
>> On a side note, can you test this in debian 7 -
>> Go to one of the nodes where any of the agents are running (say NIMBUS or
>> any other component) and then issue a SIGTERM to the main.py process (kill
>> -s TERM <pid>). What do you see in the slider-agent.log after that? What
>> happens to all the processes in this container? Are they still running?
>>
>> The <pid> is that of the bash main.py process (not the python main.py
>> child process).
>>
>> So if the process is something like this -
>> yarn      6007  6003  0 19:43 ?        00:00:00 /bin/bash -c python
>> ./infra/agent/slider-agent/agent/main.py --label
>> container_1431413628146_0003_01_000002___NIMBUS --zk-quorum
>> c6408.ambari.apache.org:2181 --zk-reg-path
>> /registry/users/yarn/services/org-apache-slider/storm_1 >
>> /hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out
>> 2>&1
>>
>> You need to issue -
>> kill -s TERM 6007
>>
>> -Gour
>>
>> On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" <chaku.mi...@gmail.com
>> <mailto:chaku.mi...@gmail.com>> wrote:
>>
>> Thanks for your response steve,
>>
>> I was thinking that SliderAgent would receive 'stop' command from SliderAM
>> to kill the components spawned by those agents. And yeah this might be
>> specific to debian installation as others in the group are not facing this
>> issue.
>>
>> On Tue, May 12, 2015 at 1:50 PM, Steve Loughran <ste...@hortonworks.com
>> <mailto:ste...@hortonworks.com>>
>> wrote:
>>
>>
>> > On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu <
>> chaku.mi...@gmail.com<mailto:chaku.mi...@gmail.com>> wrote:
>> >
>> > Starting a new thread,
>> >
>> > already JIRA filed for the same by Gour,
>> > https://issues.apache.org/jira/browse/YARN-3561
>> >
>> > Slider stop does not stop the components started by slider, instead it
>> > stops only SliderAM, and even SliderAgents did not receive 'stop'
>> command.
>> > (it happens with debian 7) and tested with 0.70.1 as well as 'develop'
>> > branch code.
>> >
>> > Today I just came across the following mail archive,
>> >
>> >
>>
>> http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%3c1426350060949.97...@hortonworks.com%3E
>> >
>> > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> >
>> > *What is not implemented is an explicit call to "stop function in the
>> > python scripts".
>> >
>> > What I was referring to that an attempt is made by the Agent to call
>> > stop in the python script
>> > but it is not guaranteed. The reason it is not guaranteed is that the
>> > call to stop() and kill
>> > of the containers by YARN is not co-ordinated.
>> >
>> > In summary, the ability to call stop() functions in the python script
>> > is not implemented.
>> > Its in the plan though.*
>> >
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >
>> > Does this still exists?
>>
>>
>> the idea of stop|() command is to actually offer a best-effort clean
>> shutdown for containers. Currently the AM just directly tells YARN to
>> destroy a container. The agent doesn't get told, nor does the application
>> (that's implicit from the agent).
>>
>> YARN is expected to "kill" then, if there is no response, "kill -9" the
>> agent process. Which it does for the hosts we test on, linux, OSX and
>> windows.
>>
>> IF something is up with your YARN+debian installation, we believe that it
>> is related to whether those container kill events are coming out from the
>> node manager.
>>
>>
>>
>

Re: Slider stop not working

Reply via email to