[ 
https://issues.apache.org/jira/browse/AMBARI-9893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345046#comment-14345046
 ] 

Daniel Horak commented on AMBARI-9893:
--------------------------------------

This seems to be the root issue of problems with (re)starting ambari-server via 
ansible 'command' or 'shell' module (AMBARI-9215).


> Ambari services should be properly daemonized
> ---------------------------------------------
>
>                 Key: AMBARI-9893
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9893
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent, ambari-server
>    Affects Versions: 1.6.1
>         Environment: HDP 2.1 on RHEL 6
> ambari-server-1.6.1-98.noarch
> ambari-agent-1.6.1-98.x86_64
>            Reporter: Daniel Horak
>            Priority: Critical
>
> Ambari services (_ambari-server_ and _ambari-agent_) are not properly 
> demonized.
> When any service start as daemon, it should _become a process group leader_ 
> ([apart from other 
> requirements|https://en.wikipedia.org/wiki/Daemon_%28computing%29]).
> h3. How to reproduce
> 1) Prepare simple test shell script:
> {noformat}
> # cat test-ambari-server.sh 
>   #!/bin/bash -x
>   ambari-server restart
>   sleep 10
>   ambari-server restart
>   sleep 10
>   date 
> # chmod +x test-ambari-server.sh
> {noformat}
> This script should restart ambari-server two times (with some delay) and then
> print date.
> 2) Run the test script.
> The script doesn't behave as expected: the second _ambari-server restart_ 
> kills
> the whole script! See:
> {noformat}
> # ./test-ambari-server.sh 
>   + ambari-server restart
>   Using python  /usr/bin/python2.6
>   Restarting ambari-server
>   Using python  /usr/bin/python2.6
>   Stopping ambari-server
>   Ambari Server stopped
>   Using python  /usr/bin/python2.6
>   Starting ambari-server
>   Ambari Server running with 'root' privileges.
>   Organizing resource files at /var/lib/ambari-server/resources...
>   Waiting for server start...
>   Server PID at: /var/run/ambari-server/ambari-server.pid
>   Server out at: /var/log/ambari-server/ambari-server.out
>   Server log at: /var/log/ambari-server/ambari-server.log
>   Ambari Server 'start' completed successfully.
>   + sleep 10
>   + ambari-server restart
>   Using python  /usr/bin/python2.6
>   Restarting ambari-server
>   Using python  /usr/bin/python2.6
>   Stopping ambari-server
>   Killed
> # echo $?
>   137
> {noformat}
> h3. Explanation
> After the first {{ambari-server restart}} the _process group ID_ (_PGID_) of
> ambari-server is the same as the _PGID_ of the test shell script. In other 
> words
> ambari-server belongs to the same process group as the test script
> because ambari-server haven't became the _process group leader_.
> Then 2nd {{ambari-server restart}} calls {{stop()}} function from
> {{/usr/sbin/ambari-server.py}} and this function kills all processes in the 
> same
> process group as ambari-server (code {{os.killpg(os.getpgid(pid), 
> signal.SIGKILL)}}, where {{pid}} is the pid of running ambari-server process).
> There is nothing wrong with this assuming the ambari service daemon process
> creates new process group for itself - which is not the case (and root cause 
> of
> the bug).
> h3. Deeper debugging
> You can check the PGIDs via the ps command: {{ps -e --forest -o pgrp,args}}.
> You can also add following lines to the {{test-ambari-server.sh}} script after
> the first {{ambari-server restart}} command:
> {noformat}
> echo "shell pid: $$"
> ps -o pid,ppid,pgrp -p $(cat /var/run/ambari-server/ambari-server.pid)
> {noformat}
> So that when you run the {{test-ambari-server.sh}} script again, you would be
> able to see that the ambari-server process belongs to the process group of the
> shell (PGRP aka PGID of the shell is the same as it's PID in this case):
> {noformat}
> + echo 'shell pid: 9368'
> shell pid: 9368
> ++ cat /var/run/ambari-server/ambari-server.pid
> + ps -o pid,ppid,pgrp -p 9415
>   PID  PPID  PGRP
>  9415     1  9368
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to