[ https://issues.apache.org/jira/browse/AMBARI-9893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345046#comment-14345046 ]
Daniel Horak commented on AMBARI-9893: -------------------------------------- This seems to be the root issue of problems with (re)starting ambari-server via ansible 'command' or 'shell' module (AMBARI-9215). > Ambari services should be properly daemonized > --------------------------------------------- > > Key: AMBARI-9893 > URL: https://issues.apache.org/jira/browse/AMBARI-9893 > Project: Ambari > Issue Type: Bug > Components: ambari-agent, ambari-server > Affects Versions: 1.6.1 > Environment: HDP 2.1 on RHEL 6 > ambari-server-1.6.1-98.noarch > ambari-agent-1.6.1-98.x86_64 > Reporter: Daniel Horak > Priority: Critical > > Ambari services (_ambari-server_ and _ambari-agent_) are not properly > demonized. > When any service start as daemon, it should _become a process group leader_ > ([apart from other > requirements|https://en.wikipedia.org/wiki/Daemon_%28computing%29]). > h3. How to reproduce > 1) Prepare simple test shell script: > {noformat} > # cat test-ambari-server.sh > #!/bin/bash -x > ambari-server restart > sleep 10 > ambari-server restart > sleep 10 > date > # chmod +x test-ambari-server.sh > {noformat} > This script should restart ambari-server two times (with some delay) and then > print date. > 2) Run the test script. > The script doesn't behave as expected: the second _ambari-server restart_ > kills > the whole script! See: > {noformat} > # ./test-ambari-server.sh > + ambari-server restart > Using python /usr/bin/python2.6 > Restarting ambari-server > Using python /usr/bin/python2.6 > Stopping ambari-server > Ambari Server stopped > Using python /usr/bin/python2.6 > Starting ambari-server > Ambari Server running with 'root' privileges. > Organizing resource files at /var/lib/ambari-server/resources... > Waiting for server start... > Server PID at: /var/run/ambari-server/ambari-server.pid > Server out at: /var/log/ambari-server/ambari-server.out > Server log at: /var/log/ambari-server/ambari-server.log > Ambari Server 'start' completed successfully. > + sleep 10 > + ambari-server restart > Using python /usr/bin/python2.6 > Restarting ambari-server > Using python /usr/bin/python2.6 > Stopping ambari-server > Killed > # echo $? > 137 > {noformat} > h3. Explanation > After the first {{ambari-server restart}} the _process group ID_ (_PGID_) of > ambari-server is the same as the _PGID_ of the test shell script. In other > words > ambari-server belongs to the same process group as the test script > because ambari-server haven't became the _process group leader_. > Then 2nd {{ambari-server restart}} calls {{stop()}} function from > {{/usr/sbin/ambari-server.py}} and this function kills all processes in the > same > process group as ambari-server (code {{os.killpg(os.getpgid(pid), > signal.SIGKILL)}}, where {{pid}} is the pid of running ambari-server process). > There is nothing wrong with this assuming the ambari service daemon process > creates new process group for itself - which is not the case (and root cause > of > the bug). > h3. Deeper debugging > You can check the PGIDs via the ps command: {{ps -e --forest -o pgrp,args}}. > You can also add following lines to the {{test-ambari-server.sh}} script after > the first {{ambari-server restart}} command: > {noformat} > echo "shell pid: $$" > ps -o pid,ppid,pgrp -p $(cat /var/run/ambari-server/ambari-server.pid) > {noformat} > So that when you run the {{test-ambari-server.sh}} script again, you would be > able to see that the ambari-server process belongs to the process group of the > shell (PGRP aka PGID of the shell is the same as it's PID in this case): > {noformat} > + echo 'shell pid: 9368' > shell pid: 9368 > ++ cat /var/run/ambari-server/ambari-server.pid > + ps -o pid,ppid,pgrp -p 9415 > PID PPID PGRP > 9415 1 9368 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)