Daniel Horak created AMBARI-9893: ------------------------------------ Summary: Ambari services should be properly daemonized Key: AMBARI-9893 URL: https://issues.apache.org/jira/browse/AMBARI-9893 Project: Ambari Issue Type: Bug Components: ambari-agent, ambari-server Affects Versions: 1.6.1 Environment: HDP 2.1 on RHEL 6 ambari-server-1.6.1-98.noarch ambari-agent-1.6.1-98.x86_64 Reporter: Daniel Horak Priority: Critical
Ambari services (_ambari-server_ and _ambari-agent_) are not properly demonized. When any service start as daemon, it should _become a process group leader_ ([apart from other requirements|https://en.wikipedia.org/wiki/Daemon_%28computing%29]). h3. How to reproduce 1) Prepare simple test shell script: {noformat} # cat test-ambari-server.sh #!/bin/bash -x ambari-server restart sleep 10 ambari-server restart sleep 10 date # chmod +x test-ambari-server.sh {noformat} This script should restart ambari-server two times (with some delay) and then print date. 2) Run the test script. The script doesn't behave as expected: the second _ambari-server restart_ kills the whole script! See: {noformat} # ./test-ambari-server.sh + ambari-server restart Using python /usr/bin/python2.6 Restarting ambari-server Using python /usr/bin/python2.6 Stopping ambari-server Ambari Server stopped Using python /usr/bin/python2.6 Starting ambari-server Ambari Server running with 'root' privileges. Organizing resource files at /var/lib/ambari-server/resources... Waiting for server start... Server PID at: /var/run/ambari-server/ambari-server.pid Server out at: /var/log/ambari-server/ambari-server.out Server log at: /var/log/ambari-server/ambari-server.log Ambari Server 'start' completed successfully. + sleep 10 + ambari-server restart Using python /usr/bin/python2.6 Restarting ambari-server Using python /usr/bin/python2.6 Stopping ambari-server Killed # echo $? 137 {noformat} h3. Explanation After the first {{ambari-server restart}} the _process group ID_ (_PGID_) of ambari-server is the same as the _PGID_ of the test shell script. In other words ambari-server belongs to the same process group as the test script because ambari-server haven't became the _process group leader_. Then 2nd {{ambari-server restart}} calls {{stop()}} function from {{/usr/sbin/ambari-server.py}} and this function kills all processes in the same process group as ambari-server (code {{os.killpg(os.getpgid(pid), signal.SIGKILL)}}, where {{pid}} is the pid of running ambari-server process). There is nothing wrong with this assuming the ambari service daemon process creates new process group for itself - which is not the case (and root cause of the bug). h3. Deeper debugging You can check the PGIDs via the ps command: {{ps -e --forest -o pgrp,args}}. You can also add following lines to the {{test-ambari-server.sh}} script after the first {{ambari-server restart}} command: {noformat} echo "shell pid: $$" ps -o pid,ppid,pgrp -p $(cat /var/run/ambari-server/ambari-server.pid) {noformat} So that when you run the {{test-ambari-server.sh}} script again, you would be able to see that the ambari-server process belongs to the process group of the shell (PGRP aka PGID of the shell is the same as it's PID in this case): {noformat} + echo 'shell pid: 9368' shell pid: 9368 ++ cat /var/run/ambari-server/ambari-server.pid + ps -o pid,ppid,pgrp -p 9415 PID PPID PGRP 9415 1 9368 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)