[ https://issues.apache.org/jira/browse/AMBARI-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741747#comment-14741747 ]
Hudson commented on AMBARI-13007: --------------------------------- SUCCESS: Integrated in Ambari-trunk-Commit #3430 (See [https://builds.apache.org/job/Ambari-trunk-Commit/3430/]) AMBARI-13007: Stopping ambari-server may kill ambari-agent running on the same machine in some cases (Nahappan Somasundaram via jluniya) (jluniya: http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=02fcea6c3c38f2fff4e73bdc3d235cf87477d4e2) * ambari-server/src/main/python/ambari_server_main.py > Stopping ambari-server may kill ambari-agent running on the same machine in > some cases > -------------------------------------------------------------------------------------- > > Key: AMBARI-13007 > URL: https://issues.apache.org/jira/browse/AMBARI-13007 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.2.0 > Reporter: Nahappan Somasundaram > Assignee: Nahappan Somasundaram > Fix For: 2.2.0 > > > Launch multinode Ambari clusters using a simple python script. It logs in to > every node via ssh and runs a shell script: > {code} > #!/usr/bin/env bash > while [[ $# > 0 ]] > do > key="$1" > case ${key} in > --server) > ASERVER="$2" # Server hostname > shift # past argument > ;; > --noserver) > NOSERVER="NOSERVER" # Don't install/start server > ;; > *) > echo unknown option > exit 1 > ;; > esac > shift # past argument or value > done > yum clean all > curl > http://s3.amazonaws.com/dev.hortonworks.com/ambari/centos6/2.x/latest/trunk/ambaribn.repo > > /etc/yum.repos.d/ambari.repo > # server > if [ "${ASERVER}" = $(hostname -f) ] && [ -z "${NOSERVER}" ] ; then > yum install sudo postgresql-server wget -y > rpm -i /tmp/rpms/ambari-server*.rpm > # Disable iptables > iptables -F > ambari-server setup -s > # Enable remote debug > sed -rie 's/-server -XX:NewRatio/-server > -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 > -XX:NewRatio/g' /usr/sbin/ambari_server_main.py > ## Sleep until debugger connects > # sed -rie > 's/dt_socket,server=y,suspend=.,address=5005/dt_socket,server=y,suspend=y,address=5005/g' > /usr/sbin/ambari-server.py > # Fix an issue with UI client version > gunzip /usr/lib/ambari-server/web/javascripts/app.js.gz > amb=$(ambari-server --version); sed -i "s/App\.version = '';/App.version = > '$amb';/" /usr/lib/ambari-server/web/javascripts/app.js > gzip /usr/lib/ambari-server/web/javascripts/app.js > # Increase task timeout > sed -ri > 's/agent.package.install.task.timeout=1800/agent.package.install.task.timeout=3600/g' > /etc/ambari-server/conf/ambari.properties > find /var/lib/ambari-server/resources/ -name metainfo.xml | xargs -L 1 sed > -ri 's/<timeout>[[:digit:]]+[[:digit:]]*<\//<timeout>1800<\//g' > # Start the server > ambari-server start -v || exit 1 > fi > # Agent > iptables -F > yum clean all > yum install -y wget > rpm -i /tmp/rpms/ambari-agent*.rpm > # Replace server hostname > sed -rie "s/hostname=localhost/hostname=$ASERVER/g" > /etc/ambari-agent/conf/ambari-agent.ini > # Enable debug mode at agent > # sed -rie 's/=INFO/=DEBUG/g' /etc/ambari-agent/conf/ambari-agent.ini > ambari-agent start || exit 1 > {code} > When I restart ambari-server, agent running on the same node is killed with > 100% probability. That is because it is launched in the same process group > with ambari-server, and ambari-server kills everything that belongs to it's > process group. I assume that this situation is common for launching > ambari-server and ambari-agent from the same shell script via ssh, or maybe > also via configuration management tools like puppet/chef/etc. (did not check > this assumption). > *More info:* > {code} > [root@dlysnichenko-ru3-1 ~]# ps -ejH > PID PGID SID TTY TIME CMD > 1584 1584 1584 ? 00:00:00 sshd > 2659 2659 2659 ? 00:00:00 sshd > 2662 2662 2662 pts/0 00:00:00 bash > 3268 3268 2662 pts/0 00:00:00 ps > 2056 2041 2041 ? 00:00:00 postmaster > 2058 2058 2058 ? 00:00:00 postmaster > 2060 2060 2060 ? 00:00:00 postmaster > 2061 2061 2061 ? 00:00:00 postmaster > 2062 2062 2062 ? 00:00:00 postmaster > 2063 2063 2063 ? 00:00:00 postmaster > 2380 2380 2380 ? 00:00:00 postmaster > 2397 2397 2397 ? 00:00:00 postmaster > 2649 2649 2649 ? 00:00:01 postmaster > 2654 2654 2654 ? 00:00:00 postmaster > 2655 2655 2655 ? 00:00:00 postmaster > 2656 2656 2656 ? 00:00:00 postmaster > 2360 1644 1644 ? 00:00:59 java > 2507 1644 1644 ? 00:00:00 python2.6 > 2515 1644 1644 ? 00:00:01 python2.6 > 3230 3230 3230 ? 00:00:00 anacron > [root@dlysnichenko-ru3-1 ~]# ambari-agent status > Found ambari-agent PID: 2515 > ambari-agent running. > Agent PID at: /var/run/ambari-agent/ambari-agent.pid > Agent out at: /var/log/ambari-agent/ambari-agent.out > Agent log at: /var/log/ambari-agent/ambari-agent.log > [root@dlysnichenko-ru3-1 ~]# ambari-server stop > Using python /usr/bin/python2.6 > Stopping ambari-server > Ambari Server stopped > [root@dlysnichenko-ru3-1 ~]# ambari-agent status > Found ambari-agent PID: 2515 > ambari-agent not running. Stale PID File at: > /var/run/ambari-agent/ambari-agent.pid > [root@dlysnichenko-ru3-1 ~]# > {code} > Note: both agent and server share the same process group 1644. We should not > kill process group when stopping ambari-server, or we should create a > dedicated process group when launching it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)