[ https://issues.apache.org/jira/browse/AMBARI-19435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yusaku Sako updated AMBARI-19435: --------------------------------- Reporter: Vivek Sharma (was: Jonathan Hurley) > NodeManager restart fails during HOU if it is on same host as RM > ---------------------------------------------------------------- > > Key: AMBARI-19435 > URL: https://issues.apache.org/jira/browse/AMBARI-19435 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.5.0 > Reporter: Vivek Sharma > Assignee: Jonathan Hurley > Priority: Critical > Fix For: 2.5.0 > > Attachments: AMBARI-19435.patch > > > *Steps* > # Deploy HDP-2.5.0.0 cluster with Ambari-2.5.0.0 - 4 node cluster with > NodeManager installed on all hosts, NN HA is enabled, RM HA is not enabled > # Register 2.5.3.0 version and install the bits > # Start HOU using API and accept manual prompts to sys-prep the hosts. > Observe the wizard at restart task of host that runs RM and NM together > *Result:* > At the task to Restart Node Manager on the RM host, observed below failure: > {code} > 2016-12-20 18:32:39,446 - > File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action': > ['delete'], 'not_if': 'ambari-sudo.sh -H -E test -f > /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E > pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'} > 2016-12-20 18:32:39,459 - Execute['ulimit -c unlimited; export > HADOOP_LIBEXEC_DIR=/usr/hdp/2.5.3.0-37/hadoop/libexec && > /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config > /usr/hdp/2.5.3.0-37/hadoop/conf start nodemanager'] {'not_if': > 'ambari-sudo.sh -H -E test -f > /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E > pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'user': 'yarn'} > 2016-12-20 18:32:40,558 - Execute['ambari-sudo.sh -H -E test -f > /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E > pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'not_if': > 'ambari-sudo.sh -H -E test -f > /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E > pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'tries': 5, > 'try_sleep': 1} > 2016-12-20 18:32:40,576 - Skipping Execute['ambari-sudo.sh -H -E test -f > /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E > pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] due to not_if > 2016-12-20 18:32:40,576 - Executing NodeManager Stack Upgrade post-restart > 2016-12-20 18:32:40,578 - NodeManager executing "yarn node -list > -states=RUNNING" to verify the node has rejoined the cluster... > 2016-12-20 18:32:40,578 - checked_call['yarn node -list -states=RUNNING'] > {'user': 'yarn'} > Command failed after 1 tries > {code} > A retry of the failed task is successful. > The issue looks due to the fact that RM is still down while we try to start > NM on the host. While starting NM, we run below command to verify if NM has > come up > {code} > yarn node -list -states=RUNNING > {code} > The command fails since it tries to connect to RM, resulting in timeout > As a possible fix, we may need to adjust the order in HOU upgrade pack so as > to start RM before NM in such cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346)