[jira] [Updated] (AMBARI-19435) NodeManager restart fails during HOU if it is on same host as RM

Yusaku Sako (JIRA) Mon, 20 Mar 2017 15:08:37 -0700

     [ 
https://issues.apache.org/jira/browse/AMBARI-19435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yusaku Sako updated AMBARI-19435:
---------------------------------
    Reporter: Vivek Sharma  (was: Jonathan Hurley)

> NodeManager restart fails during HOU if it is on same host as RM
> ----------------------------------------------------------------
>
>                 Key: AMBARI-19435
>                 URL: https://issues.apache.org/jira/browse/AMBARI-19435
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.5.0
>            Reporter: Vivek Sharma
>            Assignee: Jonathan Hurley
>            Priority: Critical
>             Fix For: 2.5.0
>
>         Attachments: AMBARI-19435.patch
>
>
> *Steps*
> # Deploy HDP-2.5.0.0 cluster with Ambari-2.5.0.0 - 4 node cluster with 
> NodeManager installed on all hosts, NN HA is enabled, RM HA is not enabled
> # Register 2.5.3.0 version and install the bits
> # Start HOU using API and accept manual prompts to sys-prep the hosts. 
> Observe the wizard at restart task of host that runs RM and NM together
> *Result:*
> At the task to Restart Node Manager on the RM host, observed below failure:
> {code}
> 2016-12-20 18:32:39,446 - 
> File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action': 
> ['delete'], 'not_if': 'ambari-sudo.sh  -H -E test -f 
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E 
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'}
> 2016-12-20 18:32:39,459 - Execute['ulimit -c unlimited; export 
> HADOOP_LIBEXEC_DIR=/usr/hdp/2.5.3.0-37/hadoop/libexec && 
> /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config 
> /usr/hdp/2.5.3.0-37/hadoop/conf start nodemanager'] {'not_if': 
> 'ambari-sudo.sh  -H -E test -f 
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E 
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'user': 'yarn'}
> 2016-12-20 18:32:40,558 - Execute['ambari-sudo.sh  -H -E test -f 
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E 
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'not_if': 
> 'ambari-sudo.sh  -H -E test -f 
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E 
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'tries': 5, 
> 'try_sleep': 1}
> 2016-12-20 18:32:40,576 - Skipping Execute['ambari-sudo.sh  -H -E test -f 
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E 
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] due to not_if
> 2016-12-20 18:32:40,576 - Executing NodeManager Stack Upgrade post-restart
> 2016-12-20 18:32:40,578 - NodeManager executing "yarn node -list 
> -states=RUNNING" to verify the node has rejoined the cluster...
> 2016-12-20 18:32:40,578 - checked_call['yarn node -list -states=RUNNING'] 
> {'user': 'yarn'}
> Command failed after 1 tries
> {code}
> A retry of the failed task is successful. 
> The issue looks due to the fact that RM is still down while we try to start 
> NM on the host. While starting NM, we run below command to verify if NM has 
> come up
> {code}
> yarn node -list -states=RUNNING
> {code}
> The command fails since it tries to connect to RM, resulting in timeout
> As a possible fix, we may need to adjust the order in HOU upgrade pack so as 
> to start RM before NM in such cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AMBARI-19435) NodeManager restart fails during HOU if it is on same host as RM

Reply via email to