[ https://issues.apache.org/jira/browse/AMBARI-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitry Lysnichenko updated AMBARI-18786: ---------------------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) Committed To https://git-wip-us.apache.org/repos/asf/ambari.git 819dbff..7d2b6bb branch-2.5 -> branch-2.5 43a181a..ba5cbf4 trunk -> trunk > HDP Upgrade fails when the cluster size is large > ------------------------------------------------ > > Key: AMBARI-18786 > URL: https://issues.apache.org/jira/browse/AMBARI-18786 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Reporter: Dmitry Lysnichenko > Assignee: Dmitry Lysnichenko > Fix For: 2.5.0 > > Attachments: AMBARI-18786.patch > > > Starting from Ambari 2.4, when the cluster is large, HDP upgrade fails during > namenode restart. > This is because, restart command waits for namenode to come out of safemode > and if the cluster size is large, namenode takes more time to leave safemode > but Ambari marks this action as failure as the namenode didn't leave safemode > within the configured timeout in Ambari scripts. > {code} > Traceback (most recent call last): > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", > line 42, in get_value_from_jmx > return data_dict["beans"][0][property] > IndexError: list index out of range > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", > line 420, in <module> > NameNode().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 280, in execute > method(env) > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 720, in restart > self.start(env, upgrade_type=upgrade_type) > File > "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", > line 101, in start > upgrade_suspended=params.upgrade_suspended, env=env) > File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", > line 89, in thunk > return fn(*args, **kwargs) > File > "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", > line 184, in namenode > if is_this_namenode_active() is False: > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", > line 55, in wrapper > return function(*args, **kwargs) > File > "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", > line 554, in is_this_namenode_active > raise Fail(format("The NameNode {namenode_id} is not listed as Active or > Standby, waiting...")) > resource_management.core.exceptions.Fail: The NameNode nn1 is not listed as > Active or Standby, waiting... > {code} > To resolve this, we increased the timeout for ambari > 1. Increased the timeout in > /var/lib/ambari-server/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py > from this; > @retry(times=5, sleep_time=5, backoff_factor=2, err_class=Fail) > to this; > @retry(times=25, sleep_time=25, backoff_factor=2, err_class=Fail) > 2. Restart Ambari server > After this upgrade went through fine. > I think its better to increase the timeout permanently so that we don't have > to deal with this issue again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)