[ 
https://issues.apache.org/jira/browse/AMBARI-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-18786:
----------------------------------------
    Component/s: ambari-server

> HDP Upgrade fails when the cluster size is large
> ------------------------------------------------
>
>                 Key: AMBARI-18786
>                 URL: https://issues.apache.org/jira/browse/AMBARI-18786
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>         Attachments: AMBARI-18786.patch
>
>
> Starting from Ambari 2.4, when the cluster is large, HDP upgrade fails during 
> namenode restart.
> This is because, restart command waits for namenode to come out of safemode 
> and if the cluster size is large, namenode takes more time to leave safemode 
> but Ambari marks this action as failure as the namenode didn't leave safemode 
> within the configured timeout in Ambari scripts.
> {code}
> Traceback (most recent call last):
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py",
>  line 42, in get_value_from_jmx
> return data_dict["beans"][0][property]
> IndexError: list index out of range
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
>  line 420, in <module>
> NameNode().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 280, in execute
> method(env)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 720, in restart
> self.start(env, upgrade_type=upgrade_type)
> File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
>  line 101, in start
> upgrade_suspended=params.upgrade_suspended, env=env)
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
> return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
>  line 184, in namenode
> if is_this_namenode_active() is False:
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py",
>  line 55, in wrapper
> return function(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
>  line 554, in is_this_namenode_active
> raise Fail(format("The NameNode {namenode_id} is not listed as Active or 
> Standby, waiting..."))
> resource_management.core.exceptions.Fail: The NameNode nn1 is not listed as 
> Active or Standby, waiting...
> {code}
> To resolve this, we increased the timeout for ambari
> 1. Increased the timeout in 
> /var/lib/ambari-server/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
>  from this;
> @retry(times=5, sleep_time=5, backoff_factor=2, err_class=Fail)
> to this;
> @retry(times=25, sleep_time=25, backoff_factor=2, err_class=Fail)
> 2. Restart Ambari server
> After this upgrade went through fine.
> I think its better to increase the timeout permanently so that we don't have 
> to deal with this issue again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to