Victor Galgo created AMBARI-17182:
-------------------------------------

             Summary: App timeline Server start fails on enabling HA because 
namenode is in safemode
                 Key: AMBARI-17182
                 URL: https://issues.apache.org/jira/browse/AMBARI-17182
             Project: Ambari
          Issue Type: Bug
    Affects Versions: 2.4.0
            Reporter: Victor Galgo
            Priority: Critical
             Fix For: 2.4.0


On the last step "Start all" on enabling HA below happens:
{code}
Traceback (most recent call last):
  File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 147, in <module>
    ApplicationTimelineServer().execute()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 219, in execute
    method(env)
  File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 43, in start
    self.configure(env) # FOR SECURITY
  File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 54, in configure
    yarn(name='apptimelineserver')
  File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
line 89, in thunk
    return fn(*args, **kwargs)
  File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
 line 276, in yarn
    mode=0755
  File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
line 154, in __init__
    self.env.run()
  File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 160, in run
    self.run_action(resource, action)
  File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 124, in run_action
    provider_action()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 463, in action_create_on_execute
    self.action_delayed("create")
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 460, in action_delayed
    self.get_hdfs_resource_executor().action_delayed(action_name, self)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 259, in action_delayed
    self._set_mode(self.target_status)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 366, in _set_mode
    self.util.run_command(self.main_resource.resource.target, 'SETPERMISSION', 
method='PUT', permission=self.mode, assertable_result=False)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 195, in run_command
    raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
'%{http_code}' -X PUT 
'http://os-s11-3-iavzl-nat-s-ru242to25susesecha-12.openstacklocal:50070/webhdfs/v1/ats/done?op=SETPERMISSION&user.name=hdfs&permission=755''
 returned status_code=403. 
{
  "RemoteException": {
    "exception": "RetriableException", 
    "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
    "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
blocks 697.\nThe number of live datanodes 20 has reached the minimum number 0. 
Safe mode will be turned off automatically once the thresholds have been 
reached."
  }
}
{code}

This happens because NN is not yet out of safemode at the moment of ats start, 
because DNs just started.

To fix this "stop namenodes" has to be triggered before "start all".

If this is done, on "Start all" it will be ensured that datanodes start prior 
to NN, and that NN are out of safemode before ATS start.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to