Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

Victor Galgo Tue, 21 Jun 2016 08:45:12 -0700


> On June 17, 2016, 6:44 p.m., Di Li wrote:
> > Hello Victor,
> > 
> > so I ran some tests and observed the following. I have a 3 node cluster, 
> > c1.apache.org, c2.apache.org, and c3.apache.org
> > 
> > 1. Right after finishing the manual steps listed on step "Initialize 
> > Metadata". I noticed c1.apache.org has NameNode process running but it's 
> > the standby. c2.apache.org (the new NN added) has NN stopped.
> > 
> > 2. The state of the two NNs in #1 seems to have cause the NN's 
> > check_is_active_namenode function call to return False, thus setting 
> > ensure_safemode_off to False as well. >> Skipping the safemode check 
> > altegather.
> > 
> > 3. If I just ran safemode check command line hadoop cmd, here are the 
> > results, notice the safemode is reported as ON on the Standby node and the 
> > ther one is a connection refused err
> > 
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn1
> > standby
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn2
> > 16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: 
> > c1.apache.org:8020. Already tried 0 time(s); retry policy is 
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> > MILLISECONDS)
> > Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > [hdfs@c1 ~]$ hdfs dfsadmin -safemode get
> > Safe mode is ON in c1.apache.org:8020
> > safemode: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > 
> > So in my opinion, the fix should be at the NameNode Python script level to 
> > always check safemode against the two NNs, and make sure the safemode is 
> > off on the active namenode. As a safeguard against offline active NN, the 
> > check should eventually timeout to unblock the rest of the start sequence.
> 
> Victor Galgo wrote:
>     "So in my opinion, the fix should be at the NameNode Python script level 
> to always check safemode against the two NNs". 
>     We cannot do that. Because at that points all Datanodes are stopped. 
> Which means NN will never go out of safemode.
> 
> Alejandro Fernandez wrote:
>     Please include Jonathan Hurley in the code review since he recently 
> modified the function that waits to leave safemode.
>     This is not the first time that we've had the need for a step to "leave 
> safe mode". So either we put it into the python code (and do a lot of testing 
> on it since it also impacts EU and RU), or make a custom command for HDFS 
> that is only available if HA is present, and it waits for NameNode to leave 
> safemode.
> 
> Jonathan Hurley wrote:
>     Yes, I recently added something for the case during an EU where we know 
> that the NameNode probably won't leave Safemode. Essentially, don't try to 
> create any directories if the NN didn't wait for safemode to exit. That was 
> only for NN, though.
>     
>     But this problem is a more generic case - it affects other services. 
> Since NN wasn't restarted it might be in Safemode. In this case, I think we 
> need to handle the retryable exception and back off and wait. 
>     
>     However, you could also argue that since we know we're doing a restart 
> operation, we should be shutting down the NNs completely. If there's no issue 
> with shutting them stop during the HA process, then this patch seems fine for 
> now, but we should open another one for catching the RetryableException.


Thanks Jonathan! Absolutely agree with your points. Could you please Ship it?


- Victor


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138280
-----------------------------------------------------------


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> -----------------------------------------------------------
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
>     https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in <module>
>       ApplicationTimelineServer().execute()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>       method(env)
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>       self.configure(env) # FOR SECURITY
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>       yarn(name='apptimelineserver')
>     File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>       return fn(*args, **kwargs)
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>       mode=0755
>     File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>       self.env.run()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>       self.run_action(resource, action)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>       provider_action()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>       self.action_delayed("create")
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>       self.get_hdfs_resource_executor().action_delayed(action_name, self)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>       self._set_mode(self.target_status)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>       self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>       raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION&user.name=hdfs&permission=755''
>  returned status_code=403. 
>   {
>     "RemoteException": {
>       "exception": "RetriableException", 
>       "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>       "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
>     }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -----
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: https://reviews.apache.org/r/48734/diff/
> 
> 
> Testing
> -------
> 
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> 
>   28668 tests complete (34 seconds)
>   154 tests pending
> 
> [INFO] 
> [INFO] --- apache-rat-plugin:0.11:check (default) @ ambari-web ---
> [INFO] 51 implicit excludes (use -debug for more details).
> [INFO] Exclude: .idea/**
> [INFO] Exclude: package.json
> [INFO] Exclude: public/**
> [INFO] Exclude: public-static/**
> [INFO] Exclude: app/assets/**
> [INFO] Exclude: vendor/**
> [INFO] Exclude: node_modules/**
> [INFO] Exclude: node/**
> [INFO] Exclude: npm-debug.log
> [INFO] 1425 resources included (use -debug for more details)
> Warning:  org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser: Property 
> 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
> recognized.
> Compiler warnings:
>   WARNING:  'org.apache.xerces.jaxp.SAXParserImpl: Property 
> 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.'
> Warning:  org.apache.xerces.parsers.SAXParser: Feature 
> 'http://javax.xml.XMLConstants/feature/secure-processing' is not recognized.
> Warning:  org.apache.xerces.parsers.SAXParser: Property 
> 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.
> Warning:  org.apache.xerces.parsers.SAXParser: Property 
> 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
> recognized.
> [INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 
> approved: 1425 licence.
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] Total time: 1:31.015s
> [INFO] Finished at: Sun Jun 12 14:37:47 EEST 2016
> [INFO] Final Memory: 13M/407M
> [INFO] 
> ------------------------------------------------------------------------
> 
> Also to test this I have installed 3 nodes cluster and enabled namenode ha on 
> it.
> 
> 
> Thanks,
> 
> Victor Galgo
> 
>

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

Reply via email to