> On June 17, 2016, 6:44 p.m., Di Li wrote: > > Hello Victor, > > > > so I ran some tests and observed the following. I have a 3 node cluster, > > c1.apache.org, c2.apache.org, and c3.apache.org > > > > 1. Right after finishing the manual steps listed on step "Initialize > > Metadata". I noticed c1.apache.org has NameNode process running but it's > > the standby. c2.apache.org (the new NN added) has NN stopped. > > > > 2. The state of the two NNs in #1 seems to have cause the NN's > > check_is_active_namenode function call to return False, thus setting > > ensure_safemode_off to False as well. >> Skipping the safemode check > > altegather. > > > > 3. If I just ran safemode check command line hadoop cmd, here are the > > results, notice the safemode is reported as ON on the Standby node and the > > ther one is a connection refused err > > > > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns > > binn -getServiceState nn1 > > standby > > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns > > binn -getServiceState nn2 > > 16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: > > c1.apache.org:8020. Already tried 0 time(s); retry policy is > > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 > > MILLISECONDS) > > Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on > > connection exception: java.net.ConnectException: Connection refused; For > > more details see: http://wiki.apache.org/hadoop/ConnectionRefused > > [hdfs@c1 ~]$ hdfs dfsadmin -safemode get > > Safe mode is ON in c1.apache.org:8020 > > safemode: Call From c1.apache.org to c2.apache.org:8020 failed on > > connection exception: java.net.ConnectException: Connection refused; For > > more details see: http://wiki.apache.org/hadoop/ConnectionRefused > > > > So in my opinion, the fix should be at the NameNode Python script level to > > always check safemode against the two NNs, and make sure the safemode is > > off on the active namenode. As a safeguard against offline active NN, the > > check should eventually timeout to unblock the rest of the start sequence. > > Victor Galgo wrote: > "So in my opinion, the fix should be at the NameNode Python script level > to always check safemode against the two NNs". > We cannot do that. Because at that points all Datanodes are stopped. > Which means NN will never go out of safemode. > > Alejandro Fernandez wrote: > Please include Jonathan Hurley in the code review since he recently > modified the function that waits to leave safemode. > This is not the first time that we've had the need for a step to "leave > safe mode". So either we put it into the python code (and do a lot of testing > on it since it also impacts EU and RU), or make a custom command for HDFS > that is only available if HA is present, and it waits for NameNode to leave > safemode. > > Jonathan Hurley wrote: > Yes, I recently added something for the case during an EU where we know > that the NameNode probably won't leave Safemode. Essentially, don't try to > create any directories if the NN didn't wait for safemode to exit. That was > only for NN, though. > > But this problem is a more generic case - it affects other services. > Since NN wasn't restarted it might be in Safemode. In this case, I think we > need to handle the retryable exception and back off and wait. > > However, you could also argue that since we know we're doing a restart > operation, we should be shutting down the NNs completely. If there's no issue > with shutting them stop during the HA process, then this patch seems fine for > now, but we should open another one for catching the RetryableException.
Thanks Jonathan! Absolutely agree with your points. Could you please Ship it? - Victor ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/48734/#review138280 ----------------------------------------------------------- On June 17, 2016, 10:45 p.m., Victor Galgo wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/48734/ > ----------------------------------------------------------- > > (Updated June 17, 2016, 10:45 p.m.) > > > Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew > Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert > Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako. > > > Bugs: AMBARI-17182 > https://issues.apache.org/jira/browse/AMBARI-17182 > > > Repository: ambari > > > Description > ------- > > On the last step "Start all" on enabling HA below happens: > > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py", > line 147, in <module> > ApplicationTimelineServer().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 219, in execute > method(env) > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py", > line 43, in start > self.configure(env) # FOR SECURITY > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py", > line 54, in configure > yarn(name='apptimelineserver') > File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", > line 89, in thunk > return fn(*args, **kwargs) > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py", > line 276, in yarn > mode=0755 > File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", > line 154, in __init__ > self.env.run() > File > "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", > line 160, in run > self.run_action(resource, action) > File > "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", > line 124, in run_action > provider_action() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 463, in action_create_on_execute > self.action_delayed("create") > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 460, in action_delayed > self.get_hdfs_resource_executor().action_delayed(action_name, self) > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 259, in action_delayed > self._set_mode(self.target_status) > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 366, in _set_mode > self.util.run_command(self.main_resource.resource.target, > 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False) > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 195, in run_command > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w > '%{http_code}' -X PUT > 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION&user.name=hdfs&permission=755'' > returned status_code=403. > { > "RemoteException": { > "exception": "RetriableException", > "javaClassName": "org.apache.hadoop.ipc.RetriableException", > "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: > Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported > blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total > blocks 697.\nThe number of live datanodes 20 has reached the minimum number > 0. Safe mode will be turned off automatically once the thresholds have been > reached." > } > } > > > This happens because NN is not yet out of safemode at the moment of ats > start, because DNs just started. > > To fix this "stop namenodes" has to be triggered before "start all". > > If this is done, on "Start all" it will be ensured that datanodes start prior > to NN, and that NN are out of safemode before ATS start. > > > Diffs > ----- > > > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js > 24677e4 > ambari-web/app/messages.js 6465812 > > Diff: https://reviews.apache.org/r/48734/diff/ > > > Testing > ------- > > Calling set on destroyed view > Calling set on destroyed view > Calling set on destroyed view > Calling set on destroyed view > > 28668 tests complete (34 seconds) > 154 tests pending > > [INFO] > [INFO] --- apache-rat-plugin:0.11:check (default) @ ambari-web --- > [INFO] 51 implicit excludes (use -debug for more details). > [INFO] Exclude: .idea/** > [INFO] Exclude: package.json > [INFO] Exclude: public/** > [INFO] Exclude: public-static/** > [INFO] Exclude: app/assets/** > [INFO] Exclude: vendor/** > [INFO] Exclude: node_modules/** > [INFO] Exclude: node/** > [INFO] Exclude: npm-debug.log > [INFO] 1425 resources included (use -debug for more details) > Warning: org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser: Property > 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not > recognized. > Compiler warnings: > WARNING: 'org.apache.xerces.jaxp.SAXParserImpl: Property > 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.' > Warning: org.apache.xerces.parsers.SAXParser: Feature > 'http://javax.xml.XMLConstants/feature/secure-processing' is not recognized. > Warning: org.apache.xerces.parsers.SAXParser: Property > 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized. > Warning: org.apache.xerces.parsers.SAXParser: Property > 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not > recognized. > [INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 > approved: 1425 licence. > [INFO] > ------------------------------------------------------------------------ > [INFO] BUILD SUCCESS > [INFO] > ------------------------------------------------------------------------ > [INFO] Total time: 1:31.015s > [INFO] Finished at: Sun Jun 12 14:37:47 EEST 2016 > [INFO] Final Memory: 13M/407M > [INFO] > ------------------------------------------------------------------------ > > Also to test this I have installed 3 nodes cluster and enabled namenode ha on > it. > > > Thanks, > > Victor Galgo > >