[ https://issues.apache.org/jira/browse/AMBARI-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307825#comment-14307825 ]
Hudson commented on AMBARI-9467: -------------------------------- SUCCESS: Integrated in Ambari-trunk-Commit #1697 (See [https://builds.apache.org/job/Ambari-trunk-Commit/1697/]) AMBARI-9467. RU Hacks and Technical Debt - Namenode order of active/standby in code is flipped (alejandro) (afernandez: http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=4aa80b9a7b48dfa446b2aa8195f87542b328d36a) * ambari-server/src/main/resources/common-services/ZOOKEEPER/3.4.5.2.0/package/scripts/service_check.py * ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterTest.java * ambari-server/src/main/resources/common-services/ZOOKEEPER/3.4.5.2.0/package/scripts/params.py * ambari-server/src/main/resources/common-services/ZOOKEEPER/3.4.5.2.0/package/scripts/zookeeper_server.py * ambari-server/src/main/java/org/apache/ambari/server/serveraction/upgrades/FinalizeUpgradeAction.java * ambari-server/src/test/java/org/apache/ambari/server/state/UpgradeHelperTest.java * ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java * ambari-server/src/main/java/org/apache/ambari/server/state/UpgradeHelper.java > RU Hacks and Technical Debt - Namenode order of active/standby in code is > flipped > --------------------------------------------------------------------------------- > > Key: AMBARI-9467 > URL: https://issues.apache.org/jira/browse/AMBARI-9467 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.0.0 > Reporter: Alejandro Fernandez > Assignee: Alejandro Fernandez > Fix For: 2.0.0 > > Attachments: AMBARI-9467.patch > > > Search for comments like # TODO, // TODO, # HACK, or // HACK > UpgradeHelper somehow calls the active Namenode first, but this ends up being > the standby namenode by the time it gets called; investigate why. > Journal node restart: > {code} > 2015-02-04 01:40:21,682 - Executing Rolling Upgrade post-restart > 2015-02-04 01:40:21,684 - Ensuring Journalnode quorum is established > 2015-02-04 01:40:21,692 - Namenode HA State: { > IDs: nn1, nn2 > Addresses: {u'nn2': (u'c6402.ambari.apache.org:50070', > u'c6402.ambari.apache.org:50470'), u'nn1': (u'c6401.ambari.apache.org:50070', > u'c6401.ambari.apache.org:50470')} > States: {u'standby': set([u'c6401.ambari.apache.org']), u'active': > set([u'c6402.ambari.apache.org'])} > Encrypted: False > Healthy: True > } > 2015-02-04 01:40:26,697 - u"Execute['hdfs dfsadmin -rollEdits']" {'tries': 1, > 'user': 'hdfs'} > 2015-02-04 01:40:41,490 - Could not get namenode state for nn2 > 2015-02-04 01:40:41,492 - Namenode HA State: { > IDs: nn1, nn2 > Addresses: {u'nn2': (u'c6402.ambari.apache.org:50070', > u'c6402.ambari.apache.org:50470'), u'nn1': (u'c6401.ambari.apache.org:50070', > u'c6401.ambari.apache.org:50470')} > States: {u'active': set([u'c6401.ambari.apache.org'])} > Encrypted: False > Healthy: False > } > {code} > At around 2015-02-04 01:40:26, the Namenode logs on nn2 show, > {code} > 2015-02-04 01:40:13,217 INFO blockmanagement.CacheReplicationMonitor > (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 > block(s) in 0 millisecond(s). > 2015-02-04 01:40:29,454 INFO namenode.FSNamesystem > (FSNamesystem.java:rollEditLog(6345)) - Roll Edit Log from 192.168.64.103 > 2015-02-04 01:40:29,454 INFO namenode.FSEditLog > (FSEditLog.java:rollEditLog(1157)) - Rolling edit logs > 2015-02-04 01:40:29,454 INFO namenode.FSEditLog > (FSEditLog.java:endCurrentLogSegment(1214)) - Ending log segment 34 > 2015-02-04 01:40:29,454 INFO namenode.FSEditLog > (FSEditLog.java:printStatistics(691)) - Number of transactions: 2 Total time > for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of > syncs: 1 SyncTimes(ms): 8 4 > 2015-02-04 01:40:29,465 WARN client.QuorumJournalManager > (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.101:8485 failed > to write txns 35-35. Will try to write to this JN again after the next log > roll. > org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 3 is > not the current writer epoch 2 > at > org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:445) > at > org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342) > at > org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) > at > org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) > at > org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at org.apache.hadoop.ipc.Client.call(Client.java:1468) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy12.journal(Unknown Source) > at > org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > So the namenodes flipped order. > Also, nn1 shows, > {code} > 2015-02-04 01:39:06,500 INFO blockmanagement.BlockManager > (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received > first block report from > DatanodeStorage[DS-9188ccbc-f03f-48f3-9314-ca6475d846f8,DISK,NORMAL] after > starting up or becoming active. Its block contents are no longer considered > stale > 2015-02-04 01:39:06,500 INFO BlockStateChange > (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage > DS-9188ccbc-f03f-48f3-9314-ca6475d846f8 node > DatanodeRegistration(192.168.64.103, > datanodeUuid=54053f93-435e-4fce-950f-00d4deff1506, infoPort=50075, > ipcPort=8010, > storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0), > blocks: 1, hasStaleStorages: false, processing time: 16 msecs > 2015-02-04 01:39:20,978 WARN namenode.FSNamesystem > (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned > error: Operation category READ is not supported in state standby > 2015-02-04 01:39:20,978 WARN namenode.FSNamesystem > (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned > error: Operation category READ is not supported in state standby > 2015-02-04 01:39:49,635 WARN namenode.FSNamesystem > (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned > error: Operation category READ is not supported in state standby > 2015-02-04 01:39:49,655 WARN namenode.FSNamesystem > (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned > error: Operation category READ is not supported in state standby > 2015-02-04 01:39:51,476 INFO blockmanagement.BlockManager > (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received > first block report from > DatanodeStorage[DS-ca53d96a-7630-475f-a502-47b59aba2ee0,DISK,NORMAL] after > starting up or becoming active. Its block contents are no longer considered > stale > 2015-02-04 01:39:51,477 INFO BlockStateChange > (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage > DS-ca53d96a-7630-475f-a502-47b59aba2ee0 node > DatanodeRegistration(192.168.64.101, > datanodeUuid=faff4a42-fbe3-48d1-9592-4ef1f4fe077d, infoPort=50075, > ipcPort=8010, > storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0), > blocks: 1, hasStaleStorages: false, processing time: 1 msecs > 2015-02-04 01:40:18,478 INFO blockmanagement.BlockManager > (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received > first block report from > DatanodeStorage[DS-86763f94-9f8c-41ac-be47-4f4b84b32077,DISK,NORMAL] after > starting up or becoming active. Its block contents are no longer considered > stale > 2015-02-04 01:40:18,480 INFO BlockStateChange > (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage > DS-86763f94-9f8c-41ac-be47-4f4b84b32077 node > DatanodeRegistration(192.168.64.102, > datanodeUuid=dda574c3-f296-4b95-ab56-5df231179356, infoPort=50075, > ipcPort=8010, > storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0), > blocks: 1, hasStaleStorages: false, processing time: 0 msecs > 2015-02-04 01:40:29,389 INFO ipc.Server (Server.java:run(2053)) - IPC Server > handler 88 on 8020, call > org.apache.hadoop.hdfs.protocol.ClientProtocol.rollEdits from > 192.168.64.103:45722 Call#0 Retry#0: org.apache.hadoop.ipc.StandbyException: > Operation category JOURNAL is not supported in state standby > 2015-02-04 01:40:30,019 INFO ipc.Server (Server.java:run(2053)) - IPC Server > handler 78 on 8020, call > org.apache.hadoop.hdfs.protocol.ClientProtocol.rollEdits from > 192.168.64.103:45722 Call#0 Retry#2: org.apache.hadoop.ipc.StandbyException: > Operation category JOURNAL is not supported in state standby > 2015-02-04 01:40:31,382 INFO namenode.FSNamesystem > (FSNamesystem.java:stopStandbyServices(1336)) - Stopping services started for > standby state > 2015-02-04 01:40:31,382 WARN ha.EditLogTailer > (EditLogTailer.java:doWork(339)) - Edit log tailer interrupted > java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295) > 2015-02-04 01:40:31,387 INFO namenode.FSNamesystem > (FSNamesystem.java:startActiveServices(1142)) - Starting services required > for active state > {code} > The ZKFC logs show how the unexpected failover happened, > On nn1, which transitioned from standby to active: > {code} > 2015-02-04 01:39:45,173 INFO ha.ZKFailoverController > (ZKFailoverController.java:becomeStandby(482)) - ZK Election indicated that > NameNode at c6401.ambari.apache.org/192.168.64.101:8020 should become standby > 2015-02-04 01:39:45,175 INFO ha.ZKFailoverController > (ZKFailoverController.java:becomeStandby(487)) - Successfully transitioned > NameNode at c6401.ambari.apache.org/192.168.64.101:8020 to standby state > 2015-02-04 01:40:30,304 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:fenceOldActive(877)) - Checking for any old active > which needs to be fenced... > 2015-02-04 01:40:30,308 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:fenceOldActive(898)) - Old node exists: > 0a02686112036e6e321a1763363430322e616d626172692e6170616368652e6f726720d43e28d33e > 2015-02-04 01:40:30,310 INFO ha.ZKFailoverController > (ZKFailoverController.java:doFence(511)) - Should fence: NameNode at > c6402.ambari.apache.org/192.168.64.102:8020 > 2015-02-04 01:40:31,312 INFO ipc.Client > (Client.java:handleConnectionFailure(858)) - Retrying connect to server: > c6402.ambari.apache.org/192.168.64.102:8020. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 > MILLISECONDS) > 2015-02-04 01:40:31,314 WARN ha.FailoverController > (FailoverController.java:tryGracefulFence(178)) - Unable to gracefully make > NameNode at c6402.ambari.apache.org/192.168.64.102:8020 standby (unable to > connect) > java.net.ConnectException: Call From c6401.ambari.apache.org/192.168.64.101 > to c6402.ambari.apache.org:8020 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy12.transitionToStandby(Unknown Source) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112) > at > org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172) > at > org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:512) > at > org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:503) > at > org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61) > at > org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:890) > at > org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) > at org.apache.hadoop.ipc.Client.call(Client.java:1438) > ... 14 more > 2015-02-04 01:40:31,319 INFO ha.NodeFencer (NodeFencer.java:fence(91)) - > ====== Beginning Service Fencing Process... ====== > 2015-02-04 01:40:31,319 INFO ha.NodeFencer (NodeFencer.java:fence(94)) - > Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true) > 2015-02-04 01:40:31,357 INFO ha.ShellCommandFencer > (ShellCommandFencer.java:tryFence(99)) - Launched fencing command '/bin/true' > with pid 14045 > 2015-02-04 01:40:31,364 INFO ha.NodeFencer (NodeFencer.java:fence(98)) - > ====== Fencing successful by method > org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ====== > 2015-02-04 01:40:31,364 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:writeBreadCrumbNode(824)) - Writing znode > /hadoop-ha/ha/ActiveBreadCrumb to indicate that the local node is the most > recent active... > 2015-02-04 01:40:31,369 INFO ha.ZKFailoverController > (ZKFailoverController.java:becomeActive(378)) - Trying to make NameNode at > c6401.ambari.apache.org/192.168.64.101:8020 active... > 2015-02-04 01:40:32,821 INFO ha.ZKFailoverController > (ZKFailoverController.java:becomeActive(385)) - Successfully transitioned > NameNode at c6401.ambari.apache.org/192.168.64.101:8020 to active state > {code} > On nn2, which transitioned from active to standby: > {code} > 2015-02-04 01:39:46,270 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected. > 2015-02-04 01:39:46,279 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:fenceOldActive(877)) - Checking for any old active > which needs to be fenced... > 2015-02-04 01:39:46,282 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:fenceOldActive(898)) - Old node exists: > 0a02686112036e6e321a1763363430322e616d626172692e6170616368652e6f726720d43e28d33e > 2015-02-04 01:39:46,282 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:fenceOldActive(900)) - But old node has our own > data, so don't need to fence it. > 2015-02-04 01:39:46,282 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:writeBreadCrumbNode(824)) - Writing znode > /hadoop-ha/ha/ActiveBreadCrumb to indicate that the local node is the most > recent active... > 2015-02-04 01:39:46,289 INFO ha.ZKFailoverController > (ZKFailoverController.java:becomeActive(378)) - Trying to make NameNode at > c6402.ambari.apache.org/192.168.64.102:8020 active... > 2015-02-04 01:39:46,301 INFO ha.ZKFailoverController > (ZKFailoverController.java:becomeActive(385)) - Successfully transitioned > NameNode at c6402.ambari.apache.org/192.168.64.102:8020 to active state > 2015-02-04 01:40:30,302 WARN ha.HealthMonitor > (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying > to monitor health of NameNode at c6402.ambari.apache.org/192.168.64.102:8020: > End of File Exception between local host is: > "c6402.ambari.apache.org/192.168.64.102"; destination host is: > "c6402.ambari.apache.org":8020; : java.io.EOFException; For more details see: > http://wiki.apache.org/hadoop/EOFException > 2015-02-04 01:40:30,302 INFO ha.HealthMonitor > (HealthMonitor.java:enterState(238)) - Entering state SERVICE_NOT_RESPONDING > 2015-02-04 01:40:30,302 INFO ha.ZKFailoverController > (ZKFailoverController.java:setLastHealthState(850)) - Local service NameNode > at c6402.ambari.apache.org/192.168.64.102:8020 entered state: > SERVICE_NOT_RESPONDING > 2015-02-04 01:40:30,302 INFO ha.ZKFailoverController > (ZKFailoverController.java:recheckElectability(766)) - Quitting master > election for NameNode at c6402.ambari.apache.org/192.168.64.102:8020 and > marking that fencing is necessary > 2015-02-04 01:40:30,303 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(354)) - Yielding from election > 2015-02-04 01:40:30,310 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) > - Session: 0x24b523c16a70000 closed > 2015-02-04 01:40:30,310 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:isStaleClient(1006)) - Ignoring stale result from > old client with sessionId 0x24b523c16a70000 > 2015-02-04 01:40:30,310 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) > - EventThread shut down > 2015-02-04 01:40:32,314 INFO ipc.Client > (Client.java:handleConnectionFailure(858)) - Retrying connect to server: > c6402.ambari.apache.org/192.168.64.102:8020. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 > MILLISECONDS) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)