[ https://issues.apache.org/jira/browse/HDFS-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429684#comment-13429684 ]
liaowenrui commented on HDFS-3769: ---------------------------------- Active NN editlog: 158-1-132-18:/opt/namenodeHa/hadoop-2.0.1/hadoop-root/dfs/name/current # ll edits_000000000000000235 edits_0000000000000002354-0000000000000002354 edits_0000000000000002355-0000000000000002356 edits_0000000000000002357-0000000000000002358 edits_0000000000000002359-0000000000000002360 Active NN fsimage file: -rw-r--r-- 1 root root 37545 Aug 6 07:44 fsimage_0000000000000002351 -rw-r--r-- 1 root root 62 Aug 6 07:46 fsimage_0000000000000002351.md5 -rw-r--r-- 1 root root 37545 Aug 6 09:33 fsimage_0000000000000002353 -rw-r--r-- 1 root root 62 Aug 6 09:33 fsimage_0000000000000002353.md5 Standby NN editlog: 158-1-132-19:/opt/namenodeHa/hadoop-2.0.1/hadoop-root/dfs/name/current # ll edits_000000000000000235 edits_0000000000000002350-0000000000000002351 edits_0000000000000002352-0000000000000002353 Standby NN fsimage file: -rw-r--r-- 1 root root 37545 Aug 6 11:51 fsimage_0000000000000002351 -rw-r--r-- 1 root root 62 Aug 6 11:51 fsimage_0000000000000002351.md5 -rw-r--r-- 1 root root 37545 Aug 6 13:38 fsimage_0000000000000002353 -rw-r--r-- 1 root root 62 Aug 6 13:38 fsimage_0000000000000002353.md5 -rw-r--r-- 1 root root 5 Aug 6 11:47 seen_txid share storage editlog: [zk: localhost:2181(CONNECTED) 3] ls /hdfsEdit/ledgers/edits_00000000000000235 edits_000000000000002352_000000000000002353 edits_000000000000002357_000000000000002358 edits_000000000000002355_000000000000002356 edits_000000000000002350_000000000000002351 edits_000000000000002359_000000000000002360 [zk: localhost:2181(CONNECTED) 2] get /hdfsEdit/maxtxid 2360 cZxid = 0x30000002d ctime = Mon Jul 30 05:25:32 EDT 2012 mZxid = 0xb00000860 mtime = Mon Aug 06 15:09:36 EDT 2012 pZxid = 0x30000002d cversion = 0 dataVersion = 681 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 4 numChildren = 0 we can find edits_0000000000000002354-0000000000000002354 file only in active nn. when standby nn become active,and load 2354 editlog,but 2354<2360(maxtxid),and then Standby NN throw excption,and shutdown. > standby namenode become ative fail ,because starting log segment fail on > share strage > ------------------------------------------------------------------------------------- > > Key: HDFS-3769 > URL: https://issues.apache.org/jira/browse/HDFS-3769 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 2.0.0-alpha > Environment: 3 datanode:158.1.132.18,158.1.132.19,160.161.0.143 > 2 namenode:158.1.131.18,158.1.132.19 > 3 zk:158.1.132.18,158.1.132.19,160.161.0.143 > 3 bookkeeper:158.1.132.18,158.1.132.19,160.161.0.143 > ensemble-size:2,quorum-size:2 > Reporter: liaowenrui > Priority: Critical > Fix For: 2.1.0-alpha, 2.0.1-alpha > > > 2012-08-06 15:09:46,264 ERROR > org.apache.hadoop.contrib.bkjournal.utils.RetryableZookeeper: Node > /ledgers/available already exists and this is not a retry > 2012-08-06 15:09:46,264 INFO > org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager: Successfully > created bookie available path : /ledgers/available > 2012-08-06 15:09:46,273 INFO > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering > unfinalized segments in > /opt/namenodeHa/hadoop-2.0.1/hadoop-root/dfs/name/current > 2012-08-06 15:09:46,277 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest > edits from old active before taking over writer role in edits logs. > 2012-08-06 15:09:46,363 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication > and invalidation queues... > 2012-08-06 15:09:46,363 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all > datandoes as stale > 2012-08-06 15:09:46,383 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of > blocks = 239 > 2012-08-06 15:09:46,383 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of invalid > blocks = 0 > 2012-08-06 15:09:46,383 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of > under-replicated blocks = 0 > 2012-08-06 15:09:46,383 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of > over-replicated blocks = 0 > 2012-08-06 15:09:46,383 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of blocks > being written = 0 > 2012-08-06 15:09:46,383 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing > edit logs at txnid 2354 > 2012-08-06 15:09:46,471 INFO > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 2354 > 2012-08-06 15:09:46,472 FATAL > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment > 2354 failed for required journal > (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@4eda1515, > stream=null)) > java.io.IOException: We've already seen 2354. A new stream cannot be created > with it > at > org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.startLogSegment(BookKeeperJournalManager.java:297) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:86) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$2.apply(JournalSet.java:182) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:319) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:179) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:894) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:268) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:618) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61) > at > org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1230) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:990) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427) > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira