[jira] [Commented] (HDFS-3769) standby namenode become ative fail ,because starting log segment fail on share strage

liaowenrui (JIRA) Mon, 06 Aug 2012 20:13:09 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429687#comment-13429687
 ]


liaowenrui commented on HDFS-3769:
----------------------------------

when 2354 editlog only write in local disk,and Active NN is restarted,and then 
it become active agian,and write 2355 editlog in share store.if 2354 editlog 
write to share store fail,and this fs op fail too, 2354 editlog is not 
avaliable log. but this scenario will lead to standby nn become active fail.

my modify: I will add streams = editLog.selectInputStreams(lastTxnId + 2, 0, 
null, false); code in doTailEdits function.


 private void doTailEdits() throws IOException, InterruptedException {
    // Write lock needs to be interruptible here because the 
    // transitionToActive RPC takes the write lock before calling
    // tailer.stop() -- so if we're not interruptible, it will
    // deadlock.
    namesystem.writeLockInterruptibly();
    try {
      FSImage image = namesystem.getFSImage();

      long lastTxnId = image.getLastAppliedTxId();
      
      if (LOG.isDebugEnabled()) {
        LOG.debug("lastTxnId: " + lastTxnId);
      }
      Collection<EditLogInputStream> streams;
      try {
        streams = editLog.selectInputStreams(lastTxnId + 1, 0, null, false);
      } catch (IOException ioe) {

        try
        {
            streams = editLog.selectInputStreams(lastTxnId + 2, 0, null, false);
        }catch(IOException ioe1)
        {
          // This is acceptable. If we try to tail edits in the middle of an 
edits
          // log roll, i.e. the last one has been finalized but the new 
inprogress
          // edits file hasn't been started yet.
             LOG.warn("Edits tailer failed to find any streams. Will try again 
" +
              "later.", ioe);
             return;
        }
      }
      if (LOG.isDebugEnabled()) {
        LOG.debug("edit streams to load from: " + streams.size());
      }
      
      // Once we have streams to load, errors encountered are legitimate cause
      // for concern, so we don't catch them here. Simple errors reading from
      // disk are ignored.
      long editsLoaded = 0;
      try {
        editsLoaded = image.loadEdits(streams, namesystem, null);
      } catch (EditLogInputException elie) {
        editsLoaded = elie.getNumEditsLoaded();
        throw elie;
      } finally {
        if (editsLoaded > 0 || LOG.isDebugEnabled()) {
          LOG.info(String.format("Loaded %d edits starting from txid %d ",
              editsLoaded, lastTxnId));
        }
      }

      if (editsLoaded > 0) {
        lastLoadTimestamp = now();
      }
      lastLoadedTxnId = image.getLastAppliedTxId();
    } finally {
      namesystem.writeUnlock();
    }
  }

                
> standby namenode become ative fail ,because starting log segment fail on 
> share strage
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-3769
>                 URL: https://issues.apache.org/jira/browse/HDFS-3769
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.0.0-alpha
>         Environment: 3 datanode:158.1.132.18,158.1.132.19,160.161.0.143
> 2 namenode:158.1.131.18,158.1.132.19
> 3 zk:158.1.132.18,158.1.132.19,160.161.0.143
> 3 bookkeeper:158.1.132.18,158.1.132.19,160.161.0.143
> ensemble-size:2,quorum-size:2
>            Reporter: liaowenrui
>            Priority: Critical
>             Fix For: 2.1.0-alpha, 2.0.1-alpha
>
>
> 2012-08-06 15:09:46,264 ERROR 
> org.apache.hadoop.contrib.bkjournal.utils.RetryableZookeeper: Node 
> /ledgers/available already exists and this is not a retry
> 2012-08-06 15:09:46,264 INFO 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager: Successfully 
> created bookie available path : /ledgers/available
> 2012-08-06 15:09:46,273 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in 
> /opt/namenodeHa/hadoop-2.0.1/hadoop-root/dfs/name/current
> 2012-08-06 15:09:46,277 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest 
> edits from old active before taking over writer role in edits logs.
> 2012-08-06 15:09:46,363 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication 
> and invalidation queues...
> 2012-08-06 15:09:46,363 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all 
> datandoes as stale
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of 
> blocks            = 239
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of invalid 
> blocks          = 0
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of 
> under-replicated blocks = 0
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of  
> over-replicated blocks = 0
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of blocks 
> being written    = 0
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing 
> edit logs at txnid 2354
> 2012-08-06 15:09:46,471 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 2354
> 2012-08-06 15:09:46,472 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 
> 2354 failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@4eda1515,
>  stream=null))
> java.io.IOException: We've already seen 2354. A new stream cannot be created 
> with it
>         at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.startLogSegment(BookKeeperJournalManager.java:297)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:86)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$2.apply(JournalSet.java:182)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:319)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:179)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:894)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:268)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:618)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1322)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1230)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:990)
>         at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
>         at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
>                                                                               
>                                         

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3769) standby namenode become ative fail ,because starting log segment fail on share strage

Reply via email to