[jira] [Created] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary failures

Hari Sekhon (JIRA) Thu, 30 Apr 2015 05:54:31 -0700

Hari Sekhon created HDFS-8298:
---------------------------------

             Summary: HA: NameNode should not shut down completely without 
quorum, doesn't recover from temporary failures
                 Key: HDFS-8298
                 URL: https://issues.apache.org/jira/browse/HDFS-8298
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: ha, HDFS, namenode, qjm
    Affects Versions: 2.6.0
         Environment: HDP 2.2
            Reporter: Hari Sekhon



In an HDFS HA setup if there is a temporary problem with contacting journal 
nodes (eg. network interruption), the NameNode shuts down entirely, when it 
should instead go in to a standby mode so that it can stay online and retry to 
achieve quorum later.

If both NameNodes shut themselves off like this then even after the temporary 
network outage is resolved, the entire cluster remains offline indefinitely 
until operator intervention, whereas it could have self-repaired after 
re-contacting the journalnodes and re-achieving quorum.

{code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
(JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
required journal (JournalAndStre
am(mgr=QJM to [<ip>:8485, <ip>:8485, <ip>:8485], stream=QuorumOutputStream 
starting at txid 54270281))
java.io.IOException: Interrupted waiting 20000ms for a quorum of nodes to 
respond.
        at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
        at 
org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
        at java.lang.Thread.run(Thread.java:745)
2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
(QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
txid 54270281
2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
Exiting with status 1
2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at <custom_scrubbed>/<ip>
************************************************************/{code}

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary failures

Reply via email to