[jira] [Comment Edited] (HDFS-15468) Active namenode crashed when no edit recover

Ayush Saxena (Jira) Fri, 09 Oct 2020 22:51:31 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211561#comment-17211561
 ]


Ayush Saxena edited comment on HDFS-15468 at 10/10/20, 5:50 AM:
----------------------------------------------------------------

Thanx [~kpalanisamy] for the report. Not sure it is related to safemode? I 
could repro this without namenode being in safemode.
 Got some similar exception traces :
{noformat}
127.0.0.1:59233: Can't write, no segment open ; journal id: myjournal
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:544)
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:405)
{noformat}
and
{noformat}
2020-10-10 10:30:30,058 [FSEditLogAsync] ERROR namenode.FSEditLog 
(JournalSet.java:mapJournalsAndReportErrors(406)) - Error: flush failed for 
(journal JournalAndStream(mgr=QJM to [127.0.0.1:59233, 127.0.0.1:59235, 
127.0.0.1:59237], stream=QuorumOutputStream starting at txid 1))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions 
to achieve quorum size 2/3. 1 successful responses:
127.0.0.1:59237: null [success]
2 exceptions thrown:
{noformat}
Give a check if it specifically happens with safemode only for you.

[~Amithsha] Let me know, If you want to try reproduce this, I wrote a UT for 
this, you can try that.

Regarding the safemode, It is just preventing you from making a write call to 
the NN, else the NN would have crashed before only, when the JN was down.

I am not sure there is a fix to this, you can't(shouldn't) make the JN's 
recover the last segment, because of so many reasons. Persisting the Namenode 
state and making it call startLogSegment and stuff, is too off road by design

Secondly, I think this is expected as well, If you tend to loose the quorum, 
the Namenode is expected to crash. Traditionally it isn't expected for the 
namenode to loose quorum in any case and if it does it is considered an 
alarming situation

The Admin should ensure maintenance in a way the Namenode doesn't looses the 
quorum. I don't think this would happen in a production environment may be only 
if someone is trying out upgrade without being careful. This is documented as 
well :

{noformat}
JNs is relatively stable and does not require upgrade when upgrading HDFS in 
most of the cases......Upgrading JNs and ZKNs may incur cluster 
downtime.{noformat}



was (Author: ayushtkn):
Thanx [~kpalanisamy] for the report. Not sure it is related to safemode? I 
could repro this without namenode being in safemode.
Got some similar exception traces :

{noformat}
127.0.0.1:59233: Can't write, no segment open ; journal id: myjournal
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:544)
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:405)
{noformat}

and 

{noformat}
2020-10-10 10:30:30,058 [FSEditLogAsync] ERROR namenode.FSEditLog 
(JournalSet.java:mapJournalsAndReportErrors(406)) - Error: flush failed for 
(journal JournalAndStream(mgr=QJM to [127.0.0.1:59233, 127.0.0.1:59235, 
127.0.0.1:59237], stream=QuorumOutputStream starting at txid 1))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions 
to achieve quorum size 2/3. 1 successful responses:
127.0.0.1:59237: null [success]
2 exceptions thrown:
{noformat}

Give a check if it specifically happens with safemode only for you.

[~Amithsha] Let me know, If you want to try reproduce this, I wrote a UT for 
this, you can try that.

> Active namenode crashed when no edit recover
> --------------------------------------------
>
>                 Key: HDFS-15468
>                 URL: https://issues.apache.org/jira/browse/HDFS-15468
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha, journal-node, namenode
>    Affects Versions: 3.0.0
>            Reporter: Karthik Palanisamy
>            Priority: Critical
>
> if namenode is under safe mode and let restart two journal node for 
> maintenance activity.
>  In this case, the journal node will not finalize the last edit segment which 
> is edit in-progress. 
>  This last edit segment will be finalized or recovered when edit rolling 
> operation else when epoch change due to namenode failover.
>  But the current scenario is no failover, just namenode is under safe mode. 
> If we leave the safe mode then active namenode will crash.
>  Ie.
>  the current open segment is edits_inprogress_0000000010356376710 but it is 
> not recovered or finalized post JN2 restart. I think we need to recover the 
> edits after JN restart. 
> {code:java}
> Journal node 
> 2020-06-20 16:11:53,458 INFO  server.Journal 
> (Journal.java:scanStorageForLatestEdits(193)) - Latest log is 
> EditLogFile(file=/hadoop/hdfs/journal/xxx/current/edits_inprogress_0000000010356376710,first=0000000010356376710,last=0000000010356376710,inProgress=true,hasCorruptHeader=false)
> 2020-06-20 16:19:06,397 INFO  ipc.Server (Server.java:logException(2435)) - 
> IPC Server handler 3 on 8485, call 
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from 
> 10.x.x.x:28444 Call#49083225 Retry#0
> org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException: Can't 
> write, no segment open
>         at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:484)
> {code}
> {code:java}
> {code:java}
> Namenode log:
> org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many 
> exceptions to achieve quorum size 2/3. 1 successful responses:
> 10.x.x.x:8485: null [success]
> 2 exceptions thrown:
> 10.y.y.y:8485: Can't write, no segment open
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15468) Active namenode crashed when no edit recover

Reply via email to