[jira] [Updated] (HDFS-15060) namenode doesn't retry JN when other JN goes down

Andrew Timonin (Jira) Fri, 13 Dec 2019 03:37:09 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Timonin updated HDFS-15060:
----------------------------------
    Description: 
When I upgrade hadoop to new version (using for ex. 
[https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade]
 as instruction) I've got a situation:

I'm upgrading JN's one by one.
 # Upgrade and restart JN1
 # NN see JN offline:
 WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to 
write txns 1205396-1205399. Will try to write to this JN again after the next 
log roll.
 # No log roll for some time (at least 1min)
 # Upgrade and restart JN2
 # NN see it again:
 WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to 
write txns 1205799-1205800. Will try to write to this JN again after the next 
log roll.
 # BUT! At this time we have no JN quorum: 
FATAL namenode.FSEditLog: Error: flush failed for required journal 
(JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485, 
10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246))
 10.73.67.212:8485: null [success]
 2 exceptions thrown:
 10.73.67.132:8485: Journal disabled until next roll
 10.73.67.68:8485: End of File Exception between local host is: 
"srv05.lt01.gismt.crpt.tech/10.73.67.132"; destination host is: 
"srv07.lt01.gismt.crpt.tech":8485; : java.io.EOFException; For more details 
see:  http://wiki.apache.org/hadoop/EOFException
although JN1 is online already

It looks like NN should retry JN's marked as offline before giving up.

  was:
When I upgrade hadoop to new version (using for ex. 
[https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade]
 as instruction) I've got a situation:

I'm upgrading JN's one by one.
 # Upgrade and restart JN1
 # NN see JN offline:
 WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to 
write txns 1205396-1205399. Will try to write to this JN again after the next 
log roll.
 # No log roll for some time (at least 1min)
 # Upgrade and restart JN2
 # NN see it again:
 WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to 
write txns 1205799-1205800. Will try to write to this JN again after the next 
log roll.
 # BUT! At this time we have no JN quorum: 
FATAL namenode.FSEditLog: Error: flush failed for required journal 
(JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485, 
10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246)) 
although JN1 is online already

It looks like NN should retry JN's marked as offline before giving up.


> namenode doesn't retry JN when other JN goes down
> -------------------------------------------------
>
>                 Key: HDFS-15060
>                 URL: https://issues.apache.org/jira/browse/HDFS-15060
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.1.1
>            Reporter: Andrew Timonin
>            Priority: Minor
>
> When I upgrade hadoop to new version (using for ex. 
> [https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade]
>  as instruction) I've got a situation:
> I'm upgrading JN's one by one.
>  # Upgrade and restart JN1
>  # NN see JN offline:
>  WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to 
> write txns 1205396-1205399. Will try to write to this JN again after the next 
> log roll.
>  # No log roll for some time (at least 1min)
>  # Upgrade and restart JN2
>  # NN see it again:
>  WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to 
> write txns 1205799-1205800. Will try to write to this JN again after the next 
> log roll.
>  # BUT! At this time we have no JN quorum: 
> FATAL namenode.FSEditLog: Error: flush failed for required journal 
> (JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485, 
> 10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246))
>  10.73.67.212:8485: null [success]
>  2 exceptions thrown:
>  10.73.67.132:8485: Journal disabled until next roll
>  10.73.67.68:8485: End of File Exception between local host is: 
> "srv05.lt01.gismt.crpt.tech/10.73.67.132"; destination host is: 
> "srv07.lt01.gismt.crpt.tech":8485; : java.io.EOFException; For more details 
> see:  http://wiki.apache.org/hadoop/EOFException
> although JN1 is online already
> It looks like NN should retry JN's marked as offline before giving up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15060) namenode doesn't retry JN when other JN goes down

Reply via email to