[ https://issues.apache.org/jira/browse/HDFS-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Timonin updated HDFS-15060: ---------------------------------- Description: When I upgrade hadoop to new version (using for ex. [https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade] as instruction) I've got a situation: I'm upgrading JN's one by one. # Upgrade and restart JN1 # NN see JN offline: WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to write txns 1205396-1205399. Will try to write to this JN again after the next log roll. # No log roll for some time (at least 1min) # Upgrade and restart JN2 # NN see it again: WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to write txns 1205799-1205800. Will try to write to this JN again after the next log roll. # BUT! At this time we have no JN quorum: FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485, 10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246)) 10.73.67.212:8485: null [success] 2 exceptions thrown: 10.73.67.132:8485: Journal disabled until next roll 10.73.67.68:8485: End of File Exception between local host is: "srv05.lt01.gismt.crpt.tech/10.73.67.132"; destination host is: "srv07.lt01.gismt.crpt.tech":8485; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException although JN1 is online already It looks like NN should retry JN's marked as offline before giving up. was: When I upgrade hadoop to new version (using for ex. [https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade] as instruction) I've got a situation: I'm upgrading JN's one by one. # Upgrade and restart JN1 # NN see JN offline: WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to write txns 1205396-1205399. Will try to write to this JN again after the next log roll. # No log roll for some time (at least 1min) # Upgrade and restart JN2 # NN see it again: WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to write txns 1205799-1205800. Will try to write to this JN again after the next log roll. # BUT! At this time we have no JN quorum: FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485, 10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246)) although JN1 is online already It looks like NN should retry JN's marked as offline before giving up. > namenode doesn't retry JN when other JN goes down > ------------------------------------------------- > > Key: HDFS-15060 > URL: https://issues.apache.org/jira/browse/HDFS-15060 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.1.1 > Reporter: Andrew Timonin > Priority: Minor > > When I upgrade hadoop to new version (using for ex. > [https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade] > as instruction) I've got a situation: > I'm upgrading JN's one by one. > # Upgrade and restart JN1 > # NN see JN offline: > WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to > write txns 1205396-1205399. Will try to write to this JN again after the next > log roll. > # No log roll for some time (at least 1min) > # Upgrade and restart JN2 > # NN see it again: > WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to > write txns 1205799-1205800. Will try to write to this JN again after the next > log roll. > # BUT! At this time we have no JN quorum: > FATAL namenode.FSEditLog: Error: flush failed for required journal > (JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485, > 10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246)) > 10.73.67.212:8485: null [success] > 2 exceptions thrown: > 10.73.67.132:8485: Journal disabled until next roll > 10.73.67.68:8485: End of File Exception between local host is: > "srv05.lt01.gismt.crpt.tech/10.73.67.132"; destination host is: > "srv07.lt01.gismt.crpt.tech":8485; : java.io.EOFException; For more details > see: http://wiki.apache.org/hadoop/EOFException > although JN1 is online already > It looks like NN should retry JN's marked as offline before giving up. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org