[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820302#comment-16820302
 ] 

Anu Engineer edited comment on HDFS-13596 at 4/17/19 5:03 PM:
--------------------------------------------------------------

Not to add more noise, but this might be a good opportunity for us to learn a 
trick or two from our brethren in the HBase land.  HBase has this nice ability 
to upgrade to 3.0, but will not enable 3.0 features unless another command or 
setting is applied. We actually have a very similar situation here, we might 
have changes in Edit logs, but let us not allow that feature to be used until 
after the main step is completely done and we have some way of verifying that 
nothing is broken. Then you can enable the full 3.0 features once the full 
upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase in 
doing this, and letting me know that Ozone should probably learn from that 
experience).

 

if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 
features like EC.  Till enable call is done, HDFS will not allow 3.0 features 
like EC.

 


was (Author: anu):
Not to add more noise, but this might be a good opportunity for us to learn a 
trick or two from our brethren in the HBase land.  HBase has this nice ability 
to upgrade to 3.0, but will not enable 3.0 features unless another command or 
setting is applied. We actually have a very similar situation here, we might 
have changes in Edit logs, but let us not allow that feature to be used until 
after the main step is completely done and we have some way of verifying that 
nothing is broken. Then you can enable the full 3.0 features once the full 
upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase does 
this, and letting me know that Ozone should probably learn from that 
experience).

 

if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 
features like EC.  Till enable call is done, HDFS will not allow 3.0 features 
like EC.

 

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -----------------------------------------------------
>
>                 Key: HDFS-13596
>                 URL: https://issues.apache.org/jira/browse/HDFS-13596
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>            Reporter: Hanisha Koneru
>            Assignee: Fei Hui
>            Priority: Critical
>         Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.<init>(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> Caused by: java.lang.IllegalStateException: Cannot skip to less than the 
> current value (=16389), where newValue=16388
>  at org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1943)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to