[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154414#comment-17154414 ] fengwu edited comment on HDFS-13596 at 7/10/20, 7:09 AM: - [~_ph] , At first I had the same view as you, when I saw this comment :https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16911102=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16911102 I tested downgrade from hdfs 3.2.1 to 2.8.2 success,but 2.7 ~ 2.7.2 failed. So, This can be understood as roll downgrading the current 3.x can only be downgraded to the 2.8+ , agree ? [~hanishakoneru] [~ferhui] was (Author: fengwu99): [~_ph] , At first I had the same view as you, when I saw this comment :https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16911102=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16911102 I tested downgrade from hdfs 3.2.1 to 2.8.2 success,but 2.7 failed. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Blocker > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, > HDFS-13596.009.patch, HDFS-13596.010.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154414#comment-17154414 ] fengwu edited comment on HDFS-13596 at 7/10/20, 7:00 AM: - [~_ph] , At first I had the same view as you, when I saw this comment :https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16911102=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16911102 I tested downgrade from hdfs 3.2.1 to 2.8.2 success,but 2.7 failed. was (Author: fengwu99): [~_ph] , At first I had the same view as you, when I saw this comment :[https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16911102=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16911102] > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Blocker > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, > HDFS-13596.009.patch, HDFS-13596.010.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154336#comment-17154336 ] fengwu edited comment on HDFS-13596 at 7/9/20, 9:12 AM: [~_ph] Thanks. In your case, used rollback not downgrade . Rollback will lost data, I want a way to roll downgrade do you think it works? was (Author: fengwu99): [~_ph] Thanks. In your case, used rollback not downgrade . Rollback will lost data, I want a way to roll downgrade . > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Blocker > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, > HDFS-13596.009.patch, HDFS-13596.010.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154318#comment-17154318 ] Yuriy Malygin edited comment on HDFS-13596 at 7/9/20, 8:21 AM: --- [~fengwu99] yes, in my case (https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16826162=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826162 + https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16826939=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826939) downgrade to 2.7.3 was successfully from 3.1.2 and 3.3.0-SNAPSHOT versions. was (Author: _ph): [~fengwu99] yes, in my case (https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16826162=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826162) downgrade to 2.7.3 was successfully from 3.1.2 and 3.3.0-SNAPSHOT versions. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Blocker > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, > HDFS-13596.009.patch, HDFS-13596.010.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154238#comment-17154238 ] fengwu edited comment on HDFS-13596 at 7/9/20, 7:02 AM: Hi, [~_ph] ! Yes, downgrade before namenodes and not finalize upgrade. Have you tested downgrade to 2.x from 3.x ? Datanode downgrade successfully ? was (Author: fengwu99): Hi, [~_ph] ! Yes, downgrade before namenodes and not finalize upgrade. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Blocker > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, > HDFS-13596.009.patch, HDFS-13596.010.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153543#comment-17153543 ] fengwu edited comment on HDFS-13596 at 7/8/20, 12:23 PM: - Hi, Found in my test downgrade from 3.1.3 to 2.7.2, namenode successful but datanode failed. because different DN layout versions. {code:java} // code placeholder 2020-07-06 14:45:01,313 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hadoop/dfs. Reported: -57. Expecting = -56. 2020-07-06 14:45:01,315 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data/hadoop/dfs/in_use.lock acquired by nodename 21258@test-v03 2020-07-06 14:45:01,315 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hadoop/dfs. Reported: -57. Expecting = -56. 2020-07-06 14:45:01,315 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to test-v01/10.110.228.21:8020. Exiting. java.io.IOException: All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802) at java.lang.Thread.run(Thread.java:748) {code} Does anyone know the reason? Or, is there only one way to rollback datanode ? was (Author: fengwu99): Hi, Found in my test downgrade from 3.1.3 to 2.7.2, namenode successful but datanode failed. because different DN layout versions. {code:java} // code placeholder 2020-07-06 14:45:01,313 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hadoop/dfs. Reported: -57. Expecting = -56. 2020-07-06 14:45:01,315 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data/hadoop/dfs/in_use.lock acquired by nodename 21258@test-v03 2020-07-06 14:45:01,315 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hadoop/dfs. Reported: -57. Expecting = -56. 2020-07-06 14:45:01,315 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to test-v01/10.110.228.21:8020. Exiting. java.io.IOException: All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802) at java.lang.Thread.run(Thread.java:748) {code} Does anyone know the reason? > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Blocker > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, > HDFS-13596.009.patch, HDFS-13596.010.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910968#comment-16910968 ] Fei Hui edited comment on HDFS-13596 at 8/20/19 3:47 AM: - [~jojochuang] I find the commit "Fix potential FSImage corruption." in 2.8.6-snapshot ,2.9.2,2.9.3-snapshot, 3.0.4-snapshot, 3.1.1-snapshot, 3.2.0, 3.2.1-snapshot, 3.3.0-snapshot. With this fix downgrade from 3.0.0~3.0.3,3.1.0~3.1.1 to 2.8.0~2.8.5,2.9.0~2.9.1 will success. By the way i think this issue resolve edit log incompatible problem and "Fix potential FSImage corruption." is string table incimpatible. we can file another jira to resolve it ? was (Author: ferhui): [~jojochuang] I find the commit "Fix potential FSImage corruption." in 2.8.6-snapshot ,2.9.2-snapshot, 3.0.4-snapshot, 3.1.1-snapshot, 3.2.0, 3.2.1-snapshot, 3.3.0-snapshot. With this fix downgrade from 3.0.0~3.0.3,3.1.0~3.1.1 to 2.8.0~2.8.5,2.9.0~2.9.1 will success. By the way i think this issue resolve edit log incompatible problem and "Fix potential FSImage corruption." is string table incimpatible. we can file another jira to resolve it ? > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Blocker > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, > HDFS-13596.009.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843989#comment-16843989 ] Yuriy Malygin edited comment on HDFS-13596 at 5/20/19 2:08 PM: --- [~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with full access to HDFS (read/write). After rollback (its possible only with downtime) all new data was lost and filesystem return to pre-update state. was (Author: _ph): [~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with full access to HDFS (read/write). After rollback all new data was lost and filesystem return to pre-update state. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843989#comment-16843989 ] Yuriy Malygin edited comment on HDFS-13596 at 5/20/19 2:07 PM: --- [~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with full access to HDFS (read/write). After rollback all new data was lost and filesystem return to pre-update state. was (Author: _ph): [~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with full access to HDFS (read/write). > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843989#comment-16843989 ] Yuriy Malygin edited comment on HDFS-13596 at 5/20/19 2:05 PM: --- [~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with full access to HDFS (read/write). was (Author: _ph): [~brahmareddy] yes, test was with dfs.block.access.token.enable=true and with full access to HDFS (read/write). > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826728#comment-16826728 ] Fei Hui edited comment on HDFS-13596 at 4/26/19 7:37 AM: - [~_ph] Thanks for your testing. There are two problems. One is due to EC. Please see https://issues.apache.org/jira/browse/HDFS-14396 Another is due to StringTable. It occurs downgrade from 3.3 to 2.7 and I guess the root cause is the commit as follow. {code} commit 8a41edb089fbdedc5e7d9a2aeec63d126afea49f Author: Vinayakumar B Date: Mon Oct 15 15:48:26 2018 +0530 Fix potential FSImage corruption. Contributed by Daryn Sharp. {code} StringTable change is incompatible. I revert this commit and It works. I could not find JIRA ID because it was not within commit message was (Author: ferhui): [~_ph] Thanks for your testing. There are two problems. One is due to EC. Please see https://issues.apache.org/jira/browse/HDFS-14396 Another is due to StringTable. It occurs downgrade from 3.3 to 2.7 and I guess the root cause is the commit as follow. {code} commit 8a41edb089fbdedc5e7d9a2aeec63d126afea49f Author: Vinayakumar B Date: Mon Oct 15 15:48:26 2018 +0530 Fix potential FSImage corruption. Contributed by Daryn Sharp. {code} StringTable change is incompatible. I revert this commit and It works. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826162#comment-16826162 ] Yuriy Malygin edited comment on HDFS-13596 at 4/25/19 3:31 PM: --- I'm testing last patch on a cluster with Kerberos: # download sources from [https://github.com/apache/hadoop] # apply patch _HDFS-13596.007.patch_ # build patched version of _hadoop-3.3.0-SNAPSHOT_ # prepare running cluster (_hadoop-2.7.3_) to _rollingUpgrade_ # run _hadoop-3.3.0-SNAPSHOT_ in _rollingUpgrade_ mode # upload test data to HDFS # restart NN and all is fine # stop cluster for rollback to hadoop-2.7.3 # hadoop-2.7.3 NN start fails with _ArrayIndexOutOfBoundsException_: {code:java} INFO | jvm 1| 2019/04/25 18:01:24 | STARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff; compiled by 'vinodkv' on 2016-08-18T01:01Z INFO | jvm 1| 2019/04/25 18:01:24 | STARTUP_MSG: java = 1.8.0_102 INFO | jvm 1| 2019/04/25 18:01:24 | / INFO | jvm 1| 2019/04/25 18:01:24 | 2019-04-25 18:01:24,767 INFO [Thread-3] NameNode - registered UNIX signal handlers for [TERM, HUP, INT] INFO | jvm 1| 2019/04/25 18:01:24 | 2019-04-25 18:01:24,769 INFO [Thread-3] NameNode - createNameNode [-rollingUpgrade, rollback] {code} {code:java} INFO | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 DEBUG [Thread-3] FSImage - Planning to load edit log stream: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@53468702 INFO | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 DEBUG [Thread-3] FSImage - Planning to load edit log stream: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@5e350e83 INFO | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 DEBUG [Thread-3] FSImage - Planning to load edit log stream: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@cf86362 INFO | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 DEBUG [Thread-3] FSImage - Planning to load edit log stream: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@840d683 INFO | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 INFO [Thread-3] FSImage - Planning to load image: FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_rollback_677, cpktTxId=677) INFO | jvm 1| 2019/04/25 18:01:30 | Total time for which application threads were stopped: 0.0007557 seconds, Stopping threads took: 0.0001181 seconds INFO | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,377 ERROR [Thread-3] FSImage - Failed to load image from FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_rollback_677, cpktTxId=677) INFO | jvm 1| 2019/04/25 18:01:30 | java.lang.ArrayIndexOutOfBoundsException: 536870913 INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadStringTableSection(FSImageFormatProtobuf.java:318) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:251) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:182) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:963) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:947) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:746) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:677) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:976) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645) INFO | jvm 1| 2019/04/25 18:01:30 | at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:812)
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822981#comment-16822981 ] Yuriy Malygin edited comment on HDFS-13596 at 4/22/19 10:06 AM: The same problem - HDFS-13236. I can repeat this problem with NN on all 3.x versions (with Kerberos too). I have special dedicated cluster for test and I can help with tests. In my quest rollingUgrade (2.7 -> 3.x) was successfully only in case when there were no write operations to the cluster until finalize. was (Author: _ph): The same problem - HDFS-13236. I can repeat this problem with NN on all 3.x versions (with Kerberos too). I have special dedicated cluster for test and I can help with tests. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822981#comment-16822981 ] Yuriy Malygin edited comment on HDFS-13596 at 4/22/19 10:06 AM: The same problem - HDFS-13236. I can repeat this problem with NN on all 3.x versions (with Kerberos too). I have special dedicated cluster for test and I can help with tests. In my quest rollingUgrade (2.7 -> 3.x) was successfully only in case when there were no write operations to the cluster until finalize. was (Author: _ph): The same problem - HDFS-13236. I can repeat this problem with NN on all 3.x versions (with Kerberos too). I have special dedicated cluster for test and I can help with tests. In my quest rollingUgrade (2.7 -> 3.x) was successfully only in case when there were no write operations to the cluster until finalize. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822850#comment-16822850 ] Brahma Reddy Battula edited comment on HDFS-13596 at 4/22/19 3:37 AM: -- {quote}After rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure {quote} [~zgw] and [~maxigang] thanks for reporting token error. Looks HDFS-9807 and HDFS-6708 introduced two fields in BlockTokenIdenfier.AFAIK,Here namenode created token with all this new fileds and DataNode doesn't have these fields hence token verification is failing. [~ferhui] and [~starphin], did you tried with security enabled ( mainly blockToken enabled)? was (Author: brahmareddy): {quote}After rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure {quote} [~zgw] and [~maxigang] thanks for reporting token error. Looks HDFS-9807 and HDFS-6708 introduced two fields in BlockTokenIdenfier hence token verification is failing. AFAIK,Here namenode created token with all this new fileds and DataNode doesn't have these fields hence token verification is failing. [~ferhui] and [~starphin], did you tried with security enabled ( mainly blockToken enabled)? > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820302#comment-16820302 ] Anu Engineer edited comment on HDFS-13596 at 4/17/19 5:03 PM: -- Not to add more noise, but this might be a good opportunity for us to learn a trick or two from our brethren in the HBase land. HBase has this nice ability to upgrade to 3.0, but will not enable 3.0 features unless another command or setting is applied. We actually have a very similar situation here, we might have changes in Edit logs, but let us not allow that feature to be used until after the main step is completely done and we have some way of verifying that nothing is broken. Then you can enable the full 3.0 features once the full upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase in doing this, and letting me know that Ozone should probably learn from that experience). if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 features like EC. Till enable call is done, HDFS will not allow 3.0 features like EC. was (Author: anu): Not to add more noise, but this might be a good opportunity for us to learn a trick or two from our brethren in the HBase land. HBase has this nice ability to upgrade to 3.0, but will not enable 3.0 features unless another command or setting is applied. We actually have a very similar situation here, we might have changes in Edit logs, but let us not allow that feature to be used until after the main step is completely done and we have some way of verifying that nothing is broken. Then you can enable the full 3.0 features once the full upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase does this, and letting me know that Ozone should probably learn from that experience). if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 features like EC. Till enable call is done, HDFS will not allow 3.0 features like EC. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820302#comment-16820302 ] Anu Engineer edited comment on HDFS-13596 at 4/17/19 5:03 PM: -- Not to add more noise, but this might be a good opportunity for us to learn a trick or two from our brethren in the HBase land. HBase has this nice ability to upgrade to 3.0, but will not enable 3.0 features unless another command or setting is applied. We actually have a very similar situation here, we might have changes in Edit logs, but let us not allow that feature to be used until after the main step is completely done and we have some way of verifying that nothing is broken. Then you can enable the full 3.0 features once the full upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase is in doing this, and letting me know that Ozone should probably learn from that experience). if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 features like EC. Till enable call is done, HDFS will not allow 3.0 features like EC. was (Author: anu): Not to add more noise, but this might be a good opportunity for us to learn a trick or two from our brethren in the HBase land. HBase has this nice ability to upgrade to 3.0, but will not enable 3.0 features unless another command or setting is applied. We actually have a very similar situation here, we might have changes in Edit logs, but let us not allow that feature to be used until after the main step is completely done and we have some way of verifying that nothing is broken. Then you can enable the full 3.0 features once the full upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase in doing this, and letting me know that Ozone should probably learn from that experience). if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 features like EC. Till enable call is done, HDFS will not allow 3.0 features like EC. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820302#comment-16820302 ] Anu Engineer edited comment on HDFS-13596 at 4/17/19 5:02 PM: -- Not to add more noise, but this might be a good opportunity for us to learn a trick or two from our brethren in the HBase land. HBase has this nice ability to upgrade to 3.0, but will not enable 3.0 features unless another command or setting is applied. We actually have a very similar situation here, we might have changes in Edit logs, but let us not allow that feature to be used until after the main step is completely done and we have some way of verifying that nothing is broken. Then you can enable the full 3.0 features once the full upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase does this, and letting me know that Ozone should probably learn from that experience). if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 features like EC. Till enable call is done, HDFS will not allow 3.0 features like EC. was (Author: anu): Not to add more noise, but this might be a good opportunity for us to learn a trick or two from our brethren in the HBase land. HBase has this nice ability to upgrade to 3.0, but will not enable 3.0 features unless another command or setting is applied. We actually have a very similar situation here, we might have changes in Edit logs, but let us not allow that feature to be used until after the main step is complete done and we have some way of verifying that nothing is broken. Then you can enable the full 3.0 features once the full upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase does this, and letting me know that Ozone should probably learn from that experience). if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 features like EC. Till enable call is done, HDFS will not allow 3.0 features like EC. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820034#comment-16820034 ] yao edited comment on HDFS-13596 at 4/17/19 12:37 PM: -- after rollingUpgrade NN and DN nodes to 3.2.0 , at this point, use 3.2.0 client run example pi job will failure write failure sample: {quote}19/04/17 11:45:00 INFO client.AHSProxy: Connecting to Application History server at slave01.jd.com/192.168.1.101:10200 19/04/17 11:45:00 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /yarn/staging/hdfs/.staging/job_1555472525840_0001 19/04/17 11:45:00 INFO mapreduce.JobSubmitter: Cleaning up the staging area /yarn/staging/hdfs/.staging/job_1555472525840_0001 org.apache.hadoop.security.AccessControlException: setErasureCodingPolicy not supported. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkErasureCodingSupported(FSNamesystem.java:7797) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setErasureCodingPolicy(FSNamesystem.java:7530) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setErasureCodingPolicy(NameNodeRpcServer.java:2171) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolServerSideTranslatorPB.java:1598) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2748) at org.apache.hadoop.hdfs.DistributedFileSystem$65.doCall(DistributedFileSystem.java:2852) at org.apache.hadoop.hdfs.DistributedFileSystem$65.doCall(DistributedFileSystem.java:2849) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2867) at org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:885) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:176) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:133) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:99) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:194) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:307) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:360) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:368) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820034#comment-16820034 ] yao edited comment on HDFS-13596 at 4/17/19 12:34 PM: -- after rollingUpgrade NN and DN nodes to 3.2.0 , at this point, use 3.2.0 client run example pi job will failure write failure sample: {quote}19/04/17 11:45:00 INFO client.AHSProxy: Connecting to Application History server at slave01.jd.com/192.168.1.101:10200 19/04/17 11:45:00 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /yarn/staging/hdfs/.staging/job_1555472525840_0001 19/04/17 11:45:00 INFO mapreduce.JobSubmitter: Cleaning up the staging area /yarn/staging/hdfs/.staging/job_1555472525840_0001 org.apache.hadoop.security.AccessControlException: setErasureCodingPolicy not supported. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkErasureCodingSupported(FSNamesystem.java:7797) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setErasureCodingPolicy(FSNamesystem.java:7530) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setErasureCodingPolicy(NameNodeRpcServer.java:2171) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolServerSideTranslatorPB.java:1598) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2748) at org.apache.hadoop.hdfs.DistributedFileSystem$65.doCall(DistributedFileSystem.java:2852) at org.apache.hadoop.hdfs.DistributedFileSystem$65.doCall(DistributedFileSystem.java:2849) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2867) at org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:885) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:176) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:133) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:99) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:194) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:307) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:360) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:368) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818740#comment-16818740 ] zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:44 AM: -- after rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure write failure sample: {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} {color:#d04437}xxx org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error, status message , ack with firstBadLink as x.x.x.x:x{color} {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color} {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color} {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[x.x.x.x:x,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color} {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} was (Author: zgw): after rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure write failure sample: {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} {color:#d04437}xxx org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error, status message , ack with firstBadLink as x.x.x.x:x{color} {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color} {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color} {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color} {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818740#comment-16818740 ] zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:41 AM: -- after rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure write failure sample: {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} {color:#d04437}xxx org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error, status message , ack with firstBadLink as x.x.x.x:x{color} {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color} {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color} {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color} {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} was (Author: zgw): after rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure write failure sample: {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} {color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error, status message , ack with firstBadLink as x.x.x.x:25009{color} {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color} {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color} {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color} {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818740#comment-16818740 ] zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:40 AM: -- after rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure write failure sample: {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} {color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error, status message , ack with firstBadLink as x.x.x.x:25009{color} {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color} {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color} {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color} {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} was (Author: zgw): after rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure write failure sample: {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} {color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error, status message , ack with firstBadLink as x.x.x.x:25009{color} {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color} {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color} {color:#d04437}19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[187.4.65.81:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color} {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818740#comment-16818740 ] zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:39 AM: -- after rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure write failure sample: {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} {color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error, status message , ack with firstBadLink as x.x.x.x:25009{color} {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color} {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color} {color:#d04437}19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[187.4.65.81:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color} {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} was (Author: zgw): after rollingUpgrade NN nodes to 3.x and keep DN 2.x , at this point, use 2.x client to read or write data to hdfs will failure write failure sample: {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} {color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error, status message , ack with firstBadLink as 187.4.65.81:25009{color} {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color} {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color} {color:#d04437}19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-187.4.64.197-1552442036118:blk_1073742246_1422{color} {color:#d04437}19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[187.4.65.81:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color} {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color} > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819640#comment-16819640 ] zhouguangwei edited comment on HDFS-13596 at 4/17/19 1:18 AM: -- [~jojochuang] Yes, I have merged the latest patch that Fei Hui provided and rebuild HDFS project, then replace hadoop-hdfs-3.1.1.jar was (Author: zgw): [~jojochuang] Yes, I have merge the latest patch that Fei Hui provided and rebuild HDFS project, then replace hadoop-hdfs-3.1.1.jar > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815008#comment-16815008 ] Fei Hui edited comment on HDFS-13596 at 4/11/19 1:56 AM: - [~daryn] Thanks for you comments.Upload v005 patch. Could you please take a look? {quote} The check for EC support should be in FSNamesystem methods, not NameNodeRpcServer, since there can be multiple entry points to the namesystem like webhdfs. {quote} move check to FSNamesystem {quote} DFSUtil.isSupportedErasureCoding probably doesn't belong in DFSUtil since it's not something that should called outside of the NN. {quote} delete it from DFSUtil {quote} In FSEditLogOp, please call the former method instead of duplicating the logic. {quote} do not see duplicate logic {quote} Super trivial, might rename new layoutVersion parameter in the write methods to logVersion to be consistent with the signatures for the read methods. {quote} change layoutVertion to logVersion {quote} How about a FSNamesystem.checkErasureCodingSupported(String op) to avoid all the redundant check/throw code in the methods? {quote} add checkErasureCodingSupported in FSNamesystem {quote} A test case is needed to prove the edits are correctly read/written. {quote} add a test case, write op in lower version and read it in lower version. was (Author: ferhui): [~daryn] Thanks for you comments.Upload v005 patch {quote} The check for EC support should be in FSNamesystem methods, not NameNodeRpcServer, since there can be multiple entry points to the namesystem like webhdfs. {quote} move check to FSNamesystem {quote} DFSUtil.isSupportedErasureCoding probably doesn't belong in DFSUtil since it's not something that should called outside of the NN. {quote} delete it from DFSUtil {quote} In FSEditLogOp, please call the former method instead of duplicating the logic. {quote} do not see duplicate logic {quote} Super trivial, might rename new layoutVersion parameter in the write methods to logVersion to be consistent with the signatures for the read methods. {quote} change layoutVertion to logVersion {quote} How about a FSNamesystem.checkErasureCodingSupported(String op) to avoid all the redundant check/throw code in the methods? {quote} add checkErasureCodingSupported in FSNamesystem {quote} A test case is needed to prove the edits are correctly read/written. {quote} add a test case, write op in lower version and read it in lower version. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813269#comment-16813269 ] Fei Hui edited comment on HDFS-13596 at 4/9/19 10:58 AM: - [~starphin] Thanks for your comments. If writing new properties, downgrading maybe fail while upgrading has problems [~jojochuang] Upload v004 patch. Could you please take a look? Thanks Later I will test 2.7 & 2.8, and then add release notes or document was (Author: ferhui): [~starphin] If writing new properties, Downgrading maybe fail while upgrading has problems [~jojochuang] upload v004 patch Later I will test 2.7 & 2.8, and then add release notes or document > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811584#comment-16811584 ] star edited comment on HDFS-13596 at 4/6/19 2:35 PM: - Maybe another option:we can write new properties after the older ones, so that when restarted they can be ignored. EC requests will still be accepted and persisted in editlog before finalizing rollingupgrade. In this case , it would be like this. from {quote}FSImageSerialization.writeByte(storagePolicyId, out); FSImageSerialization.writeByte(erasureCodingPolicyId, out); writeRpcIds(rpcClientId, rpcCallId, out); {quote} to {quote}writeRpcIds(rpcClientId, rpcCallId, out); FSImageSerialization.writeByte(storagePolicyId, out); FSImageSerialization.writeByte(erasureCodingPolicyId, out); {quote} was (Author: starphin): Maybe another option:we can write new properties after the older ones, so that when restarted they can be ignored. EC requests will still be accepted and persisted in editlog. In this case , it would be like this. from {quote}FSImageSerialization.writeByte(storagePolicyId, out); FSImageSerialization.writeByte(erasureCodingPolicyId, out); writeRpcIds(rpcClientId, rpcCallId, out); {quote} to {quote}writeRpcIds(rpcClientId, rpcCallId, out); FSImageSerialization.writeByte(storagePolicyId, out); FSImageSerialization.writeByte(erasureCodingPolicyId, out); {quote} > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage >
[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811584#comment-16811584 ] star edited comment on HDFS-13596 at 4/6/19 2:34 PM: - Maybe another option:we can write new properties after the older ones, so that when restarted they can be ignored. EC requests will still be accepted and persisted in editlog. In this case , it would be like this. from {quote}FSImageSerialization.writeByte(storagePolicyId, out); FSImageSerialization.writeByte(erasureCodingPolicyId, out); writeRpcIds(rpcClientId, rpcCallId, out); {quote} to {quote}writeRpcIds(rpcClientId, rpcCallId, out); FSImageSerialization.writeByte(storagePolicyId, out); FSImageSerialization.writeByte(erasureCodingPolicyId, out); {quote} was (Author: starphin): Maybe another option:we can write new properties after the older ones, so that when restarted they can be ignored. EC requests will still be accepted. In this case , it would be like this. from {quote}FSImageSerialization.writeByte(storagePolicyId, out); FSImageSerialization.writeByte(erasureCodingPolicyId, out); writeRpcIds(rpcClientId, rpcCallId, out); {quote} to {quote}writeRpcIds(rpcClientId, rpcCallId, out); FSImageSerialization.writeByte(storagePolicyId, out); FSImageSerialization.writeByte(erasureCodingPolicyId, out); {quote} > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip