[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2020-07-10 Thread fengwu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154414#comment-17154414
 ] 

fengwu edited comment on HDFS-13596 at 7/10/20, 7:09 AM:
-

[~_ph] , At first I had the same view as you, when I saw this comment 
:https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16911102=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16911102

I tested downgrade  from hdfs 3.2.1 to 2.8.2  success,but 2.7 ~ 2.7.2 failed.

So, This can be understood as roll downgrading the current 3.x can only be 
downgraded to the 2.8+ , agree  ?

[~hanishakoneru]  [~ferhui]


was (Author: fengwu99):
[~_ph] , At first I had the same view as you, when I saw this comment 
:https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16911102=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16911102

I tested downgrade  from hdfs 3.2.1 to 2.8.2  success,but 2.7 failed.

 

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, 
> HDFS-13596.009.patch, HDFS-13596.010.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2020-07-10 Thread fengwu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154414#comment-17154414
 ] 

fengwu edited comment on HDFS-13596 at 7/10/20, 7:00 AM:
-

[~_ph] , At first I had the same view as you, when I saw this comment 
:https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16911102=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16911102

I tested downgrade  from hdfs 3.2.1 to 2.8.2  success,but 2.7 failed.

 


was (Author: fengwu99):
[~_ph] , At first I had the same view as you, when I saw this comment 
:[https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16911102=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16911102]

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, 
> HDFS-13596.009.patch, HDFS-13596.010.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2020-07-09 Thread fengwu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154336#comment-17154336
 ] 

fengwu edited comment on HDFS-13596 at 7/9/20, 9:12 AM:


[~_ph]   Thanks.

In your case, used rollback not downgrade .  Rollback will lost data,  I want  
a way to roll downgrade do you think it works?


was (Author: fengwu99):
[~_ph]   Thanks.

In your case, used rollback not downgrade .  Rollback will lost data,  I want  
a way to roll downgrade . 

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, 
> HDFS-13596.009.patch, HDFS-13596.010.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2020-07-09 Thread Yuriy Malygin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154318#comment-17154318
 ] 

Yuriy Malygin edited comment on HDFS-13596 at 7/9/20, 8:21 AM:
---

[~fengwu99] yes, in my case 
(https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16826162=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826162
 + 
https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16826939=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826939)
 downgrade to 2.7.3 was successfully from 3.1.2 and 3.3.0-SNAPSHOT versions.


was (Author: _ph):
[~fengwu99] yes, in my case 
(https://issues.apache.org/jira/browse/HDFS-13596?focusedCommentId=16826162=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826162)
 downgrade to 2.7.3 was successfully from 3.1.2 and 3.3.0-SNAPSHOT versions.

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, 
> HDFS-13596.009.patch, HDFS-13596.010.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2020-07-09 Thread fengwu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154238#comment-17154238
 ] 

fengwu edited comment on HDFS-13596 at 7/9/20, 7:02 AM:


Hi, [~_ph] ! Yes,  downgrade before namenodes and not finalize upgrade.  

Have you tested downgrade to 2.x from 3.x ?  Datanode downgrade successfully ?


was (Author: fengwu99):
Hi, [~_ph] ! Yes,  downgrade before namenodes and not finalize upgrade.

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, 
> HDFS-13596.009.patch, HDFS-13596.010.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2020-07-08 Thread fengwu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153543#comment-17153543
 ] 

fengwu edited comment on HDFS-13596 at 7/8/20, 12:23 PM:
-

Hi,   Found in my test downgrade from 3.1.3 to 2.7.2, namenode  successful  but 
datanode failed.  because  different DN layout versions. 

 
{code:java}
// code placeholder
2020-07-06 14:45:01,313 WARN org.apache.hadoop.hdfs.server.common.Storage: 
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected 
version of storage directory /data/hadoop/dfs. Reported: -57. Expecting = -56.
2020-07-06 14:45:01,315 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock 
on /data/hadoop/dfs/in_use.lock acquired by nodename 21258@test-v03
2020-07-06 14:45:01,315 WARN org.apache.hadoop.hdfs.server.common.Storage: 
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected 
version of storage directory /data/hadoop/dfs. Reported: -57. Expecting = -56.
2020-07-06 14:45:01,315 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: 
Initialization failed for Block pool  (Datanode Uuid unassigned) 
service to test-v01/10.110.228.21:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:748)
{code}
 

Does anyone know the reason?

Or, is there only one way to rollback datanode ?


was (Author: fengwu99):
Hi,   Found in my test downgrade from 3.1.3 to 2.7.2, namenode  successful  but 
datanode failed.  because  different DN layout versions. 

 
{code:java}
// code placeholder
2020-07-06 14:45:01,313 WARN org.apache.hadoop.hdfs.server.common.Storage: 
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected 
version of storage directory /data/hadoop/dfs. Reported: -57. Expecting = -56.
2020-07-06 14:45:01,315 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock 
on /data/hadoop/dfs/in_use.lock acquired by nodename 21258@test-v03
2020-07-06 14:45:01,315 WARN org.apache.hadoop.hdfs.server.common.Storage: 
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected 
version of storage directory /data/hadoop/dfs. Reported: -57. Expecting = -56.
2020-07-06 14:45:01,315 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: 
Initialization failed for Block pool  (Datanode Uuid unassigned) 
service to test-v01/10.110.228.21:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:748)
{code}
 

Does anyone know the reason?

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, 
> HDFS-13596.009.patch, HDFS-13596.010.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-08-19 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910968#comment-16910968
 ] 

Fei Hui edited comment on HDFS-13596 at 8/20/19 3:47 AM:
-

[~jojochuang] I find the commit "Fix potential FSImage corruption."  in 
2.8.6-snapshot ,2.9.2,2.9.3-snapshot, 3.0.4-snapshot, 3.1.1-snapshot, 3.2.0, 
3.2.1-snapshot, 3.3.0-snapshot. 
With this fix downgrade from  3.0.0~3.0.3,3.1.0~3.1.1 to 
2.8.0~2.8.5,2.9.0~2.9.1 will success.
By the way i think this issue resolve edit log incompatible problem and "Fix 
potential FSImage corruption." is string table incimpatible. we can file 
another jira to resolve it ?


was (Author: ferhui):
[~jojochuang] I find the commit "Fix potential FSImage corruption."  in 
2.8.6-snapshot ,2.9.2-snapshot, 3.0.4-snapshot, 3.1.1-snapshot, 3.2.0, 
3.2.1-snapshot, 3.3.0-snapshot. 
With this fix downgrade from  3.0.0~3.0.3,3.1.0~3.1.1 to 
2.8.0~2.8.5,2.9.0~2.9.1 will success.
By the way i think this issue resolve edit log incompatible problem and "Fix 
potential FSImage corruption." is string table incimpatible. we can file 
another jira to resolve it ?

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Blocker
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, 
> HDFS-13596.009.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-05-20 Thread Yuriy Malygin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843989#comment-16843989
 ] 

Yuriy Malygin edited comment on HDFS-13596 at 5/20/19 2:08 PM:
---

[~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with 
full access to HDFS (read/write). After rollback (its possible only with 
downtime) all new data was lost and filesystem return to pre-update state.


was (Author: _ph):
[~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with 
full access to HDFS (read/write). After rollback all new data was lost and 
filesystem return to pre-update state.

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-05-20 Thread Yuriy Malygin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843989#comment-16843989
 ] 

Yuriy Malygin edited comment on HDFS-13596 at 5/20/19 2:07 PM:
---

[~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with 
full access to HDFS (read/write). After rollback all new data was lost and 
filesystem return to pre-update state.


was (Author: _ph):
[~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with 
full access to HDFS (read/write).

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-05-20 Thread Yuriy Malygin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843989#comment-16843989
 ] 

Yuriy Malygin edited comment on HDFS-13596 at 5/20/19 2:05 PM:
---

[~brahmareddy] yes, test was with _dfs.block.access.token.enable=true_ and with 
full access to HDFS (read/write).


was (Author: _ph):
[~brahmareddy] yes, test was with dfs.block.access.token.enable=true and with 
full access to HDFS (read/write).

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-26 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826728#comment-16826728
 ] 

Fei Hui edited comment on HDFS-13596 at 4/26/19 7:37 AM:
-

[~_ph] Thanks for your testing. There are two problems.

One is due to EC. Please see https://issues.apache.org/jira/browse/HDFS-14396 

Another is due to StringTable. It occurs downgrade from 3.3 to 2.7 and I guess 
the root cause is the commit as follow.
{code}
commit 8a41edb089fbdedc5e7d9a2aeec63d126afea49f

Author: Vinayakumar B 

Date:   Mon Oct 15 15:48:26 2018 +0530



Fix potential FSImage corruption. Contributed by Daryn Sharp.
{code}

StringTable change is incompatible. I revert this commit and It works.
I could not find JIRA ID because it was not within commit message


was (Author: ferhui):
[~_ph] Thanks for your testing. There are two problems.

One is due to EC. Please see https://issues.apache.org/jira/browse/HDFS-14396 

Another is due to StringTable. It occurs downgrade from 3.3 to 2.7 and I guess 
the root cause is the commit as follow.
{code}
commit 8a41edb089fbdedc5e7d9a2aeec63d126afea49f

Author: Vinayakumar B 

Date:   Mon Oct 15 15:48:26 2018 +0530



Fix potential FSImage corruption. Contributed by Daryn Sharp.
{code}

StringTable change is incompatible. I revert this commit and It works.

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-25 Thread Yuriy Malygin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826162#comment-16826162
 ] 

Yuriy Malygin edited comment on HDFS-13596 at 4/25/19 3:31 PM:
---

I'm testing last patch on a cluster with Kerberos:
 # download sources from [https://github.com/apache/hadoop]
 # apply patch _HDFS-13596.007.patch_
 # build patched version of _hadoop-3.3.0-SNAPSHOT_
 # prepare running cluster (_hadoop-2.7.3_) to _rollingUpgrade_
 # run _hadoop-3.3.0-SNAPSHOT_ in _rollingUpgrade_ mode
 # upload test data to HDFS
 # restart NN and all is fine
 # stop cluster for rollback to hadoop-2.7.3 
 # hadoop-2.7.3 NN start fails with _ArrayIndexOutOfBoundsException_:
{code:java}
INFO   | jvm 1| 2019/04/25 18:01:24 | STARTUP_MSG:   build = 
https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
baa91f7c6bc9cb92be5982de4719c1c8af91ccff; compiled by 'vinodkv' on 
2016-08-18T01:01Z
INFO   | jvm 1| 2019/04/25 18:01:24 | STARTUP_MSG:   java = 1.8.0_102
INFO   | jvm 1| 2019/04/25 18:01:24 | 
/
INFO   | jvm 1| 2019/04/25 18:01:24 | 2019-04-25 18:01:24,767  INFO 
[Thread-3] NameNode - registered UNIX signal handlers for [TERM, HUP, INT]
INFO   | jvm 1| 2019/04/25 18:01:24 | 2019-04-25 18:01:24,769  INFO 
[Thread-3] NameNode - createNameNode [-rollingUpgrade, rollback]
{code}
{code:java}
INFO   | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 DEBUG 
[Thread-3] FSImage - Planning to load edit log stream: 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@53468702
INFO   | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 DEBUG 
[Thread-3] FSImage - Planning to load edit log stream: 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@5e350e83
INFO   | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 DEBUG 
[Thread-3] FSImage - Planning to load edit log stream: 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@cf86362
INFO   | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342 DEBUG 
[Thread-3] FSImage - Planning to load edit log stream: 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@840d683
INFO   | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,342  INFO 
[Thread-3] FSImage - Planning to load image: 
FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_rollback_677,
 cpktTxId=677)
INFO   | jvm 1| 2019/04/25 18:01:30 | Total time for which application 
threads were stopped: 0.0007557 seconds, Stopping threads took: 0.0001181 
seconds
INFO   | jvm 1| 2019/04/25 18:01:30 | 2019-04-25 18:01:30,377 ERROR 
[Thread-3] FSImage - Failed to load image from 
FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_rollback_677,
 cpktTxId=677)
INFO   | jvm 1| 2019/04/25 18:01:30 | 
java.lang.ArrayIndexOutOfBoundsException: 536870913
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadStringTableSection(FSImageFormatProtobuf.java:318)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:251)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:182)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:963)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:947)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:746)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:677)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:976)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645)
INFO   | jvm 1| 2019/04/25 18:01:30 |   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:812)

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-22 Thread Yuriy Malygin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822981#comment-16822981
 ] 

Yuriy Malygin edited comment on HDFS-13596 at 4/22/19 10:06 AM:


The same problem - HDFS-13236. 

I can repeat this problem with NN on all 3.x versions (with Kerberos too). I 
have special dedicated cluster for test and I can help with tests.

 

In my quest rollingUgrade (2.7 -> 3.x) was successfully only in case when there 
were no write operations to the cluster until finalize.


was (Author: _ph):
The same problem - HDFS-13236. 

I can repeat this problem with NN on all 3.x versions (with Kerberos too). I 
have special dedicated cluster for test and I can help with tests.

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-22 Thread Yuriy Malygin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822981#comment-16822981
 ] 

Yuriy Malygin edited comment on HDFS-13596 at 4/22/19 10:06 AM:


The same problem - HDFS-13236. 

I can repeat this problem with NN on all 3.x versions (with Kerberos too). I 
have special dedicated cluster for test and I can help with tests.

In my quest rollingUgrade (2.7 -> 3.x) was successfully only in case when there 
were no write operations to the cluster until finalize.


was (Author: _ph):
The same problem - HDFS-13236. 

I can repeat this problem with NN on all 3.x versions (with Kerberos too). I 
have special dedicated cluster for test and I can help with tests.

 

In my quest rollingUgrade (2.7 -> 3.x) was successfully only in case when there 
were no write operations to the cluster until finalize.

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-21 Thread Brahma Reddy Battula (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822850#comment-16822850
 ] 

Brahma Reddy Battula edited comment on HDFS-13596 at 4/22/19 3:37 AM:
--

{quote}After rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, 
use 2.x client to read or write data to hdfs will failure
{quote}
[~zgw] and [~maxigang] thanks for reporting token error. Looks HDFS-9807 and 
HDFS-6708 introduced two fields in BlockTokenIdenfier.AFAIK,Here namenode 
created token with all this new fileds and DataNode doesn't have these fields 
hence token verification is failing.

[~ferhui] and [~starphin], did you tried with security enabled ( mainly 
blockToken enabled)?

 

 


was (Author: brahmareddy):
{quote}After rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, 
use 2.x client to read or write data to hdfs will failure
{quote}
[~zgw] and [~maxigang] thanks for reporting token error. Looks HDFS-9807 and 
HDFS-6708 introduced two fields in BlockTokenIdenfier hence token verification 
is failing. AFAIK,Here namenode created token with all this new fileds and 
DataNode doesn't have these fields hence token verification is failing.

[~ferhui] and [~starphin], did you tried with security enabled ( mainly 
blockToken enabled)?

 

 

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-17 Thread Anu Engineer (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820302#comment-16820302
 ] 

Anu Engineer edited comment on HDFS-13596 at 4/17/19 5:03 PM:
--

Not to add more noise, but this might be a good opportunity for us to learn a 
trick or two from our brethren in the HBase land.  HBase has this nice ability 
to upgrade to 3.0, but will not enable 3.0 features unless another command or 
setting is applied. We actually have a very similar situation here, we might 
have changes in Edit logs, but let us not allow that feature to be used until 
after the main step is completely done and we have some way of verifying that 
nothing is broken. Then you can enable the full 3.0 features once the full 
upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase in 
doing this, and letting me know that Ozone should probably learn from that 
experience).

 

if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 
features like EC.  Till enable call is done, HDFS will not allow 3.0 features 
like EC.

 


was (Author: anu):
Not to add more noise, but this might be a good opportunity for us to learn a 
trick or two from our brethren in the HBase land.  HBase has this nice ability 
to upgrade to 3.0, but will not enable 3.0 features unless another command or 
setting is applied. We actually have a very similar situation here, we might 
have changes in Edit logs, but let us not allow that feature to be used until 
after the main step is completely done and we have some way of verifying that 
nothing is broken. Then you can enable the full 3.0 features once the full 
upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase does 
this, and letting me know that Ozone should probably learn from that 
experience).

 

if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 
features like EC.  Till enable call is done, HDFS will not allow 3.0 features 
like EC.

 

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-17 Thread Anu Engineer (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820302#comment-16820302
 ] 

Anu Engineer edited comment on HDFS-13596 at 4/17/19 5:03 PM:
--

Not to add more noise, but this might be a good opportunity for us to learn a 
trick or two from our brethren in the HBase land.  HBase has this nice ability 
to upgrade to 3.0, but will not enable 3.0 features unless another command or 
setting is applied. We actually have a very similar situation here, we might 
have changes in Edit logs, but let us not allow that feature to be used until 
after the main step is completely done and we have some way of verifying that 
nothing is broken. Then you can enable the full 3.0 features once the full 
upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase is in 
doing this, and letting me know that Ozone should probably learn from that 
experience).

 

if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 
features like EC.  Till enable call is done, HDFS will not allow 3.0 features 
like EC.

 


was (Author: anu):
Not to add more noise, but this might be a good opportunity for us to learn a 
trick or two from our brethren in the HBase land.  HBase has this nice ability 
to upgrade to 3.0, but will not enable 3.0 features unless another command or 
setting is applied. We actually have a very similar situation here, we might 
have changes in Edit logs, but let us not allow that feature to be used until 
after the main step is completely done and we have some way of verifying that 
nothing is broken. Then you can enable the full 3.0 features once the full 
upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase in 
doing this, and letting me know that Ozone should probably learn from that 
experience).

 

if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 
features like EC.  Till enable call is done, HDFS will not allow 3.0 features 
like EC.

 

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-17 Thread Anu Engineer (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820302#comment-16820302
 ] 

Anu Engineer edited comment on HDFS-13596 at 4/17/19 5:02 PM:
--

Not to add more noise, but this might be a good opportunity for us to learn a 
trick or two from our brethren in the HBase land.  HBase has this nice ability 
to upgrade to 3.0, but will not enable 3.0 features unless another command or 
setting is applied. We actually have a very similar situation here, we might 
have changes in Edit logs, but let us not allow that feature to be used until 
after the main step is completely done and we have some way of verifying that 
nothing is broken. Then you can enable the full 3.0 features once the full 
upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase does 
this, and letting me know that Ozone should probably learn from that 
experience).

 

if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 
features like EC.  Till enable call is done, HDFS will not allow 3.0 features 
like EC.

 


was (Author: anu):
Not to add more noise, but this might be a good opportunity for us to learn a 
trick or two from our brethren in the HBase land.  HBase has this nice ability 
to upgrade to 3.0, but will not enable 3.0 features unless another command or 
setting is applied. We actually have a very similar situation here, we might 
have changes in Edit logs, but let us not allow that feature to be used until 
after the main step is complete done and we have some way of verifying that 
nothing is broken. Then you can enable the full 3.0 features once the full 
upgrade is done. (Thanks to [~ccondit] for educating me on how good HBase does 
this, and letting me know that Ozone should probably learn from that 
experience).

 

if we do this, Rolling upgrade would be two steps, upgrade, then enable 3.0 
features like EC.  Till enable call is done, HDFS will not allow 3.0 features 
like EC.

 

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-17 Thread yao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820034#comment-16820034
 ] 

yao edited comment on HDFS-13596 at 4/17/19 12:37 PM:
--

after rollingUpgrade NN and DN nodes to 3.2.0 ,  at this point, use 3.2.0 
client run example pi job will failure

write failure sample:
{quote}19/04/17 11:45:00 INFO client.AHSProxy: Connecting to Application 
History server at slave01.jd.com/192.168.1.101:10200
 19/04/17 11:45:00 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding 
for path: /yarn/staging/hdfs/.staging/job_1555472525840_0001
 19/04/17 11:45:00 INFO mapreduce.JobSubmitter: Cleaning up the staging area 
/yarn/staging/hdfs/.staging/job_1555472525840_0001
 org.apache.hadoop.security.AccessControlException: setErasureCodingPolicy not 
supported.
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkErasureCodingSupported(FSNamesystem.java:7797)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setErasureCodingPolicy(FSNamesystem.java:7530)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setErasureCodingPolicy(NameNodeRpcServer.java:2171)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolServerSideTranslatorPB.java:1598)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
 at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
 at org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2748)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$65.doCall(DistributedFileSystem.java:2852)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$65.doCall(DistributedFileSystem.java:2849)
 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2867)
 at 
org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:885)
 at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:176)
 at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:133)
 at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:99)
 at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:194)
 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
 at 
org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:307)
 at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:360)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
 at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:368)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
 at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
 at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-17 Thread yao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820034#comment-16820034
 ] 

yao edited comment on HDFS-13596 at 4/17/19 12:34 PM:
--

after rollingUpgrade NN and DN nodes to 3.2.0 ,  at this point, use 3.2.0 
client run example pi job will failure

write failure sample:
{quote}19/04/17 11:45:00 INFO client.AHSProxy: Connecting to Application 
History server at slave01.jd.com/192.168.1.101:10200
 19/04/17 11:45:00 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding 
for path: /yarn/staging/hdfs/.staging/job_1555472525840_0001
 19/04/17 11:45:00 INFO mapreduce.JobSubmitter: Cleaning up the staging area 
/yarn/staging/hdfs/.staging/job_1555472525840_0001
 org.apache.hadoop.security.AccessControlException: setErasureCodingPolicy not 
supported.
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkErasureCodingSupported(FSNamesystem.java:7797)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setErasureCodingPolicy(FSNamesystem.java:7530)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setErasureCodingPolicy(NameNodeRpcServer.java:2171)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolServerSideTranslatorPB.java:1598)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
 at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
 at org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2748)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$65.doCall(DistributedFileSystem.java:2852)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$65.doCall(DistributedFileSystem.java:2849)
 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2867)
 at 
org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:885)
 at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:176)
 at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:133)
 at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:99)
 at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:194)
 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
 at 
org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:307)
 at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:360)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
 at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:368)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
 at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
 at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-16 Thread zhouguangwei (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818740#comment-16818740
 ] 

zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:44 AM:
--

after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x 
client to read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}
 {color:#d04437}xxx 
org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got 
access token error, status message , ack with firstBadLink as x.x.x.x:x{color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
 {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning 
BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding 
datanode 
DatanodeInfoWithStorage[x.x.x.x:x,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}


was (Author: zgw):
after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x 
client to read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}
 {color:#d04437}xxx 
org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got 
access token error, status message , ack with firstBadLink as x.x.x.x:x{color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
 {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning 
BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding 
datanode 
DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-16 Thread zhouguangwei (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818740#comment-16818740
 ] 

zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:41 AM:
--

after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x 
client to read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}
 {color:#d04437}xxx 
org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got 
access token error, status message , ack with firstBadLink as x.x.x.x:x{color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
 {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning 
BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding 
datanode 
DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}


was (Author: zgw):
after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x 
client to read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}
 
{color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException:
 Got access token error, status message , ack with firstBadLink as 
x.x.x.x:25009{color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
 {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning 
BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding 
datanode 
DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-16 Thread zhouguangwei (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818740#comment-16818740
 ] 

zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:40 AM:
--

after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x 
client to read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}
 
{color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException:
 Got access token error, status message , ack with firstBadLink as 
x.x.x.x:25009{color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
 {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning 
BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding 
datanode 
DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}


was (Author: zgw):
after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x 
client to read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}
 
{color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException:
 Got access token error, status message , ack with firstBadLink as 
x.x.x.x:25009{color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
{color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning 
BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode 
DatanodeInfoWithStorage[187.4.65.81:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-16 Thread zhouguangwei (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818740#comment-16818740
 ] 

zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:39 AM:
--

after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x 
client to read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}
 
{color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException:
 Got access token error, status message , ack with firstBadLink as 
x.x.x.x:25009{color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
{color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning 
BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode 
DatanodeInfoWithStorage[187.4.65.81:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}


was (Author: zgw):
after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x 
client to read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}
{color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException:
 Got access token error, status message , ack with firstBadLink as 
187.4.65.81:25009{color}
{color:#d04437} at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
{color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
{color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
{color:#d04437} at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
{color:#d04437}19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning 
BP-1321128176-187.4.64.197-1552442036118:blk_1073742246_1422{color}
{color:#d04437}19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode 
DatanodeInfoWithStorage[187.4.65.81:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in 
createBlockOutputStream{color}

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-16 Thread zhouguangwei (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819640#comment-16819640
 ] 

zhouguangwei edited comment on HDFS-13596 at 4/17/19 1:18 AM:
--

[~jojochuang] Yes, I have merged the latest patch that Fei Hui provided and 
rebuild HDFS project, then replace hadoop-hdfs-3.1.1.jar


was (Author: zgw):
[~jojochuang] Yes, I have merge the latest patch that Fei Hui provided and 
rebuild HDFS project, then replace hadoop-hdfs-3.1.1.jar

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-10 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815008#comment-16815008
 ] 

Fei Hui edited comment on HDFS-13596 at 4/11/19 1:56 AM:
-

[~daryn] Thanks for you comments.Upload v005 patch. Could you please take a 
look?
{quote}
The check for EC support should be in FSNamesystem methods, not 
NameNodeRpcServer, since there can be multiple entry points to the namesystem 
like webhdfs.
{quote}
move check to FSNamesystem
{quote}
DFSUtil.isSupportedErasureCoding probably doesn't belong in DFSUtil since it's 
not something that should called outside of the NN.
{quote}
delete it from DFSUtil
{quote}
In FSEditLogOp, please call the former method instead of duplicating the logic.
{quote}
do not see duplicate logic
{quote}
Super trivial, might rename new layoutVersion parameter in the write methods to 
logVersion to be consistent with the signatures for the read methods.
{quote}
change layoutVertion to logVersion
{quote}
How about a FSNamesystem.checkErasureCodingSupported(String op) to avoid all 
the redundant check/throw code in the methods?
{quote}
add checkErasureCodingSupported in FSNamesystem
{quote}
A test case is needed to prove the edits are correctly read/written.
{quote}
add a test case, write op in lower version and read it in lower version.



was (Author: ferhui):
[~daryn] Thanks for you comments.Upload v005 patch
{quote}
The check for EC support should be in FSNamesystem methods, not 
NameNodeRpcServer, since there can be multiple entry points to the namesystem 
like webhdfs.
{quote}
move check to FSNamesystem
{quote}
DFSUtil.isSupportedErasureCoding probably doesn't belong in DFSUtil since it's 
not something that should called outside of the NN.
{quote}
delete it from DFSUtil
{quote}
In FSEditLogOp, please call the former method instead of duplicating the logic.
{quote}
do not see duplicate logic
{quote}
Super trivial, might rename new layoutVersion parameter in the write methods to 
logVersion to be consistent with the signatures for the read methods.
{quote}
change layoutVertion to logVersion
{quote}
How about a FSNamesystem.checkErasureCodingSupported(String op) to avoid all 
the redundant check/throw code in the methods?
{quote}
add checkErasureCodingSupported in FSNamesystem
{quote}
A test case is needed to prove the edits are correctly read/written.
{quote}
add a test case, write op in lower version and read it in lower version.


> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-09 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813269#comment-16813269
 ] 

Fei Hui edited comment on HDFS-13596 at 4/9/19 10:58 AM:
-

[~starphin] Thanks for your comments. If writing new properties, downgrading 
maybe fail while upgrading has problems

[~jojochuang] Upload v004 patch. Could you please take a look? Thanks

Later I will test 2.7 & 2.8, and then add release notes or document 


was (Author: ferhui):
[~starphin] If writing new properties, Downgrading maybe fail while upgrading 
has problems

[~jojochuang] upload v004 patch

Later I will test 2.7 & 2.8, and then add release notes or document 

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-06 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811584#comment-16811584
 ] 

star edited comment on HDFS-13596 at 4/6/19 2:35 PM:
-

Maybe another option:we can write new properties after the older ones, so that 
when restarted they can be ignored.  EC requests will still be accepted and 
persisted in editlog before finalizing rollingupgrade. 

In this case , it would be like this.

from 
{quote}FSImageSerialization.writeByte(storagePolicyId, out);
 FSImageSerialization.writeByte(erasureCodingPolicyId, out);
 writeRpcIds(rpcClientId, rpcCallId, out);
{quote}
to 
{quote}writeRpcIds(rpcClientId, rpcCallId, out);

FSImageSerialization.writeByte(storagePolicyId, out);
 FSImageSerialization.writeByte(erasureCodingPolicyId, out);
{quote}


was (Author: starphin):
Maybe another option:we can write new properties after the older ones, so that 
when restarted they can be ignored.  EC requests will still be accepted and 
persisted in editlog. 

In this case , it would be like this.

from 
{quote}FSImageSerialization.writeByte(storagePolicyId, out);
 FSImageSerialization.writeByte(erasureCodingPolicyId, out);
 writeRpcIds(rpcClientId, rpcCallId, out);
{quote}
to 
{quote}writeRpcIds(rpcClientId, rpcCallId, out);

FSImageSerialization.writeByte(storagePolicyId, out);
 FSImageSerialization.writeByte(erasureCodingPolicyId, out);
{quote}

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> 

[jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2019-04-06 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811584#comment-16811584
 ] 

star edited comment on HDFS-13596 at 4/6/19 2:34 PM:
-

Maybe another option:we can write new properties after the older ones, so that 
when restarted they can be ignored.  EC requests will still be accepted and 
persisted in editlog. 

In this case , it would be like this.

from 
{quote}FSImageSerialization.writeByte(storagePolicyId, out);
 FSImageSerialization.writeByte(erasureCodingPolicyId, out);
 writeRpcIds(rpcClientId, rpcCallId, out);
{quote}
to 
{quote}writeRpcIds(rpcClientId, rpcCallId, out);

FSImageSerialization.writeByte(storagePolicyId, out);
 FSImageSerialization.writeByte(erasureCodingPolicyId, out);
{quote}


was (Author: starphin):
Maybe another option:we can write new properties after the older ones, so that 
when restarted they can be ignored.  EC requests will still be accepted. 

In this case , it would be like this.

from 
{quote}FSImageSerialization.writeByte(storagePolicyId, out);
 FSImageSerialization.writeByte(erasureCodingPolicyId, out);
 writeRpcIds(rpcClientId, rpcCallId, out);
{quote}
to 
{quote}writeRpcIds(rpcClientId, rpcCallId, out);

FSImageSerialization.writeByte(storagePolicyId, out);
 FSImageSerialization.writeByte(erasureCodingPolicyId, out);
{quote}

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip