[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815008#comment-16815008 ]
Fei Hui edited comment on HDFS-13596 at 4/11/19 1:56 AM: --------------------------------------------------------- [~daryn] Thanks for you comments.Upload v005 patch. Could you please take a look? {quote} The check for EC support should be in FSNamesystem methods, not NameNodeRpcServer, since there can be multiple entry points to the namesystem like webhdfs. {quote} move check to FSNamesystem {quote} DFSUtil.isSupportedErasureCoding probably doesn't belong in DFSUtil since it's not something that should called outside of the NN. {quote} delete it from DFSUtil {quote} In FSEditLogOp, please call the former method instead of duplicating the logic. {quote} do not see duplicate logic {quote} Super trivial, might rename new layoutVersion parameter in the write methods to logVersion to be consistent with the signatures for the read methods. {quote} change layoutVertion to logVersion {quote} How about a FSNamesystem.checkErasureCodingSupported(String op) to avoid all the redundant check/throw code in the methods? {quote} add checkErasureCodingSupported in FSNamesystem {quote} A test case is needed to prove the edits are correctly read/written. {quote} add a test case, write op in lower version and read it in lower version. was (Author: ferhui): [~daryn] Thanks for you comments.Upload v005 patch {quote} The check for EC support should be in FSNamesystem methods, not NameNodeRpcServer, since there can be multiple entry points to the namesystem like webhdfs. {quote} move check to FSNamesystem {quote} DFSUtil.isSupportedErasureCoding probably doesn't belong in DFSUtil since it's not something that should called outside of the NN. {quote} delete it from DFSUtil {quote} In FSEditLogOp, please call the former method instead of duplicating the logic. {quote} do not see duplicate logic {quote} Super trivial, might rename new layoutVersion parameter in the write methods to logVersion to be consistent with the signatures for the read methods. {quote} change layoutVertion to logVersion {quote} How about a FSNamesystem.checkErasureCodingSupported(String op) to avoid all the redundant check/throw code in the methods? {quote} add checkErasureCodingSupported in FSNamesystem {quote} A test case is needed to prove the edits are correctly read/written. {quote} add a test case, write op in lower version and read it in lower version. > NN restart fails after RollingUpgrade from 2.x to 3.x > ----------------------------------------------------- > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Reporter: Hanisha Koneru > Assignee: Fei Hui > Priority: Critical > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.<init>(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > Caused by: java.lang.IllegalStateException: Cannot skip to less than the > current value (=16389), where newValue=16388 > at org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1943) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org