[ https://issues.apache.org/jira/browse/HDFS-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tao Yang updated HDFS-15105: ---------------------------- Description: We found a issue that Standby NN exited and failed to restart until we resolved the edit log corruption. Error logs: {noformat} java.io.IOException: Mismatched block IDs or generation stamps, attempting to replace block blk_74288647857_73526148211 with blk_74288647857_73526377369 as block # 15/17 of /maindump/mainv10/dump_online/lasttable/20200105015500/part-319 at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1019) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:431) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:234) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:885) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:866) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:234) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:342) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:295) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:312) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:455) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:308) {noformat} Related edit log transactions of the same file: {noformat} 1. TXID=444341628498 time=1578251449632 OP_UPDATE_BLOCKS blocks: ... blk_74288647857_73526148211 blk_74454090866_73526215536 2. TXID=444342382774 time=1578251520740 OP_REASSIGN_LEASE 3. TXID=444342401216 time=1578251522779 OP_CLOSE blocks: ... blk_74288647857_73526377369 blk_74454090866_73526374095 4. TXID=444342401394 OP_SET_GENSTAMP_V2 generate stamp: 73526377369 5. TXID=444342401395 time=1578251522835 (03:12:02,835) OP_TRUNCATE 6. TXID=444342402176 time=1578251523246 (03:12:03,246) OP_CLOSE blocks: ... blk_74288647857_73526377369 {noformat} According to the edit logs, it's wield to see that stamp(73526377369) was generated in transaction 4 but already used in transaction 3, and for transaction 3 there should be only the last block changed but in fact the last two blocks are both changed. This problem might be produced in a complex scenario that truncate operation immediately followed the recover-lease operation for the same file. A suspicious point is that between creation and being written for transaction 3, stamp of the second last block was updated when committing block synchronization caused by the truncate operation. Related calling stack is as follows: {noformat} NameNodeRpcServer#commitBlockSynchronization FSNamesystem#commitBlockSynchronization // update last block if(!copyTruncate) { storedBlock.setGenerationStamp(newgenerationstamp); //updated now!!! storedBlock.setNumBytes(newlength); } {noformat} Any comments are welcome. Thanks. was: We found a issue that Standby NN exited and failed to restart until we resolved the edit log corruption. Error logs: {noformat} java.io.IOException: Mismatched block IDs or generation stamps, attempting to replace block blk_74288647857_73526148211 with blk_74288647857_73526377369 as block # 15/17 of /maindump/mainv10/dump_online/lasttable/20200105015500/part-319 at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1019) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:431) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:234) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:885) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:866) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:234) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:342) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:295) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:312) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:455) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:308) {noformat} Related edit log transactions of the same file: {noformat} 1. TXID=444341628498 time=1578251449632 OP_UPDATE_BLOCKS blocks: ... blk_74288647857_73526148211 blk_74454090866_73526215536 2. TXID=444342382774 time=1578251520740 OP_REASSIGN_LEASE 3. TXID=444342401216 time=1578251522779 OP_CLOSE blocks: ... blk_74288647857_73526377369 blk_74454090866_73526374095 4. TXID=444342401394 OP_SET_GENSTAMP_V2 generate stamp: 73526377369 !!!! this stamp is generated but already used in the previous edit log 5. TXID=444342401395 time=1578251522835 (03:12:02,835) OP_TRUNCATE 6. TXID=444342402176 time=1578251523246 (03:12:03,246) OP_CLOSE blocks: ... blk_74288647857_73526377369 {noformat} According to the edit logs, it's wield to see that stamp(73526377369) was generated in transaction 4 but already used in transaction 3, and for transaction 3 there should be only the last block changed but in fact the last two blocks are both changed. This problem might be produced in a complex scenario that truncate operation immediately followed the recover-lease operation for the same file. A suspicious point is that between creation and being written for transaction 3, stamp of the second last block was updated when committing block synchronization caused by the truncate operation. Related calling stack is as follows: {noformat} NameNodeRpcServer#commitBlockSynchronization FSNamesystem#commitBlockSynchronization // update last block if(!copyTruncate) { storedBlock.setGenerationStamp(newgenerationstamp); //updated now!!! storedBlock.setNumBytes(newlength); } {noformat} Any comments are welcome. Thanks. > Standby NN exits and fails to restart due to edit log corruption > ---------------------------------------------------------------- > > Key: HDFS-15105 > URL: https://issues.apache.org/jira/browse/HDFS-15105 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.8.0 > Reporter: Tao Yang > Priority: Critical > > We found a issue that Standby NN exited and failed to restart until we > resolved the edit log corruption. > Error logs: > {noformat} > java.io.IOException: Mismatched block IDs or generation stamps, attempting to > replace block blk_74288647857_73526148211 with blk_74288647857_73526377369 as > block # 15/17 of > /maindump/mainv10/dump_online/lasttable/20200105015500/part-319 > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1019) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:431) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:234) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:885) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:866) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:234) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:342) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:295) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:312) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:455) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:308) > {noformat} > Related edit log transactions of the same file: > {noformat} > 1. TXID=444341628498 time=1578251449632 > OP_UPDATE_BLOCKS > blocks: ... blk_74288647857_73526148211 blk_74454090866_73526215536 > 2. TXID=444342382774 time=1578251520740 > OP_REASSIGN_LEASE > 3. TXID=444342401216 time=1578251522779 > OP_CLOSE > blocks: ... blk_74288647857_73526377369 blk_74454090866_73526374095 > 4. TXID=444342401394 > OP_SET_GENSTAMP_V2 > generate stamp: 73526377369 > 5. TXID=444342401395 time=1578251522835 (03:12:02,835) > OP_TRUNCATE > 6. TXID=444342402176 time=1578251523246 (03:12:03,246) > OP_CLOSE > blocks: ... blk_74288647857_73526377369 > {noformat} > According to the edit logs, it's wield to see that stamp(73526377369) was > generated in transaction 4 but already used in transaction 3, and for > transaction 3 there should be only the last block changed but in fact the > last two blocks are both changed. > This problem might be produced in a complex scenario that truncate operation > immediately followed the recover-lease operation for the same file. A > suspicious point is that between creation and being written for transaction > 3, stamp of the second last block was updated when committing block > synchronization caused by the truncate operation. > Related calling stack is as follows: > {noformat} > NameNodeRpcServer#commitBlockSynchronization > FSNamesystem#commitBlockSynchronization > // update last block > if(!copyTruncate) { > storedBlock.setGenerationStamp(newgenerationstamp); //updated now!!! > storedBlock.setNumBytes(newlength); > } > {noformat} > Any comments are welcome. Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org