[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yicong Cai updated HDFS-15175: ------------------------------ Description: {panel:title=Crash exception} 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=32625024993] java.io.IOException: File is not under construction: ...... at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) {panel} {panel:title=Editlog} <RECORD> <OPCODE>OP_REASSIGN_LEASE</OPCODE> <DATA> <TXID>32625021150</TXID> <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER> <PATH>......</PATH> <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER> </DATA> </RECORD> ...... <RECORD> <OPCODE>OP_CLOSE</OPCODE> <DATA> <TXID>32625023743</TXID> <LENGTH>0</LENGTH> <INODEID>0</INODEID> <PATH>......</PATH> <REPLICATION>3</REPLICATION> <MTIME>1581816135883</MTIME> <ATIME>1581814760398</ATIME> <BLOCKSIZE>536870912</BLOCKSIZE> <CLIENT_NAME></CLIENT_NAME> <CLIENT_MACHINE></CLIENT_MACHINE> <OVERWRITE>false</OVERWRITE> <BLOCK> <BLOCK_ID>5568434562</BLOCK_ID> <NUM_BYTES>185818644</NUM_BYTES> <GENSTAMP>4495417845</GENSTAMP> </BLOCK> <PERMISSION_STATUS> <USERNAME>da_music</USERNAME> <GROUPNAME>hdfs</GROUPNAME> <MODE>416</MODE> </PERMISSION_STATUS> </DATA> </RECORD> ...... <RECORD> <OPCODE>OP_TRUNCATE</OPCODE> <DATA> <TXID>32625024049</TXID> <SRC>......</SRC> <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME> <CLIENTMACHINE>......</CLIENTMACHINE> <NEWLENGTH>185818644</NEWLENGTH> <TIMESTAMP>1581816136336</TIMESTAMP> <BLOCK> <BLOCK_ID>5568434562</BLOCK_ID> <NUM_BYTES>185818648</NUM_BYTES> <GENSTAMP>4495417845</GENSTAMP> </BLOCK> </DATA> </RECORD> ...... <RECORD> <OPCODE>OP_CLOSE</OPCODE> <DATA> <TXID>32625024993</TXID> <LENGTH>0</LENGTH> <INODEID>0</INODEID> <PATH>......</PATH> <REPLICATION>3</REPLICATION> <MTIME>1581816138774</MTIME> <ATIME>1581814760398</ATIME> <BLOCKSIZE>536870912</BLOCKSIZE> <CLIENT_NAME></CLIENT_NAME> <CLIENT_MACHINE></CLIENT_MACHINE> <OVERWRITE>false</OVERWRITE> <BLOCK> <BLOCK_ID>5568434562</BLOCK_ID> <NUM_BYTES>185818644</NUM_BYTES> <GENSTAMP>4495417845</GENSTAMP> </BLOCK> <PERMISSION_STATUS> <USERNAME>da_music</USERNAME> <GROUPNAME>hdfs</GROUPNAME> <MODE>416</MODE> </PERMISSION_STATUS> </DATA> </RECORD> {panel} The block size should be 185818648 in the first CloseOp. When truncate is used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is synchronized to the JournalNode in the same batch. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When SNN rolling Editlog, TruncateOp does not make the file to the UnderConstruction state. Then, when the second CloseOp is executed, the file is not in the UnderConstruction state, and SNN crashes. was: {panel:title=Crash exception} 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=32625024993] java.io.IOException: File is not under construction: ...... at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) {panel} > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > ------------------------------------------------------------------------------------------------ > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Yicong Cai > Assignee: Yicong Cai > Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: ...... > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > <RECORD> > <OPCODE>OP_REASSIGN_LEASE</OPCODE> > <DATA> > <TXID>32625021150</TXID> > <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER> > <PATH>......</PATH> > <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER> > </DATA> > </RECORD> > ...... > <RECORD> > <OPCODE>OP_CLOSE</OPCODE> > <DATA> > <TXID>32625023743</TXID> > <LENGTH>0</LENGTH> > <INODEID>0</INODEID> > <PATH>......</PATH> > <REPLICATION>3</REPLICATION> > <MTIME>1581816135883</MTIME> > <ATIME>1581814760398</ATIME> > <BLOCKSIZE>536870912</BLOCKSIZE> > <CLIENT_NAME></CLIENT_NAME> > <CLIENT_MACHINE></CLIENT_MACHINE> > <OVERWRITE>false</OVERWRITE> > <BLOCK> > <BLOCK_ID>5568434562</BLOCK_ID> > <NUM_BYTES>185818644</NUM_BYTES> > <GENSTAMP>4495417845</GENSTAMP> > </BLOCK> > <PERMISSION_STATUS> > <USERNAME>da_music</USERNAME> > <GROUPNAME>hdfs</GROUPNAME> > <MODE>416</MODE> > </PERMISSION_STATUS> > </DATA> > </RECORD> > ...... > <RECORD> > <OPCODE>OP_TRUNCATE</OPCODE> > <DATA> > <TXID>32625024049</TXID> > <SRC>......</SRC> > <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME> > <CLIENTMACHINE>......</CLIENTMACHINE> > <NEWLENGTH>185818644</NEWLENGTH> > <TIMESTAMP>1581816136336</TIMESTAMP> > <BLOCK> > <BLOCK_ID>5568434562</BLOCK_ID> > <NUM_BYTES>185818648</NUM_BYTES> > <GENSTAMP>4495417845</GENSTAMP> > </BLOCK> > </DATA> > </RECORD> > ...... > <RECORD> > <OPCODE>OP_CLOSE</OPCODE> > <DATA> > <TXID>32625024993</TXID> > <LENGTH>0</LENGTH> > <INODEID>0</INODEID> > <PATH>......</PATH> > <REPLICATION>3</REPLICATION> > <MTIME>1581816138774</MTIME> > <ATIME>1581814760398</ATIME> > <BLOCKSIZE>536870912</BLOCKSIZE> > <CLIENT_NAME></CLIENT_NAME> > <CLIENT_MACHINE></CLIENT_MACHINE> > <OVERWRITE>false</OVERWRITE> > <BLOCK> > <BLOCK_ID>5568434562</BLOCK_ID> > <NUM_BYTES>185818644</NUM_BYTES> > <GENSTAMP>4495417845</GENSTAMP> > </BLOCK> > <PERMISSION_STATUS> > <USERNAME>da_music</USERNAME> > <GROUPNAME>hdfs</GROUPNAME> > <MODE>416</MODE> > </PERMISSION_STATUS> > </DATA> > </RECORD> > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org