[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316997#comment-17316997 ]
Max Xie edited comment on HDFS-15175 at 4/8/21, 9:01 AM: ---------------------------------------------------------- We encountered this bug on hdfs 3.2.1. Is there any progress now? ping [~hexiaoqiao] [~wanchang] . was (Author: max2049): ping [~hexiaoqiao] [~wanchang] . > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > ------------------------------------------------------------------------------------------------ > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Yicong Cai > Assignee: Wan Chang > Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: ...... > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > <RECORD> > <OPCODE>OP_REASSIGN_LEASE</OPCODE> > <DATA> > <TXID>32625021150</TXID> > <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER> > <PATH>......</PATH> > <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER> > </DATA> > </RECORD> > ...... > <RECORD> > <OPCODE>OP_CLOSE</OPCODE> > <DATA> > <TXID>32625023743</TXID> > <LENGTH>0</LENGTH> > <INODEID>0</INODEID> > <PATH>......</PATH> > <REPLICATION>3</REPLICATION> > <MTIME>1581816135883</MTIME> > <ATIME>1581814760398</ATIME> > <BLOCKSIZE>536870912</BLOCKSIZE> > <CLIENT_NAME></CLIENT_NAME> > <CLIENT_MACHINE></CLIENT_MACHINE> > <OVERWRITE>false</OVERWRITE> > <BLOCK> > <BLOCK_ID>5568434562</BLOCK_ID> > <NUM_BYTES>185818644</NUM_BYTES> > <GENSTAMP>4495417845</GENSTAMP> > </BLOCK> > <PERMISSION_STATUS> > <USERNAME>da_music</USERNAME> > <GROUPNAME>hdfs</GROUPNAME> > <MODE>416</MODE> > </PERMISSION_STATUS> > </DATA> > </RECORD> > ...... > <RECORD> > <OPCODE>OP_TRUNCATE</OPCODE> > <DATA> > <TXID>32625024049</TXID> > <SRC>......</SRC> > <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME> > <CLIENTMACHINE>......</CLIENTMACHINE> > <NEWLENGTH>185818644</NEWLENGTH> > <TIMESTAMP>1581816136336</TIMESTAMP> > <BLOCK> > <BLOCK_ID>5568434562</BLOCK_ID> > <NUM_BYTES>185818648</NUM_BYTES> > <GENSTAMP>4495417845</GENSTAMP> > </BLOCK> > </DATA> > </RECORD> > ...... > <RECORD> > <OPCODE>OP_CLOSE</OPCODE> > <DATA> > <TXID>32625024993</TXID> > <LENGTH>0</LENGTH> > <INODEID>0</INODEID> > <PATH>......</PATH> > <REPLICATION>3</REPLICATION> > <MTIME>1581816138774</MTIME> > <ATIME>1581814760398</ATIME> > <BLOCKSIZE>536870912</BLOCKSIZE> > <CLIENT_NAME></CLIENT_NAME> > <CLIENT_MACHINE></CLIENT_MACHINE> > <OVERWRITE>false</OVERWRITE> > <BLOCK> > <BLOCK_ID>5568434562</BLOCK_ID> > <NUM_BYTES>185818644</NUM_BYTES> > <GENSTAMP>4495417845</GENSTAMP> > </BLOCK> > <PERMISSION_STATUS> > <USERNAME>da_music</USERNAME> > <GROUPNAME>hdfs</GROUPNAME> > <MODE>416</MODE> > </PERMISSION_STATUS> > </DATA> > </RECORD> > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org