[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-15175:
------------------------------
    Description: 
 
{panel:title=Crash exception}
2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
[length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, 
clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
txid=32625024993]
 java.io.IOException: File is not under construction: ......
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:360)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
 at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
{panel}
 
{panel:title=Editlog}
<RECORD>
 <OPCODE>OP_REASSIGN_LEASE</OPCODE>
 <DATA>
 <TXID>32625021150</TXID>
 <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER>
 <PATH>......</PATH>
 <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER>
 </DATA>
 </RECORD>

......

<RECORD>
 <OPCODE>OP_CLOSE</OPCODE>
 <DATA>
 <TXID>32625023743</TXID>
 <LENGTH>0</LENGTH>
 <INODEID>0</INODEID>
 <PATH>......</PATH>
 <REPLICATION>3</REPLICATION>
 <MTIME>1581816135883</MTIME>
 <ATIME>1581814760398</ATIME>
 <BLOCKSIZE>536870912</BLOCKSIZE>
 <CLIENT_NAME></CLIENT_NAME>
 <CLIENT_MACHINE></CLIENT_MACHINE>
 <OVERWRITE>false</OVERWRITE>
 <BLOCK>
 <BLOCK_ID>5568434562</BLOCK_ID>
 <NUM_BYTES>185818644</NUM_BYTES>
 <GENSTAMP>4495417845</GENSTAMP>
 </BLOCK>
 <PERMISSION_STATUS>
 <USERNAME>da_music</USERNAME>
 <GROUPNAME>hdfs</GROUPNAME>
 <MODE>416</MODE>
 </PERMISSION_STATUS>
 </DATA>
 </RECORD>

......

<RECORD>
 <OPCODE>OP_TRUNCATE</OPCODE>
 <DATA>
 <TXID>32625024049</TXID>
 <SRC>......</SRC>
 <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME>
 <CLIENTMACHINE>......</CLIENTMACHINE>
 <NEWLENGTH>185818644</NEWLENGTH>
 <TIMESTAMP>1581816136336</TIMESTAMP>
 <BLOCK>
 <BLOCK_ID>5568434562</BLOCK_ID>
 <NUM_BYTES>185818648</NUM_BYTES>
 <GENSTAMP>4495417845</GENSTAMP>
 </BLOCK>
 </DATA>
 </RECORD>

......

<RECORD>
 <OPCODE>OP_CLOSE</OPCODE>
 <DATA>
 <TXID>32625024993</TXID>
 <LENGTH>0</LENGTH>
 <INODEID>0</INODEID>
 <PATH>......</PATH>
 <REPLICATION>3</REPLICATION>
 <MTIME>1581816138774</MTIME>
 <ATIME>1581814760398</ATIME>
 <BLOCKSIZE>536870912</BLOCKSIZE>
 <CLIENT_NAME></CLIENT_NAME>
 <CLIENT_MACHINE></CLIENT_MACHINE>
 <OVERWRITE>false</OVERWRITE>
 <BLOCK>
 <BLOCK_ID>5568434562</BLOCK_ID>
 <NUM_BYTES>185818644</NUM_BYTES>
 <GENSTAMP>4495417845</GENSTAMP>
 </BLOCK>
 <PERMISSION_STATUS>
 <USERNAME>da_music</USERNAME>
 <GROUPNAME>hdfs</GROUPNAME>
 <MODE>416</MODE>
 </PERMISSION_STATUS>
 </DATA>
 </RECORD>
{panel}
 

 

The block size should be 185818648 in the first CloseOp. When truncate is used, 
the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
synchronized to the JournalNode in the same batch. The block used by CloseOp 
twice is the same instance, which causes the first CloseOp has wrong block 
size. When SNN rolling Editlog, TruncateOp does not make the file to the 
UnderConstruction state. Then, when the second CloseOp is executed, the file is 
not in the UnderConstruction state, and SNN crashes.

  was:
 
{panel:title=Crash exception}
2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
[length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, 
clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
txid=32625024993]
java.io.IOException: File is not under construction: ......
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
{panel}
 


> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15175
>                 URL: https://issues.apache.org/jira/browse/HDFS-15175
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Yicong Cai
>            Assignee: Yicong Cai
>            Priority: Critical
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ......
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> <RECORD>
>  <OPCODE>OP_REASSIGN_LEASE</OPCODE>
>  <DATA>
>  <TXID>32625021150</TXID>
>  <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER>
>  <PATH>......</PATH>
>  <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER>
>  </DATA>
>  </RECORD>
> ......
> <RECORD>
>  <OPCODE>OP_CLOSE</OPCODE>
>  <DATA>
>  <TXID>32625023743</TXID>
>  <LENGTH>0</LENGTH>
>  <INODEID>0</INODEID>
>  <PATH>......</PATH>
>  <REPLICATION>3</REPLICATION>
>  <MTIME>1581816135883</MTIME>
>  <ATIME>1581814760398</ATIME>
>  <BLOCKSIZE>536870912</BLOCKSIZE>
>  <CLIENT_NAME></CLIENT_NAME>
>  <CLIENT_MACHINE></CLIENT_MACHINE>
>  <OVERWRITE>false</OVERWRITE>
>  <BLOCK>
>  <BLOCK_ID>5568434562</BLOCK_ID>
>  <NUM_BYTES>185818644</NUM_BYTES>
>  <GENSTAMP>4495417845</GENSTAMP>
>  </BLOCK>
>  <PERMISSION_STATUS>
>  <USERNAME>da_music</USERNAME>
>  <GROUPNAME>hdfs</GROUPNAME>
>  <MODE>416</MODE>
>  </PERMISSION_STATUS>
>  </DATA>
>  </RECORD>
> ......
> <RECORD>
>  <OPCODE>OP_TRUNCATE</OPCODE>
>  <DATA>
>  <TXID>32625024049</TXID>
>  <SRC>......</SRC>
>  <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME>
>  <CLIENTMACHINE>......</CLIENTMACHINE>
>  <NEWLENGTH>185818644</NEWLENGTH>
>  <TIMESTAMP>1581816136336</TIMESTAMP>
>  <BLOCK>
>  <BLOCK_ID>5568434562</BLOCK_ID>
>  <NUM_BYTES>185818648</NUM_BYTES>
>  <GENSTAMP>4495417845</GENSTAMP>
>  </BLOCK>
>  </DATA>
>  </RECORD>
> ......
> <RECORD>
>  <OPCODE>OP_CLOSE</OPCODE>
>  <DATA>
>  <TXID>32625024993</TXID>
>  <LENGTH>0</LENGTH>
>  <INODEID>0</INODEID>
>  <PATH>......</PATH>
>  <REPLICATION>3</REPLICATION>
>  <MTIME>1581816138774</MTIME>
>  <ATIME>1581814760398</ATIME>
>  <BLOCKSIZE>536870912</BLOCKSIZE>
>  <CLIENT_NAME></CLIENT_NAME>
>  <CLIENT_MACHINE></CLIENT_MACHINE>
>  <OVERWRITE>false</OVERWRITE>
>  <BLOCK>
>  <BLOCK_ID>5568434562</BLOCK_ID>
>  <NUM_BYTES>185818644</NUM_BYTES>
>  <GENSTAMP>4495417845</GENSTAMP>
>  </BLOCK>
>  <PERMISSION_STATUS>
>  <USERNAME>da_music</USERNAME>
>  <GROUPNAME>hdfs</GROUPNAME>
>  <MODE>416</MODE>
>  </PERMISSION_STATUS>
>  </DATA>
>  </RECORD>
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to