[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133245#comment-17133245
 ] 

huhaiyang edited comment on HDFS-15391 at 6/11/20, 1:28 PM:
------------------------------------------------------------

Hi [~ayushtkn] could you please take a look at this issue?

{quote}
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=xxxxpath, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://xxxxpath
{quote}
Related edit log transactions 

{noformat}
1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45)

NEWLENGTH=868460715
blocks: ... 
<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>103364934</NUM_BYTES><GENSTAMP>10354049310</GENSTAMP>

2. TXID=126060182170 OP_REASSIGN_LEASE

3. TXID=126060308267 OP_CLOSE
<MTIME>1591266492080</MTIME> 2020-06-04 18:28:12 <ATIME>1591264800229</ATIME> 
2020-06-04 18:00:00
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>63154347</NUM_BYTES><GENSTAMP>10354049316</GENSTAMP>

4. TXID=126060311503 OP_APPEND

5. TXID=126060311717 OP_SET_GENSTAMP_V2

<GENSTAMPV2>10354071495</GENSTAMPV2>

6. TXID=126060313001 OP_UPDATE_BLOCKS
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>63154347</NUM_BYTES><GENSTAMP>10354071495</GENSTAMP>

7. TXID=126060921401 OP_REASSIGN_LEASE

8. TXID=126060942290 OP_CLOSE
<MTIME>1591266619003</MTIME> 2020-06-04 18:30:19 <ATIME>1591264800229</ATIME> 
2020-06-04 18:00:00
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>63154347</NUM_BYTES><GENSTAMP>10354157480</GENSTAMP>

9.TXID=126060942548 OP_SET_GENSTAMP_V2

<GENSTAMPV2>10354157480</GENSTAMPV2>

10. TXID=126060942549 OP_TRUNCATE
<NEWLENGTH>868460715</NEWLENGTH>
<TIMESTAMP>1591266619207</TIMESTAMP> 2020-06-04 18:30:19
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>108764672</NUM_BYTES><GENSTAMP>10354157480</GENSTAMP>

11. TXID=126060943585 OP_CLOSE
<MTIME>1591266620287</MTIME>2020-06-04 18:30:20 
<ATIME>1591264800229</ATIME>2020-06-04 18:00:00
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>63154347</NUM_BYTES><GENSTAMP>10354157480</GENSTAMP>
{noformat}

The block size should be 108764672  in the first CloseOp(TXID=126060942290).
 When truncate is used, the block size is 63154347. 
The block used by CloseOp twice is the same instance, which causes the first 
CloseOp has wrong block size.
When the second CloseOp(TXID=126060943585) is executed, the file is not in the 
UnderConstruction state, and SNN down.


was (Author: haiyang hu):
Hi [~ayushtkn] could you please take a look at this issue?

{quote}
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=xxxxpath, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://xxxxpath
{quote}
Related edit log transactions 

{noformat}
1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45)

NEWLENGTH=868460715
blocks: ... 
<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>103364934</NUM_BYTES><GENSTAMP>10354049310</GENSTAMP>

2. TXID=126060182170 OP_REASSIGN_LEASE

3. TXID=126060308267 OP_CLOSE
<MTIME>1591266492080</MTIME> 2020-06-04 18:28:12 <ATIME>1591264800229</ATIME> 
2020-06-04 18:00:00
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>63154347</NUM_BYTES><GENSTAMP>10354049316</GENSTAMP>

4. TXID=126060311503 OP_APPEND

5. TXID=126060313001 OP_UPDATE_BLOCKS
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>63154347</NUM_BYTES><GENSTAMP>10354071495</GENSTAMP>

6. TXID=126060921401 OP_REASSIGN_LEASE

7. TXID=126060942290 OP_CLOSE
<MTIME>1591266619003</MTIME> 2020-06-04 18:30:19 <ATIME>1591264800229</ATIME> 
2020-06-04 18:00:00
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>63154347</NUM_BYTES><GENSTAMP>10354157480</GENSTAMP>

8.TXID=126060942548 OP_SET_GENSTAMP_V2

<GENSTAMPV2>10354157480</GENSTAMPV2>

9. TXID=126060942549 OP_TRUNCATE
<NEWLENGTH>868460715</NEWLENGTH>
<TIMESTAMP>1591266619207</TIMESTAMP> 2020-06-04 18:30:19
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>108764672</NUM_BYTES><GENSTAMP>10354157480</GENSTAMP>

10. TXID=126060943585 OP_CLOSE
<MTIME>1591266620287</MTIME>2020-06-04 18:30:20 
<ATIME>1591264800229</ATIME>2020-06-04 18:00:00
blocks: 
...<BLOCK_ID>11382080753</BLOCK_ID><NUM_BYTES>63154347</NUM_BYTES><GENSTAMP>10354157480</GENSTAMP>
{noformat}

The block size should be 108764672  in the first CloseOp(TXID=126060942290).
 When truncate is used, the block size is 63154347. 
The block used by CloseOp twice is the same instance, which causes the first 
CloseOp has wrong block size.
When the second CloseOp(TXID=126060943585) is executed, the file is not in the 
UnderConstruction state, and SNN down.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15391
>                 URL: https://issues.apache.org/jira/browse/HDFS-15391
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.2.0
>            Reporter: huhaiyang
>            Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to