[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

Zephyr Guo (JIRA) Wed, 07 Mar 2018 18:16:17 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zephyr Guo updated HDFS-13243:
------------------------------
    Description: 
HDFS File might get broken because of corrupt block(s) that could be produced 
by calling close and sync in the same time.

When calling close was not successful, UCBlock status would change to 
COMMITTED, and if a sync request gets popped from queue and processed, sync 
operation would change the last block length.

After that, DataNode would report all received block to NameNode, and will 
check Block length of all COMMITTED Blocks. But the block length was already 
different between recorded in NameNode memory and reported by DataNode, and 
consequently, the last block is marked as corruptted because of inconsistent 
length.

 
{panel:title=Log in my hdfs}
2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
truncateBlock=null, primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
 
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
 for 
/hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: 
/hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
 for DFSClient_NONMAPREDUCE_1077513762_1
2018-03-05 04:05:39,761 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
 
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
 is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in file 
/hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 10.0.0.220:50010 is added to 
blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
 
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
 size 2054413
2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
10.0.0.219:50010 by hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 
because block is COMMITTED and reported length 2054413 does not match length in 
block map 141232
2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
10.0.0.218:50010 by hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 
because block is COMMITTED and reported length 2054413 does not match length in 
block map 141232
2018-03-05 04:05:40,162 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
 
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
 is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in file 
/hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
{panel}

  was:
HDFS File might get broken because of corrupt block(s) that could be produced 
by calling close and sync in the same time.

When calling close was not successful, UCBlock status would change to 
COMMITTED, and if a sync request gets popped from queue and processed, sync 
operation would change the last block length.

After that, DataNode would report all received block to NameNode, and will 
check Block length of all COMMITTED Blocks. But the block length was already 
different between recorded in NameNode memory and reported by DataNode, and 
consequently, the last block is marked as corruptted because of inconsistent 
length.


> Get CorruptBlock because of calling close and sync in same time
> ---------------------------------------------------------------
>
>                 Key: HDFS-13243
>                 URL: https://issues.apache.org/jira/browse/HDFS-13243
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.2, 3.2.0
>            Reporter: Zephyr Guo
>            Priority: Critical
>             Fix For: 3.2.0
>
>         Attachments: HDFS-13243-v1.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> {panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

Reply via email to