[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802449#comment-17802449 ] Shilun Fan commented on HDFS-13243: --- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: daimin >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >=
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532770#comment-17532770 ] daimin commented on HDFS-13243: --- [~gzh1992n][~weichiu] We have encountered the same problem in our cluster and fixed it some months ago. As this jira is still unresolved, I would like to continue to fix it. Please let me know if you have some concerns on this, thanks. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: daimin >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065533#comment-17065533 ] Hadoop QA commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 8s{color} | {color:red} HDFS-13243 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12918525/HDFS-13243-v6.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/29010/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617254#comment-16617254 ] Sunil Govindan commented on HDFS-13243: --- As code freeze for 3.2 is crossed, moving this Jira to 3.3. Please feel free to revert if anyone has concerns. Thank you. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608737#comment-16608737 ] Hadoop QA commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 6s{color} | {color:red} HDFS-13243 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12918525/HDFS-13243-v6.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/25016/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608710#comment-16608710 ] Sunil Govindan commented on HDFS-13243: --- Ping again : [~gzh1992n] > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595927#comment-16595927 ] Sunil Govindan commented on HDFS-13243: --- [~gzh1992n] As this jira is marked for 3.2 as a critical, cud u pls help to take this forward or move out if its not feasible to finish in coming weeks. 3.2 code freeze date is nearby in a weeks. Kindly help to check the same. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443630#comment-16443630 ] Zephyr Guo commented on HDFS-13243: --- Thank you for reviewing,[~daryn]. There are some mistakes in your summaries. {quote} thread1 is writing and closes the stream thread2 is syncing the stream thread1 calls commits the block with size -141232- 2054413 thread2 fsyncs with size -2054413- 141232 DNs report block with size 2054413, marked corrupt {quote} {quote} This sounds like a serious client-side issue. {quote} Yes, as I said above comments. Client call sync() with a corrent length, but sync request could be sent after close(). The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code: {code} synchronized (this) {} ... // code before send request } // **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) { try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) { // Deal with ioe } } synchronized (this) { ... // code after send request } {code} I am not sure how to fix client-side. Cloud we put RPC into synchronized code block directly? Maybe we put RPC outside synchronized code block can get more performance benefit. {quote} We cannot simply return success in some invalid cases: ie. fsync when the file has no blocks, size is negative, size is less than less synced/committed size. That just masks bugs. {quote} TestHFlush.hSyncUpdateLength_00 make an sync call with no blocks. So I don't think this is a bug case. {quote} Also, we shouldn't need all the new factories. The tests must verify that namesystem calls, in various specific orders, with specific arguments that are good/bad, either succeed or fail. {quote} I can add new test cases to verify namesystem calls in various specific orders. I have to mock DFSOutputStream to reappear bus about corrupt blocks, so I need all the new factories. If mock DFSOutputStream is not necessary in your opinion, I will remove it. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441312#comment-16441312 ] Daryn Sharp commented on HDFS-13243: Maybe I overlooked some details, but please summarize the problem. As best I can tell: * thread1 is writing and closes the stream * thread2 is syncing the stream * thread1 calls commits the block with size 141232 * thread2 fsyncs with size 2054413 * DNs report block with size 2054413, marked corrupt Am I wrong? If no, how on earth can the block be committed with a size _less_ than a racing fsync? This sounds like a serious client-side issue. Agree the server-side logic needs to be improved. Note that {{FileUnderConstructionFeature#updateLengthOfLastBlock}} guards against some of the invalid/malicious cases but unfortunately via asserts. They don't need to be quasi-duplicated in this patch. We cannot simply return success in some invalid cases: ie. fsync when the file has no blocks, size is negative, size is less than less synced/committed size. That just masks bugs. Also, we shouldn't need all the new factories. The tests must verify that namesystem calls, in various specific orders, with specific arguments that are good/bad, either succeed or fail. We can't rely on the behavior of writing to a stream to prove the correctness of the namesystem guarding against invalid input from the client. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433756#comment-16433756 ] genericqa commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 28s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 2s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 15s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 3s{color} | {color:orange} hadoop-hdfs-project: The patch generated 104 new + 853 unchanged - 2 fixed = 957 total (was 855) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 4s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 24s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}106m 37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}199m 33s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts | | | hadoop.tools.TestHdfsConfigFields | | | hadoop.hdfs.TestEncryptionZonesWithKMS | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b | | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12918525/HDFS-13243-v6.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 36723dbe2d3f 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / b0aff8a | | maven | version:
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433543#comment-16433543 ] Zephyr Guo commented on HDFS-13243: --- I agree with you [~jojochuang]. Do not fix client side in path-v6, and I fix the failure of the test case. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED,
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432621#comment-16432621 ] Wei-Chiu Chuang commented on HDFS-13243: Hi Zephyr, thanks for updating the patch. Unfortunately it looks like TestHFlush.hSyncUpdateLength_00 failed, and it is reproducible in my local machine. {noformat} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.fsync(FSNamesystem.java:3279) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.fsync(NameNodeRpcServer.java:1414) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.fsync(ClientNamenodeProtocolServerSideTranslatorPB.java:1026) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) {noformat} Also, I am still not sure about the client side fix. It basically extends the synchronized block to make sure it holds lock while waiting for ack and calling fsync at NameNode. That aside, please refrain from using wild card import (DFSOutputStream.java) > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431991#comment-16431991 ] genericqa commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 33s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 28s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 40s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 14s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 6s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 44s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 7s{color} | {color:orange} hadoop-hdfs-project: The patch generated 104 new + 853 unchanged - 2 fixed = 957 total (was 855) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 3s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 28s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}103m 38s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}197m 35s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts | | | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.tools.TestHdfsConfigFields | | | hadoop.hdfs.TestHFlush | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b | | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12918321/HDFS-13243-v5.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux e737623ad762 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality |
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431686#comment-16431686 ] genericqa commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 27s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 7s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 30s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 53s{color} | {color:red} hadoop-hdfs-project in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 53s{color} | {color:red} hadoop-hdfs-project in the patch failed. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 11s{color} | {color:orange} hadoop-hdfs-project: The patch generated 114 new + 853 unchanged - 2 fixed = 967 total (was 855) {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 31s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 3m 27s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 32s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 42s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 9 new + 1 unchanged - 0 fixed = 10 total (was 1) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 25s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 31s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 74m 37s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b | | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12918302/HDFS-13243-v5.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 7d90301283d0 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0006346 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_162 | | findbugs |
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431670#comment-16431670 ] Zephyr Guo commented on HDFS-13243: --- I have rebased. [~jojochuang] > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425716#comment-16425716 ] Wei-Chiu Chuang commented on HDFS-13243: Code doesn't compile. Please rebase your patch. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425384#comment-16425384 ] genericqa commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 29s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 26s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 14s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 8s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 28s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 46s{color} | {color:red} hadoop-hdfs-project in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 46s{color} | {color:red} hadoop-hdfs-project in the patch failed. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 2s{color} | {color:orange} hadoop-hdfs-project: The patch generated 114 new + 853 unchanged - 2 fixed = 967 total (was 855) {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 28s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 2m 58s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 30s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 42s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 9 new + 1 unchanged - 0 fixed = 10 total (was 1) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 29s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 28s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 68m 59s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b | | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12916598/HDFS-13243-v4.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 08f7bbe924cb 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / b779f4f | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_162 | | findbugs |
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425326#comment-16425326 ] Wei-Chiu Chuang commented on HDFS-13243: The patch did not compile, due to other patches. Triggered the precommit build again. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417288#comment-16417288 ] genericqa commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 7s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 19s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 34s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 4s{color} | {color:red} hadoop-hdfs-project in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 4s{color} | {color:red} hadoop-hdfs-project in the patch failed. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 7s{color} | {color:orange} hadoop-hdfs-project: The patch generated 114 new + 853 unchanged - 2 fixed = 967 total (was 855) {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 36s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 3m 16s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 35s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 45s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 9 new + 1 unchanged - 0 fixed = 10 total (was 1) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 34s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 79m 23s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b | | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12916598/HDFS-13243-v4.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux b890fdc614d5 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 411993f | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs |
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417198#comment-16417198 ] Zephyr Guo commented on HDFS-13243: --- Rebase patch-v3, attach v4. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417167#comment-16417167 ] Zephyr Guo commented on HDFS-13243: --- Hi, [~jojochuang] I attached patch-v3. I move RPC call into synchronized code block. I try my best to let mock code clearly. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417165#comment-16417165 ] genericqa commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} HDFS-13243 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12916595/HDFS-13243-v3.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/23701/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409969#comment-16409969 ] Wei-Chiu Chuang commented on HDFS-13243: Reviewing the patch again, at first I thought we could mock the class. But it was called from a static method, so seems not easy. Instead, I suggest look into FsDatasetFactory, FsDatasetImpl and FsDatasetSpi<>#newInstance(), which is cleaner than the one in patch v2. {quote}BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is this for performance benefit?If that, Is it necessary to fix the client? {quote} Making a RPC call in a synchronized block seems like an anti-pattern. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889 ] Zephyr Guo commented on HDFS-13243: --- [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. You have no power to let all user update their client code, right? I will write a new patch in serval days, thanks for your advice. BTW, why doesn't we include dfsClient.namenode.fsync() into synchronized code block?Is this for performance benefit?If that, Is it necessary to fix the client? > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392812#comment-16392812 ] Wei-Chiu Chuang commented on HDFS-13243: [~gzh1992n] thanks for the comments. That makes a lot more sense now. {quote}Client call sync() with a *corrent* length, but sync request could be sent after close(). The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code: {quote} If the problem is client side race condition, I would recommend fixing it at client side. Suppose client calls fsync() but server ignores it because of the client race condition, doesn't that mean the block is not sync'ed? Client expects that if it calls fsync() and that if it does not throw an IOException, the call succeeds. {quote}I'm not sure that data reliability would be affected if minimal replication set to 1. Do you have some experience about this {quote} This is a typical misunderstanding. Minimal replication is not replication factor. And setting minimal replication to 2 is not a well tested configuration, at least in CDH. I work for Cloudera and we have thousands of customers. Almost no use sets minimal replication factor to 2. There's just one that I am aware, and that customer reports quite a few tricky issues because of it. File can not be closed because DataNode block reports are not processed in time in a large cluster and you need two DNs reporting an updated block for it to become COMPLETE and get closed. Decommissioning does not complete at all due to a corner case ... etc. Long story short, I suspect this race condition happens because of this unusual setting. (or makes it more prone to this bug) {quote}I don't understand this. There is no API that set impl of FSNamesystem in MiniCluster. Could you give me a sample in another test case? I will rewrite this patch. {quote} There's none. But you can make one :) {quote}In server-side, we could log warn for wrong length and throw exception for invalid state. Is this better than current version? {quote} That sounds like a plan. Thank you > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392581#comment-16392581 ] neef6 commented on HDFS-13243: -- {code:java} /** * The block is committed. * The client reported that all bytes are written to data-nodes * with the given generation stamp and block length, but no * {@link ReplicaState#FINALIZED} * replicas has yet been reported by data-nodes themselves. */ COMMITTED; else if (storedBlock.getNumBytes() != reported.getNumBytes()) { return new BlockToMarkCorrupt(storedBlock, "block is " + ucState + " and reported length " + reported.getNumBytes() + " does not match " + "length in block map " + storedBlock.getNumBytes(), Reason.SIZE_MISMATCH); {code} client.close() report length is smaller than datanode report. looking at log , the error occurs in client.close() and dn.report().where is the Sync impact ? i also have a problem: between allocateBlock and datanode report,there should be a change in blockMap?but i coun't find the code. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392309#comment-16392309 ] Zephyr Guo commented on HDFS-13243: --- {{[~jojochuang], Thanks for reviewing.}} {{{quote}}} {{1.}}It seems to me the root of problem is that client would call fsync() with an incorrect length (shorter than what it is supposed to sync). If that's the case you should fix the client (DFSOutputStream), rather than the NameNode. {{{quote}}} {{Client call sync() with a *corrent* length, but sync request could be sent after close().}} {{The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code:}} {{synchronized (this) {}} {{ ... // code before send request}} {{}}} {{// **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) \{ try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) \{ // Deal with ioe } }}} {{synchronized (this) { }} {{ ... // code after send request}} {{}}} {{{quote}}} 2. Looking at the log, your minimal replication number is 2, rather than 1. That's very unusual. In my past experience a lot of weird behavior like this could arise when you have that kind of configuration. {{{quote}}} {{I'm not sure that data reliability would be affected if minimal replication set to 1. Do you have some experience about this?}} {{{quote}}} 3.And why is close() in the picture? IMHO you don't even need to close(). Suppose you block DataNode heartbeat, and let client keep the file open and then call sync(), the last block's state remains in COMMITTED. Would that cause the same behavior? {{{quote}}} {{close() must be called after sync(see above root cause). If you don't call close(), the last block's state can't change to COMMITTED, right? }} {{{quote}}} 4.Looking at the patch, I would like to ask you to stay away from using reflection. You could refactor FSNamesystem and DFSOutputStream to return a new FSNamesystem/DFSOutputStream object and override them in the test code. That way, you don't need to introduce new configurations too. And it'll be much cleaner. {{{quote}}} {{I don't understand this. There is no API that set impl of FSNamesystem in MiniCluster. Could you give me a sample in another test case? I will rewrite this patch.}} {{{quote}}} {{5.I don't understand the following code.}} {{{quote}}} {{My fixed code is not final version, because I don't know that whether DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in client, maybe for performance benefit? If for a benefit, we just fix server-side.If not,we need to fix both server-side and client.}} {{In server-side, we could log warn for wrong length and throw exception for invalid state. Is this better than current version?}} > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391178#comment-16391178 ] Wei-Chiu Chuang commented on HDFS-13243: Hi [~gzh1992n] thanks very much for reporting the issue. The NN log is useful too. I looked at the patch & log and tried to understand where the problem is. I don't think I understand the problem fully, but here are some of the thoughts that I'd like to share with you. # It seems to me the root of problem is that client would call fsync() with an incorrect length (shorter than what it is supposed to sync). If that's the case you should fix the client (DFSOutputStream), rather than the NameNode. # Looking at the patch, I would like to ask you to stay away from using reflection. You could refactor FSNamesystem and DFSOutputStream to return a new FSNamesystem/DFSOutputStream object and override them in the test code. That way, you don't need to introduce new configurations too. And it'll be much cleaner. # I don't understand the following code. ## if lastBlockLength <= 0 || lastBlockLength <= b.getNumBytes() is unexpected. You should not just log a debug message and ignore it. It's got to be a WARN level message. You should also log the size of b.getNumBytes() as well. There's also a grammatical error in the log message too. ## If your fix is correct, you should update the assertion in FileUnderConstructionFeature#updateLengthOfLastBlock() so it expects neither COMMITTED nor COMPLETE. ## What should it do when block state is unexpected? I don't think you should just ignore it. {code:java} BlockInfo b = pendingFile.getLastBlock(); if (lastBlockLength <= 0 || lastBlockLength <= b.getNumBytes()) { LOG.debug("lastBlockLength(" + lastBlockLength + ") seems wrong, maybe have a bug here?"); return; } if (b.getBlockUCState() != BlockUCState.COMMITTED && b.getBlockUCState() != BlockUCState.COMPLETE) {{code} > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390962#comment-16390962 ] genericqa commented on HDFS-13243: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 33s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 18s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 13m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 25s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 18s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 17s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 13m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 13m 5s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 38s{color} | {color:orange} root: The patch generated 73 new + 670 unchanged - 2 fixed = 743 total (was 672) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 16s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 30s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 41s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}139m 12s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 1m 11s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}243m 47s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA | | | hadoop.hdfs.web.TestWebHdfsTimeouts | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure | | | hadoop.tools.TestHdfsConfigFields | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f | | JIRA Issue | HDFS-13243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12913521/HDFS-13243-v2.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | |
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390772#comment-16390772 ] Zephyr Guo commented on HDFS-13243: --- Attach v2 patch to fix NoSuchMethodException > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >