[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392309#comment-16392309 ]
Zephyr Guo commented on HDFS-13243: ----------------------------------- {{[~jojochuang], Thanks for reviewing.}} {{{quote}}} {{1.}}It seems to me the root of problem is that client would call fsync() with an incorrect length (shorter than what it is supposed to sync). If that's the case you should fix the client (DFSOutputStream), rather than the NameNode. {{{quote}}} {{Client call sync() with a *corrent* length, but sync request could be sent after close().}} {{The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code:}} {{synchronized (this) {}} {{ ... // code before send request}} {{}}} {{// **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) \{ try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) \{ // Deal with ioe } }}} {{synchronized (this) { }} {{ ... // code after send request}} {{}}} {{{quote}}} 2. Looking at the log, your minimal replication number is 2, rather than 1. That's very unusual. In my past experience a lot of weird behavior like this could arise when you have that kind of configuration. {{{quote}}} {{I'm not sure that data reliability would be affected if minimal replication set to 1. Do you have some experience about this?}} {{{quote}}} 3.And why is close() in the picture? IMHO you don't even need to close(). Suppose you block DataNode heartbeat, and let client keep the file open and then call sync(), the last block's state remains in COMMITTED. Would that cause the same behavior? {{{quote}}} {{close() must be called after sync(see above root cause). If you don't call close(), the last block's state can't change to COMMITTED, right? }} {{{quote}}} 4.Looking at the patch, I would like to ask you to stay away from using reflection. You could refactor FSNamesystem and DFSOutputStream to return a new FSNamesystem/DFSOutputStream object and override them in the test code. That way, you don't need to introduce new configurations too. And it'll be much cleaner. {{{quote}}} {{I don't understand this. There is no API that set impl of FSNamesystem in MiniCluster. Could you give me a sample in another test case? I will rewrite this patch.}} {{{quote}}} {{5.I don't understand the following code.}} {{{quote}}} {{My fixed code is not final version, because I don't know that whether DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in client, maybe for performance benefit? If for a benefit, we just fix server-side.If not,we need to fix both server-side and client.}} {{In server-side, we could log warn for wrong length and throw exception for invalid state. Is this better than current version?}} > Get CorruptBlock because of calling close and sync in same time > --------------------------------------------------------------- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.7.2, 3.2.0 > Reporter: Zephyr Guo > Assignee: Zephyr Guo > Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org