Much appreciated, Suresh. Let me know if I can provide any more information or if you'd like me to open a JIRA.
Dave On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <sur...@hortonworks.com>wrote: > Dave, > > Thanks for the detailed email. Sorry I did not read all the details you > had sent earlier completely (on my phone). As you said, this is not related > to data loss related to HBase log and hsync. I think you are right; the > rename operation itself might not have hit the disk. I think we should > either ensure metadata operation is synced on the datanode or handle it > being reported as blockBeingWritten. Let me spend sometime to debug this > issue. > > One surprising thing is, all the replicas were reported as > blockBeingWritten. > > Regards, > Suresh > > > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <lat...@davelink.net> wrote: > >> (Removing hbase list and adding hdfs-dev list as this is pretty internal >> stuff). >> >> Reading through the code a bit: >> >> FSDataOutputStream.close calls >> DFSOutputStream.close calls >> DFSOutputStream.closeInternal >> - sets currentPacket.lastPacketInBlock = true >> - then calls >> DFSOutputStream.flushInternal >> - enqueues current packet >> - waits for ack >> >> BlockReceiver.run >> - if (lastPacketInBlock && !receiver.finalized) calls >> FSDataset.finalizeBlock calls >> FSDataset.finalizeBlockInternal calls >> FSVolume.addBlock calls >> FSDir.addBlock calls >> FSDir.addBlock >> - renames block from "blocksBeingWritten" tmp dir to "current" dest dir >> >> This looks to me as I would expect a synchronous chain from a DFS client >> to moving the file from blocksBeingWritten to the current dir so that once >> the file is closed that it the block files would be in the proper directory >> - even if the contents of the file are still in the OS buffer rather than >> synced to disk. It's only after this moving of blocks that >> NameNode.complete file is called. There are several conditions and loops >> in there that I'm not certain this chain is fully reliable in all cases >> without a greater understanding of the code. >> >> Could it be the case that the rename operation itself is not synced and >> that ext3 lost the fact that the block files were moved? >> Or is there a bug in the close file logic that for some reason the block >> files are not always moved into place when a file is closed? >> >> Thanks for your patience, >> Dave >> >> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <lat...@davelink.net> wrote: >> >>> Thanks for the response, Suresh. >>> >>> I'm not sure that I understand the details properly. From my reading of >>> HDFS-744 the hsync API would allow a client to make sure that at any point >>> in time it's writes so far hit the disk. For example, for HBase it could >>> apply a fsync after adding some edits to its WAL to ensure those edits are >>> fully durable for a file which is still open. >>> >>> However, in this case the dfs file was closed and even renamed. Is it >>> the case that even after a dfs file is closed and renamed that the data >>> blocks would still not be synced and would still be stored by the datanode >>> in "blocksBeingWritten" rather than in "current"? If that is case, would >>> it be better for the NameNode not to reject replicas that are in >>> blocksBeingWritten, especially if it doesn't have any other replicas >>> available? >>> >>> Dave >>> >>> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas >>> <sur...@hortonworks.com>wrote: >>> >>>> Yes this is a known issue. >>>> >>>> The HDFS part of this was addressed in >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is >>>> not >>>> available in 1.x release. I think HBase does not use this API yet. >>>> >>>> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <lat...@davelink.net> >>>> wrote: >>>> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes. On Saturday >>>> the >>>> > data center we were in had a total power failure and the cluster went >>>> down >>>> > hard. When we brought it back up, HDFS reported 4 files as CORRUPT. >>>> We >>>> > recovered the data in question from our secondary datacenter, but I'm >>>> > trying to understand what happened and whether this is a bug in HDFS >>>> that >>>> > should be fixed. >>>> > >>>> > From what I can tell the file was created and closed by the dfs client >>>> > (hbase). Then HBase renamed it into a new directory and deleted some >>>> other >>>> > files containing the same data. Then the cluster lost power. After >>>> the >>>> > cluster was restarted, the datanodes reported into the namenode but >>>> the >>>> > blocks for this file appeared as "blocks being written" - the namenode >>>> > rejected them and the datanodes deleted the blocks. At this point >>>> there >>>> > were no replicas for the blocks and the files were marked CORRUPT. >>>> The >>>> > underlying file systems are ext3. Some questions that I would love >>>> get >>>> > answers for if anyone with deeper understanding of HDFS can chime in: >>>> > >>>> > - Is this a known scenario where data loss is expected? (I found >>>> > HDFS-1539 but that seems different) >>>> > - When are blocks moved from blocksBeingWritten to current? Does >>>> that >>>> > happen before a file close operation is acknowledged to a hdfs client? >>>> > - Could it be that the DataNodes actually moved the blocks to >>>> current but >>>> > after the restart ext3 rewound state somehow (forgive my ignorance of >>>> > underlying file system behavior)? >>>> > - Is there any other explanation for how this can happen? >>>> > >>>> > Here is a sequence of selected relevant log lines from the RS (HBase >>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in >>>> > question). It includes everything that mentions the block in >>>> question in >>>> > the NameNode and one DataNode log. Please let me know if this more >>>> > information that would be helpful. >>>> > >>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: >>>> > Creating >>>> > >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c >>>> > with permission=rwxrwxrwx >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.allocateBlock: >>>> > >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. >>>> > blk_1395839728632046111_357084589 >>>> > DN 2013-06-29 11:16:06,832 INFO >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: / >>>> > 10.0.5.237:50010 >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added >>>> to >>>> > blk_1395839728632046111_357084589 size 25418340 >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is >>>> added to >>>> > blk_1395839728632046111_357084589 size 25418340 >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is >>>> added to >>>> > blk_1395839728632046111_357084589 size 25418340 >>>> > DN 2013-06-29 11:16:11,385 INFO >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block >>>> > blk_1395839728632046111_357084589 of size 25418340 from / >>>> 10.0.5.237:14327 >>>> >>>> > DN 2013-06-29 11:16:11,385 INFO >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for >>>> > block blk_1395839728632046111_357084589 terminating >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: >>>> > Removing lease on file >>>> > >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932 >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: >>>> DIR* >>>> > NameSystem.completeFile: file >>>> > >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 >>>> > RS 2013-06-29 11:16:11,393 INFO >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at >>>> > >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c >>>> > to >>>> > >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c >>>> > RS 2013-06-29 11:16:11,505 INFO >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major >>>> compaction of 7 >>>> > file(s) in n of >>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. >>>> into >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is >>>> 24.2m >>>> > >>>> > ------- CRASH, RESTART --------- >>>> > >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addStoredBlock: addStoredBlock request received for >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 >>>> but was >>>> > rejected: Reported as block being written but is a block of closed >>>> file. >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to >>>> invalidSet >>>> > of 10.0.6.1:50010 >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addStoredBlock: addStoredBlock request received for >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 >>>> but >>>> > was rejected: Reported as block being written but is a block of >>>> closed file. >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to >>>> invalidSet >>>> > of 10.0.5.237:50010 >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addStoredBlock: addStoredBlock request received for >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 >>>> but >>>> > was rejected: Reported as block being written but is a block of >>>> closed file. >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: >>>> BLOCK* >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to >>>> invalidSet >>>> > of 10.0.6.24:50010 >>>> > (note the clock on the server running DN is wrong after restart. I >>>> > believe timestamps are off by 6 hours:) >>>> > DN 2013-06-29 06:07:22,877 INFO >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block >>>> > blk_1395839728632046111_357084589 file >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for >>>> > deletion >>>> > DN 2013-06-29 06:07:24,952 INFO >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block >>>> > blk_1395839728632046111_357084589 at file >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 >>>> > >>>> > >>>> > Thanks, >>>> > Dave >>>> > >>>> >>>> >>>> >>>> -- >>>> http://hortonworks.com/download/ >>>> >>> >>> >> > > > -- > http://hortonworks.com/download/ >