Hi Uma, Thanks for the pointer. Your case sounds very similar. The main differences that I see are that in my case it happened on all 3 replicas and the power failure occurred merely seconds after the blocks were finalized. So I guess the question is whether HDFS can do anything to better recover from such situations. I'm also curious whether ext4 would be less susceptible than ext3.
I will definitely look at enabling dfs.datanode.synconclose once we upgrade to a version of hdfs that has it. I would love to see some performance numbers if anyone has run them. Also appears that HBase is considering enabling it by default (cf. comments on HBase-5954). Dave On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G <mahesw...@huawei.com>wrote: > Hi Dave, > > Looks like your analysis is correct. I have faced similar issue some time > back. > See the discussion link: > http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results > On sudden restarts, it can lost the OS filesystem edits. Similar thing > happened in our case, i.e, after restart blocks were moved back to > BeingWritten directory even though they were finalized. > After restart they were marked as corrupt. You could set > dfs.datanode.synconclose to true to avoid this sort of things, but that > will degrade performance. > > Regards, > Uma > > -----Original Message----- > From: ddlat...@gmail.com [mailto:ddlat...@gmail.com] On Behalf Of Dave > Latham > Sent: 01 July 2013 16:08 > To: hdfs-u...@hadoop.apache.org > Cc: hdfs-dev@hadoop.apache.org > Subject: Re: data loss after cluster wide power loss > > Much appreciated, Suresh. Let me know if I can provide any more > information or if you'd like me to open a JIRA. > > Dave > > > On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <sur...@hortonworks.com > >wrote: > > > Dave, > > > > Thanks for the detailed email. Sorry I did not read all the details > > you had sent earlier completely (on my phone). As you said, this is > > not related to data loss related to HBase log and hsync. I think you > > are right; the rename operation itself might not have hit the disk. I > > think we should either ensure metadata operation is synced on the > > datanode or handle it being reported as blockBeingWritten. Let me > > spend sometime to debug this issue. > > > > One surprising thing is, all the replicas were reported as > > blockBeingWritten. > > > > Regards, > > Suresh > > > > > > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <lat...@davelink.net> wrote: > > > >> (Removing hbase list and adding hdfs-dev list as this is pretty > >> internal stuff). > >> > >> Reading through the code a bit: > >> > >> FSDataOutputStream.close calls > >> DFSOutputStream.close calls > >> DFSOutputStream.closeInternal > >> - sets currentPacket.lastPacketInBlock = true > >> - then calls > >> DFSOutputStream.flushInternal > >> - enqueues current packet > >> - waits for ack > >> > >> BlockReceiver.run > >> - if (lastPacketInBlock && !receiver.finalized) calls > >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls > >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock > >> - renames block from "blocksBeingWritten" tmp dir to "current" dest > >> dir > >> > >> This looks to me as I would expect a synchronous chain from a DFS > >> client to moving the file from blocksBeingWritten to the current dir > >> so that once the file is closed that it the block files would be in > >> the proper directory > >> - even if the contents of the file are still in the OS buffer rather > >> than synced to disk. It's only after this moving of blocks that > >> NameNode.complete file is called. There are several conditions and > >> loops in there that I'm not certain this chain is fully reliable in > >> all cases without a greater understanding of the code. > >> > >> Could it be the case that the rename operation itself is not synced > >> and that ext3 lost the fact that the block files were moved? > >> Or is there a bug in the close file logic that for some reason the > >> block files are not always moved into place when a file is closed? > >> > >> Thanks for your patience, > >> Dave > >> > >> > >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <lat...@davelink.net> > wrote: > >> > >>> Thanks for the response, Suresh. > >>> > >>> I'm not sure that I understand the details properly. From my > >>> reading of > >>> HDFS-744 the hsync API would allow a client to make sure that at any > >>> point in time it's writes so far hit the disk. For example, for > >>> HBase it could apply a fsync after adding some edits to its WAL to > >>> ensure those edits are fully durable for a file which is still open. > >>> > >>> However, in this case the dfs file was closed and even renamed. Is > >>> it the case that even after a dfs file is closed and renamed that > >>> the data blocks would still not be synced and would still be stored > >>> by the datanode in "blocksBeingWritten" rather than in "current"? > >>> If that is case, would it be better for the NameNode not to reject > >>> replicas that are in blocksBeingWritten, especially if it doesn't > >>> have any other replicas available? > >>> > >>> Dave > >>> > >>> > >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas < > sur...@hortonworks.com>wrote: > >>> > >>>> Yes this is a known issue. > >>>> > >>>> The HDFS part of this was addressed in > >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and > >>>> is not available in 1.x release. I think HBase does not use this > >>>> API yet. > >>>> > >>>> > >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <lat...@davelink.net> > >>>> wrote: > >>>> > >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes. On > >>>> > Saturday > >>>> the > >>>> > data center we were in had a total power failure and the cluster > >>>> > went > >>>> down > >>>> > hard. When we brought it back up, HDFS reported 4 files as CORRUPT. > >>>> We > >>>> > recovered the data in question from our secondary datacenter, but > >>>> > I'm trying to understand what happened and whether this is a bug > >>>> > in HDFS > >>>> that > >>>> > should be fixed. > >>>> > > >>>> > From what I can tell the file was created and closed by the dfs > >>>> > client (hbase). Then HBase renamed it into a new directory and > >>>> > deleted some > >>>> other > >>>> > files containing the same data. Then the cluster lost power. > >>>> > After > >>>> the > >>>> > cluster was restarted, the datanodes reported into the namenode > >>>> > but > >>>> the > >>>> > blocks for this file appeared as "blocks being written" - the > >>>> > namenode rejected them and the datanodes deleted the blocks. At > >>>> > this point > >>>> there > >>>> > were no replicas for the blocks and the files were marked CORRUPT. > >>>> The > >>>> > underlying file systems are ext3. Some questions that I would > >>>> > love > >>>> get > >>>> > answers for if anyone with deeper understanding of HDFS can chime > in: > >>>> > > >>>> > - Is this a known scenario where data loss is expected? (I > >>>> > found > >>>> > HDFS-1539 but that seems different) > >>>> > - When are blocks moved from blocksBeingWritten to current? > >>>> > Does > >>>> that > >>>> > happen before a file close operation is acknowledged to a hdfs > client? > >>>> > - Could it be that the DataNodes actually moved the blocks to > >>>> current but > >>>> > after the restart ext3 rewound state somehow (forgive my > >>>> > ignorance of underlying file system behavior)? > >>>> > - Is there any other explanation for how this can happen? > >>>> > > >>>> > Here is a sequence of selected relevant log lines from the RS > >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example > >>>> > of 3 in question). It includes everything that mentions the > >>>> > block in > >>>> question in > >>>> > the NameNode and one DataNode log. Please let me know if this > >>>> > more information that would be helpful. > >>>> > > >>>> > RS 2013-06-29 11:16:06,812 DEBUG > org.apache.hadoop.hbase.util.FSUtils: > >>>> > Creating > >>>> > > >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e > >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > >>>> > with permission=rwxrwxrwx > >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.allocateBlock: > >>>> > > >>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO > >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: / > >>>> > 10.0.5.237:50010 > >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is > >>>> > added > >>>> to > >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 > >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is > >>>> added to > >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 > >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is > >>>> added to > >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29 > >>>> > 11:16:11,385 INFO > >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > >>>> > blk_1395839728632046111_357084589 of size 25418340 from / > >>>> 10.0.5.237:14327 > >>>> > >>>> > DN 2013-06-29 11:16:11,385 INFO > >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN > >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: > >>>> > Removing lease on file > >>>> > > >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e > >>>> 56ba5a539fdf159c4c > >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932 > >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: > >>>> DIR* > >>>> > NameSystem.completeFile: file > >>>> > > >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e > >>>> 56ba5a539fdf159c4c > >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > >>>> > RS 2013-06-29 11:16:11,393 INFO > >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted > >>>> > file at > >>>> > > >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp > >>>> /6e0cc30af6e64e56ba5a539fdf159c4c > >>>> > to > >>>> > > >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e > >>>> 0cc30af6e64e56ba5a539fdf159c4c > >>>> > RS 2013-06-29 11:16:11,505 INFO > >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major > >>>> compaction of 7 > >>>> > file(s) in n of > >>>> > > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. > >>>> into > >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for > >>>> > store is > >>>> 24.2m > >>>> > > >>>> > ------- CRASH, RESTART --------- > >>>> > > >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addStoredBlock: addStoredBlock request received for > >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 > >>>> but was > >>>> > rejected: Reported as block being written but is a block of > >>>> > closed > >>>> file. > >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to > >>>> invalidSet > >>>> > of 10.0.6.1:50010 > >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addStoredBlock: addStoredBlock request received for > >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size > >>>> > 16971264 > >>>> but > >>>> > was rejected: Reported as block being written but is a block of > >>>> closed file. > >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to > >>>> invalidSet > >>>> > of 10.0.5.237:50010 > >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addStoredBlock: addStoredBlock request received for > >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size > >>>> > 21913088 > >>>> but > >>>> > was rejected: Reported as block being written but is a block of > >>>> closed file. > >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: > >>>> BLOCK* > >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to > >>>> invalidSet > >>>> > of 10.0.6.24:50010 > >>>> > (note the clock on the server running DN is wrong after restart. > >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29 > >>>> > 06:07:22,877 INFO > >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block > >>>> > blk_1395839728632046111_357084589 file > >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 > >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO > >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block > >>>> > blk_1395839728632046111_357084589 at file > >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 > >>>> > > >>>> > > >>>> > Thanks, > >>>> > Dave > >>>> > > >>>> > >>>> > >>>> > >>>> -- > >>>> http://hortonworks.com/download/ > >>>> > >>> > >>> > >> > > > > > > -- > > http://hortonworks.com/download/ > > >