Re: data loss after cluster wide power loss

Dave Latham Mon, 01 Jul 2013 21:09:42 -0700

Much appreciated, Suresh.  Let me know if I can provide any more
information or if you'd like me to open a JIRA.


Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <sur...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you
> had sent earlier completely (on my phone). As you said, this is not related
> to data loss related to HBase log and hsync. I think you are right; the
> rename operation itself might not have hit the disk. I think we should
> either ensure metadata operation is synced on the datanode or handle it
> being reported as blockBeingWritten. Let me spend sometime to debug this
> issue.
>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <lat...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops
>> in there that I'm not certain this chain is fully reliable in all cases
>> without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <lat...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>>> it be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas 
>>> <sur...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <lat...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Reply via email to