Hi Uma,

Thanks for the pointer.  Your case sounds very similar.  The main
differences that I see are that in my case it happened on all 3 replicas
and the power failure occurred merely seconds after the blocks were
finalized.  So I guess the question is whether HDFS can do anything to
better recover from such situations.  I'm also curious whether ext4 would
be less susceptible than ext3.

I will definitely look at enabling dfs.datanode.synconclose once we upgrade
to a version of hdfs that has it.  I would love to see some performance
numbers if anyone has run them.  Also appears that HBase is considering
enabling it by default (cf. comments on HBase-5954).

Dave


On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G
<mahesw...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlat...@gmail.com [mailto:ddlat...@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-u...@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <sur...@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <lat...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <lat...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> sur...@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <lat...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Reply via email to