The problem you describe occurs with NFS also.

Basically, single-site-semantics are very hard to achieve on a networked
file system.


On Mon, Feb 14, 2011 at 8:21 PM, Gokulakannan M <gok...@huawei.com> wrote:

> I agree that HDFS doesn't strongly follow POSIX semantics. But it would
> have
> been better if this issue is fixed.
>
>
>
>  _____
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Monday, February 14, 2011 10:18 PM
> To: gok...@huawei.com
> Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
> dhr...@gmail.com
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> HDFS definitely doesn't follow anything like POSIX file semantics.
>
>
>
> They may be a vague inspiration for what HDFS does, but generally the
> behavior of HDFS is not tightly specified.  Even the unit tests have some
> real surprising behavior.
>
> On Mon, Feb 14, 2011 at 7:21 AM, Gokulakannan M <gok...@huawei.com> wrote:
>
>
>
> >> I think that in general, the behavior of any program reading data from
> an
> HDFS file before hsync or close is called is pretty much undefined.
>
>
>
> In Unix, users can parallelly read a file when another user is writing a
> file. And I suppose the sync feature design is based on that.
>
> So at any point of time during the file write, parallel users should be
> able
> to read the file.
>
>
>
> https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958
> <
> https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958&pa
>
> ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-
> 12663958>
>
> &page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comme
> nt-12663958
>
>  _____
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Friday, February 11, 2011 2:14 PM
> To: common-user@hadoop.apache.org; gok...@huawei.com
> Cc: hdfs-u...@hadoop.apache.org; dhr...@gmail.com
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> I think that in general, the behavior of any program reading data from an
> HDFS file before hsync or close is called is pretty much undefined.
>
>
>
> If you don't wait until some point were part of the file is defined, you
> can't expect any particular behavior.
>
> On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M <gok...@huawei.com>
> wrote:
>
> I am not concerned about the sync behavior.
>
> The thing is the reader reading non-flushed(non-synced) data from HDFS as
> you have explained in previous post.(in hadoop 0.20 append branch)
>
> I identified one specific scenario where the above statement is not holding
> true.
>
> Following is how you can reproduce the problem.
>
> 1. add debug point at createBlockOutputStream() method in DFSClient and run
> your HDFS write client in debug mode
>
> 2. allow client to write 1 block to HDFS
>
> 3. for the 2nd block, the flow will come to the debug point mentioned in
> 1(do not execute the createBlockOutputStream() method). hold here.
>
> 4. parallely, try to read the file from another client
>
> Now you will get an error saying that file cannot be read.
>
>
>
>  _____
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Friday, February 11, 2011 11:04 AM
> To: gok...@huawei.com
> Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
> c...@boudnik.org
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> It is a bit confusing.
>
>
>
> SequenceFile.Writer#sync isn't really sync.
>
>
>
> There is SequenceFile.Writer#syncFs which is more what you might expect to
> be sync.
>
>
>
> Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
> you want portable code, you have to reflect a bit to figure out what can be
> done.
>
> On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M <gok...@huawei.com> wrote:
>
> Thanks Ted for clarifying.
>
> So the sync is to just flush the current buffers to datanode and persist
> the
> block info in namenode once per block, isn't it?
>
>
>
> Regarding reader able to see the unflushed data, I faced an issue in the
> following scneario:
>
> 1. a writer is writing a 10MB file(block size 2 MB)
>
> 2. wrote the file upto 4MB (2 finalized blocks in current and nothing in
> blocksBeingWritten directory in DN) . So 2 blocks are written
>
> 3. client calls addBlock for the 3rd block on namenode and not yet created
> outputstream to DN(or written anything to DN). At this point of time, the
> namenode knows about the 3rd block but the datanode doesn't.
>
> 4. at point 3, a reader is trying to read the file and he is getting
> exception and not able to read the file as the datanode's getBlockInfo
> returns null to the client(of course DN doesn't know about the 3rd block
> yet)
>
> In this situation the reader cannot see the file. But when the block
> writing
> is in progress , the read is successful.
>
> Is this a bug that needs to be handled in append branch?
>
>
>
> >> -----Original Message-----
> >> From: Konstantin Boudnik [mailto:c...@boudnik.org]
> >> Sent: Friday, February 11, 2011 4:09 AM
> >>To: common-user@hadoop.apache.org
> >> Subject: Re: hadoop 0.20 append - some clarifications
>
> >> You might also want to check append design doc published at HDFS-265
>
>
>
> I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's
> design doc won't apply to it.
>
>
>
>  _____
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Thursday, February 10, 2011 9:29 PM
> To: common-user@hadoop.apache.org; gok...@huawei.com
> Cc: hdfs-u...@hadoop.apache.org
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> Correct is a strong word here.
>
>
>
> There is actually an HDFS unit test that checks to see if partially written
> and unflushed data is visible.  The basic rule of thumb is that you need to
> synchronize readers and writers outside of HDFS.  There is no guarantee
> that
> data is visible or invisible after writing, but there is a guarantee that
> it
> will become visible after sync or close.
>
> On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M <gok...@huawei.com> wrote:
>
> Is this the correct behavior or my understanding is wrong?
>
>
>
>
>
>
>
>
>
>

Reply via email to