Hi Aaron,

Thank you so much for your reply!

After taking your notes, I realize it may not be necessary to enforce file
lock synchronization on the file system level (HDFS here). In the
traditional abstraction of file systems, the file access synchronization is
the programmers' work to make sure their critical sections are correctly set
so that the concurrent programs do not write the same files at the same time
(one example of access synchronization).

File lock mechanism for access synchronization or data consistency becomes
necessary only when we move to database abstraction on top of file systems.

That's what I think about the traditional designs or abstractions of file
systems and databases. Please correct me if I am wrong.

Based on the above thoughts, we will use HDFS by taking the traditional file
system abstraction (without file lock). If we really need the file access
synchronization, we may go for HBase.

Thanks again for your reply! It is really inspiring. Please shot your
comments if I am wrong in any points. :)

Best regards,
Zhang Bingjun (Eddy)

E-mail: eddym...@gmail.com, bing...@nus.edu.sg, bing...@comp.nus.edu.sg
Tel No: +65-96188110 (M)


On Thu, Aug 6, 2009 at 1:22 AM, Aaron Kimball <aa...@cloudera.com> wrote:

> On Wed, Aug 5, 2009 at 6:09 AM, Zhang Bingjun (Eddy) <eddym...@gmail.com
> >wrote:
>
> > Hi All,
> >
> > I am quite new to Hadoop. May I ask a simple question about HDFS file
> > access
> > synchronization?
> >
> > For some very typical scenarios below, how does HDFS respond? Is there a
> > way
> > to synchronize file access in HDFS?
> >
> > A tries to read a file currently being written by B.
>
>
> There is no sync() call in HDFS. A will read whatever portion of B's data
> has already been committed to disk by the datanode. It is unspecified how
> much data this will contain. It may be variable depending on which replica
> of the file A is reading. After B close()'s the file, all the data will be
> available to A.
>
>
> >
> > A tries to write a file currently being written by B.
>
>
> This will fail. HDFS does not allow multiple writers to a file. The
> FileSystem.create() call used by A to open the file for write access will
> throw IOException.
>
>
> >
> > A tries to write a file currently being read by B.
>
>
> This will fail. HDFS does not allow file updates, so if the file already
> exists and B is reading it, the FileSystem.create() call used by A will
> fail
> with IOException.
>
>
> >
> >
> > We plan to put some shared data in HDFS so that multiple applications can
> > share the data between them. The ideal case is that the underlying
> > distributed file system (HDFS here) will provide file access
> > synchronization
> > so that applications know when they can or cannot operate on a certain
> > file.
> > Is this way of thinking correct? What is the typical design for this kind
> > of
> > application scenario?
>
>
> You'll have to think carefully. You can't update files. There is also no
> equivalent of flock(), so you can't use files as locks for exclusive access
> to some part of a work flow. If that's what you need, you may want to look
> at the ZooKeeper project and see if you can't integrate ZK into your
> system.
> ZK is specifically designed to handle locking, mutual exclusion, and other
> distributed synchronization problems.
>
>
>
> >
> >
> > I am quite confused. Definitely need to read more about HDFS and other
> > distributed file systems. But before that, I would appreciate very much
> the
> > input from experts in the mailing list.
>
>
> http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html and
> http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html are good
> places to start.
>
>
> >
> >
> > Thanks a lot!
> >
> > Best regards,
> > Zhang Bingjun (Eddy)
> >
> > E-mail: eddym...@gmail.com, bing...@nus.edu.sg, bing...@comp.nus.edu.sg
> > Tel No: +65-96188110 (M)
> >
>
> Cheers,
> - Aaron
>

Reply via email to