Re: writing throughput vs replication

Andrew Rothstein Thu, 17 Mar 2011 12:01:43 -0700

Thanks for that pointer. I read the sections on staging and
replication pipelining and I still am not clear on the synchronization
points. If I'm writing out 2 blocks worth of data, will the thread
that's performing the write block after the first chunk is staged but
prior to replication pipelining, or will it continue to stage the
second chuck while replication pipelining the first chunk? Or have I
not sufficiently scrutinized the documentation and I should continue
to RTFM?

For reference:

Staging
A client request to create a file does not reach the NameNode
immediately. In fact, initially the HDFS client caches the file data
into a temporary local file. Application writes are transparently
redirected to this temporary local file. When the local file
accumulates data worth over one HDFS block size, the client contacts
the NameNode. The NameNode inserts the file name into the file system
hierarchy and allocates a data block for it. The NameNode responds to
the client request with the identity of the DataNode and the
destination data block. Then the client flushes the block of data from
the local temporary file to the specified DataNode. When a file is
closed, the remaining un-flushed data in the temporary local file is
transferred to the DataNode. The client then tells the NameNode that
the file is closed. At this point, the NameNode commits the file
creation operation into a persistent store. If the NameNode dies
before the file is closed, the file is lost.

The above approach has been adopted after careful consideration of
target applications that run on HDFS. These applications need
streaming writes to files. If a client writes to a remote file
directly without any client side buffering, the network speed and the
congestion in the network impacts throughput considerably. This
approach is not without precedent. Earlier distributed file systems,
e.g. AFS, have used client side caching to improve performance. A
POSIX requirement has been relaxed to achieve higher performance of
data uploads.

Replication Pipelining
When a client is writing data to an HDFS file, its data is first
written to a local file as explained in the previous section. Suppose
the HDFS file has a replication factor of three. When the local file
accumulates a full block of user data, the client retrieves a list of
DataNodes from the NameNode. This list contains the DataNodes that
will host a replica of that block. The client then flushes the data
block to the first DataNode. The first DataNode starts receiving the
data in small portions (4 KB), writes each portion to its local
repository and transfers that portion to the second DataNode in the
list. The second DataNode, in turn starts receiving each portion of
the data block, writes that portion to its repository and then flushes
that portion to the third DataNode. Finally, the third DataNode writes
the data to its local repository. Thus, a DataNode can be receiving
data from the previous one in the pipeline and at the same time
forwarding data to the next one in the pipeline. Thus, the data is
pipelined from one DataNode to the next.

On Thu, Mar 17, 2011 at 12:34 PM, Harsh J <qwertyman...@gmail.com> wrote:
> Have a read of the replication feature design:
> http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html#Data+Replication
> :)
>
> On Thu, Mar 17, 2011 at 9:59 PM, Andrew Rothstein
> <andrew.rothst...@gmail.com> wrote:
>> If I'm using a replication factor of 3 and I write a block of data
>> will my write operation block until the data is present on 3 nodes?
>
> No.
>
>> will it block until the data is present on 1 node and asynchronously
>> replicate from there to 2 other nodes?
>
> Yes!
>
> --
> Harsh J
> http://harshj.com
>

Re: writing throughput vs replication

Reply via email to