Hi, I have same doubt.
>From the code scan it looks like whenever client writes data, one packet is buffered (of size 64 KB) and this packet is directly sent to the corresponding datanodes. Whenever a block end is found and new packet of new block is ready, namenode is contacted to create new block entry and to assign datanodes to it, then the new packets are sent to one of these newly allocated datanodes. So it seems that it does not cache entire block locally before contacting namenode, as stated in design doc. can somebody please clarify on this. On Mon, Feb 23, 2009 at 11:05 AM, Sangmin Lee <[email protected]> wrote: > Hi folks, > > I have a question regarding HDFS' client side buffering. > From the documents > > http://hadoop.apache.org/core/docs/r0.19.0/hdfs_design.html#Staging > > It states that a HDFS client caches one blocks size before it contacts a > namenode for a new block. > Is this true? > I can't find a part of source code for this operation. The source code of above mentioned description can be found in, DFSClient.DFSOutputStream Short explaination is given below, 1. Whenever user writes data by calling FSDataOutputStream.write(...) internally DFSClient.DFSOutputStream.writeChunk(...) gets called which creates a 'Packet' in its buffer and enqueues it in 'DataQueue' maintained by object of DFSOutputStream. (Packet size is 64KB. ) 2. There is a continuously running thread 'DataStreamer' (DFSClient.DFSOutputStream.DataStreamer) which is started when DFSOutputStream object is created. 3. This DataStreamer continuously looks at 'DataQueue', as soon as it finds a packet added to the queue, it dequeues that packet and sends it on stream connected to the datanode. If end of block is found, it contacts namenode (namenode.addblock) and gets new datanode address. > > Can anyone shed some light on this for me? > > I appreciate your help. > > -sangmin > thanks, - ajit.
