On Oct 7, 2010, at 2:35 AM, elton sky wrote:

> Hello experts,
> 
> I was benchmarking sequential write throughput of HDFS.
> 
> For testing affect of bytesPerChecksum (bpc) size to write performance, I am
> using different bpc size: 2M, 256K, 32K, 4K, 512B.
> 
> My cluster has 1 name node and 5 data nodes. They are xen VMs and each of
> them configured with 56MB/s duplex ethernet connection. I
> 
> I try to create a 10G file with different bpc. When bpc is 2M, the
> throughput drops dramatically compared with others:
> 
> time(ms): 333008  bpc: 2M
> 
> time(ms): 234180  bpc: 256K
> 
> time(ms): 223737  bpc: 32K
> 
> time(ms): 228842  bpc: 4K
> 
> time(ms): 228238  bpc: 512
> 
> After dig into the source, I found the problem happens on data nodes.
> In org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket():
> 
> private int readNextPacket() throws IOException {
> ...
> 
> while (buf.remaining() < SIZE_OF_INTEGER) {
> 
>     if (buf.position() > 0) {
>        shiftBufData();
>      }
> 
> *      readToBuf(-1); // this line takes 30ms or more for each packet before
> returns*
>    }
> ...
> 
> while (toRead > 0) { //this loop also takes around 30 ms
>        toRead -= readToBuf(toRead);
>      }
> ...
> }
> 
> private long readToBufTime(int toRead) throws IOException {
> ...
> 
> *int nRead = in.read(buf.array(), buf.limit(), toRead);**// this is the line
> actually causes the delay*
> ...
> 
> }
> 
> The *in.read() *takes around 30ms to wait for data before it returns. And
> when it returns it reads a few KBs data.  The while loop comes later takes
> similar time to finish, which reads (2MB - a few KBs reads before).
> 
> I couldn't understand the reason for the pause of *in.read()*. Why data node
> needs to wait?  why data is not available then?

It is probably waiting on disk or network.
>  Why this happens when using
> big bpc?
> 

Linux tends to asynchronously 'read-ahead' from disks if sequential access is 
detected in a file.  The default is to read-ahead in chunks of up to 128K.  You 
can change this on a per device level with "blockdev --setra" (google it).
Since Hadoop fetches data in a synchronous loop, it loses the benefit of the OS 
asynchronous read-ahead past 128K unless you change that setting.

I recommend a readahead value of ~2MB for today's SATA drives if you need top 
sequential access performance from linux.  This would look something like this 
for 2MB:

# blockdev --setra 4096 /dev/sda


> any idea will be appreciated!

Reply via email to