On Oct 7, 2010, at 2:35 AM, elton sky wrote: > Hello experts, > > I was benchmarking sequential write throughput of HDFS. > > For testing affect of bytesPerChecksum (bpc) size to write performance, I am > using different bpc size: 2M, 256K, 32K, 4K, 512B. > > My cluster has 1 name node and 5 data nodes. They are xen VMs and each of > them configured with 56MB/s duplex ethernet connection. I > > I try to create a 10G file with different bpc. When bpc is 2M, the > throughput drops dramatically compared with others: > > time(ms): 333008 bpc: 2M > > time(ms): 234180 bpc: 256K > > time(ms): 223737 bpc: 32K > > time(ms): 228842 bpc: 4K > > time(ms): 228238 bpc: 512 > > After dig into the source, I found the problem happens on data nodes. > In org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(): > > private int readNextPacket() throws IOException { > ... > > while (buf.remaining() < SIZE_OF_INTEGER) { > > if (buf.position() > 0) { > shiftBufData(); > } > > * readToBuf(-1); // this line takes 30ms or more for each packet before > returns* > } > ... > > while (toRead > 0) { //this loop also takes around 30 ms > toRead -= readToBuf(toRead); > } > ... > } > > private long readToBufTime(int toRead) throws IOException { > ... > > *int nRead = in.read(buf.array(), buf.limit(), toRead);**// this is the line > actually causes the delay* > ... > > } > > The *in.read() *takes around 30ms to wait for data before it returns. And > when it returns it reads a few KBs data. The while loop comes later takes > similar time to finish, which reads (2MB - a few KBs reads before). > > I couldn't understand the reason for the pause of *in.read()*. Why data node > needs to wait? why data is not available then?
It is probably waiting on disk or network. > Why this happens when using > big bpc? > Linux tends to asynchronously 'read-ahead' from disks if sequential access is detected in a file. The default is to read-ahead in chunks of up to 128K. You can change this on a per device level with "blockdev --setra" (google it). Since Hadoop fetches data in a synchronous loop, it loses the benefit of the OS asynchronous read-ahead past 128K unless you change that setting. I recommend a readahead value of ~2MB for today's SATA drives if you need top sequential access performance from linux. This would look something like this for 2MB: # blockdev --setra 4096 /dev/sda > any idea will be appreciated!