I strongly suggest benchmarking a modern version of Hadoop rather than
Hadoop 1.x. The native CRC stuff from HDFS-3528 greatly reduces CPU
consumption on the read path. I wrote about some other read path
optimizations in Hadoop 2.x here:
http://www.club.cc.cmu.edu/~cmccabe/d/2014.04_ApacheCon_HDF
Daemeon - Indeed, I neglected to mention that I am clearing the caches
throughout my cluster before running the read benchmark. My expectation
was to ideally get results that were proportionate to disk I/O, given
that replicated writes perform twice the disk I/O relative to reads. I've
verified the
Reads can be faster than writes for smaller bursts of IO in part due to
disk and memory caching of reads (if you turn on write back (not
recommended!) your numbers above are likely to get closer together). As
your volume of IO increases, you tend to reach a point where you are bound
(more or less)
I would advise against using TestDFSIO, instead trying TeraGen and
TeraValidate. IIRC TestDFSIO doesn't actually schedule for task locality,
so it's not very good if you have a cluster bigger than your replication
factor. You might be network bound as you try to read more files.
Best,
Andrew
On T
I am benchmarking my cluster of 16 nodes (all in one rack) with TestDFSIO on
Hadoop 1.0.4. For simplicity, I turned off speculative task execution and set
the max map and reduce tasks to 1.
With a replication factor of 2, writing 1 file of 5GB takes twice as long as
reading 1 file. This result se