Re: Why do reads take as long as replicated writes?

2014-11-10 Thread Colin McCabe
I strongly suggest benchmarking a modern version of Hadoop rather than Hadoop 1.x. The native CRC stuff from HDFS-3528 greatly reduces CPU consumption on the read path. I wrote about some other read path optimizations in Hadoop 2.x here: http://www.club.cc.cmu.edu/~cmccabe/d/2014.04_ApacheCon_HDF

Re: Why do reads take as long as replicated writes?

2014-11-05 Thread Eitan Rosenfeld
Daemeon - Indeed, I neglected to mention that I am clearing the caches throughout my cluster before running the read benchmark. My expectation was to ideally get results that were proportionate to disk I/O, given that replicated writes perform twice the disk I/O relative to reads. I've verified the

Re: Why do reads take as long as replicated writes?

2014-11-04 Thread daemeon reiydelle
Reads can be faster than writes for smaller bursts of IO in part due to disk and memory caching of reads (if you turn on write back (not recommended!) your numbers above are likely to get closer together). As your volume of IO increases, you tend to reach a point where you are bound (more or less)

Re: Why do reads take as long as replicated writes?

2014-11-04 Thread Andrew Wang
I would advise against using TestDFSIO, instead trying TeraGen and TeraValidate. IIRC TestDFSIO doesn't actually schedule for task locality, so it's not very good if you have a cluster bigger than your replication factor. You might be network bound as you try to read more files. Best, Andrew On T

Why do reads take as long as replicated writes?

2014-11-04 Thread Eitan Rosenfeld
I am benchmarking my cluster of 16 nodes (all in one rack) with TestDFSIO on Hadoop 1.0.4. For simplicity, I turned off speculative task execution and set the max map and reduce tasks to 1. With a replication factor of 2, writing 1 file of 5GB takes twice as long as reading 1 file. This result se