I am benchmarking my cluster of 16 nodes (all in one rack) with TestDFSIO on Hadoop 1.0.4. For simplicity, I turned off speculative task execution and set the max map and reduce tasks to 1.
With a replication factor of 2, writing 1 file of 5GB takes twice as long as reading 1 file. This result seems to make sense since the replication results in twice the I/O in the cluster versus the read. However, as I scale up the number of 5GB files from 1 to 64 files, reading ultimately takes as long as writing. In particular, I see this result when writing and reading 64 such files. What could cause read performance to degrade faster than write performance as the number of files increases? The full results (number of 5GB files, ratio of write time to read time) are below: 1, 2.02 2, 1.87 4, 1.73 8, 1.54 16, 1.37 32, 1.29 64, 1.01 Thank you, Eitan