Daemeon - Indeed, I neglected to mention that I am clearing the caches throughout my cluster before running the read benchmark. My expectation was to ideally get results that were proportionate to disk I/O, given that replicated writes perform twice the disk I/O relative to reads. I've verified the I/O with iostat. However, as I mentioned earlier, reads and writes converge as the number of files in the workload increases, despite the constant ratio of write I/O to read I/O.
Andrew - I've verified that the network is not the bottleneck. (All of the links are 10Gb). As you'll see, I suspect that the lack of data-locality causes the slowdown because a given node can be responsible for serving multiple remote block reads all at once. I hope my understanding of writes and reads can be confirmed: Write pipelining allows a node to write, replicate, and receive replicated data in parallel. If node A is writing its own data while receiving replicated data from node B, node B does not wait for node A to finish writing B's replicated data to disk. Rather, node B can begin writing its next local block immediately. Thus, pipelining helps replicated writes have good performance. In contrast, let's assume node A is currently reading a block. If node A receives an additional read request from node B, A will take longer to serve the block to B because of A's pre-existing read. Because node B waits longer for the block to be served from A, there is a delay on node B before it attempts to read the next block in the file. Multiple read requests from different nodes are a consequence of having no built-in data locality with TestDFSIO. Finally, as the number of concurrent tasks throughout the cluster increases, the wait time for reads increases. Is my understanding of these read and write mechanisms correct? Thank you, Eitan