Re: Why single thread for HDFS?

Steve Loughran Tue, 06 Jul 2010 08:31:25 -0700

Michael Segel wrote:

Uhm...
That's not really true. It gets a bit more complicated than that.
If you're talking about M/R jobs, you don't want to do threads in your map() routine, while this is possible, its going to be really hard to justify the extra parallelism along with the need to wait for all of the threads to complete before you can end the map() method.If you're talking about a way to copy files from one cluster to another... in hadoop... you can find out the block lists that make up the file. As long as the file is static, meaning no one is writing/spliting/compacting the file, you could copy it. Here being multi threaded could work.You'd have one thread per block that will read from one machine, and then write directly to the other. Of course you'll need to figure out where to write the block, or rather tie in to HDFS.

There's a paper by Russ Perry using HDFS as a filestore for rasterprocessing, where he modified DfsClient to get all the locations of afile, and let the caller decide where to read blocks from.


http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html

the advantage of this is that the caller can do the striping acrossmachines, keep every server busy by asking for files from each of them.Of course, this ignores the trend to many-HDD servers; DfsClient can'tcurrently see which physical disk a file is on, which you'd need if theclient wanted to keep every disk on every server busy during a big read

Re: Why single thread for HDFS?

Reply via email to