Michael Segel wrote:
Uhm...

That's not really true. It gets a bit more complicated than that.

If you're talking about M/R jobs, you don't want to do threads in your map() routine, while this is possible, its going to be really hard to justify the extra parallelism along with the need to wait for all of the threads to complete before you can end the map() method. If you're talking about a way to copy files from one cluster to another... in hadoop... you can find out the block lists that make up the file. As long as the file is static, meaning no one is writing/spliting/compacting the file, you could copy it. Here being multi threaded could work. You'd have one thread per block that will read from one machine, and then write directly to the other. Of course you'll need to figure out where to write the block, or rather tie in to HDFS.

There's a paper by Russ Perry using HDFS as a filestore for raster processing, where he modified DfsClient to get all the locations of a file, and let the caller decide where to read blocks from.

http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html

the advantage of this is that the caller can do the striping across machines, keep every server busy by asking for files from each of them. Of course, this ignores the trend to many-HDD servers; DfsClient can't currently see which physical disk a file is on, which you'd need if the client wanted to keep every disk on every server busy during a big read

Reply via email to