I hope y'all can offer me some insight. I've only started looking hard at hadoop in the last couple of days, but I've yet to see a good example that I can hack for a style of problem I tend to encounter a lot.

Two large (1-10 Gb) files (A & B) already partitioned (semantically) into n and m chunks of binary data, respectively.

An existing binary with otherwise fixed parameters needs to be run on all n * m chunk-pairs, most of which take 10's of minutes to run.

I'd like to use the streaming-module, and in my current (hadoop) cluster implementation, there is a shared NFS file-system and HDFS is implemented on local scratch disk on each machine. However, I can't guarantee that other hadoop clusters I co-opt in the future will have a shared file-system.

I can enumerate pairs of chunk identifiers, have each one (lines in a file or separate files) fire-off a (python) mapper script which reads the two input chunks Aj and Bk over NFS or using fs -get to the local directory. I could even try to use the -cacheFile (can this point to a _directory_ in HDFS? as they are specified once per job, they can't point to individual chunks, so must point to all). Perhaps I could use a -cacheArchive of all the chunks, and just extract the right one from the jar file on demand?

However, this forgoes any opportunity for the map-reduce scheduler to take advantage of chunk data-locality in HDFS in scheduling the mapping tasks, which will essentially require the full replication off all chunks to all task nodes (not horrible, but not great) given random task assignment.

Maybe all I need is a special inputformat or inputrecord and supply two directories of chunks on the command-line as arguments to -input ?

Dunno.

My current solution (720 * 720) involves shell-scripts, lockfiles, and condor, over a few hundred CPUs, and sometimes has trouble with using the NFS shared-filesystem (particularly NFS writes) and with "random" jobs failures, which are hard to detect and recompute.

Thanks for any insight...

- n

--
Dr. Nathan Edwards                      [email protected]
Department of Biochemistry and Molecular & Cellular Biology
           Georgetown University Medical Center
Room 1215, Harris Building          Room 347, Basic Science
3300 Whitehaven St, NW              3900 Reservoir Road, NW
Washington DC 20007                     Washington DC 20007
Phone: 202-687-7042                     Phone: 202-687-1618
Fax: 202-687-0057                         Fax: 202-687-7186

Reply via email to