Streaming analysis of n * m binary files...

Nathan Edwards Fri, 04 Dec 2009 11:57:58 -0800

I hope y'all can offer me some insight. I've only started looking hardat hadoop in the last couple of days, but I've yet to see a good examplethat I can hack for a style of problem I tend to encounter a lot.

Two large (1-10 Gb) files (A & B) already partitioned (semantically)into n and m chunks of binary data, respectively.

An existing binary with otherwise fixed parameters needs to be run onall n * m chunk-pairs, most of which take 10's of minutes to run.

I'd like to use the streaming-module, and in my current (hadoop) clusterimplementation, there is a shared NFS file-system and HDFS isimplemented on local scratch disk on each machine. However, I can'tguarantee that other hadoop clusters I co-opt in the future will have ashared file-system.

I can enumerate pairs of chunk identifiers, have each one (lines in afile or separate files) fire-off a (python) mapper script which readsthe two input chunks Aj and Bk over NFS or using fs -get to the localdirectory. I could even try to use the -cacheFile (can this point to a_directory_ in HDFS? as they are specified once per job, they can'tpoint to individual chunks, so must point to all). Perhaps I could use a-cacheArchive of all the chunks, and just extract the right one from thejar file on demand?

However, this forgoes any opportunity for the map-reduce scheduler totake advantage of chunk data-locality in HDFS in scheduling the mappingtasks, which will essentially require the full replication off allchunks to all task nodes (not horrible, but not great) given random taskassignment.

Maybe all I need is a special inputformat or inputrecord and supply twodirectories of chunks on the command-line as arguments to -input ?


Dunno.

My current solution (720 * 720) involves shell-scripts, lockfiles, andcondor, over a few hundred CPUs, and sometimes has trouble with usingthe NFS shared-filesystem (particularly NFS writes) and with "random"jobs failures, which are hard to detect and recompute.


Thanks for any insight...

- n

--
Dr. Nathan Edwards                      [email protected]
Department of Biochemistry and Molecular & Cellular Biology
           Georgetown University Medical Center
Room 1215, Harris Building          Room 347, Basic Science
3300 Whitehaven St, NW              3900 Reservoir Road, NW
Washington DC 20007                     Washington DC 20007
Phone: 202-687-7042                     Phone: 202-687-1618
Fax: 202-687-0057                         Fax: 202-687-7186

Streaming analysis of n * m binary files...

Reply via email to