Hi Dan,

You could do one few things to get around this.
1. In a subsequent step you could merge all your MapFile outputs into one file. 
This is if your MapFile output is small.
2. Else, you can use the same partition function which hadoop used to find the 
partition ID. Partition ID can tell you which output file (out of the 150 
files) your key is present in. 
Eg. if the partition ID was 23, then the output file you would have to look for 
would be part-00023 in the generated output. 

You can use your own Partition class (make sure you use it for your first job 
as well as second) or reuse the one already used by Hadoop. 
http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/mapred/Partitioner.html
 has details.

I think this 
http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/examples/SleepJob.html
 has its usage example. (look for SleepJob.java)

-Lohit




----- Original Message ----
From: Dan Benjamin <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, November 18, 2008 10:53:47 AM
Subject: Performing a Lookup in Multiple MapFiles?


I've got a Hadoop process that creates as its output a MapFile.  Using one
reducer this is very slow (as the map is large), but with 150 (on a cluster
of 80 nodes) it runs quickly.  The problem is that it produces 150 output
files as well.  In a subsequent process I need to perform lookups on this
map - how is it recommended that I do this, given that I may not know the
number of existing MapFiles or their names?  Is there a cleaner solution
than listing the contents of the directory containing all of the MapFiles
and then just querying each in sequence?
-- 
View this message in context: 
http://www.nabble.com/Performing-a-Lookup-in-Multiple-MapFiles--tp20565940p20565940.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to