On Mar 26, 2008, at 11:05 AM, Arun C Murthy wrote:
On Mar 26, 2008, at 9:39 AM, Aayush Garg wrote:
HI,
I am developing the simple inverted index program frm the hadoop.
My map
function has the output:
<word, doc>
and the reducer has:
<word, list(docs)>
Now I want to use one more mapreduce to remove stop and scrub
words from
this output. Also in the next stage I would like to have short summay
associated with every word. How should I design my program from
this stage?
I mean how would I apply multiple mapreduce to this? What would be
the
better way to perform this?
In general you are better off with lesser number of Map-Reduce
jobs ... lesser i/o works better.
I forgot to add that you can use the apis in JobClient and JobControl
to chain jobs together ...
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job
+Control
http://hadoop.apache.org/core/docs/current/
mapred_tutorial.html#JobControl
Arun
Use the DistributedCache if you can and fix your first Map to not
emit the stop words at all. Use the combiner to crunch down amount
of intermediate map-outputs etc.
Something useful to look at:
http://hadoop.apache.org/core/docs/current/
mapred_tutorial.html#Example%3A+WordCount+v2.0
Arun
Thanks,
Regards,
-
Aayush Garg,
Phone: +41 76 482 240