On Mar 26, 2008, at 11:05 AM, Arun C Murthy wrote:


On Mar 26, 2008, at 9:39 AM, Aayush Garg wrote:

HI,
I am developing the simple inverted index program frm the hadoop. My map
function has the output:
<word, doc>
and the reducer has:
<word, list(docs)>

Now I want to use one more mapreduce to remove stop and scrub words from
this output. Also in the next stage I would like to have short summay
associated with every word. How should I design my program from this stage? I mean how would I apply multiple mapreduce to this? What would be the
better way to perform this?


In general you are better off with lesser number of Map-Reduce jobs ... lesser i/o works better.


I forgot to add that you can use the apis in JobClient and JobControl to chain jobs together ... http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job +Control http://hadoop.apache.org/core/docs/current/ mapred_tutorial.html#JobControl

Arun

Use the DistributedCache if you can and fix your first Map to not emit the stop words at all. Use the combiner to crunch down amount of intermediate map-outputs etc.

Something useful to look at:
http://hadoop.apache.org/core/docs/current/ mapred_tutorial.html#Example%3A+WordCount+v2.0

Arun

Thanks,

Regards,
-
Aayush Garg,
Phone: +41 76 482 240


Reply via email to