Re: Nutch Extensions to MapReduce

Enis Soztutar Thu, 06 Mar 2008 05:58:53 -0800

Let me explain this more technically :)

An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result<k2,v2>* pairs. So at the end of the map stage, the output will be ofthe form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.


I cannot understand what you meant by

if a MapReduce job could output multiple files each holds different <key,value> 
pairs"

The resulting segment directories after a crawl containsubdirectories(like crawl_generate, content, etc), but these aregenerated one-by-one in several jobs running sequentially(and sometimesby the same job, see ParseOutputFormat in nutch). You can refer furtherto the OutputFormat and RecordWriter interfaces for specific needs.

For each split in the reduce phrase a different output file will begenerated, but all the records in the files have the same type. Howeverin some cases using GenericWritable or ObjectWtritable, you can wrapdifferent types of keys and values.


Hope it helps,
Enis

Naama Kraus wrote:

Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output multiple
files each holds different <key,value> pairs. I got the impression this is
done in Nutch from slide 15 of
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?

Thanks for offering to help, I might have more concrete details of what I am
trying to implement later on, now I am basically learning.

Naama

On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a
base for distributed computing and storage. In this regard there is no
part in nutch that "extends" hadoop. The core of the mapreduce indeed
does work with <key,value> pairs, and nutch uses specific <key,value>
pairs such as <url, CrawlDatum>, etc.

So long story short, it depends on what you want to build. If you
working on something that is not related to nutch, you do not need it.
You can give further info about your project if you want extended help.

best wishes.
Enis

Naama Kraus wrote:

Hi,

I've seen in

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>

12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>

15) that MapReduce outputs two files each holds different <key,value>
pairs. I'd be curious to know if I can achieve that using the standard

API.

Thanks, Naama

Re: Nutch Extensions to MapReduce

Reply via email to