Re: Nutch Extensions to MapReduce

Enis Soztutar Thu, 06 Mar 2008 07:27:30 -0800

Naama Kraus wrote:

OK. Let me try an example:


Say my map maps a person name to a his child name. <p, c>. If a person "Dan"
has more than 1 child, bunch of <Dan, c>* pairs will be produced, right ?
Now say I have two different information needs:
1. Get a list of all children names for each person.
2. Get the number of children of each person.

I could run two different MapReduce jobs, with same map but different
reducres:
1. emits <p, lc>* pairs where p is the person, lc is a concatenation of his
children names.
2. emits <p,n>* pairs where p is the person, n is the number of children.

No you cannot have more than one type of reduces in one job. But yes youcan write more than one file as theresult of the reduce phase, which is what I wanted to explain bypointing to ParseOutputFormat which writes ParseText and ParseDatatodifferent MapFiles at the end of the reduce step. So this is done byimplementing OutputFormat + RecordWriter(given a resulting record fromthe reduce, write separate parts of it in different files)

Does that make any sense by now ?

Now, my question is whether I can save the two jobs and have a single one
only which emits both two type of pairs - <p, lc>* and <p,n>*. In separate
files probably. This way I gain one pass on the input files instead of two
(or more, if I had more output types ...).

Actually for this scenario you do not even need two different files with<p, cl>* and <p,n>*. You can just compute<p, <c1,c2, ..>> which also contains the number of the children (Thevalue is a List(for example ArrayWritable) containing children names).

If not, that's also fine, I was just curious :-)

Naama



On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

Let me explain this more technically :)

An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
<k2,v2>* pairs. So at the end of the map stage, the output will be of
the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.

I cannot understand what you meant by

if a MapReduce job could output multiple files each holds different
<key,value> pairs"

The resulting segment directories after a crawl contain
subdirectories(like crawl_generate, content, etc), but these are
generated one-by-one in several jobs running sequentially(and sometimes
by the same job, see ParseOutputFormat in nutch). You can refer further
to the OutputFormat and RecordWriter interfaces for specific needs.

For each split in the reduce phrase a different output file will be
generated, but all the records in the files have the same type. However
in some cases using GenericWritable or ObjectWtritable, you can wrap
different types of keys and values.

Hope it helps,
Enis

Naama Kraus wrote:

Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output

multiple

files each holds different <key,value> pairs. I got the impression this

is

done in Nutch from slide 15 of

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf

but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?

Thanks for offering to help, I might have more concrete details of what

I am

trying to implement later on, now I am basically learning.

Naama

On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a
base for distributed computing and storage. In this regard there is no
part in nutch that "extends" hadoop. The core of the mapreduce indeed
does work with <key,value> pairs, and nutch uses specific <key,value>
pairs such as <url, CrawlDatum>, etc.

So long story short, it depends on what you want to build. If you
working on something that is not related to nutch, you do not need it.
You can give further info about your project if you want extended help.

best wishes.
Enis

Naama Kraus wrote:

Hi,

I've seen in

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
<
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide

12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
<
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide

15) that MapReduce outputs two files each holds different <key,value>
pairs. I'd be curious to know if I can achieve that using the standard

API.

Thanks, Naama

Re: Nutch Extensions to MapReduce

Reply via email to