Re: Nutch Extensions to MapReduce

Naama Kraus Thu, 06 Mar 2008 06:24:09 -0800

OK. Let me try an example:

Say my map maps a person name to a his child name. <p, c>. If a person "Dan"
has more than 1 child, bunch of <Dan, c>* pairs will be produced, right ?
Now say I have two different information needs:
1. Get a list of all children names for each person.
2. Get the number of children of each person.


I could run two different MapReduce jobs, with same map but different
reducres:
1. emits <p, lc>* pairs where p is the person, lc is a concatenation of his
children names.
2. emits <p,n>* pairs where p is the person, n is the number of children.

Does that make any sense by now ?

Now, my question is whether I can save the two jobs and have a single one
only which emits both two type of pairs - <p, lc>* and <p,n>*. In separate
files probably. This way I gain one pass on the input files instead of two
(or more, if I had more output types ...).

If not, that's also fine, I was just curious :-)

Naama



On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

> Let me explain this more technically :)
>
> An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
> <k2,v2>* pairs. So at the end of the map stage, the output will be of
> the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
> v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
>
> I cannot understand what you meant by
>
> if a MapReduce job could output multiple files each holds different
> <key,value> pairs"
>
> The resulting segment directories after a crawl contain
> subdirectories(like crawl_generate, content, etc), but these are
> generated one-by-one in several jobs running sequentially(and sometimes
> by the same job, see ParseOutputFormat in nutch). You can refer further
> to the OutputFormat and RecordWriter interfaces for specific needs.
>
> For each split in the reduce phrase a different output file will be
> generated, but all the records in the files have the same type. However
> in some cases using GenericWritable or ObjectWtritable, you can wrap
> different types of keys and values.
>
> Hope it helps,
> Enis
>
> Naama Kraus wrote:
> > Well, I was not actually thinking to use Nutch.
> > To be concrete, I was interested if a MapReduce job could output
> multiple
> > files each holds different <key,value> pairs. I got the impression this
> is
> > done in Nutch from slide 15 of
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> > but maybe I was mis-understanding.
> > Is it Nutch specific or achievable using Hadoop API ? Would multiple
> > different reducers do the trick ?
> >
> > Thanks for offering to help, I might have more concrete details of what
> I am
> > trying to implement later on, now I am basically learning.
> >
> > Naama
> >
> > On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
> > wrote:
> >
> >
> >> Hi,
> >>
> >> Currently nutch is a fairly complex application that *uses* hadoop as a
> >> base for distributed computing and storage. In this regard there is no
> >> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >> does work with <key,value> pairs, and nutch uses specific <key,value>
> >> pairs such as <url, CrawlDatum>, etc.
> >>
> >> So long story short, it depends on what you want to build. If you
> >> working on something that is not related to nutch, you do not need it.
> >> You can give further info about your project if you want extended help.
> >>
> >> best wishes.
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>
> >>> Hi,
> >>>
> >>> I've seen in
> >>>
> >>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
> <
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
> >
> >>
> >>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>> these are part of the Hadoop API or inside Nutch only.
> >>>
> >>> More specifically, I saw in
> >>>
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
> <
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
> >
> >>
> >>> 15) that MapReduce outputs two files each holds different <key,value>
> >>> pairs. I'd be curious to know if I can achieve that using the standard
> >>>
> >> API.
> >>
> >>> Thanks, Naama
> >>>
> >>>
> >>>
> >
> >
> >
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: Nutch Extensions to MapReduce

Reply via email to