OK. Let me try an example: Say my map maps a person name to a his child name. <p, c>. If a person "Dan" has more than 1 child, bunch of <Dan, c>* pairs will be produced, right ? Now say I have two different information needs: 1. Get a list of all children names for each person. 2. Get the number of children of each person.
I could run two different MapReduce jobs, with same map but different reducres: 1. emits <p, lc>* pairs where p is the person, lc is a concatenation of his children names. 2. emits <p,n>* pairs where p is the person, n is the number of children. Does that make any sense by now ? Now, my question is whether I can save the two jobs and have a single one only which emits both two type of pairs - <p, lc>* and <p,n>*. In separate files probably. This way I gain one pass on the input files instead of two (or more, if I had more output types ...). If not, that's also fine, I was just curious :-) Naama On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <[EMAIL PROTECTED]> wrote: > Let me explain this more technically :) > > An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result > <k2,v2>* pairs. So at the end of the map stage, the output will be of > the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3, > v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types. > > I cannot understand what you meant by > > if a MapReduce job could output multiple files each holds different > <key,value> pairs" > > The resulting segment directories after a crawl contain > subdirectories(like crawl_generate, content, etc), but these are > generated one-by-one in several jobs running sequentially(and sometimes > by the same job, see ParseOutputFormat in nutch). You can refer further > to the OutputFormat and RecordWriter interfaces for specific needs. > > For each split in the reduce phrase a different output file will be > generated, but all the records in the files have the same type. However > in some cases using GenericWritable or ObjectWtritable, you can wrap > different types of keys and values. > > Hope it helps, > Enis > > Naama Kraus wrote: > > Well, I was not actually thinking to use Nutch. > > To be concrete, I was interested if a MapReduce job could output > multiple > > files each holds different <key,value> pairs. I got the impression this > is > > done in Nutch from slide 15 of > > > http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf > > but maybe I was mis-understanding. > > Is it Nutch specific or achievable using Hadoop API ? Would multiple > > different reducers do the trick ? > > > > Thanks for offering to help, I might have more concrete details of what > I am > > trying to implement later on, now I am basically learning. > > > > Naama > > > > On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]> > > wrote: > > > > > >> Hi, > >> > >> Currently nutch is a fairly complex application that *uses* hadoop as a > >> base for distributed computing and storage. In this regard there is no > >> part in nutch that "extends" hadoop. The core of the mapreduce indeed > >> does work with <key,value> pairs, and nutch uses specific <key,value> > >> pairs such as <url, CrawlDatum>, etc. > >> > >> So long story short, it depends on what you want to build. If you > >> working on something that is not related to nutch, you do not need it. > >> You can give further info about your project if you want extended help. > >> > >> best wishes. > >> Enis > >> > >> Naama Kraus wrote: > >> > >>> Hi, > >>> > >>> I've seen in > >>> > >>> > >> > http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide> > < > http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide > > > >> > >>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether > >>> these are part of the Hadoop API or inside Nutch only. > >>> > >>> More specifically, I saw in > >>> > >>> > >> > http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide> > < > http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide > > > >> > >>> 15) that MapReduce outputs two files each holds different <key,value> > >>> pairs. I'd be curious to know if I can achieve that using the standard > >>> > >> API. > >> > >>> Thanks, Naama > >>> > >>> > >>> > > > > > > > > > -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales." (Albert Einstein)