Re: Nutch Extensions to MapReduce

2008-03-08 Thread Naama Kraus
Found related details in the wiki Map Reduce tutorial. In particular, in the
section "Task Side-Effect Files".

Thanks all for the various inputs, Naama

On Sun, Mar 9, 2008 at 8:22 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Yes.
>
> Look on the wiki or in the discussion archives for details of how to get
> to
> the output directory name.
>
>
> On 3/8/08 1:06 PM, "Naama Kraus" <[EMAIL PROTECTED]> wrote:
>
> > So the configure() method is called when the Reduce task starts, before
> the
> > actual reduce takes place ? Is that so ?
> > Same for map ?
> >
> > Thanks, Naama
> >
> > On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> >>
> >>
> >> This is not difficult to do.  Simply open an extra file in the reducers
> >> configure method and close it in the close method.  Make sure you make
> it
> >> relative to the map reduce output directory so that you can take
> advantage
> >> of all of the machinery that handles lost jobs and such.
> >>
> >> Search the mailing list archives for more details.
> >>
> >>
>
>


-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Re: Nutch Extensions to MapReduce

2008-03-08 Thread Ted Dunning

Yes.

Look on the wiki or in the discussion archives for details of how to get to
the output directory name.


On 3/8/08 1:06 PM, "Naama Kraus" <[EMAIL PROTECTED]> wrote:

> So the configure() method is called when the Reduce task starts, before the
> actual reduce takes place ? Is that so ?
> Same for map ?
> 
> Thanks, Naama
> 
> On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> 
>> 
>> 
>> This is not difficult to do.  Simply open an extra file in the reducers
>> configure method and close it in the close method.  Make sure you make it
>> relative to the map reduce output directory so that you can take advantage
>> of all of the machinery that handles lost jobs and such.
>> 
>> Search the mailing list archives for more details.
>> 
>> 



Re: Nutch Extensions to MapReduce

2008-03-08 Thread Naama Kraus
So the configure() method is called when the Reduce task starts, before the
actual reduce takes place ? Is that so ?
Same for map ?

Thanks, Naama

On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> This is not difficult to do.  Simply open an extra file in the reducers
> configure method and close it in the close method.  Make sure you make it
> relative to the map reduce output directory so that you can take advantage
> of all of the machinery that handles lost jobs and such.
>
> Search the mailing list archives for more details.
>
>
> On 3/6/08 5:22 AM, "Naama Kraus" <[EMAIL PROTECTED]> wrote:
>
> > Well, I was not actually thinking to use Nutch.
> > To be concrete, I was interested if a MapReduce job could output
> multiple
> > files each holds different  pairs. I got the impression this
> is
> > done in Nutch from slide 15 of
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments
> > /yahoo-sds.pdf
> > but maybe I was mis-understanding.
> > Is it Nutch specific or achievable using Hadoop API ? Would multiple
> > different reducers do the trick ?
> >
> > Thanks for offering to help, I might have more concrete details of what
> I am
> > trying to implement later on, now I am basically learning.
> >
> > Naama
> >
> > On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
> > wrote:
> >
> >> Hi,
> >>
> >> Currently nutch is a fairly complex application that *uses* hadoop as a
> >> base for distributed computing and storage. In this regard there is no
> >> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >> does work with  pairs, and nutch uses specific 
> >> pairs such as , etc.
> >>
> >> So long story short, it depends on what you want to build. If you
> >> working on something that is not related to nutch, you do not need it.
> >> You can give further info about your project if you want extended help.
> >>
> >> best wishes.
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>> Hi,
> >>>
> >>> I've seen in
> >>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon
> >> 05.pdf(slide<
> http://wiki.apache.org/nutch-data/attachments/Presentations/atta
> >> chments/oscon05.pdf%28slide>
> >>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>> these are part of the Hadoop API or inside Nutch only.
> >>>
> >>> More specifically, I saw in
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachment
> >> s/yahoo-sds.pdf(slide<
> http://wiki.apache.org/hadoop-data/attachments/HadoopPr
> >> esentations/attachments/yahoo-sds.pdf%28slide>
> >>> 15) that MapReduce outputs two files each holds different 
> >>> pairs. I'd be curious to know if I can achieve that using the standard
> >> API.
> >>>
> >>> Thanks, Naama
> >>>
> >>>
> >>
> >
> >
>
>


-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Re: Nutch Extensions to MapReduce

2008-03-06 Thread Ted Dunning


This is not difficult to do.  Simply open an extra file in the reducers
configure method and close it in the close method.  Make sure you make it
relative to the map reduce output directory so that you can take advantage
of all of the machinery that handles lost jobs and such.

Search the mailing list archives for more details.


On 3/6/08 5:22 AM, "Naama Kraus" <[EMAIL PROTECTED]> wrote:

> Well, I was not actually thinking to use Nutch.
> To be concrete, I was interested if a MapReduce job could output multiple
> files each holds different  pairs. I got the impression this is
> done in Nutch from slide 15 of
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments
> /yahoo-sds.pdf
> but maybe I was mis-understanding.
> Is it Nutch specific or achievable using Hadoop API ? Would multiple
> different reducers do the trick ?
> 
> Thanks for offering to help, I might have more concrete details of what I am
> trying to implement later on, now I am basically learning.
> 
> Naama
> 
> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
> wrote:
> 
>> Hi,
>> 
>> Currently nutch is a fairly complex application that *uses* hadoop as a
>> base for distributed computing and storage. In this regard there is no
>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
>> does work with  pairs, and nutch uses specific 
>> pairs such as , etc.
>> 
>> So long story short, it depends on what you want to build. If you
>> working on something that is not related to nutch, you do not need it.
>> You can give further info about your project if you want extended help.
>> 
>> best wishes.
>> Enis
>> 
>> Naama Kraus wrote:
>>> Hi,
>>> 
>>> I've seen in
>>> 
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon
>> 05.pdf(slide> chments/oscon05.pdf%28slide>
>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
>>> these are part of the Hadoop API or inside Nutch only.
>>> 
>>> More specifically, I saw in
>>> 
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachment
>> s/yahoo-sds.pdf(slide> esentations/attachments/yahoo-sds.pdf%28slide>
>>> 15) that MapReduce outputs two files each holds different 
>>> pairs. I'd be curious to know if I can achieve that using the standard
>> API.
>>> 
>>> Thanks, Naama
>>> 
>>> 
>> 
> 
> 



Re: Nutch Extensions to MapReduce

2008-03-06 Thread Naama Kraus
OK, so what I've learned -

One: there is only one reducer type per job.
Two: sounds like ParseOutputFormat is the reference I was looking for, I'll
go have a look.

And yes, I admit my example was a naive one, it was for demonstration
purposes only.

Thanks a lot for the input,
Naama

On Thu, Mar 6, 2008 at 5:26 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

> Naama Kraus wrote:
> > OK. Let me try an example:
> >
> > Say my map maps a person name to a his child name. . If a person
> "Dan"
> > has more than 1 child, bunch of * pairs will be produced, right
> ?
> > Now say I have two different information needs:
> > 1. Get a list of all children names for each person.
> > 2. Get the number of children of each person.
> >
> > I could run two different MapReduce jobs, with same map but different
> > reducres:
> > 1. emits * pairs where p is the person, lc is a concatenation of
> his
> > children names.
> > 2. emits * pairs where p is the person, n is the number of
> children.
> >
> No you cannot have more than one type of reduces in one job. But yes you
> can write more than one file as the
> result of the reduce phase, which is what I wanted to explain by
> pointing to ParseOutputFormat which writes ParseText and ParseDatato
> different MapFiles at the end of the reduce step.  So this is done by
> implementing OutputFormat + RecordWriter(given a resulting record from
> the reduce, write separate parts of it in different files)
> > Does that make any sense by now ?
> >
> > Now, my question is whether I can save the two jobs and have a single
> one
> > only which emits both two type of pairs - * and *. In
> separate
> > files probably. This way I gain one pass on the input files instead of
> two
> > (or more, if I had more output types ...).
> >
> Actually for this scenario you do not even need two different files with
> * and *.  You can just compute
> > which also contains the number of the children (The
> value is a List(for example ArrayWritable) containing children names).
>
> > If not, that's also fine, I was just curious :-)
> >
> > Naama
> >
> >
> >
> > On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <[EMAIL PROTECTED]>
> > wrote:
> >
> >
> >> Let me explain this more technically :)
> >>
> >> An MR job takes  pairs. Each map(k1,v1) may result result
> >> * pairs. So at the end of the map stage, the output will be of
> >> the form  pairs. The reduce takes  pairs and emits  >> v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
> >>
> >> I cannot understand what you meant by
> >>
> >> if a MapReduce job could output multiple files each holds different
> >>  pairs"
> >>
> >> The resulting segment directories after a crawl contain
> >> subdirectories(like crawl_generate, content, etc), but these are
> >> generated one-by-one in several jobs running sequentially(and sometimes
> >> by the same job, see ParseOutputFormat in nutch). You can refer further
> >> to the OutputFormat and RecordWriter interfaces for specific needs.
> >>
> >> For each split in the reduce phrase a different output file will be
> >> generated, but all the records in the files have the same type. However
> >> in some cases using GenericWritable or ObjectWtritable, you can wrap
> >> different types of keys and values.
> >>
> >> Hope it helps,
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>
> >>> Well, I was not actually thinking to use Nutch.
> >>> To be concrete, I was interested if a MapReduce job could output
> >>>
> >> multiple
> >>
> >>> files each holds different  pairs. I got the impression
> this
> >>>
> >> is
> >>
> >>> done in Nutch from slide 15 of
> >>>
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> >>
> >>> but maybe I was mis-understanding.
> >>> Is it Nutch specific or achievable using Hadoop API ? Would multiple
> >>> different reducers do the trick ?
> >>>
> >>> Thanks for offering to help, I might have more concrete details of
> what
> >>>
> >> I am
> >>
> >>> trying to implement later on, now I am basically learning.
> >>>
> >>> Naama
> >>>
> >>> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <
> [EMAIL PROTECTED]>
> >>> wrote:
> >>>
> >>>
> >>>
>  Hi,
> 
>  Currently nutch is a fairly complex application that *uses* hadoop as
> a
>  base for distributed computing and storage. In this regard there is
> no
>  part in nutch that "extends" hadoop. The core of the mapreduce indeed
>  does work with  pairs, and nutch uses specific 
>  pairs such as , etc.
> 
>  So long story short, it depends on what you want to build. If you
>  working on something that is not related to nutch, you do not need
> it.
>  You can give further info about your project if you want extended
> help.
> 
>  best wishes.
>  Enis
> 
>  Naama Kraus wrote:
> 
> 
> > Hi,
> >
> > I've seen in
> >
> >
> >
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide

Re: Nutch Extensions to MapReduce

2008-03-06 Thread Enis Soztutar

Naama Kraus wrote:

OK. Let me try an example:

Say my map maps a person name to a his child name. . If a person "Dan"
has more than 1 child, bunch of * pairs will be produced, right ?
Now say I have two different information needs:
1. Get a list of all children names for each person.
2. Get the number of children of each person.

I could run two different MapReduce jobs, with same map but different
reducres:
1. emits * pairs where p is the person, lc is a concatenation of his
children names.
2. emits * pairs where p is the person, n is the number of children.
  
No you cannot have more than one type of reduces in one job. But yes you 
can write more than one file as the
result of the reduce phase, which is what I wanted to explain by 
pointing to ParseOutputFormat which writes ParseText and ParseDatato 
different MapFiles at the end of the reduce step.  So this is done by 
implementing OutputFormat + RecordWriter(given a resulting record from 
the reduce, write separate parts of it in different files)

Does that make any sense by now ?

Now, my question is whether I can save the two jobs and have a single one
only which emits both two type of pairs - * and *. In separate
files probably. This way I gain one pass on the input files instead of two
(or more, if I had more output types ...).
  
Actually for this scenario you do not even need two different files with 
* and *.  You can just compute
> which also contains the number of the children (The 
value is a List(for example ArrayWritable) containing children names).



If not, that's also fine, I was just curious :-)

Naama



On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

  

Let me explain this more technically :)

An MR job takes  pairs. Each map(k1,v1) may result result
* pairs. So at the end of the map stage, the output will be of
the form  pairs. The reduce takes  pairs and emits * pairs, where k1,k2,k3,v1,v2,v3 are all types.

I cannot understand what you meant by

if a MapReduce job could output multiple files each holds different
 pairs"

The resulting segment directories after a crawl contain
subdirectories(like crawl_generate, content, etc), but these are
generated one-by-one in several jobs running sequentially(and sometimes
by the same job, see ParseOutputFormat in nutch). You can refer further
to the OutputFormat and RecordWriter interfaces for specific needs.

For each split in the reduce phrase a different output file will be
generated, but all the records in the files have the same type. However
in some cases using GenericWritable or ObjectWtritable, you can wrap
different types of keys and values.

Hope it helps,
Enis

Naama Kraus wrote:


Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output
  

multiple


files each holds different  pairs. I got the impression this
  

is


done in Nutch from slide 15 of

  

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf


but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?

Thanks for offering to help, I might have more concrete details of what
  

I am


trying to implement later on, now I am basically learning.

Naama

On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:


  

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a
base for distributed computing and storage. In this regard there is no
part in nutch that "extends" hadoop. The core of the mapreduce indeed
does work with  pairs, and nutch uses specific 
pairs such as , etc.

So long story short, it depends on what you want to build. If you
working on something that is not related to nutch, you do not need it.
You can give further info about your project if you want extended help.

best wishes.
Enis

Naama Kraus wrote:



Hi,

I've seen in


  

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
<
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide


12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in


  

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
<
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide


15) that MapReduce outputs two files each holds different 
pairs. I'd be curious to know if I can achieve that using the standard

  

API.



Thanks, Naama



  



  




  


Re: Nutch Extensions to MapReduce

2008-03-06 Thread Naama Kraus
OK. Let me try an example:

Say my map maps a person name to a his child name. . If a person "Dan"
has more than 1 child, bunch of * pairs will be produced, right ?
Now say I have two different information needs:
1. Get a list of all children names for each person.
2. Get the number of children of each person.

I could run two different MapReduce jobs, with same map but different
reducres:
1. emits * pairs where p is the person, lc is a concatenation of his
children names.
2. emits * pairs where p is the person, n is the number of children.

Does that make any sense by now ?

Now, my question is whether I can save the two jobs and have a single one
only which emits both two type of pairs - * and *. In separate
files probably. This way I gain one pass on the input files instead of two
(or more, if I had more output types ...).

If not, that's also fine, I was just curious :-)

Naama



On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

> Let me explain this more technically :)
>
> An MR job takes  pairs. Each map(k1,v1) may result result
> * pairs. So at the end of the map stage, the output will be of
> the form  pairs. The reduce takes  pairs and emits  v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
>
> I cannot understand what you meant by
>
> if a MapReduce job could output multiple files each holds different
>  pairs"
>
> The resulting segment directories after a crawl contain
> subdirectories(like crawl_generate, content, etc), but these are
> generated one-by-one in several jobs running sequentially(and sometimes
> by the same job, see ParseOutputFormat in nutch). You can refer further
> to the OutputFormat and RecordWriter interfaces for specific needs.
>
> For each split in the reduce phrase a different output file will be
> generated, but all the records in the files have the same type. However
> in some cases using GenericWritable or ObjectWtritable, you can wrap
> different types of keys and values.
>
> Hope it helps,
> Enis
>
> Naama Kraus wrote:
> > Well, I was not actually thinking to use Nutch.
> > To be concrete, I was interested if a MapReduce job could output
> multiple
> > files each holds different  pairs. I got the impression this
> is
> > done in Nutch from slide 15 of
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> > but maybe I was mis-understanding.
> > Is it Nutch specific or achievable using Hadoop API ? Would multiple
> > different reducers do the trick ?
> >
> > Thanks for offering to help, I might have more concrete details of what
> I am
> > trying to implement later on, now I am basically learning.
> >
> > Naama
> >
> > On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
> > wrote:
> >
> >
> >> Hi,
> >>
> >> Currently nutch is a fairly complex application that *uses* hadoop as a
> >> base for distributed computing and storage. In this regard there is no
> >> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >> does work with  pairs, and nutch uses specific 
> >> pairs such as , etc.
> >>
> >> So long story short, it depends on what you want to build. If you
> >> working on something that is not related to nutch, you do not need it.
> >> You can give further info about your project if you want extended help.
> >>
> >> best wishes.
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>
> >>> Hi,
> >>>
> >>> I've seen in
> >>>
> >>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
> <
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
> >
> >>
> >>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>> these are part of the Hadoop API or inside Nutch only.
> >>>
> >>> More specifically, I saw in
> >>>
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
> <
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
> >
> >>
> >>> 15) that MapReduce outputs two files each holds different 
> >>> pairs. I'd be curious to know if I can achieve that using the standard
> >>>
> >> API.
> >>
> >>> Thanks, Naama
> >>>
> >>>
> >>>
> >
> >
> >
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Re: Nutch Extensions to MapReduce

2008-03-06 Thread Enis Soztutar

Let me explain this more technically :)

An MR job takes  pairs. Each map(k1,v1) may result result 
* pairs. So at the end of the map stage, the output will be of 
the form  pairs. The reduce takes  pairs and emits v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.


I cannot understand what you meant by

if a MapReduce job could output multiple files each holds different  
pairs"

The resulting segment directories after a crawl contain 
subdirectories(like crawl_generate, content, etc), but these are 
generated one-by-one in several jobs running sequentially(and sometimes 
by the same job, see ParseOutputFormat in nutch). You can refer further 
to the OutputFormat and RecordWriter interfaces for specific needs.


For each split in the reduce phrase a different output file will be 
generated, but all the records in the files have the same type. However 
in some cases using GenericWritable or ObjectWtritable, you can wrap 
different types of keys and values.


Hope it helps,
Enis

Naama Kraus wrote:

Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output multiple
files each holds different  pairs. I got the impression this is
done in Nutch from slide 15 of
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?

Thanks for offering to help, I might have more concrete details of what I am
trying to implement later on, now I am basically learning.

Naama

On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

  

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a
base for distributed computing and storage. In this regard there is no
part in nutch that "extends" hadoop. The core of the mapreduce indeed
does work with  pairs, and nutch uses specific 
pairs such as , etc.

So long story short, it depends on what you want to build. If you
working on something that is not related to nutch, you do not need it.
You can give further info about your project if you want extended help.

best wishes.
Enis

Naama Kraus wrote:


Hi,

I've seen in

  

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide


12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in

  

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide


15) that MapReduce outputs two files each holds different 
pairs. I'd be curious to know if I can achieve that using the standard
  

API.


Thanks, Naama


  




  


Re: Nutch Extensions to MapReduce

2008-03-06 Thread Naama Kraus
Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output multiple
files each holds different  pairs. I got the impression this is
done in Nutch from slide 15 of
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?

Thanks for offering to help, I might have more concrete details of what I am
trying to implement later on, now I am basically learning.

Naama

On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> Currently nutch is a fairly complex application that *uses* hadoop as a
> base for distributed computing and storage. In this regard there is no
> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> does work with  pairs, and nutch uses specific 
> pairs such as , etc.
>
> So long story short, it depends on what you want to build. If you
> working on something that is not related to nutch, you do not need it.
> You can give further info about your project if you want extended help.
>
> best wishes.
> Enis
>
> Naama Kraus wrote:
> > Hi,
> >
> > I've seen in
> >
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
> > 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> > these are part of the Hadoop API or inside Nutch only.
> >
> > More specifically, I saw in
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
> > 15) that MapReduce outputs two files each holds different 
> > pairs. I'd be curious to know if I can achieve that using the standard
> API.
> >
> > Thanks, Naama
> >
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Re: Nutch Extensions to MapReduce

2008-03-06 Thread Enis Soztutar

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a 
base for distributed computing and storage. In this regard there is no 
part in nutch that "extends" hadoop. The core of the mapreduce indeed 
does work with  pairs, and nutch uses specific  
pairs such as , etc.


So long story short, it depends on what you want to build. If you 
working on something that is not related to nutch, you do not need it. 
You can give further info about your project if you want extended help.


best wishes.
Enis

Naama Kraus wrote:

Hi,

I've seen in
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
15) that MapReduce outputs two files each holds different 
pairs. I'd be curious to know if I can achieve that using the standard API.

Thanks, Naama

  


Nutch Extensions to MapReduce

2008-03-06 Thread Naama Kraus
Hi,

I've seen in
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
15) that MapReduce outputs two files each holds different 
pairs. I'd be curious to know if I can achieve that using the standard API.

Thanks, Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)