Thank you Lewis!  I'm still not sure I understand why only 2 reducers were
created for the fetch phase given that there were over 7 million URLs to
process.  Currently the partitioning scheme is set to byHost, and I've
lowered the generate.max.count to 250.  If I understand correctly, that
will limit the fetcher to only 250 URLs per host making it more polite?

Are you suggesting making many jobIDs with the generate phase and run many
fetch map reduce jobs in parallel?
Thanks again for your response on this!

-Joe

On Tue, May 3, 2016 at 4:32 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Joseph,
>
> On Tue, May 3, 2016 at 7:53 AM, <[email protected]> wrote:
>
> >
> > From: Joseph Obernberger <[email protected]>
> > To: [email protected]
> > Cc:
> > Date: Tue, 3 May 2016 09:04:09 -0400
> > Subject: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers
> > Hello - I'm working with nutch 2.3.1 with HBase for the webpage table.  I
> > have all the phases (inject, generate, fetch, parse, and updatedb)
> working
> > fine.  Nutch is a crawling beast!
> >
>
> Glad to hear.
>
>
> >
> > On our cluster, the generate phase uses around 60 mappers and 128
> reducers,
> > but the fetch phase always uses just 2 reducers.  In a recent test, the
> > fetch phase used 60 mappers and 2 reducers.
> >
>
> In Nutch 2.X you will have noticed that the actual 'Fetching' is executed
> within the FetcherReducer [0]. More specifically, it is achieved within the
> FetcherReducer.FetcherThread [1] which picks items from FetchItemQueues and
> fetches the pages.
> The crux of this issue here is a politeness issue. It has to do with the
> URL Partitioning scheme [2] you use which partitions urls by host, domain
> name or IP depending on the value of the parameter 'partition.url.mode'
> which can be 'byHost', 'byDomain' or 'byIP'.
> The issue was described a few weeks ago by Karanjeet and Sebastian
> http://www.mail-archive.com/user%40nutch.apache.org/msg14496.html
>
>
> [0]
>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java
> [1]
>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java#L430
> [2]
>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/URLPartitioner.java
>
>
> Please note that you have quite significant differences between the
> following
>
>
>
> >                 Map input records=22514605
> >                 Map output records=21459377
> >
> >
>
> Above Generator Map-phase delta of 1,055,228, and
>
>
> >                 Reduce input records=21459377
> >                 Reduce output records=7506045
> >
> >
>
> Above Fetch Map-phase delta of 13,953,332
>
>
> >                 Reduce input records=7503906
> >                 Reduce output records=609920
> >
> >
>
> Above Fetch Reducer-phase delta of 6,893,986
>
>
> >         FetcherStatus
> >                 ACCESS_DENIED=131
> >                 EXCEPTION=36676
> >                 GONE=295
> >                 HitByTimeLimit-QueueFeeder=6883654
> >                 HitByTimeLimit-Queues=10291
> >                 MOVED=37141
> >                 NOTFOUND=10490
> >                 NOTMODIFIED=732
> >                 SUCCESS=485083
> >                 TEMP_MOVED=14589
> >
> >
>
> Very interesting FetcherStatus stats. HitByTimeLimit-QueueFeeder=6883654 is
> of particular interest.
> If I were you I would create many more, smaller batches of URLs to fetch as
> opposs to these large batches which are simply... not being fetched. You
> only fetched around 485K URLs going by the above stats.
>
>
> >
> >
> > Any idea on what I need to adjust to use more nodes for the reduce phase?
>
>
> Hopefully the above has given you a decent amount to consider. Please let
> us knwo if you have some more questions.
> Thanks
> Lewis
>

Reply via email to