Thank you Nguyen.  I tried modifying the yarn settings to no avail.  Other
jobs, like the generate phase, create 128 reducers no problem.  It's only
the fetch phase that always makes 2 reducers.  Running multiple jobs does
work, and does increase throughput, but I'm still curious as to why I'm
only getting 2 reducers in the fetch phase only.
Thanks!

-Joe

On Tue, May 10, 2016 at 3:57 AM, Nguyen Manh Tien <[email protected]
> wrote:

> Hi Joseph,
>
> I guest you use hadoop 2.x.
> The default value of mapreduce.tasktracker.reduce.tasks.maximum=2 so it
> only create at max 2 reducer,
> You shoudl change this parameter in hadoop mapred-site.xml.
> The number of reducer created also depend on other yarn configs
> (mapreduce.reduce.memory.mb,
> yarn.nodemanager.resource.memory-mb, mapreduce.reduce.cpu.vcores,
> yarn.nodemanager.resource.cpu-vcores)
>
> http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
>
> Tien
>
> On Wed, May 4, 2016 at 4:00 AM, Joseph Obernberger <
> [email protected]> wrote:
>
> > Thank you Lewis!  I'm still not sure I understand why only 2 reducers
> were
> > created for the fetch phase given that there were over 7 million URLs to
> > process.  Currently the partitioning scheme is set to byHost, and I've
> > lowered the generate.max.count to 250.  If I understand correctly, that
> > will limit the fetcher to only 250 URLs per host making it more polite?
> >
> > Are you suggesting making many jobIDs with the generate phase and run
> many
> > fetch map reduce jobs in parallel?
> > Thanks again for your response on this!
> >
> > -Joe
> >
> > On Tue, May 3, 2016 at 4:32 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> > > Hi Joseph,
> > >
> > > On Tue, May 3, 2016 at 7:53 AM, <[email protected]>
> > wrote:
> > >
> > > >
> > > > From: Joseph Obernberger <[email protected]>
> > > > To: [email protected]
> > > > Cc:
> > > > Date: Tue, 3 May 2016 09:04:09 -0400
> > > > Subject: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers
> > > > Hello - I'm working with nutch 2.3.1 with HBase for the webpage
> > table.  I
> > > > have all the phases (inject, generate, fetch, parse, and updatedb)
> > > working
> > > > fine.  Nutch is a crawling beast!
> > > >
> > >
> > > Glad to hear.
> > >
> > >
> > > >
> > > > On our cluster, the generate phase uses around 60 mappers and 128
> > > reducers,
> > > > but the fetch phase always uses just 2 reducers.  In a recent test,
> the
> > > > fetch phase used 60 mappers and 2 reducers.
> > > >
> > >
> > > In Nutch 2.X you will have noticed that the actual 'Fetching' is
> executed
> > > within the FetcherReducer [0]. More specifically, it is achieved within
> > the
> > > FetcherReducer.FetcherThread [1] which picks items from FetchItemQueues
> > and
> > > fetches the pages.
> > > The crux of this issue here is a politeness issue. It has to do with
> the
> > > URL Partitioning scheme [2] you use which partitions urls by host,
> domain
> > > name or IP depending on the value of the parameter 'partition.url.mode'
> > > which can be 'byHost', 'byDomain' or 'byIP'.
> > > The issue was described a few weeks ago by Karanjeet and Sebastian
> > > http://www.mail-archive.com/user%40nutch.apache.org/msg14496.html
> > >
> > >
> > > [0]
> > >
> > >
> >
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java
> > > [1]
> > >
> > >
> >
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java#L430
> > > [2]
> > >
> > >
> >
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/URLPartitioner.java
> > >
> > >
> > > Please note that you have quite significant differences between the
> > > following
> > >
> > >
> > >
> > > >                 Map input records=22514605
> > > >                 Map output records=21459377
> > > >
> > > >
> > >
> > > Above Generator Map-phase delta of 1,055,228, and
> > >
> > >
> > > >                 Reduce input records=21459377
> > > >                 Reduce output records=7506045
> > > >
> > > >
> > >
> > > Above Fetch Map-phase delta of 13,953,332
> > >
> > >
> > > >                 Reduce input records=7503906
> > > >                 Reduce output records=609920
> > > >
> > > >
> > >
> > > Above Fetch Reducer-phase delta of 6,893,986
> > >
> > >
> > > >         FetcherStatus
> > > >                 ACCESS_DENIED=131
> > > >                 EXCEPTION=36676
> > > >                 GONE=295
> > > >                 HitByTimeLimit-QueueFeeder=6883654
> > > >                 HitByTimeLimit-Queues=10291
> > > >                 MOVED=37141
> > > >                 NOTFOUND=10490
> > > >                 NOTMODIFIED=732
> > > >                 SUCCESS=485083
> > > >                 TEMP_MOVED=14589
> > > >
> > > >
> > >
> > > Very interesting FetcherStatus stats.
> HitByTimeLimit-QueueFeeder=6883654
> > is
> > > of particular interest.
> > > If I were you I would create many more, smaller batches of URLs to
> fetch
> > as
> > > opposs to these large batches which are simply... not being fetched.
> You
> > > only fetched around 485K URLs going by the above stats.
> > >
> > >
> > > >
> > > >
> > > > Any idea on what I need to adjust to use more nodes for the reduce
> > phase?
> > >
> > >
> > > Hopefully the above has given you a decent amount to consider. Please
> let
> > > us knwo if you have some more questions.
> > > Thanks
> > > Lewis
> > >
> >
>

Reply via email to