Hi Joseph,

I guest you use hadoop 2.x.
The default value of mapreduce.tasktracker.reduce.tasks.maximum=2 so it
only create at max 2 reducer,
You shoudl change this parameter in hadoop mapred-site.xml.
The number of reducer created also depend on other yarn configs
(mapreduce.reduce.memory.mb,
yarn.nodemanager.resource.memory-mb, mapreduce.reduce.cpu.vcores,
yarn.nodemanager.resource.cpu-vcores)
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

Tien

On Wed, May 4, 2016 at 4:00 AM, Joseph Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you Lewis!  I'm still not sure I understand why only 2 reducers were
> created for the fetch phase given that there were over 7 million URLs to
> process.  Currently the partitioning scheme is set to byHost, and I've
> lowered the generate.max.count to 250.  If I understand correctly, that
> will limit the fetcher to only 250 URLs per host making it more polite?
>
> Are you suggesting making many jobIDs with the generate phase and run many
> fetch map reduce jobs in parallel?
> Thanks again for your response on this!
>
> -Joe
>
> On Tue, May 3, 2016 at 4:32 PM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
> > Hi Joseph,
> >
> > On Tue, May 3, 2016 at 7:53 AM, <user-digest-h...@nutch.apache.org>
> wrote:
> >
> > >
> > > From: Joseph Obernberger <joseph.obernber...@gmail.com>
> > > To: user@nutch.apache.org
> > > Cc:
> > > Date: Tue, 3 May 2016 09:04:09 -0400
> > > Subject: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers
> > > Hello - I'm working with nutch 2.3.1 with HBase for the webpage
> table.  I
> > > have all the phases (inject, generate, fetch, parse, and updatedb)
> > working
> > > fine.  Nutch is a crawling beast!
> > >
> >
> > Glad to hear.
> >
> >
> > >
> > > On our cluster, the generate phase uses around 60 mappers and 128
> > reducers,
> > > but the fetch phase always uses just 2 reducers.  In a recent test, the
> > > fetch phase used 60 mappers and 2 reducers.
> > >
> >
> > In Nutch 2.X you will have noticed that the actual 'Fetching' is executed
> > within the FetcherReducer [0]. More specifically, it is achieved within
> the
> > FetcherReducer.FetcherThread [1] which picks items from FetchItemQueues
> and
> > fetches the pages.
> > The crux of this issue here is a politeness issue. It has to do with the
> > URL Partitioning scheme [2] you use which partitions urls by host, domain
> > name or IP depending on the value of the parameter 'partition.url.mode'
> > which can be 'byHost', 'byDomain' or 'byIP'.
> > The issue was described a few weeks ago by Karanjeet and Sebastian
> > http://www.mail-archive.com/user%40nutch.apache.org/msg14496.html
> >
> >
> > [0]
> >
> >
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java
> > [1]
> >
> >
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java#L430
> > [2]
> >
> >
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/URLPartitioner.java
> >
> >
> > Please note that you have quite significant differences between the
> > following
> >
> >
> >
> > >                 Map input records=22514605
> > >                 Map output records=21459377
> > >
> > >
> >
> > Above Generator Map-phase delta of 1,055,228, and
> >
> >
> > >                 Reduce input records=21459377
> > >                 Reduce output records=7506045
> > >
> > >
> >
> > Above Fetch Map-phase delta of 13,953,332
> >
> >
> > >                 Reduce input records=7503906
> > >                 Reduce output records=609920
> > >
> > >
> >
> > Above Fetch Reducer-phase delta of 6,893,986
> >
> >
> > >         FetcherStatus
> > >                 ACCESS_DENIED=131
> > >                 EXCEPTION=36676
> > >                 GONE=295
> > >                 HitByTimeLimit-QueueFeeder=6883654
> > >                 HitByTimeLimit-Queues=10291
> > >                 MOVED=37141
> > >                 NOTFOUND=10490
> > >                 NOTMODIFIED=732
> > >                 SUCCESS=485083
> > >                 TEMP_MOVED=14589
> > >
> > >
> >
> > Very interesting FetcherStatus stats. HitByTimeLimit-QueueFeeder=6883654
> is
> > of particular interest.
> > If I were you I would create many more, smaller batches of URLs to fetch
> as
> > opposs to these large batches which are simply... not being fetched. You
> > only fetched around 485K URLs going by the above stats.
> >
> >
> > >
> > >
> > > Any idea on what I need to adjust to use more nodes for the reduce
> phase?
> >
> >
> > Hopefully the above has given you a decent amount to consider. Please let
> > us knwo if you have some more questions.
> > Thanks
> > Lewis
> >
>

Reply via email to