Hi Joseph, I guest you use hadoop 2.x. The default value of mapreduce.tasktracker.reduce.tasks.maximum=2 so it only create at max 2 reducer, You shoudl change this parameter in hadoop mapred-site.xml. The number of reducer created also depend on other yarn configs (mapreduce.reduce.memory.mb, yarn.nodemanager.resource.memory-mb, mapreduce.reduce.cpu.vcores, yarn.nodemanager.resource.cpu-vcores) http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
Tien On Wed, May 4, 2016 at 4:00 AM, Joseph Obernberger < joseph.obernber...@gmail.com> wrote: > Thank you Lewis! I'm still not sure I understand why only 2 reducers were > created for the fetch phase given that there were over 7 million URLs to > process. Currently the partitioning scheme is set to byHost, and I've > lowered the generate.max.count to 250. If I understand correctly, that > will limit the fetcher to only 250 URLs per host making it more polite? > > Are you suggesting making many jobIDs with the generate phase and run many > fetch map reduce jobs in parallel? > Thanks again for your response on this! > > -Joe > > On Tue, May 3, 2016 at 4:32 PM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > > > Hi Joseph, > > > > On Tue, May 3, 2016 at 7:53 AM, <user-digest-h...@nutch.apache.org> > wrote: > > > > > > > > From: Joseph Obernberger <joseph.obernber...@gmail.com> > > > To: user@nutch.apache.org > > > Cc: > > > Date: Tue, 3 May 2016 09:04:09 -0400 > > > Subject: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers > > > Hello - I'm working with nutch 2.3.1 with HBase for the webpage > table. I > > > have all the phases (inject, generate, fetch, parse, and updatedb) > > working > > > fine. Nutch is a crawling beast! > > > > > > > Glad to hear. > > > > > > > > > > On our cluster, the generate phase uses around 60 mappers and 128 > > reducers, > > > but the fetch phase always uses just 2 reducers. In a recent test, the > > > fetch phase used 60 mappers and 2 reducers. > > > > > > > In Nutch 2.X you will have noticed that the actual 'Fetching' is executed > > within the FetcherReducer [0]. More specifically, it is achieved within > the > > FetcherReducer.FetcherThread [1] which picks items from FetchItemQueues > and > > fetches the pages. > > The crux of this issue here is a politeness issue. It has to do with the > > URL Partitioning scheme [2] you use which partitions urls by host, domain > > name or IP depending on the value of the parameter 'partition.url.mode' > > which can be 'byHost', 'byDomain' or 'byIP'. > > The issue was described a few weeks ago by Karanjeet and Sebastian > > http://www.mail-archive.com/user%40nutch.apache.org/msg14496.html > > > > > > [0] > > > > > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java > > [1] > > > > > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java#L430 > > [2] > > > > > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/URLPartitioner.java > > > > > > Please note that you have quite significant differences between the > > following > > > > > > > > > Map input records=22514605 > > > Map output records=21459377 > > > > > > > > > > Above Generator Map-phase delta of 1,055,228, and > > > > > > > Reduce input records=21459377 > > > Reduce output records=7506045 > > > > > > > > > > Above Fetch Map-phase delta of 13,953,332 > > > > > > > Reduce input records=7503906 > > > Reduce output records=609920 > > > > > > > > > > Above Fetch Reducer-phase delta of 6,893,986 > > > > > > > FetcherStatus > > > ACCESS_DENIED=131 > > > EXCEPTION=36676 > > > GONE=295 > > > HitByTimeLimit-QueueFeeder=6883654 > > > HitByTimeLimit-Queues=10291 > > > MOVED=37141 > > > NOTFOUND=10490 > > > NOTMODIFIED=732 > > > SUCCESS=485083 > > > TEMP_MOVED=14589 > > > > > > > > > > Very interesting FetcherStatus stats. HitByTimeLimit-QueueFeeder=6883654 > is > > of particular interest. > > If I were you I would create many more, smaller batches of URLs to fetch > as > > opposs to these large batches which are simply... not being fetched. You > > only fetched around 485K URLs going by the above stats. > > > > > > > > > > > > > Any idea on what I need to adjust to use more nodes for the reduce > phase? > > > > > > Hopefully the above has given you a decent amount to consider. Please let > > us knwo if you have some more questions. > > Thanks > > Lewis > > >