Thank you Nguyen. I tried modifying the yarn settings to no avail. Other jobs, like the generate phase, create 128 reducers no problem. It's only the fetch phase that always makes 2 reducers. Running multiple jobs does work, and does increase throughput, but I'm still curious as to why I'm only getting 2 reducers in the fetch phase only. Thanks!
-Joe On Tue, May 10, 2016 at 3:57 AM, Nguyen Manh Tien <[email protected] > wrote: > Hi Joseph, > > I guest you use hadoop 2.x. > The default value of mapreduce.tasktracker.reduce.tasks.maximum=2 so it > only create at max 2 reducer, > You shoudl change this parameter in hadoop mapred-site.xml. > The number of reducer created also depend on other yarn configs > (mapreduce.reduce.memory.mb, > yarn.nodemanager.resource.memory-mb, mapreduce.reduce.cpu.vcores, > yarn.nodemanager.resource.cpu-vcores) > > http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ > > Tien > > On Wed, May 4, 2016 at 4:00 AM, Joseph Obernberger < > [email protected]> wrote: > > > Thank you Lewis! I'm still not sure I understand why only 2 reducers > were > > created for the fetch phase given that there were over 7 million URLs to > > process. Currently the partitioning scheme is set to byHost, and I've > > lowered the generate.max.count to 250. If I understand correctly, that > > will limit the fetcher to only 250 URLs per host making it more polite? > > > > Are you suggesting making many jobIDs with the generate phase and run > many > > fetch map reduce jobs in parallel? > > Thanks again for your response on this! > > > > -Joe > > > > On Tue, May 3, 2016 at 4:32 PM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > > > Hi Joseph, > > > > > > On Tue, May 3, 2016 at 7:53 AM, <[email protected]> > > wrote: > > > > > > > > > > > From: Joseph Obernberger <[email protected]> > > > > To: [email protected] > > > > Cc: > > > > Date: Tue, 3 May 2016 09:04:09 -0400 > > > > Subject: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers > > > > Hello - I'm working with nutch 2.3.1 with HBase for the webpage > > table. I > > > > have all the phases (inject, generate, fetch, parse, and updatedb) > > > working > > > > fine. Nutch is a crawling beast! > > > > > > > > > > Glad to hear. > > > > > > > > > > > > > > On our cluster, the generate phase uses around 60 mappers and 128 > > > reducers, > > > > but the fetch phase always uses just 2 reducers. In a recent test, > the > > > > fetch phase used 60 mappers and 2 reducers. > > > > > > > > > > In Nutch 2.X you will have noticed that the actual 'Fetching' is > executed > > > within the FetcherReducer [0]. More specifically, it is achieved within > > the > > > FetcherReducer.FetcherThread [1] which picks items from FetchItemQueues > > and > > > fetches the pages. > > > The crux of this issue here is a politeness issue. It has to do with > the > > > URL Partitioning scheme [2] you use which partitions urls by host, > domain > > > name or IP depending on the value of the parameter 'partition.url.mode' > > > which can be 'byHost', 'byDomain' or 'byIP'. > > > The issue was described a few weeks ago by Karanjeet and Sebastian > > > http://www.mail-archive.com/user%40nutch.apache.org/msg14496.html > > > > > > > > > [0] > > > > > > > > > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java > > > [1] > > > > > > > > > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java#L430 > > > [2] > > > > > > > > > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/URLPartitioner.java > > > > > > > > > Please note that you have quite significant differences between the > > > following > > > > > > > > > > > > > Map input records=22514605 > > > > Map output records=21459377 > > > > > > > > > > > > > > Above Generator Map-phase delta of 1,055,228, and > > > > > > > > > > Reduce input records=21459377 > > > > Reduce output records=7506045 > > > > > > > > > > > > > > Above Fetch Map-phase delta of 13,953,332 > > > > > > > > > > Reduce input records=7503906 > > > > Reduce output records=609920 > > > > > > > > > > > > > > Above Fetch Reducer-phase delta of 6,893,986 > > > > > > > > > > FetcherStatus > > > > ACCESS_DENIED=131 > > > > EXCEPTION=36676 > > > > GONE=295 > > > > HitByTimeLimit-QueueFeeder=6883654 > > > > HitByTimeLimit-Queues=10291 > > > > MOVED=37141 > > > > NOTFOUND=10490 > > > > NOTMODIFIED=732 > > > > SUCCESS=485083 > > > > TEMP_MOVED=14589 > > > > > > > > > > > > > > Very interesting FetcherStatus stats. > HitByTimeLimit-QueueFeeder=6883654 > > is > > > of particular interest. > > > If I were you I would create many more, smaller batches of URLs to > fetch > > as > > > opposs to these large batches which are simply... not being fetched. > You > > > only fetched around 485K URLs going by the above stats. > > > > > > > > > > > > > > > > > > Any idea on what I need to adjust to use more nodes for the reduce > > phase? > > > > > > > > > Hopefully the above has given you a decent amount to consider. Please > let > > > us knwo if you have some more questions. > > > Thanks > > > Lewis > > > > > >

