Hello - I'm working with nutch 2.3.1 with HBase for the webpage table. I have all the phases (inject, generate, fetch, parse, and updatedb) working fine. Nutch is a crawling beast!
On our cluster, the generate phase uses around 60 mappers and 128 reducers, but the fetch phase always uses just 2 reducers. In a recent test, the fetch phase used 60 mappers and 2 reducers. The configuration I have uses: generate.max.count=250 fetcher.threads.fetch=256 fetcher.server.min.delay=1 fetcher.threads.per.queue=5 Output from the generate phase: --------------- 16/05/02 18:10:57 INFO mapreduce.Job: Job job_1461352180552_0008 completed successfully 16/05/02 18:10:57 INFO mapreduce.Job: Counters: 52 File System Counters FILE: Number of bytes read=534466703 FILE: Number of bytes written=1093638467 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=45663 HDFS: Number of bytes written=0 HDFS: Number of read operations=60 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=60 Launched reduce tasks=128 Data-local map tasks=47 Rack-local map tasks=13 Total time spent by all maps in occupied slots (ms)=41550640 Total time spent by all reduces in occupied slots (ms)=482838568 Total time spent by all map tasks (ms)=10387660 Total time spent by all reduce tasks (ms)=60354821 Total vcore-seconds taken by all map tasks=10387660 Total vcore-seconds taken by all reduce tasks=60354821 Total megabyte-seconds taken by all map tasks=42547855360 Total megabyte-seconds taken by all reduce tasks=494426693632 Map-Reduce Framework Map input records=22514605 Map output records=21459377 Map output bytes=2302304271 Map output materialized bytes=532738342 Input split bytes=45663 Combine input records=0 Combine output records=0 Reduce input groups=21458913 Reduce shuffle bytes=532738342 Reduce input records=21459377 Reduce output records=7506045 Spilled Records=42918754 Shuffled Maps =7680 Failed Shuffles=0 Merged Map outputs=7680 GC time elapsed (ms)=100632 CPU time spent (ms)=16005360 Physical memory (bytes) snapshot=205304303616 Virtual memory (bytes) snapshot=1838431825920 Total committed heap usage (bytes)=365396230144 Generator GENERATE_MARK=7506045 MALFORMED_URL=1 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 16/05/02 18:10:57 INFO crawl.GeneratorJob: GeneratorJob: finished at 2016-05-02 18:10:57, time elapsed: 00:28:19 16/05/02 18:10:57 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1462225358-1352746578 containing 7506045 URLs --------------- Output from the fetch phase: --------------- 16/05/02 19:18:09 INFO mapreduce.Job: Job job_1461352180552_0009 completed successfully 16/05/02 19:18:09 INFO mapreduce.Job: Counters: 60 File System Counters FILE: Number of bytes read=483484507 FILE: Number of bytes written=942430295 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=51243 HDFS: Number of bytes written=0 HDFS: Number of read operations=60 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=60 Launched reduce tasks=2 Data-local map tasks=47 Rack-local map tasks=13 Total time spent by all maps in occupied slots (ms)=11392544 Total time spent by all reduces in occupied slots (ms)=61953544 Total time spent by all map tasks (ms)=2848136 Total time spent by all reduce tasks (ms)=7744193 Total vcore-seconds taken by all map tasks=2848136 Total vcore-seconds taken by all reduce tasks=7744193 Total megabyte-seconds taken by all map tasks=11665965056 Total megabyte-seconds taken by all reduce tasks=63440429056 Map-Reduce Framework Map input records=7503906 Map output records=7503906 Map output bytes=1081616122 Map output materialized bytes=450300347 Input split bytes=51243 Combine input records=0 Combine output records=0 Reduce input groups=131072 Reduce shuffle bytes=450300347 Reduce input records=7503906 Reduce output records=609920 Spilled Records=15007812 Shuffled Maps =120 Failed Shuffles=0 Merged Map outputs=120 GC time elapsed (ms)=132204 CPU time spent (ms)=19741790 Physical memory (bytes) snapshot=107981033472 Virtual memory (bytes) snapshot=336697593856 Total committed heap usage (bytes)=158064443392 FetcherStatus ACCESS_DENIED=131 EXCEPTION=36676 GONE=295 HitByTimeLimit-QueueFeeder=6883654 HitByTimeLimit-Queues=10291 MOVED=37141 NOTFOUND=10490 NOTMODIFIED=732 SUCCESS=485083 TEMP_MOVED=14589 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 16/05/02 19:18:09 INFO fetcher.FetcherJob: FetcherJob: finished at 2016-05-02 19:18:09, time elapsed: 01:06:23 Any idea on what I need to adjust to use more nodes for the reduce phase? Any other issues from the above that I should be aware of? I'm very new to nutch. Thank you! -Joe Obernberger