Hello - I'm working with nutch 2.3.1 with HBase for the webpage table.  I
have all the phases (inject, generate, fetch, parse, and updatedb) working
fine.  Nutch is a crawling beast!

On our cluster, the generate phase uses around 60 mappers and 128 reducers,
but the fetch phase always uses just 2 reducers.  In a recent test, the
fetch phase used 60 mappers and 2 reducers.

The configuration I have uses:

generate.max.count=250
fetcher.threads.fetch=256
fetcher.server.min.delay=1
fetcher.threads.per.queue=5

Output from the generate phase:
---------------
16/05/02 18:10:57 INFO mapreduce.Job: Job job_1461352180552_0008 completed
successfully
16/05/02 18:10:57 INFO mapreduce.Job: Counters: 52
        File System Counters
                FILE: Number of bytes read=534466703
                FILE: Number of bytes written=1093638467
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=45663
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=60
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=60
                Launched reduce tasks=128
                Data-local map tasks=47
                Rack-local map tasks=13
                Total time spent by all maps in occupied slots (ms)=41550640
                Total time spent by all reduces in occupied slots
(ms)=482838568
                Total time spent by all map tasks (ms)=10387660
                Total time spent by all reduce tasks (ms)=60354821
                Total vcore-seconds taken by all map tasks=10387660
                Total vcore-seconds taken by all reduce tasks=60354821
                Total megabyte-seconds taken by all map tasks=42547855360
                Total megabyte-seconds taken by all reduce
tasks=494426693632
        Map-Reduce Framework
                Map input records=22514605
                Map output records=21459377
                Map output bytes=2302304271
                Map output materialized bytes=532738342
                Input split bytes=45663
                Combine input records=0
                Combine output records=0
                Reduce input groups=21458913
                Reduce shuffle bytes=532738342
                Reduce input records=21459377
                Reduce output records=7506045
                Spilled Records=42918754
                Shuffled Maps =7680
                Failed Shuffles=0
                Merged Map outputs=7680
                GC time elapsed (ms)=100632
                CPU time spent (ms)=16005360
                Physical memory (bytes) snapshot=205304303616
                Virtual memory (bytes) snapshot=1838431825920
                Total committed heap usage (bytes)=365396230144
        Generator
                GENERATE_MARK=7506045
                MALFORMED_URL=1
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
16/05/02 18:10:57 INFO crawl.GeneratorJob: GeneratorJob: finished at
2016-05-02 18:10:57, time elapsed: 00:28:19
16/05/02 18:10:57 INFO crawl.GeneratorJob: GeneratorJob: generated batch
id: 1462225358-1352746578 containing 7506045 URLs
---------------

Output from the fetch phase:

---------------
16/05/02 19:18:09 INFO mapreduce.Job: Job job_1461352180552_0009 completed
successfully
16/05/02 19:18:09 INFO mapreduce.Job: Counters: 60
        File System Counters
                FILE: Number of bytes read=483484507
                FILE: Number of bytes written=942430295
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=51243
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=60
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=60
                Launched reduce tasks=2
                Data-local map tasks=47
                Rack-local map tasks=13
                Total time spent by all maps in occupied slots (ms)=11392544
                Total time spent by all reduces in occupied slots
(ms)=61953544
                Total time spent by all map tasks (ms)=2848136
                Total time spent by all reduce tasks (ms)=7744193
                Total vcore-seconds taken by all map tasks=2848136
                Total vcore-seconds taken by all reduce tasks=7744193
                Total megabyte-seconds taken by all map tasks=11665965056
                Total megabyte-seconds taken by all reduce tasks=63440429056
        Map-Reduce Framework
                Map input records=7503906
                Map output records=7503906
                Map output bytes=1081616122
                Map output materialized bytes=450300347
                Input split bytes=51243
                Combine input records=0
                Combine output records=0
                Reduce input groups=131072
                Reduce shuffle bytes=450300347
                Reduce input records=7503906
                Reduce output records=609920
                Spilled Records=15007812
                Shuffled Maps =120
                Failed Shuffles=0
                Merged Map outputs=120
                GC time elapsed (ms)=132204
                CPU time spent (ms)=19741790
                Physical memory (bytes) snapshot=107981033472
                Virtual memory (bytes) snapshot=336697593856
                Total committed heap usage (bytes)=158064443392
        FetcherStatus
                ACCESS_DENIED=131
                EXCEPTION=36676
                GONE=295
                HitByTimeLimit-QueueFeeder=6883654
                HitByTimeLimit-Queues=10291
                MOVED=37141
                NOTFOUND=10490
                NOTMODIFIED=732
                SUCCESS=485083
                TEMP_MOVED=14589
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
16/05/02 19:18:09 INFO fetcher.FetcherJob: FetcherJob: finished at
2016-05-02 19:18:09, time elapsed: 01:06:23


Any idea on what I need to adjust to use more nodes for the reduce phase?
Any other issues from the above that I should be aware of?  I'm very new to
nutch.
Thank you!

-Joe Obernberger

Reply via email to