[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741082#action_12741082
 ] 

Julien Nioche commented on NUTCH-721:
-------------------------------------

I had another look at this issue after applying the patch from Nutch-719. I can 
easily reproduce the situation from the original post by setting 
fetcher.threads.per.host.by.ip to true. The nutch-site file sent by Rodger does 
not specify it so it would rely on this value by default. Once setting it to 
false all threads are active and the fetching is much faster. 

I have used the first 5K URLs from the fetchlist sent by Rodger and compared 
the perfs with by.ip set to false :  

OldFetcher :  
real    32m26.003s
user    1m11.768s
sys     0m10.337s

OldFetcher :  
real    30m52.965s
user    1m10.696s
sys     0m10.425s

Fetcher :  
real    31m21.924s
user    1m12.725s
sys     0m10.797s

Fetcher :
real    30m3.017s
user    1m15.509s
sys     0m10.909s

I ran each step twice and as we can see the results are comparable.

This explanation is also compliant with Steven's observation that we get 5-7 
times the rate as we would hit the DNS cache for subsequent calls for URLs from 
non unique sites. The IP resolution is done by the QueueFeeder which explains 
why it is slowing down the number of URLs being available for fetching.

I don't think that the oldFetcher allows to group URLs by IP for politeness in 
which case why not making fetcher.threads.per.host.by.ip default to false in 
the new fetcher?


> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a 
> re-fetch, so possible issue relating to NS lookups, however all things being 
> equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to