[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730239#action_12730239
 ] 

Steven Denny commented on NUTCH-719:
------------------------------------

I'm not sure, as far as I can tell, the feeder has always finished feeding the 
urls, it's just that a proportion are "lost".

However, there are two things I've noted re performance (if you just look at 
url's crawled per second)

1) When this situation arrises, the fetcher will time out and "Abort with N 
hung threads". The timeout occurs after "mapred.task.timeout"/2 or seconds 
(default 5 mins), so any timing on a crawl that aborted will be extended by 5 
mins. One a small crawl this could skew the figures

2) DNS look up can take a while. I know this has been noted before, but on my 
test system (admittedly only a vm on our network, with nothing special in terms 
of DNS), some of the look ups were taking 5-6 seconds. THis is possibley the 
wrong place to discuss given NUTCH-721, but I put in some debug arround the 
feeder thread and got:

2009-07-10 04:01:35,296 INFO  fetcher.Fetcher - Fed 500 urls in 186 secs = 
2.7url/s
2009-07-10 04:04:18,343 INFO  fetcher.Fetcher - Fed 499 urls in 163 secs = 
3.1url/s
2009-07-10 04:06:57,109 INFO  fetcher.Fetcher - Fed 498 urls in 158 secs = 
3.2url/s
2009-07-10 04:10:38,282 INFO  fetcher.Fetcher - Fed 499 urls in 221 secs = 
2.3url/s
2009-07-10 04:12:58,371 INFO  fetcher.Fetcher - Fed 498 urls in 140 secs = 
3.6url/s
2009-07-10 04:16:12,275 INFO  fetcher.Fetcher - Fed 499 urls in 193 secs = 
2.6url/s
2009-07-10 04:19:20,162 INFO  fetcher.Fetcher - Fed 499 urls in 187 secs = 
2.7url/s
2009-07-10 04:21:25,846 INFO  fetcher.Fetcher - Fed 499 urls in 125 secs = 
4.0url/s
2009-07-10 04:24:16,049 INFO  fetcher.Fetcher - Fed 495 urls in 170 secs = 
2.9url/s
2009-07-10 04:27:01,944 INFO  fetcher.Fetcher - Fed 499 urls in 165 secs = 
3.0url/s
2009-07-10 04:29:26,247 INFO  fetcher.Fetcher - Fed 499 urls in 144 secs = 
3.5url/s
2009-07-10 04:32:02,590 INFO  fetcher.Fetcher - Fed 499 urls in 156 secs = 
3.2url/s
2009-07-10 04:34:49,985 INFO  fetcher.Fetcher - Fed 498 urls in 167 secs = 
3.0url/s
2009-07-10 04:37:28,367 INFO  fetcher.Fetcher - Fed 498 urls in 158 secs = 
3.2url/s
2009-07-10 04:40:09,865 INFO  fetcher.Fetcher - Fed 499 urls in 161 secs = 
3.1url/s
2009-07-10 04:42:55,203 INFO  fetcher.Fetcher - Fed 499 urls in 165 secs = 
3.0url/s

obviously when I'm only feeding 3-4 urls/sec, i'll only every be able to fetch 
that. That test was one a crawldb just initialised with 11,000 urls (unique 
sites).

However, on the next iteration where I'm feeding urls from non-unique sites, I 
see 5-7 times that rate.


> fetchQueues.totalSize incorrect in Fetcher2
> -------------------------------------------
>
>                 Key: NUTCH-719
>                 URL: https://issues.apache.org/jira/browse/NUTCH-719
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>
> I had a look at the logs generated by Fetcher2 and found cases where there 
> were no active fetchQueues but fetchQueues.totalSize was != 0
> fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
> fetchQueues.totalSize=1, fetchQueues=0
> since the code relies on fetchQueues.totalSize to determine whether the work 
> is finished or not the task is blocked until the abortion mechanism kicks in
> 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
> threads.
> could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to