Dennis,

Interesting info, I don't use the standard OPIC scorer but a slightly
modified version which boost pages with content that I'm looking for... so
it could be that my pages are generally on slow servers.

Now heads-up, just started a new run with 450k URLs and it looks like I'm
back to the previous behaviour :
+ 4 Mb/s for a few minutes
+ steady 1.9 Mb/s ... for ages probably since it really means around 10-15
Fetch/s

Why did the previous run go so fast ???? I'm still wondering

2009/11/26 Dennis Kubes <[email protected]>

> One interesting thing we were seeing a while back on large crawls where we
> were fetching the best scoring pages first, then next best, and so on, is
> that lower scoring pages typically had worse response time rates and worse
> timeout rates.
>
> So while the best scoring pages would respond very quickly and would have <
> 1% timeout rate, the worst scoring pages would take x times as long (don't
> remember the exact ratio but it was multiples) and could have as high as a
> 50% timeout rate.  Just something to think about.
>
> Dennis Kubes
>
>
> Andrzej Bialecki wrote:
>
>> MilleBii wrote:
>>
>>> I have to say that I'm still puzzled. Here is the latest. I just
>>> restarted a
>>> run and then guess what :
>>>
>>> got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only
>>> get
>>> 3Mbit/s max before (nota bits and not bytes as I said before).
>>> A few samples show that I was running at 50 Fetches/sec ... not bad. But
>>> why
>>> this high-speed on this run I haven't got the faintest idea.
>>>
>>>
>>> Than it drops and I get that kind of logs
>>>
>>> 2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=516
>>> 2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=120
>>> 2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=516
>>> 2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=120
>>> 2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=516
>>>
>>> Don't fully understand why it is oscillating between two queue size never
>>> mind.... but it is likely the end of the run since hadoop shows 99.99%
>>> percent complete for the 2 map it generated.
>>>
>>> Would that be explained by a better URL mix ????
>>>
>>
>> I suspect that you have a bunch of hosts that slowly trickle the content,
>> i.e. requests don't time out, crawl-delay is low, but the download speed is
>> very very low due to the limits at their end (either physical or
>> artificial).
>>
>> The solution in that case would be to track a minimum avg. speed per
>> FetchQueue, and lock-out the queue if this number crosses the threshold
>> (similarly to what we do when we discover a crawl-delay that is too high).
>>
>> In the meantime, you could add the number of FetchQueue-s to that
>> diagnostic output, to see how many unique hosts are in the current working
>> set.
>>
>>


-- 
-MilleBii-

Reply via email to