One interesting thing we were seeing a while back on large crawls where
we were fetching the best scoring pages first, then next best, and so
on, is that lower scoring pages typically had worse response time rates
and worse timeout rates.
So while the best scoring pages would respond very quickly and would
have < 1% timeout rate, the worst scoring pages would take x times as
long (don't remember the exact ratio but it was multiples) and could
have as high as a 50% timeout rate. Just something to think about.
Dennis Kubes
Andrzej Bialecki wrote:
MilleBii wrote:
I have to say that I'm still puzzled. Here is the latest. I just
restarted a
run and then guess what :
got ultra-high speed : 8Mbits/s sustained for 1 hour where I could
only get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running at 50 Fetches/sec ... not bad.
But why
this high-speed on this run I haven't got the faintest idea.
Than it drops and I get that kind of logs
2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
Don't fully understand why it is oscillating between two queue size never
mind.... but it is likely the end of the run since hadoop shows 99.99%
percent complete for the 2 map it generated.
Would that be explained by a better URL mix ????
I suspect that you have a bunch of hosts that slowly trickle the
content, i.e. requests don't time out, crawl-delay is low, but the
download speed is very very low due to the limits at their end (either
physical or artificial).
The solution in that case would be to track a minimum avg. speed per
FetchQueue, and lock-out the queue if this number crosses the threshold
(similarly to what we do when we discover a crawl-delay that is too high).
In the meantime, you could add the number of FetchQueue-s to that
diagnostic output, to see how many unique hosts are in the current
working set.