One interesting thing we were seeing a while back on large crawls where we were fetching the best scoring pages first, then next best, and so on, is that lower scoring pages typically had worse response time rates and worse timeout rates.

So while the best scoring pages would respond very quickly and would have < 1% timeout rate, the worst scoring pages would take x times as long (don't remember the exact ratio but it was multiples) and could have as high as a 50% timeout rate. Just something to think about.

Dennis Kubes

Andrzej Bialecki wrote:
MilleBii wrote:
I have to say that I'm still puzzled. Here is the latest. I just restarted a
run and then guess what :

got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running at 50 Fetches/sec ... not bad. But why
this high-speed on this run I haven't got the faintest idea.


Than it drops and I get that kind of logs

2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516

Don't fully understand why it is oscillating between two queue size never
mind.... but it is likely the end of the run since hadoop shows 99.99%
percent complete for the 2 map it generated.

Would that be explained by a better URL mix ????

I suspect that you have a bunch of hosts that slowly trickle the content, i.e. requests don't time out, crawl-delay is low, but the download speed is very very low due to the limits at their end (either physical or artificial).

The solution in that case would be to track a minimum avg. speed per FetchQueue, and lock-out the queue if this number crosses the threshold (similarly to what we do when we discover a crawl-delay that is too high).

In the meantime, you could add the number of FetchQueue-s to that diagnostic output, to see how many unique hosts are in the current working set.

Reply via email to