Re: 100 fetches per second?

Dennis Kubes Wed, 25 Nov 2009 16:43:01 -0800

One interesting thing we were seeing a while back on large crawls wherewe were fetching the best scoring pages first, then next best, and soon, is that lower scoring pages typically had worse response time ratesand worse timeout rates.

So while the best scoring pages would respond very quickly and wouldhave < 1% timeout rate, the worst scoring pages would take x times aslong (don't remember the exact ratio but it was multiples) and couldhave as high as a 50% timeout rate. Just something to think about.


Dennis Kubes

Andrzej Bialecki wrote:

MilleBii wrote:
I have to say that I'm still puzzled. Here is the latest. I justrestarted a
run and then guess what :
got ultra-high speed : 8Mbits/s sustained for 1 hour where I couldonly get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running at 50 Fetches/sec ... not bad.But why
this high-speed on this run I haven't got the faintest idea.


Than it drops and I get that kind of logs

2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516

Don't fully understand why it is oscillating between two queue size never
mind.... but it is likely the end of the run since hadoop shows 99.99%
percent complete for the 2 map it generated.

Would that be explained by a better URL mix ????
I suspect that you have a bunch of hosts that slowly trickle thecontent, i.e. requests don't time out, crawl-delay is low, but thedownload speed is very very low due to the limits at their end (eitherphysical or artificial).
The solution in that case would be to track a minimum avg. speed perFetchQueue, and lock-out the queue if this number crosses the threshold(similarly to what we do when we discover a crawl-delay that is too high).
In the meantime, you could add the number of FetchQueue-s to thatdiagnostic output, to see how many unique hosts are in the currentworking set.

Re: 100 fetches per second?

Reply via email to