Dennis, Interesting info, I don't use the standard OPIC scorer but a slightly modified version which boost pages with content that I'm looking for... so it could be that my pages are generally on slow servers.
Now heads-up, just started a new run with 450k URLs and it looks like I'm back to the previous behaviour : + 4 Mb/s for a few minutes + steady 1.9 Mb/s ... for ages probably since it really means around 10-15 Fetch/s Why did the previous run go so fast ???? I'm still wondering 2009/11/26 Dennis Kubes <[email protected]> > One interesting thing we were seeing a while back on large crawls where we > were fetching the best scoring pages first, then next best, and so on, is > that lower scoring pages typically had worse response time rates and worse > timeout rates. > > So while the best scoring pages would respond very quickly and would have < > 1% timeout rate, the worst scoring pages would take x times as long (don't > remember the exact ratio but it was multiples) and could have as high as a > 50% timeout rate. Just something to think about. > > Dennis Kubes > > > Andrzej Bialecki wrote: > >> MilleBii wrote: >> >>> I have to say that I'm still puzzled. Here is the latest. I just >>> restarted a >>> run and then guess what : >>> >>> got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only >>> get >>> 3Mbit/s max before (nota bits and not bytes as I said before). >>> A few samples show that I was running at 50 Fetches/sec ... not bad. But >>> why >>> this high-speed on this run I haven't got the faintest idea. >>> >>> >>> Than it drops and I get that kind of logs >>> >>> 2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100, >>> spinWaiting=100, fetchQueues.totalSize=516 >>> 2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100, >>> spinWaiting=100, fetchQueues.totalSize=120 >>> 2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100, >>> spinWaiting=100, fetchQueues.totalSize=516 >>> 2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100, >>> spinWaiting=100, fetchQueues.totalSize=120 >>> 2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100, >>> spinWaiting=100, fetchQueues.totalSize=516 >>> >>> Don't fully understand why it is oscillating between two queue size never >>> mind.... but it is likely the end of the run since hadoop shows 99.99% >>> percent complete for the 2 map it generated. >>> >>> Would that be explained by a better URL mix ???? >>> >> >> I suspect that you have a bunch of hosts that slowly trickle the content, >> i.e. requests don't time out, crawl-delay is low, but the download speed is >> very very low due to the limits at their end (either physical or >> artificial). >> >> The solution in that case would be to track a minimum avg. speed per >> FetchQueue, and lock-out the queue if this number crosses the threshold >> (similarly to what we do when we discover a crawl-delay that is too high). >> >> In the meantime, you could add the number of FetchQueue-s to that >> diagnostic output, to see how many unique hosts are in the current working >> set. >> >> -- -MilleBii-
