Did not think of that one... interesting how & where do you control the number of FetchQueues I only use the default so I assume there is only one.
How I should do if I want to analyze the content of a generated fetchlist ? Is it possible to increase the number of fetcher on a single node configuration ? If not than I may turn to a configuration with two low specs servers vs one middle range... I will get more out my bucks ;-) 2009/11/26 Andrzej Bialecki <[email protected]> > MilleBii wrote: > >> I have to say that I'm still puzzled. Here is the latest. I just restarted >> a >> run and then guess what : >> >> got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only >> get >> 3Mbit/s max before (nota bits and not bytes as I said before). >> A few samples show that I was running at 50 Fetches/sec ... not bad. But >> why >> this high-speed on this run I haven't got the faintest idea. >> >> >> Than it drops and I get that kind of logs >> >> 2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100, >> spinWaiting=100, fetchQueues.totalSize=516 >> 2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100, >> spinWaiting=100, fetchQueues.totalSize=120 >> 2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100, >> spinWaiting=100, fetchQueues.totalSize=516 >> 2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100, >> spinWaiting=100, fetchQueues.totalSize=120 >> 2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100, >> spinWaiting=100, fetchQueues.totalSize=516 >> >> Don't fully understand why it is oscillating between two queue size never >> mind.... but it is likely the end of the run since hadoop shows 99.99% >> percent complete for the 2 map it generated. >> >> Would that be explained by a better URL mix ???? >> > > I suspect that you have a bunch of hosts that slowly trickle the content, > i.e. requests don't time out, crawl-delay is low, but the download speed is > very very low due to the limits at their end (either physical or > artificial). > > The solution in that case would be to track a minimum avg. speed per > FetchQueue, and lock-out the queue if this number crosses the threshold > (similarly to what we do when we discover a crawl-delay that is too high). > > In the meantime, you could add the number of FetchQueue-s to that > diagnostic output, to see how many unique hosts are in the current working > set. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- -MilleBii-
