Did not think of that one... interesting

how & where do you control the number of FetchQueues I only use the default
so I assume there is only one.

How I should do if I want to analyze the content of a generated fetchlist ?

Is it possible to increase the number of fetcher on a single node
configuration ? If not than I may turn to a configuration with two low specs
servers vs one middle range... I will get more out my bucks ;-)

2009/11/26 Andrzej Bialecki <[email protected]>

> MilleBii wrote:
>
>> I have to say that I'm still puzzled. Here is the latest. I just restarted
>> a
>> run and then guess what :
>>
>> got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only
>> get
>> 3Mbit/s max before (nota bits and not bytes as I said before).
>> A few samples show that I was running at 50 Fetches/sec ... not bad. But
>> why
>> this high-speed on this run I haven't got the faintest idea.
>>
>>
>> Than it drops and I get that kind of logs
>>
>> 2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>> 2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=120
>> 2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>> 2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=120
>> 2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>>
>> Don't fully understand why it is oscillating between two queue size never
>> mind.... but it is likely the end of the run since hadoop shows 99.99%
>> percent complete for the 2 map it generated.
>>
>> Would that be explained by a better URL mix ????
>>
>
> I suspect that you have a bunch of hosts that slowly trickle the content,
> i.e. requests don't time out, crawl-delay is low, but the download speed is
> very very low due to the limits at their end (either physical or
> artificial).
>
> The solution in that case would be to track a minimum avg. speed per
> FetchQueue, and lock-out the queue if this number crosses the threshold
> (similarly to what we do when we discover a crawl-delay that is too high).
>
> In the meantime, you could add the number of FetchQueue-s to that
> diagnostic output, to see how many unique hosts are in the current working
> set.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
-MilleBii-

Reply via email to