I looked at the bandwidth profile of my last two runs and they have
the same shape:

 starts at 5 MBytes/s and decreases to below 500 kBytes/s with 1/x
kind of curve shape

Fairly even distribution of urls, local DNS is running...

 I can't find a good explanation for this behavior ... It looks to me
that when the fetch queue is full it is a lot less effective ???
I use 100 threads.

2009/11/24, MilleBii <[email protected]>:
> So I indeed have a local DNS server running does not seem to help so much
>
> Just finished a run of 80K urls, at the beginning the speed can be around
> 15
> Fetch/s and at the end due to long tail effects I get  below  1 Fetch/s...
> Average on a 12h34 run : 1,7 Fetch/s pretty slow really.
>
> I limit URL per site at 1000 to limit the long tail effect.... however I
> have  half-a-dozen site which I need to index and they have around 60k URLs
> so it means I will need 60 runs to get them indexed. So I would like to
> increase the limit and the long tail effect. Interesting dilemna.
>
> 2009/11/24 MilleBii <[email protected]>
>
>> Why would DNS local caching work... It only is working if you are
>> going to crawl often the same site ... In which case you are hit by
>> the politeness.
>>
>> if you have segments with only/mainly different sites it is not/really
>> going to help.
>>
>> So far I have not seen my quad core + 100mb/s + pseudo distributed
>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
>> will tell you.
>>
>> I vote for 100 Fetch/s not sure how to get it though
>>
>>
>>
>> 2009/11/24, Dennis Kubes <[email protected]>:
>> > Hi Mark,
>> >
>> > I just put this up on the wiki.  Hope it helps:
>> >
>> > http://wiki.apache.org/nutch/OptimizingCrawls
>> >
>> > Dennis
>> >
>> >
>> > Mark Kerzner wrote:
>> >> Hi, guys,
>> >>
>> >> my goal is to do by crawls at 100 fetches per second, observing, of
>> >> course,
>> >> polite crawling. But, when URLs are all different domains, what
>> >> theoretically would stop some software from downloading from 100
>> >> domains
>> >> at
>> >> once, achieving the desired speed?
>> >>
>> >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>> >> starts at a few dozen URLs/second, it slows down at the end (as
>> discussed
>> >> by
>> >> many and by Krugler).
>> >>
>> >> Should I write something of my own, or are their fast crawlers?
>> >>
>> >> Thanks!
>> >>
>> >> Mark
>> >>
>> >
>>
>> --
>> Envoyé avec mon mobile
>>
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>

-- 
Envoyé avec mon mobile

-MilleBii-

Reply via email to