So I indeed have a local DNS server running does not seem to help so much

Just finished a run of 80K urls, at the beginning the speed can be around 15
Fetch/s and at the end due to long tail effects I get  below  1 Fetch/s...
Average on a 12h34 run : 1,7 Fetch/s pretty slow really.

I limit URL per site at 1000 to limit the long tail effect.... however I
have  half-a-dozen site which I need to index and they have around 60k URLs
so it means I will need 60 runs to get them indexed. So I would like to
increase the limit and the long tail effect. Interesting dilemna.

2009/11/24 MilleBii <[email protected]>

> Why would DNS local caching work... It only is working if you are
> going to crawl often the same site ... In which case you are hit by
> the politeness.
>
> if you have segments with only/mainly different sites it is not/really
> going to help.
>
> So far I have not seen my quad core + 100mb/s + pseudo distributed
> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> will tell you.
>
> I vote for 100 Fetch/s not sure how to get it though
>
>
>
> 2009/11/24, Dennis Kubes <[email protected]>:
> > Hi Mark,
> >
> > I just put this up on the wiki.  Hope it helps:
> >
> > http://wiki.apache.org/nutch/OptimizingCrawls
> >
> > Dennis
> >
> >
> > Mark Kerzner wrote:
> >> Hi, guys,
> >>
> >> my goal is to do by crawls at 100 fetches per second, observing, of
> >> course,
> >> polite crawling. But, when URLs are all different domains, what
> >> theoretically would stop some software from downloading from 100 domains
> >> at
> >> once, achieving the desired speed?
> >>
> >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
> >> starts at a few dozen URLs/second, it slows down at the end (as
> discussed
> >> by
> >> many and by Krugler).
> >>
> >> Should I write something of my own, or are their fast crawlers?
> >>
> >> Thanks!
> >>
> >> Mark
> >>
> >
>
> --
> Envoyé avec mon mobile
>
> -MilleBii-
>



-- 
-MilleBii-

Reply via email to