Hi Otis In my case the crawl would be wide, which is good for URL distribution, but > bad > for DNS. > What's recommended for DNA caching? I do see > http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean setting > up a > local DNS server (e.g. bind) or something like pdnsdr or something else? >
I used bind for local DNS caching when running a 400-node cluster on EC2 for Similarpages, am sure there are other tools which work just as well [...] > > The time spent in generate and update is proportional to the size of the > > crawldb. Might take half the time at one point but will take more than > that. > > The best option would probably be to generate multiple segments in one > go > > (see options for the Generator), fetch all the segments one by one, then > > merge them with the crawldb in a single call to update > > Right. > But with time (or, more precisely, as crawldb grows) this generation will > start > taking more and more time, and there is no way around that, right? > nope. Nutch 2.0 will be faster for the updates compared to 1.x but the generation will still be proportional to the size of the crawldb > > > You will also inevitably hit slow servers which will have an impact the > > fetchrate - although not as bad as before the introduction of the > timeout on > > fetching. > > Right, I remember this problem. So now one can specify how long each fetch > should last and fetching will stop when that time is reached? > exactly - you give it say 60 mins and it will stop fetching after that > > How does one guess what that time limit to pick, especially since fetch > runs can > vary in terms of how fast they are depending on what hosts are in it? > empirically :-) take a largish value, observe the fetch and at which point it is starting to slow down and reduce accordingly Sounds a bit like a recipe, doesn't it? > > Wouldn't it be better to express this in requests/second instead of time, > so > that you can say "when fetching goes below N requests per second and stays > like > that for M minutes, abort fetch"? > this would be a nice feature indeed. The timeout is an efficient but somewhat crude mechanism, but it proved useful though as fetches could hang on a single host for a looooooooooooooong time which on a large cluster means big money > > What if you have a really fast fetch run going on, but the time is still > reached > and fetch aborted? What do you do? Restart the fetch with the same list > of > generated URLs as before? Somehow restart with only unfetched URLs? > Generate a > whole new fetchlist (which ends up being slow)? > you won't need to restart the fetch with the same list. The unfetched one should end up in the next round of generation > As a result what your crawl will just be churning URLs generated > > automatically from adult sites and despite the fact that your crawldb > will > > contain loads of URLs there will be very little useful ones. > > One man's trash is another man's... > even if you adult sites is what you really want to crawl for there is still a need for filtering / normalisation strategies. > > Anyway, it's not just a matter of pages / seconds. Doing large, open > crawls > > brings up a lot of interesting challenges :-) > > Yup. Thanks Julien! > > You are welcome. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

