Re: [Nutch-dev] make URLFilter as plugin

Doug Cutting Mon, 14 Feb 2005 14:30:14 -0800

Chirag Chaman wrote:

We found that the bottleneck to a faster crawl and index is due to the
following:
1. WebDB Size
2. Recrawling Blocked URLs (not remembering domain status across crawls)
Point 1 should be intuitive -- the larger the DB, the more time is takes to sort. The second point relates to the fact that the fetcher does not remember the status of a domain across crawls -- if you are blocked from a particular domain, future fetch lists should not even contain URLs from that domain/directory. Another issue is when a domain is down -- this should also be stored for a period of time (say 12 hours).

Another approach for these might just be a caching proxy, like Squid. You could configure this to cache only robots.txt and dead hosts.

Doug


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] make URLFilter as plugin

Reply via email to