Chirag Chaman wrote:
We found that the bottleneck to a faster crawl and index is due to the
following:
1. WebDB Size
2. Recrawling Blocked URLs (not remembering domain status across crawls)

Point 1 should be intuitive -- the larger the DB, the more time is takes to
sort. The second point relates to the fact that the fetcher does not
remember the status of a domain across crawls -- if you are blocked from a
particular domain, future fetch lists should not even contain URLs from that
domain/directory. Another issue is when a domain is down -- this should also
be stored for a period of time (say 12 hours).

Another approach for these might just be a caching proxy, like Squid. You could configure this to cache only robots.txt and dead hosts.


Doug


------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to