Optimization in crawling and indexing

Rupesh Mankar Mon, 14 Dec 2009 03:05:01 -0800

I want to see if there is any possible bandwidth optimization while using Nutch.



a)    Crawling: After initial crawl, ONLY fetch updated document? Re-crawl 
command after every 6 hours will crawl and fetch all documents. 
['db.fetch.interval.default' is 6 hours]. It should just bring updated 
documents only.



Does Nutch internally use HEAD request to check whether that document (html, 
PDFs and Docs) has changed or not?



b)    Indexing: Can I find out based on a timestamp, how many documents have 
changed after last re-crawl?


Thanks,
Rupesh

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Optimization in crawling and indexing

Reply via email to