I want to see if there is any possible bandwidth optimization while using Nutch.
a) Crawling: After initial crawl, ONLY fetch updated document? Re-crawl command after every 6 hours will crawl and fetch all documents. ['db.fetch.interval.default' is 6 hours]. It should just bring updated documents only. Does Nutch internally use HEAD request to check whether that document (html, PDFs and Docs) has changed or not? b) Indexing: Can I find out based on a timestamp, how many documents have changed after last re-crawl? Thanks, Rupesh DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.