I am using Nutch. I want to know how I can do daily crawls with Nutch.
Here are the details I want:-

1. Doing a crawl that keeps running all the time and keeps updating the crawldb
2. Whether it can avoid re-crawling the pages that have been crawled
recently. Basically I want it to waste bandwidth by recrawling the
pages that have been crawled just a few days before. It  should crawl
only those pages which might be out of date in crawldb.
3. Whether it is possible to schedule the crawl like it should be
fetching at a certain time and indexing at another certain time every
day.

Please share your knowledge or point me to URLs, FAQs, Tutorials, etc.
where I can learn more about this.

Reply via email to