I am using Nutch. I want to know how I can do daily crawls with Nutch.
Here are the details I want:-

1. Doing a crawl that keeps running all the time and keeps updating the crawldb
2. Whether it can avoid re-crawling the pages that have been crawled
recently. Basically I want it to waste bandwidth by recrawling the
pages that have been crawled just a few days before. It  should crawl
only those pages which might be out of date in crawldb.
3. Whether it is possible to schedule the crawl like it should be
fetching at a certain time and indexing at another certain time every
day.

Please share your knowledge or point me to URLs, FAQs, Tutorials, etc.
where I can learn more about this.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to