I am using Nutch. I want to know how I can do daily crawls with Nutch. Here are the details I want:-
1. Doing a crawl that keeps running all the time and keeps updating the crawldb 2. Whether it can avoid re-crawling the pages that have been crawled recently. Basically I want it to waste bandwidth by recrawling the pages that have been crawled just a few days before. It should crawl only those pages which might be out of date in crawldb. 3. Whether it is possible to schedule the crawl like it should be fetching at a certain time and indexing at another certain time every day. Please share your knowledge or point me to URLs, FAQs, Tutorials, etc. where I can learn more about this. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general