Hi, I am trying to solve a problem but I am unable to find any feature in Nutch that lets me solve this problem.
Let's say in my intranet there are 1000 sites. Sites 1 to 100 have pages that are never going to change, i.e. they are static. So I don't need to crawl them again and again. But extra pages may be added to these sites. Sites 101 to 500 have pretty dynamic content in which I can expect the content to change significantly every 7 days. Sites 501 to 1000 are very dynamic and content change can happen in any page almost every day. So, how can I do recrawls on them in a manner that 1) it doesn't crawl the existing pages of the first group (1-100) sites but crawl the new pages that have come up. 2) re-crawl all pages of the second group at an interval of 7 days. 3) re-crawl all pages of the third group every day 4) it crawls any new URLs injected into the crawl db during recrawl. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general