Hi,

I am trying to solve a problem but I am unable to find any feature in
Nutch that lets me solve this problem.

Let's say in my intranet there are 1000 sites.

Sites 1 to 100 have pages that are never going to change, i.e. they
are static. So I don't need to crawl them again and again. But extra
pages may be added to these sites.

Sites 101 to 500 have pretty dynamic content in which I can expect the
content to change significantly every 7 days.

Sites 501 to 1000 are very dynamic and content change can happen in
any page almost every day.

So, how can I do recrawls on them in a manner that

1) it doesn't crawl the existing pages of the first group (1-100)
sites but crawl the new pages that have come up.

2) re-crawl all pages of the second group at an interval of 7 days.

3) re-crawl all pages of the third group every day

4) it crawls any new URLs injected into the crawl db during recrawl.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to