Hi, we have kind of a similar case and we perform the following:
1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb with -filter param to strip those URLs from the crawl-db * 3 put the URLs from 1 into a seed file 4 remove the URLs from 1 from the regex-urlfilter.txt 5 start the crawl with the seed file from 3 * This is a merge on itself, for example: bin/nutch mergedb $CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter I dunno wether this is the best way to do it, but since we automated it it works very well. Regards Hannes On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe <jan.ri...@comspace.de> wrote: > Hi there, > > till now i did not find a way to crawl a specific page manuell. > Is there a possibility manuell set the recrawl interval or the crawl > date, or any other explicit way to make nutch invalidate a page? > > We have got 70k+ pages in the index and a full recrawl would take to > long. > > Thanks > Jan >