The FreeGenerator tool is the easiest approach.

On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer <hannesc...@googlemail.com> wrote:
Hi,

we have kind of a similar case and we perform the following:

1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs from
the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the regex-urlfilter.txt
5 start the crawl with the seed file from 3

* This is a merge on itself, for example: bin/nutch mergedb
$CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter

I dunno wether this is the best way to do it, but since we automated it it
works very well.

Regards

Hannes

On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe <jan.ri...@comspace.de> wrote:

Hi there,

till now i did not find a way to crawl a specific page manuell.
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?

We have got 70k+ pages in the index and a full recrawl would take to
long.

Thanks
Jan


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Reply via email to