Re: recrawl a single page explicit

Hannes Carl Meyer Mon, 02 Apr 2012 02:29:33 -0700

Hi,

we have kind of a similar case and we perform the following:


1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs from
the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the regex-urlfilter.txt
5 start the crawl with the seed file from 3

* This is a merge on itself, for example: bin/nutch mergedb
$CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter

I dunno wether this is the best way to do it, but since we automated it it
works very well.

Regards

Hannes

On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe <jan.ri...@comspace.de> wrote:

> Hi there,
>
> till now i did not find a way to crawl a specific page manuell.
> Is there a possibility manuell set the recrawl interval or the crawl
> date, or any other explicit way to make nutch invalidate a page?
>
> We have got 70k+ pages in the index and a full recrawl would take to
> long.
>
> Thanks
> Jan
>

Re: recrawl a single page explicit

Reply via email to