Incremental crawling with nutch

Ali Nazemian Sun, 01 Jun 2014 07:47:29 -0700

Hi everybody,
I am going to use nutch for crawling some news web site. These websites
will be updated regularly. Therefore I should recrawl them at least every 2
hours. But the problem is I want to have incremental re-crawl, it means
nutch should crawl only the urls that are new and not fetched before
(except the main page of each site for extracting new urls). I want in each
re-crawling process only the new URLs fetched and send to solr for
indexing. Would somebody guide me through this scenario with nutch 1.8?
Best regards.


-- 
A.Nazemian

Incremental crawling with nutch

Reply via email to