RE: re-Crawl re-fetch all pages each time

Markus Jelsma Thu, 15 Nov 2012 06:43:02 -0800

Hi - this should not happen. The only thing i can imagine is that the update 
step doesn't succeed but that would mean nothing is going to be indexed either. 
You can inspect an URL using the readdb tool, check before and after.



-----Original message-----
> From:vetus <ve...@isac.cat>
> Sent: Thu 15-Nov-2012 15:41
> To: user@nutch.apache.org
> Subject: re-Crawl re-fetch all pages each time
> 
> Hello,
> 
> I have a problem...
> 
> I'm trying to index a small domain, and I'm using
> org.apache.nutch.crawl.Crawler to do it. The problem, is that after the
> crawler has indexed all the pages of the domain, I execute the crawler
> again... and It fetch all the pages again althoug the fetch interval has not
> expired...
> This is wrong because it generates a lot of connections...
> 
> I'm using the default config and this is the command that I execute:
> 
> org.apache.nutch.crawl.Crawler  -depth 1 -threads 1 -topN 5
> 
> Can you help me? please
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/re-Crawl-re-fetch-all-pages-each-time-tp4020464.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: re-Crawl re-fetch all pages each time

Reply via email to