re-Crawl re-fetch all pages each time

2012-11-15 Thread vetus
Hello, I have a problem... I'm trying to index a small domain, and I'm using org.apache.nutch.crawl.Crawler to do it. The problem, is that after the crawler has indexed all the pages of the domain, I execute the crawler again... and It fetch all the pages again althoug the fetch interval has not

Re: only fetch home page

2012-11-15 Thread vetus
I have the same problem, But in my case, when I do the re-crawl, Nutch fetch all pages again. Can sombody help me please? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/only-fetch-home-page-tp617901p4020466.html Sent from the Nutch - User mailing list archive at

Re: How to re-fetch all the modified page?

2012-11-15 Thread vetus
I have the same problem, But in my case, when I do the re-crawl, Nutch fetch all pages again. Can sombody help me please? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-re-fetch-all-the-modified-page-tp2980301p4020467.html Sent from the Nutch - User

Re: site-specific crawling policies

2012-11-15 Thread Sourajit Basak
You probably need to customize parse-metatags plugin. I think you go ahead and include all possible metatags. And take care of missing metatags in solr. On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang smartag...@gmail.com wrote: I understand conf/regex-urlfilter.txt; I can put domain names into

RE: re-Crawl re-fetch all pages each time

2012-11-15 Thread Markus Jelsma
Hi - this should not happen. The only thing i can imagine is that the update step doesn't succeed but that would mean nothing is going to be indexed either. You can inspect an URL using the readdb tool, check before and after. -Original message- From:vetus ve...@isac.cat Sent: Thu

Re: site-specific crawling policies

2012-11-15 Thread Joe Zhang
well, these are all details. The bigger question is, how to seperate the crawling policy of site A from that of site B? On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak sourajit.ba...@gmail.comwrote: You probably need to customize parse-metatags plugin. I think you go ahead and include all