Hello,
I have a problem...
I'm trying to index a small domain, and I'm using
org.apache.nutch.crawl.Crawler to do it. The problem, is that after the
crawler has indexed all the pages of the domain, I execute the crawler
again... and It fetch all the pages again althoug the fetch interval has not
I have the same problem, But in my case, when I do the re-crawl, Nutch fetch
all pages again.
Can sombody help me please?
Thanks
--
View this message in context:
http://lucene.472066.n3.nabble.com/only-fetch-home-page-tp617901p4020466.html
Sent from the Nutch - User mailing list archive at
I have the same problem, But in my case, when I do the re-crawl, Nutch fetch
all pages again.
Can sombody help me please?
Thanks
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-re-fetch-all-the-modified-page-tp2980301p4020467.html
Sent from the Nutch - User
You probably need to customize parse-metatags plugin.
I think you go ahead and include all possible metatags. And take care of
missing metatags in solr.
On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang smartag...@gmail.com wrote:
I understand conf/regex-urlfilter.txt; I can put domain names into
Hi - this should not happen. The only thing i can imagine is that the update
step doesn't succeed but that would mean nothing is going to be indexed either.
You can inspect an URL using the readdb tool, check before and after.
-Original message-
From:vetus ve...@isac.cat
Sent: Thu
well, these are all details. The bigger question is, how to seperate the
crawling policy of site A from that of site B?
On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak sourajit.ba...@gmail.comwrote:
You probably need to customize parse-metatags plugin.
I think you go ahead and include all
6 matches
Mail list logo