Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change
notification.
The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/FetchCycleOverlap
The comment on the change is:
This won't work 100% correctly - removing it so I don
You don't have to update CrawlDb after every fetch cycle, so keeping the
generated CrawlDatums from one generate run might be useful.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: wuqi <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
I might reconsider how important it is for us that we always get the
best urls.
Perhaps your situation also applies to us.
The output of the segment (number of docs generated plus the number of
outlinks) determines how much the crawldb changes after an updatedb. For
us, this is far less than t
- Original Message -
From: "Mathijs Homminga" <[EMAIL PROTECTED]>
To:
Sent: Wednesday, May 07, 2008 5:21 PM
Subject: Re: Internet crawl: CrawlDb getting big!
> wuqi wrote:
>> I am also trying to improve the Generator efficiency. The current Generator
>> all the URLs in crawlDB are dum
wuqi wrote:
I am also trying to improve the Generator efficiency. The current Generator all
the URLs in crawlDB are dumped out and ordered during the map process and the
reduce process will try to find top N pages or maxPerhost page for you. If the
page amounts in the CrawlDB is much bigger t