[Nutch Wiki] Update of "FetchCycleOverlap" by OtisGospodnetic

2008-05-07 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The following page has been changed by OtisGospodnetic: http://wiki.apache.org/nutch/FetchCycleOverlap The comment on the change is: This won't work 100% correctly - removing it so I don

Re: Internet crawl: CrawlDb getting big!

2008-05-07 Thread ogjunk-nutch
You don't have to update CrawlDb after every fetch cycle, so keeping the generated CrawlDatums from one generate run might be useful. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: wuqi <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org

Re: Internet crawl: CrawlDb getting big!

2008-05-07 Thread Mathijs Homminga
I might reconsider how important it is for us that we always get the best urls. Perhaps your situation also applies to us. The output of the segment (number of docs generated plus the number of outlinks) determines how much the crawldb changes after an updatedb. For us, this is far less than t

Re: Internet crawl: CrawlDb getting big!

2008-05-07 Thread wuqi
- Original Message - From: "Mathijs Homminga" <[EMAIL PROTECTED]> To: Sent: Wednesday, May 07, 2008 5:21 PM Subject: Re: Internet crawl: CrawlDb getting big! > wuqi wrote: >> I am also trying to improve the Generator efficiency. The current Generator >> all the URLs in crawlDB are dum

Re: Internet crawl: CrawlDb getting big!

2008-05-07 Thread Mathijs Homminga
wuqi wrote: I am also trying to improve the Generator efficiency. The current Generator all the URLs in crawlDB are dumped out and ordered during the map process and the reduce process will try to find top N pages or maxPerhost page for you. If the page amounts in the CrawlDB is much bigger t