You don't have to update CrawlDb after every fetch cycle, so keeping the generated CrawlDatums from one generate run might be useful.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: wuqi <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] > Sent: Wednesday, May 7, 2008 5:36:39 AM > Subject: Re: Internet crawl: CrawlDb getting big! > > > ----- Original Message ----- > From: "Mathijs Homminga" > To: > Sent: Wednesday, May 07, 2008 5:21 PM > Subject: Re: Internet crawl: CrawlDb getting big! > > > > wuqi wrote: > >> I am also trying to improve the Generator efficiency. The current > >> Generator > all the URLs in crawlDB are dumped out and ordered during the map process and > the reduce process will try to find top N pages or maxPerhost page for you. > If > the page amounts in the CrawlDB is much bigger than N, Need all the page be > dumped out during map process? We may just need to provide (2~3)*N pages > during the map process,and then reduce select N pages from dumped out (2~3)n > pages. this might improve the Generator efficiency .. > > Yes, the generate process will be faster. But of course less accurate. > > And if you're working with generate.max.per.host, then it is likely that > > your segment will be less than topN in size. > >> I think maybe the crawlDB can be stored based on two layers, the first > >> layer > is Host,the second layer is pageURL.This can improve efficiency when using > max > pages per host to generator fetch list. > >> > > My first thought is that such an approach makes it hard to select the > > best scoring urls. > In my understanding, best scoring url mighte isn't so important. For example > if > you want 10URLs, I select 50 URLS for you to chose top10 URLs, this is > enough > for me. > > > Perhaps we could design the process in such way that some intermediate > > results like the part of the crawldb which is sorted during generation > > (this contains all urls elegible for fetching) are saved and reused. Why > > sort everything again each time when you know only a fraction of the > > urls have been updated? > The crawlDB minght change dramactially after you update you crawlDB from a > fetched segement, so a pre-sorted crawlDB might not usefull during for netx > generator > > > > > Mathijs > > > >> Hbase can greatly improve the updateDB efficiency,because no need to dump > all URLS in crawldb, it just need to append a new column with DB_Fetched for > the URL fetched. The other benefit brought by Hbase is that we can easily > change > schema of crawlDB for example add IP address for each URL... I am not > familiar > with how the HBase behavior under the interface.. so selecting out might be > problem... > >> > >> > >> ----- Original Message ----- > >> From: "Mathijs Homminga" > >> To: > >> Sent: Wednesday, May 07, 2008 6:28 AM > >> Subject: Internet crawl: CrawlDb getting big! > >> > >> > >> > >>> Hi all, > >>> > >>> The time needed to do a generate and an updatedb depends linearly on the > >>> size of the CrawlDb. > >>> Our CrawlDb currently contains about 1.5 billion urls (some fetched, but > >>> most of them unfetched). > >>> We are using Nutch 0.9 on a 15-node cluster. These are the times needed > >>> for these jobs: > >>> > >>> generate: 8-10 hours > >>> updatedb: 8-10 hours > >>> > >>> Our fetch job takes about 30 hours, in which we fetch and parse about 8 > >>> million docs (limited by our current bandwidth). > >>> So, we spent about 40% of our time on CrawlDb administration. > >>> > >>> The first problem for us was that we didn't make the best use of our > >>> bandwidth (40% of the time no fetching). We solved this by designing a > >>> system which looks a bit like the FetchCycleOverlap > >>> (http://wiki.apache.org/nutch/FetchCycleOverlap) recently suggested by > >>> Otis. > >>> > >>> Another problem is that as the CrawlDb grows, the admin time increases. > >>> One way to solve this is by increasing the topN each time so the ratio > >>> between admin jobs and the fetch job remains constant. However, we will > >>> end up with extreme long cycles and large segments. Some of this we > >>> solved by generating multiple segments in one generate job and only > >>> perform an updatedb when (almost) all of these segments are fetched. > >>> > >>> But still. The number of urls we select (generate), and the number of > >>> urls we update (updatedb) is very small compared to the size of the > >>> CrawlDb. We were wondering if there is a way such that we don't need to > >>> read in the whole CrawlDb each time. > >>> How about putting the CrawlDb in HBase? Sorting (generate) might become > >>> a problem then... > >>> Is this issue addressed in the Nutch2Architecture? > >>> > >>> I'm happily willing to spend some more time on this, so all ideas are > >>> welcome. > >>> > >>> Thanks, > >>> Mathijs Homminga > >>> > >>> -- > >>> Knowlogy > >>> Helperpark 290 C > >>> 9723 ZA Groningen > >>> The Netherlands > >>> +31 (0)50 2103567 > >>> http://www.knowlogy.nl > >>> > >>> [EMAIL PROTECTED] > >>> +31 (0)6 15312977 > >>> > >>> > >> > > > > > -- > > Knowlogy > > Helperpark 290 C > > 9723 ZA Groningen > > +31 (0)50 2103567 > > http://www.knowlogy.nl > > > > [EMAIL PROTECTED] > > +31 (0)6 15312977 > > > >