----- Original Message ----- From: "Mathijs Homminga" <[EMAIL PROTECTED]> To: <nutch-dev@lucene.apache.org> Sent: Wednesday, May 07, 2008 5:21 PM Subject: Re: Internet crawl: CrawlDb getting big!
> wuqi wrote: >> I am also trying to improve the Generator efficiency. The current Generator >> all the URLs in crawlDB are dumped out and ordered during the map process >> and the reduce process will try to find top N pages or maxPerhost page for >> you. If the page amounts in the CrawlDB is much bigger than N, Need all the >> page be dumped out during map process? We may just need to provide >> (2~3)*N pages during the map process,and then reduce select N pages from >> dumped out (2~3)n pages. this might improve the Generator efficiency .. > Yes, the generate process will be faster. But of course less accurate. > And if you're working with generate.max.per.host, then it is likely that > your segment will be less than topN in size. >> I think maybe the crawlDB can be stored based on two layers, the first layer >> is Host,the second layer is pageURL.This can improve efficiency when using >> max pages per host to generator fetch list. >> > My first thought is that such an approach makes it hard to select the > best scoring urls. In my understanding, best scoring url mighte isn't so important. For example if you want 10URLs, I select 50 URLS for you to chose top10 URLs, this is enough for me. > Perhaps we could design the process in such way that some intermediate > results like the part of the crawldb which is sorted during generation > (this contains all urls elegible for fetching) are saved and reused. Why > sort everything again each time when you know only a fraction of the > urls have been updated? The crawlDB minght change dramactially after you update you crawlDB from a fetched segement, so a pre-sorted crawlDB might not usefull during for netx generator > > Mathijs > >> Hbase can greatly improve the updateDB efficiency,because no need to dump >> all URLS in crawldb, it just need to append a new column with DB_Fetched >> for the URL fetched. The other benefit brought by Hbase is that we can >> easily change schema of crawlDB for example add IP address for each URL... I >> am not familiar with how the HBase behavior under the interface.. so >> selecting out might be problem... >> >> >> ----- Original Message ----- >> From: "Mathijs Homminga" <[EMAIL PROTECTED]> >> To: <nutch-dev@lucene.apache.org> >> Sent: Wednesday, May 07, 2008 6:28 AM >> Subject: Internet crawl: CrawlDb getting big! >> >> >> >>> Hi all, >>> >>> The time needed to do a generate and an updatedb depends linearly on the >>> size of the CrawlDb. >>> Our CrawlDb currently contains about 1.5 billion urls (some fetched, but >>> most of them unfetched). >>> We are using Nutch 0.9 on a 15-node cluster. These are the times needed >>> for these jobs: >>> >>> generate: 8-10 hours >>> updatedb: 8-10 hours >>> >>> Our fetch job takes about 30 hours, in which we fetch and parse about 8 >>> million docs (limited by our current bandwidth). >>> So, we spent about 40% of our time on CrawlDb administration. >>> >>> The first problem for us was that we didn't make the best use of our >>> bandwidth (40% of the time no fetching). We solved this by designing a >>> system which looks a bit like the FetchCycleOverlap >>> (http://wiki.apache.org/nutch/FetchCycleOverlap) recently suggested by Otis. >>> >>> Another problem is that as the CrawlDb grows, the admin time increases. >>> One way to solve this is by increasing the topN each time so the ratio >>> between admin jobs and the fetch job remains constant. However, we will >>> end up with extreme long cycles and large segments. Some of this we >>> solved by generating multiple segments in one generate job and only >>> perform an updatedb when (almost) all of these segments are fetched. >>> >>> But still. The number of urls we select (generate), and the number of >>> urls we update (updatedb) is very small compared to the size of the >>> CrawlDb. We were wondering if there is a way such that we don't need to >>> read in the whole CrawlDb each time. >>> How about putting the CrawlDb in HBase? Sorting (generate) might become >>> a problem then... >>> Is this issue addressed in the Nutch2Architecture? >>> >>> I'm happily willing to spend some more time on this, so all ideas are >>> welcome. >>> >>> Thanks, >>> Mathijs Homminga >>> >>> -- >>> Knowlogy >>> Helperpark 290 C >>> 9723 ZA Groningen >>> The Netherlands >>> +31 (0)50 2103567 >>> http://www.knowlogy.nl >>> >>> [EMAIL PROTECTED] >>> +31 (0)6 15312977 >>> >>> >> > > > -- > Knowlogy > Helperpark 290 C > 9723 ZA Groningen > +31 (0)50 2103567 > http://www.knowlogy.nl > > [EMAIL PROTECTED] > +31 (0)6 15312977 > >