Re: Internet crawl: CrawlDb getting big!

wuqi Wed, 07 May 2008 02:40:07 -0700

----- Original Message ----- 
From: "Mathijs Homminga" <[EMAIL PROTECTED]>
To: <nutch-dev@lucene.apache.org>
Sent: Wednesday, May 07, 2008 5:21 PM
Subject: Re: Internet crawl: CrawlDb getting big!



> wuqi wrote:
>> I am also trying to improve the Generator efficiency. The current Generator 
>> all the URLs in crawlDB are dumped out and ordered during the map process 
>> and the reduce process will try to find top N pages or maxPerhost page for 
>> you. If  the page amounts in the CrawlDB is much bigger than N, Need all the 
>> page be dumped out during map process?  We  may just need to  provide 
>> (2~3)*N pages during the map process,and then reduce select N pages from 
>> dumped out (2~3)n pages. this might  improve the Generator efficiency ..
> Yes, the generate process will be faster. But of course less accurate. 
> And if you're working with generate.max.per.host, then it is likely that 
> your segment will be less than topN in size.
>> I think maybe the crawlDB can be stored based on two layers, the first layer 
>> is Host,the second layer is pageURL.This can improve  efficiency when using  
>> max pages per host to generator fetch list.
>>   
> My first thought is that such an approach makes it hard to select the 
> best scoring urls.
In my understanding, best scoring url mighte isn't so important. For example if 
you want 10URLs, I select 50 URLS  for you to chose top10 URLs, this is enough 
for me.

> Perhaps we could design the process in such way that some intermediate 
> results like the part of the crawldb which is sorted during generation 
> (this contains all urls elegible for fetching) are saved and reused. Why 
> sort everything again each time when you know only a fraction of the 
> urls have been updated?
The crawlDB minght change dramactially after you update you crawlDB from a 
fetched segement, so a pre-sorted crawlDB might  not usefull during for netx 
generator

> 
> Mathijs
> 
>> Hbase can greatly  improve the updateDB efficiency,because no need to dump 
>> all URLS in crawldb, it just need to append a new column with  DB_Fetched 
>> for the URL fetched. The other benefit brought by Hbase is that we can 
>> easily change schema of crawlDB for example add IP address for each URL... I 
>> am not familiar  with how the HBase behavior under the interface.. so 
>> selecting out  might be problem...
>>   
>>
>> ----- Original Message ----- 
>> From: "Mathijs Homminga" <[EMAIL PROTECTED]>
>> To: <nutch-dev@lucene.apache.org>
>> Sent: Wednesday, May 07, 2008 6:28 AM
>> Subject: Internet crawl: CrawlDb getting big!
>>
>>
>>   
>>> Hi all,
>>>
>>> The time needed to do a generate and an updatedb depends linearly on the 
>>> size of the CrawlDb.
>>> Our CrawlDb currently contains about 1.5 billion urls (some fetched, but 
>>> most of them unfetched).
>>> We are using Nutch 0.9 on a 15-node cluster. These are the times needed 
>>> for these jobs:
>>>
>>> generate:    8-10 hours
>>> updatedb:   8-10 hours
>>>
>>> Our fetch job takes about 30 hours, in which we fetch and parse about 8 
>>> million docs (limited by our current bandwidth).
>>> So, we spent about 40% of our time on CrawlDb administration.
>>>
>>> The first problem for us was that we didn't make the best use of our 
>>> bandwidth (40% of the time no fetching). We solved this by designing a 
>>> system which looks a bit like the FetchCycleOverlap 
>>> (http://wiki.apache.org/nutch/FetchCycleOverlap) recently suggested by Otis.
>>>
>>> Another problem is that as the CrawlDb grows, the admin time increases. 
>>> One way to solve this is by increasing the topN each time so the ratio 
>>> between admin jobs and the fetch job remains constant. However, we will 
>>> end up with extreme long cycles and large segments. Some of this we 
>>> solved by generating multiple segments in one generate job and only 
>>> perform an updatedb when (almost) all of these segments are fetched.
>>>
>>> But still. The number of urls we select (generate), and the number of 
>>> urls we update (updatedb) is very small compared to the size of the 
>>> CrawlDb. We were wondering if there is a way such that we don't need to 
>>> read in the whole CrawlDb each time.
>>> How about putting the CrawlDb in HBase? Sorting (generate) might become 
>>> a problem then...
>>> Is this issue addressed in the Nutch2Architecture?
>>>
>>> I'm happily willing to spend some more time on this, so all ideas are 
>>> welcome.
>>>
>>> Thanks,
>>> Mathijs Homminga
>>>
>>> -- 
>>> Knowlogy
>>> Helperpark 290 C
>>> 9723 ZA Groningen
>>> The Netherlands
>>> +31 (0)50 2103567
>>> http://www.knowlogy.nl
>>>
>>> [EMAIL PROTECTED]
>>> +31 (0)6 15312977
>>>
>>>     
>> >
> 
> -- 
> Knowlogy
> Helperpark 290 C
> 9723 ZA Groningen
> +31 (0)50 2103567
> http://www.knowlogy.nl
> 
> [EMAIL PROTECTED]
> +31 (0)6 15312977
> 
>

Re: Internet crawl: CrawlDb getting big!

Reply via email to