Internet crawl: CrawlDb getting big!

Mathijs Homminga Tue, 06 May 2008 15:28:35 -0700

Hi all,

The time needed to do a generate and an updatedb depends linearly on thesize of the CrawlDb.Our CrawlDb currently contains about 1.5 billion urls (some fetched, butmost of them unfetched).We are using Nutch 0.9 on a 15-node cluster. These are the times neededfor these jobs:


generate:    8-10 hours
updatedb:   8-10 hours

Our fetch job takes about 30 hours, in which we fetch and parse about 8million docs (limited by our current bandwidth).

So, we spent about 40% of our time on CrawlDb administration.

The first problem for us was that we didn't make the best use of ourbandwidth (40% of the time no fetching). We solved this by designing asystem which looks a bit like the FetchCycleOverlap(http://wiki.apache.org/nutch/FetchCycleOverlap) recently suggested by Otis.

Another problem is that as the CrawlDb grows, the admin time increases.One way to solve this is by increasing the topN each time so the ratiobetween admin jobs and the fetch job remains constant. However, we willend up with extreme long cycles and large segments. Some of this wesolved by generating multiple segments in one generate job and onlyperform an updatedb when (almost) all of these segments are fetched.

But still. The number of urls we select (generate), and the number ofurls we update (updatedb) is very small compared to the size of theCrawlDb. We were wondering if there is a way such that we don't need toread in the whole CrawlDb each time.How about putting the CrawlDb in HBase? Sorting (generate) might becomea problem then...

Is this issue addressed in the Nutch2Architecture?

I'm happily willing to spend some more time on this, so all ideas arewelcome.


Thanks,
Mathijs Homminga

--
Knowlogy
Helperpark 290 C
9723 ZA Groningen
The Netherlands
+31 (0)50 2103567
http://www.knowlogy.nl

[EMAIL PROTECTED]
+31 (0)6 15312977

Internet crawl: CrawlDb getting big!

Reply via email to