Re: [Nutch-dev] Alternatives to SequenceFile for WebDB?

Doug Cutting Thu, 11 Nov 2004 04:13:06 -0800

scott cotton wrote:

1) As you mention, keep the append/create architecture and break it up into buckets, using a hash (or consistent hash for a little more flexibility). One notable advantage is that this could be done as a front end on current code, or just about any other implementation.


This is what the DistributedWebDBWriter does already, no?

A disadvantage is that it still
isn't really optimal for updates/ inserts with current storage of the webdb.


I'd argue that it is optimal.  More on that below.

2) use something like berkely db which will increase space usage by I'd guess about 100-150%, but will allow for fast inserts/updates/deletes. Sounds better to me than the current approach, but for large installations we may run into hardware limits without compressing the data. I've heard of berkeyly db being used to store 100Gig databases. I guess a large nutch installation may push or break that size.

We started out using Berkeley DB and it became very slow when the database was large. The problem is that B-Trees get fragmented as they grow. Each update eventually requires a random access, a disk seek, which take around 10 milliseconds.

Consider this: If each B-tree page holds, say, 100 pages or links, and we're updating at least 1% of all entries in the B-Tree, then, in the course of a db update we'll visit every page in the B-tree, but as a random access. It is much faster to pre-sort the updates and then merge them with the database. All disk operations are sequential and hence operate at the transfer rate, typically around 10MB/second, nearly 100 times faster than random seeks.

The last time I benchmarked the db sorting and merging code on large collections it was disk i/o bound. Is this no longer the case? When performing an update on a large (>10M page) db, what is the CPU and disk utilizations?

In short, maintaining a link graph is a very data intensive operation. An RDBMS will always use a B-tree, and will always degenerate to random accesses per link update when the database is large. Fetching at 100 pages per second with an average of 10 links per page requires 1000 link updates per second in order for the database to keep up with fetching. A typical hard drive can only perform 100 seeks per second. So any approach which requires a random access per link will fail to keep up, unless 10 hard drives are allocated per fetcher!

With 100 bytes per link and 10 links per page, a 100M page database requires 100GB. At 10MB/second transfer rate this takes on the order of three hours to read and six hours to re-write, even with tens of millions of updates. With two 10Ms seeks required per update, only around 1M links could be updated in six hours.

So, yes, the implementation Nutch uses does use a lot of space, but it is very scalable.

Doug

-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Alternatives to SequenceFile for WebDB?

Reply via email to