Re: [Nutch-dev] Alternatives to SequenceFile for WebDB?

scott cotton Thu, 11 Nov 2004 11:11:57 -0800

On Thu, Nov 11, 2004 at 01:09:36PM +0100, Doug Cutting wrote:
> scott cotton wrote:
> >1) As you mention, keep the append/create architecture and break it up into
> >buckets, using a hash (or consistent hash for a little more flexibility).  
> >One notable advantage is that this could be done as a front end on current
> >code, or just about any other implementation.
> 
> This is what the DistributedWebDBWriter does already, no?


Maybe I'm misunderstanding something, or didn't explain myself well, or am
referring to an old version of nutch (0.5) but it looks like
DistributedWebDBWriter breaks things up into buckets in two ways: by the
different files in the WebDB (linksbyURL, etc) and by using the NutchFS.  If
this is correct, then a change to one bucket will entail changes to all other
buckets falling later in the sort order, so the buckets have to processed
sequentially.  Using hashing/consistent hashing would restrict the effect of
updates to a single bucket, so a minor difference would be that the buckets
could be processed in parallel rather than sequentially.  Apologies If I'm
misunderstanding something about DistributedWebDBWriter.

> >A disadvantage is that it still
> >isn't really optimal for updates/ inserts with current storage of the 
> >webdb.
> 
> I'd argue that it is optimal.  More on that below.

Some hopefully helpful thoughts on optimality of webdb below.

> 
> >2) use something like berkely db which will increase space usage
> >by I'd guess about 100-150%, but will allow for fast 
> >inserts/updates/deletes. Sounds better to me than the current approach, 
> >but for large installations
> >we may run into hardware limits without compressing the data.  I've heard
> >of berkeyly db being used to store 100Gig  databases.  I guess a large 
> >nutch
> >installation may push or break that size.
> 
> We started out using Berkeley DB and it became very slow when the 
> database was large.  The problem is that B-Trees get fragmented as they 
> grow.  Each update eventually requires a random access, a disk seek, 
> which take around 10 milliseconds.
> 
> Consider this: If each B-tree page holds, say, 100 pages or links, and 
> we're updating at least 1% of all entries in the B-Tree, then, in the 
> course of a db update we'll visit every page in the B-tree, but as a 
> random access.  It is much faster to pre-sort the updates and then merge 
> them with the database.  All disk operations are sequential and hence 
> operate at the transfer rate, typically around 10MB/second, nearly 100 
> times faster than random seeks.

Minor question: if you use the update in a transaction, then do the
btrees get updated in batch?  There's got to be some batch updateable
rdbs out there.

> 
> The last time I benchmarked the db sorting and merging code on large 
> collections it was disk i/o bound.  Is this no longer the case?  When 
> performing an update on a large (>10M page) db, what is the CPU and disk 
> utilizations?
> 
> In short, maintaining a link graph is a very data intensive operation. 
> An RDBMS will always use a B-tree, and will always degenerate to random 
> accesses per link update when the database is large.  Fetching at 100 
> pages per second with an average of 10 links per page requires 1000 link 
> updates per second in order for the database to keep up with fetching. 
> A typical hard drive can only perform 100 seeks per second.  So any 
> approach which requires a random access per link will fail to keep up, 
> unless 10 hard drives are allocated per fetcher!
> 
> With 100 bytes per link and 10 links per page, a 100M page database 
> requires 100GB.  At 10MB/second transfer rate this takes on the order of 
> three hours to read and six hours to re-write, even with tens of 
> millions of updates.  With two 10Ms seeks required per update, only 
> around 1M links could be updated in six hours.

If distributed over N machines or N disks which work independently rather
than sequentially, updates would take 1/Nth the time.  Again, I could be 
wrong about the DistributedWebDBWriter, but my understanding is that 
its buckets don't act independently when there are updates.

> So, yes, the implementation Nutch uses does use a lot of space, but it 
> is very scalable.

Indeed as you describe link updates are problematic with btrees, and 
if your batch updates are big enough, then the one-pass update is
much more efficient than btrees.  Also, as the collection grows,
increasing the size of batch updates becomes more and more important, no?

I can think of something that should do better than one-pass, sequential
(ie non-block structured) update: First, if the data is stored on blocks each
of which has adequate space reserved for growth, then one-pass batch updates
can skip the re-writing of many blocks.  OK, that uses even more space, but
there are some saving graces: 1) there is no compression in webdb now
(correct?) adding a little prefix compression may more than accomodate for 
difference in space usage 2) If you're going to make a big installation it
doesn't seem like a big deal to supply the web db with a desired size, and so
much of this extra space would be used eventually anyway. 3)  faster updates
because most of the writes from sequential one-pass update are skipped.  

One could improve on how well such block based storage accomodates 
growth by starting with a reasonably sized linkdb, sorting links by the rank of
the target url or by the number of incoming links, and keeping a separate
index mapping target url to block.  Then the index needs to be blocked as 
well, but the index growth (of target urls) can have a distinct growth
rate from the total number of links, making the whole system accomodate 
growth more accurately (with updates triggering many block writes less often). 

Again, the whole bucketing of things into independent buckets could be used in
this schema as well. 

All this is just a few thoughts, please feel free to disregard if it doesn't 
seem useful.

cheers,

scott






















> 
> Doug
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Sybase ASE Linux Express Edition - download now for FREE
> LinuxWorld Reader's Choice Award Winner for best database on Linux.
> http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Alternatives to SequenceFile for WebDB?

Reply via email to