Re: [Nutch-dev] Alternatives to SequenceFile for WebDB?

sg Thu, 11 Nov 2004 05:01:12 -0800

Doug, 
two  questions:

Do you see any chance to preprocess (e.g. presort) data may until fetching 
where i/o istn't 100% used to improve processes later?


What speaks against made all releated code as a plugin as well so people with a 
small installation can use this implementation and people with a larger nutch 
installation can use a other implementation?

Stefan 




Zitiere Doug Cutting <[EMAIL PROTECTED]>:

> scott cotton wrote:
> > 1) As you mention, keep the append/create architecture and break it up
> into
> > buckets, using a hash (or consistent hash for a little more
> flexibility).  
> > One notable advantage is that this could be done as a front end on
> current
> > code, or just about any other implementation.
> 
> This is what the DistributedWebDBWriter does already, no?
> 
> > A disadvantage is that it still
> > isn't really optimal for updates/ inserts with current storage of the
> webdb.
> 
> I'd argue that it is optimal.  More on that below.
> 
> > 2) use something like berkely db which will increase space usage
> > by I'd guess about 100-150%, but will allow for fast
> inserts/updates/deletes. 
> > Sounds better to me than the current approach, but for large
> installations
> > we may run into hardware limits without compressing the data.  I've
> heard
> > of berkeyly db being used to store 100Gig  databases.  I guess a large
> nutch
> > installation may push or break that size.
> 
> We started out using Berkeley DB and it became very slow when the 
> database was large.  The problem is that B-Trees get fragmented as they
> 
> grow.  Each update eventually requires a random access, a disk seek, 
> which take around 10 milliseconds.
> 
> Consider this: If each B-tree page holds, say, 100 pages or links, and
> 
> we're updating at least 1% of all entries in the B-Tree, then, in the 
> course of a db update we'll visit every page in the B-tree, but as a 
> random access.  It is much faster to pre-sort the updates and then merge
> 
> them with the database.  All disk operations are sequential and hence 
> operate at the transfer rate, typically around 10MB/second, nearly 100
> 
> times faster than random seeks.
> 
> The last time I benchmarked the db sorting and merging code on large 
> collections it was disk i/o bound.  Is this no longer the case?  When 
> performing an update on a large (>10M page) db, what is the CPU and disk
> 
> utilizations?
> 
> In short, maintaining a link graph is a very data intensive operation.
> 
> An RDBMS will always use a B-tree, and will always degenerate to random
> 
> accesses per link update when the database is large.  Fetching at 100 
> pages per second with an average of 10 links per page requires 1000 link
> 
> updates per second in order for the database to keep up with fetching.
> 
> A typical hard drive can only perform 100 seeks per second.  So any 
> approach which requires a random access per link will fail to keep up,
> 
> unless 10 hard drives are allocated per fetcher!
> 
> With 100 bytes per link and 10 links per page, a 100M page database 
> requires 100GB.  At 10MB/second transfer rate this takes on the order of
> 
> three hours to read and six hours to re-write, even with tens of 
> millions of updates.  With two 10Ms seeks required per update, only 
> around 1M links could be updated in six hours.
> 
> So, yes, the implementation Nutch uses does use a lot of space, but it
> 
> is very scalable.
> 
> Doug
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Sybase ASE Linux Express Edition - download now for FREE
> LinuxWorld Reader's Choice Award Winner for best database on Linux.
> http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 


-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Alternatives to SequenceFile for WebDB?

Reply via email to