Doug, two questions: Do you see any chance to preprocess (e.g. presort) data may until fetching where i/o istn't 100% used to improve processes later?
What speaks against made all releated code as a plugin as well so people with a small installation can use this implementation and people with a larger nutch installation can use a other implementation? Stefan Zitiere Doug Cutting <[EMAIL PROTECTED]>: > scott cotton wrote: > > 1) As you mention, keep the append/create architecture and break it up > into > > buckets, using a hash (or consistent hash for a little more > flexibility). > > One notable advantage is that this could be done as a front end on > current > > code, or just about any other implementation. > > This is what the DistributedWebDBWriter does already, no? > > > A disadvantage is that it still > > isn't really optimal for updates/ inserts with current storage of the > webdb. > > I'd argue that it is optimal. More on that below. > > > 2) use something like berkely db which will increase space usage > > by I'd guess about 100-150%, but will allow for fast > inserts/updates/deletes. > > Sounds better to me than the current approach, but for large > installations > > we may run into hardware limits without compressing the data. I've > heard > > of berkeyly db being used to store 100Gig databases. I guess a large > nutch > > installation may push or break that size. > > We started out using Berkeley DB and it became very slow when the > database was large. The problem is that B-Trees get fragmented as they > > grow. Each update eventually requires a random access, a disk seek, > which take around 10 milliseconds. > > Consider this: If each B-tree page holds, say, 100 pages or links, and > > we're updating at least 1% of all entries in the B-Tree, then, in the > course of a db update we'll visit every page in the B-tree, but as a > random access. It is much faster to pre-sort the updates and then merge > > them with the database. All disk operations are sequential and hence > operate at the transfer rate, typically around 10MB/second, nearly 100 > > times faster than random seeks. > > The last time I benchmarked the db sorting and merging code on large > collections it was disk i/o bound. Is this no longer the case? When > performing an update on a large (>10M page) db, what is the CPU and disk > > utilizations? > > In short, maintaining a link graph is a very data intensive operation. > > An RDBMS will always use a B-tree, and will always degenerate to random > > accesses per link update when the database is large. Fetching at 100 > pages per second with an average of 10 links per page requires 1000 link > > updates per second in order for the database to keep up with fetching. > > A typical hard drive can only perform 100 seeks per second. So any > approach which requires a random access per link will fail to keep up, > > unless 10 hard drives are allocated per fetcher! > > With 100 bytes per link and 10 links per page, a 100M page database > requires 100GB. At 10MB/second transfer rate this takes on the order of > > three hours to read and six hours to re-write, even with tens of > millions of updates. With two 10Ms seeks required per update, only > around 1M links could be updated in six hours. > > So, yes, the implementation Nutch uses does use a lot of space, but it > > is very scalable. > > Doug > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: > Sybase ASE Linux Express Edition - download now for FREE > LinuxWorld Reader's Choice Award Winner for best database on Linux. > http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > ------------------------------------------------------- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
