On Thu, Nov 11, 2004 at 01:09:36PM +0100, Doug Cutting wrote: > scott cotton wrote: > >1) As you mention, keep the append/create architecture and break it up into > >buckets, using a hash (or consistent hash for a little more flexibility). > >One notable advantage is that this could be done as a front end on current > >code, or just about any other implementation. > > This is what the DistributedWebDBWriter does already, no?
Maybe I'm misunderstanding something, or didn't explain myself well, or am referring to an old version of nutch (0.5) but it looks like DistributedWebDBWriter breaks things up into buckets in two ways: by the different files in the WebDB (linksbyURL, etc) and by using the NutchFS. If this is correct, then a change to one bucket will entail changes to all other buckets falling later in the sort order, so the buckets have to processed sequentially. Using hashing/consistent hashing would restrict the effect of updates to a single bucket, so a minor difference would be that the buckets could be processed in parallel rather than sequentially. Apologies If I'm misunderstanding something about DistributedWebDBWriter. > >A disadvantage is that it still > >isn't really optimal for updates/ inserts with current storage of the > >webdb. > > I'd argue that it is optimal. More on that below. Some hopefully helpful thoughts on optimality of webdb below. > > >2) use something like berkely db which will increase space usage > >by I'd guess about 100-150%, but will allow for fast > >inserts/updates/deletes. Sounds better to me than the current approach, > >but for large installations > >we may run into hardware limits without compressing the data. I've heard > >of berkeyly db being used to store 100Gig databases. I guess a large > >nutch > >installation may push or break that size. > > We started out using Berkeley DB and it became very slow when the > database was large. The problem is that B-Trees get fragmented as they > grow. Each update eventually requires a random access, a disk seek, > which take around 10 milliseconds. > > Consider this: If each B-tree page holds, say, 100 pages or links, and > we're updating at least 1% of all entries in the B-Tree, then, in the > course of a db update we'll visit every page in the B-tree, but as a > random access. It is much faster to pre-sort the updates and then merge > them with the database. All disk operations are sequential and hence > operate at the transfer rate, typically around 10MB/second, nearly 100 > times faster than random seeks. Minor question: if you use the update in a transaction, then do the btrees get updated in batch? There's got to be some batch updateable rdbs out there. > > The last time I benchmarked the db sorting and merging code on large > collections it was disk i/o bound. Is this no longer the case? When > performing an update on a large (>10M page) db, what is the CPU and disk > utilizations? > > In short, maintaining a link graph is a very data intensive operation. > An RDBMS will always use a B-tree, and will always degenerate to random > accesses per link update when the database is large. Fetching at 100 > pages per second with an average of 10 links per page requires 1000 link > updates per second in order for the database to keep up with fetching. > A typical hard drive can only perform 100 seeks per second. So any > approach which requires a random access per link will fail to keep up, > unless 10 hard drives are allocated per fetcher! > > With 100 bytes per link and 10 links per page, a 100M page database > requires 100GB. At 10MB/second transfer rate this takes on the order of > three hours to read and six hours to re-write, even with tens of > millions of updates. With two 10Ms seeks required per update, only > around 1M links could be updated in six hours. If distributed over N machines or N disks which work independently rather than sequentially, updates would take 1/Nth the time. Again, I could be wrong about the DistributedWebDBWriter, but my understanding is that its buckets don't act independently when there are updates. > So, yes, the implementation Nutch uses does use a lot of space, but it > is very scalable. Indeed as you describe link updates are problematic with btrees, and if your batch updates are big enough, then the one-pass update is much more efficient than btrees. Also, as the collection grows, increasing the size of batch updates becomes more and more important, no? I can think of something that should do better than one-pass, sequential (ie non-block structured) update: First, if the data is stored on blocks each of which has adequate space reserved for growth, then one-pass batch updates can skip the re-writing of many blocks. OK, that uses even more space, but there are some saving graces: 1) there is no compression in webdb now (correct?) adding a little prefix compression may more than accomodate for difference in space usage 2) If you're going to make a big installation it doesn't seem like a big deal to supply the web db with a desired size, and so much of this extra space would be used eventually anyway. 3) faster updates because most of the writes from sequential one-pass update are skipped. One could improve on how well such block based storage accomodates growth by starting with a reasonably sized linkdb, sorting links by the rank of the target url or by the number of incoming links, and keeping a separate index mapping target url to block. Then the index needs to be blocked as well, but the index growth (of target urls) can have a distinct growth rate from the total number of links, making the whole system accomodate growth more accurately (with updates triggering many block writes less often). Again, the whole bucketing of things into independent buckets could be used in this schema as well. All this is just a few thoughts, please feel free to disregard if it doesn't seem useful. cheers, scott > > Doug > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: > Sybase ASE Linux Express Edition - download now for FREE > LinuxWorld Reader's Choice Award Winner for best database on Linux. > http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
