Steve Howell <showel...@yahoo.com> writes: > Doesn't cdb do at least one disk seek as well? In the diagram on this > page, it seems you would need to do a seek based on the value of the > initial pointer (from the 256 possible values):
Yes, of course it has to seek if there is too much data to fit in memory. All I'm saying is that if you're spending milliseconds on the seek, that may dominate the lookup time even if you scan the 90k. Actually, looking at the spec more closely, there are 256 hash tables in the file, but each table can be of any size. So there can be far more than 2**16 hash slots. Uncharacteristically for Bernstein, the document is pretty unclear, so maybe you have to look at the code to be sure of the layout. Note also that empty hash buckets have just 2 words of overhead. So with 3M keys and 75% load factor, you get 4M buckets and relatively few collisions. The extra 1M buckets in a 64 bit implementation is just 16MB in the file, which isn't much at all even considering that you want it to stay resident in memory to avoid some seeks, assuming you're on a PC and not some smaller device like a phone. (A phone will have a solid state disk eliminating most seek time, so you're ok in that situation too). > Yup, I don't think I want to incur the extra overhead. Do you have > any first hand experience pushing dbm to the scale of 6Gb or so? My > take on dbm is that its niche is more in the 10,000-record range. There are a bunch of different variants. I'm trying to remember what I've personally done with it and I'm sure I've used much more than 10k records, but maybe not millions. Unix dbm was originally designed to handle millions of records back when that was a lot. I'd expect gdbm, bsddb and so forth can handle it easily. The naive Python dbm module might not be able to. The amount of data you're talking about (a few million keys, a few gb of data) is fairly modest by today's standards, so I would think fancy methods aren't really needed. -- http://mail.python.org/mailman/listinfo/python-list