Re: Questions about bsddb
Thanks for the suggestion, I do remember reading that, but I don't think that helped much. I found experimenting around with the different settings, that the cache size is where the problem was. I've got it set to 1.5 GB and it's pretty happy at the moment, and the reduction in build time is a fraction of what it used to be. Thanks again for all the suggestions. Regards, JM -- http://mail.python.org/mailman/listinfo/python-list
Questions about bsddb
Hello, I need to build a large database that has roughly 500,000 keys, and a variable amount of data for each key. The data for each key could range from 100 bytes to megabytes.The data under each will grow with time as the database is being built. Are there some flags I should be setting when opening the database to handle large amounts of data per key? Is hash or binary tree recommended for this type of job, I'll be building the database from scratch, so lots of lookups and appending of data. Testing is showing bt to be faster, so I'm leaning towards that. The estimated build time is around 10~12 hours on my machine, so I want to make sure that something won't get messed up in the 10th hour. TIA, JM -- http://mail.python.org/mailman/listinfo/python-list
Re: Questions about bsddb
On May 9, 8:23 am, [EMAIL PROTECTED] wrote: Hello, I need to build a large database that has roughly 500,000 keys, and a variable amount of data for each key. The data for each key could range from 100 bytes to megabytes.The data under each will grow with time as the database is being built. Are there some flags I should be setting when opening the database to handle large amounts of data per key? Is hash or binary tree recommended for this type of job, I'll be building the database from scratch, so lots of lookups and appending of data. Testing is showing bt to be faster, so I'm leaning towards that. The estimated build time is around 10~12 hours on my machine, so I want to make sure that something won't get messed up in the 10th hour. TIA, JM JM, How will you access your data? If you access the keys often in a sequencial manner, then bt is better. In general, the rule is: 1) for small data sets, either one works 2) for larger data sets, use bt. Also, bt is good for sequential key access. 3) for really huge data sets where the metadata of the the btree cannot even fit in the cache, the hash will be better. The reasoning is since the metadata is larger than the cache there will be at least an I/O operation, but with a btree there might be mulple I/O to just find the key because the tree is not all in the memory and will have multiple levels. Also consider this: I had somewhat of a similar problem. I ended up using MySQL as a backend. In my application, the data actually was composed of a number of fields and I wanted to select based on some of those fields as well (i.e. select based on part of the value, not just the keys). and thus needed to have indices for those fields. The result was that my disk I/ O was saturated (i.e. the application was running as fast as the hard drive would let it), so it was good enough for me. Hope this helps, -Nick Vatamaniuc -- http://mail.python.org/mailman/listinfo/python-list
Re: Questions about bsddb
Thanks for the info Nick. I plan on accessing the data in pretty much random order, and once the database is built, it will be read only. At this point Im not too concerned about access times, just getting something to work. I've been messing around with both bt and hash with limited success, which led me to think that maybe I was going beyond some internal limit for the data size.It works great on a limited set of data, but once I turn it loose on the full set, usually several hours later, it either causes a hard reset of my machine or the HD grinds on endlessly with no apparent progress. Is there a limit to the size of data you can place per key? Thanks for the MySQL suggestion, I'll take a look. -JM -- http://mail.python.org/mailman/listinfo/python-list
Re: Questions about bsddb
On May 9, 4:01 pm, [EMAIL PROTECTED] wrote: Thanks for the info Nick. I plan on accessing the data in pretty much random order, and once the database is built, it will be read only. At this point Im not too concerned about access times, just getting something to work. I've been messing around with both bt and hash with limited success, which led me to think that maybe I was going beyond some internal limit for the data size.It works great on a limited set of data, but once I turn it loose on the full set, usually several hours later, it either causes a hard reset of my machine or the HD grinds on endlessly with no apparent progress. Is there a limit to the size of data you can place per key? Thanks for the MySQL suggestion, I'll take a look. -JM JM, If you want, take a look at my PyDBTable on www.psipy.com. The description and the examples section is being finished but the source API documentation will help you. It is a fast Python wrapper around MySQL, PostgreSQL or SQLite. It is very fast and buffers queries and insertions. You just set up the database and then pass the connection parameters to the initializer method. After that you can use the pydb object as a dictionary of { primary_key : list_of_values }. You can even create indices on individual fields and query with queries like : - pydb.query( ['id','data_field1'], ('id','',10), ('data_field1','LIKE','Hello%') ) Which will translate into the SQL query like : -- SELECT id, data_field1 FROM ... WHERE id10 AND data_field1 LIKE 'Hello %' -- and return an __iterator__. The iterator as a the result is excellent because you can iterate over results much larger than your virtual memory. But in the background PyDBTable will retrieve rows from the database in large batches and cache them as to optimise I/O. Anyway, on my machine PyDBTable saturates the disk I/O (it runs as fast as a pure MySQL query). Take care, -Nick Vatamaniuc -- http://mail.python.org/mailman/listinfo/python-list