On 5/3/2012 10:42, Steve Howell wrote:
On May 2, 11:48 pm, Paul Rubin<no.em...@nospam.invalid>  wrote:
Paul Rubin<no.em...@nospam.invalid>  writes:
looking at the spec more closely, there are 256 hash tables.. ...

You know, there is a much simpler way to do this, if you can afford to
use a few hundred MB of memory and you don't mind some load time when
the program first starts.  Just dump all the data sequentially into a
file.  Then scan through the file, building up a Python dictionary
mapping data keys to byte offsets in the file (this is a few hundred MB
if you have 3M keys).  Then dump the dictionary as a Python pickle and
read it back in when you start the program.

You may want to turn off the cyclic garbage collector when building or
loading the dictionary, as it badly can slow down the construction of
big lists and maybe dicts (I'm not sure of the latter).

I'm starting to lean toward the file-offset/seek approach.  I am
writing some benchmarks on it, comparing it to a more file-system
based approach like I mentioned in my original post.  I'll report back
when I get results, but it's already way past my bedtime for tonight.

Thanks for all your help and suggestions.

You should really cache the accesses to that file hoping that the accesses are not as random as you think. If that's the case you should notice a *huge* improvement.

Kiuhnm
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to