Hi Roger, By very large dictionary, I mean about 25M items per dictionary. Each item is a simple integer whose value will never exceed 2^15.
I populate these dictionaries by parsing very large ASCII text files containing detailed manufacturing events. From each line in my log file I construct one or more keys and increment the numeric values associated with these keys using timing info also extracted from each line. Some of our log files are generated by separate monitoring equipment measuring the same process. In theory, these log files should be identical, but of course they are not. I'm looking for a way to determine the differences between the 2 dictionaries I will create from so-called matching sets of log files. At this point in time, I don't have concerns about memory as I'm running my scripts on a dedicated 64-bit server with 32Gb of RAM (but with budget approval to raise our total RAM to 64Gb if necessary). My main concern is am I applying a reasonably pythonic approach to my problem, eg. am I using appropriate python techniques and data structures? I am also interested in using reasonable techniques that will provide me with the fastest execution time. Thank you for sharing your thoughts with me. Regards, Malcolm ----- Original message ----- From: "Roger Binns" <rog...@rogerbinns.com> To: python-list@python.org Date: Tue, 23 Dec 2008 23:26:49 -0800 Subject: Re: Strategy for determing difference between 2 very large dictionaries -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 pyt...@bdurham.com wrote: > Feedback on my proposed strategies (or better strategies) would be > greatly appreciated. Both strategies will work but I'd recommend the second approach since it uses already tested code written by other people - the chances of it being wrong are far lower than new code. You also neglected to mention what your concerns are or even what "very large" is. Example concerns are memory consumption, cpu consumption, testability, utility of output (eg as a generator getting each result on demand or a single list with complete results). Some people will think a few hundred entries is large. My idea of large is a working set larger than my workstation's 6GB of memory :-) In general the Pythonic approach is: 1 - Get the correct result 2 - Simple code (developer time is precious) 3 - Optimise for your data and environment Step 3 is usually not needed. Roger -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I LIEAn080VgNvmEYj47Mm7BtV69J1GwXN =MKLl -----END PGP SIGNATURE----- -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list