Re: handeling very large dictionaries
You could also try using a key-value store. I am using pytc, a Python API for Tokyo Cabinet. It seems to figure out quite nicely when to go to disk, and when to use memory. But I have not done extensive tests. Here is some example code for using pytc: http://github.com/turian/pytc-example/tree/master Joseph On Jun 28, 7:13 pm, mclovin wrote: > Hello all, > > I need to have a dictionary of about 8 gigs (well the data it is > processing is around 4gb). so naturally i am running into memory > errors. > > So i looked around and found bsddb which acts like a dictionary object > only offloads the data from the RAM to the HDD, however that only > supports strings. > > my dictionaries hold my own class objects > > Is there something like it that is more flexible? -- http://mail.python.org/mailman/listinfo/python-list
Re: handeling very large dictionaries
On Jun 29, 9:13 am, mclovin wrote: > Is there something like it that is more flexible? Have you seen the stdlib module 'shelve'? http://docs.python.org/library/shelve.html It creates a persistent file-based dictionary, which can hold any type of object as long as it can be pickled. I really like the 3rd party module 'shove': http://pypi.python.org/pypi/shove It's similar to shelve but provides many more backend options, including dbs, svn and Amazon S3. Very handy. -- http://mail.python.org/mailman/listinfo/python-list
Re: handeling very large dictionaries
In article <9efff087-bd6e-49fb-ad30-a955a64b8...@j32g2000yqh.googlegroups.com>, mclovin wrote: > >I need to have a dictionary of about 8 gigs (well the data it is >processing is around 4gb). so naturally i am running into memory >errors. > >So i looked around and found bsddb which acts like a dictionary object >only offloads the data from the RAM to the HDD, however that only >supports strings. Look at the pickle module. Personally, I'd use SQLite instead of bsddb. You might also look into a full-blown ORM such as SQLAlchemy. -- Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/ "as long as we like the same operating system, things are cool." --piranha -- http://mail.python.org/mailman/listinfo/python-list
handeling very large dictionaries
Hello all, I need to have a dictionary of about 8 gigs (well the data it is processing is around 4gb). so naturally i am running into memory errors. So i looked around and found bsddb which acts like a dictionary object only offloads the data from the RAM to the HDD, however that only supports strings. my dictionaries hold my own class objects Is there something like it that is more flexible? -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
Hello, I'm having the same problem. I'm creating two dictionaries with about 700,000 to 1,600,000 keys each, each key pointing to a smaller structure of embedded dictionaries. On a 8 x 3.0GHz Xeon Mac Pro with 17GB of RAM, cPickle is excruciatingly slow. Moreover, it crashes with memory error (12). I've been looking for alternatives, even thinking of writing up my own file format and read/write functions. As a last resort I tried dumping the two lists with the marshal module and it worked great. A 200MB and a 300MB file are created in a couple of seconds without a crash. You should be warned that marshal is not meant as a general data-persistence module. But as long as you have a rather simple data structure and you are using it with a specific python version, it seems to run circles around cPickle. Hope it helps. -- View this message in context: http://www.nabble.com/writing-large-dictionaries-to-file-using-cPickle-tp21708938p23246693.html Sent from the Python - python-list mailing list archive at Nabble.com. -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 30, 2:44 pm, perfr...@gmail.com wrote: > On Jan 28, 6:08 pm, Aaron Brady wrote: > > > > > On Jan 28, 4:43 pm, perfr...@gmail.com wrote: > > > > On Jan 28, 5:14 pm, John Machin wrote: > > > > > On Jan 29, 3:13 am, perfr...@gmail.com wrote: > > > > > > hello all, > > > > > > i have a large dictionary which contains about 10 keys, each key has a > > > > > value which is a list containing about 1 to 5 million (small) > > > > > dictionaries. for example, > > > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > > > > 'world'}, ...], > > > > > key2: [...]} > > > > > > in total there are about 10 to 15 million lists if we concatenate > > > > > together all the values of every key in 'mydict'. mydict is a > > > > > structure that represents data in a very large file (about 800 > > > > > megabytes). > > > snip > > > > in reply to the other poster: i thought 'shelve' simply calls pickle. > > > if thats the case, it wouldnt be any faster, right ? > > > Yes, but not all at once. It's a clear winner if you need to update > > any of them later, but if it's just write-once, read-many, it's about > > the same. > > > You said you have a million dictionaries. Even if each took only one > > byte, you would still have a million bytes. Do you expect a faster I/ > > O time than the time it takes to write a million bytes? > > > I want to agree with John's worry about RAM, unless you have several+ > > GB, as you say. You are not dealing with small numbers. > > in my case, i just write the pickle file once and then read it in > later. in that case, cPickle and shelve would be identical, if i > understand correctly? No not identical. 'shelve' is not a dictionary, it's a database object that implements the mapping protocol. 'isinstance( shelve, dict )' is False, for example. > the file i'm reading in is ~800 MB file, and the pickle file is around > 300 MB. even if it were 800 MB, it doesn't make sense to me that > python's i/o would be that slow... it takes roughly 5 seconds to write > one megabyte of a binary file (the pickled object in this case), which > just seems wrong. does anyone know anything about this? about how i/o > can be sped up for example? You can try copying a 1-MB file. Or something like: f= open( 'temp.temp', 'w' ) for x in range( 10 ): f.write( '0'* 10 ) You know how long it takes OSes to boot, right? > the dictionary might have a million keys, but each key's value is very > small. i tried the same example where the keys are short strings (and > there are about 10-15 million of them) and each value is an integer, > and it is still very slow. does anyone know how to test whether i/o is > the bottle neck, or whether it's something specific about pickle? > > thanks. You could fall back to storing a parallel list by hand, if you're just using string and numeric primitives. -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 28, 6:08 pm, Aaron Brady wrote: > On Jan 28, 4:43 pm, perfr...@gmail.com wrote: > > > On Jan 28, 5:14 pm, John Machin wrote: > > > > On Jan 29, 3:13 am, perfr...@gmail.com wrote: > > > > > hello all, > > > > > i have a large dictionary which contains about 10 keys, each key has a > > > > value which is a list containing about 1 to 5 million (small) > > > > dictionaries. for example, > > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > > > 'world'}, ...], > > > > key2: [...]} > > > > > in total there are about 10 to 15 million lists if we concatenate > > > > together all the values of every key in 'mydict'. mydict is a > > > > structure that represents data in a very large file (about 800 > > > > megabytes). > > snip > > > in reply to the other poster: i thought 'shelve' simply calls pickle. > > if thats the case, it wouldnt be any faster, right ? > > Yes, but not all at once. It's a clear winner if you need to update > any of them later, but if it's just write-once, read-many, it's about > the same. > > You said you have a million dictionaries. Even if each took only one > byte, you would still have a million bytes. Do you expect a faster I/ > O time than the time it takes to write a million bytes? > > I want to agree with John's worry about RAM, unless you have several+ > GB, as you say. You are not dealing with small numbers. in my case, i just write the pickle file once and then read it in later. in that case, cPickle and shelve would be identical, if i understand correctly? the file i'm reading in is ~800 MB file, and the pickle file is around 300 MB. even if it were 800 MB, it doesn't make sense to me that python's i/o would be that slow... it takes roughly 5 seconds to write one megabyte of a binary file (the pickled object in this case), which just seems wrong. does anyone know anything about this? about how i/o can be sped up for example? the dictionary might have a million keys, but each key's value is very small. i tried the same example where the keys are short strings (and there are about 10-15 million of them) and each value is an integer, and it is still very slow. does anyone know how to test whether i/o is the bottle neck, or whether it's something specific about pickle? thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
En Wed, 28 Jan 2009 14:13:10 -0200, escribió: i have a large dictionary which contains about 10 keys, each key has a value which is a list containing about 1 to 5 million (small) dictionaries. for example, mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': 'world'}, ...], key2: [...]} [pickle] creates extremely large files (~ 300 MB) though it does so *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and it gets slower and slower. it takes almost an hour if not more to write this pickle object to file. There is an undocumented Pickler attribute, "fast". Usually, when the same object is referenced more than once, only the first appearance is stored in the pickled stream; later references just point to the original. This requires the Pickler instance to remember every object pickled so far -- setting the "fast" attribute to a true value bypasses this check. Before using this, you must be positively sure that your objects don't contain circular references -- else pickling will never finish. py> from cPickle import Pickler py> from cStringIO import StringIO py> s = StringIO() py> p = Pickler(s, -1) py> p.fast = 1 py> x = [1,2,3] py> y = [x, x, x] py> y [[1, 2, 3], [1, 2, 3], [1, 2, 3]] py> y[0] is y[1] True py> p.dump(y) py> s.getvalue() '\x80\x02](](K\x01K\x02K\x03e](K\x01K\x02K\x03e](K\x01K\x02K\x03ee.' Note that, when unpickling, shared references are broken: py> s.seek(0,0) py> from cPickle import load py> y2 = load(s) py> y2 [[1, 2, 3], [1, 2, 3], [1, 2, 3]] py> y2[0] is y2[1] False -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 29, 9:43 am, perfr...@gmail.com wrote: > On Jan 28, 5:14 pm, John Machin wrote: > > > > > On Jan 29, 3:13 am, perfr...@gmail.com wrote: > > > > hello all, > > > > i have a large dictionary which contains about 10 keys, each key has a > > > value which is a list containing about 1 to 5 million (small) > > > dictionaries. for example, > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > > 'world'}, ...], > > > key2: [...]} > > > > in total there are about 10 to 15 million lists if we concatenate > > > together all the values of every key in 'mydict'. mydict is a > > > structure that represents data in a very large file (about 800 > > > megabytes). > > > > what is the fastest way to pickle 'mydict' into a file? right now i am > > > experiencing a lot of difficulties with cPickle when using it like > > > this: > > > > from cPickle import pickle > > > pfile = open(my_file, 'w') > > > pickle.dump(mydict, pfile) > > > pfile.close() > > > > this creates extremely large files (~ 300 MB) though it does so > > > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and > > > it gets slower and slower. it takes almost an hour if not more to > > > write this pickle object to file. > > > > is there any way to speed this up? i dont mind the large file... after > > > all the text file with the data used to make the dictionary was larger > > > (~ 800 MB) than the file it eventually creates, which is 300 MB. but > > > i do care about speed... > > > > i have tried optimizing this by using this: > > > > s = pickle.dumps(mydict, 2) > > > pfile.write(s) > > > > but this takes just as long... any ideas ? is there a different module > > > i could use that's more suitable for large dictionaries ? > > > thank you very much. > > > Pardon me if I'm asking the "bleedin' obvious", but have you checked > > how much virtual memory this is taking up compared to how much real > > memory you have? If the slowness is due to pagefile I/O, consider > > doing "about 10" separate pickles (one for each key in your top-level > > dictionary). > > the slowness is due to CPU when i profile my program using the unix > program 'top'... i think all the work is in the file I/O. the machine > i am using several GB of ram and ram memory is not heavily taxed at > all. do you know how file I/O can be sped up? More quick silly questions: (1) How long does it take to load that 300MB pickle back into memory using: (a) cpickle.load(f) (b) f.read() ? What else is happening on the machine while you are creating the pickle? (2) How does -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 28, 4:43 pm, perfr...@gmail.com wrote: > On Jan 28, 5:14 pm, John Machin wrote: > > > > > On Jan 29, 3:13 am, perfr...@gmail.com wrote: > > > > hello all, > > > > i have a large dictionary which contains about 10 keys, each key has a > > > value which is a list containing about 1 to 5 million (small) > > > dictionaries. for example, > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > > 'world'}, ...], > > > key2: [...]} > > > > in total there are about 10 to 15 million lists if we concatenate > > > together all the values of every key in 'mydict'. mydict is a > > > structure that represents data in a very large file (about 800 > > > megabytes). snip > in reply to the other poster: i thought 'shelve' simply calls pickle. > if thats the case, it wouldnt be any faster, right ? Yes, but not all at once. It's a clear winner if you need to update any of them later, but if it's just write-once, read-many, it's about the same. You said you have a million dictionaries. Even if each took only one byte, you would still have a million bytes. Do you expect a faster I/ O time than the time it takes to write a million bytes? I want to agree with John's worry about RAM, unless you have several+ GB, as you say. You are not dealing with small numbers. -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 28, 5:14 pm, John Machin wrote: > On Jan 29, 3:13 am, perfr...@gmail.com wrote: > > > > > hello all, > > > i have a large dictionary which contains about 10 keys, each key has a > > value which is a list containing about 1 to 5 million (small) > > dictionaries. for example, > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > 'world'}, ...], > > key2: [...]} > > > in total there are about 10 to 15 million lists if we concatenate > > together all the values of every key in 'mydict'. mydict is a > > structure that represents data in a very large file (about 800 > > megabytes). > > > what is the fastest way to pickle 'mydict' into a file? right now i am > > experiencing a lot of difficulties with cPickle when using it like > > this: > > > from cPickle import pickle > > pfile = open(my_file, 'w') > > pickle.dump(mydict, pfile) > > pfile.close() > > > this creates extremely large files (~ 300 MB) though it does so > > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and > > it gets slower and slower. it takes almost an hour if not more to > > write this pickle object to file. > > > is there any way to speed this up? i dont mind the large file... after > > all the text file with the data used to make the dictionary was larger > > (~ 800 MB) than the file it eventually creates, which is 300 MB. but > > i do care about speed... > > > i have tried optimizing this by using this: > > > s = pickle.dumps(mydict, 2) > > pfile.write(s) > > > but this takes just as long... any ideas ? is there a different module > > i could use that's more suitable for large dictionaries ? > > thank you very much. > > Pardon me if I'm asking the "bleedin' obvious", but have you checked > how much virtual memory this is taking up compared to how much real > memory you have? If the slowness is due to pagefile I/O, consider > doing "about 10" separate pickles (one for each key in your top-level > dictionary). the slowness is due to CPU when i profile my program using the unix program 'top'... i think all the work is in the file I/O. the machine i am using several GB of ram and ram memory is not heavily taxed at all. do you know how file I/O can be sped up? in reply to the other poster: i thought 'shelve' simply calls pickle. if thats the case, it wouldnt be any faster, right ? -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 29, 3:13 am, perfr...@gmail.com wrote: > hello all, > > i have a large dictionary which contains about 10 keys, each key has a > value which is a list containing about 1 to 5 million (small) > dictionaries. for example, > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > 'world'}, ...], > key2: [...]} > > in total there are about 10 to 15 million lists if we concatenate > together all the values of every key in 'mydict'. mydict is a > structure that represents data in a very large file (about 800 > megabytes). > > what is the fastest way to pickle 'mydict' into a file? right now i am > experiencing a lot of difficulties with cPickle when using it like > this: > > from cPickle import pickle > pfile = open(my_file, 'w') > pickle.dump(mydict, pfile) > pfile.close() > > this creates extremely large files (~ 300 MB) though it does so > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and > it gets slower and slower. it takes almost an hour if not more to > write this pickle object to file. > > is there any way to speed this up? i dont mind the large file... after > all the text file with the data used to make the dictionary was larger > (~ 800 MB) than the file it eventually creates, which is 300 MB. but > i do care about speed... > > i have tried optimizing this by using this: > > s = pickle.dumps(mydict, 2) > pfile.write(s) > > but this takes just as long... any ideas ? is there a different module > i could use that's more suitable for large dictionaries ? > thank you very much. Pardon me if I'm asking the "bleedin' obvious", but have you checked how much virtual memory this is taking up compared to how much real memory you have? If the slowness is due to pagefile I/O, consider doing "about 10" separate pickles (one for each key in your top-level dictionary). -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 28, 10:13 am, perfr...@gmail.com wrote: > hello all, > > i have a large dictionary which contains about 10 keys, each key has a > value which is a list containing about 1 to 5 million (small) > dictionaries. for example, snip > but this takes just as long... any ideas ? is there a different module > i could use that's more suitable for large dictionaries ? > thank you very much. There is the 'shelve' module. You could create a shelf that tells you the filename of the 5 other ones. A million keys should be no problem, I guess. (It's standard library.) All your keys have to be strings, though, and all your values have to be pickleable. If that's a problem, yes you will need ZODB or Django (I understand), or another relational DB. -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
perfr...@gmail.com schrieb: > but this takes just as long... any ideas ? is there a different module > i could use that's more suitable for large dictionaries ? > thank you very much. Have a look at ZODB. -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 28, 11:32 am, pyt...@bdurham.com wrote: > Hi, > > Change: > > pickle.dump(mydict, pfile) > > to: > > pickle.dump(mydict, pfile, -1 ) > > I think you will see a big difference in performance and also a much > smaller file on disk. > > BTW: What type of application are you developing that creates so many > dictionaries? Sounds interesting. > > Malcolm hi! thank you for your reply. unfortunately i tried this but it doesn't change the speed. it's still writing the file extremely slowly. i'm not sure why? thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
Hi, Change: pickle.dump(mydict, pfile) to: pickle.dump(mydict, pfile, -1 ) I think you will see a big difference in performance and also a much smaller file on disk. BTW: What type of application are you developing that creates so many dictionaries? Sounds interesting. Malcolm -- http://mail.python.org/mailman/listinfo/python-list
writing large dictionaries to file using cPickle
hello all, i have a large dictionary which contains about 10 keys, each key has a value which is a list containing about 1 to 5 million (small) dictionaries. for example, mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': 'world'}, ...], key2: [...]} in total there are about 10 to 15 million lists if we concatenate together all the values of every key in 'mydict'. mydict is a structure that represents data in a very large file (about 800 megabytes). what is the fastest way to pickle 'mydict' into a file? right now i am experiencing a lot of difficulties with cPickle when using it like this: from cPickle import pickle pfile = open(my_file, 'w') pickle.dump(mydict, pfile) pfile.close() this creates extremely large files (~ 300 MB) though it does so *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and it gets slower and slower. it takes almost an hour if not more to write this pickle object to file. is there any way to speed this up? i dont mind the large file... after all the text file with the data used to make the dictionary was larger (~ 800 MB) than the file it eventually creates, which is 300 MB. but i do care about speed... i have tried optimizing this by using this: s = pickle.dumps(mydict, 2) pfile.write(s) but this takes just as long... any ideas ? is there a different module i could use that's more suitable for large dictionaries ? thank you very much. -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
On Jan 15, 6:39 pm, Per Freem wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > i create a dictionary that maps elements of the file that i want to > count > to their number of occurs. so i iterate through the file and for each > line > extract the elements (simple text operation) and see if it has an > entry in the dict: > > for line in file: > try: > elt = MyClass(line)# extract elt from line... > my_dict[elt] += 1 > except KeyError: > my_dict[elt] = 1 > > i am using try/except since it is supposedly faster (though i am not > sure > about this? is this really true in Python 2.5?). > > the only 'twist' is that my elt is an instance of a class (MyClass) > with 3 fields, all numeric. the class is hashable, and so my_dict[elt] > works well. > the __repr__ and __hash__ methods of my class simply return str() > representation > of self, while __str__ just makes everything numeric field into a > concatenated string: > > class MyClass > > def __str__(self): > return "%s-%s-%s" %(self.field1, self.field2, self.field3) > > def __repr__(self): > return str(self) > > def __hash__(self): > return hash(str(self)) > > is there anything that can be done to speed up this simply code? right > now it is taking well over 15 minutes to process, on a 3 Ghz machine > with lots of RAM (though this is all taking CPU power, not RAM at this > point.) > > any general advice on how to optimize large dicts would be great too > > thanks for your help. How about this?: for line in file: elt = MyClass(line) my_dict[elt] = my_dict.get(elt, 0) + 1 ... -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
Per Freem writes: > the only 'twist' is that my elt is an instance of a class (MyClass) > with 3 fields, all numeric. the class is hashable, and so > my_dict[elt] works well. the __repr__ and __hash__ methods of my > class simply return str() representation of self, which just calls __str__(). I guess you are aware of that but you could call self.__str__() directly. Maybe that saves something when you do that 10 million times. > while __str__ just makes everything numeric field into a > concatenated string: > > class MyClass > > def __str__(self): > return "%s-%s-%s" %(self.field1, self.field2, self.field3) > > def __repr__(self): > return str(self) > > def __hash__(self): > return hash(str(self)) Maybe it would be faster to numerically combine the three fields instead of hashing the string representation. Matthias -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
On Jan 15, 4:39 pm, Per Freem wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > i create a dictionary that maps elements of the file that i want to > count > to their number of occurs. so i iterate through the file and for each > line > extract the elements (simple text operation) and see if it has an > entry in the dict: > > for line in file: > try: > elt = MyClass(line)# extract elt from line... > my_dict[elt] += 1 > except KeyError: > my_dict[elt] = 1 > > i am using try/except since it is supposedly faster (though i am not > sure > about this? is this really true in Python 2.5?). > > the only 'twist' is that my elt is an instance of a class (MyClass) > with 3 fields, all numeric. the class is hashable, and so my_dict[elt] > works well. > the __repr__ and __hash__ methods of my class simply return str() > representation > of self, while __str__ just makes everything numeric field into a > concatenated string: > > class MyClass > > def __str__(self): > return "%s-%s-%s" %(self.field1, self.field2, self.field3) > > def __repr__(self): > return str(self) > > def __hash__(self): > return hash(str(self)) > > is there anything that can be done to speed up this simply code? right > now it is taking well over 15 minutes to process, on a 3 Ghz machine > with lots of RAM (though this is all taking CPU power, not RAM at this > point.) > > any general advice on how to optimize large dicts would be great too > > thanks for your help. I am willing to bet a beer that the slow performance has nothing to do with the dict but is either because of MyClass or the read from disk. -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
Paul Rubin wrote: Per Freem writes: 2] is there an easy way to have nested defaultdicts? ie i want to say that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact that my_dict is a dictionary, whose values are dictionary that map to ints. but that syntax is not valid. my_dict = defaultdict(lambda: defaultdict(int)) or my_dict = defaultdict(functools.partial(defaultdict, int)) --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
On Jan 15, 5:31 pm, Per Freem wrote: > ...the aKeys are very small (less than 100) where > as the bKeys are the ones that are in the millions. so in that case, > doing a Try-Except on aKey should be very efficient, since often it > will not fail, ... Do you know the aKeys in advance? If so, then just initialize the first layer, and it will never fail. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
Per Freem schrieb: > 1] is Try-Except really slower? my dict actually has two layers, so > my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where > as the bKeys are the ones that are in the millions. so in that case, > doing a Try-Except on aKey should be very efficient, since often it > will not fail, where as if I do: "if aKey in my_dict", that statement > will get executed for each aKey. can someone definitely say whether > Try-Except is faster or not? My benchmarks aren't conclusive and i > hear it both ways from several people (though majority thinks > TryExcept is faster). A defaultdict is faster than a try/except if the key is missing. A defaultdict is the canonical way to solve your problem, too. > 3] more importantly, is there likely to be a big improvement for > splitting up one big dictionary into several smaller ones? if so, is > there a straight forward elegant way to implement this? the way i am > thinking is to just fix a number of dicts and populate them with > elements. then during retrieval, try the first dict, if that fails, > try the second, if not the third, etc... but i can imagine how that's > more likely to lead to bugs / debugging give the way my code is setup > so i am wondering whether it is really worth it. > if it can lead to a factor of 2 difference, i will definitely > implement it -- does anyone have experience with this? Nested dicts more likely slower than one large dict. You have to keep in mind that Python's dict have been optimized to utilize the CPU cache as much as possible. Any layer you put on top of a single dict can increase the number of cache misses and increase the number of function calls. This may lead to a general slow down. I assume that you won't notice a slow down. However my instincts tell me that you'll hardly get a speedup, either. You have to roll your own benchmarks to study which algorithm is faster in reality. Christian -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
On Thu, 15 Jan 2009 14:49:29 -0800, bearophileHUGS wrote: > Matimus, your suggestions are all good. > > Try-except is slower than: > if x in adict: ... else: ... Not according to my tests. >>> def tryexcept(D, key): ... try: ... return D[key] ... except KeyError: ... return None ... >>> def ifinelse(D, key): ... if key in D: ... return D[key] ... else: ... return None ... >>> from timeit import Timer >>> setup = "from __main__ import tryexcept, ifinelse; D = {2: 'c'}" >>> >>> t1 = Timer('tryexcept(D, 2)', setup) >>> t2 = Timer('ifinelse(D, 2)', setup) >>> >>> min(t1.repeat()) # try...except 0.61699414253234863 >>> min(t2.repeat()) # if...else 0.71856999397277832 Perhaps what you meant to say is that try...except is slower *if the key is not found*. And in that case, there's a truly massive difference: >>> t3 = Timer('tryexcept(D, 5)', setup) >>> t4 = Timer('ifinelse(D, 5)', setup) >>> >>> min(t3.repeat()) # try...except 5.3846139907836914 >>> min(t4.repeat()) # if...else 0.5983281135559082 Which solution is faster on average will depend on the ratio of keys found versus keys not found. On my PC, based on the above results, if...else becomes faster when just 2% of keys are not found. But if you can arrange matters so that nearly all keys are found, then try...except can be faster. There's another solution nobody has mentioned (as far as I can see): the get method. In my test I call get() from inside a function, rather than directly, so that the timing results include the function call overhead for all three strategies. >>> def get(D, key): ... return D.get(key) ... >>> >>> setup = "from __main__ import get; D = {2: 'c'}" >>> t5 = Timer('get(D, 2)', setup) >>> t6 = Timer('get(D, 5)', setup) >>> >>> min(t5.repeat()) # found key 0.88362312316894531 >>> min(t6.repeat()) # missing key 0.87782382965087891 -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
Per Freem writes: > 2] is there an easy way to have nested defaultdicts? ie i want to say > that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact > that my_dict is a dictionary, whose values are dictionary that map to > ints. but that syntax is not valid. my_dict = defaultdict(lambda: defaultdict(int)) -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
thanks to everyone for the excellent suggestions. a few follow up q's: 1] is Try-Except really slower? my dict actually has two layers, so my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where as the bKeys are the ones that are in the millions. so in that case, doing a Try-Except on aKey should be very efficient, since often it will not fail, where as if I do: "if aKey in my_dict", that statement will get executed for each aKey. can someone definitely say whether Try-Except is faster or not? My benchmarks aren't conclusive and i hear it both ways from several people (though majority thinks TryExcept is faster). 2] is there an easy way to have nested defaultdicts? ie i want to say that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact that my_dict is a dictionary, whose values are dictionary that map to ints. but that syntax is not valid. 3] more importantly, is there likely to be a big improvement for splitting up one big dictionary into several smaller ones? if so, is there a straight forward elegant way to implement this? the way i am thinking is to just fix a number of dicts and populate them with elements. then during retrieval, try the first dict, if that fails, try the second, if not the third, etc... but i can imagine how that's more likely to lead to bugs / debugging give the way my code is setup so i am wondering whether it is really worth it. if it can lead to a factor of 2 difference, i will definitely implement it -- does anyone have experience with this? On Jan 15, 5:58 pm, Steven D'Aprano wrote: > On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote: > >> is there anything that can be done to speed up this simply code? right > >> now it is taking well over 15 minutes to process, on a 3 Ghz machine > >> with lots of RAM (though this is all taking CPU power, not RAM at this > >> point.) > > > class MyClass(object): > > # a new style class with slots saves some memory > > __slots__ = ("field1", "field2", "field2") > > I was curious whether using slots would speed up attribute access. > > >>> class Parrot(object): > > ... def __init__(self, a, b, c): > ... self.a = a > ... self.b = b > ... self.c = c > ...>>> class SlottedParrot(object): > > ... __slots__ = 'a', 'b', 'c' > ... def __init__(self, a, b, c): > ... self.a = a > ... self.b = b > ... self.c = c > ... > > >>> p = Parrot(23, "something", [1, 2, 3]) > >>> sp = SlottedParrot(23, "something", [1, 2, 3]) > > >>> from timeit import Timer > >>> setup = "from __main__ import p, sp" > >>> t1 = Timer('p.a, p.b, p.c', setup) > >>> t2 = Timer('sp.a, sp.b, sp.c', setup) > >>> min(t1.repeat()) > 0.83308887481689453 > >>> min(t2.repeat()) > > 0.62758088111877441 > > That's not a bad improvement. I knew that __slots__ was designed to > reduce memory consumption, but I didn't realise they were faster as well. > > -- > Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote: >> is there anything that can be done to speed up this simply code? right >> now it is taking well over 15 minutes to process, on a 3 Ghz machine >> with lots of RAM (though this is all taking CPU power, not RAM at this >> point.) > > class MyClass(object): > # a new style class with slots saves some memory > __slots__ = ("field1", "field2", "field2") I was curious whether using slots would speed up attribute access. >>> class Parrot(object): ... def __init__(self, a, b, c): ... self.a = a ... self.b = b ... self.c = c ... >>> class SlottedParrot(object): ... __slots__ = 'a', 'b', 'c' ... def __init__(self, a, b, c): ... self.a = a ... self.b = b ... self.c = c ... >>> >>> p = Parrot(23, "something", [1, 2, 3]) >>> sp = SlottedParrot(23, "something", [1, 2, 3]) >>> >>> from timeit import Timer >>> setup = "from __main__ import p, sp" >>> t1 = Timer('p.a, p.b, p.c', setup) >>> t2 = Timer('sp.a, sp.b, sp.c', setup) >>> min(t1.repeat()) 0.83308887481689453 >>> min(t2.repeat()) 0.62758088111877441 That's not a bad improvement. I knew that __slots__ was designed to reduce memory consumption, but I didn't realise they were faster as well. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
Matimus, your suggestions are all good. Try-except is slower than: if x in adict: ... else: ... A defaultdict is generally faster (there are some conditions when it's not faster, but they aren't much common. I think it's when the ratio of duplicates is really low), creating just a tuple instead of a class helps a lot, and when the CPU/OS allow it, Psyco too may help some here. If the resulting speed isn't enough yet, consider that Python dicts are quite fast, so you may need lot of care to write D/C/C++/Clisp code that's faster for this problem. I also suspect that when they become very large, Python dicts lose some of their efficiency. If this is true, you may switch to a new dictionary every chunk of file, and then merge the dicts at the end. I don't actually know if this may speed up your Python code even more (if you experiment this, I'd like to know if it's false). Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
> class MyClass > > def __str__(self): > return "%s-%s-%s" %(self.field1, self.field2, self.field3) > > def __repr__(self): > return str(self) > > def __hash__(self): > return hash(str(self)) > > > is there anything that can be done to speed up this simply code? right > now it is taking well over 15 minutes to process, on a 3 Ghz machine > with lots of RAM (though this is all taking CPU power, not RAM at this > point.) class MyClass(object): # a new style class with slots saves some memory __slots__ = ("field1", "field2", "field2") def __hash__(self): # In general this is faster than your approach because # it requires less function calls. However all fields must # be hashable return hash((self.field1, self.field2, self.field2)) def __eq__(self, other): # you MUST provide a rich compare __eq__ function # when you provide your own hash function. It's called # when hash(a) == hash(b). if not isinstance(other, MyClass): return NotImplemented return (self.field1 == other.field1 and self.field2 == other.field2 and self.field3 == other.field3) def __str__(self): return "%s-%s-%s" %(self.field1, self.field2, self.field3) # faster than your approach because it doesn't require more function # lookups. You can omit this alias, too. It's not required. __repr__ = __str__ Instead of the try/except clause I recommend colletions.defaultdict. Defaultdict takes one callable as factory function. Since int() returns 0 it has the same effect as your code. from collections import defaultdict elt = defaultdict(int) Christian -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
On Fri, Jan 16, 2009 at 8:39 AM, Per Freem wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > > for line in file: > try: >elt = MyClass(line)# extract elt from line... >my_dict[elt] += 1 > except KeyError: >my_dict[elt] = 1 > > > class MyClass > > def __str__(self): >return "%s-%s-%s" %(self.field1, self.field2, self.field3) > > def __repr__(self): >return str(self) > > def __hash__(self): >return hash(str(self)) > > > is there anything that can be done to speed up this simply code? right > now it is taking well over 15 minutes to process, on a 3 Ghz machine > with lots of RAM (though this is all taking CPU power, not RAM at this > point.) > > any general advice on how to optimize large dicts would be great too > > thanks for your help. > -- > http://mail.python.org/mailman/listinfo/python-list > Hello, You can get a large speedup by removing the need to instantiate a new MyClass instance on each iteration of your loop. Instead define one MyClass with an 'interpret' method that would be called instead of MyClass() interpret would return the string '%s-%s-%s' % (self.field1 etc..) i.e myclass = MyClass() interpret = myclass.interpret for line in file: elt = interpet(line)# extract elt from line... try: my_dict[elt] += 1 except KeyError: my_dict[elt] = 1 The speed up is on the order of 10 on my machine. Cheers, -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
On Jan 15, 1:39 pm, Per Freem wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > i create a dictionary that maps elements of the file that i want to > count > to their number of occurs. so i iterate through the file and for each > line > extract the elements (simple text operation) and see if it has an > entry in the dict: > > for line in file: > try: > elt = MyClass(line)# extract elt from line... > my_dict[elt] += 1 > except KeyError: > my_dict[elt] = 1 > > i am using try/except since it is supposedly faster (though i am not > sure > about this? is this really true in Python 2.5?). > > the only 'twist' is that my elt is an instance of a class (MyClass) > with 3 fields, all numeric. the class is hashable, and so my_dict[elt] > works well. > the __repr__ and __hash__ methods of my class simply return str() > representation > of self, while __str__ just makes everything numeric field into a > concatenated string: > > class MyClass > > def __str__(self): > return "%s-%s-%s" %(self.field1, self.field2, self.field3) > > def __repr__(self): > return str(self) > > def __hash__(self): > return hash(str(self)) > > is there anything that can be done to speed up this simply code? right > now it is taking well over 15 minutes to process, on a 3 Ghz machine > with lots of RAM (though this is all taking CPU power, not RAM at this > point.) > > any general advice on how to optimize large dicts would be great too > > thanks for your help. You can use a tuple instead of a string, which should be a little quicker: def __hash__(self): return self.field1, self.field2, self.field3 You could speed it up even more if you just saved a single attribute "fields" as a tuple to begin with. Also, you can use defauldict and get rid of the try/except. I don't think try/except is slow, but avoiding it will give you a speed up. from collections import defaultdict my_dict = defaultdict(int) for line in file: elt = MyClass(line)# extract elt from line... my_dict[elt] += 1 You might even consider turning "MyClass" into just a function that extracts the values from the line and returns a tuple, which should give you even more of a boost since a tuple is completely implemented in C. Matt -- http://mail.python.org/mailman/listinfo/python-list
optimizing large dictionaries
hello i have an optimization questions about python. i am iterating through a file and counting the number of repeated elements. the file has on the order of tens of millions elements... i create a dictionary that maps elements of the file that i want to count to their number of occurs. so i iterate through the file and for each line extract the elements (simple text operation) and see if it has an entry in the dict: for line in file: try: elt = MyClass(line)# extract elt from line... my_dict[elt] += 1 except KeyError: my_dict[elt] = 1 i am using try/except since it is supposedly faster (though i am not sure about this? is this really true in Python 2.5?). the only 'twist' is that my elt is an instance of a class (MyClass) with 3 fields, all numeric. the class is hashable, and so my_dict[elt] works well. the __repr__ and __hash__ methods of my class simply return str() representation of self, while __str__ just makes everything numeric field into a concatenated string: class MyClass def __str__(self): return "%s-%s-%s" %(self.field1, self.field2, self.field3) def __repr__(self): return str(self) def __hash__(self): return hash(str(self)) is there anything that can be done to speed up this simply code? right now it is taking well over 15 minutes to process, on a 3 Ghz machine with lots of RAM (though this is all taking CPU power, not RAM at this point.) any general advice on how to optimize large dicts would be great too thanks for your help. -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
On Dec 24, 7:04 pm, "Malcolm Greene" wrote: > Hi Roger, > > By very large dictionary, I mean about 25M items per dictionary. Each > item is a simple integer whose value will never exceed 2^15. In Python-speak about dictionaries, an "item" is a tuple (key, value). I presume from the gross difference between 25M and 2**15 -- more pedantry: 2^15 is 13 :-) -- that you mean that the values satisfy 0 <= value <= 2**15. > > I populate these dictionaries by parsing very large ASCII text files > containing detailed manufacturing events. From each line in my log file > I construct one or more keys and increment the numeric values associated > with these keys using timing info also extracted from each line. > > Some of our log files are generated by separate monitoring equipment > measuring the same process. In theory, these log files should be > identical, but of course they are not. I'm looking for a way to > determine the differences between the 2 dictionaries I will create from > so-called matching sets of log files. You seem to refer to the dictionaries as part of your requirements, not as part of the solution. Do you *need* the two dictionaries after you have obtained the differences? Let's start from the beginning: You have two log files, each of which can be abstracted as containing lots of (k, v) data, where each k describes an event and each v is a compressed 15-bit timestamp. You want to know for each datum whether it is (a) in both files (b) only in file1 (c) only in file2. Is that a fair summary? If so, there's an alternative that doesn't need to use dictionaries. Each file can be represented by a *list* of 32768 sets. Each set contains the ks that happened at the corresponding time... sets[fileno] [v].add(k). Once you've built that, you trundle through the pairs of sets doing set differences etc. Bear in mind that Python dicts and sets grow as required by doubling (IIRC) in size and rehashing from old to new. Consequently you get periodical disturbances which are not usually noticed but may be noticeable with large amounts of data. HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Marc 'BlackJack' Rintsch wrote: On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: collection, I don't see the advantage of using an iterator or a list. I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a point in time where the dictionary, the list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only the set and the dictionary exist in memory. If you can, consider using 3.0 in which d.keys() is a set-like view of the keys of d. Same for d.values and d.items. The time and space to create such is O(1), I believe. since they are just alternate read-only interfaces to the internal dict storage. There is not even a separate set until you do something like intersect d1.keys() with d2.keys() or d1.keys() - d2.keys(). I think this is an under-appreciated feature of 3.0 which should be really useful with large dicts. tjr -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Gabriel Genellina wrote: En Wed, 24 Dec 2008 06:23:00 -0200, escribió: ... k1 = set(dict1.iterkeys()) You've got an excelent explanation from Marc Rintsch. (Note that in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above code is supposed to be written in Python 2.x) And, in fact, a dictionary iterates its keys, so: k1 = set(dict1) works in 2.4, 2.5, 2.6, and 3.0 --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Peter Otten: > >>> a = dict(a=1, b=2, c=3) > >>> b = dict(b=2, c=30, d=4) > >>> dict(set(a.iteritems()) ^ set(b.iteritems())) For larger sets this may be better, may avoid the creation of the second set: dict(set(a.iteritems()).symmetric_difference(b.iteritems())) Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: Most efficient way to build very large dictionaries
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Martin wrote: > I'd think he's talking about in memory SQLite Databases, this way you > should be quite fast _and_ could dump all that to a persistent > storage... I was just talking about regular on disk SQLite databases. In terms of priming the pump, you can just open/read/close the whole database file which will cause the database to be in the operating system cache buffers, or just let SQLite do its thing. For anyone who is curious about what Martin is referring to, SQLite does support the database file being memory (although it isn't a file at that point). It has a fake filename of :memory:. As an example you can copy the table 'example' to memory using: ATTACH ':memory:' as memdb; CREATE TABLE memdb.memexample AS select * from example; As with Python, all this stuff is premature optimization. Get it working right first and then try tweaks to improve performance. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklSEqcACgkQmOOfHg372QTPpgCgvSKGMCJAKhnm5I8qHdmZtRh3 SgMAoI3DVhWCVdUE1TLck9ZEfp/Ln1H5 =5kNT -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: Most efficient way to build very large dictionaries
Hi, 2008/12/24 : > Hi Roger, > >> you may want to consider using SQLite > > Thank you for your suggestion about looking at SQLite. I haven't > compared the performance of SQLite to Python dictionaries, but I'm > skeptical that SQLite would be faster than in-memory Python dictionaries > for the type of analysis I'm doing. I'd think he's talking about in memory SQLite Databases, this way you should be quite fast _and_ could dump all that to a persistent storage... regards martin -- http://soup.alt.delete.co.at http://www.xing.com/profile/Martin_Marcher http://www.linkedin.com/in/martinmarcher You are not free to read this message, by doing so, you have violated my licence and are required to urinate publicly. Thank you. Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Gabriel, > in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above > code is supposed to be written in Python 2.x) I understand. Thank you. > note that dict comprehensions require Python 3.0 I'm relieved to know that I didn't miss that feature in my reading of Python's 2.5/2.6 documentation :) > You might use instead: > > dict((key,(dict1[key],dict2[key])) for key in ...) Excellent. Thank you. Regards, Malcolm - Original message - From: "Gabriel Genellina" To: python-list@python.org Date: Wed, 24 Dec 2008 07:10:16 -0200 Subject: Re: Strategy for determing difference between 2 very large dictionaries En Wed, 24 Dec 2008 06:23:00 -0200, escribió: > Hi Gabriel, > > Thank you very much for your feedback! > >> k1 = set(dict1.iterkeys()) > > I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage > to using an iterator vs. a list as the basis for creating a set? I You've got an excelent explanation from Marc Rintsch. (Note that in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above code is supposed to be written in Python 2.x) >>> can this last step be done via a simple list comprehension? > >> Yes; but isn't a dict comprehension more adequate? >> >> [key: (dict1[key], dict2[key]) for key in common_keys if >> dict1[key]!=dict2[key]} > > Cool!! I'm relatively new to Python and totally missed the ability to > work with dictionary comprehensions. Yes, your dictionary comprehension > technique is much better than the list comprehension approach I was > struggling with. Your dictionary comprehension statement describes > exactly what I wanted to write. This time, note that dict comprehensions require Python 3.0 -- so the code above won't work in Python 2.x. (It's not a good idea to mix both versions in the same post, sorry!) You might use instead: dict((key,(dict1[key],dict2[key])) for key in ...) but it's not as readable. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Peter, If you are not interested in the intermediate results and the dictionary values are hashable you can get the difference by >>> a = dict(a=1, b=2, c=3) >>> b = dict(b=2, c=30, d=4) >>> dict(set(a.iteritems()) ^ set(b.iteritems())) {'a': 1, 'c': 3, 'd': 4} That's very cool! Thanks for sharing that technique. Regards, Malcolm - Original message - From: "Peter Otten" <__pete...@web.de> To: python-list@python.org Date: Wed, 24 Dec 2008 09:46:51 +0100 Subject: Re: Strategy for determing difference between 2 very large dictionaries Gabriel Genellina wrote: > En Wed, 24 Dec 2008 05:16:36 -0200, escribió: [I didn't see the original post] >> I'm looking for suggestions on the best ('Pythonic') way to >> determine the difference between 2 very large dictionaries >> containing simple key/value pairs. >> By difference, I mean a list of keys that are present in the >> first dictionary, but not the second. And vice versa. And a list >> of keys in common between the 2 dictionaries whose values are >> different. >> The 2 strategies I'm considering are: >> 1. Brute force: Iterate through first dictionary's keys and >> determine which keys it has that are missing from the second >> dictionary. If keys match, then verify that the 2 dictionaries >> have identical values for the same key. Repeat this process for >> the second dictionary. >> 2. Use sets: Create sets from each dictionary's list of keys and >> use Python's set methods to generate a list of keys present in >> one dictionary but not the other (for both dictionaries) as well >> as a set of keys the 2 dictionaries have in common. > > I cannot think of any advantage of the first approach - so I'd use sets. > > k1 = set(dict1.iterkeys()) > k2 = set(dict2.iterkeys()) > k1 - k2 # keys in dict1 not in dict2 > k2 - k1 # keys in dict2 not in dict1 > k1 & k2 # keys in both > >> Using the set >> of keys in common, compare values across dictionaries to >> determine which keys have different values (can this last step be >> done via a simple list comprehension?) If you are not interested in the intermediate results and the dictionary values are hashable you can get the difference by >>> a = dict(a=1, b=2, c=3) >>> b = dict(b=2, c=30, d=4) >>> dict(set(a.iteritems()) ^ set(b.iteritems())) {'a': 1, 'c': 3, 'd': 4} Peter -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Most efficient way to build very large dictionaries
Hi Roger, > The performance improvements you are seeing with Python over Oracle are > exactly the same range people see with SQLite over Oracle. One common usage > reported on the SQLite mailing list is people copying data out of Oracle and > running their analysis in SQLite because of the performance advantages. I wasn't aware that SQLite was so fast compared to Oracle. That's great news. I will definitely take a look at SQLite for my current data analysis project. ... hey, you're the author of APSW! :) For those following this thread, see APSW: http://code.google.com/p/apsw/ > The pragmas tune things like cache sizes. The SQLite default is 2MB, relying > on the operating system for caching beyond that. Bumping up that kind of size > was my suggestion :-) I missed that nuance - a side effect of emailing at 4am :) Thanks again for your help Roger! Regards, Malcolm - Original message - From: "Roger Binns" To: python-list@python.org Date: Wed, 24 Dec 2008 00:50:49 -0800 Subject: Re: Most efficient way to build very large dictionaries -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > Thank you for your suggestion about looking at SQLite. I haven't > compared the performance of SQLite to Python dictionaries, but I'm > skeptical that SQLite would be faster than in-memory Python dictionaries > for the type of analysis I'm doing. I'd recommend at least trying a test just to see. As an example SQLite uses indices which will be faster than Python dicts for some set operations. (And if you aren't careful, your various Python based optimizations will end up duplicating what SQLite does internally anyway :-) > Prior to my use of Python, my > customer used a very expensive Oracle system to analyze their log files. > My simple Python scripts are 4-20x faster than the Oracle PL/SQL they > are replacing - and run on much cheaper hardware. SQLite is not like Oracle or any similar database system. It does not operate over the network or similar connection. It is a library in your process that has an optimized disk storage format (single file) and a SQL parser that generates bytecode for a special purpose virtual machine in pretty much the same way CPython operates. The performance improvements you are seeing with Python over Oracle are exactly the same range people see with SQLite over Oracle. One common usage reported on the SQLite mailing list is people copying data out of Oracle and running their analysis in SQLite because of the performance advantages. > Note: Memory is currently not a concern for me so I don't need SQLite's > ability to work with data sets larger than my physical memory. The pragmas tune things like cache sizes. The SQLite default is 2MB, relying on the operating system for caching beyond that. Bumping up that kind of size was my suggestion :-) Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR9+UACgkQmOOfHg372QSMbwCdGS5S2/96fWW8knjfBVqReAfV AEwAn2Yc+L9BEZgT69OjwtyqxLtifVpU =mPfy -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
En Wed, 24 Dec 2008 06:23:00 -0200, escribió: Hi Gabriel, Thank you very much for your feedback! k1 = set(dict1.iterkeys()) I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage to using an iterator vs. a list as the basis for creating a set? I You've got an excelent explanation from Marc Rintsch. (Note that in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above code is supposed to be written in Python 2.x) can this last step be done via a simple list comprehension? Yes; but isn't a dict comprehension more adequate? [key: (dict1[key], dict2[key]) for key in common_keys if dict1[key]!=dict2[key]} Cool!! I'm relatively new to Python and totally missed the ability to work with dictionary comprehensions. Yes, your dictionary comprehension technique is much better than the list comprehension approach I was struggling with. Your dictionary comprehension statement describes exactly what I wanted to write. This time, note that dict comprehensions require Python 3.0 -- so the code above won't work in Python 2.x. (It's not a good idea to mix both versions in the same post, sorry!) You might use instead: dict((key,(dict1[key],dict2[key])) for key in ...) but it's not as readable. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi James, > For the purpose of perpetuating the annoying pedantry that has made > usenet great: > > http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists Great tip! Thank you! Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Marc 'BlackJack' Rintsch wrote: On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: Hi Gabriel, Thank you very much for your feedback! k1 = set(dict1.iterkeys()) I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage to using an iterator vs. a list as the basis for creating a set? I understand that an iterator makes sense if you're working with a large set of items one at a time, but if you're creating a non-filtered collection, I don't see the advantage of using an iterator or a list. I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a point in time where the dictionary, the list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only the set and the dictionary exist in memory. For the purpose of perpetuating the annoying pedantry that has made usenet great: http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Most efficient way to build very large dictionaries
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > Thank you for your suggestion about looking at SQLite. I haven't > compared the performance of SQLite to Python dictionaries, but I'm > skeptical that SQLite would be faster than in-memory Python dictionaries > for the type of analysis I'm doing. I'd recommend at least trying a test just to see. As an example SQLite uses indices which will be faster than Python dicts for some set operations. (And if you aren't careful, your various Python based optimizations will end up duplicating what SQLite does internally anyway :-) > Prior to my use of Python, my > customer used a very expensive Oracle system to analyze their log files. > My simple Python scripts are 4-20x faster than the Oracle PL/SQL they > are replacing - and run on much cheaper hardware. SQLite is not like Oracle or any similar database system. It does not operate over the network or similar connection. It is a library in your process that has an optimized disk storage format (single file) and a SQL parser that generates bytecode for a special purpose virtual machine in pretty much the same way CPython operates. The performance improvements you are seeing with Python over Oracle are exactly the same range people see with SQLite over Oracle. One common usage reported on the SQLite mailing list is people copying data out of Oracle and running their analysis in SQLite because of the performance advantages. > Note: Memory is currently not a concern for me so I don't need SQLite's > ability to work with data sets larger than my physical memory. The pragmas tune things like cache sizes. The SQLite default is 2MB, relying on the operating system for caching beyond that. Bumping up that kind of size was my suggestion :-) Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR9+UACgkQmOOfHg372QSMbwCdGS5S2/96fWW8knjfBVqReAfV AEwAn2Yc+L9BEZgT69OjwtyqxLtifVpU =mPfy -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Gabriel Genellina wrote: > En Wed, 24 Dec 2008 05:16:36 -0200, escribió: [I didn't see the original post] >> I'm looking for suggestions on the best ('Pythonic') way to >> determine the difference between 2 very large dictionaries >> containing simple key/value pairs. >> By difference, I mean a list of keys that are present in the >> first dictionary, but not the second. And vice versa. And a list >> of keys in common between the 2 dictionaries whose values are >> different. >> The 2 strategies I'm considering are: >> 1. Brute force: Iterate through first dictionary's keys and >> determine which keys it has that are missing from the second >> dictionary. If keys match, then verify that the 2 dictionaries >> have identical values for the same key. Repeat this process for >> the second dictionary. >> 2. Use sets: Create sets from each dictionary's list of keys and >> use Python's set methods to generate a list of keys present in >> one dictionary but not the other (for both dictionaries) as well >> as a set of keys the 2 dictionaries have in common. > > I cannot think of any advantage of the first approach - so I'd use sets. > > k1 = set(dict1.iterkeys()) > k2 = set(dict2.iterkeys()) > k1 - k2 # keys in dict1 not in dict2 > k2 - k1 # keys in dict2 not in dict1 > k1 & k2 # keys in both > >> Using the set >> of keys in common, compare values across dictionaries to >> determine which keys have different values (can this last step be >> done via a simple list comprehension?) If you are not interested in the intermediate results and the dictionary values are hashable you can get the difference by >>> a = dict(a=1, b=2, c=3) >>> b = dict(b=2, c=30, d=4) >>> dict(set(a.iteritems()) ^ set(b.iteritems())) {'a': 1, 'c': 3, 'd': 4} Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Marc, > `keys()` creates a list in memory, `iterkeys()` does not. With > ``set(dict.keys())`` there is a point in time where the dictionary, the > list, and the set co-exist in memory. With ``set(dict.iterkeys())`` > only the set and the dictionary exist in memory. Perfect explanation. Thank you! Malcolm - Original message - From: "Marc 'BlackJack' Rintsch" To: python-list@python.org Date: 24 Dec 2008 08:30:41 GMT Subject: Re: Strategy for determing difference between 2 very large dictionaries On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: > Hi Gabriel, > > Thank you very much for your feedback! > >> k1 = set(dict1.iterkeys()) > > I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage > to using an iterator vs. a list as the basis for creating a set? I > understand that an iterator makes sense if you're working with a large > set of items one at a time, but if you're creating a non-filtered > collection, I don't see the advantage of using an iterator or a list. > I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a point in time where the dictionary, the list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only the set and the dictionary exist in memory. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Most efficient way to build very large dictionaries
Hi Roger, > you may want to consider using SQLite Thank you for your suggestion about looking at SQLite. I haven't compared the performance of SQLite to Python dictionaries, but I'm skeptical that SQLite would be faster than in-memory Python dictionaries for the type of analysis I'm doing. Prior to my use of Python, my customer used a very expensive Oracle system to analyze their log files. My simple Python scripts are 4-20x faster than the Oracle PL/SQL they are replacing - and run on much cheaper hardware. Note: Memory is currently not a concern for me so I don't need SQLite's ability to work with data sets larger than my physical memory. Regards, Malcolm - Original message - From: "Roger Binns" To: python-list@python.org Date: Wed, 24 Dec 2008 00:19:56 -0800 Subject: Re: Most efficient way to build very large dictionaries -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > I would appreciate your thoughts on whether there are advantages to > working with a pre-built dictionary and if so, what are the best ways to > create a pre-loaded dictionary. Based on this and your other thread, you may want to consider using SQLite (standard Python module is available for it). SQL queries are very similar to set operations and indices take care of performance (by using more storage). You also get transactions for free. If you look into SQLite pragmas you can get it to use more RAM to improve performance. And by using a SQL layer you can later switch to another database should you need really hard core storage. It also makes the data available to other programs in a uniform way. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR8KgACgkQmOOfHg372QSrHQCfVJzueXVKme8QZcxoLf70BL4K RL8AoM9QOFykOLrr5QXtpmZ5f7CFHm6e =zAPG -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: > Hi Gabriel, > > Thank you very much for your feedback! > >> k1 = set(dict1.iterkeys()) > > I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage > to using an iterator vs. a list as the basis for creating a set? I > understand that an iterator makes sense if you're working with a large > set of items one at a time, but if you're creating a non-filtered > collection, I don't see the advantage of using an iterator or a list. > I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a point in time where the dictionary, the list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only the set and the dictionary exist in memory. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Most efficient way to build very large dictionaries
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > Can I take advantage of this knowledge to optimize You do the optimization last :-) The first thing you need to do is make sure you have a way of validating you got the correct results. With 25M entries it would be very easy for an optimization to get the wrong results (maybe only one result wrong). Once you can verify the results correctness, write the simplest most readable code possible. Then once your whole program is approaching completion time how long things take to get an idea of where to optimize. For example it is pointless optimizing the loading phase if it only takes 2 minutes out of a 3 hour runtime. Finally you can optimize and always have a way of verifying that the optimizations are correct. You have the original code to document what is supposed to be happening as optimized code tends to get rather obfuscated and unreadable. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR8mkACgkQmOOfHg372QQvLQCgu6NYNUuhgR06KQunPmIrZ64B +rsAnAgQOKzMdmonF+zIhsX2r/Xg/72Y =LFfW -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Gabriel, Thank you very much for your feedback! > k1 = set(dict1.iterkeys()) I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage to using an iterator vs. a list as the basis for creating a set? I understand that an iterator makes sense if you're working with a large set of items one at a time, but if you're creating a non-filtered collection, I don't see the advantage of using an iterator or a list. I'm sure I'm missing a subtle point here :) >> can this last step be done via a simple list comprehension? > Yes; but isn't a dict comprehension more adequate? > > [key: (dict1[key], dict2[key]) for key in common_keys if > dict1[key]!=dict2[key]} Cool!! I'm relatively new to Python and totally missed the ability to work with dictionary comprehensions. Yes, your dictionary comprehension technique is much better than the list comprehension approach I was struggling with. Your dictionary comprehension statement describes exactly what I wanted to write. Regards, Malcolm - Original message - From: "Gabriel Genellina" To: python-list@python.org Date: Wed, 24 Dec 2008 05:46:04 -0200 Subject: Re: Strategy for determing difference between 2 very large dictionaries En Wed, 24 Dec 2008 05:16:36 -0200, escribió: > I'm looking for suggestions on the best ('Pythonic') way to > determine the difference between 2 very large dictionaries > containing simple key/value pairs. > By difference, I mean a list of keys that are present in the > first dictionary, but not the second. And vice versa. And a list > of keys in common between the 2 dictionaries whose values are > different. > The 2 strategies I'm considering are: > 1. Brute force: Iterate through first dictionary's keys and > determine which keys it has that are missing from the second > dictionary. If keys match, then verify that the 2 dictionaries > have identical values for the same key. Repeat this process for > the second dictionary. > 2. Use sets: Create sets from each dictionary's list of keys and > use Python's set methods to generate a list of keys present in > one dictionary but not the other (for both dictionaries) as well > as a set of keys the 2 dictionaries have in common. I cannot think of any advantage of the first approach - so I'd use sets. k1 = set(dict1.iterkeys()) k2 = set(dict2.iterkeys()) k1 - k2 # keys in dict1 not in dict2 k2 - k1 # keys in dict2 not in dict1 k1 & k2 # keys in both > Using the set > of keys in common, compare values across dictionaries to > determine which keys have different values (can this last step be > done via a simple list comprehension?) Yes; but isn't a dict comprehension more adequate? [key: (dict1[key], dict2[key]) for key in common_keys if dict1[key]!=dict2[key]} (where common_keys=k1&k2 as above) -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Most efficient way to build very large dictionaries
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > I would appreciate your thoughts on whether there are advantages to > working with a pre-built dictionary and if so, what are the best ways to > create a pre-loaded dictionary. Based on this and your other thread, you may want to consider using SQLite (standard Python module is available for it). SQL queries are very similar to set operations and indices take care of performance (by using more storage). You also get transactions for free. If you look into SQLite pragmas you can get it to use more RAM to improve performance. And by using a SQL layer you can later switch to another database should you need really hard core storage. It also makes the data available to other programs in a uniform way. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR8KgACgkQmOOfHg372QSrHQCfVJzueXVKme8QZcxoLf70BL4K RL8AoM9QOFykOLrr5QXtpmZ5f7CFHm6e =zAPG -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
Hi Roger, By very large dictionary, I mean about 25M items per dictionary. Each item is a simple integer whose value will never exceed 2^15. I populate these dictionaries by parsing very large ASCII text files containing detailed manufacturing events. From each line in my log file I construct one or more keys and increment the numeric values associated with these keys using timing info also extracted from each line. Some of our log files are generated by separate monitoring equipment measuring the same process. In theory, these log files should be identical, but of course they are not. I'm looking for a way to determine the differences between the 2 dictionaries I will create from so-called matching sets of log files. At this point in time, I don't have concerns about memory as I'm running my scripts on a dedicated 64-bit server with 32Gb of RAM (but with budget approval to raise our total RAM to 64Gb if necessary). My main concern is am I applying a reasonably pythonic approach to my problem, eg. am I using appropriate python techniques and data structures? I am also interested in using reasonable techniques that will provide me with the fastest execution time. Thank you for sharing your thoughts with me. Regards, Malcolm - Original message - From: "Roger Binns" To: python-list@python.org Date: Tue, 23 Dec 2008 23:26:49 -0800 Subject: Re: Strategy for determing difference between 2 very large dictionaries -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > Feedback on my proposed strategies (or better strategies) would be > greatly appreciated. Both strategies will work but I'd recommend the second approach since it uses already tested code written by other people - the chances of it being wrong are far lower than new code. You also neglected to mention what your concerns are or even what "very large" is. Example concerns are memory consumption, cpu consumption, testability, utility of output (eg as a generator getting each result on demand or a single list with complete results). Some people will think a few hundred entries is large. My idea of large is a working set larger than my workstation's 6GB of memory :-) In general the Pythonic approach is: 1 - Get the correct result 2 - Simple code (developer time is precious) 3 - Optimise for your data and environment Step 3 is usually not needed. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I LIEAn080VgNvmEYj47Mm7BtV69J1GwXN =MKLl -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
On Tue, Dec 23, 2008 at 11:46 PM, Gabriel Genellina wrote: > > Yes; but isn't a dict comprehension more adequate? > > [key: (dict1[key], dict2[key]) for key in common_keys if > dict1[key]!=dict2[key]} That initial [ should be a { of course. Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
En Wed, 24 Dec 2008 05:16:36 -0200, escribió: I'm looking for suggestions on the best ('Pythonic') way to determine the difference between 2 very large dictionaries containing simple key/value pairs. By difference, I mean a list of keys that are present in the first dictionary, but not the second. And vice versa. And a list of keys in common between the 2 dictionaries whose values are different. The 2 strategies I'm considering are: 1. Brute force: Iterate through first dictionary's keys and determine which keys it has that are missing from the second dictionary. If keys match, then verify that the 2 dictionaries have identical values for the same key. Repeat this process for the second dictionary. 2. Use sets: Create sets from each dictionary's list of keys and use Python's set methods to generate a list of keys present in one dictionary but not the other (for both dictionaries) as well as a set of keys the 2 dictionaries have in common. I cannot think of any advantage of the first approach - so I'd use sets. k1 = set(dict1.iterkeys()) k2 = set(dict2.iterkeys()) k1 - k2 # keys in dict1 not in dict2 k2 - k1 # keys in dict2 not in dict1 k1 & k2 # keys in both Using the set of keys in common, compare values across dictionaries to determine which keys have different values (can this last step be done via a simple list comprehension?) Yes; but isn't a dict comprehension more adequate? [key: (dict1[key], dict2[key]) for key in common_keys if dict1[key]!=dict2[key]} (where common_keys=k1&k2 as above) -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Most efficient way to build very large dictionaries
I'm working with some very large dictionaries (about 25M items per dictionary) that get filled based on data parsed from text based log files. I'm using Python dictionaries to track the frequency of specific combinations of events, eg, I build a key and then add various timing info from the current record to that's key's current value. If the key doesn't exist, I create it and initalize it to its first value. Most keys will have anywhere from 2 to 10,000 value increments. Over time, I've noticed that 90%+ of the keys in my dictionaries stay constant across daily runs of my program. Can I take advantage of this knowledge to optimize my dictionary performance by pre-building my dictionary based on a given list of keys whose values are all set to 0? Specifically, is there a way to bulk load a dictionary with keys/initial values that is faster than individually adding most dictionary entries one at a time? Here are 2 high level strategies I've been thinking about for improving the performance of my dictionaries. 1. Create a list of keys from a current dictionary, pickle this list of keys and then on my next program run, load the pickled list of keys and somehow convert this list to a dictionary with all key values set to 0. 2. After I've performed my analysis on a current dictionary, iterate through this dictionary, setting all values to 0 (is there a fast way to do this?), and then on my next program run, load this pickled dictionary of keys and zero values. I would appreciate your thoughts on whether there are advantages to working with a pre-built dictionary and if so, what are the best ways to create a pre-loaded dictionary. Thank you, Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Strategy for determing difference between 2 very large dictionaries
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pyt...@bdurham.com wrote: > Feedback on my proposed strategies (or better strategies) would be > greatly appreciated. Both strategies will work but I'd recommend the second approach since it uses already tested code written by other people - the chances of it being wrong are far lower than new code. You also neglected to mention what your concerns are or even what "very large" is. Example concerns are memory consumption, cpu consumption, testability, utility of output (eg as a generator getting each result on demand or a single list with complete results). Some people will think a few hundred entries is large. My idea of large is a working set larger than my workstation's 6GB of memory :-) In general the Pythonic approach is: 1 - Get the correct result 2 - Simple code (developer time is precious) 3 - Optimise for your data and environment Step 3 is usually not needed. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I LIEAn080VgNvmEYj47Mm7BtV69J1GwXN =MKLl -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Strategy for determing difference between 2 very large dictionaries
I'm looking for suggestions on the best ('Pythonic') way to determine the difference between 2 very large dictionaries containing simple key/value pairs. By difference, I mean a list of keys that are present in the first dictionary, but not the second. And vice versa. And a list of keys in common between the 2 dictionaries whose values are different. The 2 strategies I'm considering are: 1. Brute force: Iterate through first dictionary's keys and determine which keys it has that are missing from the second dictionary. If keys match, then verify that the 2 dictionaries have identical values for the same key. Repeat this process for the second dictionary. 2. Use sets: Create sets from each dictionary's list of keys and use Python's set methods to generate a list of keys present in one dictionary but not the other (for both dictionaries) as well as a set of keys the 2 dictionaries have in common. Using the set of keys in common, compare values across dictionaries to determine which keys have different values (can this last step be done via a simple list comprehension?) Feedback on my proposed strategies (or better strategies) would be greatly appreciated. Thank you, Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Optimizing size of very large dictionaries
Raymond Hettinger wrote: Background: I'm trying to identify duplicate records in very large text based transaction logs. I'm detecting duplicate records by creating a SHA1 checksum of each record and using this checksum as a dictionary key. This works great except for several files whose size is such that their associated checksum dictionaries are too big for my workstation's 2G of RAM. Tons of memory can be saved by not storing the contents of the record. Just make an initial pass to identify the digest values of possible duplicates. The use a set to identify actual dups but only store the records for those whose digest is a possible duplicate: bag = collections.defaultdict(int) for record in logs: bag[sha1(record).digest()] += 1 possible_dups = set() while bag: hashval, cnt = bag.popitem() if cnt > 1: possible_dups.add(hashvalue) Since actual counts above 1 are not needed, I believe a bit more memory could be saved by computing possible_dups incrementally. The logic is a bit trickier, and seeing the above helps. once, dups = set(), set() for record in logs: d = sha1(record).digest() if d in once: once.remove(d) dups.add(d) elif d not in dups: once.add(d) seen = set() for record in logs: if record in seen: print 'Duplicate:', record elif sha1(record).digest() in possible_dups: seen.add(record) Raymond P.S. If the log entries are one liners, maybe it would be better to use the operating system's sort/uniq filters. -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Optimizing size of very large dictionaries
On 2008-07-31 02:29, [EMAIL PROTECTED] wrote: Are there any techniques I can use to strip a dictionary data structure down to the smallest memory overhead possible? I'm working on a project where my available RAM is limited to 2G and I would like to use very large dictionaries vs. a traditional database. Background: I'm trying to identify duplicate records in very large text based transaction logs. I'm detecting duplicate records by creating a SHA1 checksum of each record and using this checksum as a dictionary key. This works great except for several files whose size is such that their associated checksum dictionaries are too big for my workstation's 2G of RAM. If you don't have a problem with taking a small performance hit, then I'd suggest to have a look at mxBeeBase, which is an on-disk dictionary implementation: http://www.egenix.com/products/python/mxBase/mxBeeBase/ Of course, you could also use a database table for this. Together with a proper index that should work as well (but it's likely slower than mxBeeBase). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 31 2008) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 -- http://mail.python.org/mailman/listinfo/python-list
Re: Optimizing size of very large dictionaries
> > Are there any techniques I can use to strip a dictionary data > > structure down to the smallest memory overhead possible? Sure. You can build your own version of a dict using UserDict.DictMixin. The underlying structure can be as space efficient as you want. FWIW, dictionaries automatically become more space efficient at largers sizes (50,000+ records). The size quadrupling strategy falls back to just doubling whenever a dict gets two thirds full. > > Background: I'm trying to identify duplicate records in very > > large text based transaction logs. I'm detecting duplicate > > records by creating a SHA1 checksum of each record and using this > > checksum as a dictionary key. This works great except for several > > files whose size is such that their associated checksum > > dictionaries are too big for my workstation's 2G of RAM. Tons of memory can be saved by not storing the contents of the record. Just make an initial pass to identify the digest values of possible duplicates. The use a set to identify actual dups but only store the records for those whose digest is a possible duplicate: bag = collections.defaultdict(int) for record in logs: bag[sha1(record).digest()] += 1 possible_dups = set() while bag: hashval, cnt = bag.popitem() if cnt > 1: possible_dups.add(hashvalue) seen = set() for record in logs: if record in seen: print 'Duplicate:', record elif sha1(record).digest() in possible_dups: seen.add(record) Raymond P.S. If the log entries are one liners, maybe it would be better to use the operating system's sort/uniq filters. -- http://mail.python.org/mailman/listinfo/python-list
Re: Optimizing size of very large dictionaries
On Wed, Jul 30, 2008 at 8:29 PM, <[EMAIL PROTECTED]> wrote: > Background: I'm trying to identify duplicate records in very large text > based transaction logs. I'm detecting duplicate records by creating a SHA1 > checksum of each record and using this checksum as a dictionary key. This > works great except for several files whose size is such that their > associated checksum dictionaries are too big for my workstation's 2G of RAM. What are the values of this dictionary? You can save memory by representing the checksums as long integers, if you're currently using strings. -Miles -- http://mail.python.org/mailman/listinfo/python-list
Re: Optimizing size of very large dictionaries
En Wed, 30 Jul 2008 21:29:39 -0300, <[EMAIL PROTECTED]> escribi�: Are there any techniques I can use to strip a dictionary data structure down to the smallest memory overhead possible? I'm working on a project where my available RAM is limited to 2G and I would like to use very large dictionaries vs. a traditional database. Background: I'm trying to identify duplicate records in very large text based transaction logs. I'm detecting duplicate records by creating a SHA1 checksum of each record and using this checksum as a dictionary key. This works great except for several files whose size is such that their associated checksum dictionaries are too big for my workstation's 2G of RAM. You could use a different hash algorithm yielding a smaller value (crc32, by example, fits on an integer). At the expense of having more collisions, and more processing time to check those possible duplicates. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Optimizing size of very large dictionaries
Are there any techniques I can use to strip a dictionary data structure down to the smallest memory overhead possible? I'm working on a project where my available RAM is limited to 2G and I would like to use very large dictionaries vs. a traditional database. Background: I'm trying to identify duplicate records in very large text based transaction logs. I'm detecting duplicate records by creating a SHA1 checksum of each record and using this checksum as a dictionary key. This works great except for several files whose size is such that their associated checksum dictionaries are too big for my workstation's 2G of RAM. Thank you, Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Thomas Ganss wrote: > Klaas schrieb: > > > 4. Insert your keys in sorted order. > This advice is questionable - > it depends on the at least on the db vendor and probably > sometimes on the sort method, if inserting pre-sorted > values is better. The article I wrote that you quoted named a specific vendor (berkeley db) for which my advice is unquestionable and well-known. -Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Marcin Ciura wrote: > Duncan Booth wrote: >> Marcin Ciura wrote: >>>See Figure 8 in >>>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf >> That isn't what the reference says. It only covers N up to a few >> thousand. Practical values of N need to at least go up into the >> millions. > > Please look at the performance graph of Tokuda's increment sequence. > You can see that it scales pretty well at least up to 100 million. Ah sorry, I misread it. It was sequences with several thousand elements, which corresponds to N of 100 million. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Duncan Booth wrote: > Marcin Ciura wrote: >>See Figure 8 in >>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf > That isn't what the reference says. It only covers N up to a few thousand. > Practical values of N need to at least go up into the millions. Please look at the performance graph of Tokuda's increment sequence. You can see that it scales pretty well at least up to 100 million. Marcin -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Marcin Ciura wrote: > Tim Peters wrote: >> shellsort is much more a refinement of insertion-sort than of >> bubblesort, but is not an O(N log N) algorithm regardlesss. > > With a judiciously chosen increment sequence, > the number of comparisons made by shellsort > *is* of the order N log N for practical values of N. > See Figure 8 in > http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf That isn't what the reference says. It only covers N up to a few thousand. Practical values of N need to at least go up into the millions. Read the summary: > Using sequential analysis, the search for optimal increment sequences > for Shellsort was accelerated enough to find them for arrays up to > several thousand elements. ... > However, the sequences obtained so far are too short to draw a > reliable conclusion whether an appropriate sequence of increments can > make Shellsort a O(N logN) sorting method on the average. Some > hypotheses may be possible when the sequences are prolonged to sort > arrays of about 10^5 elements. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Tim Peters wrote: > shellsort is much more a refinement of insertion-sort than of > bubblesort, but is not an O(N log N) algorithm regardlesss. With a judiciously chosen increment sequence, the number of comparisons made by shellsort *is* of the order N log N for practical values of N. See Figure 8 in http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf Marcin -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
[Tim Peters] >> ... >> O(N log N) sorting algorithms helped by pre-existing order are >> uncommon, unless they do extra work to detect and exploit >> pre-existing order. [Lawrence D'Oliveiro] > Shellsort works well with nearly-sorted data. It's basically a smarter > version of bubblesort with much improved efficiency. It's also very > compact to implement. shellsort is much more a refinement of insertion-sort than of bubblesort, but is not an O(N log N) algorithm regardlesss. A nice, succinct summary of what was known about shellsort's behavior as of Sedgewick's 1996 "Analysis of Shellsort and Related Algorithms" can be found here: http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/shell/shellen.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article <[EMAIL PROTECTED]>, "Iain King" <[EMAIL PROTECTED]> wrote: >Lawrence D'Oliveiro wrote: >> In article <[EMAIL PROTECTED]>, >> Scott David Daniels <[EMAIL PROTECTED]> wrote: >> >> >For example, time timsort (Python's internal sort) on pre-sorted >> >data; you'll find it is handled faster than random data. >> >> But isn't that how a reasonable sorting algorithm should behave? Less >> work to do if the data is already sorted? > >An already sorted list can be pathological for Quicksort, depending on >how you code it. I did say "reasonable". :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article <[EMAIL PROTECTED]>, "Tim Peters" <[EMAIL PROTECTED]> wrote: >For example, the O(N log N) heapsort is unreasonable compared to the >O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad >case for heapsort, but a good case for bubblesort)? O(N log N) >sorting algorithms helped by pre-existing order are uncommon, unless >they do extra work to detect and exploit pre-existing order. Shellsort works well with nearly-sorted data. It's basically a smarter version of bubblesort with much improved efficiency. It's also very compact to implement. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article <[EMAIL PROTECTED]>, Tim Peters <[EMAIL PROTECTED]> wrote: >[Jim Segrave] >> Actually, presorted lists are not a bad case for heapsort - it's quite >> immune to any existing order or lack thereof, > >Write a heapsort and time it. It's not a difference in O() behavior, >but more memory movement is required for a sorted list because >transforming the list into a max-heap at the start requires shuffling >the largest elements from the end of the list to its start. Initially >reverse-sorted is a "good case" for heapsort in the same sense >(because the list is already a max-heap then; initially sorted is as >far from being a max-heap as is possible). "good" and "bad" are both >O(N log N) here, though. I did implement a crude heapsort before posting this and measured the number of comparisions and the number of swaps. == #!/usr/local/bin/python import random class HeapSorter(object): def __init__(self, seq): self.compares = 0 self.swaps = 0 self.length = len(seq) self.seq = seq def __sift(self, i): while True: j = 2 * i + 1 if j + 1 < self.length: self.compares += 1 if self.seq[j + 1] > self.seq[j]: j = j + 1 if j >= self.length: return self.compares += 1 if self.seq[i] > self.seq[j]: return self.swaps += 1 self.seq[i], self.seq[j] = (self.seq[j], self.seq[i]) i = j def __makeheap(self): for i in range(self.length / 2, -1, -1): self.__sift(i) def sort(self): self.__makeheap() for i in range(self.length - 1, -1, -1): self.swaps += 1 self.seq[i], self.seq[0] = (self.seq[0], self.seq[i]) self.length -= 1 self.__sift(0) if __name__ == '__main__': s = list(range(10)) random.shuffle(s) heap = HeapSorter(s) heap.sort() print "Random list: comapres %d, swaps %d" % \ (heap.compares, heap.swaps) heap = HeapSorter(s) heap.sort() print "Sorted list: comapres %d, swaps %d" % \ (heap.compares, heap.swaps) s.reverse() heap = HeapSorter(s) heap.sort() print "Reverse Sorted list: comapres %d, swaps %d" % \ (heap.compares, heap.swaps) == Which gave the following results: /usr/home/jes% python ./heapsort Random list: comapres 3019969, swaps 1575263 Sorted list: comapres 3112517, swaps 1650855 Reverse Sorted list: comapres 2926640, swaps 1497435 Assuming compares and swaps dominate other bookkeeping, then heapsort is quite stable in its performance, although the constant in the O(N log N) is much larger than for other sorts. -- Jim Segrave ([EMAIL PROTECTED]) -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
[Jim Segrave] > Actually, presorted lists are not a bad case for heapsort - it's quite > immune to any existing order or lack thereof, Write a heapsort and time it. It's not a difference in O() behavior, but more memory movement is required for a sorted list because transforming the list into a max-heap at the start requires shuffling the largest elements from the end of the list to its start. Initially reverse-sorted is a "good case" for heapsort in the same sense (because the list is already a max-heap then; initially sorted is as far from being a max-heap as is possible). "good" and "bad" are both O(N log N) here, though. BTW, most people have never heard of Smoothsort, which was Dijkstra's excruciating attempt to make heapsort exploit pre-existing order: http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796.PDF That's one of a hundred algorithms I rejected for Python ;-) > whereas some other sorts, quicksort being a prime example, require > great care to avoid pathological cases. Indeed, Python gave up on quicksort for that reason. While the samplesort hybrid that came between quicksort and the current sort also had theoretical O(N**2) worst-case behavior, it was exceedingly difficult to create such an input, and (unlike as for every quicksort variant Python tried) no real-life case of undue sloth was ever reported against it. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article <[EMAIL PROTECTED]>, Tim Peters <[EMAIL PROTECTED]> wrote: >[Scott David Daniels] >>> For example, time timsort (Python's internal sort) on pre-sorted >>> data; you'll find it is handled faster than random data. > >O(N) vs O(N log N), in fact. > >[Lawrence D'Oliveiro] >> But isn't that how a reasonable sorting algorithm should behave? Less >> work to do if the data is already sorted? > >For example, the O(N log N) heapsort is unreasonable compared to the >O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad >case for heapsort, but a good case for bubblesort)? O(N log N) >sorting algorithms helped by pre-existing order are uncommon, unless >they do extra work to detect and exploit pre-existing order. Actually, presorted lists are not a bad case for heapsort - it's quite immune to any existing order or lack thereof, whereas some other sorts, quicksort being a prime example, require great care to avoid pathological cases. -- Jim Segrave ([EMAIL PROTECTED]) -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
[Scott David Daniels] >> For example, time timsort (Python's internal sort) on pre-sorted >> data; you'll find it is handled faster than random data. O(N) vs O(N log N), in fact. [Lawrence D'Oliveiro] > But isn't that how a reasonable sorting algorithm should behave? Less > work to do if the data is already sorted? For example, the O(N log N) heapsort is unreasonable compared to the O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad case for heapsort, but a good case for bubblesort)? O(N log N) sorting algorithms helped by pre-existing order are uncommon, unless they do extra work to detect and exploit pre-existing order. Python's current sorting algorithm exploits pre-existing order of many kinds. For example, take a sorted list, "cut it in half, and riffle shuffle the two halves together"; e.g., [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15] That turns out to be a supernaturally good case too. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Lawrence D'Oliveiro wrote: > In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: > > >For example, time timsort (Python's internal sort) on pre-sorted > >data; you'll find it is handled faster than random data. > > But isn't that how a reasonable sorting algorithm should behave? Less > work to do if the data is already sorted? An already sorted list can be pathological for Quicksort, depending on how you code it. Iain -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article <[EMAIL PROTECTED]>, Lawrence D'Oliveiro <[EMAIL PROTECTED]> wrote: >In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: >> >>For example, time timsort (Python's internal sort) on pre-sorted >>data; you'll find it is handled faster than random data. > >But isn't that how a reasonable sorting algorithm should behave? Less >work to do if the data is already sorted? Read some of the old discussions in the python-dev archives. -- Aahz ([EMAIL PROTECTED]) <*> http://www.pythoncraft.com/ "I saw `cout' being shifted "Hello world" times to the left and stopped right there." --Steve Gonedes -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Lawrence D'Oliveiro wrote: > In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: > > >>For example, time timsort (Python's internal sort) on pre-sorted >>data; you'll find it is handled faster than random data. > > > But isn't that how a reasonable sorting algorithm should behave? Less > work to do if the data is already sorted? Isn't that just your definition of "reasonable"? regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Love me, love my blog http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article <[EMAIL PROTECTED]>, Scott David Daniels <[EMAIL PROTECTED]> wrote: >For example, time timsort (Python's internal sort) on pre-sorted >data; you'll find it is handled faster than random data. But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data is already sorted? -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Roy Smith wrote: > My guess would be that each resize grows the table by a constant > factor, which IIRC, works out to amortized O(n). Right. For <5 entries, the size is multiplied by 4 each time, and doubled each time for large dictionaries. > It's not as good as > creating the dict the right size in the first place, but it's really > not that bad. Somewhat more troubling is that it can lead to memory > fragmentation problems. Depends on your operating system, of course. You get over a page size fairly quickly, and then many systems use anonymous mappings, where a range of pages gets mapped from swap on allocation, and unmapped on deallocation. So while this can cause address space fragmentation, it doesn't cause memory fragmentation (i.e. memory that is allocated by the process but can't be used). > I don't understand why Python doesn't have a way to give a size hint > when creating a dict. It seems so obvious to be able to say "d = dict > (size=10*1000*1000)" if you know beforehand that's how many keys > you're going to add. There is no need for it. If dicts don't work well in some cases, these cases should be investigated and fixed, instead of working around the problem by loading the problem onto the developer. As you said, it *should* have linear time, with the number of insertions. If it doesn't (and it appears not to), that may be due to hash collisions, or due to the time for computing hashes not being constant. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Thomas Ganss wrote: > Klaas schrieb: >> 4. Insert your keys in sorted order. > This advice is questionable - > > My gut feeling on this matter is: > IF the insert times of pre-sorted values is far better > than the times of unsorted values, there is a chance > that the resulting tree is unbalanced: only 1 compare > operation after insert and no re-balancing of the tree. > > re-indexing will probably give you far better access times > in this case. Another option is to drop non RI indexes used only > for query optimization and recreate them after batch insert. Don't use your gut for such issues. Pre-sorted data is such a common special case (in fact, the only easily describable special case) that many systems handle this specially. For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. If you are using your gut without testing, go ahead and presort. In any case, reading documents and testing beats gut feels every time. --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Klaas schrieb: > 4. Insert your keys in sorted order. This advice is questionable - it depends on the at least on the db vendor and probably sometimes on the sort method, if inserting pre-sorted values is better. My gut feeling on this matter is: IF the insert times of pre-sorted values is far better than the times of unsorted values, there is a chance that the resulting tree is unbalanced: only 1 compare operation after insert and no re-balancing of the tree. re-indexing will probably give you far better access times in this case. Another option is to drop non RI indexes used only for query optimization and recreate them after batch insert. my 0.02 EUR thomas -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris: > class StorageBerkeleyDB(StorageTest): >def runtest(self, number_hash): >db = bsddb.hashopen(None, flag='c', cachesize=8192) >for (num, wildcard_digits) in number_hash.keys(): >key = '%d:%d' % (num, wildcard_digits) >db[key] = None >db.close() BDBs can accomplish what you're looking to do, but they need to be tuned carefully. I won't get into too many details here, but you have a few fatal flaws in that code. 1. 8Kb of cache is _pathetic_. Give it a few hundred megs. This is by far your nbiggest problem. 2. Use BTREE unless you have a good reason to use DBHASH 3. Use proper bdb env creation instead of the hash_open apis. 4. Insert your keys in sorted order. -Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris: > Berkeley DB is great for accessing data by key for things already > stored on disk (i.e. read access), but write performance for each > key-value pair is slow due to it being careful about flushing > writes to disk by default. This is absolutely false. -Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: > Richie Hindle wrote: >> [Chris] >>> Has anyone written a fast hash module which is more optimal for >>> large datasets ? >> >> PyJudy might be what you're looking for, though I've never used it: >> >> http://www.dalkescientific.com/Python/PyJudy.html >> >> "Judy's key benefits are scalability, high performance, and memory >> efficiency. A Judy array is extensible and can scale up to a very large >> number of elements, bounded only by machine memory." ... "PyJudy arrays >> are similar to Python dictionaries and sets." > > Thanks for the suggestion Richie. PyJudy works brilliantly :-) > > Cheers, > Chris It seems, that there is no Microsoft Windows version of Judy available, so for this reason PyJudy won't work on Windows - am I right? Claudio -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: > Claudio Grondi wrote: >> Chris Foote wrote: >>> Klaas wrote: >>> > 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. >>> >>> I've put together some simplified test code, but the bsddb >>> module gives 11m for 1M keys: >>> >> I have run your code for the bsddb on my P4 2.8 GHz and have got: >> Number generator test for 100 number ranges >> with a maximum of 3 wildcard digits. >> Wed May 17 16:34:06 2006 dictionary population started >> Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s >> Wed May 17 16:34:14 2006 StorageBerkeleyDB population started >> Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, >> duration 104.3s > > >> Surprising here, that the dictionary population gives the same time, >> but the BerkeleyDB inserts the records 6 times faster on my computer >> than on yours. I am running Python 2.4.2 on Windows XP SP2, and you? > > Fedora core 5 with ext3 filesystem. The difference will be due to > the way that Windows buffers writes for the filesystem you're using > (it sounds like you're using a FAT-based file system). Ok, according to the Windows task manager the Python process reads/writes to the file system during the run of BerkeleyDB test around 7 GByte(!) of data and the hard drive is continuously busy, where the size of file I found in the Temp directory is always below 20 MByte. The hard drive access is probably the main reason for loosing time - here a question to BerkeleyDB experts: Can the BerkeleyDB via Python bsddb3 interface be tuned to use only RAM or as BerkeleyDB can scale to larger data amount it makes not much sense to tweak it into RAM? Chris, is maybe a RAM-disk the right way to go here to save time lost for accessing the file stored in the file system on the hard drive? The RAM requirements, according to Windows XP task manager, are below 100 MByte. I am using the NTFS file system (yes, I know, that FAT is in some configurations faster than NTFS) and XP Professional SP2 without any tuning of file system caching. The CPU is 100% busy. What CPU and RAM (SIMM, DDR, DDR2) do you have? I have 2GByte fast DDR PC400/3200 dual line RAM. It seems, that you are still not getting results within the range others experience running your code, so I suppose, it has something to do with the hardware you are using. > >>> Number generator test for 100 number ranges >>> with a maximum of 3 wildcard digits. >>> Wed May 17 22:18:17 2006 dictionary population started >>> Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s >>> Wed May 17 22:18:27 2006 StorageBerkeleyDB population started >>> Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, >>> duration 665.6s >>> Wed May 17 22:29:33 2006 StorageSQLite population started >>> Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration >>> 65.5s >> As I don't have SQLite installed, it is interesting to see if the >> factor 10 in the speed difference between BerkeleyDB and SQLite can be >> confirmed by someone else. >> Why is SQLite faster here? I suppose, that SQLite first adds all the >> records and builds the index afterwards with all the records there >> (with db.commit()). > > SQLite is way faster because BerkeleyDB always uses a disk file, > and SQLite is in RAM only. One of the reasons I put an eye on BerkeleyDB is that it pretends to scale to a huge amount (Terrabyte) of data and don't need as much RAM as Python dictionary and it is not necessary to save/load pickled version of the data (i.e. here the dictionary) from/to RAM in order to work with it. I guess, that in your case BerkeleyDB is for the named reasons probably the right way to go, except your data will stay small and the Python dictionary with them will always fit into RAM. Now I am curious to know which path you have decided to go and why? Claudio > > Cheers, > Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Claudio Grondi wrote: > Chris Foote wrote: >> Klaas wrote: >> 22.2s 20m25s[3] >>> >>> 20m to insert 1m keys? You are doing something wrong. >> >> I've put together some simplified test code, but the bsddb >> module gives 11m for 1M keys: >> > I have run your code for the bsddb on my P4 2.8 GHz and have got: > Number generator test for 100 number ranges > with a maximum of 3 wildcard digits. > Wed May 17 16:34:06 2006 dictionary population started > Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s > Wed May 17 16:34:14 2006 StorageBerkeleyDB population started > Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration > 104.3s > > Surprising here, that the dictionary population gives the same time, but > the BerkeleyDB inserts the records 6 times faster on my computer than on > yours. I am running Python 2.4.2 on Windows XP SP2, and you? Fedora core 5 with ext3 filesystem. The difference will be due to the way that Windows buffers writes for the filesystem you're using (it sounds like you're using a FAT-based file system). >> Number generator test for 100 number ranges >> with a maximum of 3 wildcard digits. >> Wed May 17 22:18:17 2006 dictionary population started >> Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s >> Wed May 17 22:18:27 2006 StorageBerkeleyDB population started >> Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, >> duration 665.6s >> Wed May 17 22:29:33 2006 StorageSQLite population started >> Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s > As I don't have SQLite installed, it is interesting to see if the factor > 10 in the speed difference between BerkeleyDB and SQLite can be > confirmed by someone else. > Why is SQLite faster here? I suppose, that SQLite first adds all the > records and builds the index afterwards with all the records there (with > db.commit()). SQLite is way faster because BerkeleyDB always uses a disk file, and SQLite is in RAM only. Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Richie Hindle wrote: > [Chris] >> Has anyone written a fast hash module which is more optimal for >> large datasets ? > > PyJudy might be what you're looking for, though I've never used it: > > http://www.dalkescientific.com/Python/PyJudy.html > > "Judy's key benefits are scalability, high performance, and memory > efficiency. A Judy array is extensible and can scale up to a very large > number of elements, bounded only by machine memory." ... "PyJudy arrays > are similar to Python dictionaries and sets." Thanks for the suggestion Richie. PyJudy works brilliantly :-) Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: > Klaas wrote: > >>> 22.2s 20m25s[3] >> >> >> 20m to insert 1m keys? You are doing something wrong. > > > Hi Mike. > > I've put together some simplified test code, but the bsddb > module gives 11m for 1M keys: > I have run your code for the bsddb on my P4 2.8 GHz and have got: Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 16:34:06 2006 dictionary population started Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s Wed May 17 16:34:14 2006 StorageBerkeleyDB population started Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration 104.3s Surprising here, that the dictionary population gives the same time, but the BerkeleyDB inserts the records 6 times faster on my computer than on yours. I am running Python 2.4.2 on Windows XP SP2, and you? > Number generator test for 100 number ranges > with a maximum of 3 wildcard digits. > Wed May 17 22:18:17 2006 dictionary population started > Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s > Wed May 17 22:18:27 2006 StorageBerkeleyDB population started > Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration > 665.6s > Wed May 17 22:29:33 2006 StorageSQLite population started > Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s As I don't have SQLite installed, it is interesting to see if the factor 10 in the speed difference between BerkeleyDB and SQLite can be confirmed by someone else. Why is SQLite faster here? I suppose, that SQLite first adds all the records and builds the index afterwards with all the records there (with db.commit()). Can the same be done in BerkeleyDB, or does BerkeleyDB not support inserting of records without building an index each single insert command? If yes, how? Claudio > > test code is attached. > >> With bdb's it is crucial to insert keys in bytestring-sorted order. > > > For the bsddb test, I'm using a plain string. (The module docs list a > string being the only datatype supported for both keys & values). > >> Also, be sure to give it a decent amount of cache. > > > The bsddb.hashopen() factory seems to have a bug in this regard; if you > supply a cachesize argument, then it barfs: > > > File "bsddb-test.py", line 67, in runtest > db = bsddb.hashopen(None, flag='c', cachesize=8192) > File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen > if cachesize is not None: d.set_cachesize(0, cachesize) > bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- > DB->set_cachesize: method not permitted when environment specified') > > > I'll file a bug report on this if it isn't already fixed. > > Cheers, > Chris > > > > > import bsddb > import random > import time > import sqlite > > import psyco > psyco.full() > > class WallClockTimer(object): > '''Simple timer for measuring tests.''' > def __init__(self, msg): > self.start = time.time() > self.msg = msg > print time.ctime(self.start), self.msg, 'started' > > def stop(self): > end = time.time() > duration = end - self.start > print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration > > class NumberGen(object): > '''Phone number generator for testing different methods of storage.''' > def __init__(self, range_start, range_end, n, max_wildcards): > > print '''Number generator test for %d number ranges > with a maximum of %d wildcard digits.''' % (n, max_wildcards) > > wildcards = range(0, max_wildcards + 1) > # generate n unique numbers and store as keys in number_hash > timer = WallClockTimer('dictionary population') > self.number_hash = {} > > for i in xrange(0, n): > unique = False > while not unique: > wildcard_digits = self.gen_wildcard_digits(wildcards) > num = self.gen_number(range_start, range_end) > if wildcard_digits > 0: > num /= (10 ** wildcard_digits) > key = (num, wildcard_digits) > if self.number_hash.has_key(key): > unique = False > else: > unique = True > self.number_hash[key] = None > timer.stop() > > def gen_wildcard_digits(self, wildcards): > return random.choice(wildcards) > > def gen_number(self, start, end): > return int(random.uniform(start, end)) > > def storage_test(self, StorageTestClass): > test = StorageTestClass(self.number_hash) > > class StorageTest(object): > '''base class for testing storage. Derive a test > class and provide your own runtest() method.''' > def __init__(self, number_hash): > timer = W
Re: Large Dictionaries
Klaas wrote: 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. Hi Mike. I've put together some simplified test code, but the bsddb module gives 11m for 1M keys: Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 22:18:17 2006 dictionary population started Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s Wed May 17 22:18:27 2006 StorageBerkeleyDB population started Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s Wed May 17 22:29:33 2006 StorageSQLite population started Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s test code is attached. With bdb's it is crucial to insert keys in bytestring-sorted order. For the bsddb test, I'm using a plain string. (The module docs list a string being the only datatype supported for both keys & values). Also, be sure to give it a decent amount of cache. The bsddb.hashopen() factory seems to have a bug in this regard; if you supply a cachesize argument, then it barfs: ... File "bsddb-test.py", line 67, in runtest db = bsddb.hashopen(None, flag='c', cachesize=8192) File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen if cachesize is not None: d.set_cachesize(0, cachesize) bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- DB->set_cachesize: method not permitted when environment specified') I'll file a bug report on this if it isn't already fixed. Cheers, Chris import bsddb import random import time import sqlite import psyco psyco.full() class WallClockTimer(object): '''Simple timer for measuring tests.''' def __init__(self, msg): self.start = time.time() self.msg = msg print time.ctime(self.start), self.msg, 'started' def stop(self): end = time.time() duration = end - self.start print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration class NumberGen(object): '''Phone number generator for testing different methods of storage.''' def __init__(self, range_start, range_end, n, max_wildcards): print '''Number generator test for %d number ranges with a maximum of %d wildcard digits.''' % (n, max_wildcards) wildcards = range(0, max_wildcards + 1) # generate n unique numbers and store as keys in number_hash timer = WallClockTimer('dictionary population') self.number_hash = {} for i in xrange(0, n): unique = False while not unique: wildcard_digits = self.gen_wildcard_digits(wildcards) num = self.gen_number(range_start, range_end) if wildcard_digits > 0: num /= (10 ** wildcard_digits) key = (num, wildcard_digits) if self.number_hash.has_key(key): unique = False else: unique = True self.number_hash[key] = None timer.stop() def gen_wildcard_digits(self, wildcards): return random.choice(wildcards) def gen_number(self, start, end): return int(random.uniform(start, end)) def storage_test(self, StorageTestClass): test = StorageTestClass(self.number_hash) class StorageTest(object): '''base class for testing storage. Derive a test class and provide your own runtest() method.''' def __init__(self, number_hash): timer = WallClockTimer('%s population' % type(self).__name__) self.runtest(number_hash) timer.stop() class StorageBerkeleyDB(StorageTest): def runtest(self, number_hash): db = bsddb.hashopen(None, flag='c', cachesize=8192) for (num, wildcard_digits) in number_hash.keys(): key = '%d:%d' % (num, wildcard_digits) db[key] = None db.close() class StorageSQLite(StorageTest): def runtest(self, number_hash): db = sqlite.connect(':memory:') cursor = db.cursor() cursor.execute('CREATE TABLE numbers (num INTEGER, wd INTEGER)') q = """insert into numbers (num, wd) values (%d, %d)""" for (num, wildcard_digits) in number_hash.keys(): cursor.execute(q, num, wildcard_digits) db.commit() db.close() if __name__ == '__main__': test_vals = {'range_start' : 61880 * 10 ** 7, 'range_end' : 61891 * 10 ** 7, 'n' : 1 * 10 ** 4, 'max_wildcards' : 3} ngen = NumberGen(test_vals['range_start'], test_vals['range_end'], test_vals['n'], test_vals['max_wildcards']) ngen.storage_test(StorageBerkeleyDB) ngen.storage_test(StorageSQLite) -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
bob> If you have the same number of entries as buckets, and you have a bob> good hash function, then if you have n buckets your longest chain bob> should have length around ln(n). The average length of a nonempty bob> bucket would be somewhere around 1 1/2. Yes, and it achieves that nice short chain length by consuming gobs of memory. A dictionary with 10**7 keys is going to chew up lots of memory. There's nothing particularly magical about dictionaries in this respect. They are good examples of a classic time-space tradeoff. Skip -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
Roy> If you're getting long hash chains, you're either using a bad hash Roy> function, or your table isn't big enough. Sure. The original poster said something about 10 million keys I think. Unless the key distribution is amazingly fortuitous and the number of unique values is small, the dictionary is going to have a huge memory footprint. On my Mac laptop this code: >>> d = {} >>> for n in xrange(10**7): ... d[n] = hex(n) ... yields a process with 647MB of VM. If I trim that to 1000 unique values: >>> d = {} >>> for n in xrange(10**7): ... d[n] = hex(n % 1000) ... I still wind up with 647MB VM. The dictionary and its keys are what consume all the memory, and those keys are as simple as you can get. Skip -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
>22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. With bdb's it is crucial to insert keys in bytestring-sorted order. Also, be sure to give it a decent amount of cache. -Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
If you have the same number of entries as buckets, and you have a good hash function, then if you have n buckets your longest chain should have length around ln(n). The average length of a nonempty bucket would be somewhere around 1 1/2. -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
In article <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]> wrote: > >Graham> Looking up a key in a dictionary is done in constant-time, >Graham> i.e. it doesn't matter how large the dictionary is. > >Doesn't that depend on how many keys hash to the same value? For small >dictionaries keeping the max keys that hash to the same value small isn't a >huge problem. For large dictionaries (millions of keys) might you have some >long chains? Or in an effort to reduce the chain length, wind up using so >much virtual memory that you wind up wrecking performance by swapping? If you're getting long hash chains, you're either using a bad hash function, or your table isn't big enough. -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
On 5/16/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > Graham> Looking up a key in a dictionary is done in constant-time, > Graham> i.e. it doesn't matter how large the dictionary is. > > Doesn't that depend on how many keys hash to the same value? For small > dictionaries keeping the max keys that hash to the same value small isn't a > huge problem. Hi Skip. On reflection, I guess I ought to have asked the OP what he meant by "large". :-) Probably asked for details on his use-case as well. Sure, collisions are a reality. The introductory comments in dictobject.c in the Python source [1] give good coverage of how these issues are handled in CPython. And, they make for great bedtime reading! :-) Regardless, I can't imagine that using multiple dictionaries would provide a sufficient speed-up for the vast majority of large dictionary use-cases. There are still swapping issues, plus the user-code overhead of managing the multiple dicts, plus the multiple lookups. If the only improvement is in collision reduction (which is questionable in the general case, since an unfortunate set of keys could result in numerous collisions even in a smaller database), then is the extra overhead really worth it? Still, it's just my opinion. If someone wants to post code and prove me wrong, I'll eat my hat and take my lickings. ;-) Best, Graham [1] http://svn.python.org/view/python/trunk/Objects/dictobject.c -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
Graham> Looking up a key in a dictionary is done in constant-time, Graham> i.e. it doesn't matter how large the dictionary is. Doesn't that depend on how many keys hash to the same value? For small dictionaries keeping the max keys that hash to the same value small isn't a huge problem. For large dictionaries (millions of keys) might you have some long chains? Or in an effort to reduce the chain length, wind up using so much virtual memory that you wind up wrecking performance by swapping? Skip -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
Casey Hawthorne wrote: > For Large Dictionaries Could One Use Separate Dictionaries Where Each > Dictionary Covers an Interval of the Input Range? One Could, But Why? :-) You wouldn't see any performance improvements. Looking up a key in a dictionary is done in constant-time, i.e. it doesn't matter how large the dictionary is. Graham -- http://mail.python.org/mailman/listinfo/python-list
For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range? You would need to know something of the input range and its uniformity! Self-balancing between dictionaries for separate dictionaries? -- Regards, Casey -- http://mail.python.org/mailman/listinfo/python-list