Re: Large Dictionaries
Thomas Ganss wrote: Klaas schrieb: 4. Insert your keys in sorted order. This advice is questionable - it depends on the at least on the db vendor and probably sometimes on the sort method, if inserting pre-sorted values is better. The article I wrote that you quoted named a specific vendor (berkeley db) for which my advice is unquestionable and well-known. -Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Tim Peters wrote: shellsort is much more a refinement of insertion-sort than of bubblesort, but is not an O(N log N) algorithm regardlesss. With a judiciously chosen increment sequence, the number of comparisons made by shellsort *is* of the order N log N for practical values of N. See Figure 8 in http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf Marcin -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Marcin Ciura wrote: Tim Peters wrote: shellsort is much more a refinement of insertion-sort than of bubblesort, but is not an O(N log N) algorithm regardlesss. With a judiciously chosen increment sequence, the number of comparisons made by shellsort *is* of the order N log N for practical values of N. See Figure 8 in http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf That isn't what the reference says. It only covers N up to a few thousand. Practical values of N need to at least go up into the millions. Read the summary: Using sequential analysis, the search for optimal increment sequences for Shellsort was accelerated enough to find them for arrays up to several thousand elements. ... However, the sequences obtained so far are too short to draw a reliable conclusion whether an appropriate sequence of increments can make Shellsort a O(N logN) sorting method on the average. Some hypotheses may be possible when the sequences are prolonged to sort arrays of about 10^5 elements. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Duncan Booth wrote: Marcin Ciura wrote: See Figure 8 in http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf That isn't what the reference says. It only covers N up to a few thousand. Practical values of N need to at least go up into the millions. Please look at the performance graph of Tokuda's increment sequence. You can see that it scales pretty well at least up to 100 million. Marcin -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Marcin Ciura wrote: Duncan Booth wrote: Marcin Ciura wrote: See Figure 8 in http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf That isn't what the reference says. It only covers N up to a few thousand. Practical values of N need to at least go up into the millions. Please look at the performance graph of Tokuda's increment sequence. You can see that it scales pretty well at least up to 100 million. Ah sorry, I misread it. It was sequences with several thousand elements, which corresponds to N of 100 million. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article [EMAIL PROTECTED], Tim Peters [EMAIL PROTECTED] wrote: For example, the O(N log N) heapsort is unreasonable compared to the O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad case for heapsort, but a good case for bubblesort)? O(N log N) sorting algorithms helped by pre-existing order are uncommon, unless they do extra work to detect and exploit pre-existing order. Shellsort works well with nearly-sorted data. It's basically a smarter version of bubblesort with much improved efficiency. It's also very compact to implement. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article [EMAIL PROTECTED], Iain King [EMAIL PROTECTED] wrote: Lawrence D'Oliveiro wrote: In article [EMAIL PROTECTED], Scott David Daniels [EMAIL PROTECTED] wrote: For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data is already sorted? An already sorted list can be pathological for Quicksort, depending on how you code it. I did say reasonable. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
[Tim Peters] ... O(N log N) sorting algorithms helped by pre-existing order are uncommon, unless they do extra work to detect and exploit pre-existing order. [Lawrence D'Oliveiro] Shellsort works well with nearly-sorted data. It's basically a smarter version of bubblesort with much improved efficiency. It's also very compact to implement. shellsort is much more a refinement of insertion-sort than of bubblesort, but is not an O(N log N) algorithm regardlesss. A nice, succinct summary of what was known about shellsort's behavior as of Sedgewick's 1996 Analysis of Shellsort and Related Algorithms can be found here: http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/shell/shellen.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article [EMAIL PROTECTED], Scott David Daniels [EMAIL PROTECTED] wrote: For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data is already sorted? -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Lawrence D'Oliveiro wrote: In article [EMAIL PROTECTED], Scott David Daniels [EMAIL PROTECTED] wrote: For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data is already sorted? Isn't that just your definition of reasonable? regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Love me, love my blog http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article [EMAIL PROTECTED], Lawrence D'Oliveiro [EMAIL PROTECTED] wrote: In article [EMAIL PROTECTED], Scott David Daniels [EMAIL PROTECTED] wrote: For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data is already sorted? Read some of the old discussions in the python-dev archives. -- Aahz ([EMAIL PROTECTED]) * http://www.pythoncraft.com/ I saw `cout' being shifted Hello world times to the left and stopped right there. --Steve Gonedes -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Lawrence D'Oliveiro wrote: In article [EMAIL PROTECTED], Scott David Daniels [EMAIL PROTECTED] wrote: For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data is already sorted? An already sorted list can be pathological for Quicksort, depending on how you code it. Iain -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
[Scott David Daniels] For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. O(N) vs O(N log N), in fact. [Lawrence D'Oliveiro] But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data is already sorted? For example, the O(N log N) heapsort is unreasonable compared to the O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad case for heapsort, but a good case for bubblesort)? O(N log N) sorting algorithms helped by pre-existing order are uncommon, unless they do extra work to detect and exploit pre-existing order. Python's current sorting algorithm exploits pre-existing order of many kinds. For example, take a sorted list, cut it in half, and riffle shuffle the two halves together; e.g., [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15] That turns out to be a supernaturally good case too. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article [EMAIL PROTECTED], Tim Peters [EMAIL PROTECTED] wrote: [Scott David Daniels] For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. O(N) vs O(N log N), in fact. [Lawrence D'Oliveiro] But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data is already sorted? For example, the O(N log N) heapsort is unreasonable compared to the O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad case for heapsort, but a good case for bubblesort)? O(N log N) sorting algorithms helped by pre-existing order are uncommon, unless they do extra work to detect and exploit pre-existing order. Actually, presorted lists are not a bad case for heapsort - it's quite immune to any existing order or lack thereof, whereas some other sorts, quicksort being a prime example, require great care to avoid pathological cases. -- Jim Segrave ([EMAIL PROTECTED]) -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
[Jim Segrave] Actually, presorted lists are not a bad case for heapsort - it's quite immune to any existing order or lack thereof, Write a heapsort and time it. It's not a difference in O() behavior, but more memory movement is required for a sorted list because transforming the list into a max-heap at the start requires shuffling the largest elements from the end of the list to its start. Initially reverse-sorted is a good case for heapsort in the same sense (because the list is already a max-heap then; initially sorted is as far from being a max-heap as is possible). good and bad are both O(N log N) here, though. BTW, most people have never heard of Smoothsort, which was Dijkstra's excruciating attempt to make heapsort exploit pre-existing order: http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796.PDF That's one of a hundred algorithms I rejected for Python ;-) whereas some other sorts, quicksort being a prime example, require great care to avoid pathological cases. Indeed, Python gave up on quicksort for that reason. While the samplesort hybrid that came between quicksort and the current sort also had theoretical O(N**2) worst-case behavior, it was exceedingly difficult to create such an input, and (unlike as for every quicksort variant Python tried) no real-life case of undue sloth was ever reported against it. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article [EMAIL PROTECTED], Tim Peters [EMAIL PROTECTED] wrote: [Jim Segrave] Actually, presorted lists are not a bad case for heapsort - it's quite immune to any existing order or lack thereof, Write a heapsort and time it. It's not a difference in O() behavior, but more memory movement is required for a sorted list because transforming the list into a max-heap at the start requires shuffling the largest elements from the end of the list to its start. Initially reverse-sorted is a good case for heapsort in the same sense (because the list is already a max-heap then; initially sorted is as far from being a max-heap as is possible). good and bad are both O(N log N) here, though. I did implement a crude heapsort before posting this and measured the number of comparisions and the number of swaps. == #!/usr/local/bin/python import random class HeapSorter(object): def __init__(self, seq): self.compares = 0 self.swaps = 0 self.length = len(seq) self.seq = seq def __sift(self, i): while True: j = 2 * i + 1 if j + 1 self.length: self.compares += 1 if self.seq[j + 1] self.seq[j]: j = j + 1 if j = self.length: return self.compares += 1 if self.seq[i] self.seq[j]: return self.swaps += 1 self.seq[i], self.seq[j] = (self.seq[j], self.seq[i]) i = j def __makeheap(self): for i in range(self.length / 2, -1, -1): self.__sift(i) def sort(self): self.__makeheap() for i in range(self.length - 1, -1, -1): self.swaps += 1 self.seq[i], self.seq[0] = (self.seq[0], self.seq[i]) self.length -= 1 self.__sift(0) if __name__ == '__main__': s = list(range(10)) random.shuffle(s) heap = HeapSorter(s) heap.sort() print Random list: comapres %d, swaps %d % \ (heap.compares, heap.swaps) heap = HeapSorter(s) heap.sort() print Sorted list: comapres %d, swaps %d % \ (heap.compares, heap.swaps) s.reverse() heap = HeapSorter(s) heap.sort() print Reverse Sorted list: comapres %d, swaps %d % \ (heap.compares, heap.swaps) == Which gave the following results: /usr/home/jes% python ./heapsort Random list: comapres 3019969, swaps 1575263 Sorted list: comapres 3112517, swaps 1650855 Reverse Sorted list: comapres 2926640, swaps 1497435 Assuming compares and swaps dominate other bookkeeping, then heapsort is quite stable in its performance, although the constant in the O(N log N) is much larger than for other sorts. -- Jim Segrave ([EMAIL PROTECTED]) -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Roy Smith wrote: My guess would be that each resize grows the table by a constant factor, which IIRC, works out to amortized O(n). Right. For 5 entries, the size is multiplied by 4 each time, and doubled each time for large dictionaries. It's not as good as creating the dict the right size in the first place, but it's really not that bad. Somewhat more troubling is that it can lead to memory fragmentation problems. Depends on your operating system, of course. You get over a page size fairly quickly, and then many systems use anonymous mappings, where a range of pages gets mapped from swap on allocation, and unmapped on deallocation. So while this can cause address space fragmentation, it doesn't cause memory fragmentation (i.e. memory that is allocated by the process but can't be used). I don't understand why Python doesn't have a way to give a size hint when creating a dict. It seems so obvious to be able to say d = dict (size=10*1000*1000) if you know beforehand that's how many keys you're going to add. There is no need for it. If dicts don't work well in some cases, these cases should be investigated and fixed, instead of working around the problem by loading the problem onto the developer. As you said, it *should* have linear time, with the number of insertions. If it doesn't (and it appears not to), that may be due to hash collisions, or due to the time for computing hashes not being constant. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Klaas schrieb: 4. Insert your keys in sorted order. This advice is questionable - it depends on the at least on the db vendor and probably sometimes on the sort method, if inserting pre-sorted values is better. My gut feeling on this matter is: IF the insert times of pre-sorted values is far better than the times of unsorted values, there is a chance that the resulting tree is unbalanced: only 1 compare operation after insert and no re-balancing of the tree. re-indexing will probably give you far better access times in this case. Another option is to drop non RI indexes used only for query optimization and recreate them after batch insert. my 0.02 EUR thomas -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Thomas Ganss wrote: Klaas schrieb: 4. Insert your keys in sorted order. This advice is questionable - My gut feeling on this matter is: IF the insert times of pre-sorted values is far better than the times of unsorted values, there is a chance that the resulting tree is unbalanced: only 1 compare operation after insert and no re-balancing of the tree. re-indexing will probably give you far better access times in this case. Another option is to drop non RI indexes used only for query optimization and recreate them after batch insert. Don't use your gut for such issues. Pre-sorted data is such a common special case (in fact, the only easily describable special case) that many systems handle this specially. For example, time timsort (Python's internal sort) on pre-sorted data; you'll find it is handled faster than random data. If you are using your gut without testing, go ahead and presort. In any case, reading documents and testing beats gut feels every time. --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris: Berkeley DB is great for accessing data by key for things already stored on disk (i.e. read access), but write performance for each key-value pair is slow due to it being careful about flushing writes to disk by default. This is absolutely false. -Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris: class StorageBerkeleyDB(StorageTest): def runtest(self, number_hash): db = bsddb.hashopen(None, flag='c', cachesize=8192) for (num, wildcard_digits) in number_hash.keys(): key = '%d:%d' % (num, wildcard_digits) db[key] = None db.close() BDBs can accomplish what you're looking to do, but they need to be tuned carefully. I won't get into too many details here, but you have a few fatal flaws in that code. 1. 8Kb of cache is _pathetic_. Give it a few hundred megs. This is by far your nbiggest problem. 2. Use BTREE unless you have a good reason to use DBHASH 3. Use proper bdb env creation instead of the hash_open apis. 4. Insert your keys in sorted order. -Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Claudio Grondi wrote: Chris Foote wrote: Klaas wrote: 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. I've put together some simplified test code, but the bsddb module gives 11m for 1M keys: I have run your code for the bsddb on my P4 2.8 GHz and have got: Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 16:34:06 2006 dictionary population started Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s Wed May 17 16:34:14 2006 StorageBerkeleyDB population started Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration 104.3s Surprising here, that the dictionary population gives the same time, but the BerkeleyDB inserts the records 6 times faster on my computer than on yours. I am running Python 2.4.2 on Windows XP SP2, and you? Fedora core 5 with ext3 filesystem. The difference will be due to the way that Windows buffers writes for the filesystem you're using (it sounds like you're using a FAT-based file system). Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 22:18:17 2006 dictionary population started Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s Wed May 17 22:18:27 2006 StorageBerkeleyDB population started Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s Wed May 17 22:29:33 2006 StorageSQLite population started Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s As I don't have SQLite installed, it is interesting to see if the factor 10 in the speed difference between BerkeleyDB and SQLite can be confirmed by someone else. Why is SQLite faster here? I suppose, that SQLite first adds all the records and builds the index afterwards with all the records there (with db.commit()). SQLite is way faster because BerkeleyDB always uses a disk file, and SQLite is in RAM only. Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: Claudio Grondi wrote: Chris Foote wrote: Klaas wrote: 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. I've put together some simplified test code, but the bsddb module gives 11m for 1M keys: I have run your code for the bsddb on my P4 2.8 GHz and have got: Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 16:34:06 2006 dictionary population started Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s Wed May 17 16:34:14 2006 StorageBerkeleyDB population started Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration 104.3s Surprising here, that the dictionary population gives the same time, but the BerkeleyDB inserts the records 6 times faster on my computer than on yours. I am running Python 2.4.2 on Windows XP SP2, and you? Fedora core 5 with ext3 filesystem. The difference will be due to the way that Windows buffers writes for the filesystem you're using (it sounds like you're using a FAT-based file system). Ok, according to the Windows task manager the Python process reads/writes to the file system during the run of BerkeleyDB test around 7 GByte(!) of data and the hard drive is continuously busy, where the size of file I found in the Temp directory is always below 20 MByte. The hard drive access is probably the main reason for loosing time - here a question to BerkeleyDB experts: Can the BerkeleyDB via Python bsddb3 interface be tuned to use only RAM or as BerkeleyDB can scale to larger data amount it makes not much sense to tweak it into RAM? Chris, is maybe a RAM-disk the right way to go here to save time lost for accessing the file stored in the file system on the hard drive? The RAM requirements, according to Windows XP task manager, are below 100 MByte. I am using the NTFS file system (yes, I know, that FAT is in some configurations faster than NTFS) and XP Professional SP2 without any tuning of file system caching. The CPU is 100% busy. What CPU and RAM (SIMM, DDR, DDR2) do you have? I have 2GByte fast DDR PC400/3200 dual line RAM. It seems, that you are still not getting results within the range others experience running your code, so I suppose, it has something to do with the hardware you are using. Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 22:18:17 2006 dictionary population started Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s Wed May 17 22:18:27 2006 StorageBerkeleyDB population started Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s Wed May 17 22:29:33 2006 StorageSQLite population started Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s As I don't have SQLite installed, it is interesting to see if the factor 10 in the speed difference between BerkeleyDB and SQLite can be confirmed by someone else. Why is SQLite faster here? I suppose, that SQLite first adds all the records and builds the index afterwards with all the records there (with db.commit()). SQLite is way faster because BerkeleyDB always uses a disk file, and SQLite is in RAM only. One of the reasons I put an eye on BerkeleyDB is that it pretends to scale to a huge amount (Terrabyte) of data and don't need as much RAM as Python dictionary and it is not necessary to save/load pickled version of the data (i.e. here the dictionary) from/to RAM in order to work with it. I guess, that in your case BerkeleyDB is for the named reasons probably the right way to go, except your data will stay small and the Python dictionary with them will always fit into RAM. Now I am curious to know which path you have decided to go and why? Claudio Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: Richie Hindle wrote: [Chris] Has anyone written a fast hash module which is more optimal for large datasets ? PyJudy might be what you're looking for, though I've never used it: http://www.dalkescientific.com/Python/PyJudy.html Judy's key benefits are scalability, high performance, and memory efficiency. A Judy array is extensible and can scale up to a very large number of elements, bounded only by machine memory. ... PyJudy arrays are similar to Python dictionaries and sets. Thanks for the suggestion Richie. PyJudy works brilliantly :-) Cheers, Chris It seems, that there is no Microsoft Windows version of Judy available, so for this reason PyJudy won't work on Windows - am I right? Claudio -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Klaas wrote: 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. Hi Mike. I've put together some simplified test code, but the bsddb module gives 11m for 1M keys: Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 22:18:17 2006 dictionary population started Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s Wed May 17 22:18:27 2006 StorageBerkeleyDB population started Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s Wed May 17 22:29:33 2006 StorageSQLite population started Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s test code is attached. With bdb's it is crucial to insert keys in bytestring-sorted order. For the bsddb test, I'm using a plain string. (The module docs list a string being the only datatype supported for both keys values). Also, be sure to give it a decent amount of cache. The bsddb.hashopen() factory seems to have a bug in this regard; if you supply a cachesize argument, then it barfs: ... File bsddb-test.py, line 67, in runtest db = bsddb.hashopen(None, flag='c', cachesize=8192) File /usr/lib/python2.4/bsddb/__init__.py, line 288, in hashopen if cachesize is not None: d.set_cachesize(0, cachesize) bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- DB-set_cachesize: method not permitted when environment specified') I'll file a bug report on this if it isn't already fixed. Cheers, Chris import bsddb import random import time import sqlite import psyco psyco.full() class WallClockTimer(object): '''Simple timer for measuring tests.''' def __init__(self, msg): self.start = time.time() self.msg = msg print time.ctime(self.start), self.msg, 'started' def stop(self): end = time.time() duration = end - self.start print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration class NumberGen(object): '''Phone number generator for testing different methods of storage.''' def __init__(self, range_start, range_end, n, max_wildcards): print '''Number generator test for %d number ranges with a maximum of %d wildcard digits.''' % (n, max_wildcards) wildcards = range(0, max_wildcards + 1) # generate n unique numbers and store as keys in number_hash timer = WallClockTimer('dictionary population') self.number_hash = {} for i in xrange(0, n): unique = False while not unique: wildcard_digits = self.gen_wildcard_digits(wildcards) num = self.gen_number(range_start, range_end) if wildcard_digits 0: num /= (10 ** wildcard_digits) key = (num, wildcard_digits) if self.number_hash.has_key(key): unique = False else: unique = True self.number_hash[key] = None timer.stop() def gen_wildcard_digits(self, wildcards): return random.choice(wildcards) def gen_number(self, start, end): return int(random.uniform(start, end)) def storage_test(self, StorageTestClass): test = StorageTestClass(self.number_hash) class StorageTest(object): '''base class for testing storage. Derive a test class and provide your own runtest() method.''' def __init__(self, number_hash): timer = WallClockTimer('%s population' % type(self).__name__) self.runtest(number_hash) timer.stop() class StorageBerkeleyDB(StorageTest): def runtest(self, number_hash): db = bsddb.hashopen(None, flag='c', cachesize=8192) for (num, wildcard_digits) in number_hash.keys(): key = '%d:%d' % (num, wildcard_digits) db[key] = None db.close() class StorageSQLite(StorageTest): def runtest(self, number_hash): db = sqlite.connect(':memory:') cursor = db.cursor() cursor.execute('CREATE TABLE numbers (num INTEGER, wd INTEGER)') q = insert into numbers (num, wd) values (%d, %d) for (num, wildcard_digits) in number_hash.keys(): cursor.execute(q, num, wildcard_digits) db.commit() db.close() if __name__ == '__main__': test_vals = {'range_start' : 61880 * 10 ** 7, 'range_end' : 61891 * 10 ** 7, 'n' : 1 * 10 ** 4, 'max_wildcards' : 3} ngen = NumberGen(test_vals['range_start'], test_vals['range_end'], test_vals['n'], test_vals['max_wildcards']) ngen.storage_test(StorageBerkeleyDB) ngen.storage_test(StorageSQLite) -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: Klaas wrote: 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. Hi Mike. I've put together some simplified test code, but the bsddb module gives 11m for 1M keys: I have run your code for the bsddb on my P4 2.8 GHz and have got: Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 16:34:06 2006 dictionary population started Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s Wed May 17 16:34:14 2006 StorageBerkeleyDB population started Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration 104.3s Surprising here, that the dictionary population gives the same time, but the BerkeleyDB inserts the records 6 times faster on my computer than on yours. I am running Python 2.4.2 on Windows XP SP2, and you? Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 22:18:17 2006 dictionary population started Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s Wed May 17 22:18:27 2006 StorageBerkeleyDB population started Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s Wed May 17 22:29:33 2006 StorageSQLite population started Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s As I don't have SQLite installed, it is interesting to see if the factor 10 in the speed difference between BerkeleyDB and SQLite can be confirmed by someone else. Why is SQLite faster here? I suppose, that SQLite first adds all the records and builds the index afterwards with all the records there (with db.commit()). Can the same be done in BerkeleyDB, or does BerkeleyDB not support inserting of records without building an index each single insert command? If yes, how? Claudio test code is attached. With bdb's it is crucial to insert keys in bytestring-sorted order. For the bsddb test, I'm using a plain string. (The module docs list a string being the only datatype supported for both keys values). Also, be sure to give it a decent amount of cache. The bsddb.hashopen() factory seems to have a bug in this regard; if you supply a cachesize argument, then it barfs: File bsddb-test.py, line 67, in runtest db = bsddb.hashopen(None, flag='c', cachesize=8192) File /usr/lib/python2.4/bsddb/__init__.py, line 288, in hashopen if cachesize is not None: d.set_cachesize(0, cachesize) bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- DB-set_cachesize: method not permitted when environment specified') I'll file a bug report on this if it isn't already fixed. Cheers, Chris import bsddb import random import time import sqlite import psyco psyco.full() class WallClockTimer(object): '''Simple timer for measuring tests.''' def __init__(self, msg): self.start = time.time() self.msg = msg print time.ctime(self.start), self.msg, 'started' def stop(self): end = time.time() duration = end - self.start print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration class NumberGen(object): '''Phone number generator for testing different methods of storage.''' def __init__(self, range_start, range_end, n, max_wildcards): print '''Number generator test for %d number ranges with a maximum of %d wildcard digits.''' % (n, max_wildcards) wildcards = range(0, max_wildcards + 1) # generate n unique numbers and store as keys in number_hash timer = WallClockTimer('dictionary population') self.number_hash = {} for i in xrange(0, n): unique = False while not unique: wildcard_digits = self.gen_wildcard_digits(wildcards) num = self.gen_number(range_start, range_end) if wildcard_digits 0: num /= (10 ** wildcard_digits) key = (num, wildcard_digits) if self.number_hash.has_key(key): unique = False else: unique = True self.number_hash[key] = None timer.stop() def gen_wildcard_digits(self, wildcards): return random.choice(wildcards) def gen_number(self, start, end): return int(random.uniform(start, end)) def storage_test(self, StorageTestClass): test = StorageTestClass(self.number_hash) class StorageTest(object): '''base class for testing storage. Derive a test class and provide your own runtest() method.''' def __init__(self, number_hash): timer = WallClockTimer('%s population' % type(self).__name__) self.runtest(number_hash) timer.stop() class
Re: Large Dictionaries
Richie Hindle wrote: [Chris] Has anyone written a fast hash module which is more optimal for large datasets ? PyJudy might be what you're looking for, though I've never used it: http://www.dalkescientific.com/Python/PyJudy.html Judy's key benefits are scalability, high performance, and memory efficiency. A Judy array is extensible and can scale up to a very large number of elements, bounded only by machine memory. ... PyJudy arrays are similar to Python dictionaries and sets. Thanks for the suggestion Richie. PyJudy works brilliantly :-) Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Le Lundi 15 Mai 2006 21:07, Diez B. Roggisch a écrit : d={}.fromkeys(xrange(5*10**6)) ? That is a totally different beast. You don't want to insert arbitrary keys, you want the internal hash-array to be of the right size. But you can replace the xrange part by any generator function you want. def get_mykeys(..) ... yield key I just the wonder if the time consuming part is the memory allocation of hash table (key by key) or the hash operation itself. I don't see a use case where a python programmer should need a dictionnary that will be probably big but can't predict what keys will be in. -- _ Maric Michaud _ Aristote - www.aristote.info 3 place des tapis 69004 Lyon Tel: +33 426 880 097 -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: Don't forget that adding keys requires resizing the dict, which is a moderately expensive operation. Yep, that's why I probably need a dictionary where I can pre-specify an approximate size at the time of its creation. Try splitting the creation of the keys from the creation of the dictionary and you'll see where the bottleneck really is. On my system: r = range(500) d = dict.fromkeys(r) takes less than 1 second to create the dictionary. So creating a dictionary with 5 million elements is not in itself a terribly expensive operation (and yes, even with fromkeys there is no attempt to work out in advance what size the dictionary should be). Trying to simulate your data: scale = 2**32 data = [ (i*scale,i) for i in range(500) ] d = dict.fromkeys(data) the assignment to data took 42 seconds. Creating the dictionary took 6 seconds. Trying the variation someone else suggested: scale = 2**32 data = [ i*scale+i for i in range(500) ] d = dict.fromkeys(data) creating data took under 6 seconds and about 1 second for d. so it looks like the issue is not creating a large dictionary, nor lots of integers, but creating lots of tuples is expensive. Oh, and if anyone tells you that the intermediate list I created in the tests above is going to make it slow and you should use iterators instead: r = range(500) scale = 2**32 s = [ i*scale for i in r ] from itertools import izip d = dict.fromkeys(izip(s,r)) The first few lines took a couple of seconds, the last one 1 minute 50 seconds. The alternative version: scale = 2**32 r = range(500) d = dict.fromkeys((i*scale,i) for i in r) takes a similar time. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Claudio Grondi wrote: Chris Foote wrote: p.s. Disk-based DBs are out of the question because most key lookups will result in a miss, and lookup time is critical for this application. Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the standard Python 2.4 distribution. However, please note that the Python bsddb module doesn't support in-memory based databases - note the library documentation's[1] wording: Files never intended to be preserved on disk may be created by passing None as the filename. which closely mirrors the Sleepycat documentation[2]: In-memory databases never intended to be preserved on disk may be created by setting the file parameter to NULL. It does actually use a temporary file (in /var/tmp), for which performance for my purposes is unsatisfactory: # keys dictionary metakit bsddb (all using psyco) -- -- --- - 1M8.8s 22.2s 20m25s[3] 2M 24.0s 43.7s N/A 5M 115.3s105.4s N/A Cheers, Chris [1] bsddb docs: http://www.python.org/doc/current/lib/module-bsddb.html [2] Sleepycat BerkeleyDB C API: http://www.sleepycat.com/docs/api_c/db_open.html [3] Wall clock time. Storing the (long_integer, integer) key in string form long_integer:integer since bsddb doesn't support keys that aren't integers or strings. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Paul McGuire wrote: Claudio Grondi [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Chris Foote wrote: Hi all. I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) -- -- --- 1M8.8s 22.2s 2M 24.0s 43.7s 5M 115.3s105.4s Has anyone written a fast hash module which is more optimal for large datasets ? p.s. Disk-based DBs are out of the question because most key lookups will result in a miss, and lookup time is critical for this application. Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the standard Python 2.4 distribution. Berkeley DB was 20 times faster than other databases. It has the operational speed of a main memory database, the startup and shut down speed of a disk-resident database, and does not have the overhead of a client-server inter-process communication. Ray Van Tassle, Senior Staff Engineer, Motorola Please let me/us know if it is what you are looking for. sqlite also supports an in-memory database - use pysqlite (http://initd.org/tracker/pysqlite/wiki) to access this from Python. Hi Paul. I tried that, but the overhead of parsing SQL queries is too high: dictionary metakit sqlite[1] -- --- - 1M numbers 8.8s 22.2s 89.6s 2M numbers 24.0s 43.7s190.0s 5M numbers 115.3s 105.4s N/A Thanks for the suggestion, but no go. Cheers, Chris [1] pysqlite V1 sqlite V3. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
lcaamano wrote: Sounds like PyTables could be useful. http://www.pytables.org In browsing their excellent documentation, it seems that all concepts are built around storing and reading HDF5 format files. Not suitable for this project unfortunately. Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
(my dictionary values are all None). So in fact all you need is a set. Have you experimented with the Python 2.5 alpha? -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: lcaamano wrote: Sounds like PyTables could be useful. http://www.pytables.org In browsing their excellent documentation, it seems that all concepts are built around storing and reading HDF5 format files. Not suitable for this project unfortunately. Cheers, Chris How many values accept long and int in pairs ? It is possible to try to use the dictionary of dictionaries: d = {} for long_val, int_val in list_vals: d.setdefault(long_val, {})[int_val] = None -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
John Machin wrote: (my dictionary values are all None). So in fact all you need is a set. Have you experimented with the Python 2.5 alpha? I don't think that will help him: my timings work out about the same on 2.4.2 or 2.5a2. As I pointed out, the bottleneck is creating the tuples. Creating either a dictionary or a set from 5 million tuples takes a similar amount of time (about 6 seconds on my system). BTW, a string key is quite a bit faster than the tuples, even though you end up with as many temporary tuples during the process: scale = 2**32 data = [ %d:%d % (i*scale,i) for i in range(500) ] d = set(data) takes about 18 seconds to create the data list and 5 for the set. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: Claudio Grondi wrote: Chris Foote wrote: p.s. Disk-based DBs are out of the question because most key lookups will result in a miss, and lookup time is critical for this application. Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the standard Python 2.4 distribution. However, please note that the Python bsddb module doesn't support in-memory based databases - note the library documentation's[1] wording: Files never intended to be preserved on disk may be created by passing None as the filename. which closely mirrors the Sleepycat documentation[2]: In-memory databases never intended to be preserved on disk may be created by setting the file parameter to NULL. It does actually use a temporary file (in /var/tmp), for which performance for my purposes is unsatisfactory: # keys dictionary metakit bsddb (all using psyco) -- -- --- - 1M8.8s 22.2s 20m25s[3] 2M 24.0s 43.7s N/A 5M 115.3s105.4s N/A Cheers, Chris [1] bsddb docs: http://www.python.org/doc/current/lib/module-bsddb.html [2] Sleepycat BerkeleyDB C API: http://www.sleepycat.com/docs/api_c/db_open.html [3] Wall clock time. Storing the (long_integer, integer) key in string form long_integer:integer since bsddb doesn't support keys that aren't integers or strings. I have to admit, that I haven't wrote any own code to actually test this, but if 20m25s for storing of a single MByte of strings in a database table index column is really what you are getting, I can't get rid of the feeling, that there is something elementary wrong with your way doing it. Posting the code for your test cases appears to me to be the only option to see what is the reason for the mystery you are getting here (this will clarify also the other mysterious things considered by the posters to this thread up to now). Claudio -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
BTW why are python dicts implemented as hash tables and not judy arrays? -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
[EMAIL PROTECTED] wrote: BTW why are python dicts implemented as hash tables and not judy arrays? Probably because: Python predates judy arrays. Nobody has proposed replacing the current dict implementation. Nobody has demonstrated that it would actually be an advantage to replace the current implementation. Or maybe Python dicts are just more flexible? You choose. A quick scan of the Judy IV Shop Manual reveals: Judy is a dessert topping and a floor wax, but its not intended for use as a house paint. ... Judy is a programming library that provides a relatively simple interface (API) for array-like storage of word or string indexes with optional mapping of indexes to single-word values. It doesn't sound to me as though they match the flexibility of Python dictionaries. Without reading further through the documentation, the implication is that keys are restricted to simple word or string values. Remember, all that Python requires of a dict key is that it has a hash value and can be compared for equal/not equal with other keys. It doesn't require the key to be expressible as a sequence of bits which is what the Judy manual seems to imply: You can think of digital trees as peeling (decoding) leading bits off a key until only one bit is left, but in the case of an unbounded variable-size key there is no definite bottom (that is, a definite last bit or maximum length for every key). However, there are always unpopulated subexpanses, except with a fixed-size key where every possible key value is stored in the data structure. When decoding keys top-down in this way, each (sub)expanse is defined by the bits already decoded and the number of bits remaining (if finite). -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Duncan Booth wrote: BTW why are python dicts implemented as hash tables and not judy arrays? Probably because: Python predates judy arrays. Nobody has proposed replacing the current dict implementation. Nobody has demonstrated that it would actually be an advantage to replace the current implementation. Or maybe Python dicts are just more flexible? Also, see http://www.dalkescientific.com/Python/PyJudy.html with the conclusions drawn on performance: The first conclusion is that Python dictionaries are wicked fast. There are few cases where PyJudy is faster, though perhaps there might be a few more if I knew more about the Python extension API. Bear in mind though that the Judy arrays are in sorted order and the JudyL* classes have ways to access elements by order. So for a specialised use Judy arrays could be a good idea, but not as a general replacement for Python dicts. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Maric Michaud [EMAIL PROTECTED] wrote: I don't see a use case where a python programmer should need a dictionnary that will be probably big but can't predict what keys will be in. I've recently been working on an app with precisely this characteristic, although there were enough other bottlenecks that the cost of growing the dict as needed was irrelevant. (Read ~2m items from csv file, compare them with items in a database and populate assorted dictionaries based on the result of that comparison for subsequent processing. You can make a reasonable guess, and have a definite upper bound, on the sizes of the dictionaries cheaply right at the start, but you don't know what the keys in each are going to be until you discover them through the comparison of file and DB.) -- \S -- [EMAIL PROTECTED] -- http://www.chaos.org.uk/~sion/ ___ | Frankly I have no feelings towards penguins one way or the other \X/ |-- Arthur C. Clarke her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Claudio Grondi wrote: Chris Foote wrote: However, please note that the Python bsddb module doesn't support in-memory based databases - note the library documentation's[1] wording: Files never intended to be preserved on disk may be created by passing None as the filename. which closely mirrors the Sleepycat documentation[2]: In-memory databases never intended to be preserved on disk may be created by setting the file parameter to NULL. It does actually use a temporary file (in /var/tmp), for which performance for my purposes is unsatisfactory: # keys dictionary metakit bsddb (all using psyco) -- -- --- - 1M8.8s 22.2s 20m25s[3] 2M 24.0s 43.7s N/A 5M 115.3s105.4s N/A Cheers, Chris [1] bsddb docs: http://www.python.org/doc/current/lib/module-bsddb.html [2] Sleepycat BerkeleyDB C API: http://www.sleepycat.com/docs/api_c/db_open.html [3] Wall clock time. Storing the (long_integer, integer) key in string form long_integer:integer since bsddb doesn't support keys that aren't integers or strings. I have to admit, that I haven't wrote any own code to actually test this, but if 20m25s for storing of a single MByte of strings in a database table index column is really what you are getting, I can't get rid of the feeling, that there is something elementary wrong with your way doing it. Hi Claudio. 1M is one million, referring to the number of insertions of keys; not a Megabyte. I'm sorry that you took it that way :-( Berkeley DB is great for accessing data by key for things already stored on disk (i.e. read access), but write performance for each key-value pair is slow due to it being careful about flushing writes to disk by default. Posting the code for your test cases appears to me to be the only option to see what is the reason for the mystery you are getting here (this will clarify also the other mysterious things considered by the posters to this thread up to now). I agree that posting some test code would have proved useful, but the code is big and has too many interdependencies on external things (e.g. databases, threads pyro RPC calls) to allow me to separate out examples easily. But if you go back to my original posting, I think my question was quite clear. Best regards, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: Claudio Grondi wrote: Chris Foote wrote: However, please note that the Python bsddb module doesn't support in-memory based databases - note the library documentation's[1] wording: Files never intended to be preserved on disk may be created by passing None as the filename. which closely mirrors the Sleepycat documentation[2]: In-memory databases never intended to be preserved on disk may be created by setting the file parameter to NULL. It does actually use a temporary file (in /var/tmp), for which performance for my purposes is unsatisfactory: # keys dictionary metakit bsddb (all using psyco) -- -- --- - 1M8.8s 22.2s 20m25s[3] 2M 24.0s 43.7s N/A 5M 115.3s105.4s N/A Cheers, Chris [1] bsddb docs: http://www.python.org/doc/current/lib/module-bsddb.html [2] Sleepycat BerkeleyDB C API: http://www.sleepycat.com/docs/api_c/db_open.html [3] Wall clock time. Storing the (long_integer, integer) key in string form long_integer:integer since bsddb doesn't support keys that aren't integers or strings. I have to admit, that I haven't wrote any own code to actually test this, but if 20m25s for storing of a single MByte of strings in a database table index column is really what you are getting, I can't get rid of the feeling, that there is something elementary wrong with your way doing it. Hi Claudio. 1M is one million, referring to the number of insertions of keys; not a Megabyte. I'm sorry that you took it that way :-( Berkeley DB is great for accessing data by key for things already stored on disk (i.e. read access), but write performance for each key-value pair is slow due to it being careful about flushing writes to disk by default. Posting the code for your test cases appears to me to be the only option to see what is the reason for the mystery you are getting here (this will clarify also the other mysterious things considered by the posters to this thread up to now). I agree that posting some test code would have proved useful, but the code is big and has too many interdependencies on external things (e.g. databases, threads pyro RPC calls) to allow me to separate out examples easily. But if you go back to my original posting, I think my question was quite clear. Best regards, Chris I have some code demonstrating the mystery from my point of view (including timings which differ from what you experience in the order of a magnitude on my 2.8 GHz P4): dctA = {} import time strt = time.clock() lst = range(12345678901000L, 12345678901000L + 500) import random endt = time.clock() print lst = range(12345678901L, 12345678901L + 500) [s] : , endt - strt strt = time.clock() random.shuffle(lst) endt = time.clock() print random.shuffle(lst) [s] : , endt - strt random.shuffle(lst) counter = 0 for item in lst: if (counter == 0): print Listing of some of lst items: if (counter 4): print :END of listing of lst items break else: print item no. %i %(counter,), repr(item) counter += 1 #:for # raw_input('continue with ENTER) #') strt = time.clock() dctA.fromkeys(lst, None) endt = time.clock() print dctA.fromkeys(lst, None) [s] : , endt - strt # raw_input('continue with ENTER) #') strt = time.clock() strLst = [] for item in lst: strLst.append(str(item)) dctA = {} endt = time.clock() print for item in lst: strLst.append(str(item)) [s] : , endt - strt # raw_input('continue with ENTER) #') strt = time.clock() dctA = dctA.fromkeys(strLst, None) endt = time.clock() print dctA.fromkeys(strLst, None) [s] : , endt - strt # raw_input('continue with ENTER) #') print len(dctA) : %i % (len(dctA),) counter = 0 for key in dctA.keys(): if (counter == 0): print Listing of some of dctA items: if (counter 4): print :END of listing of dctA items break else: print key no. %i %(counter,), repr(key) counter += 1 #:for raw_input('exit with ENTER) #') # Gives as output : # # lst = range(12345678901L, 12345678901L + 500) [s] : 0.81327347470 # random.shuffle(lst) [s] : 8.06178829991 # Listing of some of lst items: # item no. 0 123456789011010733L # item no. 1 123456789013585761L # item no. 2 123456789013610266L # item no. 3 123456789011311029L # item no. 4 123456789010968853L # :END of listing of lst items # dctA.fromkeys(lst, None) [s] : 3.11773256098 # for item in lst: strLst.append(str(item)) [s] : 11.5454232312 # dctA.fromkeys(strLst, None) [s] : 3.98027849908 # len(dctA) : 500 # Listing of some of dctA items: # key no. 0 '123456789014515319' # key no. 1 '123456789014116699' # key no. 2 '123456789014116698' # key no. 3 '123456789010800915' # key no. 4
Re: Large Dictionaries
22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. With bdb's it is crucial to insert keys in bytestring-sorted order. Also, be sure to give it a decent amount of cache. -Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
[Chris] Has anyone written a fast hash module which is more optimal for large datasets ? PyJudy might be what you're looking for, though I've never used it: http://www.dalkescientific.com/Python/PyJudy.html Judy's key benefits are scalability, high performance, and memory efficiency. A Judy array is extensible and can scale up to a very large number of elements, bounded only by machine memory. ... PyJudy arrays are similar to Python dictionaries and sets. -- Richie Hindle [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article [EMAIL PROTECTED], Chris Foote [EMAIL PROTECTED] wrote: Hi all. I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) -- -- --- 1M8.8s 22.2s 2M 24.0s 43.7s 5M 115.3s105.4s Are those clock times or CPU times? How much memory is your process using and how much is available on your machine? I'm guessing a integer takes 4 bytes and a long integer takes roughly one byte per two decimal digits. Plus a few more bytes to bundle them up into a tuple. You've probably got something like 20 bytes per key, so 5M of them is 100 meg just for the keys. To get reasonable hashing performance, the hash table needs to be maybe half full, and figure a hash key is 4 bytes, so that's another 40 meg for the hash table itself. Plus whatever the values stored in the dictionary take up. Even if you're not storing any values (i.e., there's probably 4 bytes for a null pointer (or reference to None), so that's another 40 meg. These are all vague guesses, based on what I think are probably conservative estimates of what various objects must take up in memory, but you see where this is going. We're already up to 180 meg. I wonder if the whole problem is that you're just paging yourself to death. A few minutes watching your system memory performance with ps or top while your program is running might give you some idea if this is the case. -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Chris Foote wrote: Hi all. I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) -- -- --- 1M8.8s 22.2s 2M 24.0s 43.7s 5M 115.3s105.4s Has anyone written a fast hash module which is more optimal for large datasets ? p.s. Disk-based DBs are out of the question because most key lookups will result in a miss, and lookup time is critical for this application. Cheers, Chris Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the standard Python 2.4 distribution. Berkeley DB was 20 times faster than other databases. It has the operational speed of a main memory database, the startup and shut down speed of a disk-resident database, and does not have the overhead of a client-server inter-process communication. Ray Van Tassle, Senior Staff Engineer, Motorola Please let me/us know if it is what you are looking for. Claudio -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
In article [EMAIL PROTECTED], Roy Smith [EMAIL PROTECTED] wrote: In article [EMAIL PROTECTED], Chris Foote [EMAIL PROTECTED] wrote: I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) -- -- --- 1M8.8s 22.2s 2M 24.0s 43.7s 5M 115.3s105.4s Are those clock times or CPU times? And what are these times measuring? Don't forget that adding keys requires resizing the dict, which is a moderately expensive operation. Once the dict is constructed, lookup times should be quite good. -- Aahz ([EMAIL PROTECTED]) * http://www.pythoncraft.com/ I saw `cout' being shifted Hello world times to the left and stopped right there. --Steve Gonedes -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Aahz [EMAIL PROTECTED] wrote: Don't forget that adding keys requires resizing the dict, which is a moderately expensive operation. My guess would be that each resize grows the table by a constant factor, which IIRC, works out to amortized O(n). It's not as good as creating the dict the right size in the first place, but it's really not that bad. Somewhat more troubling is that it can lead to memory fragmentation problems. I don't understand why Python doesn't have a way to give a size hint when creating a dict. It seems so obvious to be able to say d = dict (size=10*1000*1000) if you know beforehand that's how many keys you're going to add. Looking at the docs, I'm astounded to discover that you can indeed do exactly that, but you get {size:1000}. I am befuddled as to why people thought creating a dict by keyword would be a useful thing to do, but left out (and, indeed, eliminated the chance to add the syntax later) the obviously useful ability to hint at the size. I suppose we could still add: d = {} d.reserve (10*1000*1000) -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit : d = {} d.reserve (10*1000*1000) d={}.fromkeys(xrange(5*10**6)) ? -- _ Maric Michaud _ Aristote - www.aristote.info 3 place des tapis 69004 Lyon Tel: +33 426 880 097 -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Claudio Grondi [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Chris Foote wrote: Hi all. I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) -- -- --- 1M8.8s 22.2s 2M 24.0s 43.7s 5M 115.3s105.4s Has anyone written a fast hash module which is more optimal for large datasets ? p.s. Disk-based DBs are out of the question because most key lookups will result in a miss, and lookup time is critical for this application. Cheers, Chris Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the standard Python 2.4 distribution. Berkeley DB was 20 times faster than other databases. It has the operational speed of a main memory database, the startup and shut down speed of a disk-resident database, and does not have the overhead of a client-server inter-process communication. Ray Van Tassle, Senior Staff Engineer, Motorola Please let me/us know if it is what you are looking for. Claudio sqlite also supports an in-memory database - use pysqlite (http://initd.org/tracker/pysqlite/wiki) to access this from Python. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Maric Michaud schrieb: Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit : d = {} d.reserve (10*1000*1000) d={}.fromkeys(xrange(5*10**6)) ? That is a totally different beast. You don't want to insert arbitrary keys, you want the internal hash-array to be of the right size. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Roy Smith: I am befuddled as to why people thought creating a dict by keyword would be a useful thing to do, but left out (and, indeed, eliminated the chance to add the syntax later) the obviously useful ability to hint at the size. Adding the keyword syntax doesn't require much memory to a Python programmer because it's the same syntax used elsewhere. While adding another different method to dicts increases their API. Python is usually designed to have smaller APIs, to help programmers remember most things. I agree that a reserve method to Python dicts/sets can be a little useful, but Python programmers are usually more concerned about doing things in a short programmer time and short code space, than telling the computer how to do such things in the faster way (C++ is more for speed so the reserve of the STL hashes can be more useful). Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
1. Why is two minutes to insert 5M keys bad for you? What would be good? What would be good/bad look-up times? Have you measured the typical look-up time? How often is the dict creation required to be done? How often does the data change? Is multi-user access required for (a) look-up (b) updating? Have you considered loading the dict from a pickle? 2. Assuming your code that is creating the dict looks in essence like this: adict = {} for k, v in some_iterable: adict[k] = v then any non-linear behaviour can only be in the actual CPython insertion code. Psyco can't help you there. Psyco *may* help with the linear part, *if* you have enough memory. What are the corresponding times without Psyco? In any case, if your code isn't (conceptually) that simple, then try cutting away the cruft and measuring again. 3. Which version of Python? What OS? OK, psyco - Intel x86, but what chip exactly? How much free memory? 4. Consider printing time-so-far results, say every 100K keys. Multiple step-ups might indicate dict resizings. A dog-leg probably means running out of memory. Why roughly 5M keys??? 5. How large are your long_integers? 6. What is the nature of the value associated with each key? 7. Have you experimented with key = a * 2 ** 32 + b instead of key = (a, b)? HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Sounds like PyTables could be useful. http://www.pytables.org -- lpc -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Roy Smith wrote: In article [EMAIL PROTECTED], Chris Foote [EMAIL PROTECTED] wrote: I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) -- -- --- 1M8.8s 22.2s 2M 24.0s 43.7s 5M 115.3s105.4s Are those clock times or CPU times? User + system CPU time. How much memory is your process using and how much is available on your machine? A very large amount of space is consumed, which is why I didn't give times for the 10M version which would have eaten my 1G of RAM and into swap :-) I'm guessing a integer takes 4 bytes and a long integer takes roughly one byte per two decimal digits. Plus a few more bytes to bundle them up into a tuple. You've probably got something like 20 bytes per key, so 5M of them is 100 meg just for the keys. To get reasonable hashing performance, the hash table needs to be maybe half full, and figure a hash key is 4 bytes, so that's another 40 meg for the hash table itself. Plus whatever the values stored in the dictionary take up. Even if you're not storing any values (i.e., there's probably 4 bytes for a null pointer (or reference to None), so that's another 40 meg. These are all vague guesses, based on what I think are probably conservative estimates of what various objects must take up in memory, but you see where this is going. We're already up to 180 meg. Yep, that size sounds about right (my dictionary values are all None). I wonder if the whole problem is that you're just paging yourself to death. A few minutes watching your system memory performance with ps or top while your program is running might give you some idea if this is the case. Definitely not. Thanks anyway, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Large Dictionaries
Aahz wrote: In article [EMAIL PROTECTED], Roy Smith [EMAIL PROTECTED] wrote: In article [EMAIL PROTECTED], Chris Foote [EMAIL PROTECTED] wrote: I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) -- -- --- 1M8.8s 22.2s 2M 24.0s 43.7s 5M 115.3s105.4s Are those clock times or CPU times? And what are these times measuring? The loading of a file into a dictionary. i.e. no lookup operations. Don't forget that adding keys requires resizing the dict, which is a moderately expensive operation. Yep, that's why I probably need a dictionary where I can pre-specify an approximate size at the time of its creation. Once the dict is constructed, lookup times should be quite good. Very good indeed with Psyco! Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list