Re: Large Dictionaries

2006-06-12 Thread Klaas
Thomas Ganss wrote: > Klaas schrieb: > > > 4. Insert your keys in sorted order. > This advice is questionable - > it depends on the at least on the db vendor and probably > sometimes on the sort method, if inserting pre-sorted > values is better. The article I wrote that you quoted named a specifi

Re: Large Dictionaries

2006-06-09 Thread Duncan Booth
Marcin Ciura wrote: > Duncan Booth wrote: >> Marcin Ciura wrote: >>>See Figure 8 in >>>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf >> That isn't what the reference says. It only covers N up to a few >> thousand. Practical values of N need to at least go up into the >> millions. > > Pl

Re: Large Dictionaries

2006-06-09 Thread Marcin Ciura
Duncan Booth wrote: > Marcin Ciura wrote: >>See Figure 8 in >>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf > That isn't what the reference says. It only covers N up to a few thousand. > Practical values of N need to at least go up into the millions. Please look at the performance graph

Re: Large Dictionaries

2006-06-09 Thread Duncan Booth
Marcin Ciura wrote: > Tim Peters wrote: >> shellsort is much more a refinement of insertion-sort than of >> bubblesort, but is not an O(N log N) algorithm regardlesss. > > With a judiciously chosen increment sequence, > the number of comparisons made by shellsort > *is* of the order N log N for p

Re: Large Dictionaries

2006-06-09 Thread Marcin Ciura
Tim Peters wrote: > shellsort is much more a refinement of insertion-sort than of > bubblesort, but is not an O(N log N) algorithm regardlesss. With a judiciously chosen increment sequence, the number of comparisons made by shellsort *is* of the order N log N for practical values of N. See Figure

Re: Large Dictionaries

2006-06-08 Thread Tim Peters
[Tim Peters] >> ... >> O(N log N) sorting algorithms helped by pre-existing order are >> uncommon, unless they do extra work to detect and exploit >> pre-existing order. [Lawrence D'Oliveiro] > Shellsort works well with nearly-sorted data. It's basically a smarter > version of bubblesort with much

Re: Large Dictionaries

2006-06-08 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>, "Iain King" <[EMAIL PROTECTED]> wrote: >Lawrence D'Oliveiro wrote: >> In article <[EMAIL PROTECTED]>, >> Scott David Daniels <[EMAIL PROTECTED]> wrote: >> >> >For example, time timsort (Python's internal sort) on pre-sorted >> >data; you'll find it is handled fast

Re: Large Dictionaries

2006-06-08 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>, "Tim Peters" <[EMAIL PROTECTED]> wrote: >For example, the O(N log N) heapsort is unreasonable compared to the >O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad >case for heapsort, but a good case for bubblesort)? O(N log N) >sorting algorithms he

Re: Large Dictionaries

2006-06-05 Thread Jim Segrave
In article <[EMAIL PROTECTED]>, Tim Peters <[EMAIL PROTECTED]> wrote: >[Jim Segrave] >> Actually, presorted lists are not a bad case for heapsort - it's quite >> immune to any existing order or lack thereof, > >Write a heapsort and time it. It's not a difference in O() behavior, >but more memory m

Re: Large Dictionaries

2006-06-05 Thread Tim Peters
[Jim Segrave] > Actually, presorted lists are not a bad case for heapsort - it's quite > immune to any existing order or lack thereof, Write a heapsort and time it. It's not a difference in O() behavior, but more memory movement is required for a sorted list because transforming the list into a m

Re: Large Dictionaries

2006-06-05 Thread Jim Segrave
In article <[EMAIL PROTECTED]>, Tim Peters <[EMAIL PROTECTED]> wrote: >[Scott David Daniels] >>> For example, time timsort (Python's internal sort) on pre-sorted >>> data; you'll find it is handled faster than random data. > >O(N) vs O(N log N), in fact. > >[Lawrence D'Oliveiro] >> But isn't that h

Re: Large Dictionaries

2006-06-05 Thread Tim Peters
[Scott David Daniels] >> For example, time timsort (Python's internal sort) on pre-sorted >> data; you'll find it is handled faster than random data. O(N) vs O(N log N), in fact. [Lawrence D'Oliveiro] > But isn't that how a reasonable sorting algorithm should behave? Less > work to do if the data

Re: Large Dictionaries

2006-06-05 Thread Iain King
Lawrence D'Oliveiro wrote: > In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: > > >For example, time timsort (Python's internal sort) on pre-sorted > >data; you'll find it is handled faster than random data. > > But isn't that how a reasonable sorting algorithm sho

Re: Large Dictionaries

2006-06-05 Thread Aahz
In article <[EMAIL PROTECTED]>, Lawrence D'Oliveiro <[EMAIL PROTECTED]> wrote: >In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: >> >>For example, time timsort (Python's internal sort) on pre-sorted >>data; you'll find it is handled faster than random data. > >But i

Re: Large Dictionaries

2006-06-05 Thread Steve Holden
Lawrence D'Oliveiro wrote: > In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: > > >>For example, time timsort (Python's internal sort) on pre-sorted >>data; you'll find it is handled faster than random data. > > > But isn't that how a reasonable sorting algorith

Re: Large Dictionaries

2006-06-05 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>, Scott David Daniels <[EMAIL PROTECTED]> wrote: >For example, time timsort (Python's internal sort) on pre-sorted >data; you'll find it is handled faster than random data. But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data

Re: Large Dictionaries

2006-05-29 Thread Martin v. Löwis
Roy Smith wrote: > My guess would be that each resize grows the table by a constant > factor, which IIRC, works out to amortized O(n). Right. For <5 entries, the size is multiplied by 4 each time, and doubled each time for large dictionaries. > It's not as good as > creating the dict the rig

Re: Large Dictionaries

2006-05-29 Thread Scott David Daniels
Thomas Ganss wrote: > Klaas schrieb: >> 4. Insert your keys in sorted order. > This advice is questionable - > > My gut feeling on this matter is: > IF the insert times of pre-sorted values is far better > than the times of unsorted values, there is a chance > that the resulting tree is unbala

Re: Large Dictionaries

2006-05-28 Thread Thomas Ganss
Klaas schrieb: > 4. Insert your keys in sorted order. This advice is questionable - it depends on the at least on the db vendor and probably sometimes on the sort method, if inserting pre-sorted values is better. My gut feeling on this matter is: IF the insert times of pre-sorted values is far be

Re: Large Dictionaries

2006-05-24 Thread Klaas
Chris: > class StorageBerkeleyDB(StorageTest): >def runtest(self, number_hash): >db = bsddb.hashopen(None, flag='c', cachesize=8192) >for (num, wildcard_digits) in number_hash.keys(): >key = '%d:%d' % (num, wildcard_digits) >db[key] = None >db.clo

Re: Large Dictionaries

2006-05-24 Thread Klaas
Chris: > Berkeley DB is great for accessing data by key for things already > stored on disk (i.e. read access), but write performance for each > key-value pair is slow due to it being careful about flushing > writes to disk by default. This is absolutely false. -Mike -- http://mail.python.org/

Re: Large Dictionaries

2006-05-18 Thread Claudio Grondi
Chris Foote wrote: > Richie Hindle wrote: >> [Chris] >>> Has anyone written a fast hash module which is more optimal for >>> large datasets ? >> >> PyJudy might be what you're looking for, though I've never used it: >> >> http://www.dalkescientific.com/Python/PyJudy.html >> >> "Judy's key benefit

Re: Large Dictionaries

2006-05-18 Thread Claudio Grondi
Chris Foote wrote: > Claudio Grondi wrote: >> Chris Foote wrote: >>> Klaas wrote: >>> > 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. >>> >>> I've put together some simplified test code, but the bsddb >>> module gives 11m for 1M keys: >>> >> I have run your c

Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Claudio Grondi wrote: > Chris Foote wrote: >> Klaas wrote: >> 22.2s 20m25s[3] >>> >>> 20m to insert 1m keys? You are doing something wrong. >> >> I've put together some simplified test code, but the bsddb >> module gives 11m for 1M keys: >> > I have run your code for the bsddb on my P4 2.8 G

Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Richie Hindle wrote: > [Chris] >> Has anyone written a fast hash module which is more optimal for >> large datasets ? > > PyJudy might be what you're looking for, though I've never used it: > > http://www.dalkescientific.com/Python/PyJudy.html > > "Judy's key benefits are scalability, high per

Re: Large Dictionaries

2006-05-17 Thread Claudio Grondi
Chris Foote wrote: > Klaas wrote: > >>> 22.2s 20m25s[3] >> >> >> 20m to insert 1m keys? You are doing something wrong. > > > Hi Mike. > > I've put together some simplified test code, but the bsddb > module gives 11m for 1M keys: > I have run your code for the bsddb on my P4 2.8 GHz and have

Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Klaas wrote: 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. Hi Mike. I've put together some simplified test code, but the bsddb module gives 11m for 1M keys: Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 22:1

Re: Large Dictionaries

2006-05-16 Thread Klaas
>22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. With bdb's it is crucial to insert keys in bytestring-sorted order. Also, be sure to give it a decent amount of cache. -Mike -- http://mail.python.org/mailman/listinfo/python-list

Re: Large Dictionaries

2006-05-16 Thread Claudio Grondi
Chris Foote wrote: > Claudio Grondi wrote: > >> Chris Foote wrote: >> >>> However, please note that the Python bsddb module doesn't support >>> in-memory based databases - note the library documentation's[1] wording: >>> >>> "Files never intended to be preserved on disk may be created >>> by

Re: Large Dictionaries

2006-05-16 Thread Chris Foote
Claudio Grondi wrote: > Chris Foote wrote: > >> However, please note that the Python bsddb module doesn't support >> in-memory based databases - note the library documentation's[1] wording: >> >> "Files never intended to be preserved on disk may be created >> by passing None as the filena

Re: Large Dictionaries

2006-05-16 Thread Sion Arrowsmith
Maric Michaud <[EMAIL PROTECTED]> wrote: >I don't see a use case where a python programmer should need a >dictionnary "that will be probably big" but can't predict what keys will be >in. I've recently been working on an app with precisely this characteristic, although there were enough other bott

Re: Large Dictionaries

2006-05-16 Thread Duncan Booth
Duncan Booth wrote: >> BTW why are python dicts implemented as hash tables and not judy >> arrays? >> > Probably because: > > Python predates judy arrays. > Nobody has proposed replacing the current dict implementation. > Nobody has demonstrated that it would actually be an advantage to > repla

Re: Large Dictionaries

2006-05-16 Thread Duncan Booth
[EMAIL PROTECTED] wrote: > BTW why are python dicts implemented as hash tables and not judy > arrays? > Probably because: Python predates judy arrays. Nobody has proposed replacing the current dict implementation. Nobody has demonstrated that it would actually be an advantage to replace the cur

Re: Large Dictionaries

2006-05-16 Thread [EMAIL PROTECTED]
BTW why are python dicts implemented as hash tables and not judy arrays? -- http://mail.python.org/mailman/listinfo/python-list

Re: Large Dictionaries

2006-05-16 Thread Claudio Grondi
Chris Foote wrote: > Claudio Grondi wrote: > >> Chris Foote wrote: >> >>> p.s. Disk-based DBs are out of the question because most >>> key lookups will result in a miss, and lookup time is >>> critical for this application. >>> >> Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for

Re: Large Dictionaries

2006-05-16 Thread Duncan Booth
John Machin wrote: >> (my dictionary values are all None). > > So in fact all you need is a set. Have you experimented with the Python > 2.5 alpha? > I don't think that will help him: my timings work out about the same on 2.4.2 or 2.5a2. As I pointed out, the bottleneck is creating the tuples.

Re: Large Dictionaries

2006-05-16 Thread Vladimir 'Yu' Stepanov
Chris Foote wrote: > lcaamano wrote: > >> Sounds like PyTables could be useful. >> >> http://www.pytables.org >> > > In browsing their excellent documentation, it seems that all concepts > are built around storing and reading HDF5 format files. > > Not suitable for this project unfortunate

Re: Large Dictionaries

2006-05-16 Thread John Machin
> (my dictionary values are all None). So in fact all you need is a set. Have you experimented with the Python 2.5 alpha? -- http://mail.python.org/mailman/listinfo/python-list

Re: Large Dictionaries

2006-05-16 Thread Chris Foote
lcaamano wrote: > Sounds like PyTables could be useful. > > http://www.pytables.org In browsing their excellent documentation, it seems that all concepts are built around storing and reading HDF5 format files. Not suitable for this project unfortunately. Cheers, Chris -- http://mail.python.o

Re: Large Dictionaries

2006-05-16 Thread Chris Foote
Paul McGuire wrote: > "Claudio Grondi" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] >> Chris Foote wrote: >>> Hi all. >>> >>> I have the need to store a large (10M) number of keys in a hash table, >>> based on a tuple of (long_integer, integer). The standard python >>> dictionary

Re: Large Dictionaries

2006-05-16 Thread Chris Foote
Claudio Grondi wrote: > Chris Foote wrote: >> p.s. Disk-based DBs are out of the question because most >> key lookups will result in a miss, and lookup time is >> critical for this application. >> > Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for > BerkeleyDB (\Python24\DLLs\_bsd

Re: Large Dictionaries

2006-05-16 Thread Duncan Booth
Chris Foote wrote: > > Don't forget that adding keys >> requires resizing the dict, which is a moderately expensive operation. > > Yep, that's why I probably need a dictionary where I can pre-specify > an approximate size at the time of its creation. Try splitting the creation of the keys from t

Re: Large Dictionaries

2006-05-15 Thread Maric Michaud
Le Lundi 15 Mai 2006 21:07, Diez B. Roggisch a écrit : > > d={}.fromkeys(xrange(5*10**6)) ? > > That is a totally different beast. You don't want to insert arbitrary > keys, you want the internal hash-array to be of the right size. But you can replace the xrange part by any generator function you

Re: Large Dictionaries

2006-05-15 Thread Chris Foote
Aahz wrote: > In article <[EMAIL PROTECTED]>, > Roy Smith <[EMAIL PROTECTED]> wrote: >> In article <[EMAIL PROTECTED]>, Chris Foote <[EMAIL PROTECTED]> >> wrote: >>> I have the need to store a large (10M) number of keys in a hash table, >>> based on a tuple of (long_integer, integer). The standa

Re: Large Dictionaries

2006-05-15 Thread Chris Foote
Roy Smith wrote: > In article <[EMAIL PROTECTED]>, Chris Foote <[EMAIL PROTECTED]> > wrote: > >> I have the need to store a large (10M) number of keys in a hash table, >> based on a tuple of (long_integer, integer). The standard python >> dictionary works well for small numbers of keys, but star

Re: Large Dictionaries

2006-05-15 Thread lcaamano
Sounds like PyTables could be useful. http://www.pytables.org -- lpc -- http://mail.python.org/mailman/listinfo/python-list

Re: Large Dictionaries

2006-05-15 Thread John Machin
1. Why is two minutes to insert 5M keys "bad" for you? What would be "good"? What would be good/bad look-up times? Have you measured the typical look-up time? How often is the dict creation required to be done? How often does the data change? Is multi-user access required for (a) look-up (b) updat

Re: Large Dictionaries

2006-05-15 Thread bearophileHUGS
Roy Smith: > I am befuddled as to why > people thought creating a dict by keyword would be a useful thing to > do, but left out (and, indeed, eliminated the chance to add the syntax > later) the obviously useful ability to hint at the size. Adding the keyword syntax doesn't require much memory to

Re: Large Dictionaries

2006-05-15 Thread Diez B. Roggisch
Maric Michaud schrieb: > Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit : >> d = {} >> d.reserve (10*1000*1000) > > d={}.fromkeys(xrange(5*10**6)) ? That is a totally different beast. You don't want to insert arbitrary keys, you want the internal hash-array to be of the right size. Diez -- http

Re: Large Dictionaries

2006-05-15 Thread Paul McGuire
"Claudio Grondi" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Chris Foote wrote: > > Hi all. > > > > I have the need to store a large (10M) number of keys in a hash table, > > based on a tuple of (long_integer, integer). The standard python > > dictionary works well for small numb

Re: Large Dictionaries

2006-05-15 Thread Maric Michaud
Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit : > d = {} > d.reserve (10*1000*1000) d={}.fromkeys(xrange(5*10**6)) ? -- _ Maric Michaud _ Aristote - www.aristote.info 3 place des tapis 69004 Lyon Tel: +33 426 880 097 -- http://mail.python.org/mailman/listinfo/python-li

Re: Large Dictionaries

2006-05-15 Thread Roy Smith
Aahz <[EMAIL PROTECTED]> wrote: > Don't forget that adding keys requires resizing the dict, which is a > moderately expensive operation. My guess would be that each resize grows the table by a constant factor, which IIRC, works out to amortized O(n). It's not as good as creating the dict the righ

Re: Large Dictionaries

2006-05-15 Thread Aahz
In article <[EMAIL PROTECTED]>, Roy Smith <[EMAIL PROTECTED]> wrote: >In article <[EMAIL PROTECTED]>, Chris Foote <[EMAIL PROTECTED]> >wrote: >> >> I have the need to store a large (10M) number of keys in a hash table, >> based on a tuple of (long_integer, integer). The standard python >> dicti

Re: Large Dictionaries

2006-05-15 Thread Claudio Grondi
Chris Foote wrote: > Hi all. > > I have the need to store a large (10M) number of keys in a hash table, > based on a tuple of (long_integer, integer). The standard python > dictionary works well for small numbers of keys, but starts to > perform badly for me inserting roughly 5M keys: > > # keys

Re: Large Dictionaries

2006-05-15 Thread Roy Smith
In article <[EMAIL PROTECTED]>, Chris Foote <[EMAIL PROTECTED]> wrote: > Hi all. > > I have the need to store a large (10M) number of keys in a hash table, > based on a tuple of (long_integer, integer). The standard python > dictionary works well for small numbers of keys, but starts to > perfo

Re: Large Dictionaries

2006-05-15 Thread Richie Hindle
[Chris] > Has anyone written a fast hash module which is more optimal for > large datasets ? PyJudy might be what you're looking for, though I've never used it: http://www.dalkescientific.com/Python/PyJudy.html "Judy's key benefits are scalability, high performance, and memory efficiency. A J