Re: Large Dictionaries

2006-06-12 Thread Klaas
Thomas Ganss wrote:
 Klaas schrieb:

  4. Insert your keys in sorted order.
 This advice is questionable -
 it depends on the at least on the db vendor and probably
 sometimes on the sort method, if inserting pre-sorted
 values is better.

The article I wrote that you quoted named a specific vendor (berkeley
db) for which my advice is unquestionable and well-known.

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-09 Thread Marcin Ciura
Tim Peters wrote:
 shellsort is much more a refinement of insertion-sort than of
 bubblesort, but is not an O(N log N) algorithm regardlesss.

With a judiciously chosen increment sequence,
the number of comparisons made by shellsort
*is* of the order N log N for practical values of N.
See Figure 8 in
http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
   Marcin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-09 Thread Duncan Booth
Marcin Ciura wrote:

 Tim Peters wrote:
 shellsort is much more a refinement of insertion-sort than of
 bubblesort, but is not an O(N log N) algorithm regardlesss.
 
 With a judiciously chosen increment sequence,
 the number of comparisons made by shellsort
 *is* of the order N log N for practical values of N.
 See Figure 8 in
 http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf

That isn't what the reference says. It only covers N up to a few thousand. 
Practical values of N need to at least go up into the millions.

Read the summary:
 Using sequential analysis, the search for optimal increment sequences
 for Shellsort was accelerated enough to find them for arrays up to
 several thousand elements.
...
 However, the sequences obtained so far are too short to draw a
 reliable conclusion whether an appropriate sequence of increments can
 make Shellsort a O(N logN) sorting method on the average. Some
 hypotheses may be possible when the sequences are prolonged to sort
 arrays of about 10^5 elements. 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-09 Thread Marcin Ciura
Duncan Booth wrote:
 Marcin Ciura wrote:
See Figure 8 in
http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
 That isn't what the reference says. It only covers N up to a few thousand. 
 Practical values of N need to at least go up into the millions.

Please look at the performance graph of Tokuda's increment sequence.
You can see that it scales pretty well at least up to 100 million.
   Marcin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-09 Thread Duncan Booth
Marcin Ciura wrote:

 Duncan Booth wrote:
 Marcin Ciura wrote:
See Figure 8 in
http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
 That isn't what the reference says. It only covers N up to a few
 thousand. Practical values of N need to at least go up into the
 millions. 
 
 Please look at the performance graph of Tokuda's increment sequence.
 You can see that it scales pretty well at least up to 100 million.

Ah sorry, I misread it. It was sequences with several thousand elements, 
which corresponds to N of 100 million.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-08 Thread Lawrence D'Oliveiro
In article [EMAIL PROTECTED],
 Tim Peters [EMAIL PROTECTED] wrote:

For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)?  O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Shellsort works well with nearly-sorted data. It's basically a smarter 
version of bubblesort with much improved efficiency. It's also very 
compact to implement.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-08 Thread Lawrence D'Oliveiro
In article [EMAIL PROTECTED],
 Iain King [EMAIL PROTECTED] wrote:

Lawrence D'Oliveiro wrote:
 In article [EMAIL PROTECTED],
  Scott David Daniels [EMAIL PROTECTED] wrote:

 For example, time timsort (Python's internal sort) on pre-sorted
 data; you'll find it is handled faster than random data.

 But isn't that how a reasonable sorting algorithm should behave? Less
 work to do if the data is already sorted?

An already sorted list can be pathological for Quicksort, depending on
how you code it.

I did say reasonable. :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-08 Thread Tim Peters
[Tim Peters]
 ...
 O(N log N) sorting algorithms helped by pre-existing order are
 uncommon, unless they do extra work to detect and exploit
 pre-existing order.

[Lawrence D'Oliveiro]
 Shellsort works well with nearly-sorted data. It's basically a smarter
 version of bubblesort with much improved efficiency. It's also very
 compact to implement.

shellsort is much more a refinement of insertion-sort than of
bubblesort, but is not an O(N log N) algorithm regardlesss.  A nice,
succinct summary of what was known about shellsort's behavior as of
Sedgewick's 1996 Analysis of Shellsort and Related Algorithms can be
found here:

http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/shell/shellen.htm
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Lawrence D'Oliveiro
In article [EMAIL PROTECTED],
 Scott David Daniels [EMAIL PROTECTED] wrote:

For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less 
work to do if the data is already sorted?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Steve Holden
Lawrence D'Oliveiro wrote:
 In article [EMAIL PROTECTED],
  Scott David Daniels [EMAIL PROTECTED] wrote:
 
 
For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.
 
 
 But isn't that how a reasonable sorting algorithm should behave? Less 
 work to do if the data is already sorted?

Isn't that just your definition of reasonable?

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd  http://www.holdenweb.com
Love me, love my blog  http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Aahz
In article [EMAIL PROTECTED],
Lawrence D'Oliveiro  [EMAIL PROTECTED] wrote:
In article [EMAIL PROTECTED],
 Scott David Daniels [EMAIL PROTECTED] wrote:

For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less 
work to do if the data is already sorted?

Read some of the old discussions in the python-dev archives.
-- 
Aahz ([EMAIL PROTECTED])   * http://www.pythoncraft.com/

I saw `cout' being shifted Hello world times to the left and stopped
right there.  --Steve Gonedes
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Iain King

Lawrence D'Oliveiro wrote:
 In article [EMAIL PROTECTED],
  Scott David Daniels [EMAIL PROTECTED] wrote:

 For example, time timsort (Python's internal sort) on pre-sorted
 data; you'll find it is handled faster than random data.

 But isn't that how a reasonable sorting algorithm should behave? Less
 work to do if the data is already sorted?

An already sorted list can be pathological for Quicksort, depending on
how you code it.

Iain

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Tim Peters
[Scott David Daniels]
 For example, time timsort (Python's internal sort) on pre-sorted
 data; you'll find it is handled faster than random data.

O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro]
 But isn't that how a reasonable sorting algorithm should behave? Less
 work to do if the data is already sorted?

For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)?  O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Python's current sorting algorithm exploits pre-existing order of many
kinds.  For example, take a sorted list, cut it in half, and riffle
shuffle the two halves together; e.g.,

[0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

That turns out to be a supernaturally good case too.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Jim Segrave
In article [EMAIL PROTECTED],
Tim Peters [EMAIL PROTECTED] wrote:
[Scott David Daniels]
 For example, time timsort (Python's internal sort) on pre-sorted
 data; you'll find it is handled faster than random data.

O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro]
 But isn't that how a reasonable sorting algorithm should behave? Less
 work to do if the data is already sorted?

For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)?  O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof, whereas some other
sorts, quicksort being a prime example, require great care to
avoid pathological cases.



-- 
Jim Segrave   ([EMAIL PROTECTED])

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Tim Peters
[Jim Segrave]
 Actually, presorted lists are not a bad case for heapsort - it's quite
 immune to any existing order or lack thereof,

Write a heapsort and time it.  It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start.  Initially
reverse-sorted is a good case for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible).  good and bad are both
O(N log N) here, though.

BTW, most people have never heard of Smoothsort, which was Dijkstra's
excruciating attempt to make heapsort exploit pre-existing order:

http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796.PDF

That's one of a hundred algorithms I rejected for Python ;-)

 whereas some other sorts, quicksort being a prime example, require
 great care to avoid pathological cases.

Indeed, Python gave up on quicksort for that reason.  While the
samplesort hybrid that came between quicksort and the current sort
also had theoretical O(N**2) worst-case behavior, it was exceedingly
difficult to create such an input, and (unlike as for every quicksort
variant Python tried) no real-life case of undue sloth was ever
reported against it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Jim Segrave
In article [EMAIL PROTECTED],
Tim Peters [EMAIL PROTECTED] wrote:
[Jim Segrave]
 Actually, presorted lists are not a bad case for heapsort - it's quite
 immune to any existing order or lack thereof,

Write a heapsort and time it.  It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start.  Initially
reverse-sorted is a good case for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible).  good and bad are both
O(N log N) here, though.

I did implement a crude heapsort before posting this and measured the
number of comparisions and the number of swaps.

==

#!/usr/local/bin/python

import random

class HeapSorter(object):
def __init__(self, seq):
self.compares = 0
self.swaps = 0
self.length = len(seq)
self.seq = seq

def __sift(self, i):
while True:
j = 2 * i + 1
if j + 1  self.length:
self.compares += 1
if self.seq[j + 1]  self.seq[j]:
j = j + 1
if j = self.length:
return

self.compares += 1
if self.seq[i]  self.seq[j]:
return

self.swaps += 1
self.seq[i], self.seq[j] = (self.seq[j], self.seq[i])
i = j

def __makeheap(self):
for i in range(self.length / 2, -1, -1):
self.__sift(i)

def sort(self):
self.__makeheap()
for i in range(self.length - 1, -1, -1):
self.swaps += 1
self.seq[i], self.seq[0] = (self.seq[0], self.seq[i])
self.length -= 1
self.__sift(0)

if __name__ == '__main__':

s = list(range(10))
random.shuffle(s)

heap = HeapSorter(s)
heap.sort()
print Random list: comapres %d, swaps %d % \
  (heap.compares, heap.swaps)
heap = HeapSorter(s)
heap.sort()
print Sorted list: comapres %d, swaps %d % \
  (heap.compares, heap.swaps)

s.reverse()
heap = HeapSorter(s)
heap.sort()
print Reverse Sorted list: comapres %d, swaps %d % \
  (heap.compares, heap.swaps)

==

Which gave the following results:

/usr/home/jes% python ./heapsort
Random list: comapres 3019969, swaps 1575263
Sorted list: comapres 3112517, swaps 1650855
Reverse Sorted list: comapres 2926640, swaps 1497435

Assuming compares and swaps dominate other bookkeeping, then heapsort
is quite stable in its performance, although the constant in the O(N log N)
is much larger than for other sorts. 



-- 
Jim Segrave   ([EMAIL PROTECTED])

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-30 Thread Martin v. Löwis
Roy Smith wrote:
 My guess would be that each resize grows the table by a constant
 factor, which IIRC, works out to amortized O(n). 

Right. For 5 entries, the size is multiplied by 4 each time,
and doubled each time for large dictionaries.

 It's not as good as
 creating the dict the right size in the first place, but it's really
 not that bad.  Somewhat more troubling is that it can lead to memory
 fragmentation problems.

Depends on your operating system, of course. You get over a page size
fairly quickly, and then many systems use anonymous mappings, where
a range of pages gets mapped from swap on allocation, and unmapped
on deallocation. So while this can cause address space fragmentation,
it doesn't cause memory fragmentation (i.e. memory that is allocated
by the process but can't be used).

 I don't understand why Python doesn't have a way to give a size hint
 when creating a dict.  It seems so obvious to be able to say d = dict
 (size=10*1000*1000) if you know beforehand that's how many keys
 you're going to add.

There is no need for it. If dicts don't work well in some cases, these
cases should be investigated and fixed, instead of working around the
problem by loading the problem onto the developer.

As you said, it *should* have linear time, with the number of
insertions. If it doesn't (and it appears not to), that may be due to
hash collisions, or due to the time for computing hashes not being
constant.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-29 Thread Thomas Ganss
Klaas schrieb:

 4. Insert your keys in sorted order.
This advice is questionable -
it depends on the at least on the db vendor and probably
sometimes on the sort method, if inserting pre-sorted
values is better.

My gut feeling on this matter is:
IF the insert times of pre-sorted values is far better
than the times of unsorted values, there is a chance
that the resulting tree is unbalanced: only 1 compare
operation after insert and no re-balancing of the tree.

re-indexing will probably give you far better access times
in this case. Another option is to drop non RI indexes used only
for query optimization and recreate them after batch insert.

my 0.02 EUR

thomas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-29 Thread Scott David Daniels
Thomas Ganss wrote:
 Klaas schrieb:
 4. Insert your keys in sorted order.
 This advice is questionable -
 
 My gut feeling on this matter is:
 IF the insert times of pre-sorted values is far better
 than the times of unsorted values, there is a chance
 that the resulting tree is unbalanced: only 1 compare
 operation after insert and no re-balancing of the tree.
 
 re-indexing will probably give you far better access times
 in this case. Another option is to drop non RI indexes used only
 for query optimization and recreate them after batch insert.

Don't use your gut for such issues.  Pre-sorted data is such
a common special case (in fact, the only easily describable
special case) that many systems handle this specially.  For
example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

If you are using your gut without testing, go ahead and
presort.  In any case, reading documents and testing beats
gut feels every time.

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-24 Thread Klaas
Chris:
 Berkeley DB is great for accessing data by key for things already
 stored on disk (i.e. read access), but write performance for each
 key-value pair is slow due to it being careful about flushing
 writes to disk by default. 

This is absolutely false.

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-24 Thread Klaas
Chris:
 class StorageBerkeleyDB(StorageTest):
def runtest(self, number_hash):
db = bsddb.hashopen(None, flag='c', cachesize=8192)
for (num, wildcard_digits) in number_hash.keys():
key = '%d:%d' % (num, wildcard_digits)
db[key] = None
db.close()

BDBs can accomplish what you're looking to do, but they need to be
tuned carefully.  I won't get into too many details here, but you have
a few fatal flaws in that code.

1. 8Kb of cache is _pathetic_.  Give it a few hundred megs.  This is by
far your nbiggest problem.
2. Use BTREE unless you have a good reason to use DBHASH
3. Use proper bdb env creation instead of the hash_open apis.
4. Insert your keys in sorted order.

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-18 Thread Chris Foote
Claudio Grondi wrote:
 Chris Foote wrote:
 Klaas wrote:

 22.2s  20m25s[3]

 20m to insert 1m keys?  You are doing something wrong.

 I've put together some simplified test code, but the bsddb
 module gives 11m for 1M keys:

 I have run your code for the bsddb on my P4 2.8 GHz and have got:
 Number generator test for 100 number ranges
 with a maximum of 3 wildcard digits.
 Wed May 17 16:34:06 2006 dictionary population started
 Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
 Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
 Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration 
 104.3s
 
 Surprising here, that the dictionary population gives the same time, but 
 the BerkeleyDB inserts the records 6 times faster on my computer than on 
 yours. I am running Python 2.4.2 on Windows XP SP2, and you?

Fedora core 5 with ext3 filesystem.  The difference will be due to
the way that Windows buffers writes for the filesystem you're using
(it sounds like you're using a FAT-based file system).

 Number generator test for 100 number ranges
 with a maximum of 3 wildcard digits.
 Wed May 17 22:18:17 2006 dictionary population started
 Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
 Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
 Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, 
 duration 665.6s
 Wed May 17 22:29:33 2006 StorageSQLite population started
 Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s
 As I don't have SQLite installed, it is interesting to see if the factor 
 10 in the speed difference between BerkeleyDB and SQLite can be 
 confirmed by someone else.
 Why is SQLite faster here? I suppose, that SQLite first adds all the 
 records and builds the index afterwards with all the records there (with 
 db.commit()).

SQLite is way faster because BerkeleyDB always uses a disk file,
and SQLite is in RAM only.

Cheers,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-18 Thread Claudio Grondi
Chris Foote wrote:
 Claudio Grondi wrote:
 Chris Foote wrote:
 Klaas wrote:

 22.2s  20m25s[3]

 20m to insert 1m keys?  You are doing something wrong.

 I've put together some simplified test code, but the bsddb
 module gives 11m for 1M keys:

 I have run your code for the bsddb on my P4 2.8 GHz and have got:
 Number generator test for 100 number ranges
 with a maximum of 3 wildcard digits.
 Wed May 17 16:34:06 2006 dictionary population started
 Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
 Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
 Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, 
 duration 104.3s
  
 Surprising here, that the dictionary population gives the same time, 
 but the BerkeleyDB inserts the records 6 times faster on my computer 
 than on yours. I am running Python 2.4.2 on Windows XP SP2, and you?
 
 Fedora core 5 with ext3 filesystem.  The difference will be due to
 the way that Windows buffers writes for the filesystem you're using
 (it sounds like you're using a FAT-based file system).
Ok, according to the Windows task manager the Python process 
reads/writes to the file system during the run of BerkeleyDB test around 
7 GByte(!) of data and the hard drive is continuously busy, where the 
size of file I found in the Temp directory is always below 20 MByte. The 
hard drive access is probably the main reason for loosing time - here a 
question to BerkeleyDB experts:

Can the BerkeleyDB via Python bsddb3 interface be tuned to use only RAM 
or as BerkeleyDB can scale to larger data amount it makes not much sense 
to tweak it into RAM?

Chris, is maybe a RAM-disk the right way to go here to save time lost 
for accessing the file stored in the file system on the hard drive?

The RAM requirements, according to Windows XP task manager,  are below 
100 MByte. I am using the NTFS file system (yes, I know, that FAT is in 
some configurations faster than NTFS) and XP Professional SP2 without 
any tuning of file system caching. The CPU is 100% busy.

What CPU and RAM (SIMM, DDR, DDR2) do you have?  I have 2GByte fast DDR 
PC400/3200 dual line RAM. It seems, that you are still not getting 
results within the range others experience running your code, so I 
suppose, it has something to do with the hardware you are using.

 
 Number generator test for 100 number ranges
 with a maximum of 3 wildcard digits.
 Wed May 17 22:18:17 2006 dictionary population started
 Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
 Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
 Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, 
 duration 665.6s
 Wed May 17 22:29:33 2006 StorageSQLite population started
 Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 
 65.5s
 As I don't have SQLite installed, it is interesting to see if the 
 factor 10 in the speed difference between BerkeleyDB and SQLite can be 
 confirmed by someone else.
 Why is SQLite faster here? I suppose, that SQLite first adds all the 
 records and builds the index afterwards with all the records there 
 (with db.commit()).
 
 SQLite is way faster because BerkeleyDB always uses a disk file,
 and SQLite is in RAM only.
One of the reasons I put an eye on BerkeleyDB is that it pretends to 
scale to a huge amount (Terrabyte) of data and don't need as much RAM as 
Python dictionary and it is not necessary to save/load pickled version 
of the data (i.e. here the dictionary) from/to RAM in order to work with 
it.
I guess, that in your case BerkeleyDB is for the named reasons probably 
the right way to go, except your data will stay small and the Python 
dictionary with them will always fit into RAM.

Now I am curious to know which path you have decided to go and why?

Claudio
 
 Cheers,
 Chris

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-18 Thread Claudio Grondi
Chris Foote wrote:
 Richie Hindle wrote:
 [Chris]
 Has anyone written a fast hash module which is more optimal for
 large datasets ?

 PyJudy might be what you're looking for, though I've never used it:

   http://www.dalkescientific.com/Python/PyJudy.html

 Judy's key benefits are scalability, high performance, and memory
 efficiency. A Judy array is extensible and can scale up to a very large
 number of elements, bounded only by machine memory. ... PyJudy arrays
 are similar to Python dictionaries and sets.
 
 Thanks for the suggestion Richie.  PyJudy works brilliantly :-)
 
 Cheers,
 Chris
It seems, that there is no Microsoft Windows version of Judy available, 
so for this reason PyJudy won't work on Windows - am I right?

Claudio
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-17 Thread Chris Foote

Klaas wrote:

22.2s  20m25s[3]


20m to insert 1m keys?  You are doing something wrong.


Hi Mike.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

Number generator test for 100 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s

test code is attached.


With bdb's it is crucial to insert keys in bytestring-sorted order.


For the bsddb test, I'm using a plain string.  (The module docs list a
string being the only datatype supported for both keys  values).


Also, be sure to give it a decent amount of cache.


The bsddb.hashopen() factory seems to have a bug in this regard; if you
supply a cachesize argument, then it barfs:

...
  File bsddb-test.py, line 67, in runtest
db = bsddb.hashopen(None, flag='c', cachesize=8192)
  File /usr/lib/python2.4/bsddb/__init__.py, line 288, in hashopen
if cachesize is not None: d.set_cachesize(0, cachesize)
bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- DB-set_cachesize: method not permitted when environment 
specified')



I'll file a bug report on this if it isn't already fixed.

Cheers,
Chris
import bsddb
import random
import time
import sqlite

import psyco
psyco.full()

class WallClockTimer(object):
'''Simple timer for measuring tests.'''
def __init__(self, msg):
self.start = time.time()
self.msg = msg
print time.ctime(self.start), self.msg, 'started'

def stop(self):
end = time.time()
duration = end - self.start
print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration

class NumberGen(object):
'''Phone number generator for testing different methods of storage.'''
def __init__(self, range_start, range_end, n, max_wildcards):

print '''Number generator test for %d number ranges
with a maximum of %d wildcard digits.''' % (n, max_wildcards)

wildcards = range(0, max_wildcards + 1)
# generate n unique numbers and store as keys in number_hash
timer = WallClockTimer('dictionary population')
self.number_hash = {}

for i in xrange(0, n):
unique = False
while not unique:
wildcard_digits = self.gen_wildcard_digits(wildcards)
num = self.gen_number(range_start, range_end)
if wildcard_digits  0:
num /= (10 ** wildcard_digits)
key = (num, wildcard_digits)
if self.number_hash.has_key(key):
unique = False
else:
unique = True
self.number_hash[key] = None
timer.stop()

def gen_wildcard_digits(self, wildcards):
return random.choice(wildcards)

def gen_number(self, start, end):
return int(random.uniform(start, end))

def storage_test(self, StorageTestClass):
test = StorageTestClass(self.number_hash)

class StorageTest(object):
'''base class for testing storage.  Derive a test
class and provide your own runtest() method.'''
def __init__(self, number_hash):
timer = WallClockTimer('%s population' % type(self).__name__)
self.runtest(number_hash)
timer.stop()

class StorageBerkeleyDB(StorageTest):
def runtest(self, number_hash):
db = bsddb.hashopen(None, flag='c', cachesize=8192)
for (num, wildcard_digits) in number_hash.keys():
key = '%d:%d' % (num, wildcard_digits)
db[key] = None
db.close()

class StorageSQLite(StorageTest):
def runtest(self, number_hash):
db = sqlite.connect(':memory:')
cursor = db.cursor()
cursor.execute('CREATE TABLE numbers (num INTEGER, wd INTEGER)')
q = insert into numbers (num, wd) values (%d, %d)
for (num, wildcard_digits) in number_hash.keys():
cursor.execute(q, num, wildcard_digits)
db.commit()
db.close()


if __name__ == '__main__':

test_vals = {'range_start'   : 61880 * 10 ** 7,
 'range_end' : 61891 * 10 ** 7,
 'n' : 1 * 10 ** 4,
 'max_wildcards' : 3}

ngen = NumberGen(test_vals['range_start'], test_vals['range_end'],
 test_vals['n'], test_vals['max_wildcards'])

ngen.storage_test(StorageBerkeleyDB)
ngen.storage_test(StorageSQLite)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Large Dictionaries

2006-05-17 Thread Claudio Grondi
Chris Foote wrote:
 Klaas wrote:
 
 22.2s  20m25s[3]


 20m to insert 1m keys?  You are doing something wrong.
 
 
 Hi Mike.
 
 I've put together some simplified test code, but the bsddb
 module gives 11m for 1M keys:
 
I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 100 number ranges
 with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration 
104.3s
Surprising here, that the dictionary population gives the same time, but 
the BerkeleyDB inserts the records 6 times faster on my computer than on 
yours. I am running Python 2.4.2 on Windows XP SP2, and you?

 Number generator test for 100 number ranges
 with a maximum of 3 wildcard digits.
 Wed May 17 22:18:17 2006 dictionary population started
 Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
 Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
 Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 
 665.6s
 Wed May 17 22:29:33 2006 StorageSQLite population started
 Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s
As I don't have SQLite installed, it is interesting to see if the factor 
10 in the speed difference between BerkeleyDB and SQLite can be 
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the 
records and builds the index afterwards with all the records there (with 
db.commit()).
Can the same be done in BerkeleyDB, or does BerkeleyDB not support 
inserting of records without building an index each single insert 
command? If yes, how?

Claudio
 
 test code is attached.
 
 With bdb's it is crucial to insert keys in bytestring-sorted order.
 
 
 For the bsddb test, I'm using a plain string.  (The module docs list a
 string being the only datatype supported for both keys  values).
 
 Also, be sure to give it a decent amount of cache.
 
 
 The bsddb.hashopen() factory seems to have a bug in this regard; if you
 supply a cachesize argument, then it barfs:
 
 
   File bsddb-test.py, line 67, in runtest
 db = bsddb.hashopen(None, flag='c', cachesize=8192)
   File /usr/lib/python2.4/bsddb/__init__.py, line 288, in hashopen
 if cachesize is not None: d.set_cachesize(0, cachesize)
 bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- 
 DB-set_cachesize: method not permitted when environment specified')
 
 
 I'll file a bug report on this if it isn't already fixed.
 
 Cheers,
 Chris
 
 
 
 
 import bsddb
 import random
 import time
 import sqlite
 
 import psyco
 psyco.full()
 
 class WallClockTimer(object):
 '''Simple timer for measuring tests.'''
 def __init__(self, msg):
 self.start = time.time()
 self.msg = msg
 print time.ctime(self.start), self.msg, 'started'
 
 def stop(self):
 end = time.time()
 duration = end - self.start
 print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration
 
 class NumberGen(object):
 '''Phone number generator for testing different methods of storage.'''
 def __init__(self, range_start, range_end, n, max_wildcards):
 
 print '''Number generator test for %d number ranges
 with a maximum of %d wildcard digits.''' % (n, max_wildcards)
 
 wildcards = range(0, max_wildcards + 1)
 # generate n unique numbers and store as keys in number_hash
 timer = WallClockTimer('dictionary population')
 self.number_hash = {}
 
 for i in xrange(0, n):
 unique = False
 while not unique:
 wildcard_digits = self.gen_wildcard_digits(wildcards)
 num = self.gen_number(range_start, range_end)
 if wildcard_digits  0:
 num /= (10 ** wildcard_digits)
 key = (num, wildcard_digits)
 if self.number_hash.has_key(key):
 unique = False
 else:
 unique = True
 self.number_hash[key] = None
 timer.stop()
 
 def gen_wildcard_digits(self, wildcards):
 return random.choice(wildcards)
 
 def gen_number(self, start, end):
 return int(random.uniform(start, end))
 
 def storage_test(self, StorageTestClass):
 test = StorageTestClass(self.number_hash)
 
 class StorageTest(object):
 '''base class for testing storage.  Derive a test
 class and provide your own runtest() method.'''
 def __init__(self, number_hash):
 timer = WallClockTimer('%s population' % type(self).__name__)
 self.runtest(number_hash)
 timer.stop()
 
 class 

Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Richie Hindle wrote:
 [Chris]
 Has anyone written a fast hash module which is more optimal for
 large datasets ?
 
 PyJudy might be what you're looking for, though I've never used it:
 
   http://www.dalkescientific.com/Python/PyJudy.html
 
 Judy's key benefits are scalability, high performance, and memory
 efficiency. A Judy array is extensible and can scale up to a very large
 number of elements, bounded only by machine memory. ... PyJudy arrays
 are similar to Python dictionaries and sets.

Thanks for the suggestion Richie.  PyJudy works brilliantly :-)

Cheers,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Maric Michaud
Le Lundi 15 Mai 2006 21:07, Diez B. Roggisch a écrit :
  d={}.fromkeys(xrange(5*10**6)) ?

 That is a totally different beast. You don't want to insert arbitrary
 keys, you want the internal hash-array to be of the right size.

But you can replace the xrange part by any generator function you want.

def get_mykeys(..)
...
yield key

I just the wonder if the time consuming part is the memory allocation of hash 
table (key by key) or the hash operation itself.

I don't see a use case where a python programmer should need a 
dictionnary that will be probably big but can't predict what keys will be 
in.

-- 
_

Maric Michaud
_

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Duncan Booth
Chris Foote wrote:

  Don't forget that adding keys
 requires resizing the dict, which is a moderately expensive operation.
 
 Yep, that's why I probably need a dictionary where I can pre-specify
 an approximate size at the time of its creation.

Try splitting the creation of the keys from the creation of the dictionary 
and you'll see where the bottleneck really is.

On my system:

 r = range(500)
 d = dict.fromkeys(r)

takes less than 1 second to create the dictionary. So creating a dictionary 
with 5 million elements is not in itself a terribly expensive operation 
(and yes, even with fromkeys there is no attempt to work out in advance 
what size the dictionary should be).

Trying to simulate your data:

 scale = 2**32
 data = [ (i*scale,i) for i in range(500) ]
 d = dict.fromkeys(data)

the assignment to data took 42 seconds. Creating the dictionary took 6 
seconds.

Trying the variation someone else suggested:

 scale = 2**32
 data = [ i*scale+i for i in range(500) ]
 d = dict.fromkeys(data)

creating data took under 6 seconds and about 1 second for d.

so it looks like the issue is not creating a large dictionary, nor lots of 
integers, but creating lots of tuples is expensive.

Oh, and if anyone tells you that the intermediate list I created in the 
tests above is going to make it slow and you should use iterators instead:

 r = range(500)
 scale = 2**32
 s = [ i*scale for i in r ]
 from itertools import izip
 d = dict.fromkeys(izip(s,r))

The first few lines took a couple of seconds, the last one 1 minute 50 
seconds. The alternative version:

 scale = 2**32
 r = range(500)
 d = dict.fromkeys((i*scale,i) for i in r)

takes a similar time.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Chris Foote
Claudio Grondi wrote:
 Chris Foote wrote:
 p.s. Disk-based DBs are out of the question because most
 key lookups will result in a miss, and lookup time is
 critical for this application.

 Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for 
 BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the 
 standard Python 2.4 distribution.

However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

Files never intended to be preserved on disk may be created by 
 
passing None as the filename.

which closely mirrors the Sleepycat documentation[2]:

In-memory databases never intended to be preserved on disk may 
be 
created by setting the file parameter to NULL.

It does actually use a temporary file (in /var/tmp), for which 
performance for my purposes is unsatisfactory:

# keys   dictionary  metakit  bsddb  (all using psyco)
--   --  ---  -
1M8.8s 22.2s  20m25s[3]
2M   24.0s 43.7s  N/A
5M  115.3s105.4s  N/A

Cheers,
Chris

[1] bsddb docs:
 http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
 http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time.  Storing the (long_integer, integer) key in string 
form long_integer:integer since bsddb doesn't support keys that aren't 
integers or strings.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Chris Foote
Paul McGuire wrote:
 Claudio Grondi [EMAIL PROTECTED] wrote in message
 news:[EMAIL PROTECTED]
 Chris Foote wrote:
 Hi all.

 I have the need to store a large (10M) number of keys in a hash table,
 based on a tuple of (long_integer, integer).  The standard python
 dictionary works well for small numbers of keys, but starts to
 perform badly for me inserting roughly 5M keys:

 # keys   dictionary  metakit   (both using psyco)
 --   --  ---
 1M8.8s 22.2s
 2M   24.0s 43.7s
 5M  115.3s105.4s

 Has anyone written a fast hash module which is more optimal for
 large datasets ?

 p.s. Disk-based DBs are out of the question because most
 key lookups will result in a miss, and lookup time is
 critical for this application.

 Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
 BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
 standard Python 2.4 distribution.

 Berkeley DB was  20 times faster  than other databases.  It has the
 operational speed of  a main memory database, the startup and  shut down
 speed of a  disk-resident database, and does not have the  overhead  of
 a client-server inter-process communication.
 Ray  Van Tassle,  Senior  Staff Engineer, Motorola

 Please let me/us know if it is what you are looking for.

 sqlite also supports an in-memory database - use pysqlite
 (http://initd.org/tracker/pysqlite/wiki) to access this from Python.

Hi Paul.

I tried that, but the overhead of parsing SQL queries is too high:

dictionary  metakit sqlite[1]
--  --- -
1M numbers   8.8s   22.2s 89.6s
2M numbers  24.0s   43.7s190.0s
5M numbers  115.3s 105.4s   N/A

Thanks for the suggestion, but no go.

Cheers,
Chris

[1] pysqlite V1  sqlite V3.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Chris Foote
lcaamano wrote:
 Sounds like PyTables could be useful.
 
 http://www.pytables.org

In browsing their excellent documentation, it seems that all concepts 
are built around storing and reading HDF5 format files.

Not suitable for this project unfortunately.

Cheers,
Chris

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread John Machin
 (my dictionary values are all None).

So in fact all you need is a set. Have you experimented with the Python
2.5 alpha?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Vladimir 'Yu' Stepanov
Chris Foote wrote:
 lcaamano wrote:
   
 Sounds like PyTables could be useful.

 http://www.pytables.org
 

 In browsing their excellent documentation, it seems that all concepts 
 are built around storing and reading HDF5 format files.

 Not suitable for this project unfortunately.

 Cheers,
 Chris

   
How many values accept long and int in pairs ? It is possible to try to use
the dictionary of dictionaries:

d = {}
for long_val, int_val in list_vals:
d.setdefault(long_val, {})[int_val] = None


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Duncan Booth
John Machin wrote:

 (my dictionary values are all None).
 
 So in fact all you need is a set. Have you experimented with the Python
 2.5 alpha?
 
I don't think that will help him: my timings work out about the same on 
2.4.2 or 2.5a2. As I pointed out, the bottleneck is creating the tuples. 
Creating either a dictionary or a set from 5 million tuples takes a similar 
amount of time (about 6 seconds on my system).

BTW, a string key is quite a bit faster than the tuples, even though you 
end up with as many temporary tuples during the process:

 scale = 2**32
 data = [ %d:%d % (i*scale,i) for i in range(500) ]
 d = set(data)

takes about 18 seconds to create the data list and 5 for the set.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Claudio Grondi
Chris Foote wrote:
 Claudio Grondi wrote:
 
 Chris Foote wrote:

 p.s. Disk-based DBs are out of the question because most
 key lookups will result in a miss, and lookup time is
 critical for this application.

 Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for 
 BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in 
 the standard Python 2.4 distribution.
 
 
 However, please note that the Python bsddb module doesn't support
 in-memory based databases - note the library documentation's[1] wording:
 
 Files never intended to be preserved on disk may be created by  
 passing None as the filename.
 
 which closely mirrors the Sleepycat documentation[2]:
 
 In-memory databases never intended to be preserved on disk 
 may be created by setting the file parameter to NULL.
 
 It does actually use a temporary file (in /var/tmp), for which 
 performance for my purposes is unsatisfactory:
 
 # keys   dictionary  metakit  bsddb  (all using psyco)
 --   --  ---  -
 1M8.8s 22.2s  20m25s[3]
 2M   24.0s 43.7s  N/A
 5M  115.3s105.4s  N/A
 
 Cheers,
 Chris
 
 [1] bsddb docs:
 http://www.python.org/doc/current/lib/module-bsddb.html
 
 [2] Sleepycat BerkeleyDB C API:
 http://www.sleepycat.com/docs/api_c/db_open.html
 
 [3] Wall clock time.  Storing the (long_integer, integer) key in string 
 form long_integer:integer since bsddb doesn't support keys that aren't 
 integers or strings.
I have to admit, that I haven't wrote any own code to actually test 
this, but if 20m25s for storing of a single MByte of strings in a 
database table index column is really what you are getting, I can't get 
rid of the feeling, that there is something elementary wrong with your 
way doing it.

Posting the code for your test cases appears to me to be the only option 
to see what is the reason for the mystery you are getting here (this 
will clarify also the other mysterious things considered by the posters 
to this thread up to now).

Claudio
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread [EMAIL PROTECTED]
BTW why are python dicts implemented as hash tables and not judy arrays?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Duncan Booth
[EMAIL PROTECTED] wrote:

 BTW why are python dicts implemented as hash tables and not judy
 arrays? 
 
Probably because:

Python predates judy arrays.
Nobody has proposed replacing the current dict implementation.
Nobody has demonstrated that it would actually be an advantage to
replace the current implementation. 
Or maybe Python dicts are just more flexible?

You choose.

A quick scan of the Judy IV Shop Manual reveals:

 Judy is a dessert topping and a floor wax, but it’s not intended for
 use as a house paint. 
...
 Judy is a programming library that provides a relatively simple
 interface (API) for array-like storage of word or string indexes with
 optional mapping of indexes to single-word values.

It doesn't sound to me as though they match the flexibility of Python
dictionaries. Without reading further through the documentation, the
implication is that keys are restricted to simple word or string values.
Remember, all that Python requires of a dict key is that it has a hash
value and can be compared for equal/not equal with other keys. It
doesn't require the key to be expressible as a sequence of bits which is
what the Judy manual seems to imply: 

 You can think of digital trees as peeling (decoding) leading bits off
 a key until only one bit is left, but in the case of an unbounded
 variable-size key there is no definite “bottom” (that is, a definite
 last bit or maximum length for every key). However, there are always
 unpopulated subexpanses, except with a fixed-size key where every
 possible key value is stored in the data structure. When decoding keys
 top-down in this way, each (sub)expanse is defined by the bits already
 decoded and the number of bits remaining (if finite). 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Duncan Booth
Duncan Booth wrote:

 BTW why are python dicts implemented as hash tables and not judy
 arrays? 
 
 Probably because:
 
 Python predates judy arrays.
 Nobody has proposed replacing the current dict implementation.
 Nobody has demonstrated that it would actually be an advantage to
 replace the current implementation. 
 Or maybe Python dicts are just more flexible?

Also, see http://www.dalkescientific.com/Python/PyJudy.html with the
conclusions drawn on performance:

 The first conclusion is that Python dictionaries are wicked fast.
 There are few cases where PyJudy is faster, though perhaps there might
 be a few more if I knew more about the Python extension API. Bear in
 mind though that the Judy arrays are in sorted order and the JudyL*
 classes have ways to access elements by order.

So for a specialised use Judy arrays could be a good idea, but not as a 
general replacement for Python dicts.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Sion Arrowsmith
Maric Michaud  [EMAIL PROTECTED] wrote:
I don't see a use case where a python programmer should need a
dictionnary that will be probably big but can't predict what keys will be
in.

I've recently been working on an app with precisely this
characteristic, although there were enough other bottlenecks that the
cost of growing the dict as needed was irrelevant. (Read ~2m items
from csv file, compare them with items in a database and populate
assorted dictionaries based on the result of that comparison for
subsequent processing. You can make a reasonable guess, and have a
definite upper bound, on the sizes of the dictionaries cheaply right
at the start, but you don't know what the keys in each are going to be
until you discover them through the comparison of file and DB.)

-- 
\S -- [EMAIL PROTECTED] -- http://www.chaos.org.uk/~sion/
  ___  |  Frankly I have no feelings towards penguins one way or the other
  \X/  |-- Arthur C. Clarke
   her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Large Dictionaries

2006-05-16 Thread Chris Foote
Claudio Grondi wrote:
 Chris Foote wrote:

 However, please note that the Python bsddb module doesn't support
 in-memory based databases - note the library documentation's[1] wording:

 Files never intended to be preserved on disk may be created 
 by  passing None as the filename.

 which closely mirrors the Sleepycat documentation[2]:

 In-memory databases never intended to be preserved on disk 
 may be created by setting the file parameter to NULL.

 It does actually use a temporary file (in /var/tmp), for which 
 performance for my purposes is unsatisfactory:

 # keys   dictionary  metakit  bsddb  (all using psyco)
 --   --  ---  -
 1M8.8s 22.2s  20m25s[3]
 2M   24.0s 43.7s  N/A
 5M  115.3s105.4s  N/A

 Cheers,
 Chris

 [1] bsddb docs:
 http://www.python.org/doc/current/lib/module-bsddb.html

 [2] Sleepycat BerkeleyDB C API:
 http://www.sleepycat.com/docs/api_c/db_open.html

 [3] Wall clock time.  Storing the (long_integer, integer) key in 
 string form long_integer:integer since bsddb doesn't support keys 
 that aren't integers or strings.
 
 I have to admit, that I haven't wrote any own code to actually test 
 this, but if 20m25s for storing of a single MByte of strings in a 
 database table index column is really what you are getting, I can't get 
 rid of the feeling, that there is something elementary wrong with your 
 way doing it.

Hi Claudio.

1M is one million, referring to the number of insertions of keys;
not a Megabyte.  I'm sorry that you took it that way :-(

Berkeley DB is great for accessing data by key for things already
stored on disk (i.e. read access), but write performance for each
key-value pair is slow due to it being careful about flushing
writes to disk by default.

 Posting the code for your test cases appears to me to be the only option 
 to see what is the reason for the mystery you are getting here (this 
 will clarify also the other mysterious things considered by the posters 
 to this thread up to now).

I agree that posting some test code would have proved useful, but
the code is big and has too many interdependencies on external
things (e.g. databases, threads  pyro RPC calls) to allow me
to separate out examples easily.  But if you go back to my
original posting, I think my question was quite clear.

Best regards,
Chris

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Claudio Grondi
Chris Foote wrote:
 Claudio Grondi wrote:
 
 Chris Foote wrote:

 However, please note that the Python bsddb module doesn't support
 in-memory based databases - note the library documentation's[1] wording:

 Files never intended to be preserved on disk may be created 
 by  passing None as the filename.

 which closely mirrors the Sleepycat documentation[2]:

 In-memory databases never intended to be preserved on disk 
 may be created by setting the file parameter to NULL.

 It does actually use a temporary file (in /var/tmp), for which 
 performance for my purposes is unsatisfactory:

 # keys   dictionary  metakit  bsddb  (all using psyco)
 --   --  ---  -
 1M8.8s 22.2s  20m25s[3]
 2M   24.0s 43.7s  N/A
 5M  115.3s105.4s  N/A

 Cheers,
 Chris

 [1] bsddb docs:
 http://www.python.org/doc/current/lib/module-bsddb.html

 [2] Sleepycat BerkeleyDB C API:
 http://www.sleepycat.com/docs/api_c/db_open.html

 [3] Wall clock time.  Storing the (long_integer, integer) key in 
 string form long_integer:integer since bsddb doesn't support keys 
 that aren't integers or strings.
 
  
 
 I have to admit, that I haven't wrote any own code to actually test 
 this, but if 20m25s for storing of a single MByte of strings in a 
 database table index column is really what you are getting, I can't 
 get rid of the feeling, that there is something elementary wrong with 
 your way doing it.
 
 
 Hi Claudio.
 
 1M is one million, referring to the number of insertions of keys;
 not a Megabyte.  I'm sorry that you took it that way :-(
 
 Berkeley DB is great for accessing data by key for things already
 stored on disk (i.e. read access), but write performance for each
 key-value pair is slow due to it being careful about flushing
 writes to disk by default.
 
 Posting the code for your test cases appears to me to be the only 
 option to see what is the reason for the mystery you are getting here 
 (this will clarify also the other mysterious things considered by the 
 posters to this thread up to now).
 
 
 I agree that posting some test code would have proved useful, but
 the code is big and has too many interdependencies on external
 things (e.g. databases, threads  pyro RPC calls) to allow me
 to separate out examples easily.  But if you go back to my
 original posting, I think my question was quite clear.
 
 Best regards,
 Chris
 
I have some code demonstrating the mystery from my point of view 
(including timings which differ from what you experience in the order of 
a magnitude on my 2.8 GHz P4):

dctA = {}
import time
strt = time.clock()
lst = range(12345678901000L, 12345678901000L + 500)
import random
endt = time.clock()
print lst = range(12345678901L, 12345678901L + 500) [s] : , endt - 
strt
strt = time.clock()
random.shuffle(lst)
endt = time.clock()
print random.shuffle(lst)   [s] : , endt - 
strt
random.shuffle(lst)
counter = 0
for item in lst:
   if (counter == 0):
 print Listing of some of lst items:
   if (counter  4):
 print :END of listing of lst items
 break
   else:
 print item no. %i %(counter,), repr(item)
   counter += 1
#:for
# raw_input('continue with ENTER) #')
strt = time.clock()
dctA.fromkeys(lst, None)
endt = time.clock()
print dctA.fromkeys(lst, None)  [s] : , endt - 
strt
# raw_input('continue with ENTER) #')
strt = time.clock()
strLst = []
for item in lst: strLst.append(str(item))
dctA = {}
endt = time.clock()
print for item in lst: strLst.append(str(item)) [s] : , endt - 
strt
# raw_input('continue with ENTER) #')
strt = time.clock()
dctA = dctA.fromkeys(strLst, None)
endt = time.clock()
print dctA.fromkeys(strLst, None)   [s] : , endt - 
strt
# raw_input('continue with ENTER) #')
print len(dctA) : %i  % (len(dctA),)
counter = 0
for key in dctA.keys():
   if (counter == 0):
 print Listing of some of dctA items:
   if (counter  4):
 print :END of listing of dctA items
 break
   else:
 print key  no. %i %(counter,), repr(key)
   counter += 1
#:for
raw_input('exit with ENTER) #')

# Gives as output :
#
# lst = range(12345678901L, 12345678901L + 500) [s] :  0.81327347470
# random.shuffle(lst)   [s] :  8.06178829991
# Listing of some of lst items:
# item no. 0  123456789011010733L
# item no. 1  123456789013585761L
# item no. 2  123456789013610266L
# item no. 3  123456789011311029L
# item no. 4  123456789010968853L
# :END of listing of lst items
# dctA.fromkeys(lst, None)  [s] :  3.11773256098
# for item in lst: strLst.append(str(item)) [s] :  11.5454232312
# dctA.fromkeys(strLst, None)   [s] :  3.98027849908
# len(dctA) : 500
# Listing of some of dctA items:
# key  no. 0  '123456789014515319'
# key  no. 1  '123456789014116699'
# key  no. 2  '123456789014116698'
# key  no. 3  '123456789010800915'
# key  no. 4  

Re: Large Dictionaries

2006-05-16 Thread Klaas
22.2s  20m25s[3]

20m to insert 1m keys?  You are doing something wrong.

With bdb's it is crucial to insert keys in bytestring-sorted order.
Also, be sure to give it a decent amount of cache.

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Richie Hindle

[Chris]
 Has anyone written a fast hash module which is more optimal for
 large datasets ?

PyJudy might be what you're looking for, though I've never used it:

  http://www.dalkescientific.com/Python/PyJudy.html

Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory. ... PyJudy arrays
are similar to Python dictionaries and sets.

-- 
Richie Hindle
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Roy Smith
In article [EMAIL PROTECTED], Chris Foote [EMAIL PROTECTED] 
wrote:

 Hi all.
 
 I have the need to store a large (10M) number of keys in a hash table,
 based on a tuple of (long_integer, integer).  The standard python
 dictionary works well for small numbers of keys, but starts to
 perform badly for me inserting roughly 5M keys:
 
 # keys   dictionary  metakit   (both using psyco)
 --   --  ---
 1M8.8s 22.2s
 2M   24.0s 43.7s
 5M  115.3s105.4s

Are those clock times or CPU times?

How much memory is your process using and how much is available on your 
machine?

I'm guessing a integer takes 4 bytes and a long integer takes roughly one 
byte per two decimal digits.  Plus a few more bytes to bundle them up into 
a tuple.  You've probably got something like 20 bytes per key, so 5M of 
them is 100 meg just for the keys.

To get reasonable hashing performance, the hash table needs to be maybe 
half full, and figure a hash key is 4 bytes, so that's another 40 meg for 
the hash table itself.

Plus whatever the values stored in the dictionary take up.  Even if you're 
not storing any values (i.e., there's probably 4 bytes for a null pointer 
(or reference to None), so that's another 40 meg.

These are all vague guesses, based on what I think are probably 
conservative estimates of what various objects must take up in memory, but 
you see where this is going.  We're already up to 180 meg.  I wonder if the 
whole problem is that you're just paging yourself to death.  A few minutes 
watching your system memory performance with ps or top while your program 
is running might give you some idea if this is the case.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Claudio Grondi
Chris Foote wrote:
 Hi all.
 
 I have the need to store a large (10M) number of keys in a hash table,
 based on a tuple of (long_integer, integer).  The standard python
 dictionary works well for small numbers of keys, but starts to
 perform badly for me inserting roughly 5M keys:
 
 # keys   dictionary  metakit   (both using psyco)
 --   --  ---
 1M8.8s 22.2s
 2M   24.0s 43.7s
 5M  115.3s105.4s
 
 Has anyone written a fast hash module which is more optimal for
 large datasets ?
 
 p.s. Disk-based DBs are out of the question because most
 key lookups will result in a miss, and lookup time is
 critical for this application.
 
 Cheers,
 Chris
Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for 
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the 
standard Python 2.4 distribution.

Berkeley DB was  20 times faster  than other databases.  It has the 
operational speed of  a main memory database, the startup and  shut down 
speed of a  disk-resident database, and does not have the  overhead  of 
a client-server inter-process communication.
Ray  Van Tassle,  Senior  Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.

Claudio
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Aahz
In article [EMAIL PROTECTED],
Roy Smith  [EMAIL PROTECTED] wrote:
In article [EMAIL PROTECTED], Chris Foote [EMAIL PROTECTED] 
wrote:
 
 I have the need to store a large (10M) number of keys in a hash table,
 based on a tuple of (long_integer, integer).  The standard python
 dictionary works well for small numbers of keys, but starts to
 perform badly for me inserting roughly 5M keys:
 
 # keys   dictionary  metakit   (both using psyco)
 --   --  ---
 1M8.8s 22.2s
 2M   24.0s 43.7s
 5M  115.3s105.4s

Are those clock times or CPU times?

And what are these times measuring?  Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.
Once the dict is constructed, lookup times should be quite good.
-- 
Aahz ([EMAIL PROTECTED])   * http://www.pythoncraft.com/

I saw `cout' being shifted Hello world times to the left and stopped
right there.  --Steve Gonedes
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Roy Smith
Aahz [EMAIL PROTECTED] wrote:
 Don't forget that adding keys requires resizing the dict, which is a
 moderately expensive operation.

My guess would be that each resize grows the table by a constant
factor, which IIRC, works out to amortized O(n).  It's not as good as
creating the dict the right size in the first place, but it's really
not that bad.  Somewhat more troubling is that it can lead to memory
fragmentation problems.

I don't understand why Python doesn't have a way to give a size hint
when creating a dict.  It seems so obvious to be able to say d = dict
(size=10*1000*1000) if you know beforehand that's how many keys
you're going to add.

Looking at the docs, I'm astounded to discover that you can indeed do
exactly that, but you get {size:1000}.  I am befuddled as to why
people thought creating a dict by keyword would be a useful thing to
do, but left out (and, indeed, eliminated the chance to add the syntax
later) the obviously useful ability to hint at the size.  I suppose we
could still add:

d = {}
d.reserve (10*1000*1000)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Maric Michaud
Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit :
 d = {}
 d.reserve (10*1000*1000)

d={}.fromkeys(xrange(5*10**6)) ?

-- 
_

Maric Michaud
_

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Paul McGuire
Claudio Grondi [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 Chris Foote wrote:
  Hi all.
 
  I have the need to store a large (10M) number of keys in a hash table,
  based on a tuple of (long_integer, integer).  The standard python
  dictionary works well for small numbers of keys, but starts to
  perform badly for me inserting roughly 5M keys:
 
  # keys   dictionary  metakit   (both using psyco)
  --   --  ---
  1M8.8s 22.2s
  2M   24.0s 43.7s
  5M  115.3s105.4s
 
  Has anyone written a fast hash module which is more optimal for
  large datasets ?
 
  p.s. Disk-based DBs are out of the question because most
  key lookups will result in a miss, and lookup time is
  critical for this application.
 
  Cheers,
  Chris
 Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
 BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
 standard Python 2.4 distribution.

 Berkeley DB was  20 times faster  than other databases.  It has the
 operational speed of  a main memory database, the startup and  shut down
 speed of a  disk-resident database, and does not have the  overhead  of
 a client-server inter-process communication.
 Ray  Van Tassle,  Senior  Staff Engineer, Motorola

 Please let me/us know if it is what you are looking for.

 Claudio

sqlite also supports an in-memory database - use pysqlite
(http://initd.org/tracker/pysqlite/wiki) to access this from Python.

-- Paul


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Diez B. Roggisch
Maric Michaud schrieb:
 Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit :
 d = {}
 d.reserve (10*1000*1000)
 
 d={}.fromkeys(xrange(5*10**6)) ?

That is a totally different beast. You don't want to insert arbitrary 
keys, you want the internal hash-array to be of the right size.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread bearophileHUGS
Roy Smith:
 I am befuddled as to why
 people thought creating a dict by keyword would be a useful thing to
 do, but left out (and, indeed, eliminated the chance to add the syntax
 later) the obviously useful ability to hint at the size.

Adding the keyword syntax doesn't require much memory to a Python
programmer because it's the same syntax used elsewhere. While adding
another different method to dicts increases their API. Python is
usually designed to have smaller APIs, to help programmers remember
most things.

I agree that a reserve method to Python dicts/sets can be a little
useful, but Python programmers are usually more concerned about doing
things in a short programmer time and short code space, than telling
the computer how to do such things in the faster way (C++ is more for
speed so the reserve of the STL hashes can be more useful).

Bye,
bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread John Machin
1. Why is two minutes to insert 5M keys bad for you? What would be
good? What would be good/bad look-up times? Have you measured the
typical look-up time? How often is the dict creation  required to be
done? How often does the data change? Is multi-user access required for
(a) look-up (b) updating? Have you considered loading the dict from a
pickle?
2. Assuming your code that is creating the dict looks in essence like
this:
adict = {}
for k, v in some_iterable:
adict[k] = v
then any non-linear behaviour can only be in the actual CPython
insertion code. Psyco can't help you there. Psyco *may* help with the
linear part, *if* you have enough memory. What are the corresponding
times without Psyco? In any case, if your code isn't (conceptually)
that simple, then try cutting away the cruft and measuring again.
3. Which version of Python? What OS? OK, psyco - Intel x86, but what
chip exactly? How much free memory?
4. Consider printing time-so-far results, say every 100K keys. Multiple
step-ups might indicate dict resizings. A dog-leg probably means
running out of memory. Why roughly 5M keys???
5. How large are your long_integers?
6. What is the nature of the value associated with each key?
7. Have you experimented with key = a * 2 ** 32 + b instead of key =
(a, b)?

HTH,
John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread lcaamano
Sounds like PyTables could be useful.

http://www.pytables.org

--
lpc

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Chris Foote
Roy Smith wrote:
 In article [EMAIL PROTECTED], Chris Foote [EMAIL PROTECTED] 
 wrote:
 
 I have the need to store a large (10M) number of keys in a hash table,
 based on a tuple of (long_integer, integer).  The standard python
 dictionary works well for small numbers of keys, but starts to
 perform badly for me inserting roughly 5M keys:

 # keys   dictionary  metakit   (both using psyco)
 --   --  ---
 1M8.8s 22.2s
 2M   24.0s 43.7s
 5M  115.3s105.4s
 
 Are those clock times or CPU times?

User + system CPU time.

 How much memory is your process using and how much is available on your 
 machine?

A very large amount of space is consumed, which is why I didn't give 
times for the 10M version which would have eaten my 1G of RAM and
into swap :-)

 I'm guessing a integer takes 4 bytes and a long integer takes roughly one 
 byte per two decimal digits.  Plus a few more bytes to bundle them up into 
 a tuple.  You've probably got something like 20 bytes per key, so 5M of 
 them is 100 meg just for the keys.
 
 To get reasonable hashing performance, the hash table needs to be maybe 
 half full, and figure a hash key is 4 bytes, so that's another 40 meg for 
 the hash table itself.
 
 Plus whatever the values stored in the dictionary take up.  Even if you're 
 not storing any values (i.e., there's probably 4 bytes for a null pointer 
 (or reference to None), so that's another 40 meg.
 
 These are all vague guesses, based on what I think are probably 
 conservative estimates of what various objects must take up in memory, but 
 you see where this is going.  We're already up to 180 meg.

Yep, that size sounds about right (my dictionary values are all None).

 I wonder if the 
 whole problem is that you're just paging yourself to death.  A few minutes 
 watching your system memory performance with ps or top while your program 
 is running might give you some idea if this is the case.

Definitely not.

Thanks anyway,

Chris

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-15 Thread Chris Foote
Aahz wrote:
 In article [EMAIL PROTECTED],
 Roy Smith  [EMAIL PROTECTED] wrote:
 In article [EMAIL PROTECTED], Chris Foote [EMAIL PROTECTED] 
 wrote:
 I have the need to store a large (10M) number of keys in a hash table,
 based on a tuple of (long_integer, integer).  The standard python
 dictionary works well for small numbers of keys, but starts to
 perform badly for me inserting roughly 5M keys:

 # keys   dictionary  metakit   (both using psyco)
 --   --  ---
 1M8.8s 22.2s
 2M   24.0s 43.7s
 5M  115.3s105.4s
 Are those clock times or CPU times?
 
 And what are these times measuring?

The loading of a file into a dictionary.  i.e. no lookup operations.

  Don't forget that adding keys
 requires resizing the dict, which is a moderately expensive operation.

Yep, that's why I probably need a dictionary where I can pre-specify
an approximate size at the time of its creation.

 Once the dict is constructed, lookup times should be quite good.

Very good indeed with Psyco!

Cheers,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list