Re: handeling very large dictionaries

2009-06-28 Thread Joseph Turian
You could also try using a key-value store.
I am using pytc, a Python API for Tokyo Cabinet. It seems to figure
out quite nicely when to go to disk, and when to use memory. But I
have not done extensive tests.

Here is some example code for using pytc:
  http://github.com/turian/pytc-example/tree/master

 Joseph

On Jun 28, 7:13 pm, mclovin  wrote:
> Hello all,
>
> I need to have a dictionary of about 8 gigs (well the data it is
> processing is around 4gb). so naturally i am running into memory
> errors.
>
> So i looked around and found bsddb which acts like a dictionary object
> only offloads the data from the RAM to the HDD, however that only
> supports strings.
>
> my dictionaries hold my own class objects
>
> Is there something like it that is more flexible?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: handeling very large dictionaries

2009-06-28 Thread alex23
On Jun 29, 9:13 am, mclovin  wrote:
> Is there something like it that is more flexible?

Have you seen the stdlib module 'shelve'? 
http://docs.python.org/library/shelve.html

It creates a persistent file-based dictionary, which can hold any type
of object as long as it can be pickled.

I really like the 3rd party module 'shove': http://pypi.python.org/pypi/shove

It's similar to shelve but provides many more backend options,
including dbs, svn and Amazon S3. Very handy.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: handeling very large dictionaries

2009-06-28 Thread Aahz
In article <9efff087-bd6e-49fb-ad30-a955a64b8...@j32g2000yqh.googlegroups.com>,
mclovin   wrote:
>
>I need to have a dictionary of about 8 gigs (well the data it is
>processing is around 4gb). so naturally i am running into memory
>errors.
>
>So i looked around and found bsddb which acts like a dictionary object
>only offloads the data from the RAM to the HDD, however that only
>supports strings.

Look at the pickle module.  Personally, I'd use SQLite instead of bsddb.
You might also look into a full-blown ORM such as SQLAlchemy.
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha
-- 
http://mail.python.org/mailman/listinfo/python-list


handeling very large dictionaries

2009-06-28 Thread mclovin
Hello all,

I need to have a dictionary of about 8 gigs (well the data it is
processing is around 4gb). so naturally i am running into memory
errors.

So i looked around and found bsddb which acts like a dictionary object
only offloads the data from the RAM to the HDD, however that only
supports strings.

my dictionaries hold my own class objects

Is there something like it that is more flexible?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-04-26 Thread repi

Hello,

I'm having the same problem. I'm creating two dictionaries with about
700,000 to 1,600,000 keys each, each key pointing to a smaller structure of
embedded dictionaries. On a 8 x 3.0GHz Xeon Mac Pro with 17GB of RAM,
cPickle is excruciatingly slow. Moreover, it crashes with memory error (12).
I've been looking for alternatives, even thinking of writing up my own file
format and read/write functions. As a last resort I tried dumping the two
lists with the marshal module and it worked great. A 200MB and a 300MB file
are created in a couple of seconds without a crash. 

You should be warned that marshal is not meant as a general data-persistence
module. But as long as you have a rather simple data structure and you are
using it with a specific python version, it seems to run circles around
cPickle.

Hope it helps.

-- 
View this message in context: 
http://www.nabble.com/writing-large-dictionaries-to-file-using-cPickle-tp21708938p23246693.html
Sent from the Python - python-list mailing list archive at Nabble.com.

--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-30 Thread Aaron Brady
On Jan 30, 2:44 pm, perfr...@gmail.com wrote:
> On Jan 28, 6:08 pm, Aaron Brady  wrote:
>
>
>
> > On Jan 28, 4:43 pm, perfr...@gmail.com wrote:
>
> > > On Jan 28, 5:14 pm, John Machin  wrote:
>
> > > > On Jan 29, 3:13 am, perfr...@gmail.com wrote:
>
> > > > > hello all,
>
> > > > > i have a large dictionary which contains about 10 keys, each key has a
> > > > > value which is a list containing about 1 to 5 million (small)
> > > > > dictionaries. for example,
>
> > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > > > > 'world'}, ...],
> > > > >                 key2: [...]}
>
> > > > > in total there are about 10 to 15 million lists if we concatenate
> > > > > together all the values of every key in 'mydict'. mydict is a
> > > > > structure that represents data in a very large file (about 800
> > > > > megabytes).
>
> > snip
>
> > > in reply to the other poster: i thought 'shelve' simply calls pickle.
> > > if thats the case, it wouldnt be any faster, right ?
>
> > Yes, but not all at once.  It's a clear winner if you need to update
> > any of them later, but if it's just write-once, read-many, it's about
> > the same.
>
> > You said you have a million dictionaries.  Even if each took only one
> > byte, you would still have a million bytes.  Do you expect a faster I/
> > O time than the time it takes to write a million bytes?
>
> > I want to agree with John's worry about RAM, unless you have several+
> > GB, as you say.  You are not dealing with small numbers.
>
> in my case, i just write the pickle file once and then read it in
> later. in that case, cPickle and shelve would be identical, if i
> understand correctly?

No not identical.  'shelve' is not a dictionary, it's a database
object that implements the mapping protocol.  'isinstance( shelve,
dict )' is False, for example.

> the file i'm reading in is ~800 MB file, and the pickle file is around
> 300 MB. even if it were 800 MB, it doesn't make sense to me that
> python's i/o would be that slow... it takes roughly 5 seconds to write
> one megabyte of a binary file (the pickled object in this case), which
> just seems wrong. does anyone know anything about this? about how i/o
> can be sped up for example?

You can try copying a 1-MB file.  Or something like:

f= open( 'temp.temp', 'w' )
for x in range( 10 ):
f.write( '0'* 10 )

You know how long it takes OSes to boot, right?

> the dictionary might have a million keys, but each key's value is very
> small. i tried the same example where the keys are short strings (and
> there are about 10-15 million of them) and each value is an integer,
> and it is still very slow. does anyone know how to test whether i/o is
> the bottle neck, or whether it's something specific about pickle?
>
> thanks.

You could fall back to storing a parallel list by hand, if you're just
using string and numeric primitives.
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-30 Thread perfreem
On Jan 28, 6:08 pm, Aaron Brady  wrote:
> On Jan 28, 4:43 pm, perfr...@gmail.com wrote:
>
> > On Jan 28, 5:14 pm, John Machin  wrote:
>
> > > On Jan 29, 3:13 am, perfr...@gmail.com wrote:
>
> > > > hello all,
>
> > > > i have a large dictionary which contains about 10 keys, each key has a
> > > > value which is a list containing about 1 to 5 million (small)
> > > > dictionaries. for example,
>
> > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > > > 'world'}, ...],
> > > >                 key2: [...]}
>
> > > > in total there are about 10 to 15 million lists if we concatenate
> > > > together all the values of every key in 'mydict'. mydict is a
> > > > structure that represents data in a very large file (about 800
> > > > megabytes).
>
> snip
>
> > in reply to the other poster: i thought 'shelve' simply calls pickle.
> > if thats the case, it wouldnt be any faster, right ?
>
> Yes, but not all at once.  It's a clear winner if you need to update
> any of them later, but if it's just write-once, read-many, it's about
> the same.
>
> You said you have a million dictionaries.  Even if each took only one
> byte, you would still have a million bytes.  Do you expect a faster I/
> O time than the time it takes to write a million bytes?
>
> I want to agree with John's worry about RAM, unless you have several+
> GB, as you say.  You are not dealing with small numbers.

in my case, i just write the pickle file once and then read it in
later. in that case, cPickle and shelve would be identical, if i
understand correctly?

the file i'm reading in is ~800 MB file, and the pickle file is around
300 MB. even if it were 800 MB, it doesn't make sense to me that
python's i/o would be that slow... it takes roughly 5 seconds to write
one megabyte of a binary file (the pickled object in this case), which
just seems wrong. does anyone know anything about this? about how i/o
can be sped up for example?

the dictionary might have a million keys, but each key's value is very
small. i tried the same example where the keys are short strings (and
there are about 10-15 million of them) and each value is an integer,
and it is still very slow. does anyone know how to test whether i/o is
the bottle neck, or whether it's something specific about pickle?

thanks.
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread Gabriel Genellina

En Wed, 28 Jan 2009 14:13:10 -0200,  escribió:


i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

[pickle] creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.


There is an undocumented Pickler attribute, "fast". Usually, when the same  
object is referenced more than once, only the first appearance is stored  
in the pickled stream; later references just point to the original. This  
requires the Pickler instance to remember every object pickled so far --  
setting the "fast" attribute to a true value bypasses this check. Before  
using this, you must be positively sure that your objects don't contain  
circular references -- else pickling will never finish.


py> from cPickle import Pickler
py> from cStringIO import StringIO
py> s = StringIO()
py> p = Pickler(s, -1)
py> p.fast = 1
py> x = [1,2,3]
py> y = [x, x, x]
py> y
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y[0] is y[1]
True
py> p.dump(y)

py> s.getvalue()
'\x80\x02](](K\x01K\x02K\x03e](K\x01K\x02K\x03e](K\x01K\x02K\x03ee.'

Note that, when unpickling, shared references are broken:

py> s.seek(0,0)
py> from cPickle import load
py> y2 = load(s)
py> y2
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y2[0] is y2[1]
False

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread John Machin
On Jan 29, 9:43 am, perfr...@gmail.com wrote:
> On Jan 28, 5:14 pm, John Machin  wrote:
>
>
>
> > On Jan 29, 3:13 am, perfr...@gmail.com wrote:
>
> > > hello all,
>
> > > i have a large dictionary which contains about 10 keys, each key has a
> > > value which is a list containing about 1 to 5 million (small)
> > > dictionaries. for example,
>
> > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > > 'world'}, ...],
> > >                 key2: [...]}
>
> > > in total there are about 10 to 15 million lists if we concatenate
> > > together all the values of every key in 'mydict'. mydict is a
> > > structure that represents data in a very large file (about 800
> > > megabytes).
>
> > > what is the fastest way to pickle 'mydict' into a file? right now i am
> > > experiencing a lot of difficulties with cPickle when using it like
> > > this:
>
> > > from cPickle import pickle
> > > pfile = open(my_file, 'w')
> > > pickle.dump(mydict, pfile)
> > > pfile.close()
>
> > > this creates extremely large files (~ 300 MB) though it does so
> > > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> > > it gets slower and slower. it takes almost an hour if not more to
> > > write this pickle object to file.
>
> > > is there any way to speed this up? i dont mind the large file... after
> > > all the text file with the data used to make the dictionary was larger
> > > (~ 800 MB) than the file it eventually creates, which is 300 MB.  but
> > > i do care about speed...
>
> > > i have tried optimizing this by using this:
>
> > > s = pickle.dumps(mydict, 2)
> > > pfile.write(s)
>
> > > but this takes just as long... any ideas ? is there a different module
> > > i could use that's more suitable for large dictionaries ?
> > > thank you very much.
>
> > Pardon me if I'm asking the "bleedin' obvious", but have you checked
> > how much virtual memory this is taking up compared to how much real
> > memory you have? If the slowness is due to pagefile I/O, consider
> > doing "about 10" separate pickles (one for each key in your top-level
> > dictionary).
>
> the slowness is due to CPU when i profile my program using the unix
> program 'top'... i think all the work is in the file I/O. the machine
> i am using several GB of ram and ram memory is not heavily taxed at
> all. do you know how file I/O can be sped up?

More quick silly questions:

(1) How long does it take to load that 300MB pickle back into memory
using:
(a) cpickle.load(f)
(b) f.read()
?

What else is happening on the machine while you are creating the
pickle?

(2) How does

--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread Aaron Brady
On Jan 28, 4:43 pm, perfr...@gmail.com wrote:
> On Jan 28, 5:14 pm, John Machin  wrote:
>
>
>
> > On Jan 29, 3:13 am, perfr...@gmail.com wrote:
>
> > > hello all,
>
> > > i have a large dictionary which contains about 10 keys, each key has a
> > > value which is a list containing about 1 to 5 million (small)
> > > dictionaries. for example,
>
> > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > > 'world'}, ...],
> > >                 key2: [...]}
>
> > > in total there are about 10 to 15 million lists if we concatenate
> > > together all the values of every key in 'mydict'. mydict is a
> > > structure that represents data in a very large file (about 800
> > > megabytes).
snip

> in reply to the other poster: i thought 'shelve' simply calls pickle.
> if thats the case, it wouldnt be any faster, right ?

Yes, but not all at once.  It's a clear winner if you need to update
any of them later, but if it's just write-once, read-many, it's about
the same.

You said you have a million dictionaries.  Even if each took only one
byte, you would still have a million bytes.  Do you expect a faster I/
O time than the time it takes to write a million bytes?

I want to agree with John's worry about RAM, unless you have several+
GB, as you say.  You are not dealing with small numbers.
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
On Jan 28, 5:14 pm, John Machin  wrote:
> On Jan 29, 3:13 am, perfr...@gmail.com wrote:
>
>
>
> > hello all,
>
> > i have a large dictionary which contains about 10 keys, each key has a
> > value which is a list containing about 1 to 5 million (small)
> > dictionaries. for example,
>
> > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > 'world'}, ...],
> >                 key2: [...]}
>
> > in total there are about 10 to 15 million lists if we concatenate
> > together all the values of every key in 'mydict'. mydict is a
> > structure that represents data in a very large file (about 800
> > megabytes).
>
> > what is the fastest way to pickle 'mydict' into a file? right now i am
> > experiencing a lot of difficulties with cPickle when using it like
> > this:
>
> > from cPickle import pickle
> > pfile = open(my_file, 'w')
> > pickle.dump(mydict, pfile)
> > pfile.close()
>
> > this creates extremely large files (~ 300 MB) though it does so
> > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> > it gets slower and slower. it takes almost an hour if not more to
> > write this pickle object to file.
>
> > is there any way to speed this up? i dont mind the large file... after
> > all the text file with the data used to make the dictionary was larger
> > (~ 800 MB) than the file it eventually creates, which is 300 MB.  but
> > i do care about speed...
>
> > i have tried optimizing this by using this:
>
> > s = pickle.dumps(mydict, 2)
> > pfile.write(s)
>
> > but this takes just as long... any ideas ? is there a different module
> > i could use that's more suitable for large dictionaries ?
> > thank you very much.
>
> Pardon me if I'm asking the "bleedin' obvious", but have you checked
> how much virtual memory this is taking up compared to how much real
> memory you have? If the slowness is due to pagefile I/O, consider
> doing "about 10" separate pickles (one for each key in your top-level
> dictionary).

the slowness is due to CPU when i profile my program using the unix
program 'top'... i think all the work is in the file I/O. the machine
i am using several GB of ram and ram memory is not heavily taxed at
all. do you know how file I/O can be sped up?

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread John Machin
On Jan 29, 3:13 am, perfr...@gmail.com wrote:
> hello all,
>
> i have a large dictionary which contains about 10 keys, each key has a
> value which is a list containing about 1 to 5 million (small)
> dictionaries. for example,
>
> mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> 'world'}, ...],
>                 key2: [...]}
>
> in total there are about 10 to 15 million lists if we concatenate
> together all the values of every key in 'mydict'. mydict is a
> structure that represents data in a very large file (about 800
> megabytes).
>
> what is the fastest way to pickle 'mydict' into a file? right now i am
> experiencing a lot of difficulties with cPickle when using it like
> this:
>
> from cPickle import pickle
> pfile = open(my_file, 'w')
> pickle.dump(mydict, pfile)
> pfile.close()
>
> this creates extremely large files (~ 300 MB) though it does so
> *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> it gets slower and slower. it takes almost an hour if not more to
> write this pickle object to file.
>
> is there any way to speed this up? i dont mind the large file... after
> all the text file with the data used to make the dictionary was larger
> (~ 800 MB) than the file it eventually creates, which is 300 MB.  but
> i do care about speed...
>
> i have tried optimizing this by using this:
>
> s = pickle.dumps(mydict, 2)
> pfile.write(s)
>
> but this takes just as long... any ideas ? is there a different module
> i could use that's more suitable for large dictionaries ?
> thank you very much.

Pardon me if I'm asking the "bleedin' obvious", but have you checked
how much virtual memory this is taking up compared to how much real
memory you have? If the slowness is due to pagefile I/O, consider
doing "about 10" separate pickles (one for each key in your top-level
dictionary).
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread Aaron Brady
On Jan 28, 10:13 am, perfr...@gmail.com wrote:
> hello all,
>
> i have a large dictionary which contains about 10 keys, each key has a
> value which is a list containing about 1 to 5 million (small)
> dictionaries. for example,
snip
> but this takes just as long... any ideas ? is there a different module
> i could use that's more suitable for large dictionaries ?
> thank you very much.

There is the 'shelve' module.  You could create a shelf that tells you
the filename of the 5 other ones.  A million keys should be no
problem, I guess.  (It's standard library.)  All your keys have to be
strings, though, and all your values have to be pickleable.  If that's
a problem, yes you will need ZODB or Django (I understand), or another
relational DB.
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread Christian Heimes
perfr...@gmail.com schrieb:
> but this takes just as long... any ideas ? is there a different module
> i could use that's more suitable for large dictionaries ?
> thank you very much.

Have a look at ZODB.

--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
On Jan 28, 11:32 am, pyt...@bdurham.com wrote:
> Hi,
>
> Change:
>
> pickle.dump(mydict, pfile)
>
> to:
>
> pickle.dump(mydict, pfile, -1 )
>
> I think you will see a big difference in performance and also a much
> smaller file on disk.
>
> BTW: What type of application are you developing that creates so many
> dictionaries? Sounds interesting.
>
> Malcolm

hi!

thank you for your reply. unfortunately i tried this but it doesn't
change the speed. it's still writing the file extremely slowly.  i'm
not sure why?

thank you.
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread python
Hi,

Change:

pickle.dump(mydict, pfile)

to:

pickle.dump(mydict, pfile, -1 )

I think you will see a big difference in performance and also a much
smaller file on disk.

BTW: What type of application are you developing that creates so many
dictionaries? Sounds interesting.

Malcolm
--
http://mail.python.org/mailman/listinfo/python-list


writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
hello all,

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:

from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()

this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB.  but
i do care about speed...

i have tried optimizing this by using this:

s = pickle.dumps(mydict, 2)
pfile.write(s)

but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-16 Thread Luis M . González
On Jan 15, 6:39 pm, Per Freem  wrote:
> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
> i create a dictionary that maps elements of the file that i want to
> count
> to their number of occurs. so i iterate through the file and for each
> line
> extract the elements (simple text operation) and see if it has an
> entry in the dict:
>
> for line in file:
>   try:
>     elt = MyClass(line)# extract elt from line...
>     my_dict[elt] += 1
>   except KeyError:
>     my_dict[elt] = 1
>
> i am using try/except since it is supposedly faster (though i am not
> sure
> about this? is this really true in Python 2.5?).
>
> the only 'twist' is that my elt is an instance of a class (MyClass)
> with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
> works well.
> the __repr__ and __hash__ methods of my class simply return str()
> representation
> of self, while __str__ just makes everything numeric field into a
> concatenated string:
>
> class MyClass
>
>   def __str__(self):
>     return "%s-%s-%s" %(self.field1, self.field2, self.field3)
>
>   def __repr__(self):
>     return str(self)
>
>   def __hash__(self):
>     return hash(str(self))
>
> is there anything that can be done to speed up this simply code? right
> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> with lots of RAM (though this is all taking CPU power, not RAM at this
> point.)
>
> any general advice on how to optimize large dicts would be great too
>
> thanks for your help.

How about this?:

for line in file:
elt = MyClass(line)
my_dict[elt] = my_dict.get(elt, 0) + 1
...

--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-16 Thread Matthias Julius
Per Freem  writes:

> the only 'twist' is that my elt is an instance of a class (MyClass)
> with 3 fields, all numeric. the class is hashable, and so
> my_dict[elt] works well.  the __repr__ and __hash__ methods of my
> class simply return str() representation of self, 

which just calls __str__().  I guess you are aware of that but you
could call self.__str__() directly.  Maybe that saves something when
you do that 10 million times.

> while __str__ just makes everything numeric field into a
> concatenated string:
>
> class MyClass
>
>   def __str__(self):
> return "%s-%s-%s" %(self.field1, self.field2, self.field3)
>
>   def __repr__(self):
> return str(self)
>
>   def __hash__(self):
> return hash(str(self))

Maybe it would be faster to numerically combine the three fields
instead of hashing the string representation.

Matthias
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-16 Thread pruebauno
On Jan 15, 4:39 pm, Per Freem  wrote:
> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
> i create a dictionary that maps elements of the file that i want to
> count
> to their number of occurs. so i iterate through the file and for each
> line
> extract the elements (simple text operation) and see if it has an
> entry in the dict:
>
> for line in file:
>   try:
>     elt = MyClass(line)# extract elt from line...
>     my_dict[elt] += 1
>   except KeyError:
>     my_dict[elt] = 1
>
> i am using try/except since it is supposedly faster (though i am not
> sure
> about this? is this really true in Python 2.5?).
>
> the only 'twist' is that my elt is an instance of a class (MyClass)
> with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
> works well.
> the __repr__ and __hash__ methods of my class simply return str()
> representation
> of self, while __str__ just makes everything numeric field into a
> concatenated string:
>
> class MyClass
>
>   def __str__(self):
>     return "%s-%s-%s" %(self.field1, self.field2, self.field3)
>
>   def __repr__(self):
>     return str(self)
>
>   def __hash__(self):
>     return hash(str(self))
>
> is there anything that can be done to speed up this simply code? right
> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> with lots of RAM (though this is all taking CPU power, not RAM at this
> point.)
>
> any general advice on how to optimize large dicts would be great too
>
> thanks for your help.

I am willing to bet a beer that the slow performance has nothing to do
with the dict but is either because of MyClass or the read from disk.
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Scott David Daniels

Paul Rubin wrote:

Per Freem  writes:

2] is there an easy way to have nested defaultdicts? ie i want to say
that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
that my_dict is a dictionary, whose values are dictionary that map to
ints. but that syntax is not valid.


my_dict = defaultdict(lambda: defaultdict(int))

or
my_dict = defaultdict(functools.partial(defaultdict, int))

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Paul McGuire
On Jan 15, 5:31 pm, Per Freem  wrote:
> ...the aKeys are very small (less than 100) where
> as the bKeys are the ones that are in the millions.  so in that case,
> doing a Try-Except on aKey should be very efficient, since often it
> will not fail, ...

Do you know the aKeys in advance?  If so, then just initialize the
first layer, and it will never fail.

-- Paul
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Christian Heimes
Per Freem schrieb:
> 1] is Try-Except really slower? my dict actually has two layers, so
> my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where
> as the bKeys are the ones that are in the millions.  so in that case,
> doing a Try-Except on aKey should be very efficient, since often it
> will not fail, where as if I do: "if aKey in my_dict", that statement
> will get executed for each aKey. can someone definitely say whether
> Try-Except is faster or not? My benchmarks aren't conclusive and i
> hear it both ways from several people (though majority thinks
> TryExcept is faster).

A defaultdict is faster than a try/except if the key is missing. A
defaultdict is the canonical way to solve your problem, too.

> 3] more importantly, is there likely to be a big improvement for
> splitting up one big dictionary into several smaller ones? if so, is
> there a straight forward elegant way to implement this? the way i am
> thinking is to just fix a number of dicts and populate them with
> elements. then during retrieval, try the first dict, if that fails,
> try the second, if not the third, etc... but i can imagine how that's
> more likely to lead to bugs / debugging give the way my code is setup
> so i am wondering whether it is really worth it.
> if it can lead to a factor of 2 difference, i will definitely
> implement it -- does anyone have experience with this?

Nested dicts more likely slower than one large dict. You have to keep in
mind that Python's dict have been optimized to utilize the CPU cache as
much as possible. Any layer you put on top of a single dict can increase
the number of cache misses and increase the number of function calls.
This may lead to a general slow down.

I assume that you won't notice a slow down. However my instincts tell me
that you'll hardly get a speedup, either. You have to roll your own
benchmarks to study which algorithm is faster in reality.

Christian

--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Steven D'Aprano
On Thu, 15 Jan 2009 14:49:29 -0800, bearophileHUGS wrote:

> Matimus, your suggestions are all good.
> 
> Try-except is slower than:
> if x in adict: ... else: ...



Not according to my tests.


>>> def tryexcept(D, key):
... try:
... return D[key]
... except KeyError:
... return None
...
>>> def ifinelse(D, key):
... if key in D:
... return D[key]
... else:
... return None
...
>>> from timeit import Timer
>>> setup = "from __main__ import tryexcept, ifinelse; D = {2: 'c'}"
>>>
>>> t1 = Timer('tryexcept(D, 2)', setup)
>>> t2 = Timer('ifinelse(D, 2)', setup)
>>>
>>> min(t1.repeat())  # try...except
0.61699414253234863
>>> min(t2.repeat())  # if...else
0.71856999397277832


Perhaps what you meant to say is that try...except is slower *if the key 
is not found*. And in that case, there's a truly massive difference:


>>> t3 = Timer('tryexcept(D, 5)', setup)
>>> t4 = Timer('ifinelse(D, 5)', setup)
>>>
>>> min(t3.repeat())  # try...except
5.3846139907836914
>>> min(t4.repeat())  # if...else
0.5983281135559082

Which solution is faster on average will depend on the ratio of keys 
found versus keys not found. On my PC, based on the above results, 
if...else becomes faster when just 2% of keys are not found.

But if you can arrange matters so that nearly all keys are found, then 
try...except can be faster.


There's another solution nobody has mentioned (as far as I can see): the 
get method. In my test I call get() from inside a function, rather than 
directly, so that the timing results include the function call overhead 
for all three strategies.


>>> def get(D, key):
... return D.get(key)
...
>>>
>>> setup = "from __main__ import get; D = {2: 'c'}"
>>> t5 = Timer('get(D, 2)', setup)
>>> t6 = Timer('get(D, 5)', setup)
>>>
>>> min(t5.repeat())  # found key
0.88362312316894531
>>> min(t6.repeat())  # missing key
0.87782382965087891




-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Paul Rubin
Per Freem  writes:
> 2] is there an easy way to have nested defaultdicts? ie i want to say
> that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
> that my_dict is a dictionary, whose values are dictionary that map to
> ints. but that syntax is not valid.

my_dict = defaultdict(lambda: defaultdict(int))
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Per Freem
thanks to everyone for the excellent suggestions. a few follow up q's:

1] is Try-Except really slower? my dict actually has two layers, so
my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where
as the bKeys are the ones that are in the millions.  so in that case,
doing a Try-Except on aKey should be very efficient, since often it
will not fail, where as if I do: "if aKey in my_dict", that statement
will get executed for each aKey. can someone definitely say whether
Try-Except is faster or not? My benchmarks aren't conclusive and i
hear it both ways from several people (though majority thinks
TryExcept is faster).

2] is there an easy way to have nested defaultdicts? ie i want to say
that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
that my_dict is a dictionary, whose values are dictionary that map to
ints. but that syntax is not valid.

3] more importantly, is there likely to be a big improvement for
splitting up one big dictionary into several smaller ones? if so, is
there a straight forward elegant way to implement this? the way i am
thinking is to just fix a number of dicts and populate them with
elements. then during retrieval, try the first dict, if that fails,
try the second, if not the third, etc... but i can imagine how that's
more likely to lead to bugs / debugging give the way my code is setup
so i am wondering whether it is really worth it.
if it can lead to a factor of 2 difference, i will definitely
implement it -- does anyone have experience with this?

On Jan 15, 5:58 pm, Steven D'Aprano  wrote:
> On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote:
> >> is there anything that can be done to speed up this simply code? right
> >> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> >> with lots of RAM (though this is all taking CPU power, not RAM at this
> >> point.)
>
> > class MyClass(object):
> >     # a new style class with slots saves some memory
> >     __slots__ = ("field1", "field2", "field2")
>
> I was curious whether using slots would speed up attribute access.
>
> >>> class Parrot(object):
>
> ...     def __init__(self, a, b, c):
> ...             self.a = a
> ...             self.b = b
> ...             self.c = c
> ...>>> class SlottedParrot(object):
>
> ...     __slots__ = 'a', 'b', 'c'
> ...     def __init__(self, a, b, c):
> ...             self.a = a
> ...             self.b = b
> ...             self.c = c
> ...
>
> >>> p = Parrot(23, "something", [1, 2, 3])
> >>> sp = SlottedParrot(23, "something", [1, 2, 3])
>
> >>> from timeit import Timer
> >>> setup = "from __main__ import p, sp"
> >>> t1 = Timer('p.a, p.b, p.c', setup)
> >>> t2 = Timer('sp.a, sp.b, sp.c', setup)
> >>> min(t1.repeat())
> 0.83308887481689453
> >>> min(t2.repeat())
>
> 0.62758088111877441
>
> That's not a bad improvement. I knew that __slots__ was designed to
> reduce memory consumption, but I didn't realise they were faster as well.
>
> --
> Steven

--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Steven D'Aprano
On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote:

>> is there anything that can be done to speed up this simply code? right
>> now it is taking well over 15 minutes to process, on a 3 Ghz machine
>> with lots of RAM (though this is all taking CPU power, not RAM at this
>> point.)
> 
> class MyClass(object):
> # a new style class with slots saves some memory 
> __slots__ = ("field1", "field2", "field2")

I was curious whether using slots would speed up attribute access.

>>> class Parrot(object):
... def __init__(self, a, b, c):
... self.a = a
... self.b = b
... self.c = c
...
>>> class SlottedParrot(object):
... __slots__ = 'a', 'b', 'c'
... def __init__(self, a, b, c):
... self.a = a
... self.b = b
... self.c = c
...
>>>
>>> p = Parrot(23, "something", [1, 2, 3])
>>> sp = SlottedParrot(23, "something", [1, 2, 3])
>>>
>>> from timeit import Timer
>>> setup = "from __main__ import p, sp"
>>> t1 = Timer('p.a, p.b, p.c', setup)
>>> t2 = Timer('sp.a, sp.b, sp.c', setup)
>>> min(t1.repeat())
0.83308887481689453
>>> min(t2.repeat())
0.62758088111877441


That's not a bad improvement. I knew that __slots__ was designed to 
reduce memory consumption, but I didn't realise they were faster as well.



-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread bearophileHUGS
Matimus, your suggestions are all good.

Try-except is slower than:
if x in adict: ... else: ...
A defaultdict is generally faster (there are some conditions when it's
not faster, but they aren't much common. I think it's when the ratio
of duplicates is really low), creating just a tuple instead of a class
helps a lot, and when the CPU/OS allow it, Psyco too may help some
here.

If the resulting speed isn't enough yet, consider that Python dicts
are quite fast, so you may need lot of care to write D/C/C++/Clisp
code that's faster for this problem.

I also suspect that when they become very large, Python dicts lose
some of their efficiency. If this is true, you may switch to a new
dictionary every chunk of file, and then merge the dicts at the end. I
don't actually know if this may speed up your Python code even more
(if you experiment this, I'd like to know if it's false).

Bye,
bearophile
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Christian Heimes
> class MyClass
> 
>   def __str__(self):
> return "%s-%s-%s" %(self.field1, self.field2, self.field3)
> 
>   def __repr__(self):
> return str(self)
> 
>   def __hash__(self):
> return hash(str(self))
> 
> 
> is there anything that can be done to speed up this simply code? right
> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> with lots of RAM (though this is all taking CPU power, not RAM at this
> point.)

class MyClass(object):
# a new style class with slots saves some memory
__slots__ = ("field1", "field2", "field2")

def __hash__(self):
# In general this is faster than your approach because
# it requires less function calls. However all fields must
# be hashable
return hash((self.field1, self.field2, self.field2))

def __eq__(self, other):
# you MUST provide a rich compare __eq__ function
# when you provide your own hash function. It's called
# when hash(a) == hash(b).
if not isinstance(other, MyClass):
return NotImplemented
return (self.field1 == other.field1
and self.field2 == other.field2
and self.field3 == other.field3)

def __str__(self):
return "%s-%s-%s" %(self.field1, self.field2, self.field3)

# faster than your approach because it doesn't require more function
# lookups. You can omit this alias, too. It's not required.
__repr__ = __str__


Instead of the try/except clause I recommend colletions.defaultdict.
Defaultdict takes one callable as factory function. Since int() returns
0 it has the same effect as your code.

from collections import defaultdict
elt = defaultdict(int)

Christian

--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Jervis Whitley
On Fri, Jan 16, 2009 at 8:39 AM, Per Freem  wrote:

> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
>
> for line in file:
>  try:
>elt = MyClass(line)# extract elt from line...
>my_dict[elt] += 1
>  except KeyError:
>my_dict[elt] = 1
>
>
> class MyClass
>
>  def __str__(self):
>return "%s-%s-%s" %(self.field1, self.field2, self.field3)
>
>  def __repr__(self):
>return str(self)
>
>  def __hash__(self):
>return hash(str(self))
>
>
> is there anything that can be done to speed up this simply code? right
> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> with lots of RAM (though this is all taking CPU power, not RAM at this
> point.)
>
> any general advice on how to optimize large dicts would be great too
>
> thanks for your help.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Hello,
You can get a large speedup by removing the need to instantiate a new
MyClass instance on
each iteration of your loop.
Instead define one MyClass with an 'interpret' method that would be called
instead of MyClass()
interpret would return the string '%s-%s-%s' % (self.field1 etc..)

i.e

myclass = MyClass()
interpret = myclass.interpret

for line in file:

 elt = interpet(line)# extract elt from line...
 try:
   my_dict[elt] += 1
 except KeyError:
   my_dict[elt] = 1

The speed up is on the order of 10 on my machine.


Cheers,
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Matimus
On Jan 15, 1:39 pm, Per Freem  wrote:
> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
> i create a dictionary that maps elements of the file that i want to
> count
> to their number of occurs. so i iterate through the file and for each
> line
> extract the elements (simple text operation) and see if it has an
> entry in the dict:
>
> for line in file:
>   try:
>     elt = MyClass(line)# extract elt from line...
>     my_dict[elt] += 1
>   except KeyError:
>     my_dict[elt] = 1
>
> i am using try/except since it is supposedly faster (though i am not
> sure
> about this? is this really true in Python 2.5?).
>
> the only 'twist' is that my elt is an instance of a class (MyClass)
> with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
> works well.
> the __repr__ and __hash__ methods of my class simply return str()
> representation
> of self, while __str__ just makes everything numeric field into a
> concatenated string:
>
> class MyClass
>
>   def __str__(self):
>     return "%s-%s-%s" %(self.field1, self.field2, self.field3)
>
>   def __repr__(self):
>     return str(self)
>
>   def __hash__(self):
>     return hash(str(self))
>
> is there anything that can be done to speed up this simply code? right
> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> with lots of RAM (though this is all taking CPU power, not RAM at this
> point.)
>
> any general advice on how to optimize large dicts would be great too
>
> thanks for your help.

You can use a tuple instead of a string, which should be a little
quicker:

def __hash__(self):
return self.field1, self.field2, self.field3

You could speed it up even more if you just saved a single attribute
"fields" as a tuple to begin with.

Also, you can use defauldict and get rid of the try/except. I don't
think try/except is slow, but avoiding it will give you a speed up.

from collections import defaultdict
my_dict = defaultdict(int)

for line in file:
  elt = MyClass(line)# extract elt from line...
  my_dict[elt] += 1


You might even consider turning "MyClass" into just a function that
extracts the values from the line and returns a tuple, which should
give you even more of a boost since a tuple is completely implemented
in C.


Matt


--
http://mail.python.org/mailman/listinfo/python-list


optimizing large dictionaries

2009-01-15 Thread Per Freem
hello

i have an optimization questions about python. i am iterating through
a file and counting the number of repeated elements. the file has on
the order
of tens of millions elements...

i create a dictionary that maps elements of the file that i want to
count
to their number of occurs. so i iterate through the file and for each
line
extract the elements (simple text operation) and see if it has an
entry in the dict:

for line in file:
  try:
elt = MyClass(line)# extract elt from line...
my_dict[elt] += 1
  except KeyError:
my_dict[elt] = 1

i am using try/except since it is supposedly faster (though i am not
sure
about this? is this really true in Python 2.5?).

the only 'twist' is that my elt is an instance of a class (MyClass)
with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
works well.
the __repr__ and __hash__ methods of my class simply return str()
representation
of self, while __str__ just makes everything numeric field into a
concatenated string:

class MyClass

  def __str__(self):
return "%s-%s-%s" %(self.field1, self.field2, self.field3)

  def __repr__(self):
return str(self)

  def __hash__(self):
return hash(str(self))


is there anything that can be done to speed up this simply code? right
now it is taking well over 15 minutes to process, on a 3 Ghz machine
with lots of RAM (though this is all taking CPU power, not RAM at this
point.)

any general advice on how to optimize large dicts would be great too

thanks for your help.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread John Machin
On Dec 24, 7:04 pm, "Malcolm Greene"  wrote:
> Hi Roger,
>
> By very large dictionary, I mean about 25M items per dictionary. Each
> item is a simple integer whose value will never exceed 2^15.

In Python-speak about dictionaries, an "item" is a tuple (key, value).
I presume from the gross difference between 25M and 2**15
-- more pedantry: 2^15 is 13 :-) -- that you mean that the values
satisfy 0 <= value <= 2**15.

>
> I populate these dictionaries by parsing very large ASCII text files
> containing detailed manufacturing events. From each line in my log file
> I construct one or more keys and increment the numeric values associated
> with these keys using timing info also extracted from each line.
>
> Some of our log files are generated by separate monitoring equipment
> measuring the same process. In theory, these log files should be
> identical, but of course they are not. I'm looking for a way to
> determine the differences between the 2 dictionaries I will create from
> so-called matching sets of log files.

You seem to refer to the dictionaries as part of your requirements,
not as part of the solution. Do you *need* the two dictionaries after
you have obtained the differences?

Let's start from the beginning: You have two log files, each of which
can be abstracted as containing lots of (k, v) data, where each k
describes an event and each v is a compressed 15-bit timestamp. You
want to know for each datum whether it is (a) in both files (b) only
in file1 (c) only in file2. Is that a fair summary?

If so, there's an alternative that doesn't need to use dictionaries.
Each file can be represented by a *list* of 32768 sets. Each set
contains the ks that happened at the corresponding time... sets[fileno]
[v].add(k). Once you've built that, you trundle through the pairs of
sets doing set differences etc.

Bear in mind that Python dicts and sets grow as required by doubling
(IIRC) in size and rehashing from old to new. Consequently you get
periodical disturbances which are not usually noticed but may be
noticeable with large amounts of data.

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Terry Reedy

Marc 'BlackJack' Rintsch wrote:

On Wed, 24 Dec 2008 03:23:00 -0500, python wrote:



collection, I don't see the advantage of using an iterator or a list.
I'm sure I'm missing a subtle point here :)


`keys()` creates a list in memory, `iterkeys()` does not.  With
``set(dict.keys())`` there is a point in time where the dictionary, the 
list, and the set co-exist in memory.  With ``set(dict.iterkeys())`` only 
the set and the dictionary exist in memory.


If you can, consider using 3.0 in which d.keys() is a set-like view of 
the keys of d.  Same for d.values and d.items. The time and space to 
create such is O(1), I believe. since they are just alternate read-only 
interfaces to the internal dict storage.  There is not even a separate 
set until you do something like intersect d1.keys() with d2.keys() or 
d1.keys() - d2.keys().  I think this is an under-appreciated feature of 
3.0 which should be really useful with large dicts.


tjr

--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Scott David Daniels

Gabriel Genellina wrote:

En Wed, 24 Dec 2008 06:23:00 -0200,  escribió:

 ... k1 = set(dict1.iterkeys())
You've got an excelent explanation from Marc Rintsch. (Note that in 
Python 3.0 keys() behaves as iterkeys() in previous versions, so the 
above code is supposed to be written in Python 2.x)

And, in fact, a dictionary iterates its keys, so:
k1 = set(dict1)
works in 2.4, 2.5, 2.6, and 3.0

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread bearophileHUGS
Peter Otten:
> >>> a = dict(a=1, b=2, c=3)
> >>> b = dict(b=2, c=30, d=4)
> >>> dict(set(a.iteritems()) ^ set(b.iteritems()))

For larger sets this may be better, may avoid the creation of the
second set:

dict(set(a.iteritems()).symmetric_difference(b.iteritems()))

Bye,
bearophile
--
http://mail.python.org/mailman/listinfo/python-list


Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martin wrote:
> I'd think he's talking about in memory SQLite Databases, this way you
> should be quite fast _and_ could dump all that to a persistent
> storage...

I was just talking about regular on disk SQLite databases.  In terms of
priming the pump, you can just open/read/close the whole database file
which will cause the database to be in the operating system cache
buffers, or just let SQLite do its thing.

For anyone who is curious about what Martin is referring to, SQLite does
support the database file being memory (although it isn't a file at that
point).  It has a fake filename of :memory:.  As an example you can copy
the table 'example' to memory using:

  ATTACH ':memory:' as memdb;
  CREATE TABLE memdb.memexample AS select * from example;

As with Python, all this stuff is premature optimization.  Get it
working right first and then try tweaks to improve performance.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklSEqcACgkQmOOfHg372QTPpgCgvSKGMCJAKhnm5I8qHdmZtRh3
SgMAoI3DVhWCVdUE1TLck9ZEfp/Ln1H5
=5kNT
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list


Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Martin
Hi,

2008/12/24  :
> Hi Roger,
>
>> you may want to consider using SQLite
>
> Thank you for your suggestion about looking at SQLite. I haven't
> compared the performance of SQLite to Python dictionaries, but I'm
> skeptical that SQLite would be faster than in-memory Python dictionaries
> for the type of analysis I'm doing.

I'd think he's talking about in memory SQLite Databases, this way you
should be quite fast _and_ could dump all that to a persistent
storage...

regards
martin



-- 
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Malcolm Greene
Hi Gabriel,

> in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above 
> code is supposed to be written in Python 2.x)

I understand. Thank you.

> note that dict comprehensions require Python 3.0

I'm relieved to know that I didn't miss that feature in my reading of
Python's 2.5/2.6 documentation :)

> You might use instead:
>
> dict((key,(dict1[key],dict2[key])) for key in ...)

Excellent. Thank you.

Regards,
Malcolm


- Original message -
From: "Gabriel Genellina" 
To: python-list@python.org
Date: Wed, 24 Dec 2008 07:10:16 -0200
Subject: Re: Strategy for determing difference between 2 very large
dictionaries

En Wed, 24 Dec 2008 06:23:00 -0200,  escribió:

> Hi Gabriel,
>
> Thank you very much for your feedback!
>
>> k1 = set(dict1.iterkeys())
>
> I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
> to using an iterator vs. a list as the basis for creating a set? I

You've got an excelent explanation from Marc Rintsch. (Note that in
Python  
3.0 keys() behaves as iterkeys() in previous versions, so the above code 
is supposed to be written in Python 2.x)

>>> can this last step be done via a simple list comprehension?
>
>> Yes; but isn't a dict comprehension more adequate?
>>
>> [key: (dict1[key], dict2[key]) for key in common_keys if
>> dict1[key]!=dict2[key]}
>
> Cool!! I'm relatively new to Python and totally missed the ability to
> work with dictionary comprehensions. Yes, your dictionary comprehension
> technique is much better than the list comprehension approach I was
> struggling with. Your dictionary comprehension statement describes
> exactly what I wanted to write.

This time, note that dict comprehensions require Python 3.0 -- so the
code  
above won't work in Python 2.x.
(It's not a good idea to mix both versions in the same post, sorry!)
You might use instead:

dict((key,(dict1[key],dict2[key])) for key in ...)

but it's not as readable.

-- 
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread python
Hi Peter,


If you are not interested in the intermediate results and the dictionary
values are hashable you can get the difference by

>>> a = dict(a=1, b=2, c=3)
>>> b = dict(b=2, c=30, d=4)
>>> dict(set(a.iteritems()) ^ set(b.iteritems()))
{'a': 1, 'c': 3, 'd': 4}


That's very cool! Thanks for sharing that technique.

Regards,
Malcolm


- Original message -
From: "Peter Otten" <__pete...@web.de>
To: python-list@python.org
Date: Wed, 24 Dec 2008 09:46:51 +0100
Subject: Re: Strategy for determing difference between 2 very large
dictionaries

Gabriel Genellina wrote:

> En Wed, 24 Dec 2008 05:16:36 -0200,  escribió:

[I didn't see the original post]

>> I'm looking for suggestions on the best ('Pythonic') way to
>> determine the difference between 2 very large dictionaries
>> containing simple key/value pairs.
>> By difference, I mean a list of keys that are present in the
>> first dictionary, but not the second. And vice versa. And a list
>> of keys in common between the 2 dictionaries whose values are
>> different.
>> The 2 strategies I'm considering are:
>> 1. Brute force: Iterate through first dictionary's keys and
>> determine which keys it has that are missing from the second
>> dictionary. If keys match, then verify that the 2 dictionaries
>> have identical values for the same key. Repeat this process for
>> the second dictionary.
>> 2. Use sets: Create sets from each dictionary's list of keys and
>> use Python's set methods to generate a list of keys present in
>> one dictionary but not the other (for both dictionaries) as well
>> as a set of keys the 2 dictionaries have in common.
> 
> I cannot think of any advantage of the first approach - so I'd use sets.
> 
> k1 = set(dict1.iterkeys())
> k2 = set(dict2.iterkeys())
> k1 - k2 # keys in dict1 not in dict2
> k2 - k1 # keys in dict2 not in dict1
> k1 & k2 # keys in both
> 
>> Using the set
>> of keys in common, compare values across dictionaries to
>> determine which keys have different values (can this last step be
>> done via a simple list comprehension?)

If you are not interested in the intermediate results and the dictionary
values are hashable you can get the difference by

>>> a = dict(a=1, b=2, c=3)
>>> b = dict(b=2, c=30, d=4)
>>> dict(set(a.iteritems()) ^ set(b.iteritems()))
{'a': 1, 'c': 3, 'd': 4}

Peter
--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list


Re: Most efficient way to build very large dictionaries

2008-12-24 Thread python
Hi Roger,

> The performance improvements you are seeing with Python over Oracle are 
> exactly the same range people see with SQLite over Oracle. One common usage 
> reported on the SQLite mailing list is people copying data out of Oracle and 
> running their analysis in SQLite because of the performance advantages.

I wasn't aware that SQLite was so fast compared to Oracle. That's great
news. I will definitely take a look at SQLite for my current data
analysis project.

 ... hey, you're the author of
APSW! :)

For those following this thread, see APSW:
http://code.google.com/p/apsw/

> The pragmas tune things like cache sizes. The SQLite default is 2MB, relying 
> on the operating system for caching beyond that. Bumping up that kind of size 
> was my suggestion :-)

I missed that nuance - a side effect of emailing at 4am :)

Thanks again for your help Roger!

Regards,
Malcolm


- Original message -
From: "Roger Binns" 
To: python-list@python.org
Date: Wed, 24 Dec 2008 00:50:49 -0800
Subject: Re: Most efficient way to build very large dictionaries

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

pyt...@bdurham.com wrote:
> Thank you for your suggestion about looking at SQLite. I haven't
> compared the performance of SQLite to Python dictionaries, but I'm
> skeptical that SQLite would be faster than in-memory Python dictionaries
> for the type of analysis I'm doing.

I'd recommend at least trying a test just to see.  As an example SQLite
uses indices which will be faster than Python dicts for some set
operations.  (And if you aren't careful, your various Python based
optimizations will end up duplicating what SQLite does internally anyway
:-)

> Prior to my use of Python, my
> customer used a very expensive Oracle system to analyze their log files.
> My simple Python scripts are 4-20x faster than the Oracle PL/SQL they
> are replacing - and run on much cheaper hardware.

SQLite is not like Oracle or any similar database system.  It does not
operate over the network or similar connection. It is a library in your
process that has an optimized disk storage format (single file) and a
SQL parser that generates bytecode for a special purpose virtual machine
in pretty much the same way CPython operates.  The performance
improvements you are seeing with Python over Oracle are exactly the same
range people see with SQLite over Oracle.  One common usage reported on
the SQLite mailing list is people copying data out of Oracle and running
their analysis in SQLite because of the performance advantages.

> Note: Memory is currently not a concern for me so I don't need SQLite's
> ability to work with data sets larger than my physical memory.

The pragmas tune things like cache sizes.  The SQLite default is 2MB,
relying on the operating system for caching beyond that.  Bumping up
that kind of size was my suggestion :-)

Roger

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR9+UACgkQmOOfHg372QSMbwCdGS5S2/96fWW8knjfBVqReAfV
AEwAn2Yc+L9BEZgT69OjwtyqxLtifVpU
=mPfy
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Gabriel Genellina

En Wed, 24 Dec 2008 06:23:00 -0200,  escribió:


Hi Gabriel,

Thank you very much for your feedback!


k1 = set(dict1.iterkeys())


I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
to using an iterator vs. a list as the basis for creating a set? I


You've got an excelent explanation from Marc Rintsch. (Note that in Python  
3.0 keys() behaves as iterkeys() in previous versions, so the above code  
is supposed to be written in Python 2.x)



can this last step be done via a simple list comprehension?



Yes; but isn't a dict comprehension more adequate?

[key: (dict1[key], dict2[key]) for key in common_keys if
dict1[key]!=dict2[key]}


Cool!! I'm relatively new to Python and totally missed the ability to
work with dictionary comprehensions. Yes, your dictionary comprehension
technique is much better than the list comprehension approach I was
struggling with. Your dictionary comprehension statement describes
exactly what I wanted to write.


This time, note that dict comprehensions require Python 3.0 -- so the code  
above won't work in Python 2.x.

(It's not a good idea to mix both versions in the same post, sorry!)
You might use instead:

dict((key,(dict1[key],dict2[key])) for key in ...)

but it's not as readable.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread python
Hi James,

> For the purpose of perpetuating the annoying pedantry that has made 
> usenet great:
>
> http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists

Great tip! Thank you!

Malcolm
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread James Stroud

Marc 'BlackJack' Rintsch wrote:

On Wed, 24 Dec 2008 03:23:00 -0500, python wrote:


Hi Gabriel,

Thank you very much for your feedback!


k1 = set(dict1.iterkeys())

I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
to using an iterator vs. a list as the basis for creating a set? I
understand that an iterator makes sense if you're working with a large
set of items one at a time, but if you're creating a non-filtered
collection, I don't see the advantage of using an iterator or a list.
I'm sure I'm missing a subtle point here :)


`keys()` creates a list in memory, `iterkeys()` does not.  With
``set(dict.keys())`` there is a point in time where the dictionary, the 
list, and the set co-exist in memory.  With ``set(dict.iterkeys())`` only 
the set and the dictionary exist in memory.


For the purpose of perpetuating the annoying pedantry that has made 
usenet great:



http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists



James


--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
--
http://mail.python.org/mailman/listinfo/python-list


Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

pyt...@bdurham.com wrote:
> Thank you for your suggestion about looking at SQLite. I haven't
> compared the performance of SQLite to Python dictionaries, but I'm
> skeptical that SQLite would be faster than in-memory Python dictionaries
> for the type of analysis I'm doing.

I'd recommend at least trying a test just to see.  As an example SQLite
uses indices which will be faster than Python dicts for some set
operations.  (And if you aren't careful, your various Python based
optimizations will end up duplicating what SQLite does internally anyway :-)

> Prior to my use of Python, my
> customer used a very expensive Oracle system to analyze their log files.
> My simple Python scripts are 4-20x faster than the Oracle PL/SQL they
> are replacing - and run on much cheaper hardware.

SQLite is not like Oracle or any similar database system.  It does not
operate over the network or similar connection. It is a library in your
process that has an optimized disk storage format (single file) and a
SQL parser that generates bytecode for a special purpose virtual machine
in pretty much the same way CPython operates.  The performance
improvements you are seeing with Python over Oracle are exactly the same
range people see with SQLite over Oracle.  One common usage reported on
the SQLite mailing list is people copying data out of Oracle and running
their analysis in SQLite because of the performance advantages.

> Note: Memory is currently not a concern for me so I don't need SQLite's
> ability to work with data sets larger than my physical memory.

The pragmas tune things like cache sizes.  The SQLite default is 2MB,
relying on the operating system for caching beyond that.  Bumping up
that kind of size was my suggestion :-)

Roger

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR9+UACgkQmOOfHg372QSMbwCdGS5S2/96fWW8knjfBVqReAfV
AEwAn2Yc+L9BEZgT69OjwtyqxLtifVpU
=mPfy
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Peter Otten
Gabriel Genellina wrote:

> En Wed, 24 Dec 2008 05:16:36 -0200,  escribió:

[I didn't see the original post]

>> I'm looking for suggestions on the best ('Pythonic') way to
>> determine the difference between 2 very large dictionaries
>> containing simple key/value pairs.
>> By difference, I mean a list of keys that are present in the
>> first dictionary, but not the second. And vice versa. And a list
>> of keys in common between the 2 dictionaries whose values are
>> different.
>> The 2 strategies I'm considering are:
>> 1. Brute force: Iterate through first dictionary's keys and
>> determine which keys it has that are missing from the second
>> dictionary. If keys match, then verify that the 2 dictionaries
>> have identical values for the same key. Repeat this process for
>> the second dictionary.
>> 2. Use sets: Create sets from each dictionary's list of keys and
>> use Python's set methods to generate a list of keys present in
>> one dictionary but not the other (for both dictionaries) as well
>> as a set of keys the 2 dictionaries have in common.
> 
> I cannot think of any advantage of the first approach - so I'd use sets.
> 
> k1 = set(dict1.iterkeys())
> k2 = set(dict2.iterkeys())
> k1 - k2 # keys in dict1 not in dict2
> k2 - k1 # keys in dict2 not in dict1
> k1 & k2 # keys in both
> 
>> Using the set
>> of keys in common, compare values across dictionaries to
>> determine which keys have different values (can this last step be
>> done via a simple list comprehension?)

If you are not interested in the intermediate results and the dictionary
values are hashable you can get the difference by

>>> a = dict(a=1, b=2, c=3)
>>> b = dict(b=2, c=30, d=4)
>>> dict(set(a.iteritems()) ^ set(b.iteritems()))
{'a': 1, 'c': 3, 'd': 4}

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread python
Hi Marc,

> `keys()` creates a list in memory, `iterkeys()` does not. With
> ``set(dict.keys())`` there is a point in time where the dictionary, the 
> list, and the set co-exist in memory.  With ``set(dict.iterkeys())``
> only the set and the dictionary exist in memory.

Perfect explanation.

Thank you!
Malcolm


- Original message -
From: "Marc 'BlackJack' Rintsch" 
To: python-list@python.org
Date: 24 Dec 2008 08:30:41 GMT
Subject: Re: Strategy for determing difference between 2 very large 
   dictionaries

On Wed, 24 Dec 2008 03:23:00 -0500, python wrote:

> Hi Gabriel,
> 
> Thank you very much for your feedback!
> 
>> k1 = set(dict1.iterkeys())
> 
> I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
> to using an iterator vs. a list as the basis for creating a set? I
> understand that an iterator makes sense if you're working with a large
> set of items one at a time, but if you're creating a non-filtered
> collection, I don't see the advantage of using an iterator or a list.
> I'm sure I'm missing a subtle point here :)

`keys()` creates a list in memory, `iterkeys()` does not.  With
``set(dict.keys())`` there is a point in time where the dictionary, the 
list, and the set co-exist in memory.  With ``set(dict.iterkeys())``
only 
the set and the dictionary exist in memory.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list


Re: Most efficient way to build very large dictionaries

2008-12-24 Thread python
Hi Roger,

> you may want to consider using SQLite

Thank you for your suggestion about looking at SQLite. I haven't
compared the performance of SQLite to Python dictionaries, but I'm
skeptical that SQLite would be faster than in-memory Python dictionaries
for the type of analysis I'm doing. Prior to my use of Python, my
customer used a very expensive Oracle system to analyze their log files.
My simple Python scripts are 4-20x faster than the Oracle PL/SQL they
are replacing - and run on much cheaper hardware.

Note: Memory is currently not a concern for me so I don't need SQLite's
ability to work with data sets larger than my physical memory.

Regards,
Malcolm


- Original message -
From: "Roger Binns" 
To: python-list@python.org
Date: Wed, 24 Dec 2008 00:19:56 -0800
Subject: Re: Most efficient way to build very large dictionaries

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

pyt...@bdurham.com wrote:
> I would appreciate your thoughts on whether there are advantages to
> working with a pre-built dictionary and if so, what are the best ways to
> create a pre-loaded dictionary.

Based on this and your other thread, you may want to consider using
SQLite (standard Python module is available for it).  SQL queries are
very similar to set operations and indices take care of performance (by
using more storage).  You also get transactions for free.  If you look
into SQLite pragmas you can get it to use more RAM to improve
performance.

And by using a SQL layer you can later switch to another database should
you need really hard core storage.  It also makes the data available to
other programs in a uniform way.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR8KgACgkQmOOfHg372QSrHQCfVJzueXVKme8QZcxoLf70BL4K
RL8AoM9QOFykOLrr5QXtpmZ5f7CFHm6e
=zAPG
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Marc 'BlackJack' Rintsch
On Wed, 24 Dec 2008 03:23:00 -0500, python wrote:

> Hi Gabriel,
> 
> Thank you very much for your feedback!
> 
>> k1 = set(dict1.iterkeys())
> 
> I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
> to using an iterator vs. a list as the basis for creating a set? I
> understand that an iterator makes sense if you're working with a large
> set of items one at a time, but if you're creating a non-filtered
> collection, I don't see the advantage of using an iterator or a list.
> I'm sure I'm missing a subtle point here :)

`keys()` creates a list in memory, `iterkeys()` does not.  With
``set(dict.keys())`` there is a point in time where the dictionary, the 
list, and the set co-exist in memory.  With ``set(dict.iterkeys())`` only 
the set and the dictionary exist in memory.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

pyt...@bdurham.com wrote:
> Can I take advantage of this knowledge to optimize

You do the optimization last :-)  The first thing you need to do is make
sure you have a way of validating you got the correct results.  With 25M
entries it would be very easy for an optimization to get the wrong
results (maybe only one result wrong).  Once you can verify the results
correctness, write the simplest most readable code possible.

Then once your whole program is approaching completion time how long
things take to get an idea of where to optimize.  For example it is
pointless optimizing the loading phase if it only takes 2 minutes out of
a 3 hour runtime.  Finally you can optimize and always have a way of
verifying that the optimizations are correct.  You have the original
code to document what is supposed to be happening as optimized code
tends to get rather obfuscated and unreadable.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR8mkACgkQmOOfHg372QQvLQCgu6NYNUuhgR06KQunPmIrZ64B
+rsAnAgQOKzMdmonF+zIhsX2r/Xg/72Y
=LFfW
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread python
Hi Gabriel,

Thank you very much for your feedback!

> k1 = set(dict1.iterkeys())

I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
to using an iterator vs. a list as the basis for creating a set? I
understand that an iterator makes sense if you're working with a large
set of items one at a time, but if you're creating a non-filtered
collection, I don't see the advantage of using an iterator or a list.
I'm sure I'm missing a subtle point here :)

>> can this last step be done via a simple list comprehension?

> Yes; but isn't a dict comprehension more adequate?
>
> [key: (dict1[key], dict2[key]) for key in common_keys if  
> dict1[key]!=dict2[key]}

Cool!! I'm relatively new to Python and totally missed the ability to
work with dictionary comprehensions. Yes, your dictionary comprehension
technique is much better than the list comprehension approach I was
struggling with. Your dictionary comprehension statement describes
exactly what I wanted to write.

Regards,
Malcolm


- Original message -
From: "Gabriel Genellina" 
To: python-list@python.org
Date: Wed, 24 Dec 2008 05:46:04 -0200
Subject: Re: Strategy for determing difference between 2 very large
dictionaries

En Wed, 24 Dec 2008 05:16:36 -0200,  escribió:

> I'm looking for suggestions on the best ('Pythonic') way to
> determine the difference between 2 very large dictionaries
> containing simple key/value pairs.
> By difference, I mean a list of keys that are present in the
> first dictionary, but not the second. And vice versa. And a list
> of keys in common between the 2 dictionaries whose values are
> different.
> The 2 strategies I'm considering are:
> 1. Brute force: Iterate through first dictionary's keys and
> determine which keys it has that are missing from the second
> dictionary. If keys match, then verify that the 2 dictionaries
> have identical values for the same key. Repeat this process for
> the second dictionary.
> 2. Use sets: Create sets from each dictionary's list of keys and
> use Python's set methods to generate a list of keys present in
> one dictionary but not the other (for both dictionaries) as well
> as a set of keys the 2 dictionaries have in common.

I cannot think of any advantage of the first approach - so I'd use sets.

k1 = set(dict1.iterkeys())
k2 = set(dict2.iterkeys())
k1 - k2 # keys in dict1 not in dict2
k2 - k1 # keys in dict2 not in dict1
k1 & k2 # keys in both

> Using the set
> of keys in common, compare values across dictionaries to
> determine which keys have different values (can this last step be
> done via a simple list comprehension?)

Yes; but isn't a dict comprehension more adequate?

[key: (dict1[key], dict2[key]) for key in common_keys if  
dict1[key]!=dict2[key]}

(where common_keys=k1&k2 as above)

-- 
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list


Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

pyt...@bdurham.com wrote:
> I would appreciate your thoughts on whether there are advantages to
> working with a pre-built dictionary and if so, what are the best ways to
> create a pre-loaded dictionary.

Based on this and your other thread, you may want to consider using
SQLite (standard Python module is available for it).  SQL queries are
very similar to set operations and indices take care of performance (by
using more storage).  You also get transactions for free.  If you look
into SQLite pragmas you can get it to use more RAM to improve performance.

And by using a SQL layer you can later switch to another database should
you need really hard core storage.  It also makes the data available to
other programs in a uniform way.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR8KgACgkQmOOfHg372QSrHQCfVJzueXVKme8QZcxoLf70BL4K
RL8AoM9QOFykOLrr5QXtpmZ5f7CFHm6e
=zAPG
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Malcolm Greene
Hi Roger,

By very large dictionary, I mean about 25M items per dictionary. Each
item is a simple integer whose value will never exceed 2^15.

I populate these dictionaries by parsing very large ASCII text files
containing detailed manufacturing events. From each line in my log file
I construct one or more keys and increment the numeric values associated
with these keys using timing info also extracted from each line.

Some of our log files are generated by separate monitoring equipment
measuring the same process. In theory, these log files should be
identical, but of course they are not. I'm looking for a way to
determine the differences between the 2 dictionaries I will create from
so-called matching sets of log files.

At this point in time, I don't have concerns about memory as I'm running
my scripts on a dedicated 64-bit server with 32Gb of RAM (but with
budget approval to raise our total RAM to 64Gb if necessary).

My main concern is am I applying a reasonably pythonic approach to my
problem, eg. am I using appropriate python techniques and data
structures? I am also interested in using reasonable techniques that
will provide me with the fastest execution time.

Thank you for sharing your thoughts with me.

Regards,
Malcolm


- Original message -
From: "Roger Binns" 
To: python-list@python.org
Date: Tue, 23 Dec 2008 23:26:49 -0800
Subject: Re: Strategy for determing difference between 2 very large
dictionaries

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

pyt...@bdurham.com wrote:
> Feedback on my proposed strategies (or better strategies) would be
> greatly appreciated.

Both strategies will work but I'd recommend the second approach since it
uses already tested code written by other people - the chances of it
being wrong are far lower than new code.

You also neglected to mention what your concerns are or even what "very
large" is.  Example concerns are memory consumption, cpu consumption,
testability, utility of output (eg as a generator getting each result on
demand or a single list with complete results).  Some people will think
a few hundred entries is large.  My idea of large is a working set
larger than my workstation's 6GB of memory :-)

In general the Pythonic approach is:

1 - Get the correct result
2 - Simple code (developer time is precious)
3 - Optimise for your data and environment

Step 3 is usually not needed.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I
LIEAn080VgNvmEYj47Mm7BtV69J1GwXN
=MKLl
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-23 Thread Chris Rebert
On Tue, Dec 23, 2008 at 11:46 PM, Gabriel Genellina
 wrote:

>
> Yes; but isn't a dict comprehension more adequate?
>
> [key: (dict1[key], dict2[key]) for key in common_keys if
> dict1[key]!=dict2[key]}



That initial [ should be a { of course.

Cheers,
Chris

-- 
Follow the path of the Iguana...
http://rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-23 Thread Gabriel Genellina

En Wed, 24 Dec 2008 05:16:36 -0200,  escribió:


I'm looking for suggestions on the best ('Pythonic') way to
determine the difference between 2 very large dictionaries
containing simple key/value pairs.
By difference, I mean a list of keys that are present in the
first dictionary, but not the second. And vice versa. And a list
of keys in common between the 2 dictionaries whose values are
different.
The 2 strategies I'm considering are:
1. Brute force: Iterate through first dictionary's keys and
determine which keys it has that are missing from the second
dictionary. If keys match, then verify that the 2 dictionaries
have identical values for the same key. Repeat this process for
the second dictionary.
2. Use sets: Create sets from each dictionary's list of keys and
use Python's set methods to generate a list of keys present in
one dictionary but not the other (for both dictionaries) as well
as a set of keys the 2 dictionaries have in common.


I cannot think of any advantage of the first approach - so I'd use sets.

k1 = set(dict1.iterkeys())
k2 = set(dict2.iterkeys())
k1 - k2 # keys in dict1 not in dict2
k2 - k1 # keys in dict2 not in dict1
k1 & k2 # keys in both


Using the set
of keys in common, compare values across dictionaries to
determine which keys have different values (can this last step be
done via a simple list comprehension?)


Yes; but isn't a dict comprehension more adequate?

[key: (dict1[key], dict2[key]) for key in common_keys if  
dict1[key]!=dict2[key]}


(where common_keys=k1&k2 as above)

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Most efficient way to build very large dictionaries

2008-12-23 Thread python
I'm working with some very large dictionaries (about 25M items
per dictionary) that get filled based on data parsed from text
based log files. I'm using Python dictionaries to track the
frequency of specific combinations of events, eg, I build a key
and then add various timing info from the current record to
that's key's current value. If the key doesn't exist, I create it
and initalize it to its first value. Most keys will have anywhere
from 2 to 10,000 value increments.
Over time, I've noticed that 90%+ of the keys in my dictionaries
stay constant across daily runs of my program. Can I take
advantage of this knowledge to optimize my dictionary performance
by pre-building my dictionary based on a given list of keys whose
values are all set to 0? Specifically, is there a way to bulk
load a dictionary with keys/initial values that is faster than
individually adding most dictionary entries one at a time?
Here are 2 high level strategies I've been thinking about for
improving the performance of my dictionaries.
1. Create a list of keys from a current dictionary, pickle this
list of keys and then on my next program run, load the pickled
list of keys and somehow convert this list to a dictionary with
all key values set to 0.
2. After I've performed my analysis on a current dictionary,
iterate through this dictionary, setting all values to 0 (is
there a fast way to do this?), and then on my next program run,
load this pickled dictionary of keys and zero values.
I would appreciate your thoughts on whether there are advantages
to working with a pre-built dictionary and if so, what are the
best ways to create a pre-loaded dictionary.
Thank you,
Malcolm
--
http://mail.python.org/mailman/listinfo/python-list


Re: Strategy for determing difference between 2 very large dictionaries

2008-12-23 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

pyt...@bdurham.com wrote:
> Feedback on my proposed strategies (or better strategies) would be
> greatly appreciated.

Both strategies will work but I'd recommend the second approach since it
uses already tested code written by other people - the chances of it
being wrong are far lower than new code.

You also neglected to mention what your concerns are or even what "very
large" is.  Example concerns are memory consumption, cpu consumption,
testability, utility of output (eg as a generator getting each result on
demand or a single list with complete results).  Some people will think
a few hundred entries is large.  My idea of large is a working set
larger than my workstation's 6GB of memory :-)

In general the Pythonic approach is:

1 - Get the correct result
2 - Simple code (developer time is precious)
3 - Optimise for your data and environment

Step 3 is usually not needed.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I
LIEAn080VgNvmEYj47Mm7BtV69J1GwXN
=MKLl
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list


Strategy for determing difference between 2 very large dictionaries

2008-12-23 Thread python
I'm looking for suggestions on the best ('Pythonic') way to
determine the difference between 2 very large dictionaries
containing simple key/value pairs.
By difference, I mean a list of keys that are present in the
first dictionary, but not the second. And vice versa. And a list
of keys in common between the 2 dictionaries whose values are
different.
The 2 strategies I'm considering are:
1. Brute force: Iterate through first dictionary's keys and
determine which keys it has that are missing from the second
dictionary. If keys match, then verify that the 2 dictionaries
have identical values for the same key. Repeat this process for
the second dictionary.
2. Use sets: Create sets from each dictionary's list of keys and
use Python's set methods to generate a list of keys present in
one dictionary but not the other (for both dictionaries) as well
as a set of keys the 2 dictionaries have in common. Using the set
of keys in common, compare values across dictionaries to
determine which keys have different values (can this last step be
done via a simple list comprehension?)
Feedback on my proposed strategies (or better strategies) would
be greatly appreciated.
Thank you,
Malcolm
--
http://mail.python.org/mailman/listinfo/python-list


Re: Optimizing size of very large dictionaries

2008-07-31 Thread Terry Reedy



Raymond Hettinger wrote:


Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.


Tons of memory can be saved by not storing the contents of the
record.  Just make an initial pass to identify the digest values of
possible duplicates.  The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

  bag = collections.defaultdict(int)
  for record in logs:
  bag[sha1(record).digest()] += 1
  possible_dups = set()
  while bag:
  hashval, cnt = bag.popitem()
  if cnt > 1:
  possible_dups.add(hashvalue)


Since actual counts above 1 are not needed, I believe a bit more memory 
could be saved by computing possible_dups incrementally.  The logic is a 
bit trickier, and seeing the above helps.


once, dups = set(), set()
for record in logs:
d = sha1(record).digest()
if d in once:
once.remove(d)
dups.add(d)
elif d not in dups:
once.add(d)


  seen = set()
  for record in logs:
  if record in seen:
  print 'Duplicate:', record
  elif sha1(record).digest() in possible_dups:
  seen.add(record)

Raymond

P.S.  If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.
--
http://mail.python.org/mailman/listinfo/python-list



--
http://mail.python.org/mailman/listinfo/python-list


Re: Optimizing size of very large dictionaries

2008-07-31 Thread M.-A. Lemburg

On 2008-07-31 02:29, [EMAIL PROTECTED] wrote:

Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?

I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a traditional
database.

Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.


If you don't have a problem with taking a small performance hit,
then I'd suggest to have a look at mxBeeBase, which is an on-disk
dictionary implementation:

http://www.egenix.com/products/python/mxBase/mxBeeBase/

Of course, you could also use a database table for this. Together
with a proper index that should work as well (but it's likely slower
than mxBeeBase).

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 31 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
--
http://mail.python.org/mailman/listinfo/python-list


Re: Optimizing size of very large dictionaries

2008-07-31 Thread Raymond Hettinger
> > Are there any techniques I can use to strip a dictionary data
> > structure down to the smallest memory overhead possible?

Sure.  You can build your own version of a dict using
UserDict.DictMixin.  The underlying structure can be as space
efficient as you want.

FWIW, dictionaries automatically become more space efficient at
largers sizes (50,000+ records).  The size quadrupling strategy falls
back to just doubling whenever a dict gets two thirds full.

> > Background: I'm trying to identify duplicate records in very
> > large text based transaction logs. I'm detecting duplicate
> > records by creating a SHA1 checksum of each record and using this
> > checksum as a dictionary key. This works great except for several
> > files whose size is such that their associated checksum
> > dictionaries are too big for my workstation's 2G of RAM.

Tons of memory can be saved by not storing the contents of the
record.  Just make an initial pass to identify the digest values of
possible duplicates.  The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

  bag = collections.defaultdict(int)
  for record in logs:
  bag[sha1(record).digest()] += 1
  possible_dups = set()
  while bag:
  hashval, cnt = bag.popitem()
  if cnt > 1:
  possible_dups.add(hashvalue)
  seen = set()
  for record in logs:
  if record in seen:
  print 'Duplicate:', record
  elif sha1(record).digest() in possible_dups:
  seen.add(record)

Raymond

P.S.  If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Optimizing size of very large dictionaries

2008-07-30 Thread Miles
On Wed, Jul 30, 2008 at 8:29 PM,  <[EMAIL PROTECTED]> wrote:
> Background: I'm trying to identify duplicate records in very large text
> based transaction logs. I'm detecting duplicate records by creating a SHA1
> checksum of each record and using this checksum as a dictionary key. This
> works great except for several files whose size is such that their
> associated checksum dictionaries are too big for my workstation's 2G of RAM.

What are the values of this dictionary?

You can save memory by representing the checksums as long integers, if
you're currently using strings.

-Miles
--
http://mail.python.org/mailman/listinfo/python-list


Re: Optimizing size of very large dictionaries

2008-07-30 Thread Gabriel Genellina

En Wed, 30 Jul 2008 21:29:39 -0300, <[EMAIL PROTECTED]> escribi�:


Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?

I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a traditional
database.

Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.


You could use a different hash algorithm yielding a smaller value (crc32,  
by example, fits on an integer). At the expense of having more collisions,  
and more processing time to check those possible duplicates.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Optimizing size of very large dictionaries

2008-07-30 Thread python
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?

I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a traditional
database.

Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.

Thank you,

Malcolm
--
http://mail.python.org/mailman/listinfo/python-list

Re: Large Dictionaries

2006-06-12 Thread Klaas
Thomas Ganss wrote:
> Klaas schrieb:
>
> > 4. Insert your keys in sorted order.
> This advice is questionable -
> it depends on the at least on the db vendor and probably
> sometimes on the sort method, if inserting pre-sorted
> values is better.

The article I wrote that you quoted named a specific vendor (berkeley
db) for which my advice is unquestionable and well-known.

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-09 Thread Duncan Booth
Marcin Ciura wrote:

> Duncan Booth wrote:
>> Marcin Ciura wrote:
>>>See Figure 8 in
>>>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
>> That isn't what the reference says. It only covers N up to a few
>> thousand. Practical values of N need to at least go up into the
>> millions. 
> 
> Please look at the performance graph of Tokuda's increment sequence.
> You can see that it scales pretty well at least up to 100 million.

Ah sorry, I misread it. It was sequences with several thousand elements, 
which corresponds to N of 100 million.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-09 Thread Marcin Ciura
Duncan Booth wrote:
> Marcin Ciura wrote:
>>See Figure 8 in
>>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
> That isn't what the reference says. It only covers N up to a few thousand. 
> Practical values of N need to at least go up into the millions.

Please look at the performance graph of Tokuda's increment sequence.
You can see that it scales pretty well at least up to 100 million.
   Marcin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-09 Thread Duncan Booth
Marcin Ciura wrote:

> Tim Peters wrote:
>> shellsort is much more a refinement of insertion-sort than of
>> bubblesort, but is not an O(N log N) algorithm regardlesss.
> 
> With a judiciously chosen increment sequence,
> the number of comparisons made by shellsort
> *is* of the order N log N for practical values of N.
> See Figure 8 in
> http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf

That isn't what the reference says. It only covers N up to a few thousand. 
Practical values of N need to at least go up into the millions.

Read the summary:
> Using sequential analysis, the search for optimal increment sequences
> for Shellsort was accelerated enough to find them for arrays up to
> several thousand elements.
...
> However, the sequences obtained so far are too short to draw a
> reliable conclusion whether an appropriate sequence of increments can
> make Shellsort a O(N logN) sorting method on the average. Some
> hypotheses may be possible when the sequences are prolonged to sort
> arrays of about 10^5 elements. 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-09 Thread Marcin Ciura
Tim Peters wrote:
> shellsort is much more a refinement of insertion-sort than of
> bubblesort, but is not an O(N log N) algorithm regardlesss.

With a judiciously chosen increment sequence,
the number of comparisons made by shellsort
*is* of the order N log N for practical values of N.
See Figure 8 in
http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
   Marcin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-08 Thread Tim Peters
[Tim Peters]
>> ...
>> O(N log N) sorting algorithms helped by pre-existing order are
>> uncommon, unless they do extra work to detect and exploit
>> pre-existing order.

[Lawrence D'Oliveiro]
> Shellsort works well with nearly-sorted data. It's basically a smarter
> version of bubblesort with much improved efficiency. It's also very
> compact to implement.

shellsort is much more a refinement of insertion-sort than of
bubblesort, but is not an O(N log N) algorithm regardlesss.  A nice,
succinct summary of what was known about shellsort's behavior as of
Sedgewick's 1996 "Analysis of Shellsort and Related Algorithms" can be
found here:

http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/shell/shellen.htm
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-08 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>,
 "Iain King" <[EMAIL PROTECTED]> wrote:

>Lawrence D'Oliveiro wrote:
>> In article <[EMAIL PROTECTED]>,
>>  Scott David Daniels <[EMAIL PROTECTED]> wrote:
>>
>> >For example, time timsort (Python's internal sort) on pre-sorted
>> >data; you'll find it is handled faster than random data.
>>
>> But isn't that how a reasonable sorting algorithm should behave? Less
>> work to do if the data is already sorted?
>
>An already sorted list can be pathological for Quicksort, depending on
>how you code it.

I did say "reasonable". :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-08 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>,
 "Tim Peters" <[EMAIL PROTECTED]> wrote:

>For example, the O(N log N) heapsort is unreasonable compared to the
>O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
>case for heapsort, but a good case for bubblesort)?  O(N log N)
>sorting algorithms helped by pre-existing order are uncommon, unless
>they do extra work to detect and exploit pre-existing order.

Shellsort works well with nearly-sorted data. It's basically a smarter 
version of bubblesort with much improved efficiency. It's also very 
compact to implement.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Jim Segrave
In article <[EMAIL PROTECTED]>,
Tim Peters <[EMAIL PROTECTED]> wrote:
>[Jim Segrave]
>> Actually, presorted lists are not a bad case for heapsort - it's quite
>> immune to any existing order or lack thereof,
>
>Write a heapsort and time it.  It's not a difference in O() behavior,
>but more memory movement is required for a sorted list because
>transforming the list into a max-heap at the start requires shuffling
>the largest elements from the end of the list to its start.  Initially
>reverse-sorted is a "good case" for heapsort in the same sense
>(because the list is already a max-heap then; initially sorted is as
>far from being a max-heap as is possible).  "good" and "bad" are both
>O(N log N) here, though.

I did implement a crude heapsort before posting this and measured the
number of comparisions and the number of swaps.

==

#!/usr/local/bin/python

import random

class HeapSorter(object):
def __init__(self, seq):
self.compares = 0
self.swaps = 0
self.length = len(seq)
self.seq = seq

def __sift(self, i):
while True:
j = 2 * i + 1
if j + 1 < self.length:
self.compares += 1
if self.seq[j + 1] > self.seq[j]:
j = j + 1
if j >= self.length:
return

self.compares += 1
if self.seq[i] > self.seq[j]:
return

self.swaps += 1
self.seq[i], self.seq[j] = (self.seq[j], self.seq[i])
i = j

def __makeheap(self):
for i in range(self.length / 2, -1, -1):
self.__sift(i)

def sort(self):
self.__makeheap()
for i in range(self.length - 1, -1, -1):
self.swaps += 1
self.seq[i], self.seq[0] = (self.seq[0], self.seq[i])
self.length -= 1
self.__sift(0)

if __name__ == '__main__':

s = list(range(10))
random.shuffle(s)

heap = HeapSorter(s)
heap.sort()
print "Random list: comapres %d, swaps %d" % \
  (heap.compares, heap.swaps)
heap = HeapSorter(s)
heap.sort()
print "Sorted list: comapres %d, swaps %d" % \
  (heap.compares, heap.swaps)

s.reverse()
heap = HeapSorter(s)
heap.sort()
print "Reverse Sorted list: comapres %d, swaps %d" % \
  (heap.compares, heap.swaps)

==

Which gave the following results:

/usr/home/jes% python ./heapsort
Random list: comapres 3019969, swaps 1575263
Sorted list: comapres 3112517, swaps 1650855
Reverse Sorted list: comapres 2926640, swaps 1497435

Assuming compares and swaps dominate other bookkeeping, then heapsort
is quite stable in its performance, although the constant in the O(N log N)
is much larger than for other sorts. 



-- 
Jim Segrave   ([EMAIL PROTECTED])

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Tim Peters
[Jim Segrave]
> Actually, presorted lists are not a bad case for heapsort - it's quite
> immune to any existing order or lack thereof,

Write a heapsort and time it.  It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start.  Initially
reverse-sorted is a "good case" for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible).  "good" and "bad" are both
O(N log N) here, though.

BTW, most people have never heard of Smoothsort, which was Dijkstra's
excruciating attempt to make heapsort exploit pre-existing order:

http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796.PDF

That's one of a hundred algorithms I rejected for Python ;-)

> whereas some other sorts, quicksort being a prime example, require
> great care to avoid pathological cases.

Indeed, Python gave up on quicksort for that reason.  While the
samplesort hybrid that came between quicksort and the current sort
also had theoretical O(N**2) worst-case behavior, it was exceedingly
difficult to create such an input, and (unlike as for every quicksort
variant Python tried) no real-life case of undue sloth was ever
reported against it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Jim Segrave
In article <[EMAIL PROTECTED]>,
Tim Peters <[EMAIL PROTECTED]> wrote:
>[Scott David Daniels]
>>> For example, time timsort (Python's internal sort) on pre-sorted
>>> data; you'll find it is handled faster than random data.
>
>O(N) vs O(N log N), in fact.
>
>[Lawrence D'Oliveiro]
>> But isn't that how a reasonable sorting algorithm should behave? Less
>> work to do if the data is already sorted?
>
>For example, the O(N log N) heapsort is unreasonable compared to the
>O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
>case for heapsort, but a good case for bubblesort)?  O(N log N)
>sorting algorithms helped by pre-existing order are uncommon, unless
>they do extra work to detect and exploit pre-existing order.

Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof, whereas some other
sorts, quicksort being a prime example, require great care to
avoid pathological cases.



-- 
Jim Segrave   ([EMAIL PROTECTED])

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Tim Peters
[Scott David Daniels]
>> For example, time timsort (Python's internal sort) on pre-sorted
>> data; you'll find it is handled faster than random data.

O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro]
> But isn't that how a reasonable sorting algorithm should behave? Less
> work to do if the data is already sorted?

For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)?  O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Python's current sorting algorithm exploits pre-existing order of many
kinds.  For example, take a sorted list, "cut it in half, and riffle
shuffle the two halves together"; e.g.,

[0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

That turns out to be a supernaturally good case too.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Iain King

Lawrence D'Oliveiro wrote:
> In article <[EMAIL PROTECTED]>,
>  Scott David Daniels <[EMAIL PROTECTED]> wrote:
>
> >For example, time timsort (Python's internal sort) on pre-sorted
> >data; you'll find it is handled faster than random data.
>
> But isn't that how a reasonable sorting algorithm should behave? Less
> work to do if the data is already sorted?

An already sorted list can be pathological for Quicksort, depending on
how you code it.

Iain

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Aahz
In article <[EMAIL PROTECTED]>,
Lawrence D'Oliveiro  <[EMAIL PROTECTED]> wrote:
>In article <[EMAIL PROTECTED]>,
> Scott David Daniels <[EMAIL PROTECTED]> wrote:
>>
>>For example, time timsort (Python's internal sort) on pre-sorted
>>data; you'll find it is handled faster than random data.
>
>But isn't that how a reasonable sorting algorithm should behave? Less 
>work to do if the data is already sorted?

Read some of the old discussions in the python-dev archives.
-- 
Aahz ([EMAIL PROTECTED])   <*> http://www.pythoncraft.com/

"I saw `cout' being shifted "Hello world" times to the left and stopped
right there."  --Steve Gonedes
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Steve Holden
Lawrence D'Oliveiro wrote:
> In article <[EMAIL PROTECTED]>,
>  Scott David Daniels <[EMAIL PROTECTED]> wrote:
> 
> 
>>For example, time timsort (Python's internal sort) on pre-sorted
>>data; you'll find it is handled faster than random data.
> 
> 
> But isn't that how a reasonable sorting algorithm should behave? Less 
> work to do if the data is already sorted?

Isn't that just your definition of "reasonable"?

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd  http://www.holdenweb.com
Love me, love my blog  http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-06-05 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>,
 Scott David Daniels <[EMAIL PROTECTED]> wrote:

>For example, time timsort (Python's internal sort) on pre-sorted
>data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less 
work to do if the data is already sorted?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-29 Thread Martin v. Löwis
Roy Smith wrote:
> My guess would be that each resize grows the table by a constant
> factor, which IIRC, works out to amortized O(n). 

Right. For <5 entries, the size is multiplied by 4 each time,
and doubled each time for large dictionaries.

> It's not as good as
> creating the dict the right size in the first place, but it's really
> not that bad.  Somewhat more troubling is that it can lead to memory
> fragmentation problems.

Depends on your operating system, of course. You get over a page size
fairly quickly, and then many systems use anonymous mappings, where
a range of pages gets mapped from swap on allocation, and unmapped
on deallocation. So while this can cause address space fragmentation,
it doesn't cause memory fragmentation (i.e. memory that is allocated
by the process but can't be used).

> I don't understand why Python doesn't have a way to give a size hint
> when creating a dict.  It seems so obvious to be able to say "d = dict
> (size=10*1000*1000)" if you know beforehand that's how many keys
> you're going to add.

There is no need for it. If dicts don't work well in some cases, these
cases should be investigated and fixed, instead of working around the
problem by loading the problem onto the developer.

As you said, it *should* have linear time, with the number of
insertions. If it doesn't (and it appears not to), that may be due to
hash collisions, or due to the time for computing hashes not being
constant.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-29 Thread Scott David Daniels
Thomas Ganss wrote:
> Klaas schrieb:
>> 4. Insert your keys in sorted order.
> This advice is questionable -
> 
> My gut feeling on this matter is:
> IF the insert times of pre-sorted values is far better
> than the times of unsorted values, there is a chance
> that the resulting tree is unbalanced: only 1 compare
> operation after insert and no re-balancing of the tree.
> 
> re-indexing will probably give you far better access times
> in this case. Another option is to drop non RI indexes used only
> for query optimization and recreate them after batch insert.

Don't use your gut for such issues.  Pre-sorted data is such
a common special case (in fact, the only easily describable
special case) that many systems handle this specially.  For
example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

If you are using your gut without testing, go ahead and
presort.  In any case, reading documents and testing beats
gut feels every time.

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-28 Thread Thomas Ganss
Klaas schrieb:

> 4. Insert your keys in sorted order.
This advice is questionable -
it depends on the at least on the db vendor and probably
sometimes on the sort method, if inserting pre-sorted
values is better.

My gut feeling on this matter is:
IF the insert times of pre-sorted values is far better
than the times of unsorted values, there is a chance
that the resulting tree is unbalanced: only 1 compare
operation after insert and no re-balancing of the tree.

re-indexing will probably give you far better access times
in this case. Another option is to drop non RI indexes used only
for query optimization and recreate them after batch insert.

my 0.02 EUR

thomas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-24 Thread Klaas
Chris:
> class StorageBerkeleyDB(StorageTest):
>def runtest(self, number_hash):
>db = bsddb.hashopen(None, flag='c', cachesize=8192)
>for (num, wildcard_digits) in number_hash.keys():
>key = '%d:%d' % (num, wildcard_digits)
>db[key] = None
>db.close()

BDBs can accomplish what you're looking to do, but they need to be
tuned carefully.  I won't get into too many details here, but you have
a few fatal flaws in that code.

1. 8Kb of cache is _pathetic_.  Give it a few hundred megs.  This is by
far your nbiggest problem.
2. Use BTREE unless you have a good reason to use DBHASH
3. Use proper bdb env creation instead of the hash_open apis.
4. Insert your keys in sorted order.

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-24 Thread Klaas
Chris:
> Berkeley DB is great for accessing data by key for things already
> stored on disk (i.e. read access), but write performance for each
> key-value pair is slow due to it being careful about flushing
> writes to disk by default. 

This is absolutely false.

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-18 Thread Claudio Grondi
Chris Foote wrote:
> Richie Hindle wrote:
>> [Chris]
>>> Has anyone written a fast hash module which is more optimal for
>>> large datasets ?
>>
>> PyJudy might be what you're looking for, though I've never used it:
>>
>>   http://www.dalkescientific.com/Python/PyJudy.html
>>
>> "Judy's key benefits are scalability, high performance, and memory
>> efficiency. A Judy array is extensible and can scale up to a very large
>> number of elements, bounded only by machine memory." ... "PyJudy arrays
>> are similar to Python dictionaries and sets."
> 
> Thanks for the suggestion Richie.  PyJudy works brilliantly :-)
> 
> Cheers,
> Chris
It seems, that there is no Microsoft Windows version of Judy available, 
so for this reason PyJudy won't work on Windows - am I right?

Claudio
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-18 Thread Claudio Grondi
Chris Foote wrote:
> Claudio Grondi wrote:
>> Chris Foote wrote:
>>> Klaas wrote:
>>>
> 22.2s  20m25s[3]

 20m to insert 1m keys?  You are doing something wrong.
>>>
>>> I've put together some simplified test code, but the bsddb
>>> module gives 11m for 1M keys:
>>>
>> I have run your code for the bsddb on my P4 2.8 GHz and have got:
>> Number generator test for 100 number ranges
>> with a maximum of 3 wildcard digits.
>> Wed May 17 16:34:06 2006 dictionary population started
>> Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
>> Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
>> Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, 
>> duration 104.3s
>  >
>> Surprising here, that the dictionary population gives the same time, 
>> but the BerkeleyDB inserts the records 6 times faster on my computer 
>> than on yours. I am running Python 2.4.2 on Windows XP SP2, and you?
> 
> Fedora core 5 with ext3 filesystem.  The difference will be due to
> the way that Windows buffers writes for the filesystem you're using
> (it sounds like you're using a FAT-based file system).
Ok, according to the Windows task manager the Python process 
reads/writes to the file system during the run of BerkeleyDB test around 
7 GByte(!) of data and the hard drive is continuously busy, where the 
size of file I found in the Temp directory is always below 20 MByte. The 
hard drive access is probably the main reason for loosing time - here a 
question to BerkeleyDB experts:

Can the BerkeleyDB via Python bsddb3 interface be tuned to use only RAM 
or as BerkeleyDB can scale to larger data amount it makes not much sense 
to tweak it into RAM?

Chris, is maybe a RAM-disk the right way to go here to save time lost 
for accessing the file stored in the file system on the hard drive?

The RAM requirements, according to Windows XP task manager,  are below 
100 MByte. I am using the NTFS file system (yes, I know, that FAT is in 
some configurations faster than NTFS) and XP Professional SP2 without 
any tuning of file system caching. The CPU is 100% busy.

What CPU and RAM (SIMM, DDR, DDR2) do you have?  I have 2GByte fast DDR 
PC400/3200 dual line RAM. It seems, that you are still not getting 
results within the range others experience running your code, so I 
suppose, it has something to do with the hardware you are using.

> 
>>> Number generator test for 100 number ranges
>>> with a maximum of 3 wildcard digits.
>>> Wed May 17 22:18:17 2006 dictionary population started
>>> Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
>>> Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
>>> Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, 
>>> duration 665.6s
>>> Wed May 17 22:29:33 2006 StorageSQLite population started
>>> Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 
>>> 65.5s
>> As I don't have SQLite installed, it is interesting to see if the 
>> factor 10 in the speed difference between BerkeleyDB and SQLite can be 
>> confirmed by someone else.
>> Why is SQLite faster here? I suppose, that SQLite first adds all the 
>> records and builds the index afterwards with all the records there 
>> (with db.commit()).
> 
> SQLite is way faster because BerkeleyDB always uses a disk file,
> and SQLite is in RAM only.
One of the reasons I put an eye on BerkeleyDB is that it pretends to 
scale to a huge amount (Terrabyte) of data and don't need as much RAM as 
Python dictionary and it is not necessary to save/load pickled version 
of the data (i.e. here the dictionary) from/to RAM in order to work with 
it.
I guess, that in your case BerkeleyDB is for the named reasons probably 
the right way to go, except your data will stay small and the Python 
dictionary with them will always fit into RAM.

Now I am curious to know which path you have decided to go and why?

Claudio
> 
> Cheers,
> Chris

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Claudio Grondi wrote:
> Chris Foote wrote:
>> Klaas wrote:
>>
 22.2s  20m25s[3]
>>>
>>> 20m to insert 1m keys?  You are doing something wrong.
>>
>> I've put together some simplified test code, but the bsddb
>> module gives 11m for 1M keys:
>>
> I have run your code for the bsddb on my P4 2.8 GHz and have got:
> Number generator test for 100 number ranges
> with a maximum of 3 wildcard digits.
> Wed May 17 16:34:06 2006 dictionary population started
> Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
> Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
> Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration 
> 104.3s
 >
> Surprising here, that the dictionary population gives the same time, but 
> the BerkeleyDB inserts the records 6 times faster on my computer than on 
> yours. I am running Python 2.4.2 on Windows XP SP2, and you?

Fedora core 5 with ext3 filesystem.  The difference will be due to
the way that Windows buffers writes for the filesystem you're using
(it sounds like you're using a FAT-based file system).

>> Number generator test for 100 number ranges
>> with a maximum of 3 wildcard digits.
>> Wed May 17 22:18:17 2006 dictionary population started
>> Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
>> Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
>> Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, 
>> duration 665.6s
>> Wed May 17 22:29:33 2006 StorageSQLite population started
>> Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s
> As I don't have SQLite installed, it is interesting to see if the factor 
> 10 in the speed difference between BerkeleyDB and SQLite can be 
> confirmed by someone else.
> Why is SQLite faster here? I suppose, that SQLite first adds all the 
> records and builds the index afterwards with all the records there (with 
> db.commit()).

SQLite is way faster because BerkeleyDB always uses a disk file,
and SQLite is in RAM only.

Cheers,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Richie Hindle wrote:
> [Chris]
>> Has anyone written a fast hash module which is more optimal for
>> large datasets ?
> 
> PyJudy might be what you're looking for, though I've never used it:
> 
>   http://www.dalkescientific.com/Python/PyJudy.html
> 
> "Judy's key benefits are scalability, high performance, and memory
> efficiency. A Judy array is extensible and can scale up to a very large
> number of elements, bounded only by machine memory." ... "PyJudy arrays
> are similar to Python dictionaries and sets."

Thanks for the suggestion Richie.  PyJudy works brilliantly :-)

Cheers,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-17 Thread Claudio Grondi
Chris Foote wrote:
> Klaas wrote:
> 
>>> 22.2s  20m25s[3]
>>
>>
>> 20m to insert 1m keys?  You are doing something wrong.
> 
> 
> Hi Mike.
> 
> I've put together some simplified test code, but the bsddb
> module gives 11m for 1M keys:
> 
I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 100 number ranges
 with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration 
104.3s
Surprising here, that the dictionary population gives the same time, but 
the BerkeleyDB inserts the records 6 times faster on my computer than on 
yours. I am running Python 2.4.2 on Windows XP SP2, and you?

> Number generator test for 100 number ranges
> with a maximum of 3 wildcard digits.
> Wed May 17 22:18:17 2006 dictionary population started
> Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
> Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
> Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 
> 665.6s
> Wed May 17 22:29:33 2006 StorageSQLite population started
> Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s
As I don't have SQLite installed, it is interesting to see if the factor 
10 in the speed difference between BerkeleyDB and SQLite can be 
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the 
records and builds the index afterwards with all the records there (with 
db.commit()).
Can the same be done in BerkeleyDB, or does BerkeleyDB not support 
inserting of records without building an index each single insert 
command? If yes, how?

Claudio
> 
> test code is attached.
> 
>> With bdb's it is crucial to insert keys in bytestring-sorted order.
> 
> 
> For the bsddb test, I'm using a plain string.  (The module docs list a
> string being the only datatype supported for both keys & values).
> 
>> Also, be sure to give it a decent amount of cache.
> 
> 
> The bsddb.hashopen() factory seems to have a bug in this regard; if you
> supply a cachesize argument, then it barfs:
> 
> 
>   File "bsddb-test.py", line 67, in runtest
> db = bsddb.hashopen(None, flag='c', cachesize=8192)
>   File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen
> if cachesize is not None: d.set_cachesize(0, cachesize)
> bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- 
> DB->set_cachesize: method not permitted when environment specified')
> 
> 
> I'll file a bug report on this if it isn't already fixed.
> 
> Cheers,
> Chris
> 
> 
> 
> 
> import bsddb
> import random
> import time
> import sqlite
> 
> import psyco
> psyco.full()
> 
> class WallClockTimer(object):
> '''Simple timer for measuring tests.'''
> def __init__(self, msg):
> self.start = time.time()
> self.msg = msg
> print time.ctime(self.start), self.msg, 'started'
> 
> def stop(self):
> end = time.time()
> duration = end - self.start
> print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration
> 
> class NumberGen(object):
> '''Phone number generator for testing different methods of storage.'''
> def __init__(self, range_start, range_end, n, max_wildcards):
> 
> print '''Number generator test for %d number ranges
> with a maximum of %d wildcard digits.''' % (n, max_wildcards)
> 
> wildcards = range(0, max_wildcards + 1)
> # generate n unique numbers and store as keys in number_hash
> timer = WallClockTimer('dictionary population')
> self.number_hash = {}
> 
> for i in xrange(0, n):
> unique = False
> while not unique:
> wildcard_digits = self.gen_wildcard_digits(wildcards)
> num = self.gen_number(range_start, range_end)
> if wildcard_digits > 0:
> num /= (10 ** wildcard_digits)
> key = (num, wildcard_digits)
> if self.number_hash.has_key(key):
> unique = False
> else:
> unique = True
> self.number_hash[key] = None
> timer.stop()
> 
> def gen_wildcard_digits(self, wildcards):
> return random.choice(wildcards)
> 
> def gen_number(self, start, end):
> return int(random.uniform(start, end))
> 
> def storage_test(self, StorageTestClass):
> test = StorageTestClass(self.number_hash)
> 
> class StorageTest(object):
> '''base class for testing storage.  Derive a test
> class and provide your own runtest() method.'''
> def __init__(self, number_hash):
> timer = W

Re: Large Dictionaries

2006-05-17 Thread Chris Foote

Klaas wrote:

22.2s  20m25s[3]


20m to insert 1m keys?  You are doing something wrong.


Hi Mike.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

Number generator test for 100 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s

test code is attached.


With bdb's it is crucial to insert keys in bytestring-sorted order.


For the bsddb test, I'm using a plain string.  (The module docs list a
string being the only datatype supported for both keys & values).


Also, be sure to give it a decent amount of cache.


The bsddb.hashopen() factory seems to have a bug in this regard; if you
supply a cachesize argument, then it barfs:

...
  File "bsddb-test.py", line 67, in runtest
db = bsddb.hashopen(None, flag='c', cachesize=8192)
  File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen
if cachesize is not None: d.set_cachesize(0, cachesize)
bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- DB->set_cachesize: method not permitted when environment 
specified')



I'll file a bug report on this if it isn't already fixed.

Cheers,
Chris
import bsddb
import random
import time
import sqlite

import psyco
psyco.full()

class WallClockTimer(object):
'''Simple timer for measuring tests.'''
def __init__(self, msg):
self.start = time.time()
self.msg = msg
print time.ctime(self.start), self.msg, 'started'

def stop(self):
end = time.time()
duration = end - self.start
print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration

class NumberGen(object):
'''Phone number generator for testing different methods of storage.'''
def __init__(self, range_start, range_end, n, max_wildcards):

print '''Number generator test for %d number ranges
with a maximum of %d wildcard digits.''' % (n, max_wildcards)

wildcards = range(0, max_wildcards + 1)
# generate n unique numbers and store as keys in number_hash
timer = WallClockTimer('dictionary population')
self.number_hash = {}

for i in xrange(0, n):
unique = False
while not unique:
wildcard_digits = self.gen_wildcard_digits(wildcards)
num = self.gen_number(range_start, range_end)
if wildcard_digits > 0:
num /= (10 ** wildcard_digits)
key = (num, wildcard_digits)
if self.number_hash.has_key(key):
unique = False
else:
unique = True
self.number_hash[key] = None
timer.stop()

def gen_wildcard_digits(self, wildcards):
return random.choice(wildcards)

def gen_number(self, start, end):
return int(random.uniform(start, end))

def storage_test(self, StorageTestClass):
test = StorageTestClass(self.number_hash)

class StorageTest(object):
'''base class for testing storage.  Derive a test
class and provide your own runtest() method.'''
def __init__(self, number_hash):
timer = WallClockTimer('%s population' % type(self).__name__)
self.runtest(number_hash)
timer.stop()

class StorageBerkeleyDB(StorageTest):
def runtest(self, number_hash):
db = bsddb.hashopen(None, flag='c', cachesize=8192)
for (num, wildcard_digits) in number_hash.keys():
key = '%d:%d' % (num, wildcard_digits)
db[key] = None
db.close()

class StorageSQLite(StorageTest):
def runtest(self, number_hash):
db = sqlite.connect(':memory:')
cursor = db.cursor()
cursor.execute('CREATE TABLE numbers (num INTEGER, wd INTEGER)')
q = """insert into numbers (num, wd) values (%d, %d)"""
for (num, wildcard_digits) in number_hash.keys():
cursor.execute(q, num, wildcard_digits)
db.commit()
db.close()


if __name__ == '__main__':

test_vals = {'range_start'   : 61880 * 10 ** 7,
 'range_end' : 61891 * 10 ** 7,
 'n' : 1 * 10 ** 4,
 'max_wildcards' : 3}

ngen = NumberGen(test_vals['range_start'], test_vals['range_end'],
 test_vals['n'], test_vals['max_wildcards'])

ngen.storage_test(StorageBerkeleyDB)
ngen.storage_test(StorageSQLite)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread skip

bob> If you have the same number of entries as buckets, and you have a
bob> good hash function, then if you have n buckets your longest chain
bob> should have length around ln(n).  The average length of a nonempty
bob> bucket would be somewhere around 1 1/2.

Yes, and it achieves that nice short chain length by consuming gobs of
memory.  A dictionary with 10**7 keys is going to chew up lots of memory.
There's nothing particularly magical about dictionaries in this respect.
They are good examples of a classic time-space tradeoff.

Skip

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread skip

Roy> If you're getting long hash chains, you're either using a bad hash
Roy> function, or your table isn't big enough.

Sure.  The original poster said something about 10 million keys I think.
Unless the key distribution is amazingly fortuitous and the number of unique
values is small, the dictionary is going to have a huge memory footprint.

On my Mac laptop this code:

>>> d = {}
>>> for n in xrange(10**7):
...   d[n] = hex(n)
... 

yields a process with 647MB of VM.  If I trim that to 1000 unique values:

>>> d = {}
>>> for n in xrange(10**7):
...   d[n] = hex(n % 1000)
... 

I still wind up with 647MB VM.  The dictionary and its keys are what consume
all the memory, and those keys are as simple as you can get.

Skip

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Large Dictionaries

2006-05-16 Thread Klaas
>22.2s  20m25s[3]

20m to insert 1m keys?  You are doing something wrong.

With bdb's it is crucial to insert keys in bytestring-sorted order.
Also, be sure to give it a decent amount of cache.

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread bob_jenkins
If you have the same number of entries as buckets, and you have a good
hash function, then if you have n buckets your longest chain should
have length around ln(n).  The average length of a nonempty bucket
would be somewhere around 1 1/2.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread Roy Smith
In article <[EMAIL PROTECTED]>,
 <[EMAIL PROTECTED]> wrote:
>
>Graham> Looking up a key in a dictionary is done in constant-time,
>Graham> i.e. it doesn't matter how large the dictionary is.
>
>Doesn't that depend on how many keys hash to the same value?  For small
>dictionaries keeping the max keys that hash to the same value small isn't a
>huge problem.  For large dictionaries (millions of keys) might you have some
>long chains?  Or in an effort to reduce the chain length, wind up using so
>much virtual memory that you wind up wrecking performance by swapping?

If you're getting long hash chains, you're either using a bad hash
function, or your table isn't big enough.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread Graham Fawcett
On 5/16/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> Graham> Looking up a key in a dictionary is done in constant-time,
> Graham> i.e. it doesn't matter how large the dictionary is.
>
> Doesn't that depend on how many keys hash to the same value?  For small
> dictionaries keeping the max keys that hash to the same value small isn't a
> huge problem.

Hi Skip. On reflection, I guess I ought to have asked the OP what he
meant by "large". :-) Probably asked for details on his use-case as
well.

Sure, collisions are a reality. The introductory comments in
dictobject.c in the Python source [1] give good coverage of how these
issues are handled in CPython. And, they make for great bedtime
reading! :-)

Regardless, I can't imagine that using multiple dictionaries would
provide a sufficient speed-up for the vast majority of large
dictionary use-cases. There are still swapping issues, plus the
user-code overhead of managing the multiple dicts, plus the multiple
lookups. If the only improvement is in collision reduction (which is
questionable in the general case, since an unfortunate set of keys
could result in numerous collisions even in a smaller database), then
is the extra overhead really worth it?

Still, it's just my opinion. If someone wants to post code and prove
me wrong, I'll eat my hat and take my lickings. ;-)

Best,
Graham

[1] http://svn.python.org/view/python/trunk/Objects/dictobject.c
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread skip

Graham> Looking up a key in a dictionary is done in constant-time,
Graham> i.e. it doesn't matter how large the dictionary is.

Doesn't that depend on how many keys hash to the same value?  For small
dictionaries keeping the max keys that hash to the same value small isn't a
huge problem.  For large dictionaries (millions of keys) might you have some
long chains?  Or in an effort to reduce the chain length, wind up using so
much virtual memory that you wind up wrecking performance by swapping?

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread Graham Fawcett
Casey Hawthorne wrote:
> For Large Dictionaries Could One Use Separate Dictionaries Where Each
> Dictionary Covers an Interval of the Input Range?

One Could, But Why? :-) You wouldn't see any performance improvements.
Looking up a key in a dictionary is done in constant-time, i.e. it
doesn't matter how large the dictionary is. 

Graham

-- 
http://mail.python.org/mailman/listinfo/python-list


For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread Casey Hawthorne
For Large Dictionaries Could One Use Separate Dictionaries Where Each
Dictionary Covers an Interval of the Input Range? 

You would need to know something of the input range and its
uniformity!

Self-balancing between dictionaries for separate dictionaries?
--
Regards,
Casey
-- 
http://mail.python.org/mailman/listinfo/python-list


  1   2   >