Antoine Pitrou <pit...@free.fr> added the comment:

> For example, on the x64 machine the following dict() mapping 
> 10,000,000 very short unicode keys (~7 chars) to integers eats 149
> bytes per entry. 

This is counting the keys too. Under 3.x:

>>> d = {}
>>> for i in range(0, 10000000): d[str(i)] = i
... 
>>> sys.getsizeof(d)
402653464
>>> sys.getsizeof(d) / len(d)
40.2653464

So, the dict itself uses ~40 bytes/entry. Since this is a 64-bit Python build, 
each entry uses three words of 8 bytes each, that is 24 bytes per entry (one 
word for the key, one word for the associated value, one word for the cached 
hash value). So, you see the ratio of allocated entries in the hash table over 
used entries is only a bit above 2, which is reasonable.

Do note that unicode objects themselves are not that compact:

>>> sys.getsizeof("1000000")
72

If you have many of them, you might use bytestrings instead:

>>> sys.getsizeof(b"1000000")
40

I've modified your benchmark to run under 3.x and will post it in a later 
message (I don't know whether bio.trie exists for 3.x, though).

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9520>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to