Re: [Python-Dev] PEP for new dictionary implementation

2012-02-19 Thread Martin v. Löwis
 Ah, now I understand; you do need a single ssize_t either on the dict
 or at the head of the values array to indicate how many slots it has
 actually allocated.  It *may* also be worthwhile to add a second
 ssize_t to indicate how many are currently in use, for faster results
 in case of len.  But the dict is guaranteed to have at least one free
 slot, so that extra index will never make the allocation larger than
 the current code.
 
 The dict already has a field indicating how many items are in use,
 the ma_used field.

So what do you think about Jim's proposal to make the values indexed
not by hash value, but by an index that is stored in the shared keys?

Since the load will be  2/3, this should save 1/3 of the value storage
(typically more than that, if you initialize the values array always to
the current number of shared keys).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-17 Thread Mark Shannon

On 16/02/12 20:45, Antoine Pitrou wrote:


On Wed, 08 Feb 2012 19:18:14 +
Mark Shannonm...@hotpy.org  wrote:

Proposed PEP for new dictionary implementation, PEP 410?
is attached.



So, I'm running a few benchmarks using Twisted's test suite
(see https://bitbucket.org/pitrou/t3k/wiki/Home).

At the end of `python -i bin/trial twisted.internet.test`:
-  vanilla 3.3: RSS = 94 MB
-  new dict:RSS = 91 MB

At the end of `python -i bin/trial twisted.python.test`:
-  vanilla 3.3: RSS = 31.5 MB
-  new dict:RSS = 30 MB

At the end of `python -i bin/trial twisted.conch.test`:
-  vanilla 3.3: RSS = 68 MB
-  new dict:RSS = 42 MB (!)

At the end of `python -i bin/trial twisted.trial.test`:
-  vanilla 3.3: RSS = 32 MB
-  new dict:RSS = 30 MB

At the end of `python -i bin/trial twisted.test`:
-  vanilla 3.3: RSS = 62 MB
-  new dict:RSS = 78 MB (!)


In theory, new-dict should never use more a few kbs more than vanilla.
That looks like a serious leak. I'll investigate as soon as I get a chance.
Which revision of new-dict are you using?

Cheers,
Mark.


Runtimes were mostly similar in these test runs.

Perspective broker benchmark (doc/core/benchmarks/tpclient.py and
doc/core/benchmarks/tpserver.py):
-  vanilla 3.3: 422 MB/sec
-  new dict:402 MB/sec

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-17 Thread Antoine Pitrou
On Fri, 17 Feb 2012 13:10:51 +
Mark Shannon m...@hotpy.org wrote:

 On 16/02/12 20:45, Antoine Pitrou wrote:
 
  On Wed, 08 Feb 2012 19:18:14 +
  Mark Shannonm...@hotpy.org  wrote:
  Proposed PEP for new dictionary implementation, PEP 410?
  is attached.
 
 
  So, I'm running a few benchmarks using Twisted's test suite
  (see https://bitbucket.org/pitrou/t3k/wiki/Home).
 
  At the end of `python -i bin/trial twisted.internet.test`:
  -  vanilla 3.3: RSS = 94 MB
  -  new dict:RSS = 91 MB
 
  At the end of `python -i bin/trial twisted.python.test`:
  -  vanilla 3.3: RSS = 31.5 MB
  -  new dict:RSS = 30 MB
 
  At the end of `python -i bin/trial twisted.conch.test`:
  -  vanilla 3.3: RSS = 68 MB
  -  new dict:RSS = 42 MB (!)
 
  At the end of `python -i bin/trial twisted.trial.test`:
  -  vanilla 3.3: RSS = 32 MB
  -  new dict:RSS = 30 MB
 
  At the end of `python -i bin/trial twisted.test`:
  -  vanilla 3.3: RSS = 62 MB
  -  new dict:RSS = 78 MB (!)
 
 In theory, new-dict should never use more a few kbs more than vanilla.
 That looks like a serious leak. I'll investigate as soon as I get a chance.
 Which revision of new-dict are you using?

6c4d5d9dfc6d

Thanks :)

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-17 Thread Jim Jewett
On Fri, Feb 17, 2012 at 1:50 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 Good idea. However, how do you track per-dict how large the
 table is?

[Or, rather, what is the highest index needed to store any values
that are actually set for this instance.]

 To determine whether it needs to grow the array, it needs to find out
 how large the array is, no? So: how do you do that?

Ah, now I understand; you do need a single ssize_t either on the dict
or at the head of the values array to indicate how many slots it has
actually allocated.  It *may* also be worthwhile to add a second
ssize_t to indicate how many are currently in use, for faster results
in case of len.  But the dict is guaranteed to have at least one free
slot, so that extra index will never make the allocation larger than
the current code.

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-17 Thread Mark Shannon
 On 17 February 2012 at 17:42 Jim Jewett jimjjew...@gmail.com wrote:

 On Fri, Feb 17, 2012 at 1:50 AM, Martin v. Löwis mar...@v.loewis.de
wrote:
  Good idea. However, how do you track per-dict how large the
  table is?

 [Or, rather, what is the highest index needed to store any values
 that are actually set for this instance.]

  To determine whether it needs to grow the array, it needs to find out
  how large the array is, no? So: how do you do that?

 Ah, now I understand; you do need a single ssize_t either on the dict
 or at the head of the values array to indicate how many slots it has
 actually allocated.  It *may* also be worthwhile to add a second
 ssize_t to indicate how many are currently in use, for faster results
 in case of len.  But the dict is guaranteed to have at least one free
 slot, so that extra index will never make the allocation larger than
 the current code.

The dict already has a field indicating how many items are in use,
the ma_used field.

Cheers,
Mark.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP for new dictionary implementation

2012-02-16 Thread Jim J. Jewett


PEP author Mark Shannon wrote
(in 
http://mail.python.org/pipermail/python-dev/attachments/20120208/05be469a/attachment.txt):

 ... allows ... (the ``__dict__`` attribute of an object) to share
 keys with other attribute dictionaries of instances of the same class.

Is the same class a deliberate restriction, or just a convenience
of implementation?  I have often created subclasses (or even families
of subclasses) where instances (as opposed to the type) aren't likely
to have additional attributes.  These would benefit from key-sharing
across classes, but I grant that it is a minority use case that isn't
worth optimizing if it complicates the implementation.

 By separating the keys (and hashes) from the values it is possible
 to share the keys between multiple dictionaries and improve memory use.

Have you timed not storing the hash (in the dict) at all, at least for
(unicode) str-only dicts?  Going to the string for its own cached hash
breaks locality a bit more, but saves 1/3 of the memory for combined
tables, and may make a big difference for classes that have relatively
few instances.

 Reduction in memory use is directly related to the number of dictionaries
 with shared keys in existence at any time. These dictionaries are typically
 half the size of the current dictionary implementation.

How do you measure that?  The limit for huge N across huge numbers
of dicts should be 1/3 (because both hashes and keys are shared); I
assume that gets swamped by object overhead in typical small dicts.

 If a table is split the values in the keys table are ignored,
 instead the values are held in a separate array.

If they're just dead weight, then why not use them to hold indices
into the array, so that values arrays only have to be as long as
the number of keys, rather than rounding them up to a large-enough
power-of-two?  (On average, this should save half the slots.)

 A combined-table dictionary never becomes a split-table dictionary.

I thought it did (at least temporarily) as part of resizing; are you
saying that it will be re-split by the time another thread is
allowed to see it, so that it is never observed as combined?



Given that this optimization is limited to class instances, I think
there should be some explanation of why you didn't just automatically
add slots for each variable assigned (by hard-coded name) within a
method; the keys would still be stored on the type, and array storage
could still be used for the values; the __dict__ slot could initially
be a NULL pointer, and instance dicts could be added exactly when they
were needed, covering only the oddball keys.


I would reword (or at least reformat) the Cons section; at the
moment, it looks like there are four separate objections, and seems
to be a bit dismissive towards backwards copmatibility.  Perhaps
something like:

While this PEP does not change any documented APIs or invariants,
it does break some de facto invariants.

C extension modules may be relying on the current physical layout
of a dictionary.  That said, extensions which rely on internals may
already need to be recompiled with each feature release; there are
already changes planned for both Unicode (for efficiency) and dicts
(for security) that would require authors of these extensions to
at least review their code.

Because iteration (and repr) order can depend on the order in which
keys are inserted, it will be possible to construct instances that
iterate in a different order than they would under the current
implementation.  Note, however, that this will happen very rarely
in code which does not deliberately trigger the differences, and
that test cases which rely on a particular iteration order will
already need to be corrected in order to take advantage of the
security enhancements being discussed under hash randomization, or
for use with Jython and PyPy.



-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-16 Thread Antoine Pitrou

On Wed, 08 Feb 2012 19:18:14 +
Mark Shannon m...@hotpy.org wrote:
 Proposed PEP for new dictionary implementation, PEP 410?
 is attached.
 

So, I'm running a few benchmarks using Twisted's test suite
(see https://bitbucket.org/pitrou/t3k/wiki/Home).

At the end of `python -i bin/trial twisted.internet.test`:
- vanilla 3.3: RSS = 94 MB
- new dict:RSS = 91 MB

At the end of `python -i bin/trial twisted.python.test`:
- vanilla 3.3: RSS = 31.5 MB
- new dict:RSS = 30 MB

At the end of `python -i bin/trial twisted.conch.test`:
- vanilla 3.3: RSS = 68 MB
- new dict:RSS = 42 MB (!)

At the end of `python -i bin/trial twisted.trial.test`:
- vanilla 3.3: RSS = 32 MB
- new dict:RSS = 30 MB

At the end of `python -i bin/trial twisted.test`:
- vanilla 3.3: RSS = 62 MB
- new dict:RSS = 78 MB (!)

Runtimes were mostly similar in these test runs.

Perspective broker benchmark (doc/core/benchmarks/tpclient.py and
doc/core/benchmarks/tpserver.py):
- vanilla 3.3: 422 MB/sec
- new dict:402 MB/sec

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-16 Thread Martin v. Löwis
Am 11.02.2012 22:22, schrieb Mark Shannon:
 Antoine Pitrou wrote:
 Hello Mark,

 I think the PEP should explain what happens when a keys table needs
 resizing when setting an object's attribute.
 
 If the object is the only instance of a class, it remains split,
 otherwise the table is combined.

Hi Mark,

Answering on-list is fine, but please do add such answers to the PEP
when requested.

I have such a question also: why does it provide storage for the value
slot in the keys array, where this slot is actually not used?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-16 Thread Martin v. Löwis
Am 13.02.2012 13:46, schrieb Mark Shannon:
 Revised PEP for new dictionary implementation, PEP 412?
 is attached.

Committed as PEP 412.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-16 Thread Martin v. Löwis
Am 16.02.2012 19:24, schrieb Jim J. Jewett:
 
 
 PEP author Mark Shannon wrote
 (in 
 http://mail.python.org/pipermail/python-dev/attachments/20120208/05be469a/attachment.txt):
 
 ... allows ... (the ``__dict__`` attribute of an object) to share
 keys with other attribute dictionaries of instances of the same class.
 
 Is the same class a deliberate restriction, or just a convenience
 of implementation? 

It's about the implementation: the class keeps a pointer to the key set.
A subclass has a separate pointer for that.

 I have often created subclasses (or even families
 of subclasses) where instances (as opposed to the type) aren't likely
 to have additional attributes.  These would benefit from key-sharing
 across classes, but I grant that it is a minority use case that isn't
 worth optimizing if it complicates the implementation.

In particular, the potential savings are small: the instances of the
subclass will share the key sets per-class. So if you have S subclasses,
you could save up to S keysets, whereas you are already saving N-S-1
keysets (assuming you have a total of N objects across all classes).

 Have you timed not storing the hash (in the dict) at all, at least for
 (unicode) str-only dicts?  Going to the string for its own cached hash
 breaks locality a bit more, but saves 1/3 of the memory for combined
 tables, and may make a big difference for classes that have relatively
 few instances.

I'd be in favor of that, but it is actually an unrelated change: whether
or not you share key sets is unrelated to whether or not str-only dicts
drop the cached hash. Given a dict, it may be tricky to determine
whether or not it is str-only, i.e. what layout to use.

 Reduction in memory use is directly related to the number of dictionaries
 with shared keys in existence at any time. These dictionaries are typically
 half the size of the current dictionary implementation.
 
 How do you measure that?  The limit for huge N across huge numbers
 of dicts should be 1/3 (because both hashes and keys are shared); I
 assume that gets swamped by object overhead in typical small dicts.

It's more difficult than that. He also drops the smalltable (which I
think is a good idea), so accounting how this all plays together is tricky.

 If a table is split the values in the keys table are ignored,
 instead the values are held in a separate array.
 
 If they're just dead weight, then why not use them to hold indices
 into the array, so that values arrays only have to be as long as
 the number of keys, rather than rounding them up to a large-enough
 power-of-two?  (On average, this should save half the slots.)

Good idea. However, how do you track per-dict how large the table is?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-16 Thread Jim Jewett
On Thu, Feb 16, 2012 at 4:34 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 Am 16.02.2012 19:24, schrieb Jim J. Jewett:

 PEP author Mark Shannon wrote
 (in 
 http://mail.python.org/pipermail/python-dev/attachments/20120208/05be469a/attachment.txt):

 ... allows ... (the ``__dict__`` attribute of an object) to share
 keys with other attribute dictionaries of instances of the same class.

 Is the same class a deliberate restriction, or just a convenience
 of implementation?

 It's about the implementation: the class keeps a pointer to the key set.
 A subclass has a separate pointer for that.

I would prefer to see that reason in the PEP; after a few years, I
have trouble finding email, even when I remember reading the
conversation.

 Have you timed not storing the hash (in the dict) at all, at least for
 (unicode) str-only dicts?  Going to the string for its own cached hash
 breaks locality a bit more, but saves 1/3 of the memory for combined
 tables, and may make a big difference for classes that have
 relatively few instances.

 I'd be in favor of that, but it is actually an unrelated change: whether
 or not you share key sets is unrelated to whether or not str-only dicts
 drop the cached hash.

Except that the biggest arguments against it are that it breaks cache
locality, and it changes the dictentry struct -- which this patch
already does anyway.

 Given a dict, it may be tricky to determine
 whether or not it is str-only, i.e. what layout to use.

Isn't that exactly the same determination needed when deciding whether
or not to use lookdict_unicode?  (It would make the switch to the more
general lookdict more expensive, as that would involve a new
allocation.)

 Reduction in memory use is directly related to the number of dictionaries
 with shared keys in existence at any time. These dictionaries are typically
 half the size of the current dictionary implementation.

 How do you measure that?  The limit for huge N across huge numbers
 of dicts should be 1/3 (because both hashes and keys are shared); I
 assume that gets swamped by object overhead in typical small dicts.

 It's more difficult than that. He also drops the smalltable (which I
 think is a good idea), so accounting how this all plays together is tricky.

All the more reason to explain in the PEP how he measured or approximated it.

 If a table is split the values in the keys table are ignored,
 instead the values are held in a separate array.

 If they're just dead weight, then why not use them to hold indices
 into the array, so that values arrays only have to be as long as
 the number of keys, rather than rounding them up to a large-enough
 power-of-two?  (On average, this should save half the slots.)

 Good idea. However, how do you track per-dict how large the table is?

Why would you want to?

The per-instance array needs to be at least as large as the highest
index used by any key for which it has a value; if the keys table gets
far larger (or even shrinks), that doesn't really matter to the
instance.  What does matter to the instance is getting a value of its
own for a new (to it) key -- and then the keys table can tell it which
index to use, which in turn tells it whether or not it needs to grow
the array.

Are are you thinking of len(o.__dict__), which will indeed be a bit
slower?  That will happen with split dicts and potentially missing
values, regardless of how much memory is set aside (or not) for the
missing values.

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-16 Thread Martin v. Löwis
 Good idea. However, how do you track per-dict how large the table is?
 
 Why would you want to?
 
 The per-instance array needs to be at least as large as the highest
 index used by any key for which it has a value; if the keys table gets
 far larger (or even shrinks), that doesn't really matter to the
 instance.  What does matter to the instance is getting a value of its
 own for a new (to it) key -- and then the keys table can tell it which
 index to use, which in turn tells it whether or not it needs to grow
 the array.

To determine whether it needs to grow the array, it needs to find out
how large the array is, no? So: how do you do that?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-13 Thread Mark Shannon

Revised PEP for new dictionary implementation, PEP 412?
is attached.

Cheers,
Mark.

PEP: XXX
Title: Key-Sharing Dictionary
Version: $Revision$
Last-Modified: $Date$
Author: Mark Shannon m...@hotpy.org
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 08-Feb-2012
Python-Version: 3.3 or 3.4
Post-History: 08-Feb-2012


Abstract


This PEP proposes a change in the implementation of the builtin dictionary
type ``dict``. The new implementation allows dictionaries which are used as
attribute dictionaries (the ``__dict__`` attribute of an object) to share
keys with other attribute dictionaries of instances of the same class.

Motivation
==

The current dictionary implementation uses more memory than is necessary
when used as a container for object attributes as the keys are
replicated for each instance rather than being shared across many instances
of the same class.
Despite this, the current dictionary implementation is finely tuned and
performs very well as a general-purpose mapping object.

By separating the keys (and hashes) from the values it is possible to share
the keys between multiple dictionaries and improve memory use.
By ensuring that keys are separated from the values only when beneficial,
it is possible to retain the high-performance of the current dictionary
implementation when used as a general-purpose mapping object.

Behaviour
=

The new dictionary behaves in the same way as the old implementation.
It fully conforms to the Python API, the C API and the ABI.

Performance
===

Memory Usage


Reduction in memory use is directly related to the number of dictionaries
with shared keys in existence at any time. These dictionaries are typically
half the size of the current dictionary implementation.

Benchmarking shows that memory use is reduced by 10% to 20% for
object-oriented programs with no significant change in memory use
for other programs.

Speed
-

The performance of the new implementation is dominated by memory locality
effects. When keys are not shared (for example in module dictionaries
and dictionary explicitly created by dict() or {} ) then performance is
unchanged (within a percent or two) from the current implementation.

For the shared keys case, the new implementation tends to separate keys
from values, but reduces total memory usage. This will improve performance
in many cases as the effects of reduced memory usage outweigh the loss of
locality, but some programs may show a small slow down.

Benchmarking shows no significant change of speed for most benchmarks.
Object-oriented benchmarks show small speed ups when they create large
numbers of objects of the same class (the gcbench benchmark shows a 10%
speed up; this is likely to be an upper limit).

Implementation
==

Both the old and new dictionaries consist of a fixed-sized dict struct and
a re-sizeable table.
In the new dictionary the table can be further split into a keys table and
values array.
The keys table holds the keys and hashes and (for non-split tables) the
values as well. It differs only from the original implementation in that it
contains a number of fields that were previously in the dict struct.
If a table is split the values in the keys table are ignored, instead the
values are held in a separate array.

Split-Table dictionaries


When dictionaries are created to fill the __dict__ slot of an object, they are
created in split form. The keys table is cached in the type, potentially
allowing all attribute dictionaries of instances of one class to share keys.
In the event of the keys of these dictionaries starting to diverge,
individual dictionaries will lazily convert to the combined-table form.
This ensures good memory use in the common case, and correctness in all cases.

When resizing a split dictionary it is converted to a combined table.
If resizing is as a result of storing an instance attribute, and there is
only instance of a class, then the dictionary will be re-split immediately.
Since most OO code will set attributes in the __init__ method, all attributes
will be set before a second instance is created and no more resizing will be
necessary as all further instance dictionaries will have the correct size.
For more complex use patterns, it is impossible to know what is the best
approach, so the implementation allows extra insertions up to the point
of a resize when it reverts to the combined table (non-shared keys).

A deletion from a split dictionary does not change the keys table, it simply
removes the value from the values array.

Combined-Table dictionaries
---

Explicit dictionaries (dict() or {}), module dictionaries and most other
dictionaries are created as combined-table dictionaries.
A combined-table dictionary never becomes a split-table dictionary.
Combined tables are laid out in much the same way as the tables in the old
dictionary, resulting in very similar performance.

Re: [Python-Dev] PEP for new dictionary implementation

2012-02-11 Thread Antoine Pitrou

Hello Mark,

I think the PEP should explain what happens when a keys table needs
resizing when setting an object's attribute.
Reading the implementation, it seems the sharing can disappear
definitely, which seems a bit worrying.

Regards

Antoine.


On Wed, 08 Feb 2012 19:18:14 +
Mark Shannon m...@hotpy.org wrote:
 Proposed PEP for new dictionary implementation, PEP 410?
 is attached.
 
 Cheers,
 Mark.
 


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-11 Thread Mark Shannon

Antoine Pitrou wrote:

Hello Mark,

I think the PEP should explain what happens when a keys table needs
resizing when setting an object's attribute.


If the object is the only instance of a class, it remains split,
otherwise the table is combined.
Most OO code will set attributes in the __init__ method so all 
attributes are set before a second instance is created.

For more complex use patterns, it is impossible to know what is the
best approach, so the implementation allows extra insertions up to the 
point of a resize when it reverts to the combined table (non-shared keys).

(This may not be the case in the bitbucket repository,
I'll push the newer version tomorrow).


Reading the implementation, it seems the sharing can disappear
definitely, which seems a bit worrying.


It is immediately re-split (to allow sharing) when only one instance of 
the class exists.
I've implemented it that way (resize-combined then re-split) as most 
resizes (999 out of 1000) will be of combined tables, and I don't want 
to complicate the fast path.


Cheers,
Mark.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP for new dictionary implementation

2012-02-08 Thread Mark Shannon

Proposed PEP for new dictionary implementation, PEP 410?
is attached.

Cheers,
Mark.
PEP: XXX
Title: Key-Sharing Dictionary
Version: $Revision$
Last-Modified: $Date$
Author: Mark Shannon m...@hotpy.org
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 08-Feb-2012
Python-Version: 3.3 or 3.4
Post-History: 08-Feb-2012


Abstract


This PEP proposes a change in the implementation of the builtin dictionary
type ``dict``. The new implementation allows dictionaries which are used as
attribute dictionaries (the ``__dict__`` attribute of an object) to share
keys with other attribute dictionaries of instances of the same class.

Motivation
==

The current dictionary implementation uses more memory than is necessary
when used as a container for object attributes as the keys are
replicated for each instance rather than being shared across many instances
of the same class.
Despite this, the current dictionary implementation is finely tuned and
performs very well as a general-purpose mapping object.

By separating the keys (and hashes) from the values it is possible to share
the keys between multiple dictionaries and improve memory use.
By ensuring that keys are separated from the values only when beneficial,
it is possible to retain the high-performance of the current dictionary
implementation when used as a general-purpose mapping object.

Behaviour
=

The new dictionary behaves in the same way as the old implementation.
It fully conforms to the Python API, the C API and the ABI.

Performance
===

Memory Usage


Reduction in memory use is directly related to the number of dictionaries
with shared keys in existence at any time. These dictionaries are typically
half the size of the current dictionary implementation.

Benchmarking shows that memory use is reduced by 10% to 20% for
object-oriented programs with no significant change in memory use
for other programs.

Speed
-

The performance of the new implementation is dominated by memory locality
effects. When keys are not shared (for example in module dictionaries
and dictionary explicitly created by dict() or {} ) then performance is
unchanged (within a percent or two) from the current implementation.

For the shared keys case, the new implementation tends to separate keys
from values, but reduces total memory usage. This will improve performance
in many cases as the effects of reduced memory usage outweigh the loss of
locality, but some programs may show a small slow down.

Benchmarking shows no significant change of speed for most benchmarks.
Object-oriented benchmarks show small speed ups when they create large
numbers of objects of the same class (the gcbench benchmark shows a 10%
speed up; this is likely to be an upper limit).

Implementation
==

Both the old and new dictionaries consist of a fixed-sized dict struct and
a re-sizeable table.
In the new dictionary the table can be further split into a keys table and
values array.
The keys table holds the keys and hashes and (for non-split tables) the
values as well. It differs only from the original implementation in that it
contains a number of fields that were previously in the dict struct.
If a table is split the values in the keys table are ignored, instead the
values are held in a separate array.

Split-Table dictionaries


When dictionaries are created to fill the __dict__ slot of an object, they are
created in split form. The keys table is cached in the type, potentially
allowing all attribute dictionaries of instances of one class to share keys.
In the event of the keys of these dictionaries starting to diverge,
individual dictionaries will lazily convert to the combined-table form.
This ensures good memory use in the common case, and correctness in all cases.

Combined-Table dictionaries
---

Explicit dictionaries (dict() or {}), module dictionaries and most other
dictionaries are created as combined-table dictionaries.
A combined-table dictionary never becomes a split-table dictionary.
Combined tables are laid out in much the same way as the tables in the old
dictionary, resulting in very similar performance.

Implementation
==

The new dictionary implementation is available at [1]_.

Pros and Cons
=

Pros


Significant memory savings for object-oriented applications.
Small improvement to speed for programs which create lots of objects.

Cons


Change to data structures:
Third party modules which meddle with the internals of the dictionary
implementation will break.
Changes to repr() output and iteration order:
For most cases, this will be unchanged.
However for some split-table dictionaries the iteration order will
change.

Neither of these cons should be a problem.
Modules which meddle with the internals of the dictionary
implementation are already broken and should be fixed to use the API.
The iteration order of dictionaries was never defined 

Re: [Python-Dev] PEP for new dictionary implementation

2012-02-08 Thread Terry Reedy

On 2/8/2012 2:18 PM, Mark Shannon wrote:

A pretty clear draft PEP.


Changes to repr() output and iteration order:
For most cases, this will be unchanged.
However for some split-table dictionaries the iteration order will
change.

Neither of these cons should be a problem.
Modules which meddle with the internals of the dictionary
implementation are already broken and should be fixed to use the API.


So are modules that depend on set and dict iteration order and the 
consequent representations.



The iteration order of dictionaries was never defined and has always been
arbitrary; it is different for Jython and PyPy.


I am pretty sure iteration order has changed between CPython versions in 
the past (and that when it did, people got caught). The documentation 
for doctest has section 25.2.3.6. Warnings. It starts with this very issue!

'''
doctest is serious about requiring exact matches in expected output. If 
even a single character doesn’t match, the test fails. This will 
probably surprise you a few times, as you learn exactly what Python does 
and doesn’t guarantee about output. For example, when printing a dict, 
Python doesn’t guarantee that the key-value pairs will be printed in any 
particular order, so a test like


 foo()
{Hermione: hippogryph, Harry: broomstick}
is vulnerable! One workaround is to do

 foo() == {Hermione: hippogryph, Harry: broomstick}
True
instead. Another is to do

 d = sorted(foo().items())
 d
[('Harry', 'broomstick'), ('Hermione', 'hippogryph')]
'''
(Object addresses and full-precision float representations are also 
discussed.)


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP for new dictionary implementation

2012-02-08 Thread Mark Shannon

Terry Reedy wrote:

On 2/8/2012 2:18 PM, Mark Shannon wrote:

A pretty clear draft PEP.


Changes to repr() output and iteration order:
For most cases, this will be unchanged.
However for some split-table dictionaries the iteration order will
change.

Neither of these cons should be a problem.
Modules which meddle with the internals of the dictionary
implementation are already broken and should be fixed to use the API.


So are modules that depend on set and dict iteration order and the 
consequent representations.



The iteration order of dictionaries was never defined and has always been
arbitrary; it is different for Jython and PyPy.


I am pretty sure iteration order has changed between CPython versions in 
the past (and that when it did, people got caught). The documentation 
for doctest has section 25.2.3.6. Warnings. It starts with this very issue!

'''
doctest is serious about requiring exact matches in expected output. If 
even a single character doesn’t match, the test fails. This will 
probably surprise you a few times, as you learn exactly what Python does 
and doesn’t guarantee about output. For example, when printing a dict, 
Python doesn’t guarantee that the key-value pairs will be printed in any 
particular order, so a test like


  foo()
{Hermione: hippogryph, Harry: broomstick}
is vulnerable! One workaround is to do

  foo() == {Hermione: hippogryph, Harry: broomstick}
True
instead. Another is to do

  d = sorted(foo().items())
  d
[('Harry', 'broomstick'), ('Hermione', 'hippogryph')]
'''
(Object addresses and full-precision float representations are also 
discussed.)




There are a few things in the standard lib that rely on dict repr ordering:
http://bugs.python.org/issue13907
http://bugs.python.org/issue13909

I expect that the long-awaited fix to the hash-collision security issue
will expose a few more.

Version 2 of the new dict passes all these tests,
but that doesn't mean the tests are correct.

Cheers,
Mark.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com