Re: [Python-Dev] Memory bitmaps for the Python cyclic garbage collector

2017-09-08 Thread Tim Peters
[Tim]
>> In that case, it's because Python
>> _does_ mutate the objects' refcount members under the covers, and so
>> the OS ends up making fresh copies of the memory anyway.

[Greg Ewing ]
> Has anyone ever considered addressing that by moving the
> refcounts out of the objects and keeping them somewhere
> else?

Not that I know of.  I know Larry Hastings was considering doing it as
part of his experiments with removing the GIL, but that had nothing to
do with reducing cross-process copy-on-write surprises (it had to do
with "batching" refcount operations to eliminate a need for
fine-grained locking).

As-is, I'd say it's "a feature" that the refcount is part of the
object header.  Ref count manipulations are very frequent, and as part
of the object header a refcount tends to show up in cache lines "for
free" as a side effect of accessing the object's type pointer.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Memory bitmaps for the Python cyclic garbage collector

2017-09-08 Thread Greg Ewing

Tim Peters wrote:

In that case, it's because Python
_does_ mutate the objects' refcount members under the covers, and so
the OS ends up making fresh copies of the memory anyway.


Has anyone ever considered addressing that by moving the
refcounts out of the objects and keeping them somewhere
else?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Memory bitmaps for the Python cyclic garbage collector

2017-09-08 Thread Tim Peters
[Neil Schemenauer ]
> Python objects that participate in cyclic GC (things like lists, dicts,
> sets but not strings, ints and floats) have extra memory overhead.  I
> think it is possible to mostly eliminate this overhead.  Also, while
> the GC is running, this GC state is mutated, which destroys
> copy-on-write optimizations.  This change would mostly fix that
> issue.
>
> All objects that participate in cyclic GC have the Py_TPFLAGS_HAVE_GC
> bit set in their type.  That causes an extra chunk of memory to be
> allocated *before* the ob_refcnt struct member.  This is the PyGC_Head
> struct.
>
> The whole object looks like this in memory (PyObject pointer is at
> arrow):
>
> union __gc_head *gc_next;
> union __gc_head *gc_prev;
> Py_ssize_t gc_refs;
> -->
> Py_ssize_t ob_refcnt
> struct _typeobject *ob_type;
> [rest of PyObject members]
>
>
> So, 24 bytes of overhead on a 64-bit machine.  The smallest Python
> object that can have a pointer to another object (e.g. a single PyObject
> * member) is 48 bytes.  Removing PyGC_Head would cut the size of these
> objects in half.
>
> Carl Shaprio questioned me today on why we use a double linked-list and
> not the memory bitmap.  I think the answer is that there is no good
> reason. We use a double linked list only due to historical constraints
> that are no longer present.

Since you wrote this code to begin with, it will come back to you ;-)
that the real purpose of the doubly-linked lists is to _partition_
(not just find) the tracked objects.  Between collections, they're
partitioned by generation, and within a collection equivalence classes
are first merged (up through the oldest generation to be scanned in
this run), and then temporarily partitioned internally in various ways
(based on things like whether objects turn out to be reachable from
outside, and whether they have finalizers).  The linked list
representation makes all the required operations cheap:  iteration,
merging classes, moving an object from one class to another, removing
an object entirely _while_ iterating over its equivalence class.
Don't know whether all that can be done efficiently with a bitmap
representation instead.

> Long ago, Python objects could be allocated using the system malloc or
> other memory allocators.  Since we could not control the memory
> location, bitmaps would be inefficient.  Today, we allocate all Python
> objects via our own function.  Python objects under a certain size are
> allocated using our own malloc, obmalloc, and are stored in memory
> blocks known "arenas".
>
> The PyGC_Head struct performs three functions.  First, it allows the GC
> to find all Python objects that will be checked for cycles (i.e. follow
> the linked list).

As above, the set of tracked objects is partitioned into more than one
linked list.

> Second, it stores a single bit of information to let
> the GC know if it is safe to traverse the object, set with
> PyObject_GC_Track().

?  An object is "tracked" if and only if it appears in _some_
doubly-linked list.  There is no bit set (or cleared) for this.
Untracking an object removes it entirely from whichever linked list
it's in (leaving it in no linked lists), and tracking an object
consists of adding it to the "generation 0" linked list.  Unless the
code has changed a whole lot recently.

For clarity, the top N-1 bits of gc_refs (which you cover next) are
also set to a special _PyGC_REFS_UNTRACKED constant when an object is
untracked:

"""
/* True if the object is currently tracked by the GC. */
#define _PyObject_GC_IS_TRACKED(o) \
(_PyGC_REFS(o) != _PyGC_REFS_UNTRACKED)
"""

But I believe it could just as well check to see whether the object's
gc_next is NULL.


>  Finally, it has a scratch area to compute the
> effective reference count while tracing refs (gc_refs).

As above, the top N-1 bits of that are also used between collections
to record whether an object is tracked.

The least significant bit of gc_refs now (not back when you or I were
mucking with this code) records whether the object has a finalizer
that has already been run, and that state needs to be preserved across
gc runs.  So that's another bit that would need to be stored somewhere
else.


> Here is a sketch of how we can remove the PyGC_Head struct for small
> objects (say less than 512 bytes).  Large objects or objects created by
> a different memory allocator will still have the PyGC_Head overhead.
>
> * Have memory arenas that contain only objects with the
>   Py_TPFLAGS_HAVE_GC flag.  Objects like ints, strings, etc will be
>   in different arenas, not have bitmaps, not be looked at by the
>   cyclic GC.
>
> * For those arenas, add a memory bitmap.  The bitmap is a bit array that
>   has a bit for each fixed size object in the arena.  The memory used by
>   the bitmap is a fraction of what is needed by PyGC_Head.  E.g. an
>   arena that holds up to 1024 objects of 48 bytes in size would have a
>   bitmap of 1024 bits.


Re: [Python-Dev] Memory bitmaps for the Python cyclic garbage collector

2017-09-08 Thread INADA Naoki
Big +1.  I love the idea.

str (especially, docstring), dict, and tuples are major memory eater in Python.
This may reduce tuple memory usage massively.

INADA Naoki  


On Fri, Sep 8, 2017 at 2:30 AM, Neil Schemenauer  wrote:
> Python objects that participate in cyclic GC (things like lists, dicts,
> sets but not strings, ints and floats) have extra memory overhead.  I
> think it is possible to mostly eliminate this overhead.  Also, while
> the GC is running, this GC state is mutated, which destroys
> copy-on-write optimizations.  This change would mostly fix that
> issue.
>
> All objects that participate in cyclic GC have the Py_TPFLAGS_HAVE_GC
> bit set in their type.  That causes an extra chunk of memory to be
> allocated *before* the ob_refcnt struct member.  This is the PyGC_Head
> struct.
>
> The whole object looks like this in memory (PyObject pointer is at
> arrow):
>
> union __gc_head *gc_next;
> union __gc_head *gc_prev;
> Py_ssize_t gc_refs;
> -->
> Py_ssize_t ob_refcnt
> struct _typeobject *ob_type;
> [rest of PyObject members]
>
>
> So, 24 bytes of overhead on a 64-bit machine.  The smallest Python
> object that can have a pointer to another object (e.g. a single PyObject
> * member) is 48 bytes.  Removing PyGC_Head would cut the size of these
> objects in half.
>
> Carl Shaprio questioned me today on why we use a double linked-list and
> not the memory bitmap.  I think the answer is that there is no good
> reason. We use a double linked list only due to historical constraints
> that are no longer present.
>
> Long ago, Python objects could be allocated using the system malloc or
> other memory allocators.  Since we could not control the memory
> location, bitmaps would be inefficient.  Today, we allocate all Python
> objects via our own function.  Python objects under a certain size are
> allocated using our own malloc, obmalloc, and are stored in memory
> blocks known "arenas".
>
> The PyGC_Head struct performs three functions.  First, it allows the GC
> to find all Python objects that will be checked for cycles (i.e. follow
> the linked list).  Second, it stores a single bit of information to let
> the GC know if it is safe to traverse the object, set with
> PyObject_GC_Track().  Finally, it has a scratch area to compute the
> effective reference count while tracing refs (gc_refs).
>
> Here is a sketch of how we can remove the PyGC_Head struct for small
> objects (say less than 512 bytes).  Large objects or objects created by
> a different memory allocator will still have the PyGC_Head overhead.
>
> * Have memory arenas that contain only objects with the
>   Py_TPFLAGS_HAVE_GC flag.  Objects like ints, strings, etc will be
>   in different arenas, not have bitmaps, not be looked at by the
>   cyclic GC.
>
> * For those arenas, add a memory bitmap.  The bitmap is a bit array that
>   has a bit for each fixed size object in the arena.  The memory used by
>   the bitmap is a fraction of what is needed by PyGC_Head.  E.g. an
>   arena that holds up to 1024 objects of 48 bytes in size would have a
>   bitmap of 1024 bits.
>
> * The bits will be set and cleared by PyObject_GC_Track/Untrack()
>
> * We also need an array of Py_ssize_t to take over the job of gc_refs.
>   That could be allocated only when GC is working and it only needs to
>   be the size of the number of true bits in the bitmap.  Or, it could be
>   allocated when the arena is allocated and be sized for the full arena.
>
> * Objects that are too large would still get the PyGC_Head struct
>   allocated "in front" of the PyObject.  Because they are big, the
>   overhead is not so bad.
>
> * The GC process would work nearly the same as it does now.  Rather than
>   only traversing the linked list, we would also have to crawl over the
>   GC object arenas, check blocks of memory that have the tracked bit
>   set.
>
> There are a lot of smaller details to work out but I see no reason
> why the idea should not work.  It should significantly reduce memory
> usage.  Also, because the bitmap and gc_refs are contiguous in
> memory, locality will be improved.  Łukasz Langa has mentioned that
> the current GC causes issues with copy-on-write memory in big
> applications.  This change should solve that issue.
>
> To implement, I think the easiest path is to create new malloc to be
> used by small GC objects, e.g. gcmalloc.c.  It would be similar to
> obmalloc but have the features needed to keep track of the bitmap.
> obmalloc has some quirks that makes it hard to use for this purpose.
> Once the idea is proven, gcmalloc could be merged or made to be a
> variation of obmalloc.  Or, maybe just optimized and remain
> separate.  obmalloc is complicated and highly optimized.  So, adding
> additional functionality to it will be challenging.
>
> I believe this change would be ABI compatible.
> ___
> Python-Dev mailing list
> 

Re: [Python-Dev] Memory bitmaps for the Python cyclic garbage collector

2017-09-07 Thread Antoine Pitrou
On Thu, 7 Sep 2017 11:30:12 -0600
Neil Schemenauer  wrote:
> 
> * The GC process would work nearly the same as it does now.  Rather than
>   only traversing the linked list, we would also have to crawl over the
>   GC object arenas, check blocks of memory that have the tracked bit
>   set.

Small note: the linked lists are also used for distinguishing between
generations. In other words, there is not one linked list but three of
them (one per generation).  This means you probably need two bits per
object, not one.

Other than that, it's an interesting proposal.  I'm looking forward to
the concrete results (performance, and maintenance overhead) :-)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Memory bitmaps for the Python cyclic garbage collector

2017-09-07 Thread Neil Schemenauer
Python objects that participate in cyclic GC (things like lists, dicts,
sets but not strings, ints and floats) have extra memory overhead.  I
think it is possible to mostly eliminate this overhead.  Also, while
the GC is running, this GC state is mutated, which destroys
copy-on-write optimizations.  This change would mostly fix that
issue.

All objects that participate in cyclic GC have the Py_TPFLAGS_HAVE_GC
bit set in their type.  That causes an extra chunk of memory to be
allocated *before* the ob_refcnt struct member.  This is the PyGC_Head
struct.

The whole object looks like this in memory (PyObject pointer is at
arrow):

union __gc_head *gc_next;
union __gc_head *gc_prev;
Py_ssize_t gc_refs;
-->
Py_ssize_t ob_refcnt
struct _typeobject *ob_type;
[rest of PyObject members]


So, 24 bytes of overhead on a 64-bit machine.  The smallest Python
object that can have a pointer to another object (e.g. a single PyObject
* member) is 48 bytes.  Removing PyGC_Head would cut the size of these
objects in half.

Carl Shaprio questioned me today on why we use a double linked-list and
not the memory bitmap.  I think the answer is that there is no good
reason. We use a double linked list only due to historical constraints
that are no longer present.

Long ago, Python objects could be allocated using the system malloc or
other memory allocators.  Since we could not control the memory
location, bitmaps would be inefficient.  Today, we allocate all Python
objects via our own function.  Python objects under a certain size are
allocated using our own malloc, obmalloc, and are stored in memory
blocks known "arenas".

The PyGC_Head struct performs three functions.  First, it allows the GC
to find all Python objects that will be checked for cycles (i.e. follow
the linked list).  Second, it stores a single bit of information to let
the GC know if it is safe to traverse the object, set with
PyObject_GC_Track().  Finally, it has a scratch area to compute the
effective reference count while tracing refs (gc_refs).

Here is a sketch of how we can remove the PyGC_Head struct for small
objects (say less than 512 bytes).  Large objects or objects created by
a different memory allocator will still have the PyGC_Head overhead.

* Have memory arenas that contain only objects with the
  Py_TPFLAGS_HAVE_GC flag.  Objects like ints, strings, etc will be
  in different arenas, not have bitmaps, not be looked at by the
  cyclic GC.

* For those arenas, add a memory bitmap.  The bitmap is a bit array that
  has a bit for each fixed size object in the arena.  The memory used by
  the bitmap is a fraction of what is needed by PyGC_Head.  E.g. an
  arena that holds up to 1024 objects of 48 bytes in size would have a
  bitmap of 1024 bits.

* The bits will be set and cleared by PyObject_GC_Track/Untrack()

* We also need an array of Py_ssize_t to take over the job of gc_refs.
  That could be allocated only when GC is working and it only needs to
  be the size of the number of true bits in the bitmap.  Or, it could be
  allocated when the arena is allocated and be sized for the full arena.

* Objects that are too large would still get the PyGC_Head struct
  allocated "in front" of the PyObject.  Because they are big, the
  overhead is not so bad.

* The GC process would work nearly the same as it does now.  Rather than
  only traversing the linked list, we would also have to crawl over the
  GC object arenas, check blocks of memory that have the tracked bit
  set.

There are a lot of smaller details to work out but I see no reason
why the idea should not work.  It should significantly reduce memory
usage.  Also, because the bitmap and gc_refs are contiguous in
memory, locality will be improved.  Łukasz Langa has mentioned that
the current GC causes issues with copy-on-write memory in big
applications.  This change should solve that issue.

To implement, I think the easiest path is to create new malloc to be
used by small GC objects, e.g. gcmalloc.c.  It would be similar to
obmalloc but have the features needed to keep track of the bitmap.
obmalloc has some quirks that makes it hard to use for this purpose.
Once the idea is proven, gcmalloc could be merged or made to be a
variation of obmalloc.  Or, maybe just optimized and remain
separate.  obmalloc is complicated and highly optimized.  So, adding
additional functionality to it will be challenging.

I believe this change would be ABI compatible.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com