[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-09 Thread Inada Naoki
On Tue, Jul 9, 2019 at 9:46 AM Tim Peters  wrote:
>
> >  At last, all size classes has1~3 used/cached memory blocks.
>
> No doubt part of it, but hard to believe it's most of it.  If the loop
> count above really is 10240, then there's only about 80K worth of
> pointers in the final `buf`.

You are right.  List.append is not the major part of memory consumer
of "large" class (8KiB+1 ~ 512KiB).   They are several causes of large
size alloc:

* bm_logging uses StringIO.seek(0); StringIO.truncate() to reset buffer.
  So internal buffer of StringIO become Py_UCS4 array instead of a list
  of strings from the 2nd loop.  This buffer is using same policy to list
  for increase capacity.  `size + size >> 8 + (size < 9 ? 3 : 6)`.
  Actually, when I use `-n 1` option, memory usage is only 9MiB.
* The intern dict.
* Many modules are loaded, and FileIO.readall() is used to read pyc files.
  This creates and deletes various size of bytes objects.
* logging module uses several regular expressions.  `b'\0' * 0xff00` is
  used in sre_compile.
  https://github.com/python/cpython/blob/master/Lib/sre_compile.py#L320


>
> But does it really matter? ;-)  mimalloc "should have" done MADV_FREE
> on the pages holding the older `buf` instances, so it's not like the
> app is demanding to hold on to the RAM (albeit that it may well show
> up in the app's RSS unless/until the OS takes the RAM away).
>

mimalloc doesn't call madvice for each free().  Each size classes
keeps a 64KiB "page".
And several pages (4KiB) in the "page" are committed but not used.

I dumped all "mimalloc page" stat.
https://paper.dropbox.com/doc/mimalloc-on-CPython--Agg3g6XhoX77KLLmN43V48cfAg-fFyIm8P9aJpymKQN0scpp#:uid=671467140288877659659079&h2=memory-usage-of-logging_format

For example:

bin block_size   used capacity reserved
 29   2560  1   22   25 (14 pages are committed, 2560
bytes are in use)
 29   2560 14   25   25 (16 pages are committed,
2560*14 bytes are in use)
 29   2560 11   25   25
 31   3584  15   18 (5 pages are committed, 3584
bytes are in use)
 33   5120  14   12
 33   5120  2   12   12
 33   5120  2   12   12
 37  10240  3   11  409
 41  20480  16  204
 57 327680  12   12

* committed pages can be calculated by `ceil(block_size * capacity /
4096)` roughly.

There are dozen of unused memory block and committed pages in each size classes.
This caused 10MiB+ memory usage overhead on logging_format and logging_simple
benchmarks.


>> I was more intrigued by your first (speed) comparison:
>
> > - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
>
> Now _that's_ interesting ;-)  Looks like spectral_norm recycles many
> short-lived Python floats at a swift pace.  So memory management
> should account for a large part of its runtime (the arithmetic it does
> is cheap in comparison), and obmalloc and mimalloc should both excel
> at recycling mountains of small objects.  Why is mimalloc
> significantly faster?
[snip]
>  obmalloc's `address_in_range()` is definitely a major overhead in its
> fastest `free()` path, but then mimalloc has to figure out which
> thread is doing the freeing (looks cheaper than address_in_range, but
> not free).  Perhaps the layers of indirection that have been wrapped
> around obmalloc over the years are to blame?  Perhaps mimalloc's
> larger (16x) pools and arenas let it stay in its fastest paths more
> often?  I don't know why, but it would be interesting to find out :-)

Totally agree.  I'll investigate this next.

Regards,
-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MXEE2NOEDAP72RFVTC7H4GJSE2CHP3SX/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-08 Thread Tim Peters
[Inada Naoki , trying mimalloc]
>>> Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
>>> benchmarks.  I will look it later.

>> ...
>> $ ./python  -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
>> Slower (60):
>> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
>> slower (+158%)
>> - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x
>> slower (+127%)

> I think I understand why the mimalloc uses more than twice memory than the
> pymalloc + glibc malloc in logging_format and logging_simple benchmarks.
>
> These two benchmarks does like this:
>
> buf = [] # in StringIO
> for _ in range(10*1024):
> buf.append("important: some important information to be logged")
> s = "".join(buf)  # StringIO.getvalue()
> s.splitlines()
>
> mimalloc uses size segregated allocator for ~512KiB.  And size class
> is determined top three bits.
> On the other hand, list increases capacity by 9/8.  It means next size
> class is used on each realloc.

Often, but not always (multiplication by 9/8 may not change the top 3
bits - e.g., 128 * 9/8 = 144).

>  At last, all size classes has1~3 used/cached memory blocks.

No doubt part of it, but hard to believe it's most of it.  If the loop
count above really is 10240, then there's only about 80K worth of
pointers in the final `buf`.  To account for a difference of over 10M,
it would need to have left behind well over 100 _full_ size copies
from earlier reallocs.

in fact the number of list elements across resizes goes like so:

0, 4, 8, 16, 25, 35, 46, ..., 7671, 8637, 9723, 10945

Adding all of those sums to 96,113, so accounts for less than 1M of
8-byte pointers if none were ever released.  mimalloc will, of course,
add its own slop on top of that - but not a factor of ten's worth.
Unless maybe it's using a dozen such buffers at once?

But does it really matter? ;-)  mimalloc "should have" done MADV_FREE
on the pages holding the older `buf` instances, so it's not like the
app is demanding to hold on to the RAM (albeit that it may well show
up in the app's RSS unless/until the OS takes the RAM away).


> This is almost worst case for mimalloc.  In more complex application,
> there may be more chance to reuse memory blocks.
>
> In complex or huge application, this overhead will become relatively small.
> It's speed is attractive.
>
> But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for
> common cases.

The mimalloc page says that, in their benchmarks:

"""
In our benchmarks (see below), mimalloc always outperforms all other
leading allocators (jemalloc, tcmalloc, Hoard, etc), and usually uses
less memory (up to 25% more in the worst case).
"""

obmalloc is certainly more "memory efficient" (for some meanings of
that phrase) for smaller objects:  in 3.7 it splits objects of <= 512
bytes into 64 size classes.  mimalloc also has (close to) 64 "not
gigantic" size classes, but those cover a range of sizes over a
thousand times wider (up to about half a meg).  Everything obmalloc
handles fits in mimalloc's first 20 size classes.  So mimalloc
routinely needs more memory to satisfy a "small object" request than
obmalloc does.

I was more intrigued by your first (speed) comparison:

> - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)

Now _that's_ interesting ;-)  Looks like spectral_norm recycles many
short-lived Python floats at a swift pace.  So memory management
should account for a large part of its runtime (the arithmetic it does
is cheap in comparison), and obmalloc and mimalloc should both excel
at recycling mountains of small objects.  Why is mimalloc
significantly faster?  This benchmark should stay in the "fastest
paths" of both allocators most often, and they both have very lean
fastest paths (they both use pool-local singly-linked sized-segregated
free lists, so malloc and free for both should usually amount to just
popping or pushing one block off/on the head of the appropriate list).

 obmalloc's `address_in_range()` is definitely a major overhead in its
fastest `free()` path, but then mimalloc has to figure out which
thread is doing the freeing (looks cheaper than address_in_range, but
not free).  Perhaps the layers of indirection that have been wrapped
around obmalloc over the years are to blame?  Perhaps mimalloc's
larger (16x) pools and arenas let it stay in its fastest paths more
often?  I don't know why, but it would be interesting to find out :-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/554D4PU6LBBIKWJCQI4VKU2BVZD4Z3PM/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-08 Thread Inada Naoki
On Thu, Jul 4, 2019 at 11:32 PM Inada Naoki  wrote:
>
> On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou  wrote:
> >
> > Ah, interesting.  Were you able to measure the memory footprint as well?
> >
>
> Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
> benchmarks.  I will look it later.
>
> ```
> $ ./python  -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
> Slower (60):
> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
> slower (+158%)
> - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x
> slower (+127%)

I think I understand why the mimalloc uses more than twice memory than the
pymalloc + glibc malloc in logging_format and logging_simple benchmarks.

These two benchmarks does like this:

buf = [] # in StringIO
for _ in range(10*1024):
buf.append("important: some important information to be logged")
s = "".join(buf)  # StringIO.getvalue()
s.splitlines()

mimalloc uses size segregated allocator for ~512KiB.  And size class
is determined
top three bits.
On the other hand, list increases capacity by 9/8.  It means next size
class is used
on each realloc.  At last, all size classes has1~3 used/cached memory blocks.

This is almost worst case for mimalloc.  In more complex application,
there may be
more chance to reuse memory blocks.

In complex or huge application, this overhead will become relatively small.
It's speed is attractive.

But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for
common cases.

Regards,
-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ORVLH5FAEO7LVE7SK44TQR6XK4YRRZ7L/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Tim Peters
[Victor Stinner ]
> I guess that INADA-san used pyperformance --track-memory.
>
> pyperf --track-memory doc:
> "--track-memory: get the memory peak usage. it is less accurate than
> tracemalloc, but has a lower overhead. On Linux, compute the sum of
> Private_Clean and Private_Dirty memory mappings of /proc/self/smaps.
> ...

So I'll take that as meaning essentially that it's reporting what RSS
would report if it ignored shared pages (so peak # of private pages
actually backed by physical RAM).

Clear as mud how MADV_FREE affects that.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6H7HP6TXKGZBIPVNBTLULGYEDFJKVFCQ/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Inada Naoki
I found calibrated loop count is not stable so memory usage is very different
in some benchmarks.
Especially, RAM usage of logging benchmark is very relating to loop count:

$ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so
./python bm_logging.py simple --track-memory --fast --inherit-environ
PYTHONMALLOC,LD_PRELOAD -v
Run 1: calibrate the number of loops: 512
- calibrate 1: 12.7 MB (loops: 512)
Calibration: 1 warmup, 512 loops
Run 2: 0 warmups, 1 value, 512 loops
- value 1: 12.9 MB
Run 3: 0 warmups, 1 value, 512 loops
- value 1: 12.9 MB
...

$ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so
./python bm_logging.py simple --track-memory --fast --inherit-environ
PYTHONMALLOC,LD_PRELOAD -v -l1024
Run 1: 0 warmups, 1 value, 1024 loops
- value 1: 21.4 MB
Run 2: 0 warmups, 1 value, 1024 loops
- value 1: 21.4 MB
Run 3: 0 warmups, 1 value, 1024 loops
- value 1: 21.4 MB
...

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QBXLRFXDD5TLLDATV2PWE2QNLLDWRVXY/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Victor Stinner
I guess that INADA-san used pyperformance --track-memory.

pyperf --track-memory doc:
"--track-memory: get the memory peak usage. it is less accurate than
tracemalloc, but has a lower overhead. On Linux, compute the sum of
Private_Clean and Private_Dirty memory mappings of /proc/self/smaps.
On Windows, get PeakPagefileUsage of GetProcessMemoryInfo() (of the
current process): the peak value of the Commit Charge during the
lifetime of this process."
https://pyperf.readthedocs.io/en/latest/runner.html#misc

On Linux, pyperf uses read_smap_file() of pyperf._memory:
https://github.com/vstinner/pyperf/blob/master/pyperf/_memory.py

# Code to parse Linux /proc/%d/smaps files.
#
# See http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html for
# a quick introduction to smaps.
#
# Need Linux 2.6.16 or newer.
def read_smap_file():
total = 0
fp = open(proc_path("self/smaps"), "rb")
with fp:
for line in fp:
# Include both Private_Clean and Private_Dirty sections.
line = line.rstrip()
if line.startswith(b"Private_") and line.endswith(b'kB'):
parts = line.split()
total += int(parts[1]) * 1024
return total

It spawns a thread which reads /proc/self/smaps every milliseconds and
then report the *maximum*.

Victor

Le jeu. 4 juil. 2019 à 17:12, Tim Peters  a écrit :
>
> [Antoine Pitrou ]
> >> Ah, interesting.  Were you able to measure the memory footprint as well?
>
> [Inada Naoki ]
> > Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
> > benchmarks.  I will look it later.
> >
> > ```
> > $ ./python  -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
> > Slower (60):
> > - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
> > slower (+158%)
> > ...
>
> Could you say more about what's being measured here?  Like, if this is
> on Linux, is it reporting RSS?  VSS?  Something else?
>
> mimalloc is "most like" obmalloc among those I've looked at in recent
> weeks.  I noted before that its pools (they call them "pages") and
> arenas (called "segments") are at least 16x larger than obmalloc uses
> (64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size,
> in mimalloc).
>
> Like all "modern" 64-bit allocators, it cares little about reserving
> largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom.
> But it also releases (in some sense, ;like MADV_FREE) physical RAM
> back to the system at a granularity far smaller than arena'segment.
>
> At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB
> vector at the start, so no program linking to that can consume less
> than half a gig of address space.  But it only expects to _use_ a few
> 4 KiB OS pages out of that.  mimalloc doesn't do anything anywhere
> near _that_ gonzo (& since mimalloc comes out of Microsoft, it never
> will - "total virtual memory" on Windows is a system-wide resource,
> limited to the sum of physical RAM + pagefile size - no "overcommit"
> is allowed).
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-dev@python.org/message/ES3BFCBXS7N56XGUHHSOPHRT3UAEGKVA/



-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HF4UIZP5J3KKWQMLCHKJD3G6YZLHWWBE/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Tim Peters
[Antoine Pitrou ]
>> Ah, interesting.  Were you able to measure the memory footprint as well?

[Inada Naoki ]
> Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
> benchmarks.  I will look it later.
>
> ```
> $ ./python  -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
> Slower (60):
> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
> slower (+158%)
> ...

Could you say more about what's being measured here?  Like, if this is
on Linux, is it reporting RSS?  VSS?  Something else?

mimalloc is "most like" obmalloc among those I've looked at in recent
weeks.  I noted before that its pools (they call them "pages") and
arenas (called "segments") are at least 16x larger than obmalloc uses
(64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size,
in mimalloc).

Like all "modern" 64-bit allocators, it cares little about reserving
largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom.
But it also releases (in some sense, ;like MADV_FREE) physical RAM
back to the system at a granularity far smaller than arena'segment.

At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB
vector at the start, so no program linking to that can consume less
than half a gig of address space.  But it only expects to _use_ a few
4 KiB OS pages out of that.  mimalloc doesn't do anything anywhere
near _that_ gonzo (& since mimalloc comes out of Microsoft, it never
will - "total virtual memory" on Windows is a system-wide resource,
limited to the sum of physical RAM + pagefile size - no "overcommit"
is allowed).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ES3BFCBXS7N56XGUHHSOPHRT3UAEGKVA/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Antoine Pitrou
On Thu, 4 Jul 2019 23:32:55 +0900
Inada Naoki  wrote:
> On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou  wrote:
> >
> > Ah, interesting.  Were you able to measure the memory footprint as well?
> >  
> 
> Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
> benchmarks.  I will look it later.

Ah, indeed, MADV_FREE will make it complicated to measure actual memory
usage :-/

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/46QXQLQWHB4ASSBL5LB5QXJIYJLTDY77/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Inada Naoki
On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou  wrote:
>
> Ah, interesting.  Were you able to measure the memory footprint as well?
>

Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
benchmarks.  I will look it later.

```
$ ./python  -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
Slower (60):
- logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
slower (+158%)
- logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x
slower (+127%)
- regex_dna: 13.3 MB +- 19.1 kB -> 27.0 MB +- 12.1 kB: 2.02x slower (+102%)
- json_dumps: 8351.8 kB +- 19.8 kB -> 15.2 MB +- 18.0 kB: 1.87x slower (+87%)
- regex_v8: 8434.6 kB +- 20.9 kB -> 14.4 MB +- 11.0 kB: 1.75x slower (+75%)
- unpack_sequence: 7521.0 kB +- 17.0 kB -> 9980.8 kB +- 24.7 kB: 1.33x
slower (+33%)
- hexiom: 7412.2 kB +- 19.0 kB -> 9307.4 kB +- 8004 bytes: 1.26x slower (+26%)
- xml_etree_process: 12.2 MB +- 26.3 kB -> 15.0 MB +- 28.9 kB: 1.23x
slower (+23%)
- genshi_text: 10.2 MB +- 24.0 kB -> 12.5 MB +- 24.8 kB: 1.22x slower (+22%)
- crypto_pyaes: 7602.2 kB +- 35.7 kB -> 9242.8 kB +- 7873 bytes: 1.22x
slower (+22%)
- tornado_http: 24.9 MB +- 72.1 kB -> 30.1 MB +- 33.0 kB: 1.21x slower (+21%)
- chameleon: 15.8 MB +- 24.5 kB -> 19.1 MB +- 23.4 kB: 1.21x slower (+21%)
- genshi_xml: 10.9 MB +- 24.0 kB -> 12.9 MB +- 19.6 kB: 1.18x slower (+18%)
- go: 8662.6 kB +- 16.4 kB -> 10082.8 kB +- 26.2 kB: 1.16x slower (+16%)
- pathlib: 8863.6 kB +- 30.2 kB -> 10229.8 kB +- 19.4 kB: 1.15x slower (+15%)
- scimark_fft: 7473.4 kB +- 14.4 kB -> 8606.0 kB +- 28.6 kB: 1.15x slower (+15%)
- scimark_lu: 7463.2 kB +- 15.1 kB -> 8569.8 kB +- 28.6 kB: 1.15x slower (+15%)
- pidigits: 7380.2 kB +- 20.0 kB -> 8436.0 kB +- 24.2 kB: 1.14x slower (+14%)
- scimark_monte_carlo: 7354.4 kB +- 18.2 kB -> 8398.8 kB +- 27.0 kB:
1.14x slower (+14%)
- scimark_sparse_mat_mult: 7889.8 kB +- 20.1 kB -> 9006.2 kB +- 29.4
kB: 1.14x slower (+14%)
- scimark_sor: 7377.2 kB +- 18.9 kB -> 8402.0 kB +- 29.0 kB: 1.14x slower (+14%)
- chaos: 7693.0 kB +- 33.0 kB -> 8747.6 kB +- 10.5 kB: 1.14x slower (+14%)
- richards: 7364.2 kB +- 29.8 kB -> 8331.4 kB +- 20.2 kB: 1.13x slower (+13%)
- raytrace: 7696.0 kB +- 30.3 kB -> 8695.4 kB +- 30.0 kB: 1.13x slower (+13%)
- sqlite_synth: 8799.2 kB +- 25.5 kB -> 9937.4 kB +- 27.1 kB: 1.13x
slower (+13%)
- logging_silent: 7533.8 kB +- 32.0 kB -> 8488.2 kB +- 25.1 kB: 1.13x
slower (+13%)
- json_loads: 7317.8 kB +- 22.7 kB -> 8215.2 kB +- 21.5 kB: 1.12x slower (+12%)
- unpickle_list: 7513.4 kB +- 9790 bytes -> 8420.6 kB +- 25.6 kB:
1.12x slower (+12%)
- unpickle: 7519.8 kB +- 11.4 kB -> 8425.4 kB +- 27.1 kB: 1.12x slower (+12%)
- fannkuch: 7170.0 kB +- 14.9 kB -> 8033.0 kB +- 22.5 kB: 1.12x slower (+12%)
- pickle_list: 7514.6 kB +- 18.2 kB -> 8414.6 kB +- 24.0 kB: 1.12x slower (+12%)
- telco: 7685.2 kB +- 15.0 kB -> 8598.2 kB +- 17.6 kB: 1.12x slower (+12%)
- nbody: 7214.8 kB +- 10.7 kB -> 8070.2 kB +- 19.5 kB: 1.12x slower (+12%)
- pickle: 7523.2 kB +- 12.4 kB -> 8415.0 kB +- 21.0 kB: 1.12x slower (+12%)
- 2to3: 7171.2 kB +- 35.8 kB -> 8016.4 kB +- 21.7 kB: 1.12x slower (+12%)
- nqueens: 7425.2 kB +- 21.8 kB -> 8296.8 kB +- 25.5 kB: 1.12x slower (+12%)
- spectral_norm: 7212.6 kB +- 19.6 kB -> 8052.8 kB +- 18.4 kB: 1.12x
slower (+12%)
- regex_compile: 8538.0 kB +- 21.0 kB -> 9528.6 kB +- 22.1 kB: 1.12x
slower (+12%)
- pickle_pure_python: 7559.8 kB +- 19.4 kB -> 8430.0 kB +- 25.6 kB:
1.12x slower (+12%)
- unpickle_pure_python: 7545.4 kB +- 9233 bytes -> 8413.0 kB +- 15.6
kB: 1.11x slower (+11%)
- float: 23.9 MB +- 22.5 kB -> 26.6 MB +- 19.7 kB: 1.11x slower (+11%)
- sqlalchemy_imperative: 18.2 MB +- 46.2 kB -> 20.2 MB +- 36.5 kB:
1.11x slower (+11%)
- regex_effbot: 7910.8 kB +- 15.1 kB -> 8804.8 kB +- 20.9 kB: 1.11x
slower (+11%)
- pickle_dict: 7563.4 kB +- 15.3 kB -> 8415.2 kB +- 19.3 kB: 1.11x slower (+11%)
- sqlalchemy_declarative: 18.9 MB +- 40.2 kB -> 21.0 MB +- 26.4 kB:
1.11x slower (+11%)
- xml_etree_parse: 11.8 MB +- 12.6 kB -> 13.0 MB +- 16.3 kB: 1.11x slower (+11%)
- html5lib: 20.1 MB +- 44.9 kB -> 22.2 MB +- 46.6 kB: 1.10x slower (+10%)
- xml_etree_iterparse: 12.0 MB +- 26.5 kB -> 13.2 MB +- 31.3 kB: 1.10x
slower (+10%)
- sympy_integrate: 36.4 MB +- 26.7 kB -> 40.0 MB +- 33.2 kB: 1.10x slower (+10%)
- sympy_str: 37.2 MB +- 28.4 kB -> 40.7 MB +- 26.6 kB: 1.10x slower (+10%)
- sympy_expand: 36.2 MB +- 19.9 kB -> 39.7 MB +- 25.1 kB: 1.09x slower (+9%)
- mako: 15.3 MB +- 19.1 kB -> 16.7 MB +- 25.4 kB: 1.09x slower (+9%)
- django_template: 19.3 MB +- 14.9 kB -> 21.0 MB +- 14.6 kB: 1.09x slower (+9%)
- xml_etree_generate: 12.3 MB +- 39.5 kB -> 13.3 MB +- 26.9 kB: 1.08x
slower (+8%)
- deltablue: 8918.0 kB +- 19.8 kB -> 9615.8 kB +- 12.5 kB: 1.08x slower (+8%)
- dulwich_log: 12.2 MB +- 102.9 kB -> 13.1 MB +- 26.6 kB: 1.08x slower (+8%)
- meteor_contest: 9296.0 kB +- 11.9 kB -> 9996.8 kB +- 20.7 kB: 1.08x
slower (+8%)
- sympy_sum: 62.2 MB +- 20.8 kB -> 66.5 MB +- 21.1 kB: 1.07x slower (+7%)
- python_startup: 7946.6 kB +- 

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Antoine Pitrou
On Thu, 4 Jul 2019 16:19:52 +0900
Inada Naoki  wrote:

> On Tue, Jun 25, 2019 at 5:49 AM Antoine Pitrou  wrote:
> >
> >
> > For the record, there's another contender in the allocator
> > competition now:
> > https://github.com/microsoft/mimalloc/
> >
> > Regards
> >
> > Antoine.  
> 
> It's a very strong competitor!
> 
> $ ./python -m pyperf compare_to pymalloc.json mimalloc.json -G --min-speed=3
> Faster (14):
> - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
> - unpickle: 19.7 us +- 1.9 us -> 17.6 us +- 1.3 us: 1.12x faster (-11%)
> - json_dumps: 17.1 ms +- 0.2 ms -> 15.7 ms +- 0.2 ms: 1.09x faster (-8%)
> - json_loads: 39.0 us +- 2.6 us -> 36.2 us +- 1.1 us: 1.08x faster (-7%)
> - crypto_pyaes: 162 ms +- 1 ms -> 150 ms +- 1 ms: 1.08x faster (-7%)
> - regex_effbot: 3.62 ms +- 0.04 ms -> 3.38 ms +- 0.01 ms: 1.07x faster (-7%)
> - pickle_pure_python: 689 us +- 53 us -> 650 us +- 5 us: 1.06x faster (-6%)
> - scimark_fft: 502 ms +- 2 ms -> 478 ms +- 2 ms: 1.05x faster (-5%)
> - float: 156 ms +- 2 ms -> 149 ms +- 1 ms: 1.05x faster (-5%)
> - pathlib: 29.0 ms +- 0.5 ms -> 27.7 ms +- 0.4 ms: 1.05x faster (-4%)
> - mako: 22.4 ms +- 0.1 ms -> 21.6 ms +- 0.2 ms: 1.04x faster (-4%)
> - scimark_sparse_mat_mult: 6.21 ms +- 0.04 ms -> 5.99 ms +- 0.08 ms:
> 1.04x faster (-3%)
> - xml_etree_parse: 179 ms +- 5 ms -> 173 ms +- 3 ms: 1.04x faster (-3%)
> - sqlalchemy_imperative: 42.0 ms +- 0.9 ms -> 40.7 ms +- 0.9 ms: 1.03x
> faster (-3%)
> 
> Benchmark hidden because not significant (46): ...

Ah, interesting.  Were you able to measure the memory footprint as well?

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/47NHQNPQVR6GPZ3PPRCAVZLPRXGV4GNW/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Inada Naoki
On Tue, Jun 25, 2019 at 5:49 AM Antoine Pitrou  wrote:
>
>
> For the record, there's another contender in the allocator
> competition now:
> https://github.com/microsoft/mimalloc/
>
> Regards
>
> Antoine.

It's a very strong competitor!

$ ./python -m pyperf compare_to pymalloc.json mimalloc.json -G --min-speed=3
Faster (14):
- spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
- unpickle: 19.7 us +- 1.9 us -> 17.6 us +- 1.3 us: 1.12x faster (-11%)
- json_dumps: 17.1 ms +- 0.2 ms -> 15.7 ms +- 0.2 ms: 1.09x faster (-8%)
- json_loads: 39.0 us +- 2.6 us -> 36.2 us +- 1.1 us: 1.08x faster (-7%)
- crypto_pyaes: 162 ms +- 1 ms -> 150 ms +- 1 ms: 1.08x faster (-7%)
- regex_effbot: 3.62 ms +- 0.04 ms -> 3.38 ms +- 0.01 ms: 1.07x faster (-7%)
- pickle_pure_python: 689 us +- 53 us -> 650 us +- 5 us: 1.06x faster (-6%)
- scimark_fft: 502 ms +- 2 ms -> 478 ms +- 2 ms: 1.05x faster (-5%)
- float: 156 ms +- 2 ms -> 149 ms +- 1 ms: 1.05x faster (-5%)
- pathlib: 29.0 ms +- 0.5 ms -> 27.7 ms +- 0.4 ms: 1.05x faster (-4%)
- mako: 22.4 ms +- 0.1 ms -> 21.6 ms +- 0.2 ms: 1.04x faster (-4%)
- scimark_sparse_mat_mult: 6.21 ms +- 0.04 ms -> 5.99 ms +- 0.08 ms:
1.04x faster (-3%)
- xml_etree_parse: 179 ms +- 5 ms -> 173 ms +- 3 ms: 1.04x faster (-3%)
- sqlalchemy_imperative: 42.0 ms +- 0.9 ms -> 40.7 ms +- 0.9 ms: 1.03x
faster (-3%)

Benchmark hidden because not significant (46): ...
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CTIOESA4NQSWSXH5SZ5D6D7YITDGK33S/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-24 Thread Tim Peters
[Antoine Pitrou ]
> For the record, there's another contender in the allocator
> competition now:
> https://github.com/microsoft/mimalloc/

Thanks!  From a quick skim, most of it is addressing things obmalloc doesn't:

1) Efficient thread safety (we rely on the GIL).

2) Directly handling requests of all sizes (we only deal with <= 512 bytes).

3) Funky stuff to help refcounting languages that want to guarantee
   that freeing an object takes bounded time (the problem is that large
   objects can contain an unbounded number of other objects that
   will then also die - so mimalloc has complications to interact with
   client-supplied functions that parcel out the tree of objects killed
   by a decref, so the freeing can be done incrementally over time).

I don't think it's been explicitly pointed out before, but #2 is an
alternative to my PR and Neil's radix tree.  If we insist that
programs play by the rules, and obmalloc controls every piece of
memory it hands out, then the new definition of address_in_range()
becomes

#define address_in_range(p, pool) true

;-)

This deserves more thought than I've ever given to it.

Leaving aside all the above, they emphasize this:

"""
Many allocators use a single free list per size class which can lead
to bad spatial locality where objects belonging to a single structure
can be spread out over the entire heap.
...
To improve the spatial locality of allocation, mimalloc use free list
sharding where the heap is divided into pages (per size-class) with a
free list per page (where pages are usually 64KiB for small objects).
"""

Which is exactly what obmalloc already does, except we call their
"pages" "pools", and ours are 16x smaller (<= Python 3.8) or 4x
smaller (my PR).

The call our "arenas" "segments", and theirs are a minimum of 4 MiB.

Their motivations for making these "large" are the same as mine:
maximize the time they can stay in the very zippy "fast paths".

Something I'm torn on:  within a pool, obmalloc has two ways to find
the next free block.  The pool has its own block free list (as in
mimalloc), but _also_ has a "bump allocator":  The pool is carved into
blocks one at a time, low address to high, whenever the pool's free
list is NULL.  This was part of Vladimir's desire to never touch a
piece of memory before it's actually needed to satisfy a client
request.

But it does complicate and slow the code on the second-fastest
allocation path (when the pool's free list _is_ NULL).  The
conditional "OK, the free list is NULL - so is there room for another
bump allocation?" is a waste of time after all the pool's blocks have
been passed out at least once.  The bump allocator can never succeed
again then, but we keep testing for it forever.

The alternative is to break the pool into blocks, and link them into a
free list, in one gulp when a size class is assigned to a free pool.
That mutates all the blocks in the pool (to store the list pointers),
even if all but one will never be used.  In return, the code for
testing whether the bump allocator can find room, and the pool header
members supporting the bump allocator, could be thrown out.  That's
what mimalloc appears to do, and they said it sped things up a bit
overall.

But I think I'll leave that one for someone else ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RSBCIDD6YBDSEPQIBGTOZHVS63PS7LTU/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-24 Thread Antoine Pitrou


For the record, there's another contender in the allocator
competition now:
https://github.com/microsoft/mimalloc/

Regards

Antoine.


On Mon, 24 Jun 2019 00:20:03 -0500
Tim Peters  wrote:
> [Tim]
> > The radix tree generally appears to be a little more memory-frugal
> > than my PR (presumably because my need to break "big pools" into 4K
> > chunks, while the tree branch doesn't, buys the tree more space to
> > actually store objects than it costs for the new tree).  
> 
> It depends a whole lot on the size classes of the most popular
> objects.  A program below to compute it all.  For a 64-bit box using
> 3.8 alignment, and 16 KiB pools:
[snip]

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LWVFZYVFLTCNL7AKJVH2HLD2CHFRATUB/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-23 Thread Tim Peters
[Tim]
> The radix tree generally appears to be a little more memory-frugal
> than my PR (presumably because my need to break "big pools" into 4K
> chunks, while the tree branch doesn't, buys the tree more space to
> actually store objects than it costs for the new tree).

It depends a whole lot on the size classes of the most popular
objects.  A program below to compute it all.  For a 64-bit box using
3.8 alignment, and 16 KiB pools:

pages per pool 4
pool size 16,384
alignment 16

The first output block:

size 16
SQ 1012  1.2%
PR 1018  0.6%
RT 1021  0.3%

SQ is the status quo:  we have to use four separate 4 KiB pools, and
each burns 48 bytes for a pool header.

PR:  my PR.  There's one pool spanning 4 pages, with 48 bytes for a
pool header in the first page, and 16 bytes to store the arena index
in each of the other 3 pages.

RT:  the radix tree.  One 16 KiB block that only "wastes" 48 bytes for
the pool header.

The first number on each line is the number of objects we can get from
a "big pool" under that scheme.  The second number is the % of total
pool space that cannot be use for client objects.

So, in the above, PR is a substantial improvement over SQ, and RT a
less substantial improvement over PR.  Regardless of size class, PR
never does worse than SQ, and RT never worse than PR.

The most dramatic difference is in the largest size class:

size 512
SQ   28 12.5%
PR   28 12.5%
RT   31  3.1%

RT is a huge win there.  And while it's generally true that RT's
advantages are more striking in the larger size classes, it doesn't
always crush.  For example, in the 2nd-largest size class, it doesn't
matter at all which scheme is used:

size 496
SQ   32  3.1%
PR   32  3.1%
RT   32  3.1%

However, in the 3rd-largest size class, RT crushes it again:

size 480
SQ   32  6.2%
PR   32  6.2%
RT   34  0.4%

I trust the general principle at work here is too obvious to need
explanation ;-)

Anyway, these differences can really add up when there are millions of
objects passed out from a size class where RT has an advantage.
That's very attractive to me.

On the other hand, this _effectively_ make arenas even larger (they
can contain more objects), which apparently makes it even less likely
that arenas can eventually be returned to the system.

But, on the third hand, I've seen no evidence yet that increasing
arena size matters much at all to that after hitting 128 KiB arenas
(smaller than what we already use).  "Uncooperative" programs
essentially recycle nothing regardless, and "happy" programs
essentially recycle almost as many arena bytes with 1 MiB arenas as
with 8 KiB arenas.

Here's the code:

PAGES_PER_POOL = 4
ALIGNMENT = 16 # change to 8 for < Python 3.8

PAGE_SIZE = 2**12
POOL_SIZE = PAGE_SIZE * PAGES_PER_POOL
POOL_HEADER_SIZE = 48

def from_block(size, blocksize, overhead):
return (blocksize - overhead) // size

def from_first_page(size, *, pagesize=PAGE_SIZE):
return from_block(size, pagesize, POOL_HEADER_SIZE)

# using multiple 4K one-page pools - status quo
def nobj_4K(size):
return  from_first_page(size) * PAGES_PER_POOL

# using the PR
def nobj_PR(size):
return (from_first_page(size) +
from_block(size, PAGE_SIZE, ALIGNMENT)
* (PAGES_PER_POOL - 1))

# using the radix tree branch
def nobj_RT(size):
return from_first_page(size, pagesize=POOL_SIZE)

print("pages per pool", PAGES_PER_POOL)
print(f"pool size {POOL_SIZE:,}")
print("alignment", ALIGNMENT)

for size in range(ALIGNMENT, 512 + 1, ALIGNMENT):
print("size", size)
for tag, f in (("SQ", nobj_4K),
   ("PR", nobj_PR),
   ("RT", nobj_RT),
  ):
nobj = f(size)
waste = POOL_SIZE - nobj * size
print(f"{tag} {nobj:4} {waste/POOL_SIZE:5.1%}")
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/S5KMU6M6GZACRNFCF3TNPE7NKDCMQD5E/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-23 Thread Tim Peters
[Thomas]
>>> And what would be an efficient way of detecting allocations punted to
>>> malloc, if not address_in_range?

[Tim]
>> _The_ most efficient way is the one almost all allocators used long
>> ago:  use some "hidden" bits right before the address returned to the
>> user to store info about the block being returned.

[Antoine]
> There's a fundamental problem here: you can't be sure that all
> allocators reserve such space.  If some allocator doesn't, it can
> return a pointer just at the very start of the page, and if obmalloc
> tries to look at "a few bits before" that address, it could very well
> page-fault.

I snipped some technical but crucial context in my reply to Thomas:
this was assuming users are following the documented requirement to
never mix memory calls from different families.

What you describe certainly could happen in "illegal" code that. e.g.,
obtained a block from the system malloc, but passed it to obmalloc to
free.  Which, in reality, works fine today, although the docs forbid
it.  (I'm not sure, but there _may_ be some mode of running today that
actually enforces the doc requirements.)

In the other world, where code is playing by the rules, if an obmalloc
malloc/;realloc spelling is called, and it needs to punt to a
different allocator, no problem:  first it boosts the size request so
it has room to store "the bit" it needs before the address it actually
returns to the client.  Then it's "legal" only to free that memory
with an obmalloc spelling of free() later - obmalloc reads "the bit",
sees "oh - that's not my memory!", and adjusts the pointer accordingly
on _its_ call to the spelling of free() corresponding to the memory
family obmalloc() used to get the memory to begin with.

>> Neil Schemenauer takes a different approach in the recent "radix tree
>> arena map for obmalloc" thread here. ...

> One open question is the cache efficiency of the two approaches.
> Intuitively, address_in_range() will often look at exactly the same
> cache line (since a significant number of allocation will share the
> same "page prefix").

I believe we haven't seen a program yet that used more than one node
at the tree's top level :-)  But who knows?  mmap() and VirtualAlloc()
don't appear to make any _guarantees_ that the high-order bits of
returned addresses aren't entirely random.  In real life so far, they
always appear to be zeroes.

While x86 has a 64-bit virtual address space, the hardware "only"
supports 48 bits of physical address space, and I haven't seen a
virtual address yet where any of the top 16 bits are set.

AMD requires that the top 16 bits of virtual addresses be copies of bit 2**47.

Blah blah blah - for the foreseeable future, the top level of the tree
has a very easy job.

And Neil keenly observed that the top level of the tree can be
_declared_  as being very broad (and so suck up a lot of the leading
bits), because it's a file static and is effectively just an address
space reservation (at least on Linux) until nodes in it actually get
used.

>  Apparently, this benefit may be offset by cache
> aliasing issues.  Cache aliasing can also be mitigated by the fact that
> CPU caches are most of the time N-way with N > 1 (but N generally
> remains small, from 2 to 8, for L1 and L2 caches).
>
> So I guess the a priori answer is "it's complicated" :-)

Indeed it is.

> I must also thank both you and Neil for running these experiments, even
> though sometimes I may disagree about the conclusions :-)

Well, there aren't any conclusions yet - just seeing the same things repeatedly.

Over the weekend, Neil ran many variations of a longish-running "real
job" related to his work, which goes through phases of processing bulk
database operations and "trying to" release the space (details are
complicated).

Arena recycling was essentially non-existent in either branch (my PR
or the radix tree).

In 3.7.3 it _appeared_ to recycle hundreds of thousands of arenas, but
on closer examination they were virtually all of the "useless arena
thrashing" flavor.  The arenas in use were almost always within one of
the highwater mark.

But it got much better in the branches if arenas were shrunk to a tiny 8 KiB.

Which is just another instance of the "256 KiB arenas are already way
too fat for effective arena recycling unless the program is
exceptionally friendly in its dynamic memory use patterns"
observation.

We haven't seen a case yet where 1 MiB arenas do materially worse than
256 KiB ones in this respect.

Speed is generally a wash between the branches, although they
consistently appear to be faster (by a little, not a lot) than 3.7.3.

The radix tree generally appears to be a little more memory-frugal
than my PR (presumably because my need to break "big pools" into 4K
chunks, while the tree branch doesn't, buys the tree more space to
actually store objects than it costs for the new tree).

We need more "real jobs", and especially ones where arena recycling is
actually doing good in 3.7.3 (wh

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-23 Thread Antoine Pitrou
On Fri, 21 Jun 2019 17:18:18 -0500
Tim Peters  wrote:
> 
> > And what would be an efficient way of detecting allocations punted to
> > malloc, if not address_in_range?  
> 
> _The_ most efficient way is the one almost all allocators used long
> ago:  use some "hidden" bits right before the address returned to the
> user to store info about the block being returned.

There's a fundamental problem here: you can't be sure that all
allocators reserve such space.  If some allocator doesn't, it can
return a pointer just at the very start of the page, and if obmalloc
tries to look at "a few bits before" that address, it could very well
page-fault.

> Neil Schemenauer takes a different approach in the recent "radix tree
> arena map for obmalloc" thread here.  We exchanged ideas on that until
> it got to the point that the tree levels only need to trace out
> prefixes of obmalloc arena addresses.  That is, the new space burden
> of the radix tree appears quite reasonably small.

One open question is the cache efficiency of the two approaches.
Intuitively, address_in_range() will often look at exactly the same
cache line (since a significant number of allocation will share the
same "page prefix").  Apparently, this benefit may be offset by cache
aliasing issues.  Cache aliasing can also be mitigated by the fact that
CPU caches are most of the time N-way with N > 1 (but N generally
remains small, from 2 to 8, for L1 and L2 caches).

So I guess the a priori answer is "it's complicated" :-)

I must also thank both you and Neil for running these experiments, even
though sometimes I may disagree about the conclusions :-)

Regards

Antoine.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RUQVG2KOYVMUIIX5HIZKNVN4AUXKKURM/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-21 Thread Neil Schemenauer
For those who would like to test with something compatible with
Python 3.7.3, I made re-based branches here:

 https://github.com/nascheme/cpython/tree/obmalloc_radix_v37
 https://github.com/nascheme/cpython/tree/obmalloc_big_pools_v37

They should be ABI compatible with Python 3.7.3.  So, if you just
re-build the "python" executable, you don't have to rebuild anything
else.  Both those use the same arena/pool sizes and they both have
Tim's arena thrashing fix.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/K5DCROCGGVNWWLC6XM6XMCTJACESNEYS/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-21 Thread Neil Schemenauer
On 2019-06-21, Tim Peters wrote:
> [Thomas Wouters ]
> > Getting rid of address_in_range sounds like a nice idea, and I
> > would love to test how feasible it is -- I can run such a change
> > against a wide selection of code at work, including a lot of
> > third-party extension modules, but I don't see an easy way to do
> > it right now.
> 
> Neil's branch is here:
> 
>  https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

If you can test vs some real-world programs, that would be great.
I was trying to run some tests this afternoon.  Testing with Python
3.8+ is a pain because of the PyCode_New and tp_print changes.  I've
just added two fixes to the head of the obmalloc_radix_tree branch
so that you can compile code generated by old versions of Cython.
Without those fixes, building 3rd party extensions can be a real
pain.

> My PR uses 16K pools and 1M arenas, quadrupling the status quo.
> Because "why not?" ;-)
> 
> Neil's branch has _generally_, but not always, used 16 MiB arenas.
> The larger the arenas in his branch, the smaller the radix tree needs
> to grow.

Currently I have it like your big pool branch (16 KB, 1MB).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TF2JI7G5ZMCGUMM3AWNSCQDYVFNRPMQ4/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-21 Thread Victor Stinner
Le ven. 21 juin 2019 à 23:19, Thomas Wouters  a écrit :
> Is this really feasible in a world where the allocators can be selected (and 
> the default changed) at runtime?

The memory allocation must not be changed after the Python
pre-initialization. What's done after pre-initialization is more to
put "hook" which executes code before/after an allocation, but don't
replace the allocator.

It simply doesn't work to switch from pymalloc to malloc "at runtime".
Calling PyMem_Free(ptr) would call free(ptr). If the memory block was
allocated by pymalloc, free(ptr) does simply crash.

Victor
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/C5AI3SL77AV6QLRNTJ4PZH7MCYR2ZQAC/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-21 Thread Tim Peters
[Tim]
>> I don't think we need to cater anymore to careless code that mixes
>> system memory calls with O calls (e.g., if an extension gets memory
>> via `malloc()`, it's its responsibility to call `free()`), and if not
>> then `address_in_range()` isn't really necessary anymore either, and
>> then we could increase the pool size.  O would, however, need a new
>> way to recognize when its version of malloc punted to the system
>> malloc.

[Thomas Wouters ]
> Is this really feasible in a world where the allocators can be selected (and
> the default changed) at runtime?

I think so.  See the "Memory Management" section of the Python/C API
Reference Manual.  It's always been "forbidden" to, e.g., allocate a
thing with PyMem_New() but release it with free().  Ditto mixing a
PyMem_Raw... allocator with a PyMem... deallocator, or PyObject...
one.  Etc.

A type's tp_dealloc implementation should damn well which memory
family the type's allocator used,

However, no actual proposal on the table changes any "fact on the
ground" here.  They're all as forgiving of slop as the status quo.

> And what would be an efficient way of detecting allocations punted to
> malloc, if not address_in_range?

_The_ most efficient way is the one almost all allocators used long
ago:  use some "hidden" bits right before the address returned to the
user to store info about the block being returned.  Like 1 bit to
distinguish between "obmalloc took this out of one of its pools" and
"obmalloc got this from PyMem_Raw... (whatever that maps to - obmalloc
doesn't care)".  That would be much faster than what we do now.

But on current 64-bit boxes, "1 bit" turns into "16 bytes" to maintain
alignment, so space overhead becomes 100% for the smallest objects
obmalloc can return :-(

Neil Schemenauer takes a different approach in the recent "radix tree
arena map for obmalloc" thread here.  We exchanged ideas on that until
it got to the point that the tree levels only need to trace out
prefixes of obmalloc arena addresses.  That is, the new space burden
of the radix tree appears quite reasonably small.

It doesn't appear to be possible to make it faster than the current
address_in_range(), but in small-scale testing so far speed appears
comparable.


> Getting rid of address_in_range sounds like a nice idea, and I would love to 
> test
> how feasible it is -- I can run such a change against a wide selection of code
> at work, including a lot of third-party extension modules, but I don't see an 
> easy
> way to do it right now.

Neil's branch is here:

 https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

It's effectively a different _implementation_ of the current
address_in_range(), one that doesn't ever need to read possibly
uninitialized memory, and couldn't care less about the OS page size.

For the latter reason, it's by far the clearest way to enable
expanding pool size above 4 KiB.  My PR also eliminates the pool size
limitation:

https://github.com/python/cpython/pull/13934

but at the cost of breaking bigger pools up internally into 4K regions
so the excruciating current address_in_range black magic still works.

Neil and I are both keen _mostly_ to increase pool and arena sizes.
The bigger they are, the more time obmalloc can spend in its fastest
code paths.

A question we can't answer yet (or possibly ever) is how badly that
would hurt Python returning arenas to the system, in long-running apps
the go through phases of low and high memory need.

I don't run anything like that - not a world I've ever lived in.  All
my experiments so far say, for programs that are neither horrible nor
wonderful in this respect:

1. An arena size of 4 KiB is most effective for that.
2. There's significant degradation in moving even to 8 KiB arenas.
3. Which continues getting worse the larger the arenas.
4. Until reaching 128 KiB, at which point the rate of degradation falls a lot.

So the current 256 KiB arenas already suck for such programs.

For "horrible" programs, not even tiny 4K arenas help much.

For "wonderful" programs, not even 16 MiB arenas hurt arena recycling
effectiveness.

So if you have real programs keen to "return memory to the system"
periodically, it would be terrific to get info about how changing
arena size affects their behavior in that respect.

My PR uses 16K pools and 1M arenas, quadrupling the status quo.
Because "why not?" ;-)

Neil's branch has _generally_, but not always, used 16 MiB arenas.
The larger the arenas in his branch, the smaller the radix tree needs
to grow.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/7ZIFV2BEL64FQGC35F7QUPK3SHVR3VGT/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-21 Thread Thomas Wouters
On Sun, Jun 2, 2019 at 7:57 AM Tim Peters  wrote:

> I don't think we need to cater anymore to careless code that mixes
> system memory calls with O calls (e.g., if an extension gets memory
> via `malloc()`, it's its responsibility to call `free()`), and if not
> then `address_in_range()` isn't really necessary anymore either, and
> then we could increase the pool size.  O would, however, need a new
> way to recognize when its version of malloc punted to the system
> malloc.
>

Is this really feasible in a world where the allocators can be selected
(and the default changed) at runtime? And what would be an efficient way of
detecting allocations punted to malloc, if not address_in_range?

Getting rid of address_in_range sounds like a nice idea, and I would love
to test how feasible it is -- I can run such a change against a wide
selection of code at work, including a lot of third-party extension
modules, but I don't see an easy way to do it right now.

-- 
Thomas Wouters 

Hi! I'm an email virus! Think twice before sending your email to help me
spread!
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OA4Z525KYYFHBTTVI5QXVTJH72ZCGQ2S/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-10 Thread Tim Peters
[Neil Schemenauer ]
> I've done a little testing the pool overhead.  I have an application
> that uses many small dicts as holders of data.  The function:
>
> sys._debugmallocstats()
>
> is useful to get stats for the obmalloc pools.  Total data allocated
> by obmalloc is 262 MB.  At the 4*PAGE_SIZE pool size, the wasted
> space due to partly filled pools is only 0.18%.  For 16*PAGE_SIZE
> pools, 0.71%.
>
> I have a set of stats for another program.  In that case, total
> memory allocated by obmalloc is 14 MB.  For 4*PAGE_SIZE pools,
> wasted space is 0.78% of total.  At 16*PAGE_SIZE, it is 2.4%.

Definitely a tradeoff here:  increasing the number of pages per pool
is a pure win for objects of very popular size classes, but a pure
loss for objects of unpopular size classes.  New comments about
POOL_SIZE in the patch briefly hint at that.

> Based on that small set of data, using 4*PAGE_SIZE seems
> conservative.

Which is a nice way of saying "pulled out of Tim's ass" ;-)

> As I'm sure you realize, making pools bigger will
> waste actual memory, not just virtual address space because you
> write the arena pointer to each OS page.

Yes, but mostly no!  It remains the case that obmalloc neither writes
nor reads anything in an arena until a malloc/realloc call actually
needs to return a pointer to a never-before accessed page.  Writing
the arena index is NOT done to pages when the arena is allocated, or
even when a new pool is carved off an arena, but lazily, one page at a
time, lowest address to highest, as fresh pages actually need to be
returned to the caller.

So arena pointers aren't actually written more frequently than when
using pools of just one page, as is done now (which _still_ writes the
arena index into each page).

Subtlety:  currently, if you pass a nonsense address to
address_in_range that happens to be in one of the arena's pools,
address_in_range() says "yes".  However, after the patch, it may well
return "no" if the address is in one of the pool's pages that hasn't
yet been returned to a malloc/realloc caller (in which case the arena
index hasn't yet been stored at the start of the page).

I don't care about that, because it's 100% a logic error to pass an
address to free/realloc that wasn't obtained from a previous
malloc/realloc call.  So it's a change in undefined behavior.

> I want to do performance profiling using Linux perf.  That should
> show where the hotspot instructions in the obmalloc code.  Maybe
> that will be useful to you.

Would be good to know, yes!  But may also depend on the app.

> Another thought about address_in_range(): some operating systems
> allow you to allocate memory a specific alignments.  Or, you can
> even allocate a chunk of memory at a fixed memory location if you do
> the correct magic incantation.  I noticed that Go does that.  I
> imagine doing that has a bunch of associated challenges with it.
> However, if we could control the alignment and memory location of
> obmalloc arenas, we would not have the segv problem of
> address_in_range().  It's probably not worth going down that path
> due to the problems involved.

I'm unclear on how that could be exploited.  It's not addresses that
come from arenas that create segv headaches, it's addresses that come
from the system's malloc/realloc.

If we were, e.g., to pick a maximum amount of address space obmalloc
can use in advance, we could live with a single arena of that size,
and just check whether an address is within it.  All the problems come
from that address space may alternate, any number of times, between
addresses in obmalloc arenas and addresses from the system malloc's
internals.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YUWFWPKN67EOGF5F7QFGB7IFRH5K2PFW/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-10 Thread Neil Schemenauer
On 2019-06-09, Tim Peters wrote:
> And now there's a PR that removes obmalloc's limit on pool sizes, and,
> for a start, quadruples pool (and arena!) sizes on 64-bit boxes:

Neat.

> As the PR says,
> 
> """
> It would be great to get feedback from 64-bit apps that do massive
> amounts of small-object allocations and deallocations.
> """

I've done a little testing the pool overhead.  I have an application
that uses many small dicts as holders of data.  The function:

sys._debugmallocstats()

is useful to get stats for the obmalloc pools.  Total data allocated
by obmalloc is 262 MB.  At the 4*PAGE_SIZE pool size, the wasted
space due to partly filled pools is only 0.18%.  For 16*PAGE_SIZE
pools, 0.71%.

I have a set of stats for another program.  In that case, total
memory allocated by obmalloc is 14 MB.  For 4*PAGE_SIZE pools,
wasted space is 0.78% of total.  At 16*PAGE_SIZE, it is 2.4%.

Based on that small set of data, using 4*PAGE_SIZE seems
conservative.  As I'm sure you realize, making pools bigger will
waste actual memory, not just virtual address space because you
write the arena pointer to each OS page.

I want to do performance profiling using Linux perf.  That should
show where the hotspot instructions in the obmalloc code.  Maybe
that will be useful to you.

Another thought about address_in_range(): some operating systems
allow you to allocate memory a specific alignments.  Or, you can
even allocate a chunk of memory at a fixed memory location if you do
the correct magic incantation.  I noticed that Go does that.  I
imagine doing that has a bunch of associated challenges with it.
However, if we could control the alignment and memory location of
obmalloc arenas, we would not have the segv problem of
address_in_range().  It's probably not worth going down that path
due to the problems involved.

Regards,

  Neil
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LJWXNQNZRVCCF36ALXKZVWYHINXTVMVB/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-09 Thread Tim Peters
[Tim\
> For the current obmalloc, I have in mind a different way ...
> Not ideal, but ... captures the important part (more objects
> in a pool -> more times obmalloc can remain in its
> fastest "all within the pool" paths).

And now there's a PR that removes obmalloc's limit on pool sizes, and,
for a start, quadruples pool (and arena!) sizes on 64-bit boxes:

https://github.com/python/cpython/pull/13934

https://bugs.python.org/issue37211

As the PR says,

"""
It would be great to get feedback from 64-bit apps that do massive
amounts of small-object allocations and deallocations.
"""

:-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Z4YIHDGNZLP4WX5HVEBXDSFIDIWPTTYK/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-07 Thread Tim Peters
To be clearer, while knowing the size of allocated objects may be of
some use to some other allocators, "not really" for obmalloc.  That
one does small objects by itself in a uniform way, and punts
everything else to the system malloc family.  The _only_ thing it
wants to know on a free/realloc is which of those two ways delivered
the address to begin with.

Knowing the original size would be an indirect way to accomplish that,
but it really only needs 1 bit of info.  If you're using a radix tree
to implement - in effect - a dict mapping addresses to "stuff", 1 flag
bit in the "stuff" would be ideal for that.  If you haven't already, I
assume you'll soon come around to believe you really should be
tracking the addresses of all gc'ed objects (not just the "small"
ones).

> If the above idea works, we know the object size at free() and realloc(), we 
> don't
> need address_in_range() for those code paths.

Two things:  first, a single bit would not only be sufficient, it
would be _better_ than knowing the object size.  If the bit says the
object came from the system, the size is of no use at all.  It the bit
says it came from obmalloc, what's _really_ wanted is the index of the
object's "size class" in the vector of free lists, and that's already
directly available in the object's pool header (various parts of which
have to be read up &/or updated regardless).  Second, if that bit were
available, `address_in_range()` could be thrown away - obmalloc's free
and realloc entries are the only places it's used (or is of any
conceivable use to obmalloc).

For the current obmalloc, I have in mind a different way (briefly. let
s pool span a number of 4K pages, but teach pools about the page
structure so that address_in_range() continues to work after trivial
changes (read the arena index from the containing page's start rather
than from the containing pool's)).  Not ideal, but looks remarkably
easy to do, and captures the important part (more objects in a pool ->
more times obmalloc can remain in its fastest "all within the pool"
paths).

> My gut feeling is that the prev/next pointer updates done by
> move_unreachable() and similar functions must be quite expensive.
> Doing the traversal with an explicit stack is a lot less elegant but
> I think it should be faster.  At least, when you are dealing with a
> big set of GC objects that don't fit in the CPU cache.

At the start, it was the _reachable_ objects that were moved out of
the collected generation list rather than the unreachable objects.
Since it's (in all my experience) almost all objects that survive
almost all collections in almost all programs, that was an enormous
amount of moving overall.  So long I ago I rewrote that code to move
the _un_reachable objects instead.

Debug output showed countless billions of pointer updates saved across
the programs I tried at the time, but the timing difference was in the
noise.

Probably a big part of "the problem":  when collecting the oldest
generation in a program with "a lot" of objects, no approach at all
will "fit in cache".  We have to crawl the entire object graph then.
Note that the test case in the parent thread had over a billion class
instances, and it was way whittled down(!) from the OP's real program.

But I don't _think_ that's typical quite yet ;-) , and:

> You are likely correct. I'm hoping to benchmark the radix tree idea.
> I'm not too far from having it working such that it can replace
> address_in_range().  Maybe allocating gc_refs as a block would
> offset the radix tree cost vs address_in_range().

I certainly agree gc_refs is the big-bang-for-the-buck thing to
target.  There's no way to avoid mutating that bunches and bunches of
times, even in small programs, and reducing the number of dirtied
cache lines due to that "should" pay off.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/3DPAR4PPPM3Z675ELBGYIJ74INI6PL3X/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Neil Schemenauer
On 2019-06-06, Tim Peters wrote:
> The doubly linked lists in gc primarily support efficient
> _partitioning_ of objects for gc's purposes (a union of disjoint sets,
> with constant-time moving of an object from one set to another, and
> constant-time union of disjoint sets).  "All objects" is almost never
> interesting to it (it is only when the oldest non-frozen generation is
> being collected).

My current idea is to put partitioning flags on the interior radix
tree nodes.  If you mark an object as "finalizer reachable", for
example, it would mark all the nodes on the path from the root with
that flag.  Then, when you want to iterate over all the GC objects
with a flag, you can avoid uninteresting branches of the tree.

For generations, maybe tracking them at the pool level is good
enough.  Interior nodes can track generations too (i.e. the youngest
generation contained under them).

My gut feeling is that the prev/next pointer updates done by
move_unreachable() and similar functions must be quite expensive.
Doing the traversal with an explicit stack is a lot less elegant but
I think it should be faster.  At least, when you are dealing with a
big set of GC objects that don't fit in the CPU cache.

Regards,

  Neil
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/J422RWENKJAYHMXSZVRV5KGWSHNMAMJF/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Neil Schemenauer
On 2019-06-06, Tim Peters wrote:
> Like now:  if the size were passed in, obmalloc could test the size
> instead of doing the `address_in_range()` dance(*).  But if it's ever
> possible that the size won't be passed in, all the machinery
> supporting `address_in_range()` still needs to be there, and every
> obmalloc spelling of malloc/realloc needs to ensure that machinery
> will work if the returned address is passed back to an obmalloc
> free/realloc spelling without the size.


We can almost make it work for GC objects, the use of obmalloc is
quite well encapsulated.  I think I intentionally designed the
PyObject_GG_New/PyObject_GC_Del/etc APIs that way.

Quick and dirty experiment is here:

https://github.com/nascheme/cpython/tree/gc_malloc_free_size

The major hitch seems my new gc_obj_size() function.  We can't be
sure the 'nbytes' passed to _PyObject_GC_Malloc() is the same as
what is computed by gc_obj_size().  It usually works but there are
exceptions (freelists for frame objects and tuple objects, for one)

A nasty problem is the weirdness with PyType_GenericAlloc() and the
sentinel item.  _PyObject_GC_NewVar() doesn't include space for the
sentinel but PyType_GenericAlloc() does.  When you get to
gc_obj_size(), you don't if you should use "nitems" or "nitems+1".

I'm not sure how the fix the sentinel issue.  Maybe a new type slot
or a type flag?  In any case, making a change like my git branch
above would almost certainly break extensions that don't play
nicely.  It won't be hard to make it a build option, like the
original gcmodule was.  Then, assuming there is a performance boost,
people can enable it if their extensions are friendly.


> The "only"problem with address_in_range is that it limits us to a
> maximum pool size of 4K.  Just for fun, I boosted that to 8K to see
> how likely segfaults really are, and a Python built that way couldn't
> even get to its first prompt before dying with an access violation
> (Windows-speak for segfault).

If we can make the above idea work, you could set the pool size to
8K without issue.  A possible problem is that the obmalloc and
gcmalloc arenas are separate.  I suppose that affects 
performance testing.

> We could eliminate the pool size restriction in many ways.  For
> example, we could store the addresses obtained from the system
> malloc/realloc - but not yet freed - in a set, perhaps implemented as
> a radix tree to cut the memory burden.  But digging through 3 or 4
> levels of a radix tree to determine membership is probably
> significantly slower than address_in_range.

You are likely correct. I'm hoping to benchmark the radix tree idea.
I'm not too far from having it working such that it can replace
address_in_range().  Maybe allocating gc_refs as a block would
offset the radix tree cost vs address_in_range().  If the above idea
works, we know the object size at free() and realloc(), we don't
need address_in_range() for those code paths.

Regards,

  Neil
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ILFK2MTCVA7GB7JGBVSUWASKJ7T4LLJE/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Antoine Pitrou
On Thu, 6 Jun 2019 17:26:17 -0500
Tim Peters  wrote:
> 
> The doubly linked lists in gc primarily support efficient
> _partitioning_ of objects for gc's purposes (a union of disjoint sets,
> with constant-time moving of an object from one set to another, and
> constant-time union of disjoint sets).  "All objects" is almost never
> interesting to it (it is only when the oldest non-frozen generation is
> being collected).

Right.  But the oldest generation is precisely the pain point, since
full collections can induce very long pauses.  IOW, perhaps it would be
fine to keep dedicated lists for the young generations, while doing a
heap walk for the full collection case.

> In any case, it appears to be "a feature" now that people can use any
> flavor of the malloc family they like in CPython, so I expect that any
> attempt to tie cyclic gc to a specific flavor of malloc would be a
> very hard sell.

Depends on the benefits of course ;-) Most people use some pre-built
binaries that "naturally" come with obmalloc enabled.

Of course, it's only a very small minority of applications that hit
real performance issues with the GC - and in probably many of those
cases tuning the full collection threshold can alleviate the problem
still.

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZQEHZTW77MFCKCJMR7XB4AVBF6KCB3ZW/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Tim Peters
[Tim]
>> But I don't know what you mean by "access memory in random order to
>> iterate over known objects".  obmalloc never needs to iterate over
>> known objects - indeed, it contains no code capable of doing that..
>> Our cyclic gc does, but that's independent of obmalloc.

[Antoine]
> It's not.  Cyclic GC needs its own linked lists *because* the allocator
> doesn't allow it to iterate over allocated objects.

The doubly linked lists in gc primarily support efficient
_partitioning_ of objects for gc's purposes (a union of disjoint sets,
with constant-time moving of an object from one set to another, and
constant-time union of disjoint sets).  "All objects" is almost never
interesting to it (it is only when the oldest non-frozen generation is
being collected).

Between collections, the partitioning is by generation.

During a collection, the youngest generations are combined into one,
and then that's sub-partitioned in various ways as collection goes
along, ending with a partition into reachable and unreachable objects.
In between, ephemeral partitions are constructed (e.g., the set of
"tentatively unreachable" objects).

None of that was driven by obmalloc's (or malloc's) inability to
iterate over objects.  Doubly linked lists were the obvious way to
implement the required operations on partitions efficiently and
simply.

In any case, it appears to be "a feature" now that people can use any
flavor of the malloc family they like in CPython, so I expect that any
attempt to tie cyclic gc to a specific flavor of malloc would be a
very hard sell.  Which, BTW, was the intended meaning of
"independent":  cyclic gc right now couldn't care less which version
of malloc a user plugs in - nor could obmalloc care less which cyclic
gc algorithm is used.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OR4NFZRQIOCK2N3XBCPI6GESI5BYRD3D/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Antoine Pitrou
On Thu, 6 Jun 2019 16:03:03 -0500
Tim Peters  wrote:
> But I don't know what you mean by "access memory in random order to
> iterate over known objects".  obmalloc never needs to iterate over
> known objects - indeed, it contains no code capable of doing that..
> Our cyclic gc does, but that's independent of obmalloc.

It's not.  Cyclic GC needs its own linked lists *because* the allocator
doesn't allow it to iterate over allocated objects.

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YZIG3RZKM4XSTM3PYPDAPG3UHOB5QKM4/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Tim Peters
[Antoine Pitrou ]
> But my response was under the assumption that we would want obmalloc to
> deal with all allocations.

I didn't know that.  I personally have no interest in that:  if we
want an all-purpose allocator, there are several already to choose
from.  There's no reason to imagine we could write a better one.

> Which is more or less required anyway to have an efficient GC that doesn't
> have to walk linked lists and access memory in random order to iterate over
> known objects.

As the parent thread showed, obmalloc does at least as well as any
general-purpose allocator known _for Python's purposes_ (a great many
(de)allocations of "small" objects).  Already explained too that
size-segregated linked free lists are the core of _why_ it does so
well.  Besides making carving off, and freeing, blocks dirt cheap,
linked lists also naturally support recycling memory blocks in MRU
("freshness in cache") order.

But I don't know what you mean by "access memory in random order to
iterate over known objects".  obmalloc never needs to iterate over
known objects - indeed, it contains no code capable of doing that..
Our cyclic gc does, but that's independent of obmalloc.  Over on
Discourse, Neil is speculating about using radix trees for cyclic gc
instead of _its_ linked lists  In obmalloc, allocated memory regions
aren't linked at all.  It's free regions that are linked, and
helpfully so in MRU order.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Antoine Pitrou
On Thu, 6 Jun 2019 13:57:37 -0500
Tim Peters  wrote:
> [Antoine Pitrou ]
> > The interesting thing here is that in many situations, the size is
> > known up front when deallocating - it is simply not communicated to the
> > deallocator because the traditional free() API takes a sole pointer,
> > not a size.  But CPython could communicate that size easily if we
> > would like to change the deallocation API.  Then there's no bother
> > looking up the allocated size in sophisticated lookup structures.  
> 
> That could work (to make it possible to increase obmalloc's pool
> size).  Except ...
> 
> > I'll note that jemalloc provides such APIs:
> > http://jemalloc.net/jemalloc.3.html
> >
> > """The dallocx() function causes the memory referenced by ptr to be
> > made available for future allocations.
> >
> > The sdallocx() function is an extension of dallocx() with a size
> > parameter to allow the caller to pass in the allocation size as an
> > optimization."""  
> 
> obmalloc doesn't intend to be a general-purpose allocator - it only
> aims at optimizing "small" allocations, punting to the system for
> everything beyond that.

But my response was under the assumption that we would want obmalloc to
deal with all allocations.  Which is more or less required anyway to
have an efficient GC that doesn't have to walk linked lists and access
memory in random order to iterate over known objects.

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Tim Peters
[Antoine Pitrou ]
> The interesting thing here is that in many situations, the size is
> known up front when deallocating - it is simply not communicated to the
> deallocator because the traditional free() API takes a sole pointer,
> not a size.  But CPython could communicate that size easily if we
> would like to change the deallocation API.  Then there's no bother
> looking up the allocated size in sophisticated lookup structures.

That could work (to make it possible to increase obmalloc's pool
size).  Except ...

> I'll note that jemalloc provides such APIs:
> http://jemalloc.net/jemalloc.3.html
>
> """The dallocx() function causes the memory referenced by ptr to be
> made available for future allocations.
>
> The sdallocx() function is an extension of dallocx() with a size
> parameter to allow the caller to pass in the allocation size as an
> optimization."""

obmalloc doesn't intend to be a general-purpose allocator - it only
aims at optimizing "small" allocations, punting to the system for
everything beyond that.  Unless the size is _always_ passed in (on
every free() and realloc() spelling it supports), an "optimization"
doesn't help it much.  It needs a bulletproof way to determine whether
it, or system malloc/realloc, originally obtained an address passed
in.  If the size is always passed in, no problem (indeed, a single bit
could suffice).  But if it's ever possible that the size won't be
passed in, all the runtime machinery to figure that out on its own
needs to be able to handle all addresses.

Like now:  if the size were passed in, obmalloc could test the size
instead of doing the `address_in_range()` dance(*).  But if it's ever
possible that the size won't be passed in, all the machinery
supporting `address_in_range()` still needs to be there, and every
obmalloc spelling of malloc/realloc needs to ensure that machinery
will work if the returned address is passed back to an obmalloc
free/realloc spelling without the size.

The "only"problem with address_in_range is that it limits us to a
maximum pool size of 4K.  Just for fun, I boosted that to 8K to see
how likely segfaults really are, and a Python built that way couldn't
even get to its first prompt before dying with an access violation
(Windows-speak for segfault).

Alas, that makes it hard to guess how much value there would be for
Python programs if the pool size could be increased - can't even get
Python started.

We could eliminate the pool size restriction in many ways.  For
example, we could store the addresses obtained from the system
malloc/realloc - but not yet freed - in a set, perhaps implemented as
a radix tree to cut the memory burden.  But digging through 3 or 4
levels of a radix tree to determine membership is probably
significantly slower than address_in_range.

I can think of a way to do it slightly faster than (but related to)
address_in_range, but it would (on a 64-bit box) require adding 24
more bytes for each system-malloc/realloc allocation.  8 of those
bytes would be pure waste, due to that the Alignment Gods appear to
require 16-byte alignment for every allocation on a 64-bit box now.

In stark contrast, the extra memory burden of the current
address_in_range is an insignificant 8 bytes per _arena_ (256 KB, and
"should be" larger now).

Another approach:  keep address_as_range as-is, but add new internal
structure to larger pools, to repeat the arena index every 4KB.  But
that fights somewhat against the goal of larger pools.

Etc. ;-)

(*) Actually not quite true.  If a "large" object is obtained from
obmalloc now (meaning it actually came from the system malloc), then
cut back to a "small" size by a realloc, it _remains_ under the
control of the system malloc now.  Passing in the newer "small" size
to a free() later would cause obmalloc to get fatally confused about
that.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/