[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Tue, Jul 9, 2019 at 9:46 AM Tim Peters wrote: > > > At last, all size classes has1~3 used/cached memory blocks. > > No doubt part of it, but hard to believe it's most of it. If the loop > count above really is 10240, then there's only about 80K worth of > pointers in the final `buf`. You are right. List.append is not the major part of memory consumer of "large" class (8KiB+1 ~ 512KiB). They are several causes of large size alloc: * bm_logging uses StringIO.seek(0); StringIO.truncate() to reset buffer. So internal buffer of StringIO become Py_UCS4 array instead of a list of strings from the 2nd loop. This buffer is using same policy to list for increase capacity. `size + size >> 8 + (size < 9 ? 3 : 6)`. Actually, when I use `-n 1` option, memory usage is only 9MiB. * The intern dict. * Many modules are loaded, and FileIO.readall() is used to read pyc files. This creates and deletes various size of bytes objects. * logging module uses several regular expressions. `b'\0' * 0xff00` is used in sre_compile. https://github.com/python/cpython/blob/master/Lib/sre_compile.py#L320 > > But does it really matter? ;-) mimalloc "should have" done MADV_FREE > on the pages holding the older `buf` instances, so it's not like the > app is demanding to hold on to the RAM (albeit that it may well show > up in the app's RSS unless/until the OS takes the RAM away). > mimalloc doesn't call madvice for each free(). Each size classes keeps a 64KiB "page". And several pages (4KiB) in the "page" are committed but not used. I dumped all "mimalloc page" stat. https://paper.dropbox.com/doc/mimalloc-on-CPython--Agg3g6XhoX77KLLmN43V48cfAg-fFyIm8P9aJpymKQN0scpp#:uid=671467140288877659659079&h2=memory-usage-of-logging_format For example: bin block_size used capacity reserved 29 2560 1 22 25 (14 pages are committed, 2560 bytes are in use) 29 2560 14 25 25 (16 pages are committed, 2560*14 bytes are in use) 29 2560 11 25 25 31 3584 15 18 (5 pages are committed, 3584 bytes are in use) 33 5120 14 12 33 5120 2 12 12 33 5120 2 12 12 37 10240 3 11 409 41 20480 16 204 57 327680 12 12 * committed pages can be calculated by `ceil(block_size * capacity / 4096)` roughly. There are dozen of unused memory block and committed pages in each size classes. This caused 10MiB+ memory usage overhead on logging_format and logging_simple benchmarks. >> I was more intrigued by your first (speed) comparison: > > > - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%) > > Now _that's_ interesting ;-) Looks like spectral_norm recycles many > short-lived Python floats at a swift pace. So memory management > should account for a large part of its runtime (the arithmetic it does > is cheap in comparison), and obmalloc and mimalloc should both excel > at recycling mountains of small objects. Why is mimalloc > significantly faster? [snip] > obmalloc's `address_in_range()` is definitely a major overhead in its > fastest `free()` path, but then mimalloc has to figure out which > thread is doing the freeing (looks cheaper than address_in_range, but > not free). Perhaps the layers of indirection that have been wrapped > around obmalloc over the years are to blame? Perhaps mimalloc's > larger (16x) pools and arenas let it stay in its fastest paths more > often? I don't know why, but it would be interesting to find out :-) Totally agree. I'll investigate this next. Regards, -- Inada Naoki ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MXEE2NOEDAP72RFVTC7H4GJSE2CHP3SX/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Inada Naoki , trying mimalloc] >>> Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some >>> benchmarks. I will look it later. >> ... >> $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G >> Slower (60): >> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x >> slower (+158%) >> - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x >> slower (+127%) > I think I understand why the mimalloc uses more than twice memory than the > pymalloc + glibc malloc in logging_format and logging_simple benchmarks. > > These two benchmarks does like this: > > buf = [] # in StringIO > for _ in range(10*1024): > buf.append("important: some important information to be logged") > s = "".join(buf) # StringIO.getvalue() > s.splitlines() > > mimalloc uses size segregated allocator for ~512KiB. And size class > is determined top three bits. > On the other hand, list increases capacity by 9/8. It means next size > class is used on each realloc. Often, but not always (multiplication by 9/8 may not change the top 3 bits - e.g., 128 * 9/8 = 144). > At last, all size classes has1~3 used/cached memory blocks. No doubt part of it, but hard to believe it's most of it. If the loop count above really is 10240, then there's only about 80K worth of pointers in the final `buf`. To account for a difference of over 10M, it would need to have left behind well over 100 _full_ size copies from earlier reallocs. in fact the number of list elements across resizes goes like so: 0, 4, 8, 16, 25, 35, 46, ..., 7671, 8637, 9723, 10945 Adding all of those sums to 96,113, so accounts for less than 1M of 8-byte pointers if none were ever released. mimalloc will, of course, add its own slop on top of that - but not a factor of ten's worth. Unless maybe it's using a dozen such buffers at once? But does it really matter? ;-) mimalloc "should have" done MADV_FREE on the pages holding the older `buf` instances, so it's not like the app is demanding to hold on to the RAM (albeit that it may well show up in the app's RSS unless/until the OS takes the RAM away). > This is almost worst case for mimalloc. In more complex application, > there may be more chance to reuse memory blocks. > > In complex or huge application, this overhead will become relatively small. > It's speed is attractive. > > But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for > common cases. The mimalloc page says that, in their benchmarks: """ In our benchmarks (see below), mimalloc always outperforms all other leading allocators (jemalloc, tcmalloc, Hoard, etc), and usually uses less memory (up to 25% more in the worst case). """ obmalloc is certainly more "memory efficient" (for some meanings of that phrase) for smaller objects: in 3.7 it splits objects of <= 512 bytes into 64 size classes. mimalloc also has (close to) 64 "not gigantic" size classes, but those cover a range of sizes over a thousand times wider (up to about half a meg). Everything obmalloc handles fits in mimalloc's first 20 size classes. So mimalloc routinely needs more memory to satisfy a "small object" request than obmalloc does. I was more intrigued by your first (speed) comparison: > - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%) Now _that's_ interesting ;-) Looks like spectral_norm recycles many short-lived Python floats at a swift pace. So memory management should account for a large part of its runtime (the arithmetic it does is cheap in comparison), and obmalloc and mimalloc should both excel at recycling mountains of small objects. Why is mimalloc significantly faster? This benchmark should stay in the "fastest paths" of both allocators most often, and they both have very lean fastest paths (they both use pool-local singly-linked sized-segregated free lists, so malloc and free for both should usually amount to just popping or pushing one block off/on the head of the appropriate list). obmalloc's `address_in_range()` is definitely a major overhead in its fastest `free()` path, but then mimalloc has to figure out which thread is doing the freeing (looks cheaper than address_in_range, but not free). Perhaps the layers of indirection that have been wrapped around obmalloc over the years are to blame? Perhaps mimalloc's larger (16x) pools and arenas let it stay in its fastest paths more often? I don't know why, but it would be interesting to find out :-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/554D4PU6LBBIKWJCQI4VKU2BVZD4Z3PM/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Thu, Jul 4, 2019 at 11:32 PM Inada Naoki wrote: > > On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou wrote: > > > > Ah, interesting. Were you able to measure the memory footprint as well? > > > > Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some > benchmarks. I will look it later. > > ``` > $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G > Slower (60): > - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x > slower (+158%) > - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x > slower (+127%) I think I understand why the mimalloc uses more than twice memory than the pymalloc + glibc malloc in logging_format and logging_simple benchmarks. These two benchmarks does like this: buf = [] # in StringIO for _ in range(10*1024): buf.append("important: some important information to be logged") s = "".join(buf) # StringIO.getvalue() s.splitlines() mimalloc uses size segregated allocator for ~512KiB. And size class is determined top three bits. On the other hand, list increases capacity by 9/8. It means next size class is used on each realloc. At last, all size classes has1~3 used/cached memory blocks. This is almost worst case for mimalloc. In more complex application, there may be more chance to reuse memory blocks. In complex or huge application, this overhead will become relatively small. It's speed is attractive. But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for common cases. Regards, -- Inada Naoki ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ORVLH5FAEO7LVE7SK44TQR6XK4YRRZ7L/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Victor Stinner ] > I guess that INADA-san used pyperformance --track-memory. > > pyperf --track-memory doc: > "--track-memory: get the memory peak usage. it is less accurate than > tracemalloc, but has a lower overhead. On Linux, compute the sum of > Private_Clean and Private_Dirty memory mappings of /proc/self/smaps. > ... So I'll take that as meaning essentially that it's reporting what RSS would report if it ignored shared pages (so peak # of private pages actually backed by physical RAM). Clear as mud how MADV_FREE affects that. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6H7HP6TXKGZBIPVNBTLULGYEDFJKVFCQ/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
I found calibrated loop count is not stable so memory usage is very different in some benchmarks. Especially, RAM usage of logging benchmark is very relating to loop count: $ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so ./python bm_logging.py simple --track-memory --fast --inherit-environ PYTHONMALLOC,LD_PRELOAD -v Run 1: calibrate the number of loops: 512 - calibrate 1: 12.7 MB (loops: 512) Calibration: 1 warmup, 512 loops Run 2: 0 warmups, 1 value, 512 loops - value 1: 12.9 MB Run 3: 0 warmups, 1 value, 512 loops - value 1: 12.9 MB ... $ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so ./python bm_logging.py simple --track-memory --fast --inherit-environ PYTHONMALLOC,LD_PRELOAD -v -l1024 Run 1: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB Run 2: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB Run 3: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB ... -- Inada Naoki ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QBXLRFXDD5TLLDATV2PWE2QNLLDWRVXY/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
I guess that INADA-san used pyperformance --track-memory. pyperf --track-memory doc: "--track-memory: get the memory peak usage. it is less accurate than tracemalloc, but has a lower overhead. On Linux, compute the sum of Private_Clean and Private_Dirty memory mappings of /proc/self/smaps. On Windows, get PeakPagefileUsage of GetProcessMemoryInfo() (of the current process): the peak value of the Commit Charge during the lifetime of this process." https://pyperf.readthedocs.io/en/latest/runner.html#misc On Linux, pyperf uses read_smap_file() of pyperf._memory: https://github.com/vstinner/pyperf/blob/master/pyperf/_memory.py # Code to parse Linux /proc/%d/smaps files. # # See http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html for # a quick introduction to smaps. # # Need Linux 2.6.16 or newer. def read_smap_file(): total = 0 fp = open(proc_path("self/smaps"), "rb") with fp: for line in fp: # Include both Private_Clean and Private_Dirty sections. line = line.rstrip() if line.startswith(b"Private_") and line.endswith(b'kB'): parts = line.split() total += int(parts[1]) * 1024 return total It spawns a thread which reads /proc/self/smaps every milliseconds and then report the *maximum*. Victor Le jeu. 4 juil. 2019 à 17:12, Tim Peters a écrit : > > [Antoine Pitrou ] > >> Ah, interesting. Were you able to measure the memory footprint as well? > > [Inada Naoki ] > > Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some > > benchmarks. I will look it later. > > > > ``` > > $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G > > Slower (60): > > - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x > > slower (+158%) > > ... > > Could you say more about what's being measured here? Like, if this is > on Linux, is it reporting RSS? VSS? Something else? > > mimalloc is "most like" obmalloc among those I've looked at in recent > weeks. I noted before that its pools (they call them "pages") and > arenas (called "segments") are at least 16x larger than obmalloc uses > (64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size, > in mimalloc). > > Like all "modern" 64-bit allocators, it cares little about reserving > largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom. > But it also releases (in some sense, ;like MADV_FREE) physical RAM > back to the system at a granularity far smaller than arena'segment. > > At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB > vector at the start, so no program linking to that can consume less > than half a gig of address space. But it only expects to _use_ a few > 4 KiB OS pages out of that. mimalloc doesn't do anything anywhere > near _that_ gonzo (& since mimalloc comes out of Microsoft, it never > will - "total virtual memory" on Windows is a system-wide resource, > limited to the sum of physical RAM + pagefile size - no "overcommit" > is allowed). > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/ES3BFCBXS7N56XGUHHSOPHRT3UAEGKVA/ -- Night gathers, and now my watch begins. It shall not end until my death. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HF4UIZP5J3KKWQMLCHKJD3G6YZLHWWBE/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Antoine Pitrou ] >> Ah, interesting. Were you able to measure the memory footprint as well? [Inada Naoki ] > Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some > benchmarks. I will look it later. > > ``` > $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G > Slower (60): > - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x > slower (+158%) > ... Could you say more about what's being measured here? Like, if this is on Linux, is it reporting RSS? VSS? Something else? mimalloc is "most like" obmalloc among those I've looked at in recent weeks. I noted before that its pools (they call them "pages") and arenas (called "segments") are at least 16x larger than obmalloc uses (64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size, in mimalloc). Like all "modern" 64-bit allocators, it cares little about reserving largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom. But it also releases (in some sense, ;like MADV_FREE) physical RAM back to the system at a granularity far smaller than arena'segment. At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB vector at the start, so no program linking to that can consume less than half a gig of address space. But it only expects to _use_ a few 4 KiB OS pages out of that. mimalloc doesn't do anything anywhere near _that_ gonzo (& since mimalloc comes out of Microsoft, it never will - "total virtual memory" on Windows is a system-wide resource, limited to the sum of physical RAM + pagefile size - no "overcommit" is allowed). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ES3BFCBXS7N56XGUHHSOPHRT3UAEGKVA/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Thu, 4 Jul 2019 23:32:55 +0900 Inada Naoki wrote: > On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou wrote: > > > > Ah, interesting. Were you able to measure the memory footprint as well? > > > > Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some > benchmarks. I will look it later. Ah, indeed, MADV_FREE will make it complicated to measure actual memory usage :-/ Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/46QXQLQWHB4ASSBL5LB5QXJIYJLTDY77/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou wrote: > > Ah, interesting. Were you able to measure the memory footprint as well? > Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some benchmarks. I will look it later. ``` $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G Slower (60): - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x slower (+158%) - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x slower (+127%) - regex_dna: 13.3 MB +- 19.1 kB -> 27.0 MB +- 12.1 kB: 2.02x slower (+102%) - json_dumps: 8351.8 kB +- 19.8 kB -> 15.2 MB +- 18.0 kB: 1.87x slower (+87%) - regex_v8: 8434.6 kB +- 20.9 kB -> 14.4 MB +- 11.0 kB: 1.75x slower (+75%) - unpack_sequence: 7521.0 kB +- 17.0 kB -> 9980.8 kB +- 24.7 kB: 1.33x slower (+33%) - hexiom: 7412.2 kB +- 19.0 kB -> 9307.4 kB +- 8004 bytes: 1.26x slower (+26%) - xml_etree_process: 12.2 MB +- 26.3 kB -> 15.0 MB +- 28.9 kB: 1.23x slower (+23%) - genshi_text: 10.2 MB +- 24.0 kB -> 12.5 MB +- 24.8 kB: 1.22x slower (+22%) - crypto_pyaes: 7602.2 kB +- 35.7 kB -> 9242.8 kB +- 7873 bytes: 1.22x slower (+22%) - tornado_http: 24.9 MB +- 72.1 kB -> 30.1 MB +- 33.0 kB: 1.21x slower (+21%) - chameleon: 15.8 MB +- 24.5 kB -> 19.1 MB +- 23.4 kB: 1.21x slower (+21%) - genshi_xml: 10.9 MB +- 24.0 kB -> 12.9 MB +- 19.6 kB: 1.18x slower (+18%) - go: 8662.6 kB +- 16.4 kB -> 10082.8 kB +- 26.2 kB: 1.16x slower (+16%) - pathlib: 8863.6 kB +- 30.2 kB -> 10229.8 kB +- 19.4 kB: 1.15x slower (+15%) - scimark_fft: 7473.4 kB +- 14.4 kB -> 8606.0 kB +- 28.6 kB: 1.15x slower (+15%) - scimark_lu: 7463.2 kB +- 15.1 kB -> 8569.8 kB +- 28.6 kB: 1.15x slower (+15%) - pidigits: 7380.2 kB +- 20.0 kB -> 8436.0 kB +- 24.2 kB: 1.14x slower (+14%) - scimark_monte_carlo: 7354.4 kB +- 18.2 kB -> 8398.8 kB +- 27.0 kB: 1.14x slower (+14%) - scimark_sparse_mat_mult: 7889.8 kB +- 20.1 kB -> 9006.2 kB +- 29.4 kB: 1.14x slower (+14%) - scimark_sor: 7377.2 kB +- 18.9 kB -> 8402.0 kB +- 29.0 kB: 1.14x slower (+14%) - chaos: 7693.0 kB +- 33.0 kB -> 8747.6 kB +- 10.5 kB: 1.14x slower (+14%) - richards: 7364.2 kB +- 29.8 kB -> 8331.4 kB +- 20.2 kB: 1.13x slower (+13%) - raytrace: 7696.0 kB +- 30.3 kB -> 8695.4 kB +- 30.0 kB: 1.13x slower (+13%) - sqlite_synth: 8799.2 kB +- 25.5 kB -> 9937.4 kB +- 27.1 kB: 1.13x slower (+13%) - logging_silent: 7533.8 kB +- 32.0 kB -> 8488.2 kB +- 25.1 kB: 1.13x slower (+13%) - json_loads: 7317.8 kB +- 22.7 kB -> 8215.2 kB +- 21.5 kB: 1.12x slower (+12%) - unpickle_list: 7513.4 kB +- 9790 bytes -> 8420.6 kB +- 25.6 kB: 1.12x slower (+12%) - unpickle: 7519.8 kB +- 11.4 kB -> 8425.4 kB +- 27.1 kB: 1.12x slower (+12%) - fannkuch: 7170.0 kB +- 14.9 kB -> 8033.0 kB +- 22.5 kB: 1.12x slower (+12%) - pickle_list: 7514.6 kB +- 18.2 kB -> 8414.6 kB +- 24.0 kB: 1.12x slower (+12%) - telco: 7685.2 kB +- 15.0 kB -> 8598.2 kB +- 17.6 kB: 1.12x slower (+12%) - nbody: 7214.8 kB +- 10.7 kB -> 8070.2 kB +- 19.5 kB: 1.12x slower (+12%) - pickle: 7523.2 kB +- 12.4 kB -> 8415.0 kB +- 21.0 kB: 1.12x slower (+12%) - 2to3: 7171.2 kB +- 35.8 kB -> 8016.4 kB +- 21.7 kB: 1.12x slower (+12%) - nqueens: 7425.2 kB +- 21.8 kB -> 8296.8 kB +- 25.5 kB: 1.12x slower (+12%) - spectral_norm: 7212.6 kB +- 19.6 kB -> 8052.8 kB +- 18.4 kB: 1.12x slower (+12%) - regex_compile: 8538.0 kB +- 21.0 kB -> 9528.6 kB +- 22.1 kB: 1.12x slower (+12%) - pickle_pure_python: 7559.8 kB +- 19.4 kB -> 8430.0 kB +- 25.6 kB: 1.12x slower (+12%) - unpickle_pure_python: 7545.4 kB +- 9233 bytes -> 8413.0 kB +- 15.6 kB: 1.11x slower (+11%) - float: 23.9 MB +- 22.5 kB -> 26.6 MB +- 19.7 kB: 1.11x slower (+11%) - sqlalchemy_imperative: 18.2 MB +- 46.2 kB -> 20.2 MB +- 36.5 kB: 1.11x slower (+11%) - regex_effbot: 7910.8 kB +- 15.1 kB -> 8804.8 kB +- 20.9 kB: 1.11x slower (+11%) - pickle_dict: 7563.4 kB +- 15.3 kB -> 8415.2 kB +- 19.3 kB: 1.11x slower (+11%) - sqlalchemy_declarative: 18.9 MB +- 40.2 kB -> 21.0 MB +- 26.4 kB: 1.11x slower (+11%) - xml_etree_parse: 11.8 MB +- 12.6 kB -> 13.0 MB +- 16.3 kB: 1.11x slower (+11%) - html5lib: 20.1 MB +- 44.9 kB -> 22.2 MB +- 46.6 kB: 1.10x slower (+10%) - xml_etree_iterparse: 12.0 MB +- 26.5 kB -> 13.2 MB +- 31.3 kB: 1.10x slower (+10%) - sympy_integrate: 36.4 MB +- 26.7 kB -> 40.0 MB +- 33.2 kB: 1.10x slower (+10%) - sympy_str: 37.2 MB +- 28.4 kB -> 40.7 MB +- 26.6 kB: 1.10x slower (+10%) - sympy_expand: 36.2 MB +- 19.9 kB -> 39.7 MB +- 25.1 kB: 1.09x slower (+9%) - mako: 15.3 MB +- 19.1 kB -> 16.7 MB +- 25.4 kB: 1.09x slower (+9%) - django_template: 19.3 MB +- 14.9 kB -> 21.0 MB +- 14.6 kB: 1.09x slower (+9%) - xml_etree_generate: 12.3 MB +- 39.5 kB -> 13.3 MB +- 26.9 kB: 1.08x slower (+8%) - deltablue: 8918.0 kB +- 19.8 kB -> 9615.8 kB +- 12.5 kB: 1.08x slower (+8%) - dulwich_log: 12.2 MB +- 102.9 kB -> 13.1 MB +- 26.6 kB: 1.08x slower (+8%) - meteor_contest: 9296.0 kB +- 11.9 kB -> 9996.8 kB +- 20.7 kB: 1.08x slower (+8%) - sympy_sum: 62.2 MB +- 20.8 kB -> 66.5 MB +- 21.1 kB: 1.07x slower (+7%) - python_startup: 7946.6 kB +-
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Thu, 4 Jul 2019 16:19:52 +0900 Inada Naoki wrote: > On Tue, Jun 25, 2019 at 5:49 AM Antoine Pitrou wrote: > > > > > > For the record, there's another contender in the allocator > > competition now: > > https://github.com/microsoft/mimalloc/ > > > > Regards > > > > Antoine. > > It's a very strong competitor! > > $ ./python -m pyperf compare_to pymalloc.json mimalloc.json -G --min-speed=3 > Faster (14): > - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%) > - unpickle: 19.7 us +- 1.9 us -> 17.6 us +- 1.3 us: 1.12x faster (-11%) > - json_dumps: 17.1 ms +- 0.2 ms -> 15.7 ms +- 0.2 ms: 1.09x faster (-8%) > - json_loads: 39.0 us +- 2.6 us -> 36.2 us +- 1.1 us: 1.08x faster (-7%) > - crypto_pyaes: 162 ms +- 1 ms -> 150 ms +- 1 ms: 1.08x faster (-7%) > - regex_effbot: 3.62 ms +- 0.04 ms -> 3.38 ms +- 0.01 ms: 1.07x faster (-7%) > - pickle_pure_python: 689 us +- 53 us -> 650 us +- 5 us: 1.06x faster (-6%) > - scimark_fft: 502 ms +- 2 ms -> 478 ms +- 2 ms: 1.05x faster (-5%) > - float: 156 ms +- 2 ms -> 149 ms +- 1 ms: 1.05x faster (-5%) > - pathlib: 29.0 ms +- 0.5 ms -> 27.7 ms +- 0.4 ms: 1.05x faster (-4%) > - mako: 22.4 ms +- 0.1 ms -> 21.6 ms +- 0.2 ms: 1.04x faster (-4%) > - scimark_sparse_mat_mult: 6.21 ms +- 0.04 ms -> 5.99 ms +- 0.08 ms: > 1.04x faster (-3%) > - xml_etree_parse: 179 ms +- 5 ms -> 173 ms +- 3 ms: 1.04x faster (-3%) > - sqlalchemy_imperative: 42.0 ms +- 0.9 ms -> 40.7 ms +- 0.9 ms: 1.03x > faster (-3%) > > Benchmark hidden because not significant (46): ... Ah, interesting. Were you able to measure the memory footprint as well? Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/47NHQNPQVR6GPZ3PPRCAVZLPRXGV4GNW/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Tue, Jun 25, 2019 at 5:49 AM Antoine Pitrou wrote: > > > For the record, there's another contender in the allocator > competition now: > https://github.com/microsoft/mimalloc/ > > Regards > > Antoine. It's a very strong competitor! $ ./python -m pyperf compare_to pymalloc.json mimalloc.json -G --min-speed=3 Faster (14): - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%) - unpickle: 19.7 us +- 1.9 us -> 17.6 us +- 1.3 us: 1.12x faster (-11%) - json_dumps: 17.1 ms +- 0.2 ms -> 15.7 ms +- 0.2 ms: 1.09x faster (-8%) - json_loads: 39.0 us +- 2.6 us -> 36.2 us +- 1.1 us: 1.08x faster (-7%) - crypto_pyaes: 162 ms +- 1 ms -> 150 ms +- 1 ms: 1.08x faster (-7%) - regex_effbot: 3.62 ms +- 0.04 ms -> 3.38 ms +- 0.01 ms: 1.07x faster (-7%) - pickle_pure_python: 689 us +- 53 us -> 650 us +- 5 us: 1.06x faster (-6%) - scimark_fft: 502 ms +- 2 ms -> 478 ms +- 2 ms: 1.05x faster (-5%) - float: 156 ms +- 2 ms -> 149 ms +- 1 ms: 1.05x faster (-5%) - pathlib: 29.0 ms +- 0.5 ms -> 27.7 ms +- 0.4 ms: 1.05x faster (-4%) - mako: 22.4 ms +- 0.1 ms -> 21.6 ms +- 0.2 ms: 1.04x faster (-4%) - scimark_sparse_mat_mult: 6.21 ms +- 0.04 ms -> 5.99 ms +- 0.08 ms: 1.04x faster (-3%) - xml_etree_parse: 179 ms +- 5 ms -> 173 ms +- 3 ms: 1.04x faster (-3%) - sqlalchemy_imperative: 42.0 ms +- 0.9 ms -> 40.7 ms +- 0.9 ms: 1.03x faster (-3%) Benchmark hidden because not significant (46): ... ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CTIOESA4NQSWSXH5SZ5D6D7YITDGK33S/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Antoine Pitrou ] > For the record, there's another contender in the allocator > competition now: > https://github.com/microsoft/mimalloc/ Thanks! From a quick skim, most of it is addressing things obmalloc doesn't: 1) Efficient thread safety (we rely on the GIL). 2) Directly handling requests of all sizes (we only deal with <= 512 bytes). 3) Funky stuff to help refcounting languages that want to guarantee that freeing an object takes bounded time (the problem is that large objects can contain an unbounded number of other objects that will then also die - so mimalloc has complications to interact with client-supplied functions that parcel out the tree of objects killed by a decref, so the freeing can be done incrementally over time). I don't think it's been explicitly pointed out before, but #2 is an alternative to my PR and Neil's radix tree. If we insist that programs play by the rules, and obmalloc controls every piece of memory it hands out, then the new definition of address_in_range() becomes #define address_in_range(p, pool) true ;-) This deserves more thought than I've ever given to it. Leaving aside all the above, they emphasize this: """ Many allocators use a single free list per size class which can lead to bad spatial locality where objects belonging to a single structure can be spread out over the entire heap. ... To improve the spatial locality of allocation, mimalloc use free list sharding where the heap is divided into pages (per size-class) with a free list per page (where pages are usually 64KiB for small objects). """ Which is exactly what obmalloc already does, except we call their "pages" "pools", and ours are 16x smaller (<= Python 3.8) or 4x smaller (my PR). The call our "arenas" "segments", and theirs are a minimum of 4 MiB. Their motivations for making these "large" are the same as mine: maximize the time they can stay in the very zippy "fast paths". Something I'm torn on: within a pool, obmalloc has two ways to find the next free block. The pool has its own block free list (as in mimalloc), but _also_ has a "bump allocator": The pool is carved into blocks one at a time, low address to high, whenever the pool's free list is NULL. This was part of Vladimir's desire to never touch a piece of memory before it's actually needed to satisfy a client request. But it does complicate and slow the code on the second-fastest allocation path (when the pool's free list _is_ NULL). The conditional "OK, the free list is NULL - so is there room for another bump allocation?" is a waste of time after all the pool's blocks have been passed out at least once. The bump allocator can never succeed again then, but we keep testing for it forever. The alternative is to break the pool into blocks, and link them into a free list, in one gulp when a size class is assigned to a free pool. That mutates all the blocks in the pool (to store the list pointers), even if all but one will never be used. In return, the code for testing whether the bump allocator can find room, and the pool header members supporting the bump allocator, could be thrown out. That's what mimalloc appears to do, and they said it sped things up a bit overall. But I think I'll leave that one for someone else ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RSBCIDD6YBDSEPQIBGTOZHVS63PS7LTU/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
For the record, there's another contender in the allocator competition now: https://github.com/microsoft/mimalloc/ Regards Antoine. On Mon, 24 Jun 2019 00:20:03 -0500 Tim Peters wrote: > [Tim] > > The radix tree generally appears to be a little more memory-frugal > > than my PR (presumably because my need to break "big pools" into 4K > > chunks, while the tree branch doesn't, buys the tree more space to > > actually store objects than it costs for the new tree). > > It depends a whole lot on the size classes of the most popular > objects. A program below to compute it all. For a 64-bit box using > 3.8 alignment, and 16 KiB pools: [snip] ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LWVFZYVFLTCNL7AKJVH2HLD2CHFRATUB/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Tim] > The radix tree generally appears to be a little more memory-frugal > than my PR (presumably because my need to break "big pools" into 4K > chunks, while the tree branch doesn't, buys the tree more space to > actually store objects than it costs for the new tree). It depends a whole lot on the size classes of the most popular objects. A program below to compute it all. For a 64-bit box using 3.8 alignment, and 16 KiB pools: pages per pool 4 pool size 16,384 alignment 16 The first output block: size 16 SQ 1012 1.2% PR 1018 0.6% RT 1021 0.3% SQ is the status quo: we have to use four separate 4 KiB pools, and each burns 48 bytes for a pool header. PR: my PR. There's one pool spanning 4 pages, with 48 bytes for a pool header in the first page, and 16 bytes to store the arena index in each of the other 3 pages. RT: the radix tree. One 16 KiB block that only "wastes" 48 bytes for the pool header. The first number on each line is the number of objects we can get from a "big pool" under that scheme. The second number is the % of total pool space that cannot be use for client objects. So, in the above, PR is a substantial improvement over SQ, and RT a less substantial improvement over PR. Regardless of size class, PR never does worse than SQ, and RT never worse than PR. The most dramatic difference is in the largest size class: size 512 SQ 28 12.5% PR 28 12.5% RT 31 3.1% RT is a huge win there. And while it's generally true that RT's advantages are more striking in the larger size classes, it doesn't always crush. For example, in the 2nd-largest size class, it doesn't matter at all which scheme is used: size 496 SQ 32 3.1% PR 32 3.1% RT 32 3.1% However, in the 3rd-largest size class, RT crushes it again: size 480 SQ 32 6.2% PR 32 6.2% RT 34 0.4% I trust the general principle at work here is too obvious to need explanation ;-) Anyway, these differences can really add up when there are millions of objects passed out from a size class where RT has an advantage. That's very attractive to me. On the other hand, this _effectively_ make arenas even larger (they can contain more objects), which apparently makes it even less likely that arenas can eventually be returned to the system. But, on the third hand, I've seen no evidence yet that increasing arena size matters much at all to that after hitting 128 KiB arenas (smaller than what we already use). "Uncooperative" programs essentially recycle nothing regardless, and "happy" programs essentially recycle almost as many arena bytes with 1 MiB arenas as with 8 KiB arenas. Here's the code: PAGES_PER_POOL = 4 ALIGNMENT = 16 # change to 8 for < Python 3.8 PAGE_SIZE = 2**12 POOL_SIZE = PAGE_SIZE * PAGES_PER_POOL POOL_HEADER_SIZE = 48 def from_block(size, blocksize, overhead): return (blocksize - overhead) // size def from_first_page(size, *, pagesize=PAGE_SIZE): return from_block(size, pagesize, POOL_HEADER_SIZE) # using multiple 4K one-page pools - status quo def nobj_4K(size): return from_first_page(size) * PAGES_PER_POOL # using the PR def nobj_PR(size): return (from_first_page(size) + from_block(size, PAGE_SIZE, ALIGNMENT) * (PAGES_PER_POOL - 1)) # using the radix tree branch def nobj_RT(size): return from_first_page(size, pagesize=POOL_SIZE) print("pages per pool", PAGES_PER_POOL) print(f"pool size {POOL_SIZE:,}") print("alignment", ALIGNMENT) for size in range(ALIGNMENT, 512 + 1, ALIGNMENT): print("size", size) for tag, f in (("SQ", nobj_4K), ("PR", nobj_PR), ("RT", nobj_RT), ): nobj = f(size) waste = POOL_SIZE - nobj * size print(f"{tag} {nobj:4} {waste/POOL_SIZE:5.1%}") ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/S5KMU6M6GZACRNFCF3TNPE7NKDCMQD5E/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Thomas] >>> And what would be an efficient way of detecting allocations punted to >>> malloc, if not address_in_range? [Tim] >> _The_ most efficient way is the one almost all allocators used long >> ago: use some "hidden" bits right before the address returned to the >> user to store info about the block being returned. [Antoine] > There's a fundamental problem here: you can't be sure that all > allocators reserve such space. If some allocator doesn't, it can > return a pointer just at the very start of the page, and if obmalloc > tries to look at "a few bits before" that address, it could very well > page-fault. I snipped some technical but crucial context in my reply to Thomas: this was assuming users are following the documented requirement to never mix memory calls from different families. What you describe certainly could happen in "illegal" code that. e.g., obtained a block from the system malloc, but passed it to obmalloc to free. Which, in reality, works fine today, although the docs forbid it. (I'm not sure, but there _may_ be some mode of running today that actually enforces the doc requirements.) In the other world, where code is playing by the rules, if an obmalloc malloc/;realloc spelling is called, and it needs to punt to a different allocator, no problem: first it boosts the size request so it has room to store "the bit" it needs before the address it actually returns to the client. Then it's "legal" only to free that memory with an obmalloc spelling of free() later - obmalloc reads "the bit", sees "oh - that's not my memory!", and adjusts the pointer accordingly on _its_ call to the spelling of free() corresponding to the memory family obmalloc() used to get the memory to begin with. >> Neil Schemenauer takes a different approach in the recent "radix tree >> arena map for obmalloc" thread here. ... > One open question is the cache efficiency of the two approaches. > Intuitively, address_in_range() will often look at exactly the same > cache line (since a significant number of allocation will share the > same "page prefix"). I believe we haven't seen a program yet that used more than one node at the tree's top level :-) But who knows? mmap() and VirtualAlloc() don't appear to make any _guarantees_ that the high-order bits of returned addresses aren't entirely random. In real life so far, they always appear to be zeroes. While x86 has a 64-bit virtual address space, the hardware "only" supports 48 bits of physical address space, and I haven't seen a virtual address yet where any of the top 16 bits are set. AMD requires that the top 16 bits of virtual addresses be copies of bit 2**47. Blah blah blah - for the foreseeable future, the top level of the tree has a very easy job. And Neil keenly observed that the top level of the tree can be _declared_ as being very broad (and so suck up a lot of the leading bits), because it's a file static and is effectively just an address space reservation (at least on Linux) until nodes in it actually get used. > Apparently, this benefit may be offset by cache > aliasing issues. Cache aliasing can also be mitigated by the fact that > CPU caches are most of the time N-way with N > 1 (but N generally > remains small, from 2 to 8, for L1 and L2 caches). > > So I guess the a priori answer is "it's complicated" :-) Indeed it is. > I must also thank both you and Neil for running these experiments, even > though sometimes I may disagree about the conclusions :-) Well, there aren't any conclusions yet - just seeing the same things repeatedly. Over the weekend, Neil ran many variations of a longish-running "real job" related to his work, which goes through phases of processing bulk database operations and "trying to" release the space (details are complicated). Arena recycling was essentially non-existent in either branch (my PR or the radix tree). In 3.7.3 it _appeared_ to recycle hundreds of thousands of arenas, but on closer examination they were virtually all of the "useless arena thrashing" flavor. The arenas in use were almost always within one of the highwater mark. But it got much better in the branches if arenas were shrunk to a tiny 8 KiB. Which is just another instance of the "256 KiB arenas are already way too fat for effective arena recycling unless the program is exceptionally friendly in its dynamic memory use patterns" observation. We haven't seen a case yet where 1 MiB arenas do materially worse than 256 KiB ones in this respect. Speed is generally a wash between the branches, although they consistently appear to be faster (by a little, not a lot) than 3.7.3. The radix tree generally appears to be a little more memory-frugal than my PR (presumably because my need to break "big pools" into 4K chunks, while the tree branch doesn't, buys the tree more space to actually store objects than it costs for the new tree). We need more "real jobs", and especially ones where arena recycling is actually doing good in 3.7.3 (wh
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Fri, 21 Jun 2019 17:18:18 -0500 Tim Peters wrote: > > > And what would be an efficient way of detecting allocations punted to > > malloc, if not address_in_range? > > _The_ most efficient way is the one almost all allocators used long > ago: use some "hidden" bits right before the address returned to the > user to store info about the block being returned. There's a fundamental problem here: you can't be sure that all allocators reserve such space. If some allocator doesn't, it can return a pointer just at the very start of the page, and if obmalloc tries to look at "a few bits before" that address, it could very well page-fault. > Neil Schemenauer takes a different approach in the recent "radix tree > arena map for obmalloc" thread here. We exchanged ideas on that until > it got to the point that the tree levels only need to trace out > prefixes of obmalloc arena addresses. That is, the new space burden > of the radix tree appears quite reasonably small. One open question is the cache efficiency of the two approaches. Intuitively, address_in_range() will often look at exactly the same cache line (since a significant number of allocation will share the same "page prefix"). Apparently, this benefit may be offset by cache aliasing issues. Cache aliasing can also be mitigated by the fact that CPU caches are most of the time N-way with N > 1 (but N generally remains small, from 2 to 8, for L1 and L2 caches). So I guess the a priori answer is "it's complicated" :-) I must also thank both you and Neil for running these experiments, even though sometimes I may disagree about the conclusions :-) Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RUQVG2KOYVMUIIX5HIZKNVN4AUXKKURM/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
For those who would like to test with something compatible with Python 3.7.3, I made re-based branches here: https://github.com/nascheme/cpython/tree/obmalloc_radix_v37 https://github.com/nascheme/cpython/tree/obmalloc_big_pools_v37 They should be ABI compatible with Python 3.7.3. So, if you just re-build the "python" executable, you don't have to rebuild anything else. Both those use the same arena/pool sizes and they both have Tim's arena thrashing fix. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/K5DCROCGGVNWWLC6XM6XMCTJACESNEYS/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On 2019-06-21, Tim Peters wrote: > [Thomas Wouters ] > > Getting rid of address_in_range sounds like a nice idea, and I > > would love to test how feasible it is -- I can run such a change > > against a wide selection of code at work, including a lot of > > third-party extension modules, but I don't see an easy way to do > > it right now. > > Neil's branch is here: > > https://github.com/nascheme/cpython/tree/obmalloc_radix_tree If you can test vs some real-world programs, that would be great. I was trying to run some tests this afternoon. Testing with Python 3.8+ is a pain because of the PyCode_New and tp_print changes. I've just added two fixes to the head of the obmalloc_radix_tree branch so that you can compile code generated by old versions of Cython. Without those fixes, building 3rd party extensions can be a real pain. > My PR uses 16K pools and 1M arenas, quadrupling the status quo. > Because "why not?" ;-) > > Neil's branch has _generally_, but not always, used 16 MiB arenas. > The larger the arenas in his branch, the smaller the radix tree needs > to grow. Currently I have it like your big pool branch (16 KB, 1MB). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TF2JI7G5ZMCGUMM3AWNSCQDYVFNRPMQ4/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
Le ven. 21 juin 2019 à 23:19, Thomas Wouters a écrit : > Is this really feasible in a world where the allocators can be selected (and > the default changed) at runtime? The memory allocation must not be changed after the Python pre-initialization. What's done after pre-initialization is more to put "hook" which executes code before/after an allocation, but don't replace the allocator. It simply doesn't work to switch from pymalloc to malloc "at runtime". Calling PyMem_Free(ptr) would call free(ptr). If the memory block was allocated by pymalloc, free(ptr) does simply crash. Victor ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/C5AI3SL77AV6QLRNTJ4PZH7MCYR2ZQAC/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Tim] >> I don't think we need to cater anymore to careless code that mixes >> system memory calls with O calls (e.g., if an extension gets memory >> via `malloc()`, it's its responsibility to call `free()`), and if not >> then `address_in_range()` isn't really necessary anymore either, and >> then we could increase the pool size. O would, however, need a new >> way to recognize when its version of malloc punted to the system >> malloc. [Thomas Wouters ] > Is this really feasible in a world where the allocators can be selected (and > the default changed) at runtime? I think so. See the "Memory Management" section of the Python/C API Reference Manual. It's always been "forbidden" to, e.g., allocate a thing with PyMem_New() but release it with free(). Ditto mixing a PyMem_Raw... allocator with a PyMem... deallocator, or PyObject... one. Etc. A type's tp_dealloc implementation should damn well which memory family the type's allocator used, However, no actual proposal on the table changes any "fact on the ground" here. They're all as forgiving of slop as the status quo. > And what would be an efficient way of detecting allocations punted to > malloc, if not address_in_range? _The_ most efficient way is the one almost all allocators used long ago: use some "hidden" bits right before the address returned to the user to store info about the block being returned. Like 1 bit to distinguish between "obmalloc took this out of one of its pools" and "obmalloc got this from PyMem_Raw... (whatever that maps to - obmalloc doesn't care)". That would be much faster than what we do now. But on current 64-bit boxes, "1 bit" turns into "16 bytes" to maintain alignment, so space overhead becomes 100% for the smallest objects obmalloc can return :-( Neil Schemenauer takes a different approach in the recent "radix tree arena map for obmalloc" thread here. We exchanged ideas on that until it got to the point that the tree levels only need to trace out prefixes of obmalloc arena addresses. That is, the new space burden of the radix tree appears quite reasonably small. It doesn't appear to be possible to make it faster than the current address_in_range(), but in small-scale testing so far speed appears comparable. > Getting rid of address_in_range sounds like a nice idea, and I would love to > test > how feasible it is -- I can run such a change against a wide selection of code > at work, including a lot of third-party extension modules, but I don't see an > easy > way to do it right now. Neil's branch is here: https://github.com/nascheme/cpython/tree/obmalloc_radix_tree It's effectively a different _implementation_ of the current address_in_range(), one that doesn't ever need to read possibly uninitialized memory, and couldn't care less about the OS page size. For the latter reason, it's by far the clearest way to enable expanding pool size above 4 KiB. My PR also eliminates the pool size limitation: https://github.com/python/cpython/pull/13934 but at the cost of breaking bigger pools up internally into 4K regions so the excruciating current address_in_range black magic still works. Neil and I are both keen _mostly_ to increase pool and arena sizes. The bigger they are, the more time obmalloc can spend in its fastest code paths. A question we can't answer yet (or possibly ever) is how badly that would hurt Python returning arenas to the system, in long-running apps the go through phases of low and high memory need. I don't run anything like that - not a world I've ever lived in. All my experiments so far say, for programs that are neither horrible nor wonderful in this respect: 1. An arena size of 4 KiB is most effective for that. 2. There's significant degradation in moving even to 8 KiB arenas. 3. Which continues getting worse the larger the arenas. 4. Until reaching 128 KiB, at which point the rate of degradation falls a lot. So the current 256 KiB arenas already suck for such programs. For "horrible" programs, not even tiny 4K arenas help much. For "wonderful" programs, not even 16 MiB arenas hurt arena recycling effectiveness. So if you have real programs keen to "return memory to the system" periodically, it would be terrific to get info about how changing arena size affects their behavior in that respect. My PR uses 16K pools and 1M arenas, quadrupling the status quo. Because "why not?" ;-) Neil's branch has _generally_, but not always, used 16 MiB arenas. The larger the arenas in his branch, the smaller the radix tree needs to grow. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7ZIFV2BEL64FQGC35F7QUPK3SHVR3VGT/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Sun, Jun 2, 2019 at 7:57 AM Tim Peters wrote: > I don't think we need to cater anymore to careless code that mixes > system memory calls with O calls (e.g., if an extension gets memory > via `malloc()`, it's its responsibility to call `free()`), and if not > then `address_in_range()` isn't really necessary anymore either, and > then we could increase the pool size. O would, however, need a new > way to recognize when its version of malloc punted to the system > malloc. > Is this really feasible in a world where the allocators can be selected (and the default changed) at runtime? And what would be an efficient way of detecting allocations punted to malloc, if not address_in_range? Getting rid of address_in_range sounds like a nice idea, and I would love to test how feasible it is -- I can run such a change against a wide selection of code at work, including a lot of third-party extension modules, but I don't see an easy way to do it right now. -- Thomas Wouters Hi! I'm an email virus! Think twice before sending your email to help me spread! ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OA4Z525KYYFHBTTVI5QXVTJH72ZCGQ2S/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Neil Schemenauer ] > I've done a little testing the pool overhead. I have an application > that uses many small dicts as holders of data. The function: > > sys._debugmallocstats() > > is useful to get stats for the obmalloc pools. Total data allocated > by obmalloc is 262 MB. At the 4*PAGE_SIZE pool size, the wasted > space due to partly filled pools is only 0.18%. For 16*PAGE_SIZE > pools, 0.71%. > > I have a set of stats for another program. In that case, total > memory allocated by obmalloc is 14 MB. For 4*PAGE_SIZE pools, > wasted space is 0.78% of total. At 16*PAGE_SIZE, it is 2.4%. Definitely a tradeoff here: increasing the number of pages per pool is a pure win for objects of very popular size classes, but a pure loss for objects of unpopular size classes. New comments about POOL_SIZE in the patch briefly hint at that. > Based on that small set of data, using 4*PAGE_SIZE seems > conservative. Which is a nice way of saying "pulled out of Tim's ass" ;-) > As I'm sure you realize, making pools bigger will > waste actual memory, not just virtual address space because you > write the arena pointer to each OS page. Yes, but mostly no! It remains the case that obmalloc neither writes nor reads anything in an arena until a malloc/realloc call actually needs to return a pointer to a never-before accessed page. Writing the arena index is NOT done to pages when the arena is allocated, or even when a new pool is carved off an arena, but lazily, one page at a time, lowest address to highest, as fresh pages actually need to be returned to the caller. So arena pointers aren't actually written more frequently than when using pools of just one page, as is done now (which _still_ writes the arena index into each page). Subtlety: currently, if you pass a nonsense address to address_in_range that happens to be in one of the arena's pools, address_in_range() says "yes". However, after the patch, it may well return "no" if the address is in one of the pool's pages that hasn't yet been returned to a malloc/realloc caller (in which case the arena index hasn't yet been stored at the start of the page). I don't care about that, because it's 100% a logic error to pass an address to free/realloc that wasn't obtained from a previous malloc/realloc call. So it's a change in undefined behavior. > I want to do performance profiling using Linux perf. That should > show where the hotspot instructions in the obmalloc code. Maybe > that will be useful to you. Would be good to know, yes! But may also depend on the app. > Another thought about address_in_range(): some operating systems > allow you to allocate memory a specific alignments. Or, you can > even allocate a chunk of memory at a fixed memory location if you do > the correct magic incantation. I noticed that Go does that. I > imagine doing that has a bunch of associated challenges with it. > However, if we could control the alignment and memory location of > obmalloc arenas, we would not have the segv problem of > address_in_range(). It's probably not worth going down that path > due to the problems involved. I'm unclear on how that could be exploited. It's not addresses that come from arenas that create segv headaches, it's addresses that come from the system's malloc/realloc. If we were, e.g., to pick a maximum amount of address space obmalloc can use in advance, we could live with a single arena of that size, and just check whether an address is within it. All the problems come from that address space may alternate, any number of times, between addresses in obmalloc arenas and addresses from the system malloc's internals. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YUWFWPKN67EOGF5F7QFGB7IFRH5K2PFW/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On 2019-06-09, Tim Peters wrote: > And now there's a PR that removes obmalloc's limit on pool sizes, and, > for a start, quadruples pool (and arena!) sizes on 64-bit boxes: Neat. > As the PR says, > > """ > It would be great to get feedback from 64-bit apps that do massive > amounts of small-object allocations and deallocations. > """ I've done a little testing the pool overhead. I have an application that uses many small dicts as holders of data. The function: sys._debugmallocstats() is useful to get stats for the obmalloc pools. Total data allocated by obmalloc is 262 MB. At the 4*PAGE_SIZE pool size, the wasted space due to partly filled pools is only 0.18%. For 16*PAGE_SIZE pools, 0.71%. I have a set of stats for another program. In that case, total memory allocated by obmalloc is 14 MB. For 4*PAGE_SIZE pools, wasted space is 0.78% of total. At 16*PAGE_SIZE, it is 2.4%. Based on that small set of data, using 4*PAGE_SIZE seems conservative. As I'm sure you realize, making pools bigger will waste actual memory, not just virtual address space because you write the arena pointer to each OS page. I want to do performance profiling using Linux perf. That should show where the hotspot instructions in the obmalloc code. Maybe that will be useful to you. Another thought about address_in_range(): some operating systems allow you to allocate memory a specific alignments. Or, you can even allocate a chunk of memory at a fixed memory location if you do the correct magic incantation. I noticed that Go does that. I imagine doing that has a bunch of associated challenges with it. However, if we could control the alignment and memory location of obmalloc arenas, we would not have the segv problem of address_in_range(). It's probably not worth going down that path due to the problems involved. Regards, Neil ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LJWXNQNZRVCCF36ALXKZVWYHINXTVMVB/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Tim\ > For the current obmalloc, I have in mind a different way ... > Not ideal, but ... captures the important part (more objects > in a pool -> more times obmalloc can remain in its > fastest "all within the pool" paths). And now there's a PR that removes obmalloc's limit on pool sizes, and, for a start, quadruples pool (and arena!) sizes on 64-bit boxes: https://github.com/python/cpython/pull/13934 https://bugs.python.org/issue37211 As the PR says, """ It would be great to get feedback from 64-bit apps that do massive amounts of small-object allocations and deallocations. """ :-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Z4YIHDGNZLP4WX5HVEBXDSFIDIWPTTYK/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
To be clearer, while knowing the size of allocated objects may be of some use to some other allocators, "not really" for obmalloc. That one does small objects by itself in a uniform way, and punts everything else to the system malloc family. The _only_ thing it wants to know on a free/realloc is which of those two ways delivered the address to begin with. Knowing the original size would be an indirect way to accomplish that, but it really only needs 1 bit of info. If you're using a radix tree to implement - in effect - a dict mapping addresses to "stuff", 1 flag bit in the "stuff" would be ideal for that. If you haven't already, I assume you'll soon come around to believe you really should be tracking the addresses of all gc'ed objects (not just the "small" ones). > If the above idea works, we know the object size at free() and realloc(), we > don't > need address_in_range() for those code paths. Two things: first, a single bit would not only be sufficient, it would be _better_ than knowing the object size. If the bit says the object came from the system, the size is of no use at all. It the bit says it came from obmalloc, what's _really_ wanted is the index of the object's "size class" in the vector of free lists, and that's already directly available in the object's pool header (various parts of which have to be read up &/or updated regardless). Second, if that bit were available, `address_in_range()` could be thrown away - obmalloc's free and realloc entries are the only places it's used (or is of any conceivable use to obmalloc). For the current obmalloc, I have in mind a different way (briefly. let s pool span a number of 4K pages, but teach pools about the page structure so that address_in_range() continues to work after trivial changes (read the arena index from the containing page's start rather than from the containing pool's)). Not ideal, but looks remarkably easy to do, and captures the important part (more objects in a pool -> more times obmalloc can remain in its fastest "all within the pool" paths). > My gut feeling is that the prev/next pointer updates done by > move_unreachable() and similar functions must be quite expensive. > Doing the traversal with an explicit stack is a lot less elegant but > I think it should be faster. At least, when you are dealing with a > big set of GC objects that don't fit in the CPU cache. At the start, it was the _reachable_ objects that were moved out of the collected generation list rather than the unreachable objects. Since it's (in all my experience) almost all objects that survive almost all collections in almost all programs, that was an enormous amount of moving overall. So long I ago I rewrote that code to move the _un_reachable objects instead. Debug output showed countless billions of pointer updates saved across the programs I tried at the time, but the timing difference was in the noise. Probably a big part of "the problem": when collecting the oldest generation in a program with "a lot" of objects, no approach at all will "fit in cache". We have to crawl the entire object graph then. Note that the test case in the parent thread had over a billion class instances, and it was way whittled down(!) from the OP's real program. But I don't _think_ that's typical quite yet ;-) , and: > You are likely correct. I'm hoping to benchmark the radix tree idea. > I'm not too far from having it working such that it can replace > address_in_range(). Maybe allocating gc_refs as a block would > offset the radix tree cost vs address_in_range(). I certainly agree gc_refs is the big-bang-for-the-buck thing to target. There's no way to avoid mutating that bunches and bunches of times, even in small programs, and reducing the number of dirtied cache lines due to that "should" pay off. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3DPAR4PPPM3Z675ELBGYIJ74INI6PL3X/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On 2019-06-06, Tim Peters wrote: > The doubly linked lists in gc primarily support efficient > _partitioning_ of objects for gc's purposes (a union of disjoint sets, > with constant-time moving of an object from one set to another, and > constant-time union of disjoint sets). "All objects" is almost never > interesting to it (it is only when the oldest non-frozen generation is > being collected). My current idea is to put partitioning flags on the interior radix tree nodes. If you mark an object as "finalizer reachable", for example, it would mark all the nodes on the path from the root with that flag. Then, when you want to iterate over all the GC objects with a flag, you can avoid uninteresting branches of the tree. For generations, maybe tracking them at the pool level is good enough. Interior nodes can track generations too (i.e. the youngest generation contained under them). My gut feeling is that the prev/next pointer updates done by move_unreachable() and similar functions must be quite expensive. Doing the traversal with an explicit stack is a lot less elegant but I think it should be faster. At least, when you are dealing with a big set of GC objects that don't fit in the CPU cache. Regards, Neil ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/J422RWENKJAYHMXSZVRV5KGWSHNMAMJF/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On 2019-06-06, Tim Peters wrote: > Like now: if the size were passed in, obmalloc could test the size > instead of doing the `address_in_range()` dance(*). But if it's ever > possible that the size won't be passed in, all the machinery > supporting `address_in_range()` still needs to be there, and every > obmalloc spelling of malloc/realloc needs to ensure that machinery > will work if the returned address is passed back to an obmalloc > free/realloc spelling without the size. We can almost make it work for GC objects, the use of obmalloc is quite well encapsulated. I think I intentionally designed the PyObject_GG_New/PyObject_GC_Del/etc APIs that way. Quick and dirty experiment is here: https://github.com/nascheme/cpython/tree/gc_malloc_free_size The major hitch seems my new gc_obj_size() function. We can't be sure the 'nbytes' passed to _PyObject_GC_Malloc() is the same as what is computed by gc_obj_size(). It usually works but there are exceptions (freelists for frame objects and tuple objects, for one) A nasty problem is the weirdness with PyType_GenericAlloc() and the sentinel item. _PyObject_GC_NewVar() doesn't include space for the sentinel but PyType_GenericAlloc() does. When you get to gc_obj_size(), you don't if you should use "nitems" or "nitems+1". I'm not sure how the fix the sentinel issue. Maybe a new type slot or a type flag? In any case, making a change like my git branch above would almost certainly break extensions that don't play nicely. It won't be hard to make it a build option, like the original gcmodule was. Then, assuming there is a performance boost, people can enable it if their extensions are friendly. > The "only"problem with address_in_range is that it limits us to a > maximum pool size of 4K. Just for fun, I boosted that to 8K to see > how likely segfaults really are, and a Python built that way couldn't > even get to its first prompt before dying with an access violation > (Windows-speak for segfault). If we can make the above idea work, you could set the pool size to 8K without issue. A possible problem is that the obmalloc and gcmalloc arenas are separate. I suppose that affects performance testing. > We could eliminate the pool size restriction in many ways. For > example, we could store the addresses obtained from the system > malloc/realloc - but not yet freed - in a set, perhaps implemented as > a radix tree to cut the memory burden. But digging through 3 or 4 > levels of a radix tree to determine membership is probably > significantly slower than address_in_range. You are likely correct. I'm hoping to benchmark the radix tree idea. I'm not too far from having it working such that it can replace address_in_range(). Maybe allocating gc_refs as a block would offset the radix tree cost vs address_in_range(). If the above idea works, we know the object size at free() and realloc(), we don't need address_in_range() for those code paths. Regards, Neil ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ILFK2MTCVA7GB7JGBVSUWASKJ7T4LLJE/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Thu, 6 Jun 2019 17:26:17 -0500 Tim Peters wrote: > > The doubly linked lists in gc primarily support efficient > _partitioning_ of objects for gc's purposes (a union of disjoint sets, > with constant-time moving of an object from one set to another, and > constant-time union of disjoint sets). "All objects" is almost never > interesting to it (it is only when the oldest non-frozen generation is > being collected). Right. But the oldest generation is precisely the pain point, since full collections can induce very long pauses. IOW, perhaps it would be fine to keep dedicated lists for the young generations, while doing a heap walk for the full collection case. > In any case, it appears to be "a feature" now that people can use any > flavor of the malloc family they like in CPython, so I expect that any > attempt to tie cyclic gc to a specific flavor of malloc would be a > very hard sell. Depends on the benefits of course ;-) Most people use some pre-built binaries that "naturally" come with obmalloc enabled. Of course, it's only a very small minority of applications that hit real performance issues with the GC - and in probably many of those cases tuning the full collection threshold can alleviate the problem still. Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZQEHZTW77MFCKCJMR7XB4AVBF6KCB3ZW/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Tim] >> But I don't know what you mean by "access memory in random order to >> iterate over known objects". obmalloc never needs to iterate over >> known objects - indeed, it contains no code capable of doing that.. >> Our cyclic gc does, but that's independent of obmalloc. [Antoine] > It's not. Cyclic GC needs its own linked lists *because* the allocator > doesn't allow it to iterate over allocated objects. The doubly linked lists in gc primarily support efficient _partitioning_ of objects for gc's purposes (a union of disjoint sets, with constant-time moving of an object from one set to another, and constant-time union of disjoint sets). "All objects" is almost never interesting to it (it is only when the oldest non-frozen generation is being collected). Between collections, the partitioning is by generation. During a collection, the youngest generations are combined into one, and then that's sub-partitioned in various ways as collection goes along, ending with a partition into reachable and unreachable objects. In between, ephemeral partitions are constructed (e.g., the set of "tentatively unreachable" objects). None of that was driven by obmalloc's (or malloc's) inability to iterate over objects. Doubly linked lists were the obvious way to implement the required operations on partitions efficiently and simply. In any case, it appears to be "a feature" now that people can use any flavor of the malloc family they like in CPython, so I expect that any attempt to tie cyclic gc to a specific flavor of malloc would be a very hard sell. Which, BTW, was the intended meaning of "independent": cyclic gc right now couldn't care less which version of malloc a user plugs in - nor could obmalloc care less which cyclic gc algorithm is used. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OR4NFZRQIOCK2N3XBCPI6GESI5BYRD3D/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Thu, 6 Jun 2019 16:03:03 -0500 Tim Peters wrote: > But I don't know what you mean by "access memory in random order to > iterate over known objects". obmalloc never needs to iterate over > known objects - indeed, it contains no code capable of doing that.. > Our cyclic gc does, but that's independent of obmalloc. It's not. Cyclic GC needs its own linked lists *because* the allocator doesn't allow it to iterate over allocated objects. Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YZIG3RZKM4XSTM3PYPDAPG3UHOB5QKM4/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Antoine Pitrou ] > But my response was under the assumption that we would want obmalloc to > deal with all allocations. I didn't know that. I personally have no interest in that: if we want an all-purpose allocator, there are several already to choose from. There's no reason to imagine we could write a better one. > Which is more or less required anyway to have an efficient GC that doesn't > have to walk linked lists and access memory in random order to iterate over > known objects. As the parent thread showed, obmalloc does at least as well as any general-purpose allocator known _for Python's purposes_ (a great many (de)allocations of "small" objects). Already explained too that size-segregated linked free lists are the core of _why_ it does so well. Besides making carving off, and freeing, blocks dirt cheap, linked lists also naturally support recycling memory blocks in MRU ("freshness in cache") order. But I don't know what you mean by "access memory in random order to iterate over known objects". obmalloc never needs to iterate over known objects - indeed, it contains no code capable of doing that.. Our cyclic gc does, but that's independent of obmalloc. Over on Discourse, Neil is speculating about using radix trees for cyclic gc instead of _its_ linked lists In obmalloc, allocated memory regions aren't linked at all. It's free regions that are linked, and helpfully so in MRU order. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
On Thu, 6 Jun 2019 13:57:37 -0500 Tim Peters wrote: > [Antoine Pitrou ] > > The interesting thing here is that in many situations, the size is > > known up front when deallocating - it is simply not communicated to the > > deallocator because the traditional free() API takes a sole pointer, > > not a size. But CPython could communicate that size easily if we > > would like to change the deallocation API. Then there's no bother > > looking up the allocated size in sophisticated lookup structures. > > That could work (to make it possible to increase obmalloc's pool > size). Except ... > > > I'll note that jemalloc provides such APIs: > > http://jemalloc.net/jemalloc.3.html > > > > """The dallocx() function causes the memory referenced by ptr to be > > made available for future allocations. > > > > The sdallocx() function is an extension of dallocx() with a size > > parameter to allow the caller to pass in the allocation size as an > > optimization.""" > > obmalloc doesn't intend to be a general-purpose allocator - it only > aims at optimizing "small" allocations, punting to the system for > everything beyond that. But my response was under the assumption that we would want obmalloc to deal with all allocations. Which is more or less required anyway to have an efficient GC that doesn't have to walk linked lists and access memory in random order to iterate over known objects. Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/
[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Antoine Pitrou ] > The interesting thing here is that in many situations, the size is > known up front when deallocating - it is simply not communicated to the > deallocator because the traditional free() API takes a sole pointer, > not a size. But CPython could communicate that size easily if we > would like to change the deallocation API. Then there's no bother > looking up the allocated size in sophisticated lookup structures. That could work (to make it possible to increase obmalloc's pool size). Except ... > I'll note that jemalloc provides such APIs: > http://jemalloc.net/jemalloc.3.html > > """The dallocx() function causes the memory referenced by ptr to be > made available for future allocations. > > The sdallocx() function is an extension of dallocx() with a size > parameter to allow the caller to pass in the allocation size as an > optimization.""" obmalloc doesn't intend to be a general-purpose allocator - it only aims at optimizing "small" allocations, punting to the system for everything beyond that. Unless the size is _always_ passed in (on every free() and realloc() spelling it supports), an "optimization" doesn't help it much. It needs a bulletproof way to determine whether it, or system malloc/realloc, originally obtained an address passed in. If the size is always passed in, no problem (indeed, a single bit could suffice). But if it's ever possible that the size won't be passed in, all the runtime machinery to figure that out on its own needs to be able to handle all addresses. Like now: if the size were passed in, obmalloc could test the size instead of doing the `address_in_range()` dance(*). But if it's ever possible that the size won't be passed in, all the machinery supporting `address_in_range()` still needs to be there, and every obmalloc spelling of malloc/realloc needs to ensure that machinery will work if the returned address is passed back to an obmalloc free/realloc spelling without the size. The "only"problem with address_in_range is that it limits us to a maximum pool size of 4K. Just for fun, I boosted that to 8K to see how likely segfaults really are, and a Python built that way couldn't even get to its first prompt before dying with an access violation (Windows-speak for segfault). Alas, that makes it hard to guess how much value there would be for Python programs if the pool size could be increased - can't even get Python started. We could eliminate the pool size restriction in many ways. For example, we could store the addresses obtained from the system malloc/realloc - but not yet freed - in a set, perhaps implemented as a radix tree to cut the memory burden. But digging through 3 or 4 levels of a radix tree to determine membership is probably significantly slower than address_in_range. I can think of a way to do it slightly faster than (but related to) address_in_range, but it would (on a 64-bit box) require adding 24 more bytes for each system-malloc/realloc allocation. 8 of those bytes would be pure waste, due to that the Alignment Gods appear to require 16-byte alignment for every allocation on a 64-bit box now. In stark contrast, the extra memory burden of the current address_in_range is an insignificant 8 bytes per _arena_ (256 KB, and "should be" larger now). Another approach: keep address_as_range as-is, but add new internal structure to larger pools, to repeat the arena index every 4KB. But that fights somewhat against the goal of larger pools. Etc. ;-) (*) Actually not quite true. If a "large" object is obtained from obmalloc now (meaning it actually came from the system malloc), then cut back to a "small" size by a realloc, it _remains_ under the control of the system malloc now. Passing in the newer "small" size to a free() later would cause obmalloc to get fatally confused about that. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/