[Python-Dev] Re: radix tree arena map for obmalloc
On Sat, 15 Jun 2019 19:56:58 -0500 Tim Peters wrote: > > At the start, obmalloc never returned arenas to the system. The vast > majority of users were fine with that. A relative few weren't. Evan > Jones wrote all the (considerable!) code to change that, and I > massaged it and checked it in - not because there was "scientific > proof" that it was more beneficial than harmful (it certainly added > new expenses!) overall, but because it seemed like a right thing to > do, _anticipating_ that the issue would become more important in > coming years. > > I'm still glad it was done, but no tests were checked in to _quantify_ > its presumed benefits - or even to verify that it ever returned arenas > to the system. Best I can tell, nobody actually has any informed idea > how well it does. Evan stared at programs that were important to him, > and fiddled things until he was "happy enough". We moved from malloc() to mmap() for allocating arenas because of user requests to release memory more deterministically: https://bugs.python.org/issue11849 And given the number of people who use Python for long-running processes nowadays, I'm sure that they would notice (and be annoyed) if Python did not release memory after memory consumption spikes. > I've looked at obmalloc stats in other programs at various stages, and > saw nothing concerning. memchunk.py appears to model object lifetimes > as coming from a uniform distribution, but in real life they appear to > be strongly multi-modal (with high peaks at the "way less than an eye > blink" and "effectively immortal" ends). I agree they will certainly be multi-modal, with the number of modes, their respective weights and their temporal distance widely dependent on use cases. (the fact that they're multi-modal is the reason why generational GC is useful, btw) > We haven't been especially > pro-active about giant machines, and are suffering from it: > > https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/ So you're definitely trying to solve a problem, right? > Fixing the underlying cause put giant machines on my radar, and > getting rid of obmalloc's pool size limit was the next obvious thing > that would help them (although not in the same universe as cutting > quadratic time to linear). "Not in the same universe", indeed. So the question becomes: does the improvement increasing the pool and arena size have a negative outcome on *other* use cases? Not everyone has giant machines. Actually a frequent usage model is to have many small VMs or containers on a medium-size machine. > For example, > it has to allocate at least 56 bytes of separate bookkeeping info for > each arena. Nobody cares when they have 100 arenas, but when there > are a million arenas (which I've seen), that adds up. In relative terms, assuming that arenas are 50% full on average (probably a pessimistic assumption?), that overhead is 0.08% per arena memory used. What point is worrying about that? > > If the problem is the cost of mmap() and munmap() calls, then the > > solution more or less exists at the system level: jemalloc and other > > allocators use madvise() with MADV_FREE (see here: > > https://lwn.net/Articles/593564/). > > > > A possible design is a free list of arenas on which you use MADV_FREE > > to let the system know the memory *can* be reclaimed. When the free > > list overflows, call munmap() on extraneous arenas. > > People can certainly pursue that if they like. I'm not interested in > adding more complication that helps only one of obmalloc's slowest > paths on only one platform. MADV_FREE is available on multiple platforms (at least Linux, macOS, FreeBSD). Windows seems to offer similar facilities: https://devblogs.microsoft.com/oldnewthing/20170113-00/?p=95185 > The dead obvious, dead simple, way to reduce mmap() expense is to call > it less often, which just requires changing a compile-time constant - > which will also call VirtualAlloc() equally less often on Windows. That's assuming the dominating term in mmap() cost is O(1) rather than O(size). That's not a given. The system call cost is certainly O(1), but the cost of reserving and mapping HW pages, and zeroing them out is most certainly O(# pages). Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ER62CFARDT2RYUNW7WOP3L5JV6UVSHSI/
[Python-Dev] Re: radix tree arena map for obmalloc
On Sat, 15 Jun 2019 22:02:35 -0600 Neil Schemenauer wrote: > On 2019-06-15, Antoine Pitrou wrote: > > We should evaluate what problem we are trying to solve here, instead > > of staring at micro-benchmark numbers on an idle system. > > I think a change to obmalloc is not going to get accepted unless we > can show it doesn't hurt these micro-benchmarks. To displace the > status quo, it has to give other advantages as well. I don't have > any agenda or "problem to solve". After Tim made a PR to allow > obmalloc to use larger pools, I thought it would be interesting to > see if a arena mapping scheme based on radix trees should be > performance competitive. I'm not proposing any changes to CPython > at this point. I'm sharing the results of an experiment. I thought > it was interesting. I guess you don't. I'm not saying it's not interesting. I'm just saying that you can't validate memory allocator changes only on short-running micro-benchmarks with well-behaved allocation patterns. Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KD32F5SPZDEOOJ5CW2254U5FDNFGWMNL/
[Python-Dev] Re: radix tree arena map for obmalloc
[Antoine] > We moved from malloc() to mmap() for allocating arenas because of user > requests to release memory more deterministically: > > https://bugs.python.org/issue11849 Which was a good change! As was using VirtualAlloc() on Windows. None of that is being disputed. The change under discussion isn't about mmap - mmap only incidentally gets sucked in here because it's part of obmalloc's _the_ very slowest paths. Again, I'm aiming at _all_ of obmalloc's slower paths, on all platforms, at once. mmap isn't my focus. That doesn't preclude anyone who cares lots about mmap from adding more complication to cater specifically to it. > And given the number of people who use Python for long-running > processes nowadays, I'm sure that they would notice (and be annoyed) if > Python did not release memory after memory consumption spikes. The PR changes nothing about the "release arenas" heuristic or implementation. There's just unquantified speculation that boosting arena size, on 64-bit boxes, from a trivial 256 KiB to a slightly less trivial 1 MiB, may be disastrous. The only evidence for that so far is repetition of the speculation ;-) >> ... >> We haven't been especially >> pro-active about giant machines, and are suffering from it: >> >> https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/ > So you're definitely trying to solve a problem, right? By my definition of "a problem", no. I have no quantified goals or any criteria to say "and now it's solved". I have open-ended concerns about how Python will fare on giant machines slinging billions of objects, and want to make things better _before_ users are provoked to complain. Which they'll do anyway, of course. I'm not trying to change human nature ;-) > So the question becomes: does the improvement increasing the pool > and arena size have a negative outcome on *other* use cases? Which is why Neil & I have been running benchmarks, tests, Python programs we run all the time, Python programs we care about ... and actively soliciting feedback from people actually trying the code on programs _they_ care about. > Not everyone has giant machines. Actually a frequent usage model is to > have many small VMs or containers on a medium-size machine. Something I've never done, so am wholly unqualified to judge. I don't even know what "many", "small", or "medium-size" might mean in this context. And I don't have a giant machine either, but spent most of my professional career in the "supercomputer" business so at least understand how those kinds of people think ;-) >> For example, it has to allocate at least 56 bytes of separate bookkeeping >> info >> for each arena. Nobody cares when they have 100 arenas, but when there >> are a million arenas (which I've seen), that adds up. > In relative terms, assuming that arenas are 50% full on average > (probably a pessimistic assumption?), that overhead is 0.08% per arena > memory used. What point is worrying about that? You're only looking at one cost. Those bytes aren't just address reservations, they consume actual physical RAM. The bookkeeping info is periodically read, and mutated, over time. In aggregate, that's more than enough bytes to blow away an entire L3 cache. The less of that stuff needs to be read and written (i.e., the fewer arenas there are), the less pressure that puts on the faster layers (caches) of the memory hierarchy. That bookkeeping info is also immortal: all the bookkeeping arena_object structs live a single contiguously allocated vector. It may move over time (via realloc()), but can never shrink, only grow. Asking the platform malloc/realloc for a 50 MB chunk sucks on the face of it :-( So while the 50M is first-order trivial when a million arenas are in use, if the program enters a new phase releasing almost all of it, leaving (say) only 10 arenas still active, the 50 M is still there, effectively wasting 200 arenas' worth of "invisible" (to _debugmallocstats()) space forever. About typical arena usage, I expect, but don't know, that 50% is quite pessimistic. It's a failure of project management (starting with me) that _every_ step in the "free arenas" evolution was driven by a user complaining about their program, and that nothing was ever checked in to even ensure their problem remained solved, let alone to help find out how effective arena-releasing is in an arbitrary program. We've been flying blind from the start, and remain blind. That said, over the decades I've often looked at obmalloc stats, and have generally been pleasantly surprised at how much of allocated space is being put to good use. 80% seems more typical than 50% to me based on that. It's easy enough to contrive programs tending toward only 16 bytes in use per arena, but nobody has ever reported anything like that. The aforementioned memcrunch.py program wasn't particularly contrived, but was added to a bug report to "prove" that obmalloc's arena releasing strategy
[Python-Dev] Re: radix tree arena map for obmalloc
obmalloc is very nice at allocating small (~224 bytes) memory blocks. But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me. ``` >>> pool_size = 4096 - 48 # 48 is pool header size >>> for bs in range(16, 513, 16): ... n,r = pool_size//bs, pool_size%bs + 48 ... print(bs, n, r, 100*r/4096) ... 16 253 48 1.171875 32 126 64 1.5625 48 84 64 1.5625 64 63 64 1.5625 80 50 96 2.34375 96 42 64 1.5625 112 36 64 1.5625 128 31 128 3.125 144 28 64 1.5625 160 25 96 2.34375 176 23 48 1.171875 192 21 64 1.5625 208 19 144 3.515625 224 18 64 1.5625 240 16 256 6.25 256 15 256 6.25 272 14 288 7.03125 288 14 64 1.5625 304 13 144 3.515625 320 12 256 6.25 336 12 64 1.5625 352 11 224 5.46875 368 11 48 1.171875 384 10 256 6.25 400 10 96 2.34375 416 9 352 8.59375 432 9 208 5.078125 448 9 64 1.5625 464 8 384 9.375 480 8 256 6.25 496 8 128 3.125 512 7 512 12.5 ``` There are two problems here. First, pool overhead is at most about 3.5 % until 224 bytes. But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464 bytes, and 12.5% at 512 bytes. Second, some size classes have the same number of memory blocks. Class 272 and 286 have 14 blocks. 320 and 336 have 12 blocks. It reduces utilization of pools. This problem becomes bigger on 32bit platform. Increasing pool size is one obvious way to fix these problems. I think 16KiB pool size and 2MiB (huge page size of x86) arena size is a sweet spot for recent web servers (typically, about 32 threads, and 64GiB), but there is no evidence about it. We need a reference application and scenario to benchmark. pyperformance is not good for measuring memory usage of complex applications. ``` >>> header_size = 48 >>> pool_size = 16*1024 >>> for bs in range(16, 513, 16): ... n = (pool_size - header_size) // bs ... r = (pool_size - header_size) % bs + header_size ... print(bs, n, r, 100 * r / pool_size) ... 16 1021 48 0.29296875 32 510 64 0.390625 48 340 64 0.390625 64 255 64 0.390625 80 204 64 0.390625 96 170 64 0.390625 112 145 144 0.87890625 128 127 128 0.78125 144 113 112 0.68359375 160 102 64 0.390625 176 92 192 1.171875 192 85 64 0.390625 208 78 160 0.9765625 224 72 256 1.5625 240 68 64 0.390625 256 63 256 1.5625 272 60 64 0.390625 288 56 256 1.5625 304 53 272 1.66015625 320 51 64 0.390625 336 48 256 1.5625 352 46 192 1.171875 368 44 192 1.171875 384 42 256 1.5625 400 40 384 2.34375 416 39 160 0.9765625 432 37 400 2.44140625 448 36 256 1.5625 464 35 144 0.87890625 480 34 64 0.390625 496 32 512 3.125 512 31 512 3.125 ``` Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD to 256 and believe malloc works well for medium size memory blocks. -- Inada Naoki ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AG6UUPKFXYOTZALFV7XD7EUV62SHOI3P/
[Python-Dev] Re: radix tree arena map for obmalloc
[Inada Naoki ] > obmalloc is very nice at allocating small (~224 bytes) memory blocks. > But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me. For the "unavoidable memory waste" reasons you spell out here, Vladimir deliberately set the threshold to 256 at the start. As things turned out, though, dicts drove the decision to boost it to 512 anyway. A comment in the code says: * Note: a size threshold of 512 guarantees that newly created dictionaries * will be allocated from preallocated memory pools on 64-bit. That was probably before "compact dicts", but, as-is, obmalloc can handle small dicts now even if they're not empty. And small dicts are common. Another relevant one pushing dicts to use obmalloc: https://bugs.python.org/issue23601 > ... > Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD > to 256 and believe malloc works well for medium size memory blocks. Alas, it doesn't ;-) Well, everyone who ever timed it found obmalloc was significantly faster than the system malloc for every size class obmalloc handles. That's partly because obmalloc runs under protection of the GIL, while the system malloc has to worry about thread safety. But, regardless, obmalloc is just plain hard to beat so long as it can stay in its fast paths (which increasing pool and arena sizes aims at helping it do). Its current pool and arena sizes are tiny compared to comparable structures in the high-profile allocators often mentioned in these discussions. However, the latter generally use platform-specific ways to spell "free this OS page" rather than waiting for an entire arena-like structure to become empty. For example, "SuperMalloc"[1] effectively uses 2 MiB pools, in the sense of "want 16 bytes? Cool, we'll start by devoting 2 MiB to holding _only_ 16-byte objects. Now you want 388? Cool, we'll reserved another 2 MiB to holding only 388-byte objects. Etc." [1] http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf > ``` > >>> pool_size = 4096 - 48 # 48 is pool header size > >>> for bs in range(16, 513, 16): > ... n,r = pool_size//bs, pool_size%bs + 48 > ... print(bs, n, r, 100*r/4096) > ... > 16 253 48 1.171875 > 32 126 64 1.5625 > 48 84 64 1.5625 > 64 63 64 1.5625 > 80 50 96 2.34375 > 96 42 64 1.5625 > 112 36 64 1.5625 > 128 31 128 3.125 > 144 28 64 1.5625 > 160 25 96 2.34375 > 176 23 48 1.171875 > 192 21 64 1.5625 > 208 19 144 3.515625 > 224 18 64 1.5625 > 240 16 256 6.25 > 256 15 256 6.25 > 272 14 288 7.03125 > 288 14 64 1.5625 > 304 13 144 3.515625 > 320 12 256 6.25 > 336 12 64 1.5625 > 352 11 224 5.46875 > 368 11 48 1.171875 > 384 10 256 6.25 > 400 10 96 2.34375 > 416 9 352 8.59375 > 432 9 208 5.078125 > 448 9 64 1.5625 > 464 8 384 9.375 > 480 8 256 6.25 > 496 8 128 3.125 > 512 7 512 12.5 > ``` > > There are two problems here. > > First, pool overhead is at most about 3.5 % until 224 bytes. > But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464 > bytes, and 12.5% at 512 bytes. Note that in sys._debugmallocstats() output, the total of the "unusable for this reason" space is called "bytes lost to quantization", It's rarely a big deal in my experience, but certainly tends to account for a higher percentage of arena space as the average size of used objects increases. > Second, some size classes have the same number of memory blocks. > Class 272 and 286 have 14 blocks. 320 and 336 have 12 blocks. > It reduces utilization of pools. This problem becomes bigger on 32bit > platform. I don't understand the "32bit platform" comment. On 32-bit boxes, alignment is to 8 byte boundaries (not 16 as above), and there are 64 size classes (not 32 as above). Which tends to make space efficiency better, not worse, than on 64-bit Python (16-byte alignment is a waste for the vast majority of objects Python programs use, which rarely require better than pointer alignment (on 64-byte boxes) or double (on 32-bit ones)). > Increasing pool size is one obvious way to fix these problems. Yes & no ;-) In Neil's radix-tree branch your code above works fine when the pool is increased. But my PR doesn't introduce a new tree, building on what obmalloc already does: it absolutely needs to "steal" 4 bytes at the start of every 4 KiB page - which turns into 16 bytes on 64-boxes to preserve 16-bit alignment. The number of usable blocks in a pool is computed by this function in the PR: /* Return total number of blocks in pool of size index `i`, as a uint. */ static uint NUMBLOCKS(int i) { assert(0 <= i && i < NB_SMALL_SIZE_CLASSES); /* The first page burns space for a pool header, and remaining pages * burn ALIGNMENT bytes for the arena index. */ const uint size = INDEX2SIZE(i); uint usable1 = SYSTEM_PAGE_SIZE - POOL_OVERHEAD; uint usable2 = SYSTEM_PAGE_SIZE - ALIGNMENT; return usable1 / size + (usable2 / size) * (PAGES_PER_POOL - 1); } For a 16-KiB pool, this allows us to (e.g.) squeeze out 6 more 16-byte objects than we can