[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Antoine Pitrou
On Sat, 15 Jun 2019 19:56:58 -0500
Tim Peters  wrote:
> 
> At the start, obmalloc never returned arenas to the system.  The vast
> majority of users were fine with that.  A relative few weren't.  Evan
> Jones wrote all the (considerable!) code to change that, and I
> massaged it and checked it in - not because there was "scientific
> proof" that it was more beneficial than harmful (it certainly added
> new expenses!) overall, but because it seemed like a right thing to
> do, _anticipating_ that the issue would become more important in
> coming years.
> 
> I'm still glad it was done, but no tests were checked in to _quantify_
> its presumed benefits - or even to verify that it ever returned arenas
> to the system.  Best I can tell, nobody actually has any informed idea
> how well it does.  Evan stared at programs that were important to him,
> and fiddled things until he was "happy enough".

We moved from malloc() to mmap() for allocating arenas because of user
requests to release memory more deterministically:
https://bugs.python.org/issue11849

And given the number of people who use Python for long-running
processes nowadays, I'm sure that they would notice (and be annoyed) if
Python did not release memory after memory consumption spikes.

> I've looked at obmalloc stats in other programs at various stages, and
> saw nothing concerning.  memchunk.py appears to model object lifetimes
> as coming from a uniform distribution, but in real life they appear to
> be strongly multi-modal (with high peaks at the "way less than an eye
> blink" and "effectively immortal" ends).

I agree they will certainly be multi-modal, with the number of modes,
their respective weights and their temporal distance widely dependent on
use cases.

(the fact that they're multi-modal is the reason why generational GC is
useful, btw)

> We haven't been especially
> pro-active about giant machines, and are suffering from it:
> 
> https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/

So you're definitely trying to solve a problem, right?

> Fixing the underlying cause put giant machines on my radar, and
> getting rid of obmalloc's pool size limit was the next obvious thing
> that would help them (although not in the same universe as cutting
> quadratic time to linear).

"Not in the same universe", indeed.  So the question becomes: does the
improvement increasing the pool and arena size have a negative outcome
on *other* use cases?

Not everyone has giant machines.  Actually a frequent usage model is to
have many small VMs or containers on a medium-size machine.

> For example,
> it has to allocate at least 56 bytes of separate bookkeeping info for
> each arena.  Nobody cares when they have 100 arenas, but when there
> are a million arenas (which I've seen), that adds up.

In relative terms, assuming that arenas are 50% full on average
(probably a pessimistic assumption?), that overhead is 0.08% per arena
memory used.  What point is worrying about that?

> > If the problem is the cost of mmap() and munmap() calls, then the
> > solution more or less exists at the system level: jemalloc and other
> > allocators use madvise() with MADV_FREE (see here:
> > https://lwn.net/Articles/593564/).
> >
> > A possible design is a free list of arenas on which you use MADV_FREE
> > to let the system know the memory *can* be reclaimed.  When the free
> > list overflows, call munmap() on extraneous arenas.  
> 
> People can certainly pursue that if they like.  I'm not interested in
> adding more complication that helps only one of obmalloc's slowest
> paths on only one platform.

MADV_FREE is available on multiple platforms (at least Linux, macOS,
FreeBSD).  Windows seems to offer similar facilities:
https://devblogs.microsoft.com/oldnewthing/20170113-00/?p=95185

> The dead obvious, dead simple, way to reduce mmap() expense is to call
> it less often, which just requires changing a compile-time constant -
> which will also call VirtualAlloc() equally less often on Windows.

That's assuming the dominating term in mmap() cost is O(1) rather than
O(size).  That's not a given.  The system call cost is certainly O(1),
but the cost of reserving and mapping HW pages, and zeroing them out is
most certainly O(# pages).

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ER62CFARDT2RYUNW7WOP3L5JV6UVSHSI/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Antoine Pitrou
On Sat, 15 Jun 2019 22:02:35 -0600
Neil Schemenauer  wrote:
> On 2019-06-15, Antoine Pitrou wrote:
> > We should evaluate what problem we are trying to solve here, instead
> > of staring at micro-benchmark numbers on an idle system.  
> 
> I think a change to obmalloc is not going to get accepted unless we
> can show it doesn't hurt these micro-benchmarks.  To displace the
> status quo, it has to give other advantages as well.  I don't have
> any agenda or "problem to solve".  After Tim made a PR to allow
> obmalloc to use larger pools, I thought it would be interesting to
> see if a arena mapping scheme based on radix trees should be
> performance competitive.  I'm not proposing any changes to CPython
> at this point.  I'm sharing the results of an experiment.  I thought
> it was interesting.  I guess you don't.

I'm not saying it's not interesting.  I'm just saying that you can't
validate memory allocator changes only on short-running
micro-benchmarks with well-behaved allocation patterns.

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/KD32F5SPZDEOOJ5CW2254U5FDNFGWMNL/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Tim Peters
[Antoine]
> We moved from malloc() to mmap() for allocating arenas because of user
> requests to release memory more deterministically:
>
> https://bugs.python.org/issue11849

Which was a good change!  As was using VirtualAlloc() on Windows.
None of that is being disputed.  The change under discussion isn't
about mmap - mmap only incidentally gets sucked in here because it's
part of obmalloc's _the_ very slowest paths.  Again, I'm aiming at
_all_ of obmalloc's slower paths, on all platforms, at once.  mmap
isn't my focus.

That doesn't preclude anyone who cares lots about mmap from adding
more complication to cater specifically to it.

> And given the number of people who use Python for long-running
> processes nowadays, I'm sure that they would notice (and be annoyed) if
> Python did not release memory after memory consumption spikes.

The PR changes nothing about the "release arenas" heuristic or
implementation.  There's just unquantified speculation that boosting
arena size, on 64-bit boxes, from a trivial 256 KiB to a slightly less
trivial 1 MiB, may be disastrous.  The only evidence for that so far
is repetition of the speculation ;-)

>> ...
>> We haven't been especially
>> pro-active about giant machines, and are suffering from it:
>>
>> https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/

> So you're definitely trying to solve a problem, right?

By my definition of "a problem", no.  I have no quantified goals or
any criteria to say "and now it's solved".  I have open-ended concerns
about how Python will fare on giant machines slinging billions of
objects, and want to make things better _before_ users are provoked to
complain.  Which they'll do anyway, of course.  I'm not trying to
change human nature ;-)

> So the question becomes: does the improvement increasing the pool
> and arena size have a negative outcome on *other* use cases?

Which is why Neil & I have been running benchmarks, tests, Python
programs we run all the time, Python programs we care about ... and
actively soliciting feedback from people actually trying the code on
programs _they_ care about.

> Not everyone has giant machines.  Actually a frequent usage model is to
> have many small VMs or containers on a medium-size machine.

Something I've never done, so am wholly unqualified to judge.  I don't
even know what "many", "small", or "medium-size" might mean in this
context.  And I don't have a giant machine either, but spent most of
my professional career in the "supercomputer" business so at least
understand how those kinds of people think ;-)


>> For example, it has to allocate at least 56 bytes of separate bookkeeping 
>> info
>> for each arena.  Nobody cares when they have 100 arenas, but when there
>> are a million arenas (which I've seen), that adds up.

> In relative terms, assuming that arenas are 50% full on average
> (probably a pessimistic assumption?), that overhead is 0.08% per arena
> memory used.  What point is worrying about that?

You're only looking at one cost.  Those bytes aren't just address
reservations, they consume actual physical RAM.  The bookkeeping info
is periodically read, and mutated, over time.  In aggregate, that's
more than enough bytes to blow away an entire L3 cache.  The less of
that stuff needs to be read and written (i.e., the fewer arenas there
are), the less pressure that puts on the faster layers (caches) of the
memory hierarchy.

That bookkeeping info is also immortal:  all the bookkeeping
arena_object structs live a single contiguously allocated vector.  It
may move over time (via realloc()), but can never shrink, only grow.
Asking the platform malloc/realloc for a 50 MB chunk sucks on the face
of it :-(

So while the 50M is first-order trivial when a million arenas are in
use, if the program enters a new phase releasing almost all of it,
leaving (say) only 10 arenas still active, the 50 M is still there,
effectively wasting 200 arenas' worth of "invisible" (to
_debugmallocstats()) space forever.

About typical arena usage, I expect, but don't know, that 50% is quite
pessimistic.  It's a failure of project management (starting with me)
that _every_ step in the "free arenas" evolution was driven by a user
complaining about their program, and that nothing was ever checked in
to even ensure their problem remained solved, let alone to help find
out how effective arena-releasing is in an arbitrary program.  We've
been flying blind from the start, and remain blind.

That said, over the decades I've often looked at obmalloc stats, and
have generally been pleasantly surprised at how much of allocated
space is being put to good use.  80% seems more typical than 50% to me
based on that.

It's easy enough to contrive programs tending toward only 16 bytes in
use per arena, but nobody has ever reported anything like that.  The
aforementioned memcrunch.py program wasn't particularly contrived, but
was added to a bug report to "prove" that obmalloc's arena releasing
strategy 

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Inada Naoki
obmalloc is very nice at allocating small (~224 bytes) memory blocks.
But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me.

```
>>> pool_size = 4096 - 48  # 48 is pool header size
>>> for bs in range(16, 513, 16):
... n,r = pool_size//bs, pool_size%bs + 48
... print(bs, n, r, 100*r/4096)
...
16 253 48 1.171875
32 126 64 1.5625
48 84 64 1.5625
64 63 64 1.5625
80 50 96 2.34375
96 42 64 1.5625
112 36 64 1.5625
128 31 128 3.125
144 28 64 1.5625
160 25 96 2.34375
176 23 48 1.171875
192 21 64 1.5625
208 19 144 3.515625
224 18 64 1.5625
240 16 256 6.25
256 15 256 6.25
272 14 288 7.03125
288 14 64 1.5625
304 13 144 3.515625
320 12 256 6.25
336 12 64 1.5625
352 11 224 5.46875
368 11 48 1.171875
384 10 256 6.25
400 10 96 2.34375
416 9 352 8.59375
432 9 208 5.078125
448 9 64 1.5625
464 8 384 9.375
480 8 256 6.25
496 8 128 3.125
512 7 512 12.5
```

There are two problems here.

First, pool overhead is at most about 3.5 % until 224 bytes.
But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464
bytes, and 12.5% at 512 bytes.

Second, some size classes have the same number of memory blocks.
Class 272 and 286 have 14 blocks.  320 and 336 have 12 blocks.
It reduces utilization of pools.  This problem becomes bigger on 32bit platform.

Increasing pool size is one obvious way to fix these problems.
I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
a sweet spot for recent web servers (typically, about 32 threads, and
64GiB), but there is no evidence about it.
We need a reference application and scenario to benchmark.
pyperformance is not good for measuring memory usage of complex
applications.

```
>>> header_size = 48
>>> pool_size = 16*1024
>>> for bs in range(16, 513, 16):
... n = (pool_size - header_size) // bs
... r = (pool_size - header_size) % bs + header_size
... print(bs, n, r, 100 * r / pool_size)
...
16 1021 48 0.29296875
32 510 64 0.390625
48 340 64 0.390625
64 255 64 0.390625
80 204 64 0.390625
96 170 64 0.390625
112 145 144 0.87890625
128 127 128 0.78125
144 113 112 0.68359375
160 102 64 0.390625
176 92 192 1.171875
192 85 64 0.390625
208 78 160 0.9765625
224 72 256 1.5625
240 68 64 0.390625
256 63 256 1.5625
272 60 64 0.390625
288 56 256 1.5625
304 53 272 1.66015625
320 51 64 0.390625
336 48 256 1.5625
352 46 192 1.171875
368 44 192 1.171875
384 42 256 1.5625
400 40 384 2.34375
416 39 160 0.9765625
432 37 400 2.44140625
448 36 256 1.5625
464 35 144 0.87890625
480 34 64 0.390625
496 32 512 3.125
512 31 512 3.125
```

Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD
to 256 and believe malloc works well for medium size memory blocks.

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/AG6UUPKFXYOTZALFV7XD7EUV62SHOI3P/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Tim Peters
[Inada Naoki ]
> obmalloc is very nice at allocating small (~224 bytes) memory blocks.
> But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me.

For the "unavoidable memory waste" reasons you spell out here,
Vladimir deliberately set the threshold to 256 at the start.  As
things turned out, though, dicts drove the decision to boost it to 512
anyway.  A comment in the code says:

 * Note: a size threshold of 512 guarantees that newly created dictionaries
 * will be allocated from preallocated memory pools on 64-bit.

That was probably before "compact dicts", but, as-is, obmalloc can
handle small dicts now even if they're not empty.  And small dicts are
common.  Another relevant one pushing dicts to use obmalloc:

https://bugs.python.org/issue23601

> ...
> Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD
> to 256 and believe malloc works well for medium size memory blocks.

Alas, it doesn't ;-)  Well, everyone who ever timed it found obmalloc
was significantly faster than the system malloc for every size class
obmalloc handles.  That's partly because obmalloc runs under
protection of the GIL, while the system malloc has to worry about
thread safety.

But, regardless, obmalloc is just plain hard to beat so long as it can
stay in its fast paths (which increasing pool and arena sizes aims at
helping it do).  Its current pool and arena sizes are tiny compared to
comparable structures in the high-profile allocators often mentioned
in these discussions.  However, the latter generally use
platform-specific ways to spell "free this OS page" rather than
waiting for an entire arena-like structure to become empty.

For example, "SuperMalloc"[1] effectively uses 2 MiB pools, in the
sense of "want 16 bytes? Cool, we'll start by devoting 2 MiB to
holding _only_ 16-byte objects.  Now you want 388?  Cool, we'll
reserved another 2 MiB to holding only 388-byte objects.  Etc."

[1] http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf


> ```
> >>> pool_size = 4096 - 48  # 48 is pool header size
> >>> for bs in range(16, 513, 16):
> ... n,r = pool_size//bs, pool_size%bs + 48
> ... print(bs, n, r, 100*r/4096)
> ...
> 16 253 48 1.171875
> 32 126 64 1.5625
> 48 84 64 1.5625
> 64 63 64 1.5625
> 80 50 96 2.34375
> 96 42 64 1.5625
> 112 36 64 1.5625
> 128 31 128 3.125
> 144 28 64 1.5625
> 160 25 96 2.34375
> 176 23 48 1.171875
> 192 21 64 1.5625
> 208 19 144 3.515625
> 224 18 64 1.5625
> 240 16 256 6.25
> 256 15 256 6.25
> 272 14 288 7.03125
> 288 14 64 1.5625
> 304 13 144 3.515625
> 320 12 256 6.25
> 336 12 64 1.5625
> 352 11 224 5.46875
> 368 11 48 1.171875
> 384 10 256 6.25
> 400 10 96 2.34375
> 416 9 352 8.59375
> 432 9 208 5.078125
> 448 9 64 1.5625
> 464 8 384 9.375
> 480 8 256 6.25
> 496 8 128 3.125
> 512 7 512 12.5
> ```
>
> There are two problems here.
>
> First, pool overhead is at most about 3.5 % until 224 bytes.
> But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464
> bytes, and 12.5% at 512 bytes.

Note that in sys._debugmallocstats() output, the total of the
"unusable for this reason" space is called "bytes lost to
quantization",  It's rarely a big deal in my experience, but certainly
tends to account for a higher percentage of arena space as the average
size of used objects increases.


> Second, some size classes have the same number of memory blocks.
> Class 272 and 286 have 14 blocks.  320 and 336 have 12 blocks.
> It reduces utilization of pools.  This problem becomes bigger on 32bit 
> platform.

I don't understand the "32bit platform" comment.  On 32-bit boxes,
alignment is to 8 byte boundaries (not 16 as above), and there are 64
size classes (not 32 as above).  Which tends to make space efficiency
better, not worse, than on 64-bit Python (16-byte alignment is a waste
for the vast majority of objects Python programs use, which rarely
require better than pointer alignment (on 64-byte boxes) or double (on
32-bit ones)).


> Increasing pool size is one obvious way to fix these problems.

Yes & no ;-)  In Neil's radix-tree branch your code above works fine
when the pool is increased.  But my PR doesn't introduce a new tree,
building on what obmalloc already does:  it absolutely needs to
"steal" 4 bytes at the start of every 4 KiB page - which turns into 16
bytes on 64-boxes to preserve 16-bit alignment.  The number of usable
blocks in a pool is computed by this function in the PR:

/* Return total number of blocks in pool of size index `i`, as a uint. */
static uint
NUMBLOCKS(int i)
{
assert(0 <= i && i < NB_SMALL_SIZE_CLASSES);
/* The first page burns space for a pool header, and remaining pages
 * burn ALIGNMENT bytes for the arena index.
 */
const uint size = INDEX2SIZE(i);
uint usable1 = SYSTEM_PAGE_SIZE - POOL_OVERHEAD;
uint usable2 = SYSTEM_PAGE_SIZE - ALIGNMENT;
return usable1 / size + (usable2 / size) * (PAGES_PER_POOL - 1);
}

For a 16-KiB pool, this allows us to (e.g.) squeeze out 6 more 16-byte
objects than we can