[Inada Naoki <songofaca...@gmail.com>]
> obmalloc is very nice at allocating small (~224 bytes) memory blocks.
> But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me.

For the "unavoidable memory waste" reasons you spell out here,
Vladimir deliberately set the threshold to 256 at the start.  As
things turned out, though, dicts drove the decision to boost it to 512
anyway.  A comment in the code says:

 * Note: a size threshold of 512 guarantees that newly created dictionaries
 * will be allocated from preallocated memory pools on 64-bit.

That was probably before "compact dicts", but, as-is, obmalloc can
handle small dicts now even if they're not empty.  And small dicts are
common.  Another relevant one pushing dicts to use obmalloc:

https://bugs.python.org/issue23601

> ...
> Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD
> to 256 and believe malloc works well for medium size memory blocks.

Alas, it doesn't ;-)  Well, everyone who ever timed it found obmalloc
was significantly faster than the system malloc for every size class
obmalloc handles.  That's partly because obmalloc runs under
protection of the GIL, while the system malloc has to worry about
thread safety.

But, regardless, obmalloc is just plain hard to beat so long as it can
stay in its fast paths (which increasing pool and arena sizes aims at
helping it do).  Its current pool and arena sizes are tiny compared to
comparable structures in the high-profile allocators often mentioned
in these discussions.  However, the latter generally use
platform-specific ways to spell "free this OS page" rather than
waiting for an entire arena-like structure to become empty.

For example, "SuperMalloc"[1] effectively uses 2 MiB pools, in the
sense of "want 16 bytes? Cool, we'll start by devoting 2 MiB to
holding _only_ 16-byte objects.  Now you want 388?  Cool, we'll
reserved another 2 MiB to holding only 388-byte objects.  Etc."

[1] http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf


> ```
> >>> pool_size = 4096 - 48  # 48 is pool header size
> >>> for bs in range(16, 513, 16):
> ...     n,r = pool_size//bs, pool_size%bs + 48
> ...     print(bs, n, r, 100*r/4096)
> ...
> 16 253 48 1.171875
> 32 126 64 1.5625
> 48 84 64 1.5625
> 64 63 64 1.5625
> 80 50 96 2.34375
> 96 42 64 1.5625
> 112 36 64 1.5625
> 128 31 128 3.125
> 144 28 64 1.5625
> 160 25 96 2.34375
> 176 23 48 1.171875
> 192 21 64 1.5625
> 208 19 144 3.515625
> 224 18 64 1.5625
> 240 16 256 6.25
> 256 15 256 6.25
> 272 14 288 7.03125
> 288 14 64 1.5625
> 304 13 144 3.515625
> 320 12 256 6.25
> 336 12 64 1.5625
> 352 11 224 5.46875
> 368 11 48 1.171875
> 384 10 256 6.25
> 400 10 96 2.34375
> 416 9 352 8.59375
> 432 9 208 5.078125
> 448 9 64 1.5625
> 464 8 384 9.375
> 480 8 256 6.25
> 496 8 128 3.125
> 512 7 512 12.5
> ```
>
> There are two problems here.
>
> First, pool overhead is at most about 3.5 % until 224 bytes.
> But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464
> bytes, and 12.5% at 512 bytes.

Note that in sys._debugmallocstats() output, the total of the
"unusable for this reason" space is called "bytes lost to
quantization",  It's rarely a big deal in my experience, but certainly
tends to account for a higher percentage of arena space as the average
size of used objects increases.


> Second, some size classes have the same number of memory blocks.
> Class 272 and 286 have 14 blocks.  320 and 336 have 12 blocks.
> It reduces utilization of pools.  This problem becomes bigger on 32bit 
> platform.

I don't understand the "32bit platform" comment.  On 32-bit boxes,
alignment is to 8 byte boundaries (not 16 as above), and there are 64
size classes (not 32 as above).  Which tends to make space efficiency
better, not worse, than on 64-bit Python (16-byte alignment is a waste
for the vast majority of objects Python programs use, which rarely
require better than pointer alignment (on 64-byte boxes) or double (on
32-bit ones)).


> Increasing pool size is one obvious way to fix these problems.

Yes & no ;-)  In Neil's radix-tree branch your code above works fine
when the pool is increased.  But my PR doesn't introduce a new tree,
building on what obmalloc already does:  it absolutely needs to
"steal" 4 bytes at the start of every 4 KiB page - which turns into 16
bytes on 64-boxes to preserve 16-bit alignment.  The number of usable
blocks in a pool is computed by this function in the PR:

/* Return total number of blocks in pool of size index `i`, as a uint. */
static uint
NUMBLOCKS(int i)
{
    assert(0 <= i && i < NB_SMALL_SIZE_CLASSES);
    /* The first page burns space for a pool header, and remaining pages
     * burn ALIGNMENT bytes for the arena index.
     */
    const uint size = INDEX2SIZE(i);
    uint usable1 = SYSTEM_PAGE_SIZE - POOL_OVERHEAD;
    uint usable2 = SYSTEM_PAGE_SIZE - ALIGNMENT;
    return usable1 / size + (usable2 / size) * (PAGES_PER_POOL - 1);
}

For a 16-KiB pool, this allows us to (e.g.) squeeze out 6 more 16-byte
objects than we can get from 4 4-KiB pools, but (e.g.) doesn't help at
all for 512-byte objects.


> I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
> a sweet spot for recent web servers (typically, about 32 threads, and
> 64GiB), but there is no evidence about it.
>
> We need a reference application and scenario to benchmark.
> pyperformance is not good for measuring memory usage of complex
> applications.

On the other hand, when it comes to complex applications we're always
flying blind ;-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/D2YT2HCFRJHUVVMRJCH2F3YWH7MMKMCH/

Reply via email to