[Python-Dev] Re: Optimizing pymalloc (was obmalloc
[Inada Naoki] >> So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part. >> But I need to use >> "static inline" for pymalloc_alloc and pymalloc_free yet [1]. [Neil Schemenauer] > I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO > enabled. I like adding those regardless of whether compilers find them helpful: they help _people_ reading the code focus on what's important to speed. While not generally crucial, speed is important in these very low-level, very heavily used functions. Speaking of which, another possible teensy win: pymalloc's allocation has always started with: if (nbytes == 0) { return 0; } if (nbytes > SMALL_REQUEST_THRESHOLD) { return 0; } size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT; But it could be a bit leaner: size_t fatsize = (nbytes - 1) >> ALIGNMENT_SHIFT; if (UNLIKELY(fatsize >= NB_SMALL_SIZE_CLASSES)) { return 0;' } size = (uint)fatsize; The `nbytes == 0` case ends up mapping to a very large size class then, although C may not guarantee that. But it doesn't matter: if it maps to "a real" size class, that's fine. We'll return a unique pointer into a pymalloc pool then, and "unique pointer" is all that's required. An allocation requesting 0 bytes does happen at times, but it's very rare. It just doesn't merit its own dedicated test-&-branch. > Good work looking into this. Should be some relatively easy > performance win. Ditto! ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RE44X7IP464I4KDPJPG3LF5NV5P27DHU/
[Python-Dev] Re: Optimizing pymalloc (was obmalloc
> Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +- > 4 ms: 1.10x faster (-9%) ... > I will try to split pymalloc_alloc and pymalloc_free to smaller functions. I did it and pymalloc is now as fast as mimalloc. $ ./python bm_spectral_norm.py --compare-to=./python-master python-master: . 199 ms +- 1 ms python: . 176 ms +- 1 ms Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 176 ms +- 1 ms: 1.13x faster (-11%) I filed an new issue for this: https://bugs.python.org/issue37543 ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HKV6TQAHHLLLK4JS5F5JQ26MGWPLOD2M/
[Python-Dev] Re: Optimizing pymalloc (was obmalloc
On Wed, Jul 10, 2019 at 5:18 PM Neil Schemenauer wrote: > > On 2019-07-09, Inada Naoki wrote: > > PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines > > pymalloc_free. > > But compiler doesn't know which is the hot part in pymalloc_alloc and > > pymalloc_free. > > Hello Inada, > > I don't see this on my PC. I'm using GCC 8.3.0. I have configured > the build with --enable-optimizations. I didn't use PGO and that's why GCC didn't know which part is hot. Maybe, pymalloc performance is similar to mimalloc when PGO is used, but I had not confirmed it. While Linux distributions are using PGO, some people use non-PGO Python (Homebrew, pyenv, etc...). So better performance without PGO is worth. Regards, -- Inada Naoki ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LKU5FDWGWHHEBUMTNZ5ME23RC73B5JIF/
[Python-Dev] Re: Optimizing pymalloc (was obmalloc
On 2019-07-09, Inada Naoki wrote: > PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines > pymalloc_free. > But compiler doesn't know which is the hot part in pymalloc_alloc and > pymalloc_free. Hello Inada, I don't see this on my PC. I'm using GCC 8.3.0. I have configured the build with --enable-optimizations. To speed up the profile generation, I have changed PROFILE_TASK to only run these tests: test_shelve test_set test_pprint test_pickletools test_ordered_dict test_tabnanny test_difflib test_pickle test_json test_collections I haven't spent much time trying to figure out what set of tests is best but the above set runs pretty quickly and seems to work okay. I have run pyperformance to compare CPython 'master' with your PR 14674. There doesn't seem to be a difference (table below). If I look at the disassembly, it seems that the hot paths of pymalloc_alloc and pymalloc_free are being inlined as you would hope, without needing the LIKELY/UNLIKELY annotations. OTOH, your addition of LIKELY() and UNLIKELY() in the PR is a pretty small change and probably doesn't hurt anything. So, I think it would be fine to merge it. Regards, Neil +-+-+-+ | Benchmark | master | PR-14674| +=+=+=+ | 2to3| 305 ms | 304 ms: 1.00x faster (-0%) | +-+-+-+ | chaos | 109 ms | 110 ms: 1.01x slower (+1%) | +-+-+-+ | crypto_pyaes| 118 ms | 117 ms: 1.01x faster (-1%) | +-+-+-+ | django_template | 112 ms | 114 ms: 1.02x slower (+2%) | +-+-+-+ | fannkuch| 446 ms | 440 ms: 1.01x faster (-1%) | +-+-+-+ | float | 119 ms | 120 ms: 1.01x slower (+1%) | +-+-+-+ | go | 247 ms | 250 ms: 1.01x slower (+1%) | +-+-+-+ | json_loads | 25.1 us | 24.4 us: 1.03x faster (-3%) | +-+-+-+ | logging_simple | 8.86 us | 8.66 us: 1.02x faster (-2%) | +-+-+-+ | meteor_contest | 97.5 ms | 97.7 ms: 1.00x slower (+0%) | +-+-+-+ | nbody | 140 ms | 142 ms: 1.01x slower (+1%) | +-+-+-+ | pathlib | 19.2 ms | 18.9 ms: 1.01x faster (-1%) | +-+-+-+ | pickle | 8.95 us | 9.08 us: 1.02x slower (+2%) | +-+-+-+ | pickle_dict | 18.1 us | 18.0 us: 1.01x faster (-1%) | +-+-+-+ | pickle_list | 2.75 us | 2.68 us: 1.03x faster (-3%) | +-+-+-+ | pidigits| 182 ms | 184 ms: 1.01x slower (+1%) | +-+-+-+ | python_startup | 7.83 ms | 7.81 ms: 1.00x faster (-0%) | +-+-+-+ | python_startup_no_site | 5.36 ms | 5.36 ms: 1.00x faster (-0%) | +-+-+-+ | raytrace| 495 ms | 499 ms: 1.01x slower (+1%) | +-+-+-+ | regex_dna | 173 ms | 170 ms: 1.01x faster (-1%) | +-+-+-+ | regex_effbot| 2.79 ms | 2.67 ms: 1.05x faster (-4%) | +-+-+-+ | regex_v8| 21.1 ms | 21.2 ms: 1.00x slower (+0%) | +-+-+-+ | richards| 68.2 ms | 68.7 ms: 1.01x slower (+1%) | +-+-+-+ | scimark_monte_carlo | 103 ms | 102 ms: 1.01x faster (-1%) | +-+-+-+ | scimark_sparse_mat_mult | 4.37 ms | 4.35 ms: 1.00x faster (-0%) | +-+-+-+ | spectral_norm | 132 ms | 133 ms: 1.01x slower (+1%) | +-+-+-+ | sqlalchemy_imperative | 30.3 ms | 30.7 ms: 1.01x slower
[Python-Dev] Re: Keyword arguments with non-string names
I realized something that makes this even more tricky: dicts are mutable. So even if the dict contains only string keys at call time, it could theoretically be changed by the time that keywords are parsed. So for calling conventions passing dicts, I would leave it to the callee to sanity check the dict (this is the status quo). For the vectorcall/FASTCALL calling convention, the situation is a lot better: the call arguments are immutable and there are not many places where vectorcall calls are made with keywords. So we could check it on the caller side. I'll try to implement that. Jeroen. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PQYD4GARMKSURX7GYRSNCHJSLIWK22XD/