[Python-Dev] Re: Optimizing pymalloc (was obmalloc

2019-07-10 Thread Tim Peters
[Inada Naoki]
>> So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part.
>> But I need to use
>> "static inline" for pymalloc_alloc and pymalloc_free yet [1].

[Neil Schemenauer]
> I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO
> enabled.

I like adding those regardless of whether compilers find them helpful:
 they help _people_ reading the code focus on what's important to
speed.  While not generally crucial, speed is important in these very
low-level, very heavily used functions.

Speaking of which, another possible teensy win:  pymalloc's allocation
has always started with:

if (nbytes == 0) {
return 0;
}
if (nbytes > SMALL_REQUEST_THRESHOLD) {
return 0;
}
size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;

But it could be a bit leaner:

size_t fatsize = (nbytes - 1) >> ALIGNMENT_SHIFT;
 if (UNLIKELY(fatsize >= NB_SMALL_SIZE_CLASSES)) {
 return 0;'
 }
size = (uint)fatsize;

The `nbytes == 0` case ends up mapping to a very large size class
then, although C may not guarantee that.  But it doesn't matter:  if
it maps to "a real" size class, that's fine.  We'll return a unique
pointer into a pymalloc pool then, and "unique pointer" is all that's
required.

An allocation requesting 0 bytes does happen at times, but it's very
rare.  It just doesn't merit its own dedicated test-&-branch.

> Good work looking into this.  Should be some relatively easy
> performance win.

Ditto!
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RE44X7IP464I4KDPJPG3LF5NV5P27DHU/


[Python-Dev] Re: Optimizing pymalloc (was obmalloc

2019-07-10 Thread Inada Naoki
> Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +-
> 4 ms: 1.10x faster (-9%)
...
> I will try to split pymalloc_alloc and pymalloc_free to smaller functions.

I did it and pymalloc is now as fast as mimalloc.

$ ./python bm_spectral_norm.py --compare-to=./python-master
python-master: . 199 ms +- 1 ms
python: . 176 ms +- 1 ms

Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 176 ms +-
1 ms: 1.13x faster (-11%)

I filed an new issue for this: https://bugs.python.org/issue37543
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HKV6TQAHHLLLK4JS5F5JQ26MGWPLOD2M/


[Python-Dev] Re: Optimizing pymalloc (was obmalloc

2019-07-10 Thread Inada Naoki
On Wed, Jul 10, 2019 at 5:18 PM Neil Schemenauer  wrote:
>
> On 2019-07-09, Inada Naoki wrote:
> > PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines 
> > pymalloc_free.
> > But compiler doesn't know which is the hot part in pymalloc_alloc and
> > pymalloc_free.
>
> Hello Inada,
>
> I don't see this on my PC.  I'm using GCC 8.3.0.  I have configured
> the build with --enable-optimizations.

I didn't use PGO and that's why GCC didn't know which part is hot.
Maybe, pymalloc performance is similar to mimalloc when PGO is used,
but I had not confirmed it.

While Linux distributions are using PGO, some people use non-PGO Python
(Homebrew, pyenv, etc...).  So better performance without PGO is worth.

Regards,
-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LKU5FDWGWHHEBUMTNZ5ME23RC73B5JIF/


[Python-Dev] Re: Optimizing pymalloc (was obmalloc

2019-07-10 Thread Neil Schemenauer
On 2019-07-09, Inada Naoki wrote:
> PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines 
> pymalloc_free.
> But compiler doesn't know which is the hot part in pymalloc_alloc and
> pymalloc_free.

Hello Inada,

I don't see this on my PC.  I'm using GCC 8.3.0.  I have configured
the build with --enable-optimizations.  To speed up the profile
generation, I have changed PROFILE_TASK to only run these tests:

test_shelve test_set test_pprint test_pickletools
test_ordered_dict test_tabnanny test_difflib test_pickle
test_json test_collections

I haven't spent much time trying to figure out what set of tests is
best but the above set runs pretty quickly and seems to work okay.

I have run pyperformance to compare CPython 'master' with your PR
14674.  There doesn't seem to be a difference (table below).  If I
look at the disassembly, it seems that the hot paths of
pymalloc_alloc and pymalloc_free are being inlined as you would
hope, without needing the LIKELY/UNLIKELY annotations.

OTOH, your addition of LIKELY() and UNLIKELY() in the PR is a pretty
small change and probably doesn't hurt anything.  So, I think it
would be fine to merge it.

Regards,

  Neil


+-+-+-+
| Benchmark   | master  | PR-14674|
+=+=+=+
| 2to3| 305 ms  | 304 ms: 1.00x faster (-0%)  |
+-+-+-+
| chaos   | 109 ms  | 110 ms: 1.01x slower (+1%)  |
+-+-+-+
| crypto_pyaes| 118 ms  | 117 ms: 1.01x faster (-1%)  |
+-+-+-+
| django_template | 112 ms  | 114 ms: 1.02x slower (+2%)  |
+-+-+-+
| fannkuch| 446 ms  | 440 ms: 1.01x faster (-1%)  |
+-+-+-+
| float   | 119 ms  | 120 ms: 1.01x slower (+1%)  |
+-+-+-+
| go  | 247 ms  | 250 ms: 1.01x slower (+1%)  |
+-+-+-+
| json_loads  | 25.1 us | 24.4 us: 1.03x faster (-3%) |
+-+-+-+
| logging_simple  | 8.86 us | 8.66 us: 1.02x faster (-2%) |
+-+-+-+
| meteor_contest  | 97.5 ms | 97.7 ms: 1.00x slower (+0%) |
+-+-+-+
| nbody   | 140 ms  | 142 ms: 1.01x slower (+1%)  |
+-+-+-+
| pathlib | 19.2 ms | 18.9 ms: 1.01x faster (-1%) |
+-+-+-+
| pickle  | 8.95 us | 9.08 us: 1.02x slower (+2%) |
+-+-+-+
| pickle_dict | 18.1 us | 18.0 us: 1.01x faster (-1%) |
+-+-+-+
| pickle_list | 2.75 us | 2.68 us: 1.03x faster (-3%) |
+-+-+-+
| pidigits| 182 ms  | 184 ms: 1.01x slower (+1%)  |
+-+-+-+
| python_startup  | 7.83 ms | 7.81 ms: 1.00x faster (-0%) |
+-+-+-+
| python_startup_no_site  | 5.36 ms | 5.36 ms: 1.00x faster (-0%) |
+-+-+-+
| raytrace| 495 ms  | 499 ms: 1.01x slower (+1%)  |
+-+-+-+
| regex_dna   | 173 ms  | 170 ms: 1.01x faster (-1%)  |
+-+-+-+
| regex_effbot| 2.79 ms | 2.67 ms: 1.05x faster (-4%) |
+-+-+-+
| regex_v8| 21.1 ms | 21.2 ms: 1.00x slower (+0%) |
+-+-+-+
| richards| 68.2 ms | 68.7 ms: 1.01x slower (+1%) |
+-+-+-+
| scimark_monte_carlo | 103 ms  | 102 ms: 1.01x faster (-1%)  |
+-+-+-+
| scimark_sparse_mat_mult | 4.37 ms | 4.35 ms: 1.00x faster (-0%) |
+-+-+-+
| spectral_norm   | 132 ms  | 133 ms: 1.01x slower (+1%)  |
+-+-+-+
| sqlalchemy_imperative   | 30.3 ms | 30.7 ms: 1.01x slower 

[Python-Dev] Re: Keyword arguments with non-string names

2019-07-10 Thread Jeroen Demeyer
I realized something that makes this even more tricky: dicts are 
mutable. So even if the dict contains only string keys at call time, it 
could theoretically be changed by the time that keywords are parsed. So 
for calling conventions passing dicts, I would leave it to the callee to 
sanity check the dict (this is the status quo).


For the vectorcall/FASTCALL calling convention, the situation is a lot 
better: the call arguments are immutable and there are not many places 
where vectorcall calls are made with keywords. So we could check it on 
the caller side. I'll try to implement that.


Jeroen.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/PQYD4GARMKSURX7GYRSNCHJSLIWK22XD/