[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2017-05-17 Thread Jesús Cea Avión

Changes by Jesús Cea Avión :


--
nosy: +jcea

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2017-02-01 Thread STINNER Victor

STINNER Victor added the comment:

Victor: "FYI I wrote an article about this issue: 
https://haypo.github.io/analysis-python-performance-issue.html Sadly, it seems 
like I was just lucky when adding __attribute__((hot)) fixed the issue, because 
call_method is slow again!"

I upgraded speed-python server (running benchmarks) to Ubuntu 16.04 LTS to 
support PGO compilation. I removed all old benchmark results and ran again 
benchmarks with LTO+PGO. It seems like benchmark results are much better now.

I'm not sure anymore that _Py_HOT_FUNCTION is really useful to get stable 
benchmarks, but it may help code placement a little bit. I don't think that it 
hurts, so I suggest to keep it. Since benchmarks were still unstable with 
_Py_HOT_FUNCTION, I'm not interested to continue to tag more functions with 
_Py_HOT_FUNCTION. I will now focus on LTO+PGO for stable benchmarks, and ignore 
small performance difference when PGO is not used.

I close this issue now.

--
resolution:  -> fixed
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-22 Thread STINNER Victor

STINNER Victor added the comment:

> But I failed to reproduce it.

Hey, performance issues with code placement is a mysterious secret :-)
Nobody understands it :-D

The server runner the benchmark is a Intel Xeon CPU of 2011. It seems
like code placement issues are more important on this CPU than my more
recent laptop or desktop PC.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-22 Thread INADA Naoki

INADA Naoki added the comment:

I setup Ubuntu 14.04 on Azure, built python without neither PGO nor LTO.
But I failed to reproduce it.

@haypo, would you give me two binaries?

$ ~/local/py-2a143/bin/python3 -c 'import sys; print(sys.version)'
3.7.0a0 (default:2a14385710dc, Nov 22 2016, 12:02:34) 
[GCC 4.8.4]

$ ~/local/py-acde8/bin/python3 -c 'import sys; print(sys.version)'  
  
3.7.0a0 (default:acde821520fc, Nov 22 2016, 11:31:16) 
[GCC 4.8.4]

$ ~/local/py-2a143/bin/python3 bm_call_method.py 
.
call_method: Median +- std dev: 16.1 ms +- 0.6 ms

$ ~/local/py-acde8/bin/python3 bm_call_method.py
  
.
call_method: Median +- std dev: 16.1 ms +- 0.7 ms

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-22 Thread STINNER Victor

STINNER Victor added the comment:

Naoki: "Wow. It's sad that tagged version is accidentally slow..."

If you use PGO compilation, for example use "./configure
--enable-optimizations" as suggested by configure if you don't enable
the option, you don't get the issue.

I hope that most Linux distribution use PGO compilation. I'm quite
sure that it's the case for Ubuntu. I don't know for Fedora.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-22 Thread STINNER Victor

STINNER Victor added the comment:

2016-11-22 12:07 GMT+01:00 INADA Naoki :
> I want to reproduce it and check `perf record -e L1-icache-load-misses`.
> But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.

You don't need to go that far to check performances: just run
call_method and check timings. You need to compare on multiple
revisions.

speed.python.org Timeline helps to track performances, to have an idea
of the "average performance" and detect spikes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-22 Thread INADA Naoki

INADA Naoki added the comment:

Wow. It's sad that tagged version is accidentally slow...

I want to reproduce it and check `perf record -e L1-icache-load-misses`.
But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-22 Thread STINNER Victor

STINNER Victor added the comment:

FYI I wrote an article about this issue:
https://haypo.github.io/analysis-python-performance-issue.html

Sadly, it seems like I was just lucky when adding __attribute__((hot)) fixed 
the issue, because call_method is slow again!

* acde821520fc (Nov 21): 16.3 ms
* 2a14385710dc (Nov 22): 24.6 ms (+51%)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread STINNER Victor

STINNER Victor added the comment:

Serhiy Storchaka:
>> * json: scanstring_unicode()
>
> This doesn't look wise. This is specific to single extension module and 
> perhaps to single particular benchmark. Most Python code don't use json at 
> all.

Well, I tried different things to make these benchmarks more stable. I didn't 
say that we should merge hot3.patch as it is :-) It's just an attempt.


> What is the top of "perf report"?

For json_loads, it's:

 14.99%  _json.cpython-37m-x86_64-linux-gnu.so  scanstring_unicode
  8.34%  python _PyUnicode_FromUCS1
  8.32%  _json.cpython-37m-x86_64-linux-gnu.so  scan_once_unicode
  8.01%  python lookdict_unicode_nodummy
  6.72%  python siphash24
  4.45%  python PyDict_SetItem
  4.26%  python _PyObject_Malloc
  3.38%  python _PyEval_EvalFrameDefault
  3.16%  python _Py_HashBytes
  2.72%  python PyUnicode_New
  2.36%  python PyLong_FromString
  2.25%  python _PyObject_Free
  2.02%  libc-2.19.so   __memcpy_sse2_unaligned
  1.61%  python PyDict_GetItem
  1.40%  python dictresize
  1.24%  python unicode_hash
  1.11%  libc-2.19.so   _int_malloc
  1.07%  python unicode_dealloc
  1.00%  python free_keys_object

Result produced with:

   $ perf record ./python ~/performance/performance/benchmarks/bm_json_loads.py 
--worker -v -l 128 -w0 -n 100   
   
   $ perf report  


> How this list intersects with the list of functions in .text.hot section of 
> PGO build?

I checked which functions are considered as "hot" by a PGO build: I found more 
than 2,000 functions... I'm not interested to tag so many functions with 
_Py_HOT_FUNCTIONS. I would prefer to only tag something like the top 10 or top 
25 functions.

I don't know the recommandations to tag functions as hot. I guess that what 
matters is the total size of hot functions. Should I be smaller than the L2 
cache? Smaller than the L3 cache? I'm talking about instructions, but data 
share also these caches...


> Make several PGO builds (perhaps on different computers). Is .text.hot 
> section stable?

In my experience PGO builds don't provide stable performances, but I was never 
able to write an article on that because of so many bugs :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread STINNER Victor

STINNER Victor added the comment:

> New changeset cfc956f13ce2 by Victor Stinner in branch 'default':
> Issue #28618: Mark dict lookup functions as hot
> https://hg.python.org/cpython/rev/cfc956f13ce2

Here are benchmark results on the speed-python server:

haypo@speed-python$ PYTHONPATH=~/perf python -m perf compare_to 
2016-11-15_09-12-default-ac93d188ebd6.json 
2016-11-15_15-13-default-cfc956f13ce2.json -G --min-speed=1
Slower (6):
- json_loads: 62.8 us +- 1.1 us -> 65.8 us +- 2.6 us: 1.05x slower
- nbody: 243 ms +- 2 ms -> 253 ms +- 6 ms: 1.04x slower
- mako: 42.7 ms +- 0.2 ms -> 43.5 ms +- 0.3 ms: 1.02x slower
- chameleon: 29.2 ms +- 0.3 ms -> 29.7 ms +- 0.2 ms: 1.02x slower
- spectral_norm: 261 ms +- 2 ms -> 266 ms +- 3 ms: 1.02x slower
- pickle: 26.6 us +- 0.4 us -> 27.0 us +- 0.4 us: 1.01x slower

Faster (26):
- xml_etree_generate: 290 ms +- 4 ms -> 275 ms +- 3 ms: 1.06x faster
- float: 306 ms +- 5 ms -> 292 ms +- 7 ms: 1.05x faster
- logging_simple: 37.7 us +- 0.4 us -> 36.1 us +- 0.4 us: 1.04x faster
- hexiom: 25.6 ms +- 0.1 ms -> 24.5 ms +- 0.1 ms: 1.04x faster
- regex_effbot: 6.11 ms +- 0.31 ms -> 5.88 ms +- 0.43 ms: 1.04x faster
- sympy_expand: 1.19 sec +- 0.02 sec -> 1.15 sec +- 0.01 sec: 1.04x faster
- telco: 21.5 ms +- 0.4 ms -> 20.8 ms +- 0.4 ms: 1.03x faster
- raytrace: 1.41 sec +- 0.02 sec -> 1.37 sec +- 0.02 sec: 1.03x faster
- scimark_sor: 512 ms +- 11 ms -> 500 ms +- 12 ms: 1.03x faster
- logging_format: 44.6 us +- 0.5 us -> 43.6 us +- 0.7 us: 1.02x faster
- sympy_str: 532 ms +- 4 ms -> 520 ms +- 4 ms: 1.02x faster
- fannkuch: 1.11 sec +- 0.01 sec -> 1.08 sec +- 0.02 sec: 1.02x faster
- django_template: 475 ms +- 5 ms -> 467 ms +- 6 ms: 1.02x faster
- chaos: 308 ms +- 2 ms -> 303 ms +- 3 ms: 1.02x faster
- xml_etree_process: 244 ms +- 4 ms -> 240 ms +- 4 ms: 1.02x faster
- xml_etree_iterparse: 225 ms +- 5 ms -> 221 ms +- 4 ms: 1.02x faster
- pathlib: 51.1 ms +- 0.5 ms -> 50.3 ms +- 0.5 ms: 1.02x faster
- sqlite_synth: 10.5 us +- 0.2 us -> 10.3 us +- 0.2 us: 1.01x faster
- dulwich_log: 186 ms +- 1 ms -> 184 ms +- 1 ms: 1.01x faster
- sqlalchemy_imperative: 72.5 ms +- 1.6 ms -> 71.5 ms +- 1.6 ms: 1.01x faster
- deltablue: 18.5 ms +- 0.3 ms -> 18.3 ms +- 0.2 ms: 1.01x faster
- tornado_http: 438 ms +- 5 ms -> 433 ms +- 5 ms: 1.01x faster
- json_dumps: 30.4 ms +- 0.4 ms -> 30.1 ms +- 0.4 ms: 1.01x faster
- genshi_xml: 212 ms +- 3 ms -> 210 ms +- 3 ms: 1.01x faster
- scimark_monte_carlo: 273 ms +- 5 ms -> 271 ms +- 5 ms: 1.01x faster
- call_simple: 13.3 ms +- 0.3 ms -> 13.2 ms +- 0.4 ms: 1.01x faster

Benchmark hidden because not significant (32): 2to3, call_method, 
call_method_slots, call_method_unknown, crypto_pyaes, genshi_text, go, 
html5lib, logging_silent, meteor_contest, nqueens, pickle_dict, pickle_list, 
pickle_pure_python, pidigits, python_startup, python_startup_no_site, 
regex_compile, regex_dna, regex_v8, richards, scimark_fft, scimark_lu, 
scimark_sparse_mat_mult, sqlalchemy_declarative, sympy_integrate, sympy_sum, 
unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> * json: scanstring_unicode()

This doesn't look wise. This is specific to single extension module and perhaps 
to single particular benchmark. Most Python code don't use json at all.

What is the top of "perf report"? How this list intersects with the list of 
functions in .text.hot section of PGO build? Make several PGO builds (perhaps 
on different computers). Is .text.hot section stable?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread STINNER Victor

STINNER Victor added the comment:

I wrote hot3.patch when trying to make the following benchmarks more reliable:

- logging_silent: rev 8ebaa546a033 is 20% slower than the average en 2016
- json_loads: rev 0bd618fe0639 is 30% slower and rev 8ebaa546a033 is
15% slower than the average on 2016
- regex_effbot: rev 573bc1f9900e (nov 7) takes 6.0 ms, rev
cf7711887b4a (nov 7) takes 5.2 ms, rev 8ebaa546a033 (nov 10) takes 6.1
ms, etc.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread STINNER Victor

STINNER Victor added the comment:

hot3.patch: Mark additional functions as hot

* PyNumber_AsSsize_t()
* _PyUnicode_FromUCS1()
* json: scanstring_unicode()
* siphash24()
* sre_ucs1_match, sre_ucs2_match, sre_ucs4_match

I'm not sure about this patch. It's hard to get reliable benchmark results on 
microbenchmarks :-/ It's hard to understand if a speedup comes from the hot 
attribute, or if the compiler decided itself to change the code placement. 
Without the hot attribute, the code placement seems random.

--
Added file: http://bugs.python.org/file45488/hot3.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread STINNER Victor

STINNER Victor added the comment:

> How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

Ok, your benchmark results doens't look bad, so I marked the following 
functions as hot:

- lookdict
- lookdict_unicode
- lookdict_unicode_nodummy
- lookdict_split

It's common to see these functions in the top 3 of "perf report".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread Roundup Robot

Roundup Robot added the comment:

New changeset cfc956f13ce2 by Victor Stinner in branch 'default':
Issue #28618: Mark dict lookup functions as hot
https://hg.python.org/cpython/rev/cfc956f13ce2

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread INADA Naoki

INADA Naoki added the comment:

> so I suggest to run benchmarks and check that it has a non negligible effect 
> on benchmarks ;-)

When added _Py_HOT_FUNCTION to lookdict_unicode, lookdict_unicode_nodummy and 
lookdict_split
(I can't measure L1 miss via `perf stat -d` because I use EC2 for benchmark):

$ ~/local/python-master/bin/python3 -m perf compare_to -G all-master.json 
all-patched.json
Slower (28):
- pybench.CompareFloats: 106 ns +- 1 ns -> 112 ns +- 1 ns: 1.07x slower
- pybench.BuiltinFunctionCalls: 1.62 us +- 0.00 us -> 1.68 us +- 0.03 us: 1.04x 
slower
- pybench.CompareFloatsIntegers: 180 ns +- 3 ns -> 185 ns +- 3 ns: 1.03x slower
- sympy_sum: 163 ms +- 7 ms -> 167 ms +- 7 ms: 1.03x slower
- deltablue: 13.7 ms +- 0.4 ms -> 14.1 ms +- 0.5 ms: 1.02x slower
- pickle_list: 5.77 us +- 0.09 us -> 5.90 us +- 0.07 us: 1.02x slower
- pybench.PythonFunctionCalls: 1.20 us +- 0.02 us -> 1.22 us +- 0.02 us: 1.02x 
slower
- pybench.SpecialClassAttribute: 1.46 us +- 0.02 us -> 1.49 us +- 0.03 us: 
1.02x slower
- pybench.TryRaiseExcept: 207 ns +- 4 ns -> 210 ns +- 0 ns: 1.02x slower
- pickle_pure_python: 868 us +- 18 us -> 882 us +- 16 us: 1.02x slower
- genshi_text: 56.0 ms +- 0.7 ms -> 56.8 ms +- 0.6 ms: 1.01x slower
- json_dumps: 19.5 ms +- 0.3 ms -> 19.8 ms +- 0.2 ms: 1.01x slower
- richards: 137 ms +- 3 ms -> 139 ms +- 2 ms: 1.01x slower
- sqlalchemy_declarative: 272 ms +- 4 ms -> 276 ms +- 3 ms: 1.01x slower
- pickle_dict: 43.5 us +- 0.4 us -> 44.1 us +- 0.2 us: 1.01x slower
- go: 436 ms +- 4 ms -> 441 ms +- 4 ms: 1.01x slower
- pybench.SecondImport: 2.52 us +- 0.04 us -> 2.55 us +- 0.07 us: 1.01x slower
- pybench.NormalClassAttribute: 1.46 us +- 0.02 us -> 1.47 us +- 0.02 us: 1.01x 
slower
- genshi_xml: 118 ms +- 2 ms -> 118 ms +- 3 ms: 1.01x slower
- pybench.UnicodePredicates: 75.8 ns +- 0.6 ns -> 76.2 ns +- 0.9 ns: 1.01x 
slower
- pybench.ListSlicing: 415 us +- 4 us -> 417 us +- 4 us: 1.01x slower
- scimark_fft: 494 ms +- 2 ms -> 496 ms +- 12 ms: 1.01x slower
- logging_format: 23.7 us +- 0.3 us -> 23.9 us +- 0.4 us: 1.00x slower
- chaos: 200 ms +- 1 ms -> 201 ms +- 1 ms: 1.00x slower
- pybench.StringPredicates: 509 ns +- 3 ns -> 511 ns +- 4 ns: 1.00x slower
- call_method: 13.6 ms +- 0.1 ms -> 13.7 ms +- 0.2 ms: 1.00x slower
- pybench.StringSlicing: 530 ns +- 3 ns -> 532 ns +- 8 ns: 1.00x slower
- pybench.SimpleLongArithmetic: 535 ns +- 2 ns -> 536 ns +- 4 ns: 1.00x slower

Faster (47):
- html5lib: 169 ms +- 7 ms -> 158 ms +- 6 ms: 1.07x faster
- pybench.ConcatUnicode: 57.3 ns +- 3.0 ns -> 55.8 ns +- 1.3 ns: 1.03x faster
- pybench.IfThenElse: 60.5 ns +- 1.0 ns -> 59.0 ns +- 0.7 ns: 1.02x faster
- logging_silent: 606 ns +- 14 ns -> 593 ns +- 13 ns: 1.02x faster
- scimark_lu: 411 ms +- 5 ms -> 404 ms +- 4 ms: 1.02x faster
- pathlib: 29.1 ms +- 0.3 ms -> 28.7 ms +- 0.5 ms: 1.02x faster
- pybench.CreateStringsWithConcat: 2.87 us +- 0.01 us -> 2.82 us +- 0.00 us: 
1.02x faster
- pybench.DictCreation: 621 ns +- 10 ns -> 612 ns +- 8 ns: 1.01x faster
- meteor_contest: 167 ms +- 5 ms -> 164 ms +- 5 ms: 1.01x faster
- unpickle_pure_python: 656 us +- 19 us -> 647 us +- 9 us: 1.01x faster
- pybench.NestedForLoops: 20.2 ns +- 0.1 ns -> 20.0 ns +- 0.1 ns: 1.01x faster
- regex_effbot: 4.01 ms +- 0.07 ms -> 3.95 ms +- 0.06 ms: 1.01x faster
- pybench.CreateUnicodeWithConcat: 57.1 ns +- 0.2 ns -> 56.4 ns +- 0.2 ns: 
1.01x faster
- chameleon: 18.3 ms +- 0.2 ms -> 18.0 ms +- 0.3 ms: 1.01x faster
- python_startup: 13.7 ms +- 0.1 ms -> 13.5 ms +- 0.1 ms: 1.01x faster
- pybench.SmallTuples: 967 ns +- 6 ns -> 955 ns +- 8 ns: 1.01x faster
- pybench.TryFinally: 200 ns +- 3 ns -> 198 ns +- 2 ns: 1.01x faster
- pybench.SimpleIntegerArithmetic: 425 ns +- 3 ns -> 420 ns +- 4 ns: 1.01x 
faster
- pybench.Recursion: 1.34 us +- 0.02 us -> 1.33 us +- 0.03 us: 1.01x faster
- pybench.SimpleIntFloatArithmetic: 424 ns +- 1 ns -> 420 ns +- 1 ns: 1.01x 
faster
- float: 222 ms +- 2 ms -> 220 ms +- 3 ms: 1.01x faster
- 2to3: 531 ms +- 4 ms -> 527 ms +- 5 ms: 1.01x faster
- python_startup_no_site: 8.30 ms +- 0.04 ms -> 8.23 ms +- 0.05 ms: 1.01x faster
- xml_etree_parse: 196 ms +- 5 ms -> 194 ms +- 2 ms: 1.01x faster
- pybench.ComplexPythonFunctionCalls: 794 ns +- 7 ns -> 788 ns +- 7 ns: 1.01x 
faster
- logging_simple: 20.4 us +- 0.2 us -> 20.3 us +- 0.4 us: 1.01x faster
- fannkuch: 795 ms +- 9 ms -> 790 ms +- 3 ms: 1.01x faster
- hexiom: 18.7 ms +- 0.1 ms -> 18.6 ms +- 0.1 ms: 1.01x faster
- regex_compile: 322 ms +- 9 ms -> 320 ms +- 8 ms: 1.01x faster
- mako: 36.0 ms +- 0.1 ms -> 35.8 ms +- 0.2 ms: 1.01x faster
- pybench.UnicodeProperties: 91.7 ns +- 0.9 ns -> 91.1 ns +- 0.8 ns: 1.01x 
faster
- pybench.SimpleComplexArithmetic: 577 ns +- 8 ns -> 573 ns +- 3 ns: 1.01x 
faster
- xml_etree_process: 147 ms +- 2 ms -> 146 ms +- 2 ms: 1.01x faster
- pybench.CompareUnicode: 22.4 ns +- 0.1 ns -> 22.2 ns +- 0.1 ns: 1.01x faster
- crypto_pyaes: 175 ms +- 1 ms -> 174 ms +- 1 ms: 1.01x faster
- unpickle_list: 5.43 us +- 0.04 us -> 5.41 us +- 0.02 us: 

[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-15 Thread INADA Naoki

INADA Naoki added the comment:

> I don't understand well the effect of the hot attribute

I compared lookdict_unicode_nodummy assembly by `objdump -d dictobject.o`.
It looks completely same.

So I think only difference is placement. hot functions are in .text.hot section 
and linker
groups hot functions. This reduces cache hazard possibility.

When compiling Python with PGO, we can see what function is hot by objdump.

```
~/work/cpython/Objects$ objdump -tj .text.hot dictobject.o

dictobject.o: file format elf64-x86-64

SYMBOL TABLE:
 ld  .text.hot   .text.hot
07a0 l F .text.hot  0574 
lookdict_unicode_nodummy
46d0 l F .text.hot  00e8 free_keys_object
01c0 l F .text.hot  0161 new_keys_object
03b0 l F .text.hot  03e8 insertdict
1180 l F .text.hot  081f dictresize
19a0 l F .text.hot  0165 find_empty_slot.isra.0
2180 l F .text.hot  05f1 lookdict
1b10 l F .text.hot  00c2 unicode_eq
2780 l F .text.hot  0184 dict_traverse
4c20 l F .text.hot  05f7 lookdict_unicode
6b20 l F .text.hot  0330 lookdict_split
...
```

cold section of hot function is placed in .text.unlikely section.

```
$ objdump -t  dictobject.o  | grep lookdict
07a0 l F .text.hot  0574 
lookdict_unicode_nodummy
2180 l F .text.hot  05f1 lookdict
013e l   .text.unlikely  
lookdict_unicode_nodummy.cold.6
0a38 l   .text.unlikely  lookdict.cold.15
4c20 l F .text.hot  05f7 lookdict_unicode
6b20 l F .text.hot  0330 lookdict_split
1339 l   .text.unlikely  
lookdict_unicode.cold.28
1d01 l   .text.unlikely  lookdict_split.cold.42
```

All lookdict* function are put in hot section, and all of cold part is 0 byte.
It seems PGO put all lookdict* functions in hot section.

compiler info:
```
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 
5.4.0-6ubuntu1~16.04.4' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs 
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr 
--program-suffix=-5 --enable-shared --enable-linker-build-id 
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix 
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-libstdcxx-time=yes 
--with-default-libstdcxx-abi=new --enable-gnu-unique-object 
--disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib 
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo 
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home 
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar 
--enable-objc-gc --enable-multiarch --disable-werror --wi
 th-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 
--enable-multilib --with-tune=generic --enable-checking=release 
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
```

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-14 Thread STINNER Victor

STINNER Victor added the comment:

INADA Naoki added the comment:
> How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

I don't understand well the effect of the hot attribute, so I suggest
to run benchmarks and check that it has a non negligible effect on
benchmarks ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-14 Thread INADA Naoki

INADA Naoki added the comment:

How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-12 Thread STINNER Victor

STINNER Victor added the comment:

> Can we commit this to 3.6 too?

I worked on patches to try to optimize json_loads and regex_effbot as well, but 
it's still unclear to me how the hot attribute works, and I'm not 100% sure 
that using the attribut explicitly does not introduce a performance regession.

So I prefer to experiment such change in default right now.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-12 Thread Yury Selivanov

Yury Selivanov added the comment:

Can we commit this to 3.6 too?

--
nosy: +yselivanov

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-11 Thread STINNER Victor

STINNER Victor added the comment:

> - scimark_sparse_mat_mult: 8.71 ms +- 0.19 ms -> 9.28 ms +- 0.12 ms: 1.07x 
> slower

Same issue on this benchmark:

* average on one year: 8.8 ms
* peak at rev 59b91b4e9506: 9.3 ms
* run after rev 59b91b4e9506: 9.0 ms

The benchmark is unstable, but the difference is small, especially compared to 
the difference of call_method without the hot attribute.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-11 Thread STINNER Victor

STINNER Victor added the comment:

> - json_loads: 71.4 us +- 0.8 us -> 72.9 us +- 1.4 us: 1.02x slower

Hum, sadly this benchmark is still unstable after my change 59b91b4e9506 ("Mark 
hot functions using __attribute__((hot))", oops, I wanted to write Mark, not 
Make :-/).

This benchmark is around 63.4 us during many months, whereas it reached 72.9 us 
at rev 59b91b4e9506, and the new run (also using hot attribute) gone back to 
63.0 us...

I understand that json_loads depends on the code placement of some other 
functions which are not currently marked with the hot attribute.

https://speed.python.org/timeline/#/?exe=4=json_loads=1=50=off=on=on

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-11 Thread STINNER Victor

STINNER Victor added the comment:

Final result on speed-python:

haypo@speed-python$ python3 -m perf compare_to 
json_8nov/2016-11-10_15-39-default-8ebaa546a033.json 
2016-11-11_02-13-default-59b91b4e9506.json -G

Slower (12):
- scimark_sparse_mat_mult: 8.71 ms +- 0.19 ms -> 9.28 ms +- 0.12 ms: 1.07x 
slower
- nbody: 244 ms +- 2 ms -> 252 ms +- 4 ms: 1.03x slower
- json_loads: 71.4 us +- 0.8 us -> 72.9 us +- 1.4 us: 1.02x slower
- fannkuch: 1.07 sec +- 0.01 sec -> 1.09 sec +- 0.01 sec: 1.01x slower
- scimark_lu: 502 ms +- 19 ms -> 509 ms +- 12 ms: 1.01x slower
- chaos: 302 ms +- 3 ms -> 305 ms +- 3 ms: 1.01x slower
- xml_etree_iterparse: 224 ms +- 3 ms -> 226 ms +- 6 ms: 1.01x slower
- regex_dna: 299 ms +- 1 ms -> 300 ms +- 1 ms: 1.00x slower
- pickle_list: 9.21 us +- 0.33 us -> 9.24 us +- 0.56 us: 1.00x slower
- crypto_pyaes: 245 ms +- 1 ms -> 246 ms +- 2 ms: 1.00x slower
- meteor_contest: 219 ms +- 1 ms -> 219 ms +- 1 ms: 1.00x slower
- unpack_sequence: 128 ns +- 2 ns -> 128 ns +- 0 ns: 1.00x slower

Faster (39):
- logging_silent: 997 ns +- 40 ns -> 803 ns +- 13 ns: 1.24x faster
- regex_effbot: 6.16 ms +- 0.24 ms -> 5.17 ms +- 0.27 ms: 1.19x faster
- mako: 45.9 ms +- 0.7 ms -> 42.9 ms +- 0.6 ms: 1.07x faster
- xml_etree_process: 253 ms +- 4 ms -> 237 ms +- 4 ms: 1.07x faster
- call_simple: 13.9 ms +- 0.3 ms -> 13.1 ms +- 0.4 ms: 1.06x faster
- spectral_norm: 274 ms +- 2 ms -> 260 ms +- 2 ms: 1.05x faster
- xml_etree_generate: 300 ms +- 4 ms -> 285 ms +- 5 ms: 1.05x faster
- call_method_slots: 17.1 ms +- 0.2 ms -> 16.2 ms +- 0.3 ms: 1.05x faster
- telco: 21.8 ms +- 0.5 ms -> 20.7 ms +- 0.3 ms: 1.05x faster
- call_method: 17.3 ms +- 0.3 ms -> 16.5 ms +- 0.2 ms: 1.05x faster
- pickle_pure_python: 1.42 ms +- 0.02 ms -> 1.36 ms +- 0.03 ms: 1.04x faster
- pathlib: 51.9 ms +- 0.8 ms -> 50.6 ms +- 0.4 ms: 1.03x faster
- xml_etree_parse: 295 ms +- 8 ms -> 287 ms +- 7 ms: 1.03x faster
- chameleon: 31.0 ms +- 0.3 ms -> 30.2 ms +- 0.2 ms: 1.03x faster
- deltablue: 19.3 ms +- 0.3 ms -> 18.8 ms +- 0.2 ms: 1.02x faster
- django_template: 484 ms +- 4 ms -> 472 ms +- 5 ms: 1.02x faster
- call_method_unknown: 18.7 ms +- 0.2 ms -> 18.3 ms +- 0.2 ms: 1.02x faster
- html5lib: 261 ms +- 6 ms -> 256 ms +- 6 ms: 1.02x faster
- unpickle_pure_python: 973 us +- 12 us -> 954 us +- 15 us: 1.02x faster
- regex_v8: 47.6 ms +- 0.8 ms -> 46.7 ms +- 0.4 ms: 1.02x faster
- richards: 202 ms +- 4 ms -> 198 ms +- 5 ms: 1.02x faster
- logging_simple: 37.8 us +- 0.6 us -> 37.1 us +- 0.4 us: 1.02x faster
- sympy_integrate: 50.8 ms +- 0.9 ms -> 49.9 ms +- 1.4 ms: 1.02x faster
- dulwich_log: 189 ms +- 2 ms -> 186 ms +- 1 ms: 1.02x faster
- sqlalchemy_declarative: 343 ms +- 3 ms -> 339 ms +- 3 ms: 1.01x faster
- hexiom: 25.0 ms +- 0.1 ms -> 24.7 ms +- 0.1 ms: 1.01x faster
- logging_format: 44.6 us +- 0.6 us -> 44.1 us +- 0.6 us: 1.01x faster
- 2to3: 787 ms +- 4 ms -> 777 ms +- 4 ms: 1.01x faster
- tornado_http: 440 ms +- 4 ms -> 435 ms +- 4 ms: 1.01x faster
- json_dumps: 30.7 ms +- 0.4 ms -> 30.5 ms +- 0.3 ms: 1.01x faster
- go: 637 ms +- 10 ms -> 632 ms +- 8 ms: 1.01x faster
- regex_compile: 397 ms +- 2 ms -> 394 ms +- 3 ms: 1.01x faster
- nqueens: 266 ms +- 2 ms -> 264 ms +- 2 ms: 1.01x faster
- python_startup: 16.8 ms +- 0.0 ms -> 16.7 ms +- 0.0 ms: 1.01x faster
- python_startup_no_site: 9.91 ms +- 0.01 ms -> 9.86 ms +- 0.01 ms: 1.01x faster
- scimark_sor: 513 ms +- 13 ms -> 510 ms +- 8 ms: 1.01x faster
- raytrace: 1.41 sec +- 0.02 sec -> 1.40 sec +- 0.02 sec: 1.00x faster
- genshi_text: 95.2 ms +- 1.1 ms -> 94.7 ms +- 0.8 ms: 1.00x faster
- sympy_str: 529 ms +- 5 ms -> 528 ms +- 4 ms: 1.00x faster

Benchmark hidden because not significant (13): float, genshi_xml, pickle, 
pickle_dict, pidigits, scimark_fft, scimark_monte_carlo, sqlalchemy_imperative, 
sqlite_synth, sympy_expand, sympy_sum, unpickle, unpickle_list

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-10 Thread STINNER Victor

STINNER Victor added the comment:

I tried different patches and ran many quick & dirty benchmarks.

I tried to use likely/unlikely macros (using GCC __builtin__expect): the effect 
is not significant on call_simple microbenchmark. I gave up on this part.

__attribute__((hot)) on a few Python core functions fixes the major slowdown on 
call_method on the revision 83877018ef97 (described in the initial message).

I noticed tiny differences when using __attribute__((hot)), speedup in most 
cases. I noticed sometimes slowdown, but very small (ex: 1%, but 1% on a 
microbenchmark doesn't mean anything).

I pushed my patch to try to keep stable performance when Python is not compiled 
with PGO.

If you would like to know more about the crazy effect of code placement in 
modern Intel CPUs, I suggest you to see the slides of this recent talk from an 
Intel engineer:
https://llvmdevelopersmeetingbay2016.sched.org/event/8YzY/causes-of-performance-instability-due-to-code-placement-in-x86
"Causes of Performance Swings Due to Code Placement in IA by Zia Ansari 
(Intel), November 2016"

--

About PGO or not PGO: this question is not simple, I suggest to discuss it in a 
different place to not flood this issue ;-)

For my use case, I'm not convinced yet that PGO with our current build system 
produce reliable performance.

Not all Linux distributions compile Python using PGO: Fedora and RHEL don't 
compile Python using PGO for example. Bugzilla for Fedora:
https://bugzilla.redhat.com/show_bug.cgi?id=613045

I guess that there also some developers running benchmarks on Python compiled 
with "./configure && make". I'm trying to enhance documentation and tools 
around Python benchmarks to advice developers to use LTO and/or PGO.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-10 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 59b91b4e9506 by Victor Stinner in branch 'default':
Issue #28618: Make hot functions using __attribute__((hot))
https://hg.python.org/cpython/rev/59b91b4e9506

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-08 Thread STINNER Victor

Changes by STINNER Victor :


Added file: http://bugs.python.org/file45397/patch.json.gz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-08 Thread STINNER Victor

STINNER Victor added the comment:

>> Do you mean comparison between current Python with PGO and patched
>> Python without PGO?
>
> Yes.

Ok, here you have. As expected, PGO compilation is faster than default 
compilation with my patch. PGO implements more optimization than just 
__attribute__((hot)), it also optimizes branches for example.

haypo@smithers$ python3 -m perf compare_to pgo.json.gz patch.json.gz -G 
--min-speed=5
Slower (56):
- regex_effbot: 4.30 ms +- 0.26 ms -> 5.77 ms +- 0.33 ms: 1.34x slower
- telco: 16.0 ms +- 1.1 ms -> 20.6 ms +- 0.4 ms: 1.29x slower
- xml_etree_process: 174 ms +- 15 ms -> 218 ms +- 29 ms: 1.25x slower
- xml_etree_generate: 205 ms +- 16 ms -> 254 ms +- 4 ms: 1.24x slower
- unpickle_list: 6.04 us +- 1.12 us -> 7.47 us +- 0.18 us: 1.24x slower
- call_simple: 10.6 ms +- 1.4 ms -> 13.1 ms +- 0.3 ms: 1.24x slower
- mako: 33.5 ms +- 0.3 ms -> 41.3 ms +- 0.9 ms: 1.23x slower
- pathlib: 37.0 ms +- 2.3 ms -> 44.7 ms +- 2.0 ms: 1.21x slower
- sqlite_synth: 7.56 us +- 0.20 us -> 8.97 us +- 0.18 us: 1.19x slower
- unpickle: 24.2 us +- 3.9 us -> 28.7 us +- 0.3 us: 1.18x slower
- chameleon: 23.4 ms +- 2.6 ms -> 27.4 ms +- 1.5 ms: 1.17x slower
- spectral_norm: 214 ms +- 7 ms -> 249 ms +- 9 ms: 1.17x slower
- nqueens: 210 ms +- 2 ms -> 244 ms +- 36 ms: 1.16x slower
- unpickle_pure_python: 717 us +- 10 us -> 831 us +- 66 us: 1.16x slower
- pickle: 18.7 us +- 4.3 us -> 21.6 us +- 3.3 us: 1.15x slower
- sympy_expand: 829 ms +- 39 ms -> 957 ms +- 28 ms: 1.15x slower
- genshi_text: 73.1 ms +- 3.2 ms -> 84.3 ms +- 1.1 ms: 1.15x slower
- pickle_list: 6.82 us +- 0.20 us -> 7.86 us +- 0.05 us: 1.15x slower
- sympy_str: 372 ms +- 28 ms -> 428 ms +- 3 ms: 1.15x slower
- xml_etree_parse: 231 ms +- 7 ms -> 266 ms +- 9 ms: 1.15x slower
- call_method_slots: 14.0 ms +- 1.3 ms -> 16.1 ms +- 1.2 ms: 1.15x slower
- sympy_sum: 169 ms +- 6 ms -> 194 ms +- 19 ms: 1.15x slower
- logging_format: 29.3 us +- 2.5 us -> 33.7 us +- 1.6 us: 1.15x slower
- logging_simple: 25.7 us +- 2.1 us -> 29.3 us +- 0.4 us: 1.14x slower
- genshi_xml: 159 ms +- 15 ms -> 182 ms +- 1 ms: 1.14x slower
- xml_etree_iterparse: 178 ms +- 3 ms -> 203 ms +- 5 ms: 1.14x slower
- pickle_pure_python: 1.06 ms +- 0.17 ms -> 1.21 ms +- 0.16 ms: 1.14x slower
- logging_silent: 618 ns +- 11 ns -> 705 ns +- 62 ns: 1.14x slower
- hexiom: 19.0 ms +- 0.2 ms -> 21.7 ms +- 0.2 ms: 1.14x slower
- html5lib: 184 ms +- 11 ms -> 209 ms +- 31 ms: 1.14x slower
- call_method: 14.3 ms +- 0.7 ms -> 16.3 ms +- 0.1 ms: 1.14x slower
- django_template: 324 ms +- 18 ms -> 368 ms +- 3 ms: 1.14x slower
- sympy_integrate: 37.9 ms +- 0.3 ms -> 43.0 ms +- 2.7 ms: 1.13x slower
- deltablue: 15.0 ms +- 2.0 ms -> 16.9 ms +- 1.0 ms: 1.12x slower
- call_method_unknown: 16.0 ms +- 0.4 ms -> 17.9 ms +- 0.2 ms: 1.12x slower
- 2to3: 611 ms +- 12 ms -> 677 ms +- 57 ms: 1.11x slower
- regex_compile: 300 ms +- 3 ms -> 332 ms +- 21 ms: 1.11x slower
- json_loads: 50.5 us +- 2.5 us -> 55.8 us +- 1.2 us: 1.10x slower
- unpack_sequence: 111 ns +- 5 ns -> 122 ns +- 1 ns: 1.10x slower
- pickle_dict: 53.2 us +- 3.7 us -> 58.1 us +- 3.7 us: 1.09x slower
- scimark_sor: 420 ms +- 60 ms -> 458 ms +- 12 ms: 1.09x slower
- scimark_lu: 398 ms +- 74 ms -> 434 ms +- 18 ms: 1.09x slower
- regex_dna: 227 ms +- 1 ms -> 247 ms +- 9 ms: 1.09x slower
- pidigits: 266 ms +- 33 ms -> 290 ms +- 10 ms: 1.09x slower
- chaos: 243 ms +- 2 ms -> 265 ms +- 3 ms: 1.09x slower
- crypto_pyaes: 197 ms +- 16 ms -> 215 ms +- 28 ms: 1.09x slower
- dulwich_log: 129 ms +- 15 ms -> 140 ms +- 8 ms: 1.08x slower
- sqlalchemy_imperative: 50.8 ms +- 0.9 ms -> 55.0 ms +- 1.8 ms: 1.08x slower
- meteor_contest: 173 ms +- 22 ms -> 187 ms +- 5 ms: 1.08x slower
- sqlalchemy_declarative: 268 ms +- 11 ms -> 290 ms +- 3 ms: 1.08x slower
- tornado_http: 335 ms +- 4 ms -> 361 ms +- 3 ms: 1.08x slower
- python_startup: 20.6 ms +- 0.6 ms -> 22.1 ms +- 0.9 ms: 1.08x slower
- python_startup_no_site: 8.37 ms +- 0.08 ms -> 9.00 ms +- 0.07 ms: 1.08x slower
- go: 518 ms +- 36 ms -> 557 ms +- 39 ms: 1.07x slower
- raytrace: 1.14 sec +- 0.08 sec -> 1.22 sec +- 0.02 sec: 1.07x slower
- scimark_fft: 594 ms +- 29 ms -> 627 ms +- 13 ms: 1.06x slower

Benchmark hidden because not significant (8): fannkuch, float, json_dumps, 
nbody, regex_v8, richards, scimark_monte_carlo, scimark_sparse_mat_mult

--
Added file: http://bugs.python.org/file45396/pgo.json.gz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-05 Thread STINNER Victor

STINNER Victor added the comment:

Antoine Pitrou added the comment:
>> Do you mean comparison between current Python with PGO and patched
>> Python without PGO?
>
> Yes.

Oh ok, sure. I will try to run these 2 benchmarks.

>>> Ubuntu 14.04 is old, and I don't think this is something we should worry 
>>> about.
>>
>> Well, it's a practical issue for me to run benchmarks for speed.python.org.
>
> Why isn't the OS updated on that machine?

I am not sure that I want to use PGO compilation to run benchmarks.
Last time I checked, I noticed performance differences between two
compilations. PGO compilation doesn't seem 100% deterministic.

Maybe PGO compilation is fine when you build Python to create a Linux
package. But to get reliable benchmarks, I'm not sure that it's a good
idea.

I'm still running benchmarks on Python recompiled many times using
different compiler options, to measure the impact of the compiler
options (especially LTO and/or PGO) on the benchmark stability.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-05 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Le 05/11/2016 à 16:37, STINNER Victor a écrit :
> 
> Antoine Pitrou added the comment:
>> Can you compare against a PGO build?
> 
> Do you mean comparison between current Python with PGO and patched
> Python without PGO?

Yes.

>> Ubuntu 14.04 is old, and I don't think this is something we should worry 
>> about.
> 
> Well, it's a practical issue for me to run benchmarks for speed.python.org.

Why isn't the OS updated on that machine?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-05 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> Moreover, I like the idea of getting a fast(er) Python even when no
advanced optimization techniques like LTO or PGO is used.

Seconded.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-05 Thread STINNER Victor

STINNER Victor added the comment:

Antoine Pitrou added the comment:
> Can you compare against a PGO build?

Do you mean comparison between current Python with PGO and patched
Python without PGO?

The hot attribute is ignored by GCC when PGO compilation is used.

> Ubuntu 14.04 is old, and I don't think this is something we should worry 
> about.

Well, it's a practical issue for me to run benchmarks for speed.python.org.

Moreover, I like the idea of getting a fast(er) Python even when no
advanced optimization techniques like LTO or PGO is used. At least,
it's common to build quickly Python using "./configure && make" for a
quick benchmark.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-05 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Can you compare against a PGO build? Ubuntu 14.04 is old, and I don't think 
this is something we should worry about.

Overall I think this manual approach is really the wrong way to look at it. 
Compilers can do better than us.

--
nosy: +pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-05 Thread STINNER Victor

STINNER Victor added the comment:

Oh, I forgot to mention that I compiled Python with "./configure -C". The 
purpose of the patch is to optimize Python when LTO and/or PGO compilation are 
not explicitly used.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-05 Thread STINNER Victor

STINNER Victor added the comment:

I ran benchmarks. Globally, it seems like the impact of the patch is positive. 
regex_v8 and call_simple are slower, but these benchmarks are microbenchmarks 
impacted by low level stuff like CPU L1 cache. Well, my patch was supposed to 
optimize CPython for call_simple :-/ I should maybe investigate a little bit 
more.


Performance comparison (performance 0.3.2):

haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G
Slower (6):
- regex_v8: 40.6 ms +- 5.7 ms -> 47.1 ms +- 0.3 ms: 1.16x slower
- call_simple: 12.6 ms +- 0.2 ms -> 13.2 ms +- 1.3 ms: 1.05x slower
- regex_effbot: 4.58 ms +- 0.07 ms -> 4.70 ms +- 0.05 ms: 1.03x slower
- sympy_integrate: 43.4 ms +- 0.3 ms -> 44.0 ms +- 0.2 ms: 1.01x slower
- nqueens: 239 ms +- 2 ms -> 241 ms +- 1 ms: 1.01x slower
- scimark_fft: 674 ms +- 12 ms -> 680 ms +- 75 ms: 1.01x slower

Faster (32):
- scimark_monte_carlo: 255 ms +- 4 ms -> 234 ms +- 7 ms: 1.09x faster
- chameleon: 28.4 ms +- 3.1 ms -> 27.0 ms +- 0.4 ms: 1.05x faster
- scimark_sor: 488 ms +- 27 ms -> 467 ms +- 10 ms: 1.05x faster
- sqlite_synth: 9.16 us +- 1.03 us -> 8.82 us +- 0.23 us: 1.04x faster
- scimark_lu: 485 ms +- 20 ms -> 469 ms +- 14 ms: 1.03x faster
- xml_etree_process: 226 ms +- 30 ms -> 219 ms +- 4 ms: 1.03x faster
- logging_simple: 29.7 us +- 0.4 us -> 28.9 us +- 0.3 us: 1.03x faster
- pickle_list: 7.99 us +- 0.88 us -> 7.78 us +- 0.05 us: 1.03x faster
- raytrace: 1.26 sec +- 0.08 sec -> 1.23 sec +- 0.01 sec: 1.03x faster
- sympy_expand: 995 ms +- 31 ms -> 971 ms +- 35 ms: 1.03x faster
- deltablue: 17.0 ms +- 0.1 ms -> 16.6 ms +- 0.2 ms: 1.02x faster
- call_method_slots: 16.0 ms +- 0.1 ms -> 15.6 ms +- 0.2 ms: 1.02x faster
- fannkuch: 983 ms +- 12 ms -> 962 ms +- 29 ms: 1.02x faster
- pickle_pure_python: 1.25 ms +- 0.14 ms -> 1.22 ms +- 0.01 ms: 1.02x faster
- logging_format: 34.0 us +- 0.3 us -> 33.4 us +- 1.5 us: 1.02x faster
- xml_etree_parse: 274 ms +- 9 ms -> 270 ms +- 5 ms: 1.02x faster
- sympy_str: 441 ms +- 3 ms -> 433 ms +- 3 ms: 1.02x faster
- genshi_text: 87.6 ms +- 9.2 ms -> 86.0 ms +- 1.4 ms: 1.02x faster
- genshi_xml: 187 ms +- 17 ms -> 184 ms +- 1 ms: 1.02x faster
- django_template: 376 ms +- 4 ms -> 370 ms +- 2 ms: 1.02x faster
- json_dumps: 27.1 ms +- 0.4 ms -> 26.7 ms +- 0.4 ms: 1.02x faster
- sqlalchemy_declarative: 295 ms +- 3 ms -> 291 ms +- 3 ms: 1.01x faster
- call_method_unknown: 18.1 ms +- 0.1 ms -> 17.8 ms +- 0.1 ms: 1.01x faster
- nbody: 218 ms +- 4 ms -> 216 ms +- 2 ms: 1.01x faster
- regex_dna: 250 ms +- 24 ms -> 247 ms +- 2 ms: 1.01x faster
- go: 573 ms +- 2 ms -> 566 ms +- 3 ms: 1.01x faster
- richards: 173 ms +- 4 ms -> 171 ms +- 4 ms: 1.01x faster
- python_startup: 24.6 ms +- 0.1 ms -> 24.5 ms +- 0.1 ms: 1.00x faster
- regex_compile: 404 ms +- 6 ms -> 403 ms +- 5 ms: 1.00x faster
- dulwich_log: 143 ms +- 11 ms -> 143 ms +- 1 ms: 1.00x faster
- pidigits: 290 ms +- 1 ms -> 289 ms +- 0 ms: 1.00x faster
- pickle_dict: 58.3 us +- 6.5 us -> 58.3 us +- 0.7 us: 1.00x faster

Benchmark hidden because not significant (26): 2to3, call_method, chaos, 
crypto_pyaes, float, hexiom, html5lib, json_loads, logging_silent, mako, 
meteor_contest, pathlib, pickle, python_startup_no_site, 
scimark_sparse_mat_mult, spectral_norm, sqlalchemy_imperative, sympy_sum, 
telco, tornado_http, unpack_sequence, unpickle, unpickle_list, 
unpickle_pure_python, xml_etree_generate, xml_etree_iterparse

--

More readable output, only display differences >= 5%:

haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G --min-speed=5
Slower (1):
- regex_v8: 40.6 ms +- 5.7 ms -> 47.1 ms +- 0.3 ms: 1.16x slower

Faster (2):
- scimark_monte_carlo: 255 ms +- 4 ms -> 234 ms +- 7 ms: 1.09x faster
- chameleon: 28.4 ms +- 3.1 ms -> 27.0 ms +- 0.4 ms: 1.05x faster

Benchmark hidden because not significant (61): 2to3, call_method, 
call_method_slots, call_method_unknown, call_simple, chaos, crypto_pyaes, 
deltablue, django_template, dulwich_log, fannkuch, float, genshi_text, 
genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_format, 
logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, 
pickle, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, 
python_startup_no_site, raytrace, regex_compile, regex_dna, regex_effbot, 
richards, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, 
spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, 
sympy_expand, sympy_integrate, sympy_str, sympy_sum, telco, tornado_http, 
unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, 
xml_etree_generate, xml_etree_iterparse, xml_etree_parse, xml_etree_process

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 

[issue28618] Decorate hot functions using __attribute__((hot)) to optimize Python

2016-11-04 Thread STINNER Victor

New submission from STINNER Victor:

When analyzing results of Python performance benchmarks, I noticed that 
call_method was 70% slower (!) between revisions 83877018ef97 (Oct 18) and 
3e073e7b4460 (Oct 22), including these revisions, on the speed-python server.

On these revisions, the CPU L1 instruction cache is less efficient: 8% cache 
misses, whereas it was only 0.06% before and after these revisions.

Since the two mentioned revisions have no obvious impact on the call_method() 
benchmark, I understand that the performance difference by a different layout 
of the machine code, maybe the exact location of functions.

IMO the best solution to such compilation issue is to use PGO compilation. 
Problem: PGO doesn't work on Ubuntu 14.04, the OS used by speed-python (the 
server runining benchmarks for http://speed.python.org/).

I propose to decorate manually the "hot" functions using the GCC 
__attribute__((hot)) decorator:
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
(search for "hot")

Attached patch adds Py_HOT_FUNCTION and decorates the following functions:

* _PyEval_EvalFrameDefault()
* PyFrame_New()
* call_function()
* lookdict_unicode_nodummy()
* _PyFunction_FastCall()
* frame_dealloc()

These functions are the top 6 according to the Linux perf tool when running the 
call_simple benchmark of the performance project:

32,66%: _PyEval_EvalFrameDefault
13,09%: PyFrame_New
12,78%: call_function
12,24%: lookdict_unicode_nodummy
 9,85%: _PyFunction_FastCall
 8,47%: frame_dealloc

--
components: Interpreter Core
files: hot_function.patch
keywords: patch
messages: 280097
nosy: haypo
priority: normal
severity: normal
status: open
title: Decorate hot functions using __attribute__((hot)) to optimize Python
type: performance
versions: Python 3.7
Added file: http://bugs.python.org/file45361/hot_function.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com