Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
Oh, I found a nice pice of CPython history in Modules/_pickle.c. Extract of Python 3.3: - /* A temporary cleaner API for fast single argument function call. XXX: Does caching the argument tuple provides any real performance benefits? A quick benchmark, on a 2.0GHz Athlon64 3200+ running Linux 2.6.24 with glibc 2.7, tells me that it takes roughly 20,000,000 PyTuple_New(1) calls when the tuple is retrieved from the freelist (i.e, call PyTuple_New() then immediately DECREF it) and 1,200,000 calls when allocating brand new tuples (i.e, call PyTuple_New() and store the returned value in an array), to save one second (wall clock time). Either ways, the loading time a pickle stream large enough to generate this number of calls would be massively overwhelmed by other factors, like I/O throughput, the GC traversal and object allocation overhead. So, I really doubt these functions provide any real benefits. On the other hand, oprofile reports that pickle spends a lot of time in these functions. But, that is probably more related to the function call overhead, than the argument tuple allocation. XXX: And, what is the reference behavior of these? Steal, borrow? At first glance, it seems to steal the reference of 'arg' and borrow the reference of 'func'. */ static PyObject * _Pickler_FastCall(PicklerObject *self, PyObject *func, PyObject *arg) - Extract of Python 3.4 (same function): - /* Note: this function used to reuse the argument tuple. This used to give a slight performance boost with older pickle implementations where many unbuffered reads occurred (thus needing many function calls). However, this optimization was removed because it was too complicated to get right. It abused the C API for tuples to mutate them which led to subtle reference counting and concurrency bugs. Furthermore, the introduction of protocol 4 and the prefetching optimization via peek() significantly reduced the number of function calls we do. Thus, the benefits became marginal at best. */ - It recalls me the story of property_descr_get() optimizations :-) I hope that the new generic "fastcall" functions will provide a safe and reliable optimization for the pickle module, property_descr_get() and others optimized functions. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
2016-08-22 10:01 GMT+02:00 Victor Stinner: > The next step is to support keyword parameters. In fact, it's already > supported in all cases except of Python functions: > https://bugs.python.org/issue27809 Serhiy Storchaka proposed to use a single C array for positional and keyword arguments. Keyword arguments are passed as (key, value) pairs. I just added this function: PyAPI_FUNC(PyObject *) _PyObject_FastCallKeywords( PyObject *func, PyObject **stack, Py_ssize_t nargs, Py_ssize_t nkwargs); The function is not used yet. Serhiy proposed to enhance the functions to parse arguments to support this format to pass arguments which would allow to avoid the creation of a temporary dictionary in many cases. I proposed to use this format (instead of (PyObject **stack, Py_ssize_t nargs, PyObject *kwargs)) for a new METH_FASTCALL calling convention for C functions: https://bugs.python.org/issue27810 Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
Hi, I pushed the most basic implementation of _PyObject_FastCall(), it doesn't support keyword parameters yet: https://hg.python.org/cpython/rev/a1a29d20f52d https://bugs.python.org/issue27128 Then I patched a lot of call sites calling PyObject_Call(), PyObject_CallObject(), PyEval_CallObject(), etc. with a temporary tuple. Just one example: -args = PyTuple_Pack(1, match); -if (!args) { -Py_DECREF(match); -goto error; -} -item = PyObject_CallObject(filter, args); -Py_DECREF(args); +item = _PyObject_FastCall(filter, , 1, NULL); The next step is to support keyword parameters. In fact, it's already supported in all cases except of Python functions: https://bugs.python.org/issue27809 Supporting keyword parameters will allow to patch much code to avoid temporary tuples, but it is also required for a much more interesting change: https://bugs.python.org/issue27810 "Add METH_FASTCALL: new calling convention for C functions" I propose to add a new METH_FASTCALL calling convention. The example using METH_VARARGS | METH_KEYWORDS: PyObject* func(DirEntry *self, PyObject *args, PyObject *kwargs) becomes: PyObject* func(DirEntry *self, PyObject **args, int nargs, PyObject *kwargs) Later, Argument Clinic will be modified to *generate* code using the new METH_FASTCALL calling convention. Code written with Argument Clinic will only need to be updated by Argument Clinic to get the new faster calling convention (avoid the creation of a temporary tuple for positional arguments). Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
On 2016-08-08 6:53 PM, Victor Stinner wrote: 2016-08-09 0:40 GMT+02:00 Guido van Rossum: tl;dr I found a way to make CPython 3.6 faster and I validated that there is no performance regression. But is there a performance improvement? Sure. On micro-benchmarks, you can see nice improvements: * getattr(1, "real") becomes 44% faster * list(filter(lambda x: x, list(range(1000 becomes 31% faster * namedtuple.attr becomes -23% faster * etc. See https://bugs.python.org/issue26814#msg263999 for default => patch, or https://bugs.python.org/issue26814#msg264003 for comparison python 2.7 / 3.4 / 3.5 / 3.6 / 3.6 patched. On the CPython benchmark suite, I also saw many faster benchmarks: Faster (25): - pickle_list: 1.29x faster - etree_generate: 1.22x faster - pickle_dict: 1.19x faster - etree_process: 1.16x faster - mako_v2: 1.13x faster - telco: 1.09x faster - raytrace: 1.08x faster - etree_iterparse: 1.08x faster (...) Exceptional results, congrats Victor. Will be happy to help with code review. Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
2016-08-09 1:36 GMT+02:00 Brett Cannon: > I just wanted to say I'm excited about this and I'm glad someone is taking > advantage of what Argument Clinic allows for and what I know Larry had > initially hoped AC would make happen! To make "Python" faster, not only a few specific functions, "all" C code should be updated to use the new "FASTCALL" calling convention. But it's a pain to have to rewrite the code parsing arguments, we all hate having to put #ifdef in the code... (for backward compatibility.) This is where the magic happens: if your code is written using Argument Clinic, you will get the optimization (FASTCALL) for free: just run again Argument Clinic to get the "updated" "calling convention". It can be a very good motivation to rewrite your code using Argument Clinic: get better inline documentation (docstring, help(func) in REPL) *and* performance ;-) > I should also point out that Serhiy has a patch for faster keyword argument > parsing thanks to AC: http://bugs.python.org/issue27574 . Not sure how your > two patches would intertwine (if at all). In a first implementation, I packed *all* arguments in the same C array: positional and keyword arguments. The problem is that all functions expect a dict to parse keyword arguments. A dict has an important property: O(1) for lookup. It becomes O(n) if you pass keyword arguments as a list of (key, value) tuples in a C array. So I chose to don't touch keyword arguments at all: continue to pass them as a dict. By the way, it's very rare to call a function using keyword arguments from C. -- About http://bugs.python.org/issue27574 : it's really nice to see work done on this part! I recall a discussion of the performance of operator versus function call. In some cases, the overhead of "parsing" arguments is higher than the cost of the same feature implemented as an operator! Hum, it was probably this issue: https://bugs.python.org/issue17170 Extract of the issue: """ Some crude C benchmarking on this computer: - calling PyUnicode_Replace is 35 ns (per call) - calling "hundred".replace is 125 ns - calling PyArg_ParseTuple with the same signature as "hundred".replace is 80 ns """ Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
I just wanted to say I'm excited about this and I'm glad someone is taking advantage of what Argument Clinic allows for and what I know Larry had initially hoped AC would make happen! I should also point out that Serhiy has a patch for faster keyword argument parsing thanks to AC: http://bugs.python.org/issue27574 . Not sure how your two patches would intertwine (if at all). On Mon, 8 Aug 2016 at 15:26 Victor Stinnerwrote: > Hi, > > tl;dr I found a way to make CPython 3.6 faster and I validated that > there is no performance regression. I'm requesting approval of core > developers to start pushing changes. > > In 2014 during a lunch at Pycon, Larry Hasting told me that he would > like to get rid of temporary tuples to call functions in Python. In > Python, positional arguments are passed as a tuple to C functions: > "PyObject *args". Larry wrote Argument Clinic which gives more control > on how C functions are called. But I guess that Larry didn't have time > to finish his implementation, since he didn't publish a patch. > > While trying to optimize CPython 3.6, I wrote a proof-of-concept patch > and results were promising: > https://bugs.python.org/issue26814#msg264003 > https://bugs.python.org/issue26814#msg266359 > > C functions get a C array "PyObject **args, int nargs". Getting the > nth argument become "arg = args[n];" at the C level. This format is > not new, it's already used internally in Python/ceval.c. A Python > function call made from a Python function already avoids a temporary > tuple in most cases: we pass the stack of the first function as the > list of arguments to the second function. My patch generalizes the > idea to C functions. It works in all directions (C=>Python, Python=>C, > C=>C, etc.). > > Many function calls become faster than Python 3.5 with my full patch, > but even faster than Python 2.7! For multiple reasons (not interesting > here), tested functions are slower in Python 3.4 than Python 2.7. > Python 3.5 is better than Python 3.4, but still slower than Python 2.7 > in a few cases. Using my "FASTCALL" patch, all tested function calls > become faster or as fast as Python 2.7! > > But when I ran the CPython benchmark suite, I found some major > performance regressions. In fact, it took me 3 months to understand > that I didn't run benchmarks correctly and that most benchmarks of the > CPython benchmark suite are very unstable. I wrote articles explaining > how benchmarks should be run (to be stable) and I patched all > benchmarks to use my new perf module which runs benchmarks in multiple > processes and computes the average (to make benchmarks more stable). > > At the end, my minimum FASTCALL patch (issue #27128) doesn't show any > major performance regression if you run "correctly" benchmarks :-) > https://bugs.python.org/issue27128#msg272197 > > Most benchmarks are not significant, 14 are faster, and only 4 are slower. > > According to benchmarks on the "full" FASTCALL patch, the slowdown are > temporary and should become quickly speedup (with further changes). > > My question is now: can I push fastcall-2.patch of the issue #27128? > This patch only adds the infrastructure to start working on more > useful optimizations, more patches will come, I expect more exciting > benchmark results. > > Overview of the initial FASTCALL patch, see my first message on the issue: > https://bugs.python.org/issue27128#msg266422 > > -- > > Note: My full FASTCALL patch changes the C API: this is out of the > scope of my first simple FASTCALL patch. I will open a new discussion > to decide if it's worth it and if yes, how it should be done. > > Victor > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/brett%40python.org > ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
2016-08-09 0:40 GMT+02:00 Guido van Rossum: > Hm, I agree that those tuples are probably expensive. I recall that > IronPython boasted faster Python calls by doing something closer to the > platform (in their case I'm guessing C# or the CLR :-). To be honest, I didn't expect *any* speedup just by avoiding the temporary tuples. The C structore of tuples is simple and the allocation of tuples is already optimized by a free list. I still don't understand how the cost of tuple creation/destruction can have such "large" impact on performances. The discussion with Larry was not really my first motivation to work on FASTCALL. I worked on this topic because CPython already uses some "hidden" tuples to avoid the cost of the tuple creation/destruction in various places, but using really ugly code and this ugly code caused crashes and surprising behaviours... https://bugs.python.org/issue26811 is a recent crash related to property_descr_get() optimization, whereas the optimization was already "fixed" once: https://hg.python.org/cpython/rev/5dbf3d932a59/ The fix is just another hack on top of the existing hack. The issue #26811 rewrote the optimization to avoid the crash using _PyObject_GC_UNTRACK(): https://hg.python.org/cpython/rev/a98ef122d73d I tried to make this "optimization" the standard way to call functions, rather than a corner case, and avoid hacks like PyTuple_SET_ITEM(args, 0, NULL) or _PyObject_GC_UNTRACK(). Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
2016-08-09 0:40 GMT+02:00 Guido van Rossum: >> tl;dr I found a way to make CPython 3.6 faster and I validated that >> there is no performance regression. > > But is there a performance improvement? Sure. On micro-benchmarks, you can see nice improvements: * getattr(1, "real") becomes 44% faster * list(filter(lambda x: x, list(range(1000 becomes 31% faster * namedtuple.attr becomes -23% faster * etc. See https://bugs.python.org/issue26814#msg263999 for default => patch, or https://bugs.python.org/issue26814#msg264003 for comparison python 2.7 / 3.4 / 3.5 / 3.6 / 3.6 patched. On the CPython benchmark suite, I also saw many faster benchmarks: Faster (25): - pickle_list: 1.29x faster - etree_generate: 1.22x faster - pickle_dict: 1.19x faster - etree_process: 1.16x faster - mako_v2: 1.13x faster - telco: 1.09x faster - raytrace: 1.08x faster - etree_iterparse: 1.08x faster (...) See https://bugs.python.org/issue26814#msg266359 Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions
On Mon, Aug 8, 2016 at 3:25 PM, Victor Stinnerwrote: > tl;dr I found a way to make CPython 3.6 faster and I validated that > there is no performance regression. But is there a performance improvement? > I'm requesting approval of core > developers to start pushing changes. > > In 2014 during a lunch at Pycon, Larry Hasting told me that he would > like to get rid of temporary tuples to call functions in Python. In > Python, positional arguments are passed as a tuple to C functions: > "PyObject *args". Larry wrote Argument Clinic which gives more control > on how C functions are called. But I guess that Larry didn't have time > to finish his implementation, since he didn't publish a patch. > Hm, I agree that those tuples are probably expensive. I recall that IronPython boasted faster Python calls by doing something closer to the platform (in their case I'm guessing C# or the CLR :-). Is this perhaps something that could wait until the Core devs sprint in a few weeks? (I presume you're coming?!) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] New calling convention to avoid temporarily tuples when calling functions
Hi, tl;dr I found a way to make CPython 3.6 faster and I validated that there is no performance regression. I'm requesting approval of core developers to start pushing changes. In 2014 during a lunch at Pycon, Larry Hasting told me that he would like to get rid of temporary tuples to call functions in Python. In Python, positional arguments are passed as a tuple to C functions: "PyObject *args". Larry wrote Argument Clinic which gives more control on how C functions are called. But I guess that Larry didn't have time to finish his implementation, since he didn't publish a patch. While trying to optimize CPython 3.6, I wrote a proof-of-concept patch and results were promising: https://bugs.python.org/issue26814#msg264003 https://bugs.python.org/issue26814#msg266359 C functions get a C array "PyObject **args, int nargs". Getting the nth argument become "arg = args[n];" at the C level. This format is not new, it's already used internally in Python/ceval.c. A Python function call made from a Python function already avoids a temporary tuple in most cases: we pass the stack of the first function as the list of arguments to the second function. My patch generalizes the idea to C functions. It works in all directions (C=>Python, Python=>C, C=>C, etc.). Many function calls become faster than Python 3.5 with my full patch, but even faster than Python 2.7! For multiple reasons (not interesting here), tested functions are slower in Python 3.4 than Python 2.7. Python 3.5 is better than Python 3.4, but still slower than Python 2.7 in a few cases. Using my "FASTCALL" patch, all tested function calls become faster or as fast as Python 2.7! But when I ran the CPython benchmark suite, I found some major performance regressions. In fact, it took me 3 months to understand that I didn't run benchmarks correctly and that most benchmarks of the CPython benchmark suite are very unstable. I wrote articles explaining how benchmarks should be run (to be stable) and I patched all benchmarks to use my new perf module which runs benchmarks in multiple processes and computes the average (to make benchmarks more stable). At the end, my minimum FASTCALL patch (issue #27128) doesn't show any major performance regression if you run "correctly" benchmarks :-) https://bugs.python.org/issue27128#msg272197 Most benchmarks are not significant, 14 are faster, and only 4 are slower. According to benchmarks on the "full" FASTCALL patch, the slowdown are temporary and should become quickly speedup (with further changes). My question is now: can I push fastcall-2.patch of the issue #27128? This patch only adds the infrastructure to start working on more useful optimizations, more patches will come, I expect more exciting benchmark results. Overview of the initial FASTCALL patch, see my first message on the issue: https://bugs.python.org/issue27128#msg266422 -- Note: My full FASTCALL patch changes the C API: this is out of the scope of my first simple FASTCALL patch. I will open a new discussion to decide if it's worth it and if yes, how it should be done. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com