Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-24 Thread Victor Stinner
Oh, I found a nice pice of CPython history in Modules/_pickle.c.
Extract of Python 3.3:
-
/* A temporary cleaner API for fast single argument function call.

   XXX: Does caching the argument tuple provides any real performance benefits?

   A quick benchmark, on a 2.0GHz Athlon64 3200+ running Linux 2.6.24 with
   glibc 2.7, tells me that it takes roughly 20,000,000 PyTuple_New(1) calls
   when the tuple is retrieved from the freelist (i.e, call PyTuple_New() then
   immediately DECREF it) and 1,200,000 calls when allocating brand new tuples
   (i.e, call PyTuple_New() and store the returned value in an array), to save
   one second (wall clock time). Either ways, the loading time a pickle stream
   large enough to generate this number of calls would be massively
   overwhelmed by other factors, like I/O throughput, the GC traversal and
   object allocation overhead. So, I really doubt these functions provide any
   real benefits.

   On the other hand, oprofile reports that pickle spends a lot of time in
   these functions. But, that is probably more related to the function call
   overhead, than the argument tuple allocation.

   XXX: And, what is the reference behavior of these? Steal, borrow? At first
   glance, it seems to steal the reference of 'arg' and borrow the reference
   of 'func'. */
static PyObject *
_Pickler_FastCall(PicklerObject *self, PyObject *func, PyObject *arg)
-

Extract of Python 3.4 (same function):
-
/* Note: this function used to reuse the argument tuple. This used to give
   a slight performance boost with older pickle implementations where many
   unbuffered reads occurred (thus needing many function calls).

   However, this optimization was removed because it was too complicated
   to get right. It abused the C API for tuples to mutate them which led
   to subtle reference counting and concurrency bugs. Furthermore, the
   introduction of protocol 4 and the prefetching optimization via peek()
   significantly reduced the number of function calls we do. Thus, the
   benefits became marginal at best. */
-

It recalls me the story of property_descr_get() optimizations :-)

I hope that the new generic "fastcall" functions will provide a safe
and reliable optimization for the pickle module, property_descr_get()
and others optimized functions.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-24 Thread Victor Stinner
2016-08-22 10:01 GMT+02:00 Victor Stinner :
> The next step is to support keyword parameters. In fact, it's already
> supported in all cases except of Python functions:
> https://bugs.python.org/issue27809

Serhiy Storchaka proposed to use a single C array for positional and
keyword arguments. Keyword arguments are passed as (key, value) pairs.
I just added this function:

 PyAPI_FUNC(PyObject *) _PyObject_FastCallKeywords(
PyObject *func,
PyObject **stack,
Py_ssize_t nargs,
Py_ssize_t nkwargs);

The function is not used yet. Serhiy proposed to enhance the functions
to parse arguments to support this format to pass arguments which
would allow to avoid the creation of a temporary dictionary in many
cases.

I proposed to use this format (instead of (PyObject **stack,
Py_ssize_t nargs, PyObject *kwargs)) for a new METH_FASTCALL calling
convention for C functions:
https://bugs.python.org/issue27810

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-22 Thread Victor Stinner
Hi,

I pushed the most basic implementation of _PyObject_FastCall(), it
doesn't support keyword parameters yet:
https://hg.python.org/cpython/rev/a1a29d20f52d
https://bugs.python.org/issue27128

Then I patched a lot of call sites calling PyObject_Call(),
PyObject_CallObject(), PyEval_CallObject(), etc. with a temporary
tuple. Just one example:

-args = PyTuple_Pack(1, match);
-if (!args) {
-Py_DECREF(match);
-goto error;
-}
-item = PyObject_CallObject(filter, args);
-Py_DECREF(args);
+item = _PyObject_FastCall(filter, , 1, NULL);

The next step is to support keyword parameters. In fact, it's already
supported in all cases except of Python functions:
https://bugs.python.org/issue27809

Supporting keyword parameters will allow to patch much code to avoid
temporary tuples, but it is also required for a much more interesting
change:
https://bugs.python.org/issue27810
"Add METH_FASTCALL: new calling convention for C functions"

I propose to add a new METH_FASTCALL calling convention. The example
using METH_VARARGS | METH_KEYWORDS:
   PyObject* func(DirEntry *self, PyObject *args, PyObject *kwargs)
becomes:
   PyObject* func(DirEntry *self, PyObject **args, int nargs, PyObject *kwargs)

Later, Argument Clinic will be modified to *generate* code using the
new METH_FASTCALL calling convention. Code written with Argument
Clinic will only need to be updated by Argument Clinic to get the new
faster calling convention (avoid the creation of a temporary tuple for
positional arguments).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-08 Thread Yury Selivanov

On 2016-08-08 6:53 PM, Victor Stinner wrote:


2016-08-09 0:40 GMT+02:00 Guido van Rossum :

tl;dr I found a way to make CPython 3.6 faster and I validated that
there is no performance regression.

But is there a performance improvement?

Sure.


On micro-benchmarks, you can see nice improvements:

* getattr(1, "real") becomes 44% faster
* list(filter(lambda x: x, list(range(1000 becomes 31% faster
* namedtuple.attr becomes -23% faster
* etc.

See https://bugs.python.org/issue26814#msg263999 for default => patch,
or https://bugs.python.org/issue26814#msg264003 for comparison python
2.7 / 3.4 / 3.5 / 3.6 / 3.6 patched.


On the CPython benchmark suite, I also saw many faster benchmarks:

Faster (25):
- pickle_list: 1.29x faster
- etree_generate: 1.22x faster
- pickle_dict: 1.19x faster
- etree_process: 1.16x faster
- mako_v2: 1.13x faster
- telco: 1.09x faster
- raytrace: 1.08x faster
- etree_iterparse: 1.08x faster
(...)



Exceptional results, congrats Victor. Will be happy to help with code 
review.


Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-08 Thread Victor Stinner
2016-08-09 1:36 GMT+02:00 Brett Cannon :
> I just wanted to say I'm excited about this and I'm glad someone is taking
> advantage of what Argument Clinic allows for and what I know Larry had
> initially hoped AC would make happen!

To make "Python" faster, not only a few specific functions, "all" C
code should be updated to use the new "FASTCALL" calling convention.
But it's a pain to have to rewrite the code parsing arguments, we all
hate having to put #ifdef in the code... (for backward compatibility.)

This is where the magic happens: if your code is written using
Argument Clinic, you will get the optimization (FASTCALL) for free:
just run again Argument Clinic to get the "updated" "calling
convention".

It can be a very good motivation to rewrite your code using Argument
Clinic: get better inline documentation (docstring, help(func) in
REPL) *and* performance ;-)


> I should also point out that Serhiy has a patch for faster keyword argument
> parsing thanks to AC: http://bugs.python.org/issue27574 . Not sure how your
> two patches would intertwine (if at all).

In a first implementation, I packed *all* arguments in the same C
array: positional and keyword arguments. The problem is that all
functions expect a dict to parse keyword arguments. A dict has an
important property: O(1) for lookup. It becomes O(n) if you pass
keyword arguments as a list of (key, value) tuples in a C array.

So I chose to don't touch keyword arguments at all: continue to pass
them as a dict.

By the way, it's very rare to call a function using keyword arguments from C.

--

About http://bugs.python.org/issue27574 : it's really nice to see work
done on this part!

I recall a discussion of the performance of operator versus function
call. In some cases, the overhead of "parsing" arguments is higher
than the cost of the same feature implemented as an operator! Hum, it
was probably this issue:
https://bugs.python.org/issue17170

Extract of the issue:
"""
Some crude C benchmarking on this computer:
- calling PyUnicode_Replace is 35 ns (per call)
- calling "hundred".replace is 125 ns
- calling PyArg_ParseTuple with the same signature as "hundred".replace is 80 ns
"""

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-08 Thread Brett Cannon
I just wanted to say I'm excited about this and I'm glad someone is taking
advantage of what Argument Clinic allows for and what I know Larry had
initially hoped AC would make happen!

I should also point out that Serhiy has a patch for faster keyword argument
parsing thanks to AC: http://bugs.python.org/issue27574 . Not sure how your
two patches would intertwine (if at all).

On Mon, 8 Aug 2016 at 15:26 Victor Stinner  wrote:

> Hi,
>
> tl;dr I found a way to make CPython 3.6 faster and I validated that
> there is no performance regression. I'm requesting approval of core
> developers to start pushing changes.
>
> In 2014 during a lunch at Pycon, Larry Hasting told me that he would
> like to get rid of temporary tuples to call functions in Python. In
> Python, positional arguments are passed as a tuple to C functions:
> "PyObject *args". Larry wrote Argument Clinic which gives more control
> on how C functions are called. But I guess that Larry didn't have time
> to finish his implementation, since he didn't publish a patch.
>
> While trying to optimize CPython 3.6, I wrote a proof-of-concept patch
> and results were promising:
> https://bugs.python.org/issue26814#msg264003
> https://bugs.python.org/issue26814#msg266359
>
> C functions get a C array "PyObject **args, int nargs". Getting the
> nth argument become "arg = args[n];" at the C level. This format is
> not new, it's already used internally in Python/ceval.c. A Python
> function call made from a Python function already avoids a temporary
> tuple in most cases: we pass the stack of the first function as the
> list of arguments to the second function. My patch generalizes the
> idea to C functions. It works in all directions (C=>Python, Python=>C,
> C=>C, etc.).
>
> Many function calls become faster than Python 3.5 with my full patch,
> but even faster than Python 2.7! For multiple reasons (not interesting
> here), tested functions are slower in Python 3.4 than Python 2.7.
> Python 3.5 is better than Python 3.4, but still slower than Python 2.7
> in a few cases. Using my "FASTCALL" patch, all tested function calls
> become faster or as fast as Python 2.7!
>
> But when I ran the CPython benchmark suite, I found some major
> performance regressions. In fact, it took me 3 months to understand
> that I didn't run benchmarks correctly and that most benchmarks of the
> CPython benchmark suite are very unstable. I wrote articles explaining
> how benchmarks should be run (to be stable) and I patched all
> benchmarks to use my new perf module which runs benchmarks in multiple
> processes and computes the average (to make benchmarks more stable).
>
> At the end, my minimum FASTCALL patch (issue #27128) doesn't show any
> major performance regression if you run "correctly" benchmarks :-)
> https://bugs.python.org/issue27128#msg272197
>
> Most benchmarks are not significant, 14 are faster, and only 4 are slower.
>
> According to benchmarks on the "full" FASTCALL patch, the slowdown are
> temporary and should become quickly speedup (with further changes).
>
> My question is now: can I push fastcall-2.patch of the issue #27128?
> This patch only adds the infrastructure to start working on more
> useful optimizations, more patches will come, I expect more exciting
> benchmark results.
>
> Overview of the initial FASTCALL patch, see my first message on the issue:
> https://bugs.python.org/issue27128#msg266422
>
> --
>
> Note: My full FASTCALL patch changes the C API: this is out of the
> scope of my first simple FASTCALL patch. I will open a new discussion
> to decide if it's worth it and if yes, how it should be done.
>
> Victor
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-08 Thread Victor Stinner
2016-08-09 0:40 GMT+02:00 Guido van Rossum :
> Hm, I agree that those tuples are probably expensive. I recall that
> IronPython boasted faster Python calls by doing something closer to the
> platform (in their case I'm guessing C# or the CLR :-).

To be honest, I didn't expect *any* speedup just by avoiding the
temporary tuples. The C structore of tuples is simple and the
allocation of tuples is already optimized by a free list. I still
don't understand how the cost of tuple creation/destruction can have
such "large" impact on performances.

The discussion with Larry was not really my first motivation to work
on FASTCALL.

I worked on this topic because CPython already uses some "hidden"
tuples to avoid the cost of the tuple creation/destruction in various
places, but using really ugly code and this ugly code caused crashes
and surprising behaviours...

https://bugs.python.org/issue26811 is a recent crash related to
property_descr_get() optimization, whereas the optimization was
already "fixed" once:
https://hg.python.org/cpython/rev/5dbf3d932a59/

The fix is just another hack on top of the existing hack. The issue
#26811 rewrote the optimization to avoid the crash using
_PyObject_GC_UNTRACK():
https://hg.python.org/cpython/rev/a98ef122d73d

I tried to make this "optimization" the standard way to call
functions, rather than a corner case, and avoid hacks like
PyTuple_SET_ITEM(args, 0, NULL) or _PyObject_GC_UNTRACK().

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-08 Thread Victor Stinner
2016-08-09 0:40 GMT+02:00 Guido van Rossum :
>> tl;dr I found a way to make CPython 3.6 faster and I validated that
>> there is no performance regression.
>
> But is there a performance improvement?

Sure.


On micro-benchmarks, you can see nice improvements:

* getattr(1, "real") becomes 44% faster
* list(filter(lambda x: x, list(range(1000 becomes 31% faster
* namedtuple.attr becomes -23% faster
* etc.

See https://bugs.python.org/issue26814#msg263999 for default => patch,
or https://bugs.python.org/issue26814#msg264003 for comparison python
2.7 / 3.4 / 3.5 / 3.6 / 3.6 patched.


On the CPython benchmark suite, I also saw many faster benchmarks:

Faster (25):
- pickle_list: 1.29x faster
- etree_generate: 1.22x faster
- pickle_dict: 1.19x faster
- etree_process: 1.16x faster
- mako_v2: 1.13x faster
- telco: 1.09x faster
- raytrace: 1.08x faster
- etree_iterparse: 1.08x faster
(...)

See https://bugs.python.org/issue26814#msg266359

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-08 Thread Guido van Rossum
On Mon, Aug 8, 2016 at 3:25 PM, Victor Stinner 
wrote:

> tl;dr I found a way to make CPython 3.6 faster and I validated that
> there is no performance regression.


But is there a performance improvement?


> I'm requesting approval of core
> developers to start pushing changes.
>
> In 2014 during a lunch at Pycon, Larry Hasting told me that he would
> like to get rid of temporary tuples to call functions in Python. In
> Python, positional arguments are passed as a tuple to C functions:
> "PyObject *args". Larry wrote Argument Clinic which gives more control
> on how C functions are called. But I guess that Larry didn't have time
> to finish his implementation, since he didn't publish a patch.
>

Hm, I agree that those tuples are probably expensive. I recall that
IronPython boasted faster Python calls by doing something closer to the
platform (in their case I'm guessing C# or the CLR :-).

Is this perhaps something that could wait until the Core devs sprint in a
few weeks? (I presume you're coming?!)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] New calling convention to avoid temporarily tuples when calling functions

2016-08-08 Thread Victor Stinner
Hi,

tl;dr I found a way to make CPython 3.6 faster and I validated that
there is no performance regression. I'm requesting approval of core
developers to start pushing changes.

In 2014 during a lunch at Pycon, Larry Hasting told me that he would
like to get rid of temporary tuples to call functions in Python. In
Python, positional arguments are passed as a tuple to C functions:
"PyObject *args". Larry wrote Argument Clinic which gives more control
on how C functions are called. But I guess that Larry didn't have time
to finish his implementation, since he didn't publish a patch.

While trying to optimize CPython 3.6, I wrote a proof-of-concept patch
and results were promising:
https://bugs.python.org/issue26814#msg264003
https://bugs.python.org/issue26814#msg266359

C functions get a C array "PyObject **args, int nargs". Getting the
nth argument become "arg = args[n];" at the C level. This format is
not new, it's already used internally in Python/ceval.c. A Python
function call made from a Python function already avoids a temporary
tuple in most cases: we pass the stack of the first function as the
list of arguments to the second function. My patch generalizes the
idea to C functions. It works in all directions (C=>Python, Python=>C,
C=>C, etc.).

Many function calls become faster than Python 3.5 with my full patch,
but even faster than Python 2.7! For multiple reasons (not interesting
here), tested functions are slower in Python 3.4 than Python 2.7.
Python 3.5 is better than Python 3.4, but still slower than Python 2.7
in a few cases. Using my "FASTCALL" patch, all tested function calls
become faster or as fast as Python 2.7!

But when I ran the CPython benchmark suite, I found some major
performance regressions. In fact, it took me 3 months to understand
that I didn't run benchmarks correctly and that most benchmarks of the
CPython benchmark suite are very unstable. I wrote articles explaining
how benchmarks should be run (to be stable) and I patched all
benchmarks to use my new perf module which runs benchmarks in multiple
processes and computes the average (to make benchmarks more stable).

At the end, my minimum FASTCALL patch (issue #27128) doesn't show any
major performance regression if you run "correctly" benchmarks :-)
https://bugs.python.org/issue27128#msg272197

Most benchmarks are not significant, 14 are faster, and only 4 are slower.

According to benchmarks on the "full" FASTCALL patch, the slowdown are
temporary and should become quickly speedup (with further changes).

My question is now: can I push fastcall-2.patch of the issue #27128?
This patch only adds the infrastructure to start working on more
useful optimizations, more patches will come, I expect more exciting
benchmark results.

Overview of the initial FASTCALL patch, see my first message on the issue:
https://bugs.python.org/issue27128#msg266422

--

Note: My full FASTCALL patch changes the C API: this is out of the
scope of my first simple FASTCALL patch. I will open a new discussion
to decide if it's worth it and if yes, how it should be done.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com