Re: pointless musings on performance

2009-11-26 Thread Paul Boddie
On 25 Nov, 13:11, Antoine Pitrou  wrote:
>
> When you say "executing each kind of bytecode instruction", are you
> talking about the overhead of bytecode dispatch and operand gathering, or
> the total cost including doing the useful work?

Strip away any overhead (dispatch, operand gathering) and just measure
the cumulative time spent doing the actual work for each kind of
instruction, then calculate the average "cost" by dividing by the
frequency of each instruction type. So, for a whole program you'd get
a table of results like this:

LOAD_CONST   
LOAD_NAME   
CALL_FUNCTION   
...

A comparison of the "time per instruction" column would yield the
relative cost of each kind of instruction. Of course, a general
profiling of the interpreter would be useful, too, but I imagine that
this has been done many times before. To go back to the CISC vs. RISC
analogy, I'd expect substantial variation in relative costs, which one
could argue is a CISC-like trait (although a separate matter to
instruction set orthogonality).

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-25 Thread Antoine Pitrou
Le Tue, 24 Nov 2009 16:09:10 -0800, Paul Boddie a écrit :
> 
> I'm referring to what you're talking about at the end. The enhancements
> in Python 3 presumably came about after discussion of "threaded
> interpreters", confirming that the evaluation loop in Python 2 was not
> exactly optimal.

An optimal evaluation loop is a evaluation loop which doesn't get 
executed at all :-)
(which is what unladen-swallow, cython and pypy are trying to do)

> You need to draw the line between work done by system and external
> libraries and that done by Python, but a breakdown of the time spent
> executing each kind of bytecode instruction could be interesting.

When you say "executing each kind of bytecode instruction", are you 
talking about the overhead of bytecode dispatch and operand gathering, or 
the total cost including doing the useful work?

Regardless, it probably isn't easy to do such measurements. I once tried 
using AMD's CodeAnalyst (I have an AMD CPU) but I didn't manage to get 
any useful data out of it; the software felt very clumsy and it wasn't 
obvious how to make it take into account the source code of the Python 
interpreter.

Regards

Antoine.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-25 Thread Antoine Pitrou
Le Tue, 24 Nov 2009 22:08:19 +, Benjamin Peterson a écrit :
> 
>> Would it be worth in-lining the remaining part of PyObject_IsTrue in
>> ceval?
> 
> Inlining by hand is prone to error and maintainability problems.

Which is why we like to do it :-))


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Paul Boddie
On 24 Nov, 19:25, Antoine Pitrou  wrote:
>
> Sorry, I have trouble parsing your sentence. Do you mean bytecode
> interpretation overhead is minimal compared to the cost of actual useful
> work, or the contrary?
> (IMO both are wrong by the way)

I'm referring to what you're talking about at the end. The
enhancements in Python 3 presumably came about after discussion of
"threaded interpreters", confirming that the evaluation loop in Python
2 was not exactly optimal.

> > I imagine that someone (or a number of people) must have profiled the
> > Python interpreter and shown how much time goes on the individual
> > bytecode implementations and how much goes on the interpreter's own
> > housekeeping activities.
>
> Well the one problem is that it's not easy to draw a line. Another
> problem is that it depends on the workload. If you are compressing large
> data or running expensive regular expressions the answer won't be the
> same as if you compute a Mandelbrot set in pure Python.

You need to draw the line between work done by system and external
libraries and that done by Python, but a breakdown of the time spent
executing each kind of bytecode instruction could be interesting.
Certainly, without such actual cost estimations, a simple counting of
bytecodes should hardly give an indication of how optimal some Python
code might be.

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Terry Reedy

Chris Rebert wrote:

On Tue, Nov 24, 2009 at 4:31 AM, Rob Williscroft  wrote:

mk wrote in news:mailman.915.1259064240.2873.python-l...@python.org in
comp.lang.python:


def pythonic():
def unpythonic():



Decidedly counterintuitive: are there special optimizations for "if
nonevar:" type of statements in cpython implementation?


from dis import dis

dis( unpythonic )

18  31 LOAD_FAST0 (nonevar)
34 LOAD_CONST   0 (None)
37 COMPARE_OP   9 (is not)
40 JUMP_IF_FALSE4 (to 47)

dis( pythonic )

11  31 LOAD_FAST0 (nonevar)
34 JUMP_IF_FALSE4 (to 41)


In other words, CPython doesn't happen to optimize `if nonevar is not
None` as much as it theoretically could (which would essentially
require a JUMP_IF_NONE opcode). Since CPython isn't known for doing
fancy optimizations, this isn't surprising.


There is a limit of 256 bytecodes. I believe there are currently fewer. 
It would seem that adding bytecodes that combine current pairs would 
speed the interpreter, depending on how often the combined pair is used. 
And people have looked for frequent pairs across a corpus of code, and 
proposed additions.


However, additional bytecodes enlarge the basic bytecode eval loop, 
possibly (and sometimes actually) leading to to more processor cache 
misses and slower execution. At least some proposed non-essential 
additions have been rejected for the reason.


Also, even though CPython-specific bytecodes are outside the language 
definition, and could theoretically be revamped every minor (x.y) 
release, there is a cost to change. Rewrite the ast to bytecode 
compiler, rewrite dis, rewrite the dis doc, and burden anyone concerned 
with multiple versions to remember. So in practice, change is minimized 
and unassigned bytecodes are left open for the future.


Optimizations that make better use of a fix set of bytecodes are a 
different topic. Guido is conservative because historically, compilier 
optimizations are too often not exactly equivalent to naive, obviously 
correct code in some overlooked corner case, leading to subtle, 
occasional bugs. Of course, byte code changes could mess-up current 
optimizations that are performed.


Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Benjamin Peterson
Tim Wintle  teamrubber.com> writes:
> Out of interest - has anyone else spotted that the call to
> PyObject_IsTrue in the XXX_JUMP_IF_ blocks performs two unnecessary
> pointer comparisons?

I doubt two pointer comparisons will make much of a difference.

> Would it be worth in-lining the remaining part of PyObject_IsTrue in
> ceval?

Inlining by hand is prone to error and maintainability problems.




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Tim Wintle
On Tue, 2009-11-24 at 18:25 +, Antoine Pitrou wrote:
> Le Tue, 24 Nov 2009 08:58:40 -0800, Paul Boddie a écrit :
> > As you
> > point out, a lot of this RISC vs. CISC analysis (and inferences
> drawn
> > from Python bytecode analysis) is somewhat academic: the cost of the
> > JUMP_IF_FALSE instruction is likely to be minimal in the context of
> all the activity going on to evaluate the bytecodes.
> 
> Sorry, I have trouble parsing your sentence. Do you mean bytecode 
> interpretation overhead is minimal compared to the cost of actual
> useful work, or the contrary?
> (IMO both are wrong by the way)

Out of interest - has anyone else spotted that the call to
PyObject_IsTrue in the XXX_JUMP_IF_ blocks performs two unnecessary
pointer comparisons?

 ceval.c 
if (w == Py_True) {
Py_DECREF(w);
FAST_DISPATCH();
}
if (w == Py_False) {
Py_DECREF(w);
JUMPTO(oparg);
FAST_DISPATCH();
}
err = PyObject_IsTrue(w);
Py_DECREF(w);
.
.
.
==

 object.c 
PyObject_IsTrue(PyObject *v)
{
Py_ssize_t res;
if (v == Py_True)
return 1;
if (v == Py_False)
return 0;
.
.
.
==

Would it be worth in-lining the remaining part of PyObject_IsTrue in
ceval?

> Another data point I've heard is that people who have tried a very
> crude form of Python-to-C compilation (generating the exact C code
> corresponding to a function or method, using Python's C API and
> preserving dynamicity without attempting to be clever) have apparently
> reached speedups of up to 50% (in other words, "twice as fast").

That's roughly what I get with Cython - which does exactly that.

Tim

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Antoine Pitrou
Le Tue, 24 Nov 2009 08:58:40 -0800, Paul Boddie a écrit :
> As you
> point out, a lot of this RISC vs. CISC analysis (and inferences drawn
> from Python bytecode analysis) is somewhat academic: the cost of the
> JUMP_IF_FALSE instruction is likely to be minimal in the context of all
> the activity going on to evaluate the bytecodes.

Sorry, I have trouble parsing your sentence. Do you mean bytecode 
interpretation overhead is minimal compared to the cost of actual useful 
work, or the contrary?
(IMO both are wrong by the way)

> I imagine that someone (or a number of people) must have profiled the
> Python interpreter and shown how much time goes on the individual
> bytecode implementations and how much goes on the interpreter's own
> housekeeping activities.

Well the one problem is that it's not easy to draw a line. Another 
problem is that it depends on the workload. If you are compressing large 
data or running expensive regular expressions the answer won't be the 
same as if you compute a Mandelbrot set in pure Python.

One data point is that the "computed gotos" option in py3k generally 
makes the interpreter faster by ~15%. Another data point I've heard is 
that people who have tried a very crude form of Python-to-C compilation 
(generating the exact C code corresponding to a function or method, using 
Python's C API and preserving dynamicity without attempting to be clever) 
have apparently reached speedups of up to 50% (in other words, "twice as 
fast"). So you could say that the interpretation overhead is generally 
between 15% and 50%.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Paul Boddie
On 24 Nov, 16:11, Antoine Pitrou  wrote:
>

[JUMP_IF_FALSE]

> It tries to evaluate the op of the stack (here nonevar) in a boolean
> context (which theoretically involves calling __nonzero__ on the type)
> and then jumps if the result is False (rather than True).

[...]

> As someone pointed out, the Python interpreter could grow CISC-like
> opcodes so as to collapse "is not None" (or generically "is not
> ") into a single JUMP_IF_IS_NOT_CONST opcode.

Of course, JUMP_IF_FALSE is already quite CISC-like, whereas testing
if something is not None could involve some fairly RISC-like
instructions: just compare the address of an operand with the address
of None. As you point out, a lot of this RISC vs. CISC analysis (and
inferences drawn from Python bytecode analysis) is somewhat academic:
the cost of the JUMP_IF_FALSE instruction is likely to be minimal in
the context of all the activity going on to evaluate the bytecodes.

I imagine that someone (or a number of people) must have profiled the
Python interpreter and shown how much time goes on the individual
bytecode implementations and how much goes on the interpreter's own
housekeeping activities. It would be interesting to see such figures.

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Antoine Pitrou
Le Tue, 24 Nov 2009 15:11:29 +, Antoine Pitrou a écrit :
> Hello,
> 
> Le Tue, 24 Nov 2009 14:41:19 +0100, mk a écrit :
>> 
>> As Rob pointed out (thanks):
>> 
>> 11  31 LOAD_FAST0 (nonevar)
>>   34 JUMP_IF_FALSE4 (to 41)
>> 
>> I'm no good at py compiler or implementation internals and so I have no
>> idea what bytecode "JUMP_IF_FALSE" is actually doing.
> 
> It tries to evaluate the op of the stack (here nonevar)

I meant "the top of the stack" obviously.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Neil Cerutti
On 2009-11-24, Antoine Pitrou  wrote:
> It tries to evaluate the op of the stack (here nonevar) in a
> boolean context (which theoretically involves calling
> __nonzero__ on the type) 

...or __bool__ in Py3K.

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Antoine Pitrou

Hello,

Le Tue, 24 Nov 2009 14:41:19 +0100, mk a écrit :
> 
> As Rob pointed out (thanks):
> 
> 11  31 LOAD_FAST0 (nonevar)
>   34 JUMP_IF_FALSE4 (to 41)
> 
> I'm no good at py compiler or implementation internals and so I have no
> idea what bytecode "JUMP_IF_FALSE" is actually doing.

It tries to evaluate the op of the stack (here nonevar) in a boolean 
context (which theoretically involves calling __nonzero__ on the type) 
and then jumps if the result is False (rather than True).

You are totally right that it does /more/ than "is not None", but since 
it is executed as a single opcode rather than a sequence of several 
opcodes, the additional work it has to do is compensated (in this case) 
by the smaller overhead in bytecode interpretation.

As someone pointed out, the Python interpreter could grow CISC-like 
opcodes so as to collapse "is not None" (or generically "is not 
") into a single JUMP_IF_IS_NOT_CONST opcode. Actually, it is 
the kind of optimizations wpython does (http://code.google.com/p/
wpython/).

Regards

Antoine.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Rob Williscroft
mk wrote in news:mailman.923.1259070092.2873.python-l...@python.org in 
comp.lang.python:

> MRAB wrote:
>> In what way is it counterintuitive? In 'pythonic' the conditions are
>> simpler, less work is being done, therefore it's faster.
> 
> But the pythonic condition is more general: nonevar or zerovar can be 
> '', 0, or None. So I thought it was more work for interpreter to compare 
> those, while I thought that "is not None" is translated to one, more 
> low-level and faster action. Apparently not.
> 
> As Rob pointed out (thanks):
> 
> 11  31 LOAD_FAST0 (nonevar)
>   34 JUMP_IF_FALSE4 (to 41)
> 
> I'm no good at py compiler or implementation internals and so I have no 
> idea what bytecode "JUMP_IF_FALSE" is actually doing.

IIUC it implements:

http://docs.python.org/3.1/reference/expressions.html#boolean-operations

"In the context of Boolean operations, and also when expressions are used 
by control flow statements, the following values are interpreted as false: 
False, None, numeric zero of all types, and empty strings and containers 
(including strings, tuples, lists, dictionaries, sets and frozensets). All 
other values are interpreted as true. User-defined objects can customize 
their truth value by providing a __bool__() method."

In particular its implementing "... Boolean operation ... used by 
control flow ...", all in one handy op code.

Rob.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread mk

MRAB wrote:

In what way is it counterintuitive? In 'pythonic' the conditions are
simpler, less work is being done, therefore it's faster.


But the pythonic condition is more general: nonevar or zerovar can be 
'', 0, or None. So I thought it was more work for interpreter to compare 
those, while I thought that "is not None" is translated to one, more 
low-level and faster action. Apparently not.


As Rob pointed out (thanks):

11  31 LOAD_FAST0 (nonevar)
 34 JUMP_IF_FALSE4 (to 41)

I'm no good at py compiler or implementation internals and so I have no 
idea what bytecode "JUMP_IF_FALSE" is actually doing.


Regards,
mk


--
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Chris Rebert
On Tue, Nov 24, 2009 at 4:31 AM, Rob Williscroft  wrote:
> mk wrote in news:mailman.915.1259064240.2873.python-l...@python.org in
> comp.lang.python:
>
>>
>> def pythonic():
>
>> def unpythonic():
>
>
>> Decidedly counterintuitive: are there special optimizations for "if
>> nonevar:" type of statements in cpython implementation?
>>
>
> from dis import dis
>
> dis( unpythonic )
>
> 18          31 LOAD_FAST                0 (nonevar)
>             34 LOAD_CONST               0 (None)
>             37 COMPARE_OP               9 (is not)
>             40 JUMP_IF_FALSE            4 (to 47)
>
> dis( pythonic )
>
> 11          31 LOAD_FAST                0 (nonevar)
>             34 JUMP_IF_FALSE            4 (to 41)

In other words, CPython doesn't happen to optimize `if nonevar is not
None` as much as it theoretically could (which would essentially
require a JUMP_IF_NONE opcode). Since CPython isn't known for doing
fancy optimizations, this isn't surprising.

Cheers,
Chris
--
http://blog.rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread Rob Williscroft
mk wrote in news:mailman.915.1259064240.2873.python-l...@python.org in 
comp.lang.python:

> 
> def pythonic():
 
> def unpythonic():
 
 
> Decidedly counterintuitive: are there special optimizations for "if 
> nonevar:" type of statements in cpython implementation?
> 

from dis import dis

dis( unpythonic )

18  31 LOAD_FAST0 (nonevar)
 34 LOAD_CONST   0 (None)
 37 COMPARE_OP   9 (is not)
 40 JUMP_IF_FALSE4 (to 47)

dis( pythonic )

11  31 LOAD_FAST0 (nonevar)
 34 JUMP_IF_FALSE4 (to 41)
 
Rob.
 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pointless musings on performance

2009-11-24 Thread MRAB

mk wrote:

#!/usr/local/bin/python

import timeit


def pythonic():
nonevar = None
zerovar = 0
for x in range(100):
if nonevar:
pass
if zerovar:
pass

def unpythonic():
nonevar = None
zerovar = 0
for x in range(100):
if nonevar is not None:
pass
if zerovar > 0:
pass

for f in [pythonic, unpythonic]:
print f.func_name, timeit.timeit(f, number=10)



# ./t.py
pythonic 2.13092803955
unpythonic 2.82064604759

Decidedly counterintuitive: are there special optimizations for "if 
nonevar:" type of statements in cpython implementation?



In what way is it counterintuitive? In 'pythonic' the conditions are
simpler, less work is being done, therefore it's faster.
--
http://mail.python.org/mailman/listinfo/python-list