2010/1/29 Nick Coghlan <ncogh...@gmail.com>

> I wouldn't consider changing from bytecode to wordcode uncontroversial -
> the potential to have an effect on cache hit ratios means it needs to be
> benchmarked (the U-S performance tests should be helpful there).
>

It's quite strange, but from the tests made it seems that wpython perform
better with old architectures (such as my Athlon64 socket 754), which have
less resources like caches.

It'll be interesting to check how it works on more limited ISAs. I'm
especially curious about ARMs.


> It's the same basic problem where any changes to the ceval loop can have
> surprising performance effects due to the way they affect the compiled
> switch statements ability to fit into the cache and other low level
> processor weirdness.
>
> Cheers,
> Nick.
>

Sure, but consider that with wpython wordcodes require less space on
average. Also, less instructions are executed inside the ceval loop, thanks
to some natural instruction grouping.

For example, I recently introduced in wpython 1.1 a new opcode to handle
more efficiently expression generators. It's mapped as a unary operator, so
it exposes interesting properties which I'll show you with an example.

def f(a):
    return sum(x for x in a)

With CPython 2.6.4 it generates:

  0 LOAD_GLOBAL 0 (sum)
  3 LOAD_CONST 1 (<code object <genexpr> at 00512EC8, file "<stdin>", line
1>)
  6 MAKE_FUNCTION 0
  9 LOAD_FAST 0 (a)
12 GET_ITER
13 CALL_FUNCTION 1
16 CALL_FUNCTION 1
19 RETURN_VALUE

With wpython 1.1:

0 LOAD_GLOBAL 0 (sum)
1 LOAD_CONST 1 (<code object <genexpr> at 01F13208, file "<stdin>", line 1>)
2 MAKE_FUNCTION 0
3 FAST_BINOP get_generator a
5 QUICK_CALL_FUNCTION 1
6 RETURN_VALUE

The new opcode is GET_GENERATOR, which is equivalent (but more efficient,
using a faster internal function call) to:

GET_ITER
CALL_FUNCTION 1

The compiler initially generated the following opcodes:

LOAD_FAST 0 (a)
GET_GENERATOR

then the peepholer recognized the pattern UNARY(FAST), and produced the
single opcode:

FAST_BINOP get_generator a

In the end, the ceval loop executes a single instruction instead of three.
The wordcode requires 14 bytes to be stored instead of 20, so it will use 1
data cache line instead of 2 on CPUs with 16 bytes lines data cache.

The same grouping behavior happens with binary operators as well. Opcodes
aggregation is a natural and useful concept with the new wordcode structure.

Cheers,
Cesare
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to