Mark Shannon <m...@hotpy.org> wrote:
Dag Sverre Seljebotn wrote:
from numpy import sin
# assume sin is a Python callable and that NumPy decides to support
# our spec to also support getting a "double (*sinfuncptr)(double)".

# Our mission: Avoid to have the user manually import "sin" from C,
# but allow just using the NumPy object and still be fast.

# define a function to integrate
cpdef double f(double x):
    return sin(x * x) # guess on signature and use "fastcall"!

# the integrator
def integrate(func, double a, double b, int n):
    cdef double s = 0
    cdef double dx = (b - a) / n
    for i in range(n):
        # This is also a fastcall, but can be cached so doesn't
        # matter...
        s += func(a + i * dx)
    return s * dx

integrate(f, 0, 1, 1000000)

There are two problems here:

 - The "sin" global can be reassigned (monkey-patched) between each
call
to "f", no way for "f" to know. Even "sin" could do the reassignment.
So
you'd need to check for reassignment to do caching...

Since Cython allows static typing why not just declare that func can
treat sin as if it can't be monkeypatched?

If you want to manually declare stuff, you can always use a C function pointer too...

Moving the load of a global variable out of the loop does seem to be a
rather obvious optimisation, if it were declared to be legal.

In case you didn't notice, there was no global variable loads inside the loop...

You can keep chasing this, but there's *always* cases where they don't (and you need to save the situation by manual typing).

Anyway: We should really discuss Cython on the Cython list. If my motivating example wasn't good enough for you there's really nothing I can do.

Some rough numbers:

 - The overhead with the tp_flags hack is a 2 ns overhead (something
similar with a metaclass, the problems are more how to synchronize
that
metaclass across multiple 3rd party libraries)

Does your approach handle subtyping properly?

Not really.


 - Dict lookup 20 ns

Did you time _PyType_Lookup() ?

No, didn't get around to it yet (and thanks for pointing it out). (Though the GIL requirement is an issue too for Cython.)

 - The sin function is about 35 ns. And, "f" is probably only 2-3 ns,

and there could very easily be multiple such functions, defined in
different modules, in a chain, in order to build up a formula.


Such micro timings are meaningless, because the working set often tends

to fit in the hardware cache. A level 2 cache miss can takes 100s of
cycles.

I find this sort of response arrogant -- do you know the details of every usecase for a programming language under the sun?

Many Cython users are scientists. And in scientific computing in particular you *really* have the whole range of problems and working sets. Honestly. In some codes you only really care about the speed of the disk controller. In other cases you can spend *many seconds* working almost only in L1 or perhaps L2 cache (for instance when integrating ordinary differential equations in a few variables, which is not entirely different in nature from the example I posted). (Then, those many seconds are replicated many million times for different parameters on a large cluster, and a 2x speedup translates directly into large amounts of saved money.)

Also, with numerical codes you block up the problem so that loads to L2 are amortized over sufficient FLOPs (when you can).

Every time Cython becomes able to do stuff more easily in this domain, people thank us that they didn't have to dig up Fortran but can stay closer to Python.

Sorry for going off on a rant. I find that people will give well-meant advice about performance, but that advice is just generalizing from computer programs in entirely different domains (web apps?), and sweeping generalizations has a way of giving the wrong answer.

Dag
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to