On 13 March 2012 09:19, Travis Oliphant <tra...@continuum.io> wrote:
>
> On Mar 13, 2012, at 12:58 AM, Dag Sverre Seljebotn wrote:
>
> On 03/10/2012 10:35 PM, Travis Oliphant wrote:
>
> Hey all,
>
>
> I gave a lightning talk this morning on numba which is the start of a
>
> Python compiler to machine code through the LLVM tool-chain. It is proof
>
> of concept stage only at this point (use it only if you are interested
>
> in helping develop the code at this point). The only thing that works is
>
> a fast-vectorize capability on a few functions (without for-loops). But,
>
> it shows how creating functions in Python that can be used by the NumPy
>
> runtime in various ways. Several NEPS that will be discussed in the
>
> coming months will use this concept.
>
>
> Right now there is very little design documentation, but I will be
>
> adding some in the days ahead, especially if I get people who are
>
> interested in collaborating on the project. I did talk to Fijal and Alex
>
> of the PyPy project at PyCon and they both graciously suggested that I
>
> look at some of the PyPy code which walks the byte-code and does
>
> translation to their intermediate representation for inspiration.
>
>
> Again, the code is not ready for use, it is only proof of concept, but I
>
> would like to get feedback and help especially from people who might
>
> have written compilers before. The code lives at:
>
> https://github.com/ContinuumIO/numba
>
>
> Hi Travis,
>
> me and Mark F. has been talking today about whether some of numba and
> Cython development could overlap -- not right away, but in the sense
> that if Cython gets some features for optimization of numerical code,
> then make it easy for numba to reuse that functionality.
>
>
> That would be very, very interesting.
>
>
> This may be sort of off-topic re: the above-- but part of the goal of
> this post is to figure out numba's intended scope. If there isn't an
> overlap, that's good to know in itself.
>
> Question 1: Did you look at Clyther and/or Copperhead? Though similar,
> they target GPUs...but at first glance they look as though they may be
> parsing Python bytecode to get their ASTs... (didn't check though)
>
>
> I have looked at both projects although Clyther more in depth.    Clyther is
> parsing bytecode to get the AST (through a sub-project by the same author
> called Meta:  http://srossross.github.com/Meta/html/index.html).
>
>
> Question 2: What kind of performance are you targeting -- in the short
> term, and in the long term? Is competing with "Fortran-level"
> performance a goal at all?
>
>
> In the short-term, I'm targeting C-equivalent performance (like weave).   In
> the long-term, I'm targeting optimized high-level expressions (i.e.
> Fortran-level) with GPU and mulit-core.
>
>
> E.g., for ufunc computations with different iteration orders such
> as "a + b.T" (a and b in C-order), one must do blocking to get good
> performance. And when dealing with strided arrays, copying small chunks
> at the time will sometimes help performance (and sometimes not).
>
> This is optimization strategies which (as I understand it) is quite
> beyond what NumPy iterators etc. can provide.
>

As for blocking, this could be done by the numpy iterators themselves,
by simply introducing more dimensions with appropriate shape and
strides (not saying that's a solution :).

>
>
> And the LLVM level could
> be too low -- one has quite a lot of information when generating the
> ufunc/reduction/etc. that would be thrown away when generating LLVM
> code.
>
>
> It doesn't need to be thrown away at all.   It could be used to generate
> appropriate code for the arrays being used.   The long-term idea is to
> actually be aware of NumPy arrays and encourage expression of high-level
> constructs which generate optimized code using chunking, blocking, AVX
> instructions, multiple threads, etc.
>
> To do this, it may make more sense to actually emit OpenMP (unless LLVM
> grows standard threading intrinsics).   This is not out of the question.

That would be interesting, my experience with OpenMP is that the
standard doesn't define (ironically enough) the use of OpenMP in the
context of threading, and indeed, trying to use OpenMP outside of the
main thread simply segfaults your program. If llvm would get such
features, one must be prepared to make the OpenMP runtime thread-safe
as well (hopefully it will be in the first place, like I believe
Intel's implementation).

> Vectorizing compilers do their best to reconstruct this
> information; I know nothing about what actually exists here for
> LLVM. They are certainly a lot more complicated to implement and work
> with than making use of on higher-level information available before
> code generation.
>
> The idea we've been playing with is for Cython to define a limited
> subset of its syntax tree (essentially the "GIL-less" subset) seperate
> from the rest of Cython, with a more well-defined API for optimization
> passes etc., and targeted for a numerical optimization pipeline.
>
> This subset would actually be pretty close to what numba needs to
> compile, even if the overlap isn't perfect. So such a pipeline could
> possibly be shared between Cython and numba, even if Cython would use
> it at compile-time and numba at runtime, and even if the code
> generation backend is different (the code generation backend is
> probably not the hard part...). To be concrete, the idea is:
>
>
>
> (Cython|numba) -> high-level numerical compiler and
> loop-structure/blocking optimizer (by us on a shared parse tree
> representation) -> (LLVM/C/OpenCL) -> low-level optimization (by the
> respective compilers)
>
> Some algorithms that could be shareable are iteration strategies
> (already in NumPy though), blocking strategies, etc.
>
> Even if this may be beyond numba's (and perhaps Cython's) current
> ambition, it may be worth thinking about, if nothing else then just
> for how Cython's code should be structured.
>
>
> This kind of collaboration would be very nice.  I agree, there might be some
> kind of intermediate representation that would be good for both projects.
>
> -Travis
>
>
>
> (Mark F., how does the above match how you feel about this?)

I would like collaboration, but from a technical perspective I think
this would be much more involved than just dumping the AST to an IR
and generating some code from there. For vector expressions I think
sharing code would be more feasible than arbitrary (parallel) loops,
etc. Cython as a compiler can make many decisions that a Python
(bytecode) compiler can't make (at least without annotations and a
well-defined subset of the language (not so much the syntax as the
semantics)). I think in numba, if parallelism is to be supported, you
will want a prange-like construct, as proving independence between
iterations can be very hard to near impossible for a compiler.

As for code generation, I'm not sure how llvm would do things like
slicing arrays, reshaping, resizing etc (for vector expressions you
can first evaluate all slicing and indexing operations and then
compile the remaining vector expression), but for loops and array
reassignment within loops this would have to invoke the actual slicing
code from the llvm code (I presume). There are many other things, like
bounds checking, wraparound, etc, that are all supported in both numpy
and Cython, but going through an llvm layer would as far as I can see,
require re-implementing those, at least if you want top-notch
performance. Personally, I think for non-trivial performance-critical
code (for loops with indexing, slicing, function calls, etc) Cython is
a better target.

So for vector expressions I think Cython and Numba could work together
by specifying AST transformations that operate on vector expressions.
For the purposes of Cython it would go from the Cython AST to the IR
and after transformations either back to the Cython AST, or directly
to llvm. For Cython, going from that code to llvm is not necessarily
more useful than C and OpenCL, as you will know the types anyway at
compile time and you can immediately exploit multicore as well as SIMD
parallelism. In the face of blocking and chunking etc, certain
specializations may be created in advance for Cython, or it could even
generate a C version (+ openmp + auto-vectorization appeasing
pragmas), an OpenCL version for the CPU and possibly a different one
for the GPU, and a numba + numba IR version, i.e. feed the IR at
runtime to numba and have it compile to llvm. If the compiler
additionally fuses vector expressions together, this will be even more
powerful.

Finally, as for non-vector-expression code, I really believe Cython is
a better target. cython.inline can have high overhead (at least the
first time it has to compile), but with better (numpy-aware) type
inference or profile guided optimizations (see recent threads on the
cython-dev mailing list), in addition to things like prange, I
personally believe Cython targets most of the use cases where numba
would be able to generate performing code.

> Dag
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to