On 13 March 2012 09:19, Travis Oliphant <tra...@continuum.io> wrote: > > On Mar 13, 2012, at 12:58 AM, Dag Sverre Seljebotn wrote: > > On 03/10/2012 10:35 PM, Travis Oliphant wrote: > > Hey all, > > > I gave a lightning talk this morning on numba which is the start of a > > Python compiler to machine code through the LLVM tool-chain. It is proof > > of concept stage only at this point (use it only if you are interested > > in helping develop the code at this point). The only thing that works is > > a fast-vectorize capability on a few functions (without for-loops). But, > > it shows how creating functions in Python that can be used by the NumPy > > runtime in various ways. Several NEPS that will be discussed in the > > coming months will use this concept. > > > Right now there is very little design documentation, but I will be > > adding some in the days ahead, especially if I get people who are > > interested in collaborating on the project. I did talk to Fijal and Alex > > of the PyPy project at PyCon and they both graciously suggested that I > > look at some of the PyPy code which walks the byte-code and does > > translation to their intermediate representation for inspiration. > > > Again, the code is not ready for use, it is only proof of concept, but I > > would like to get feedback and help especially from people who might > > have written compilers before. The code lives at: > > https://github.com/ContinuumIO/numba > > > Hi Travis, > > me and Mark F. has been talking today about whether some of numba and > Cython development could overlap -- not right away, but in the sense > that if Cython gets some features for optimization of numerical code, > then make it easy for numba to reuse that functionality. > > > That would be very, very interesting. > > > This may be sort of off-topic re: the above-- but part of the goal of > this post is to figure out numba's intended scope. If there isn't an > overlap, that's good to know in itself. > > Question 1: Did you look at Clyther and/or Copperhead? Though similar, > they target GPUs...but at first glance they look as though they may be > parsing Python bytecode to get their ASTs... (didn't check though) > > > I have looked at both projects although Clyther more in depth. Clyther is > parsing bytecode to get the AST (through a sub-project by the same author > called Meta: http://srossross.github.com/Meta/html/index.html). > > > Question 2: What kind of performance are you targeting -- in the short > term, and in the long term? Is competing with "Fortran-level" > performance a goal at all? > > > In the short-term, I'm targeting C-equivalent performance (like weave). In > the long-term, I'm targeting optimized high-level expressions (i.e. > Fortran-level) with GPU and mulit-core. > > > E.g., for ufunc computations with different iteration orders such > as "a + b.T" (a and b in C-order), one must do blocking to get good > performance. And when dealing with strided arrays, copying small chunks > at the time will sometimes help performance (and sometimes not). > > This is optimization strategies which (as I understand it) is quite > beyond what NumPy iterators etc. can provide. >
As for blocking, this could be done by the numpy iterators themselves, by simply introducing more dimensions with appropriate shape and strides (not saying that's a solution :). > > > And the LLVM level could > be too low -- one has quite a lot of information when generating the > ufunc/reduction/etc. that would be thrown away when generating LLVM > code. > > > It doesn't need to be thrown away at all. It could be used to generate > appropriate code for the arrays being used. The long-term idea is to > actually be aware of NumPy arrays and encourage expression of high-level > constructs which generate optimized code using chunking, blocking, AVX > instructions, multiple threads, etc. > > To do this, it may make more sense to actually emit OpenMP (unless LLVM > grows standard threading intrinsics). This is not out of the question. That would be interesting, my experience with OpenMP is that the standard doesn't define (ironically enough) the use of OpenMP in the context of threading, and indeed, trying to use OpenMP outside of the main thread simply segfaults your program. If llvm would get such features, one must be prepared to make the OpenMP runtime thread-safe as well (hopefully it will be in the first place, like I believe Intel's implementation). > Vectorizing compilers do their best to reconstruct this > information; I know nothing about what actually exists here for > LLVM. They are certainly a lot more complicated to implement and work > with than making use of on higher-level information available before > code generation. > > The idea we've been playing with is for Cython to define a limited > subset of its syntax tree (essentially the "GIL-less" subset) seperate > from the rest of Cython, with a more well-defined API for optimization > passes etc., and targeted for a numerical optimization pipeline. > > This subset would actually be pretty close to what numba needs to > compile, even if the overlap isn't perfect. So such a pipeline could > possibly be shared between Cython and numba, even if Cython would use > it at compile-time and numba at runtime, and even if the code > generation backend is different (the code generation backend is > probably not the hard part...). To be concrete, the idea is: > > > > (Cython|numba) -> high-level numerical compiler and > loop-structure/blocking optimizer (by us on a shared parse tree > representation) -> (LLVM/C/OpenCL) -> low-level optimization (by the > respective compilers) > > Some algorithms that could be shareable are iteration strategies > (already in NumPy though), blocking strategies, etc. > > Even if this may be beyond numba's (and perhaps Cython's) current > ambition, it may be worth thinking about, if nothing else then just > for how Cython's code should be structured. > > > This kind of collaboration would be very nice. I agree, there might be some > kind of intermediate representation that would be good for both projects. > > -Travis > > > > (Mark F., how does the above match how you feel about this?) I would like collaboration, but from a technical perspective I think this would be much more involved than just dumping the AST to an IR and generating some code from there. For vector expressions I think sharing code would be more feasible than arbitrary (parallel) loops, etc. Cython as a compiler can make many decisions that a Python (bytecode) compiler can't make (at least without annotations and a well-defined subset of the language (not so much the syntax as the semantics)). I think in numba, if parallelism is to be supported, you will want a prange-like construct, as proving independence between iterations can be very hard to near impossible for a compiler. As for code generation, I'm not sure how llvm would do things like slicing arrays, reshaping, resizing etc (for vector expressions you can first evaluate all slicing and indexing operations and then compile the remaining vector expression), but for loops and array reassignment within loops this would have to invoke the actual slicing code from the llvm code (I presume). There are many other things, like bounds checking, wraparound, etc, that are all supported in both numpy and Cython, but going through an llvm layer would as far as I can see, require re-implementing those, at least if you want top-notch performance. Personally, I think for non-trivial performance-critical code (for loops with indexing, slicing, function calls, etc) Cython is a better target. So for vector expressions I think Cython and Numba could work together by specifying AST transformations that operate on vector expressions. For the purposes of Cython it would go from the Cython AST to the IR and after transformations either back to the Cython AST, or directly to llvm. For Cython, going from that code to llvm is not necessarily more useful than C and OpenCL, as you will know the types anyway at compile time and you can immediately exploit multicore as well as SIMD parallelism. In the face of blocking and chunking etc, certain specializations may be created in advance for Cython, or it could even generate a C version (+ openmp + auto-vectorization appeasing pragmas), an OpenCL version for the CPU and possibly a different one for the GPU, and a numba + numba IR version, i.e. feed the IR at runtime to numba and have it compile to llvm. If the compiler additionally fuses vector expressions together, this will be even more powerful. Finally, as for non-vector-expression code, I really believe Cython is a better target. cython.inline can have high overhead (at least the first time it has to compile), but with better (numpy-aware) type inference or profile guided optimizations (see recent threads on the cython-dev mailing list), in addition to things like prange, I personally believe Cython targets most of the use cases where numba would be able to generate performing code. > Dag > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion