On 03/20/2012 09:20 PM, Dag Sverre Seljebotn wrote: > On 03/20/2012 12:56 PM, Francesc Alted wrote: >> On Mar 20, 2012, at 2:29 PM, Dag Sverre Seljebotn wrote: >>> Francesc Alted<franc...@continuum.io> wrote: >>> >>>> On Mar 20, 2012, at 12:49 PM, mark florisson wrote: >>>>>> Cython and Numba certainly overlap. However, Cython requires: >>>>>> >>>>>> 1) learning another language >>>>>> 2) creating an extension module --- loading bit-code files >>>> and dynamically executing (even on a different machine from the one >>>> that initially created them) can be a powerful alternative for run-time >>>> compilation and distribution of code. >>>>>> >>>>>> These aren't show-stoppers obviously. But, I think some users >>>> would prefer an even simpler approach to getting fast-code than Cython >>>> (which currently doesn't do enought type-inference and requires >>>> building a dlopen extension module). >>>>> >>>>> Dag and I have been discussing this at PyCon, and here is my take on >>>>> it (at this moment :). >>>>> >>>>> Definitely, if you can avoid Cython then that is easier and more >>>>> desirable in many ways. So perhaps we can create a third project >>>>> called X (I'm not very creative, maybe ArrayExprOpt), that takes an >>>>> abstract syntax tree in a rather simple form, performs code >>>>> optimizations such as rewriting loops with array accesses to vector >>>>> expressions, fusing vector expressions and loops, etc, and spits out >>>> a >>>>> transformed AST containing these optimizations. If runtime >>>> information >>>>> is given such as actual shape and stride information the >>>>> transformations could figure out there and then whether to do things >>>>> like collapsing, axes swapping, blocking (as in, introducing more >>>> axes >>>>> or loops to retain discontiguous blocks in the cache), blocked memory >>>>> copies to contiguous chunks, etc. The AST could then also say whether >>>>> the final expressions are vectorizable. Part of this functionality is >>>>> already in numpy's nditer, except that this would be implicit and do >>>>> more (and hopefully with minimal overhead). >>>>> >>>>> So numba, Cython and maybe numexpr could use the functionality, >>>> simply >>>>> by building the AST from Python and converting back (if necessary) to >>>>> its own AST. As such, the AST optimizer would be only part of any >>>>> (runtime) compiler's pipeline, and it should be very flexible to >>>>> retain any information (metadata regarding actual types, control flow >>>>> information, etc) provided by the original AST. It would not do >>>>> control flow analysis, type inference or promotion, etc, but only >>>> deal >>>>> with abstract types like integers, reals and arrays (C, Fortran or >>>>> partly contiguous or strided). It would not deal with objects, but >>>>> would allow to insert nodes like UnreorderableNode and SideEffectNode >>>>> wrapping parts of the original AST. In short, it should be as easy as >>>>> possible to convert from an original AST to this project's AST and >>>>> back again afterwards. >>>> >>>> I think this is a very interesting project, and certainly projects like >>>> numba can benefit of it. So, in order to us have an idea on what you >>>> are after, can we assume that your project (call it X) would be kind of >>>> an compiler optimizer, and then the produced, optimized code could be >>>> feed into numba for optimized LLVM code generation (that on its turn, >>>> can be run on top of CPUs or GPUs or a combination)? Is that correct? >>> >>> I think so. Another way of thinking about it is that it is a >>> reimplementation of the logic in the (good and closed source) Fortran 90 >>> compilers, in a reusable component for inclusion in various compilers. >>> >>> Various c++ metaprogramming libraries (like Blitz++) are similar too. >> >> Aha, thanks. >> >>>> Giving that my interpretation above is correct, it is bit more >>>> difficult to me to see how your X project could be of benefit for >>>> numexpr. In fact, I actually see this the other way round: once the >>>> optimizer has discovered the vectorization parts, then go one step >>>> further and generate code that uses numexpr automatically (call this, >>>> vectorization through numexpr). This is what you mean, or I'm missing >>>> something? >>> >>> No. I think in some ways this is a competitor to numexpr -- you would gut >>> out the middle of numexpr and keep the frontend and backend, but use this >>> to optimize iteration order and blocking strategies. >> >> I see. Yes, I can easily see Mark's project X + numba more as a competitor >> (killer?) to numexpr too. >> >>> >>> I think the goal is for higher performance than what I understand numexpr >>> can provide (in some cases, not all!). For instance, can numexpr deal well >>> with >>> >>> a + a.T >>> >>> where a is a c-contiguous array? Any numpy-like iteration order will not >>> work well, one needs to use higher-dimensional (eg 2D) blocking, not 1D >>> blocking. >> >> No. numexpr cannot deal with the above problem efficiently. numexpr is >> about 1d blocking, so its approach is pretty naive (but effective for these >> 1d blocking tasks). >> >>> (if numexpr can do this then great, the task might then reduce to >>> refactoring numexpr so that cython and numba can use the same logic) >> >> Well, now that I think about it, the virtual machine in latest numexpr (2.0) >> is based on the new nditer iterator included in NumPy 1.6, so I'm wondering >> if with a little bit of more effort, the 2d blocking could be implemented. >> While I'm pretty sure this is the case, I don't know if it would be really >> worth the effort. Perhaps concentrating on things like numba + projectX, or >> just Theano would be better targets. Perhaps Mark Wiebe (who contributed >> the new nditer-aware VM) could say a bit more about this. > > My guess is that the answer is that nditer works well in many > situations, but can be sub-optimal for some array shapes, particularly > if the arrays fit in cache. > > Here's a contrived bad case (not sure if it is the worst): Consider > > arr[::2, :] > > where arr is C-contiguous with shape (n, 2), and let n be such that the > array fits in L2 cache :-) > > The ::2 ensures that the array can't be flattened. Any nditer approach, > as I understand it, would need to execute the inner subroutine for each > row of 4 elements, and then invoke the iteration machinery, which I > imagine is quite slow compared to LLVM code generated specifically for > this array, which could even unroll the inner 4-element loop.
Sorry: "2 elements", "2-element loop". Dag > > If one really wants to compete with Fortran, one must take into account > that the scientific end-user may already be structuring the computation > in a cache-efficient way. It doesn't always help with good performance > "as long as arrays are>10 MB". > > If the array is large, the memory bottleneck should makes a lot of the > CPU overhead of nditer irrelevant even for worst-case arrays; though I'd > be curious to see benchmarks of how much overhead is left, and my > (rather unfounded) suspicion is that hard-wired compiled LLVM code for a > specific array should play better with the CPUs cache predictor to > better mask memory access latencies (?) > > (Mark F., apropos benchmarks, let me know what kind of different > hardware you have access to; I could see if I can give you access to a > couple of boxes with different memory bus characteristics, i.e., Intel > vs. AMD and so on.) > > Dag > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion