Re: Taking pipeline processing to the next level

Manu via Digitalmars-d Mon, 05 Sep 2016 19:36:42 -0700

On 5 September 2016 at 16:21, H. S. Teoh via Digitalmars-d
<digitalmars-d@puremagic.com> wrote:
> On Mon, Sep 05, 2016 at 03:08:53PM +1000, Manu via Digitalmars-d wrote:
>> I mostly code like this now:
>>   data.map!(x => transform(x)).copy(output);
>>
>> It's convenient and reads nicely, but it's generally inefficient.
>> This sort of one-by-one software design is the core performance
>> problem with OOP. It seems a shame to be suffering OOP's failures even
>> when there is no OOP in sight.
>
> In functional languages, where these ideas originated, it's mostly the
> compiler's job to make things efficient.  I'd argue that we should
> rather be looking at how to improve the compiler.


This is also the key reason that functional languages always seem to
perform significantly slower than imperative languages.
It might be possible, but when will it happen? How reliable is it if
it does work? Does it make competing compilers impossible to
implement?


> As far as I know,
> currently even D compilers don't really take full advantage of the
> range idiom; i.e., their optimizers are written for the general case of
> arbitrary code, so while you do get some benefits, it may not
> necessarily translate to overall efficient code.  Dmd is particularly
> bad at optimizing range-based code, mainly because its inliner gives up
> too easily.  Nest the range API calls a level too deep, and the inliner
> fails to inline it, and as a result a whole series of subsequent
> optimizations are missed.

Word.


> IME, gdc does a bit better -- it's able to translate simple range
> compositions into vectorized operations in some cases that I've seen.
> Yes, I realize that there's some debate about whether auto-vectorization
> is a good thing or not, but the point is that gdc's optimizer is well
> able to translate what you call a one-by-one design into something
> that's closer to block-by-block. So it's not out of the realm of
> possibility that the compiler could do such things for you.

Auto-vectorisation is a great optimisation, but it's not reliable. If
you want to write reliably fast code, you need to do it manually, and
that tends to mean implementing the hot loop by hand.
It never hurts to hand-craft the hottest loops in your program; these
are the exact sorts of functions I'm talking about.


> Now imagine if the D frontend is able to recognize certain code patterns
> characteristic of range-based code, and optimize accordingly. In the
> front end, you have access to the AST and to higher-level constructs in
> the language, so it's more likely to be able to reliably detect
> range-based code, and do something useful with it.  Conceptually, such
> an optimizer could see code like this:
>
>         data.map!(x => 2*x + 1)
>             .map!(x => f(x) - 1)
>             .copy(output);
>
> and produce, as an initial step, something like:
>
>         foreach (i, x; data) {
>                 output[i] = f(2*x + 1) - 1;
>         }
>
> Then this gets passed through the backend optimizer which does loop
> optimizations like strength reduction, unrolling, etc.. The unrolled
> loop could then get passed over another optimizer iteration to produce
> vectorized code, or the equivalent transformation into block-based
> operations.
>
> There's a lot of optimization potential with range-based code that
> currently isn't really exploited by existing D compilers. It would be
> interesting to investigate what optimizations are possible and to
> implement them in the D front-end.

I agree, there are heaps of optimisations that could be done, but
there are also a lot of optimisations that can not be performanced
automatically.
There are many cases where the user will need to hand write the `f(T[]
data, T[] dest)` version, and we need to find a way that such a
function can work its way into these pipeline code constructs
naturally.


>> A central premise of performance-oriented programming which I've
>> employed my entire career, is "where there is one, there is probably
>> many", and if you do something to one, you should do it to many.  With
>> this in mind, the code I have always written doesn't tend to look like
>> this:
>>   R manipulate(Thing thing);
>>
>> Instead:
>>   void manipulateThings(Thing *things, size_t numThings, R *output,
>> size_t outputLen);
>>
>> Written this way for clarity. Obviously, the D equiv uses slices.
>
> Actually, isn't this the whole premise of range-based code? Instead of
> dealing with individual items, you work with a range of items, and not
> necessarily arrays either. Though for obvious reasons arrays would tend
> to perform better than the general range.

Absolutely, but we're not taking advantage of that at all yet.
All expressions are element-wise, and that alone inhibits many
possibilities for writing efficient code.
Again, we need to find a way to comfortably work those batch function
into the pipeline. It needs to be just as valid to write a batch
version and have it work in the same place as an element-wise version,
and the scaffolding would worry about buffering up blocks of items or
whatever is required. Obviously if src is an array and not a generic
range, there are great short-circuits to enjoy.


>> All functions are implemented with the presumption they will operate
>> on many things, rather than being called many times for each one.
>> This is the single most successful design pattern I have ever
>> encountered wrt high-performance code; ie, implement the array version
>> first.
>
> In D, that would be "implement the range version first" (and then you
> realize you don't need to implement any other version afterwards,
> because the range version already covers everything else :-P).

Range version is different from array version. Array version is the
one that can be optimised. Range versions can rarely be tightly
optimised; you tend to require knowledge of memory layout.


>> The problem with this API design, is that it doesn't plug into
>> algorithms or generic code well.
>>   data.map!(x => transformThings(&x, 1)).copy(output);
>>
>> I often wonder how we can integrate this design principle conveniently
>> (ie, seamlessly) into the design of algorithms, such that they can
>> make use of batching functions internally, and transparently?
>>
>> Has anyone done any work in this area?
>>
>> Ideas?
>
> One thought that occurs to me is to focus on optimizing code involving
> .map.  Obviously, a naive translation of a UFCS chain involving .map
> would involve calling the transformation function one item at a time.
> But it doesn't have to be that way, especially if the function is a
> lambda or inlineable.  Conceivably, the optimizer should first inline
> the function, then unroll the loop, then realize that further
> optimizations can be applied across multiple loop iterations to turn it
> into batched code.

Right, but map!(x => f(x)) calls a function element-wise, and there's
not really anything the optimiser can do with that, because that's
just how the signature of the function is.
This is the point I'm most interested in, and it's where the optimiser
loses all ability to do anything.

`T f(T t)` is an efficient design. `T[] f(T[], T[])` is much better,
but it's harder to work in to those pipeline expressions, and it's
still not clear to me how the optimiser would take advantage of it
without manual intervention on the whole pipeline.


> Of course, compiler improvements are kinda more in the long term; in the
> short term, what you might be able to do is to explicitly work with the
> range in blocks, i.e.:
>
>         import std.range : chunks;
>         data.chunks(chunkSize)  // split into batches
>                 .map!(chunk => transformThings(chunk))
>                 .joiner // join transformed batches back into flat range
>                 .copy(output);

Right, this is the current solution. It's a bit clunky and very
manual. People aren't writing transform functions this way... and it's
difficult to encourage them to do so, when it impacts the end-users
experience of their API like that.
We won't get many people writing good fast transformations while this
remains as the end-user experience.


On a tangent, I made a suggestion to Walter a while back, which I
think he actually liked, which is that we might make better use of D's
array expressions by lowering to pipeline expressions:
r[] * 10   ->   r.map!(e => e * 10);
r[] + u[]   ->  zip(r, u).map!(e => e[0] + e[1])
f(r[])   ->  if `f(R)(R r)` is available: f(r), else if `T f(T)` is
available: r.map!(e => f(e))

You get the idea... I just feel like D's array syntax could easily be
extended to support ranges, which would make ranges feel much more
1st-class, and simplify pipeline expressions immensely, and I think
that could be a better high-level expression where the sort of
optimisations I'm interested in could be performed. Ie, the
expressions are expressed in terms of the range, not in element-wise
terms as our existing pipeline-operations tend to be. This leaves more
flexibility to the compiler/optimiser.

There's a lot of boilerplate bang's, parens, and lambda statements in
our current pipeline expressions. Enhancing array expressions to
support ranges would simplify the whole thing immensely, and I think
it would also improve opportunity for optimisation.

Re: Taking pipeline processing to the next level

Reply via email to