Taking pipeline processing to the next level

2016-09-04 Thread Manu via Digitalmars-d
I mostly code like this now:
  data.map!(x => transform(x)).copy(output);

It's convenient and reads nicely, but it's generally inefficient.
This sort of one-by-one software design is the core performance
problem with OOP. It seems a shame to be suffering OOP's failures even
when there is no OOP in sight.

A central premise of performance-oriented programming which I've
employed my entire career, is "where there is one, there is probably
many", and if you do something to one, you should do it to many.
With this in mind, the code I have always written doesn't tend to look
like this:
  R manipulate(Thing thing);

Instead:
  void manipulateThings(Thing *things, size_t numThings, R *output,
size_t outputLen);

Written this way for clarity. Obviously, the D equiv uses slices.

All functions are implemented with the presumption they will operate
on many things, rather than being called many times for each one.
This is the single most successful design pattern I have ever
encountered wrt high-performance code; ie, implement the array version
first.

The problem with this API design, is that it doesn't plug into
algorithms or generic code well.
  data.map!(x => transformThings(&x, 1)).copy(output);

I often wonder how we can integrate this design principle conveniently
(ie, seamlessly) into the design of algorithms, such that they can
make use of batching functions internally, and transparently?

Has anyone done any work in this area?

Ideas?


Re: Taking pipeline processing to the next level

2016-09-04 Thread rikki cattermole via Digitalmars-d

On 05/09/2016 5:08 PM, Manu via Digitalmars-d wrote:

I mostly code like this now:
  data.map!(x => transform(x)).copy(output);

It's convenient and reads nicely, but it's generally inefficient.
This sort of one-by-one software design is the core performance
problem with OOP. It seems a shame to be suffering OOP's failures even
when there is no OOP in sight.

A central premise of performance-oriented programming which I've
employed my entire career, is "where there is one, there is probably
many", and if you do something to one, you should do it to many.
With this in mind, the code I have always written doesn't tend to look
like this:
  R manipulate(Thing thing);

Instead:
  void manipulateThings(Thing *things, size_t numThings, R *output,
size_t outputLen);

Written this way for clarity. Obviously, the D equiv uses slices.

All functions are implemented with the presumption they will operate
on many things, rather than being called many times for each one.
This is the single most successful design pattern I have ever
encountered wrt high-performance code; ie, implement the array version
first.

The problem with this API design, is that it doesn't plug into
algorithms or generic code well.
  data.map!(x => transformThings(&x, 1)).copy(output);

I often wonder how we can integrate this design principle conveniently
(ie, seamlessly) into the design of algorithms, such that they can
make use of batching functions internally, and transparently?

Has anyone done any work in this area?

Ideas?


Just a random idea:

import std.array : front, popFront, empty;
import std.range : ElementType, isInputRange;
int[] transformer(I)(I from, int[] buffer) if (is(ElementType!I == int) 
&& isInputRange!I) {

size_t used;

// ... transformation algo

return buffer[0 .. used];
}

auto got = input.transformer(buffer).usage;

Input range instead of straight array being passed in so it works on 
pretty much any input arrays or input ranges.


Re: Taking pipeline processing to the next level

2016-09-04 Thread H. S. Teoh via Digitalmars-d
On Mon, Sep 05, 2016 at 03:08:53PM +1000, Manu via Digitalmars-d wrote:
> I mostly code like this now:
>   data.map!(x => transform(x)).copy(output);
> 
> It's convenient and reads nicely, but it's generally inefficient.
> This sort of one-by-one software design is the core performance
> problem with OOP. It seems a shame to be suffering OOP's failures even
> when there is no OOP in sight.

In functional languages, where these ideas originated, it's mostly the
compiler's job to make things efficient.  I'd argue that we should
rather be looking at how to improve the compiler.  As far as I know,
currently even D compilers don't really take full advantage of the
range idiom; i.e., their optimizers are written for the general case of
arbitrary code, so while you do get some benefits, it may not
necessarily translate to overall efficient code.  Dmd is particularly
bad at optimizing range-based code, mainly because its inliner gives up
too easily.  Nest the range API calls a level too deep, and the inliner
fails to inline it, and as a result a whole series of subsequent
optimizations are missed.

IME, gdc does a bit better -- it's able to translate simple range
compositions into vectorized operations in some cases that I've seen.
Yes, I realize that there's some debate about whether auto-vectorization
is a good thing or not, but the point is that gdc's optimizer is well
able to translate what you call a one-by-one design into something
that's closer to block-by-block. So it's not out of the realm of
possibility that the compiler could do such things for you.

Now imagine if the D frontend is able to recognize certain code patterns
characteristic of range-based code, and optimize accordingly. In the
front end, you have access to the AST and to higher-level constructs in
the language, so it's more likely to be able to reliably detect
range-based code, and do something useful with it.  Conceptually, such
an optimizer could see code like this:

data.map!(x => 2*x + 1)
.map!(x => f(x) - 1)
.copy(output);

and produce, as an initial step, something like:

foreach (i, x; data) {
output[i] = f(2*x + 1) - 1;
}

Then this gets passed through the backend optimizer which does loop
optimizations like strength reduction, unrolling, etc.. The unrolled
loop could then get passed over another optimizer iteration to produce
vectorized code, or the equivalent transformation into block-based
operations.

There's a lot of optimization potential with range-based code that
currently isn't really exploited by existing D compilers. It would be
interesting to investigate what optimizations are possible and to
implement them in the D front-end.


> A central premise of performance-oriented programming which I've
> employed my entire career, is "where there is one, there is probably
> many", and if you do something to one, you should do it to many.  With
> this in mind, the code I have always written doesn't tend to look like
> this:
>   R manipulate(Thing thing);
> 
> Instead:
>   void manipulateThings(Thing *things, size_t numThings, R *output,
> size_t outputLen);
> 
> Written this way for clarity. Obviously, the D equiv uses slices.

Actually, isn't this the whole premise of range-based code? Instead of
dealing with individual items, you work with a range of items, and not
necessarily arrays either. Though for obvious reasons arrays would tend
to perform better than the general range.


> All functions are implemented with the presumption they will operate
> on many things, rather than being called many times for each one.
> This is the single most successful design pattern I have ever
> encountered wrt high-performance code; ie, implement the array version
> first.

In D, that would be "implement the range version first" (and then you
realize you don't need to implement any other version afterwards,
because the range version already covers everything else :-P).


> The problem with this API design, is that it doesn't plug into
> algorithms or generic code well.
>   data.map!(x => transformThings(&x, 1)).copy(output);
> 
> I often wonder how we can integrate this design principle conveniently
> (ie, seamlessly) into the design of algorithms, such that they can
> make use of batching functions internally, and transparently?
> 
> Has anyone done any work in this area?
> 
> Ideas?

One thought that occurs to me is to focus on optimizing code involving
.map.  Obviously, a naive translation of a UFCS chain involving .map
would involve calling the transformation function one item at a time.
But it doesn't have to be that way, especially if the function is a
lambda or inlineable.  Conceivably, the optimizer should first inline
the function, then unroll the loop, then realize that further
optimizations can be applied across multiple loop iterations to turn it
into batched code.

Of course, compiler improvements are kinda more in the long term; in the
short term, what you might be 

Re: Taking pipeline processing to the next level

2016-09-05 Thread Andrei Alexandrescu via Digitalmars-d

On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:

I mostly code like this now:
  data.map!(x => transform(x)).copy(output);

It's convenient and reads nicely, but it's generally inefficient.


What are the benchmarks and the numbers? What loss are you looking at? 
-- Andrei


Re: Taking pipeline processing to the next level

2016-09-05 Thread Johannes Pfau via Digitalmars-d
Am Mon, 5 Sep 2016 10:21:53 +0200
schrieb Andrei Alexandrescu :

> On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
> > I mostly code like this now:
> >   data.map!(x => transform(x)).copy(output);
> >
> > It's convenient and reads nicely, but it's generally inefficient.  
> 
> What are the benchmarks and the numbers? What loss are you looking
> at? -- Andrei

As Manu posted this question (and he's working on a color/image library)
it's not hard to guess one problem is SIMD/vectorization. E.g if
transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4
values instead of 4 individual adds.

As autovectorization is not very powerful in current compilers I can
easily imagine that complex range based examples can't compete with
hand-written SIMD loops.

@Manu: Have you had a look at std.parallelism? I think it has some kind
of parallel map which could provide some inspiration?



Re: Taking pipeline processing to the next level

2016-09-05 Thread Guillaume Piolat via Digitalmars-d

On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:

A central premise of performance-oriented programming which I've
employed my entire career, is "where there is one, there is 
probably

many", and if you do something to one, you should do it to many.
With this in mind, the code I have always written doesn't tend 
to look

like this:
  R manipulate(Thing thing);

Instead:
  void manipulateThings(Thing *things, size_t numThings, R 
*output,

size_t outputLen);

Written this way for clarity. Obviously, the D equiv uses 
slices.


To chime in, processing in chunks is indeed absolutely key for 
audio performance. Processing by element is only tolerable for 
prototyping. I don't even use slices since but usually pointers 
and length, since most slice length would be the same.


void processSamples(float* input, float* output, int samples);

Some kind of buffered input and output range concept would be 
needed there if a lazy library was to be made.


Unfortunately for this case inputs and outputs are rather 
diverse. Can't wait for a more formalized DbI introduction.


As for graphics, ae.utils.graphics usually give you a whole line 
of pixels.


Re: Taking pipeline processing to the next level

2016-09-05 Thread Ethan Watson via Digitalmars-d
On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu 
wrote:
What are the benchmarks and the numbers? What loss are you 
looking at? -- Andrei


Just looking at the example, and referencing the map code in 
std.algorithm.iteration, I can see multiple function calls 
instead of one thanks to every indexing of the new map doing a 
transformation instead of caching it. I'm not sure if the lambda 
declaration there will result in the argument being taken by ref 
or by value, but let's assume by value for the sake of argument. 
Depending on if it's taking by value a reference or a value type, 
that could either be a cheap function call or an expensive one.


But even if it took it by reference, it's still a function call. 
Function calls are generally The Devil(TM) in a gaming 
environment. The less you can make, the better.


Random aside: There are streaming store instructions available to 
me on x86 platforms so that I don't have to wait for the 
destination to hit L1 cache before writing. The pattern Manu 
talks about with a batching function can better exploit this. But 
I imagine copy could also take advantage of this when dealing 
with value types.


Re: Taking pipeline processing to the next level

2016-09-05 Thread Andrei Alexandrescu via Digitalmars-d

On 9/5/16 1:43 PM, Ethan Watson wrote:

On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:

What are the benchmarks and the numbers? What loss are you looking at?
-- Andrei


Just looking at the example, and referencing the map code in
std.algorithm.iteration, I can see multiple function calls instead of
one thanks to every indexing of the new map doing a transformation
instead of caching it. I'm not sure if the lambda declaration there will
result in the argument being taken by ref or by value, but let's assume
by value for the sake of argument. Depending on if it's taking by value
a reference or a value type, that could either be a cheap function call
or an expensive one.

But even if it took it by reference, it's still a function call.
Function calls are generally The Devil(TM) in a gaming environment. The
less you can make, the better.

Random aside: There are streaming store instructions available to me on
x86 platforms so that I don't have to wait for the destination to hit L1
cache before writing. The pattern Manu talks about with a batching
function can better exploit this. But I imagine copy could also take
advantage of this when dealing with value types.


Understood. Would explicitly asking for vectorized operations be 
acceptable? One school of thought has it that explicit invocation of 
parallel operations are preferable to autovectorization and its ilk. -- 
Andrei


Re: Taking pipeline processing to the next level

2016-09-05 Thread Manu via Digitalmars-d
On 5 September 2016 at 16:21, H. S. Teoh via Digitalmars-d
 wrote:
> On Mon, Sep 05, 2016 at 03:08:53PM +1000, Manu via Digitalmars-d wrote:
>> I mostly code like this now:
>>   data.map!(x => transform(x)).copy(output);
>>
>> It's convenient and reads nicely, but it's generally inefficient.
>> This sort of one-by-one software design is the core performance
>> problem with OOP. It seems a shame to be suffering OOP's failures even
>> when there is no OOP in sight.
>
> In functional languages, where these ideas originated, it's mostly the
> compiler's job to make things efficient.  I'd argue that we should
> rather be looking at how to improve the compiler.

This is also the key reason that functional languages always seem to
perform significantly slower than imperative languages.
It might be possible, but when will it happen? How reliable is it if
it does work? Does it make competing compilers impossible to
implement?


> As far as I know,
> currently even D compilers don't really take full advantage of the
> range idiom; i.e., their optimizers are written for the general case of
> arbitrary code, so while you do get some benefits, it may not
> necessarily translate to overall efficient code.  Dmd is particularly
> bad at optimizing range-based code, mainly because its inliner gives up
> too easily.  Nest the range API calls a level too deep, and the inliner
> fails to inline it, and as a result a whole series of subsequent
> optimizations are missed.

Word.


> IME, gdc does a bit better -- it's able to translate simple range
> compositions into vectorized operations in some cases that I've seen.
> Yes, I realize that there's some debate about whether auto-vectorization
> is a good thing or not, but the point is that gdc's optimizer is well
> able to translate what you call a one-by-one design into something
> that's closer to block-by-block. So it's not out of the realm of
> possibility that the compiler could do such things for you.

Auto-vectorisation is a great optimisation, but it's not reliable. If
you want to write reliably fast code, you need to do it manually, and
that tends to mean implementing the hot loop by hand.
It never hurts to hand-craft the hottest loops in your program; these
are the exact sorts of functions I'm talking about.


> Now imagine if the D frontend is able to recognize certain code patterns
> characteristic of range-based code, and optimize accordingly. In the
> front end, you have access to the AST and to higher-level constructs in
> the language, so it's more likely to be able to reliably detect
> range-based code, and do something useful with it.  Conceptually, such
> an optimizer could see code like this:
>
> data.map!(x => 2*x + 1)
> .map!(x => f(x) - 1)
> .copy(output);
>
> and produce, as an initial step, something like:
>
> foreach (i, x; data) {
> output[i] = f(2*x + 1) - 1;
> }
>
> Then this gets passed through the backend optimizer which does loop
> optimizations like strength reduction, unrolling, etc.. The unrolled
> loop could then get passed over another optimizer iteration to produce
> vectorized code, or the equivalent transformation into block-based
> operations.
>
> There's a lot of optimization potential with range-based code that
> currently isn't really exploited by existing D compilers. It would be
> interesting to investigate what optimizations are possible and to
> implement them in the D front-end.

I agree, there are heaps of optimisations that could be done, but
there are also a lot of optimisations that can not be performanced
automatically.
There are many cases where the user will need to hand write the `f(T[]
data, T[] dest)` version, and we need to find a way that such a
function can work its way into these pipeline code constructs
naturally.


>> A central premise of performance-oriented programming which I've
>> employed my entire career, is "where there is one, there is probably
>> many", and if you do something to one, you should do it to many.  With
>> this in mind, the code I have always written doesn't tend to look like
>> this:
>>   R manipulate(Thing thing);
>>
>> Instead:
>>   void manipulateThings(Thing *things, size_t numThings, R *output,
>> size_t outputLen);
>>
>> Written this way for clarity. Obviously, the D equiv uses slices.
>
> Actually, isn't this the whole premise of range-based code? Instead of
> dealing with individual items, you work with a range of items, and not
> necessarily arrays either. Though for obvious reasons arrays would tend
> to perform better than the general range.

Absolutely, but we're not taking advantage of that at all yet.
All expressions are element-wise, and that alone inhibits many
possibilities for writing efficient code.
Again, we need to find a way to comfortably work those batch function
into the pipeline. It needs to be just as valid to write a batch
version and have it work in the same place as an element-wise version,
and the s

Re: Taking pipeline processing to the next level

2016-09-05 Thread Manu via Digitalmars-d
On 5 September 2016 at 18:21, Andrei Alexandrescu via Digitalmars-d
 wrote:
> On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
>>
>> I mostly code like this now:
>>   data.map!(x => transform(x)).copy(output);
>>
>> It's convenient and reads nicely, but it's generally inefficient.
>
>
> What are the benchmarks and the numbers? What loss are you looking at? --
> Andrei

Well, it totally depends. Like right now in my case, 'transform' is
some image processing code (in the past when I've had these same
thoughts, it has been audio filters). You can't touch pixels (or
samples) one at a time. They need manual SIMD deployment (I've never
seen an auto-vectoriser handle saturation arithmetic, or type
promotion), alpha components (every 4th byte) is treated differently,
memory access patterns need to be tuned to be cache friendly.

I haven't done benchmarks right now, but I've done them professionally
in the past, and it's not unusual to expect a hand-written image
processing loop to see 1 or even 2 orders of magnitude improvement
when hand written, compared to calling a function for each pixel in a
loop. The sorts of low-level optimisations you deploy in image and
audio processing loops are not things I've ever seen any optimiser
even attempt.
Some core problems that tend to require manual intervention in hot-loops are:
  ubyte[16] <-> ushort[8][2] expansion/contraction
  ubyte[16] <-> float[4][4] expansion/contraction
  saturation
  scalar operator results promote to int, but wide-simd operations
don't, which means some scalar expressions can't express losslessly
collapsed to simd operations, and the compiler will always be
conservative on this matter. If the auto-vectoriser tries at all, you
will see a mountain of extra code to preserve those bits that the
scalar operator semantics would have guaranteed
  wide-vector multiplication is semantically different than scalar
multiplication, so the optimiser has a lot of trouble vectorising
mul's
  assumptions about data alignment
  interleaved data; audio samples are usually [L,R] interleaved,
images often [RGB,A], and different processes are applied across the
separation, you want to unroll and shuffle the data so you have
vectors [],[], or [RGBRGBRGBRGB],[], and I haven't seen an
optimiser go near that
  vector dot-product is always a nuisance

I could go on and on.

The point is, as an end-user, pipeline API's are great. At a library
author, I want to present the best performing library I can, which I
think means we need to find a way to conveniently connect these 2
currently disconnected worlds.
I've explored to some extent, but I've never come up with anything that I like.


Re: Taking pipeline processing to the next level

2016-09-05 Thread Manu via Digitalmars-d
On 5 September 2016 at 20:32, Johannes Pfau via Digitalmars-d
 wrote:
> Am Mon, 5 Sep 2016 10:21:53 +0200
> schrieb Andrei Alexandrescu :
>
>> On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
>> > I mostly code like this now:
>> >   data.map!(x => transform(x)).copy(output);
>> >
>> > It's convenient and reads nicely, but it's generally inefficient.
>>
>> What are the benchmarks and the numbers? What loss are you looking
>> at? -- Andrei
>
> As Manu posted this question (and he's working on a color/image library)
> it's not hard to guess one problem is SIMD/vectorization. E.g if
> transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4
> values instead of 4 individual adds.
>
> As autovectorization is not very powerful in current compilers I can
> easily imagine that complex range based examples can't compete with
> hand-written SIMD loops.
>
> @Manu: Have you had a look at std.parallelism? I think it has some kind
> of parallel map which could provide some inspiration?

I have, but even just chunks() and joiner() can do the trick to a
reasonable extent, but it's still not great. It's definitely not where
I'd like it to be. End-users won't manually deploy these strategies
correctly (or at all), I'd like to see design that enables more
automatic deployment of batch processing. I treat the end-user like a
javascript user; they shouldn't need to do hard work to make proper
use of a lib, that's poor API offering on part of the lib author.


Re: Taking pipeline processing to the next level

2016-09-05 Thread Manu via Digitalmars-d
On 5 September 2016 at 23:38, Andrei Alexandrescu via Digitalmars-d
 wrote:
> On 9/5/16 1:43 PM, Ethan Watson wrote:
>>
>> On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:
>>>
>>> What are the benchmarks and the numbers? What loss are you looking at?
>>> -- Andrei
>>
>>
>> Just looking at the example, and referencing the map code in
>> std.algorithm.iteration, I can see multiple function calls instead of
>> one thanks to every indexing of the new map doing a transformation
>> instead of caching it. I'm not sure if the lambda declaration there will
>> result in the argument being taken by ref or by value, but let's assume
>> by value for the sake of argument. Depending on if it's taking by value
>> a reference or a value type, that could either be a cheap function call
>> or an expensive one.
>>
>> But even if it took it by reference, it's still a function call.
>> Function calls are generally The Devil(TM) in a gaming environment. The
>> less you can make, the better.
>>
>> Random aside: There are streaming store instructions available to me on
>> x86 platforms so that I don't have to wait for the destination to hit L1
>> cache before writing. The pattern Manu talks about with a batching
>> function can better exploit this. But I imagine copy could also take
>> advantage of this when dealing with value types.
>
>
> Understood. Would explicitly asking for vectorized operations be acceptable?
> One school of thought has it that explicit invocation of parallel operations
> are preferable to autovectorization and its ilk. -- Andrei

I still stand by this, and I listed some reasons above.
Auto-vectorisation is a nice opportunistic optimisation, but it can't
be relied on. The key reason is that scalar arithmetic semantics are
different than vector semantics, and auto-vectorisation tends to
produce a whole bunch of extra junk code to carefully (usually
pointlessly) preserve the scalar semantics that it's trying to
vectorise. This will never end well.
But the vectorisation isn't the interesting problem here, I'm really
just interested in how to work these batch-processing functions into
our nice modern pipeline statements without placing an unreasonable
burden on the end-user, who shouldn't be expected to go out of their
way. If they even have to start manually chunking, I think we've
already lost; they won't know optimal chunk-sizes, or anything about
alignment boundaries, cache, etc.


Re: Taking pipeline processing to the next level

2016-09-06 Thread Jerry via Digitalmars-d

On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:

I mostly code like this now:
  data.map!(x => transform(x)).copy(output);


So you basicly want to make the lazy computation eager and store 
the result?


data.map!(x => transform(x)).array

Will allocate a new array and fill it with the result of map.
And if you want to recycle the buffer I guess writing a buffer 
function would be trivial.






Re: Taking pipeline processing to the next level

2016-09-06 Thread finalpatch via Digitalmars-d

On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote:

I still stand by this, and I listed some reasons above.
Auto-vectorisation is a nice opportunistic optimisation, but it 
can't
be relied on. The key reason is that scalar arithmetic 
semantics are

different than vector semantics, and auto-vectorisation tends to
produce a whole bunch of extra junk code to carefully (usually
pointlessly) preserve the scalar semantics that it's trying to
vectorise. This will never end well.
But the vectorisation isn't the interesting problem here, I'm 
really
just interested in how to work these batch-processing functions 
into
our nice modern pipeline statements without placing an 
unreasonable
burden on the end-user, who shouldn't be expected to go out of 
their

way. If they even have to start manually chunking, I think we've
already lost; they won't know optimal chunk-sizes, or anything 
about

alignment boundaries, cache, etc.


In a previous job I had successfully created a small c++ library 
to perform pipelined SIMD image processing. Not sure how relevant 
it is but think I'd share the design here, perhaps it'll give you 
guys some ideas.


Basically the users of this library only need to write simple 
kernel classes, something like this:


// A kernel that processes 4 pixels at a time
struct MySimpleKernel : Kernel<4>
{
// Tell the library the input and output type
using InputVector  = Vector<__m128, 1>;
using OutputVector = Vector<__m128, 2>;

template
OutputVector apply(const T& src)
{
// T will be deduced to Vector<__m128, 1>
// which is an array of one __m128 element
// Awesome SIMD code goes here...
// And return the output vector
return OutputVector(...);
}
};

Of course the InputVector and OutputVector do not have to be 
__m128, they can totally be other types like int or float.


The cool thing is kernels can be chained together with >> 
operators.


So assume we have another kernel:

struct AnotherKernel : Kernel<3>
{
...
}

Then we can create a processing pipeline with these 2 kernels:

InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >> 
OutputBuffer(...);


Then some template magic will figure out the LCM of the 2 
kernels' pixel width is 3*4=12 and therefore they are fused 
together into a composite kernel of pixel width 12.  The above 
line compiles down into a single function invokation, with a main 
loop that reads the source buffer in 4 pixels step, call 
MySimpleKernel 3 times, then call AnotherKernel 4 times.


Any number of kernels can be chained together in this way, as 
long as your compiler doesn't explode.


At that time, my benchmarks showed pipelines generated in this 
way often rivals the speed of hand tuned loops.




Re: Taking pipeline processing to the next level

2016-09-06 Thread finalpatch via Digitalmars-d

On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
Then some template magic will figure out the LCM of the 2 
kernels' pixel width is 3*4=12 and therefore they are fused 
together into a composite kernel of pixel width 12.  The above 
line compiles down into a single function invokation, with a 
main loop that reads the source buffer in 4 pixels step, call 
MySimpleKernel 3 times, then call AnotherKernel 4 times.


Correction:
with a main loop that reads the source buffer in *12* pixels 
step, call MySimpleKernel 3 times, then call AnotherKernel 4 
times.




Re: Taking pipeline processing to the next level

2016-09-06 Thread finalpatch via Digitalmars-d

On Tuesday, 6 September 2016 at 14:26:22 UTC, finalpatch wrote:

On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
Then some template magic will figure out the LCM of the 2 
kernels' pixel width is 3*4=12 and therefore they are fused 
together into a composite kernel of pixel width 12.  The above 
line compiles down into a single function invokation, with a 
main loop that reads the source buffer in 4 pixels step, call 
MySimpleKernel 3 times, then call AnotherKernel 4 times.


Correction:
with a main loop that reads the source buffer in *12* pixels 
step, call MySimpleKernel 3 times, then call AnotherKernel 4 
times.


And of course the key to the speed is all function calls get 
inlined by the compiler.


Re: Taking pipeline processing to the next level

2016-09-06 Thread Manu via Digitalmars-d
On 7 September 2016 at 00:26, finalpatch via Digitalmars-d
 wrote:
> On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
>>
>> Then some template magic will figure out the LCM of the 2 kernels' pixel
>> width is 3*4=12 and therefore they are fused together into a composite
>> kernel of pixel width 12.  The above line compiles down into a single
>> function invokation, with a main loop that reads the source buffer in 4
>> pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
>
>
> Correction:
> with a main loop that reads the source buffer in *12* pixels step, call
> MySimpleKernel 3 times, then call AnotherKernel 4 times.

It's interesting thoughts. What did you do when buffers weren't
multiple of the kernels?


Re: Taking pipeline processing to the next level

2016-09-06 Thread Wyatt via Digitalmars-d

On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:


A central premise of performance-oriented programming which I've
employed my entire career, is "where there is one, there is 
probably

many", and if you do something to one, you should do it to many.


From a conceptual standpoint, this sounds like the sort of thing 
array languages like APL and J thrive on, so there's solid 
precedent for the concept.  I might suggest looking into 
optimising compilers in that space for inspiration and such; 
APEX, for example: http://www.snakeisland.com/apexup.htm


Of course, this comes with the caveat that this is (still!) some 
relatively heavily-academic stuff.  And I'm not sure to what 
extent that can help mitigate the problem of relaxing type 
requirements such that you can e.g. efficiently ,/⍉ your 4 2⍴"LR" 
vector for SIMD on modern processors.


-Wyatt


Re: Taking pipeline processing to the next level

2016-09-06 Thread finalpatch via Digitalmars-d

On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote:

with a main loop that reads the source buffer in *12* pixels 
step, call

MySimpleKernel 3 times, then call AnotherKernel 4 times.


It's interesting thoughts. What did you do when buffers weren't 
multiple of the kernels?


The end of a scan line is special cased . If I need 12 pixels for 
the last iteration but there are only 8 left, an instance of 
Kernel::InputVector is allocated on stack, 8 remaining pixels are 
memcpy into it then send to the kernel. Output from kernel are 
also assigned to a stack variable first, then memcpy 8 pixels to 
the output buffer.


Re: Taking pipeline processing to the next level

2016-09-06 Thread Manu via Digitalmars-d
On 7 September 2016 at 01:54, Wyatt via Digitalmars-d
 wrote:
> On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
>>
>>
>> A central premise of performance-oriented programming which I've
>> employed my entire career, is "where there is one, there is probably
>> many", and if you do something to one, you should do it to many.
>
>
> From a conceptual standpoint, this sounds like the sort of thing array
> languages like APL and J thrive on, so there's solid precedent for the
> concept.  I might suggest looking into optimising compilers in that space
> for inspiration and such; APEX, for example:
> http://www.snakeisland.com/apexup.htm

Thanks, that's really interesting, I'll check it out.


> Of course, this comes with the caveat that this is (still!) some relatively
> heavily-academic stuff.  And I'm not sure to what extent that can help
> mitigate the problem of relaxing type requirements such that you can e.g.
> efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors.

That's not what I want though.
I intend to hand-write that function (I was just giving examples of
how auto-vectorisation almost always fails), the question here is, how
to work that new array function into our pipelines transparently...



Re: Taking pipeline processing to the next level

2016-09-06 Thread Manu via Digitalmars-d
On 7 September 2016 at 07:11, finalpatch via Digitalmars-d
 wrote:
> On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote:
>
>>> with a main loop that reads the source buffer in *12* pixels step, call
>>> MySimpleKernel 3 times, then call AnotherKernel 4 times.
>>
>>
>> It's interesting thoughts. What did you do when buffers weren't multiple
>> of the kernels?
>
>
> The end of a scan line is special cased . If I need 12 pixels for the last
> iteration but there are only 8 left, an instance of Kernel::InputVector is
> allocated on stack, 8 remaining pixels are memcpy into it then send to the
> kernel. Output from kernel are also assigned to a stack variable first, then
> memcpy 8 pixels to the output buffer.

Right, and this is a classic problem with this sort of function; it is
only more efficient if numElements is suitable long.
See, I often wonder if it would be worth being able to provide both
functions, a scalar and array version, and have the algorithms select
between them intelligently.


Re: Taking pipeline processing to the next level

2016-09-06 Thread finalpatch via Digitalmars-d

On Wednesday, 7 September 2016 at 00:21:23 UTC, Manu wrote:
The end of a scan line is special cased . If I need 12 pixels 
for the last iteration but there are only 8 left, an instance 
of Kernel::InputVector is allocated on stack, 8 remaining 
pixels are memcpy into it then send to the kernel. Output from 
kernel are also assigned to a stack variable first, then 
memcpy 8 pixels to the output buffer.


Right, and this is a classic problem with this sort of 
function; it is

only more efficient if numElements is suitable long.
See, I often wonder if it would be worth being able to provide 
both
functions, a scalar and array version, and have the algorithms 
select

between them intelligently.


We normally process full HD or higher resolution images so the 
overhead of having to copy the last iteration was negligible.


It was fairly easy to put together a scalar version as they are 
much easier to write than the SIMD ones.  In fact I had scalar 
version for every SIMD kernel,  and use them for unit testing.


It shouldn't be hard to have the framework look at the buffer 
size and choose the scalar version when number of elements are 
small, it wasn't done that way simply because we didn't need it.




Re: Taking pipeline processing to the next level

2016-09-06 Thread Manu via Digitalmars-d
On 7 September 2016 at 11:04, finalpatch via Digitalmars-d
 wrote:
>
> It shouldn't be hard to have the framework look at the buffer size and
> choose the scalar version when number of elements are small, it wasn't done
> that way simply because we didn't need it.

No, what's hard is working this into D's pipeline patterns seamlessly.


Re: Taking pipeline processing to the next level

2016-09-06 Thread finalpatch via Digitalmars-d

On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:
On 7 September 2016 at 11:04, finalpatch via Digitalmars-d 
 wrote:


It shouldn't be hard to have the framework look at the buffer 
size and choose the scalar version when number of elements are 
small, it wasn't done that way simply because we didn't need 
it.


No, what's hard is working this into D's pipeline patterns 
seamlessly.


The lesson I learned from this is that you need the user code to 
provide a lot of extra information about the algorithm at compile 
time for the  templates to work out a way to fuse pipeline stages 
together efficiently.


I believe it is possible to get something similar in D because D 
has more powerful templates than C++ and D also has some type 
introspection which C++ lacks.  Unfortunately I'm not as good on 
D so I can only provide some ideas rather than actual working 
code.


Once this problem is solved, the benefit is huge.  It allowed me 
to perform high level optimizations (streaming load/save, 
prefetching, dynamic dispatching depending on data alignment 
etc.) in the main loop which automatically benefits all kernels 
and pipelines.




Re: Taking pipeline processing to the next level

2016-09-06 Thread Manu via Digitalmars-d
On 7 September 2016 at 12:00, finalpatch via Digitalmars-d
 wrote:
> On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:
>>
>> On 7 September 2016 at 11:04, finalpatch via Digitalmars-d
>>  wrote:
>>>
>>>
>>> It shouldn't be hard to have the framework look at the buffer size and
>>> choose the scalar version when number of elements are small, it wasn't done
>>> that way simply because we didn't need it.
>>
>>
>> No, what's hard is working this into D's pipeline patterns seamlessly.
>
>
> The lesson I learned from this is that you need the user code to provide a
> lot of extra information about the algorithm at compile time for the
> templates to work out a way to fuse pipeline stages together efficiently.
>
> I believe it is possible to get something similar in D because D has more
> powerful templates than C++ and D also has some type introspection which C++
> lacks.  Unfortunately I'm not as good on D so I can only provide some ideas
> rather than actual working code.
>
> Once this problem is solved, the benefit is huge.  It allowed me to perform
> high level optimizations (streaming load/save, prefetching, dynamic
> dispatching depending on data alignment etc.) in the main loop which
> automatically benefits all kernels and pipelines.

Exactly!


Re: Taking pipeline processing to the next level

2016-09-07 Thread finalpatch via Digitalmars-d

On Wednesday, 7 September 2016 at 02:09:17 UTC, Manu wrote:
The lesson I learned from this is that you need the user code 
to provide a lot of extra information about the algorithm at 
compile time for the templates to work out a way to fuse 
pipeline stages together efficiently.


I believe it is possible to get something similar in D because 
D has more powerful templates than C++ and D also has some 
type introspection which C++ lacks.  Unfortunately I'm not as 
good on D so I can only provide some ideas rather than actual 
working code.


Once this problem is solved, the benefit is huge.  It allowed 
me to perform high level optimizations (streaming load/save, 
prefetching, dynamic dispatching depending on data alignment 
etc.) in the main loop which automatically benefits all 
kernels and pipelines.


Exactly!


I think the problem here is two fold.

First question, how do we combine pipeline stages with minimal 
overhead


I think the key to this problem is reliable *forceinline*

for example, a pipeline like this

input.map!(x=>x.f1().f2().f3().store(output));

if we could make sure f1(), f2(), f3(), store(), and map() itself 
are all inlined, then we end up with a single loop with no 
function calls and the compiler is free to perform cross function 
optimizations. This is about as good as you can get.  
Unfortunately at the moment I hear it's difficult to make sure D 
functions get inlined.


Second question, how do we combine SIMD pipeline stages with 
minimal overhead


Besides reliable inlining, we also need some template code to 
repeat stages until their strides match. This requires details 
about each stage's logical unit size, input/output type and size 
at compile time. I can't think of what the interface of this 
would look like but the current map!() is likely insufficient to 
support this.


I still don't believe auto-select between scalar or vector paths 
would be a very useful feature. Normally I would only consider 
SIMD solution when I know in advance that this is a performance 
hotspot. When the amount of data is small I simply don't care 
about performance and would just choose whatever simplest way to 
do it, like map!(), because the performance impact is not 
noticeable and definitely not worth the increased complexity.




Re: Taking pipeline processing to the next level

2016-09-07 Thread Marc Schütz via Digitalmars-d

On Wednesday, 7 September 2016 at 10:31:13 UTC, finalpatch wrote:

I think the problem here is two fold.

First question, how do we combine pipeline stages with minimal 
overhead


I think the key to this problem is reliable *forceinline*

for example, a pipeline like this

input.map!(x=>x.f1().f2().f3().store(output));

if we could make sure f1(), f2(), f3(), store(), and map() 
itself are all inlined, then we end up with a single loop with 
no function calls and the compiler is free to perform cross 
function optimizations. This is about as good as you can get.  
Unfortunately at the moment I hear it's difficult to make sure 
D functions get inlined.




If the compiler is unable to inline (or wrongly decides it is too 
costly), I'd consider this a compiler bug. Of course, sometimes 
workarounds like `pragma(inline, true)` or `@forceinline` might 
be needed from time to time in practice, but they shouldn't 
influence the design of the pipeline interface.


Second question, how do we combine SIMD pipeline stages with 
minimal overhead


Besides reliable inlining, we also need some template code to 
repeat stages until their strides match. This requires details 
about each stage's logical unit size, input/output type and 
size at compile time. I can't think of what the interface of 
this would look like but the current map!() is likely 
insufficient to support this.


Would a `vectorize` range adapter be feasible that prepares the 
input to make it SIMD compatible? That is, force alignment, 
process left-over elements at the end, etc.? As far as I 
understand, the problems with auto vectorization stem from a 
difficulty of compilers to recognize vectorizing opportunities, 
and (as Manu described) from incompatible semantics of scalar and 
vector types that the compiler needs to preserve. But if that 
hypothetical `vectorize` helper forces the input data into one of 
a handful of well-known formats and types, wouldn't it be 
possible to make the compilers recognize those (especially if 
they are accompanied by suitable pragma or other compiler hints)?




I still don't believe auto-select between scalar or vector 
paths would be a very useful feature. Normally I would only 
consider SIMD solution when I know in advance that this is a 
performance hotspot. When the amount of data is small I simply 
don't care about performance and would just choose whatever 
simplest way to do it, like map!(), because the performance 
impact is not noticeable and definitely not worth the increased 
complexity.


In the above scenario, you can add `.vectorize` to the pipeline 
to enable vectorizing wherever you need it.


Re: Taking pipeline processing to the next level

2016-09-07 Thread finalpatch via Digitalmars-d

On Wednesday, 7 September 2016 at 11:53:00 UTC, Marc Schütz wrote:
Would a `vectorize` range adapter be feasible that prepares the 
input to make it SIMD compatible? That is, force alignment, 
process left-over elements at the end, etc.? As far as I 
understand, the problems with auto vectorization stem from a 
difficulty of compilers to recognize vectorizing opportunities, 
and (as Manu described) from incompatible semantics of scalar 
and vector types that the compiler needs to preserve. But if 
that hypothetical `vectorize` helper forces the input data into 
one of a handful of well-known formats and types, wouldn't it 
be possible to make the compilers recognize those (especially 
if they are accompanied by suitable pragma or other compiler 
hints)?




Contrary to popular belief, alignment is not a showstopper of 
SIMD code. Both Intel and ARM processors have instructions to 
access data from unaligned addresses. And on Intel processors, 
there is not even any speed penalty for using them on aligned 
addresses.  Which means you can either forget it (on Intel) or 
just check the data alignment before you start and choose an 
optimal specialization of the main loop.


However regarding auto vectorization, I'm with Manu. I won't put 
my bet on auto vectorization because I have never seen any 
non-trivial auto vectorized code that comes even close to hand 
tuned SIMD code. The compiler always have to play conservatively. 
 The compiler has no idea that you are only using 10 bits of each 
16bit components in a vector. It can't even help you shuffle 
RGBARGBARGBARGBA into .  The best we can do is to 
create something that makes writing SIMD kernels 
easy/reusable/composable.


Re: Taking pipeline processing to the next level

2016-09-07 Thread Wyatt via Digitalmars-d

On Wednesday, 7 September 2016 at 00:18:59 UTC, Manu wrote:
On 7 September 2016 at 01:54, Wyatt via Digitalmars-d 
 wrote:


Thanks, that's really interesting, I'll check it out.


Here's some work on static rank polymorphism that might also be 
applicable?:

http://www.ccs.neu.edu/home/pete/pub/esop-2014.pdf

And in the Related Work, I just noticed Halide, which sounds like 
it's right up your alley:

http://halide-lang.org/

Of course, this comes with the caveat that this is (still!) 
some relatively heavily-academic stuff.  And I'm not sure to 
what extent that can help mitigate the problem of relaxing 
type requirements such that you can e.g. efficiently ,/⍉ your 
4 2⍴"LR" vector for SIMD on modern processors.


That's not what I want though.
I intend to hand-write that function (I was just giving 
examples of
how auto-vectorisation almost always fails), the question here 
is, how
to work that new array function into our pipelines 
transparently...


Ah, I misunderstood.  Sorry.  I had the impression that you 
wanted to be able to simply write:


data.map!(x => transform(x)).copy(output);

...for any data[] and have it lift the transformation to the 
whole vector.  If you're doing the work, I'm curious what you're 
hoping the end result to look like in terms of the code you want 
to be able to write.  Just a doodle is fine, it doesn't have to 
stand up to scrutiny.


-Wyatt