On Wednesday, 7 September 2016 at 02:09:17 UTC, Manu wrote:
The lesson I learned from this is that you need the user code to provide a lot of extra information about the algorithm at compile time for the templates to work out a way to fuse pipeline stages together efficiently.

I believe it is possible to get something similar in D because D has more powerful templates than C++ and D also has some type introspection which C++ lacks. Unfortunately I'm not as good on D so I can only provide some ideas rather than actual working code.

Once this problem is solved, the benefit is huge. It allowed me to perform high level optimizations (streaming load/save, prefetching, dynamic dispatching depending on data alignment etc.) in the main loop which automatically benefits all kernels and pipelines.

Exactly!

I think the problem here is two fold.

First question, how do we combine pipeline stages with minimal overhead

I think the key to this problem is reliable *forceinline*

for example, a pipeline like this

input.map!(x=>x.f1().f2().f3().store(output));

if we could make sure f1(), f2(), f3(), store(), and map() itself are all inlined, then we end up with a single loop with no function calls and the compiler is free to perform cross function optimizations. This is about as good as you can get. Unfortunately at the moment I hear it's difficult to make sure D functions get inlined.

Second question, how do we combine SIMD pipeline stages with minimal overhead

Besides reliable inlining, we also need some template code to repeat stages until their strides match. This requires details about each stage's logical unit size, input/output type and size at compile time. I can't think of what the interface of this would look like but the current map!() is likely insufficient to support this.

I still don't believe auto-select between scalar or vector paths would be a very useful feature. Normally I would only consider SIMD solution when I know in advance that this is a performance hotspot. When the amount of data is small I simply don't care about performance and would just choose whatever simplest way to do it, like map!(), because the performance impact is not noticeable and definitely not worth the increased complexity.

Reply via email to