On Wednesday, 7 September 2016 at 02:09:17 UTC, Manu wrote:
The lesson I learned from this is that you need the user code
to provide a lot of extra information about the algorithm at
compile time for the templates to work out a way to fuse
pipeline stages together efficiently.
I believe it is possible to get something similar in D because
D has more powerful templates than C++ and D also has some
type introspection which C++ lacks. Unfortunately I'm not as
good on D so I can only provide some ideas rather than actual
working code.
Once this problem is solved, the benefit is huge. It allowed
me to perform high level optimizations (streaming load/save,
prefetching, dynamic dispatching depending on data alignment
etc.) in the main loop which automatically benefits all
kernels and pipelines.
Exactly!
I think the problem here is two fold.
First question, how do we combine pipeline stages with minimal
overhead
I think the key to this problem is reliable *forceinline*
for example, a pipeline like this
input.map!(x=>x.f1().f2().f3().store(output));
if we could make sure f1(), f2(), f3(), store(), and map() itself
are all inlined, then we end up with a single loop with no
function calls and the compiler is free to perform cross function
optimizations. This is about as good as you can get.
Unfortunately at the moment I hear it's difficult to make sure D
functions get inlined.
Second question, how do we combine SIMD pipeline stages with
minimal overhead
Besides reliable inlining, we also need some template code to
repeat stages until their strides match. This requires details
about each stage's logical unit size, input/output type and size
at compile time. I can't think of what the interface of this
would look like but the current map!() is likely insufficient to
support this.
I still don't believe auto-select between scalar or vector paths
would be a very useful feature. Normally I would only consider
SIMD solution when I know in advance that this is a performance
hotspot. When the amount of data is small I simply don't care
about performance and would just choose whatever simplest way to
do it, like map!(), because the performance impact is not
noticeable and definitely not worth the increased complexity.