On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:
On 7 September 2016 at 11:04, finalpatch via Digitalmars-d
<digitalmars-d@puremagic.com> wrote:
It shouldn't be hard to have the framework look at the buffer
size and choose the scalar version when number of elements are
small, it wasn't done that way simply because we didn't need
it.
No, what's hard is working this into D's pipeline patterns
seamlessly.
The lesson I learned from this is that you need the user code to
provide a lot of extra information about the algorithm at compile
time for the templates to work out a way to fuse pipeline stages
together efficiently.
I believe it is possible to get something similar in D because D
has more powerful templates than C++ and D also has some type
introspection which C++ lacks. Unfortunately I'm not as good on
D so I can only provide some ideas rather than actual working
code.
Once this problem is solved, the benefit is huge. It allowed me
to perform high level optimizations (streaming load/save,
prefetching, dynamic dispatching depending on data alignment
etc.) in the main loop which automatically benefits all kernels
and pipelines.