On 5 September 2016 at 18:21, Andrei Alexandrescu via Digitalmars-d <[email protected]> wrote: > On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote: >> >> I mostly code like this now: >> data.map!(x => transform(x)).copy(output); >> >> It's convenient and reads nicely, but it's generally inefficient. > > > What are the benchmarks and the numbers? What loss are you looking at? -- > Andrei
Well, it totally depends. Like right now in my case, 'transform' is some image processing code (in the past when I've had these same thoughts, it has been audio filters). You can't touch pixels (or samples) one at a time. They need manual SIMD deployment (I've never seen an auto-vectoriser handle saturation arithmetic, or type promotion), alpha components (every 4th byte) is treated differently, memory access patterns need to be tuned to be cache friendly. I haven't done benchmarks right now, but I've done them professionally in the past, and it's not unusual to expect a hand-written image processing loop to see 1 or even 2 orders of magnitude improvement when hand written, compared to calling a function for each pixel in a loop. The sorts of low-level optimisations you deploy in image and audio processing loops are not things I've ever seen any optimiser even attempt. Some core problems that tend to require manual intervention in hot-loops are: ubyte[16] <-> ushort[8][2] expansion/contraction ubyte[16] <-> float[4][4] expansion/contraction saturation scalar operator results promote to int, but wide-simd operations don't, which means some scalar expressions can't express losslessly collapsed to simd operations, and the compiler will always be conservative on this matter. If the auto-vectoriser tries at all, you will see a mountain of extra code to preserve those bits that the scalar operator semantics would have guaranteed wide-vector multiplication is semantically different than scalar multiplication, so the optimiser has a lot of trouble vectorising mul's assumptions about data alignment interleaved data; audio samples are usually [L,R] interleaved, images often [RGB,A], and different processes are applied across the separation, you want to unroll and shuffle the data so you have vectors [LLLL],[RRRR], or [RGBRGBRGBRGB],[AAAA], and I haven't seen an optimiser go near that vector dot-product is always a nuisance I could go on and on. The point is, as an end-user, pipeline API's are great. At a library author, I want to present the best performing library I can, which I think means we need to find a way to conveniently connect these 2 currently disconnected worlds. I've explored to some extent, but I've never come up with anything that I like.
