Manu:
The compiler would have to do some serious magic to optimise
that;
flattening both sides of the if into parallel expressions, and
then applying the mask to combine...
I think it's a small amount of magic.
The simple features shown in that paper are fully focused on SIMD
programming, so they aren't introducing things clearly not
efficient.
I'm personally not in favour of SIMD constructs that are
anything less than
optimal (but I appreciate I'm probably in the minority here).
(The simple benchmarks of the paper show a 5-15% performance
loss compared
to handwritten SIMD code.)
Right, as I suspected.
15% is a very small performance loss, if for the programmer the
alternative is writing scalar code, that is 2 or 3 times slower
:-)
The SIMD programmers that can't stand a 1% loss of performance
use the intrinsics manually (or write in asm) and they ignore all
other things.
A much larger population of system programmers wish to use modern
CPUs efficiently, but they don't have time (or skill, this means
their programs are too much often buggy) for assembly-level
programming. Currently they use smart numerical C++ libraries,
use modern Fortran versions, and/or write C/C++ scalar code (or
Fortran), add "restrict" annotations, and take a look at the
produced asm hoping the modern compiler back-ends will vectorize
it. This is not good enough, and it's far from a 15% loss.
This paper shows a third way, making such kind of programming
simpler and approachable for a wider audience, with a small
performance loss compared to handwritten code. This is what
language designers do since 60+ years :-)
Bye,
bearophile