https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #1 from Chris Elrod <elrodc at gmail dot com> ---
Here I have added a godbolt example where I manually unroll the array, where
GCC generates excellent code https://godbolt.org/z/sd4bhGW7e
I'm not sure it is 100% optimal, but with an inner Dual size of `7`, on
Skylake-X it is 38 uops for unrolled GCC with separate struct fields, vs 49
uops for Clang, vs 67 for GCC with arrays.
uica expects <14 clock cycles for the manually unrolled vs >23 for the array
version.

My experience so far with expression templates has born this out: compilers
seem to struggle with peeling away abstractions.

Reply via email to