https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #1 from Chris Elrod <elrodc at gmail dot com> --- Here I have added a godbolt example where I manually unroll the array, where GCC generates excellent code https://godbolt.org/z/sd4bhGW7e I'm not sure it is 100% optimal, but with an inner Dual size of `7`, on Skylake-X it is 38 uops for unrolled GCC with separate struct fields, vs 49 uops for Clang, vs 67 for GCC with arrays. uica expects <14 clock cycles for the manually unrolled vs >23 for the array version. My experience so far with expression templates has born this out: compilers seem to struggle with peeling away abstractions.