https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930
--- Comment #8 from Adam Hirst <adam at aphirst dot karoo.co.uk> --- Ah, it seems that Jerry was tinkering with tp_array.f90 (intrinsic array version of the Vector type), while I was with tp_xyz.f90 (explicit separate elements). I was going to remark at how he didn't need to use -flto to get any of the matmul paths working better than the DO/SUM paths. I'm curious as to whether he reproduces my results on his system, but I'll first reproduce his. 1) When I use his modified TP_LEFT and compile only under -O2 I get, as he does, that the matmul path is faster than the DO/SUM path. Not by as large a margin, but I expect that this varies system-to-system. 2) I notice that he moved the matmul() calls out of the dot_product calls, but didn't move the D%vec calls out of matmul. If I do the same with in tp_xyz.f90, and recompile under simply -O2, I get the same kind of performance boost as Jerry does. What do you think the reason could be that: Dx = D%x Dy = D%y Dz = D%z NUDx = matmul(NU, Dx) NUDy = matmul(NU, Dy) NUDz = matmul(NU, Dz) tensorproduct%x = ... performs so much worse with -O2 than NUDx = matmul(NU, D%x) NUDy = matmul(NU, D%y) NUDz = matmul(NU, D%z) tensorproduct%x = ... that the former needs -flto to be able to compete? --- It's probably important that we remain clear on which version of the Vector type we're doing the tests, as (as someone commented to me earlier, probably Jerry), array-stride-shenanigans are bound to play some role.