https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68775
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- Ok, it looks like a ppc64le cross happily (eh) accepts sources preprocessed on x86_64-linux and even required built modules. So I have the dump files myself and the -fopt-info-vec difference (for BB vectorization only) is empty. It looks like the only code difference is for the vectorization of the BB in loop shell2.fppized.f90:971 that is do k = 1, k_max k1 = k_x(k); k2 = k_y(k); k3 = k_z(k) dot1 = k1*P1+k2*P2+k3*P3 dot2 = g4 * (k1*k1+k2*k2+k3*k3) res_ij(k) = res_ij(k) + therm(k) * (fac1 * exp(cmplx(dot2,dot1,kind=kind((1.0d0,1.0d0))))) end do which has now one less vector operand. If you can confirm this by "bisecting" the file with -fdbg-cnt=vect_slp:N that would be nice. The vectorized code looks ok to me so I suspect a target issue here. Note that we do both a vector load from the realpart of a complex and a scalar load of the imaginary part and then use that to construct another vector: _1371 = REALPART_EXPR <[shell2.fppized.f90:975:0] [shell2.fppized.f90:975:0] MEM[(complex(kind=8)[0:] *)res.0_420][_960]>; vectp.6451_10558 = &REALPART_EXPR <[shell2.fppized.f90:975:0] [shell2.fppized.f90:975:0] MEM[(complex(kind=8)[0:] *)res.0_420][_960]>; vect__1371.6452_10556 = MEM[(real(kind=8) *)vectp.6451_10558]; _395 = IMAGPART_EXPR <[shell2.fppized.f90:975:0] [shell2.fppized.f90:975:0] MEM[(complex(kind=8)[0:] *)res.0_420][_960]>; [shell2.fppized.f90:975:0] _177 = _964 * _3980; [shell2.fppized.f90:975:0] vect_cst__10554 = {_177, _395}; [shell2.fppized.f90:975:0] vect__455.6453_5427 = vect_cst__10554 + vect__1371.6452_10556; [shell2.fppized.f90:975:0] _389 = _395 + _3549; vectp.6455_5409 = &REALPART_EXPR <[shell2.fppized.f90:975:0] [shell2.fppized.f90:975:0] MEM[(complex(kind=8)[0:] *)res.0_420][_960]>; [shell2.fppized.f90:975:0] MEM[(real(kind=8) *)vectp.6455_5409] = vect__455.6453_5427; in .optimized the above looks like vect__1371.6452_10556 = MEM[base: _9159, offset: 0B]; _395 = MEM[base: _9159, offset: 8B]; _9158 = (void *) ivtmp.7110_9170; [shell2.fppized.f90:975:0] _964 = MEM[base: _9158, offset: 0B]; [shell2.fppized.f90:975:0] _486 = __builtin_exp (dot2_958); [shell2.fppized.f90:975:0] _508 = REALPART_EXPR <sincostmp_3746>; _1815 = _486 * fac1$real_1370; [shell2.fppized.f90:975:0] _518 = IMAGPART_EXPR <sincostmp_3746>; [shell2.fppized.f90:975:0] _178 = _508 * _1815; [shell2.fppized.f90:975:0] _201 = _518 * _1815; [shell2.fppized.f90:975:0] _967 = COMPLEX_EXPR <_178, _201>; [shell2.fppized.f90:975:0] _968 = ((_967)); _3980 = REALPART_EXPR <_968>; [shell2.fppized.f90:975:0] _177 = _964 * _3980; [shell2.fppized.f90:975:0] vect_cst__10554 = {_177, _395}; [shell2.fppized.f90:975:0] vect__455.6453_5427 = vect_cst__10554 + vect__1371.6452_10556; [shell2.fppized.f90:975:0] MEM[base: _9159, offset: 0B] = vect__455.6453_5427; which might be enough to trigger later RTL opt confusion. I can just guess at something CSEing the scalar load with the vector load and getting lane ordering (endianess) wrong. Maybe you can extract a small testcase from the above info that reproduces the difference.