https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66740
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jakub at gcc dot gnu.org --- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Can you still reproduce it? I don't see anything wrong on the dumps I've looked at, without -Ofast of course the order of the floating point arithmetics is significantly different between -fopenmp and -fno-openmp - but that is to be expected, you've asked for that. So, for the iterations that are vectorized, each SIMD lane has its own sum variable (with the given options the loop seems to be vectorized with vectorization factor 8, so there are 8 SIMD lanes and thus 8 separate sum vars), the vector version sums up into those (initialized with 0), any scalar iterations sum into the first SIMD lane's sum var and finally at the end the 8 partial sums are summed together (one by one, rather than what vectorizer normally does for -Ofast reduce them by summing up 4 x 2 numbers, then 2 x 2 numbers, then 2 numbers. If this is still a problem, can you cook up a small self-contained testcase out of it (small function with just the #pragma omp simd loop in it, taking the args as parameters, with noinline/noclone attribute on it ideally, and then main that fills up an array with the problematic input values and then checks what the function returned (sum))?