https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82471
Dominique d'Humieres <dominiq at lps dot ens.fr> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |WAITING Last reconfirmed| |2017-10-08 Ever confirmed|0 |1 --- Comment #3 from Dominique d'Humieres <dominiq at lps dot ens.fr> --- > I know that > DO CONCURRENT( I=1:L, J=1:M, K=1:N) > is the fastest DO CONCURRENT : 8.19799995 DO CONCURRENT : 0.284000009 ORDINARY DO : 0.116000004 ARRAY DO : 0.118000008 Note that the "right" ordered DO CONCURRENT is more than two time slower than the ORDINARY DO or the ARRAY DO (C=A+B). > but I expected that do-concurrent work like ordinary-do by varying > the last index in nested loops. Why? I am rather expecting the left-most index for the inner loop, the second for the first outer loop and so on (latin ordering). Note also that the optimization you are expecting should be done in the middle-end. Unfortunately "loop flattening" (the best optimization here) is not done by the gcc middle-end (pr82450). For more complex optimization, such as the matrix multiplication in https://groups.google.com/forum/#!topic/comp.lang.fortran/jljio5HfSQc, this requires exchanging loop order, which is not (well) handled by the gcc middle-end (pr61000). Also some loop interchanges require a cost model do j = 1, N do i = 1, L do j = 1, M c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do is faster on modern CPUs than do j = 1, N do i = 1, M do k = 1, L c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do but to reach this conclusion a priori requires detailed knowledge of the CPU and the memory system (while the exchange of i and j can never be profitable).