[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #10 from Jeff Hammond --- Thanks for the feedback. I agree that it is a huge amount of work to optimize this. For what it's worth, GCC and Clang perform about the same. Unfortunately, I do not have the means to evaluate IBM XLF, which may have an optimized implementation of this (https://rd.springer.com/chapter/10.1007%2F978-3-642-32820-6_23), so I do not have a good sense of what is achievable here, other than what I hand-optimize. I have no objection if you want to close this as invalid.
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #9 from Jakub Jelinek --- With schedule(static) or schedule(dynamic) etc. I believe the compiler is not allowed to do it, at least if it can't prove it won't be observable. So, if you have int cnt = 0; #pragma omp parallel for schedule(static) ordered(2) collapse(2) \ firstprivate(cnt) for (j = ...) for (i = ...) { #pragma omp ordered depend(sink:j,i-1) depend(sink:j-1,i) depend(sink:j-1,i-1) arr[i][j] = omp_get_thread_num (); arr2[i][j] = cnt++; grid[...] = ...; #pragma omp ordered source } then you really can't schedule arbitrarily, the standard specifies the requirements, and the above code observes both the thread num each logical iteration is assigned to, and in what order. The only way where you have complete freedom is no schedule clause (then it is implementation defined what happens) or schedule(auto) (likewise). But it still requires advanced loop optimizations to find optimal schedule for the particular loop (and other loops would need different scheduling decisions), and the schedule would also need to take into account some constraints of the doacross post/wait operations (it has to ensure all threads don't get stuck waiting for something that hasn't been scheduled). Plus unless we want complex scheduling functions in the library the compiler has to map the logical iterations to the logical iterations of the parallelized loop and similarly map the indices. It is doable, but huge amount of work.
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #8 from Jeff Hammond --- I tried with schedule(dynamic) and schedule(static,n) for n=1,8,100. None of this made a positive difference. Is that expected? I'm happy to make any code changes that don't break correctness and are still readable. I don't fully understand the semantics of ordered, but is it illegal for the compiler to generate blocked code that still orders the iteration space, which I achieve by explicitly implementing that with tasks. The following code performs very well, at least with appropriate tiling (mc,nc). _Pragma("omp master") { for (auto i=1; i=1; j--) { _Pragma("omp for") for (auto i=1; i<=j; i++) { auto x = n+i-j-1; auto y = n-i; grid[x*n+y] = grid[(x-1)*n+y] + grid[x*n+(y-1)] - grid[(x-1)*n+(y-1)]; } } _Pragma("omp master") grid[0*n+0] = -grid[(n-1)*n+(n-1)];
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #7 from Jakub Jelinek --- GCC implements what is required if there is schedule(static), which is the implementation defined schedule right now, which gives the requirement how the iterations are distributed to different threads and I don't see how could you get good performance with that distribution (if you have ideas, feel free to explain them here). In order to perform well on this testcase (which doesn't look very suitable for doacross because the computation is inexpensive and so the needed synchronization dominates the execution time), we'd have to use a different schedule, specific for this exact loop (proceed diagonally from 2, 2 to n, m or something like that).
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #6 from Jeff Hammond --- Created attachment 41566 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41566&action=edit tasks C++
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #5 from Jeff Hammond --- Created attachment 41565 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41565&action=edit header for C++ codes
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #4 from Jeff Hammond --- Created attachment 41564 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41564&action=edit doacross C++
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #3 from Jeff Hammond --- Created attachment 41563 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41563&action=edit sequential C++
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #2 from Jeff Hammond --- Created attachment 41562 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41562&action=edit tasks Fortran This code runs faster than serial (assuming blocking is used), showing that this pattern can be parallelized.
[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108 --- Comment #1 from Jeff Hammond --- Created attachment 41561 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41561&action=edit doacross Fortran