https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123272
--- Comment #1 from Benjamin Schulz <schulz.benjamin at googlemail dot com> --- Created attachment 63130 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63130&action=edit gpu_compiler_test_xnvptx-none.ii here is the ii file from -save-temps of the miscompilated matrix multiplication... Interestingly, if one would use #pragma omp target teams distribute for the first loop and #pragma omp parallel for for the second, then the results would be correct, but the collapse(2) statement is valid in the matrix multiplication for the first two loops. It is also needed for performance improvements if the matrices are not square. The collapse statement works on the host, and even for gcc on nvptx when using -O1. The wrong results are observed only if one has no optimization!, which is also strange. And only for the class with templates...
