[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

schulz.benjamin at googlemail dot com via Gcc-bugs Mon, 03 Nov 2025 16:19:48 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


Benjamin Schulz <schulz.benjamin at googlemail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #62693|0                           |1
        is obsolete|                            |

--- Comment #13 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Created attachment 62704
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62704&action=edit
main.cpp

Hi Tobias, thank you for your effort.

First, perhaps it is that, well, when i am trying to use
 -foffload-options=nvptx-none=-march=sm_120

then gcc 16, (i emerge it with crossdev from gentoo) will complain with


x86_64-pc-linux-gnu-accel-nvptx-none-gcc: Fehler: nicht erkanntes
Kommandozeilenargument in Option »-march=sm_120«

Anmerkung: gültige Argumente für »-misa=« sind: sm_30 sm_35 sm_37 sm_52 sm_53
sm_61 sm_70 sm_75 sm_80 sm_89;


Well, I don't know why my gcc 16 does not support sm_120 for my card.
But, well at least before this here 

https://github.com/llvm/llvm-project/pull/159354

"Summary: Turns out the new CUDA ABI now applies retroactively to all the other
SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping
all the SM flags consistent but using an offset."


they were backwards compatible, but according to this LLVM notice, this seems
to change with cuda 13 and I guess I would have to use sm_120 for my card
definitely.


Perhaps its just this. For this Clang has now two cuda abi versions defined..


Anyway, I have now changed the test code a bit, so that it does the
multiplications in a loop from 1 to 200 and just prints out the values if they
don't agree and stops then.

I also have changed the openmp code a bit to make it more simple. I attached
this file to the bug.



What I note is that clang++ with options 

-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda   -Wall

always gets the result correctly, on my system (at least for the 1 to 200
tries). 


Just gcc with options:

-fopenmp -foffload=nvptx-none -foffload-options=nvptx-none=-march=sm_89
-fno-stack-protector


does not and usually fails quite to produce the single threaded multiplication
results with a target teams distribute collapse(2) multiplication quite early
in the loop. 

The main.cpp will then print the matrices A and B and the different results for
the multiplication.


The fact that clang gets it right points, in my view, against a problem with
the gpu but more perhaps to one with the architecture target sm_120 and cuda-13
of the recent nvidia-driver and perhaps an abi change or something.


I will also test your patch for the libgomp memory problem,  once the patched
gcc version appears on gentoo's sync three. 

They have one published for gcc 16 from git dated 02.11, but it takes a day to
arrive on my machine.

I don't know if your patch is already in this, if not, it will soon appear in
the next version, perhaps next week or so.

I dont want to emerge the 9999 live version but give sam some days to think on
it he releases it into gentoo.


I will also test whether all the other example programs of my library from 

https://github.com/bschulz81/AcceleratedLinearAlgebra/tree/main 

compile again after your patch.

Maybe it was just this problem with the atomic that turned the entire DataBlock
class invalid. Let's see...

If it compiles, I want to add more tensor support and Message Passing Interface
Support, so that the library feels good to use on clusters for distributed
computing of relativistic problems... then I want to add mathematical things...

But it would of course also good if i could compile for sm_120 and get correct
numbers with gcc...


Meanwhile, by the way,

I think I found this problem in the mapper with OpenACC, which may be easy to
patch, but perhaps that was fixed already in gcc 16.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121178#c0

[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to