[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

schulz.benjamin at googlemail dot com via Gcc-bugs Wed, 19 Nov 2025 08:28:30 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


--- Comment #18 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
As for the valgrind output of clang, valgrind of course can't access the gpu
memory, so one would expect a bunch of pointers where it does not know whether
memory was freed. However, it also sees a memory leak with ordinary mallog on
glibc connected to libcuda. Since the example program only creates an stl
vector which releases itself by its destructor, the malloc call on the host
where valgrind says it is not freed must be something from cuda... (valgrind
can of course also here have false positives, but it marks memory where it just
does not have access to separately)...


But for the gcc output, the messages that cuda symbols would not be found is is
of course devastating. 

Probably this is due to the driver expecting cuda 13 code with sm_120 and gcc
compiles cuda 12 code for sm_89 which may not be binary compatible, even if
nvidia claims so? 

Or it may be a problem with recent kernels 6.17.8 that have updated code for
memory protection where cuda may have been lazy in the past?


Could it be that gcc my mistake takes the sm related files in the system for
clang and then compiles wrong? probably not....

When compiling with gcc, LD_LIBRARY_PATH is set tho

LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/REDIST/compilers/lib/lib:

When compiling with clang, I have to set it to:

LD_LIBRARY_PATH=/usr/lib64/nvptx64-nvidia-cuda/

as otherwise clang complains it wont find a necessary .bc file. But then the
clang output works correctly...

The output of gcc produces that it would not find some cuda symbols and yields
nonsense output...

[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to