[Bug libgomp/122280] New: target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

schulz.benjamin at googlemail dot com via Gcc-bugs Fri, 17 Oct 2025 13:35:04 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


            Bug ID: 122280
           Summary: target teams distribute parallel for collapse(2)
                    yields different results in a matmul than separate
                    loops (one with omp target teams distribute  the
                    second with omp parallel for) on nvptx target. Clang
                    compiles the code correctly
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schulz.benjamin at googlemail dot com
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 62555
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62555&action=edit
archive.tar.gz

Hi there, 

the following code for a matrix multiplication

in gpu_mathfunctions.h in line 395

 #pragma omp target teams distribute parallel for collapse(2) shared(A,B,C)
device(dev)
    for (size_t i = 0; i < rows; ++i)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }


should yield for data which is given to the function in mathdemonstrations.cpp
on line 156

[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]

and it does so with clang.

With gcc 15.2, I get:


[[529, 529, 422, 422, 648, 648, 438, 438, 640, 1041, 689, 689], 
 [550, 550, 488, 488, 511, 1059, 470, 929, 431, 431, 637, 637], 
 [564, 564, 415, 415, 607, 607, 382, 382, 669, 1068, 689, 689], 
 [515, 515, 495, 495, 589, 589, 515, 515, 433, 433, 637, 637], 
 [508, 508, 395, 395, 631, 631, 456, 456, 449, 449, 663, 663], 
 [571, 571, 515, 515, 565, 565, 487, 928, 383, 383, 663, 663], 
 [500, 500, 476, 476, 551, 551, 517, 517, 382, 382, 412, 412], 
 [537, 537, 475, 475, 609, 609, 401, 401, 573, 573, 641, 641], 
 [473, 473, 915, 466, 516, 648, 414, 414, 457, 457, 697, 697], 
 [566, 566, 448, 448, 551, 551, 387, 387, 586, 586, 408, 1025], 
 [549, 549, 484, 484, 640, 640, 405, 405, 598, 598, 677, 677], 
 [613, 613, 507, 507, 570, 570, 491, 491, 537, 537, 359, 359]]


Obviously, something different... note the 422 in the first row and fourth
column...

Writing:

   #pragma omp target teams distribute shared(A,B,C) device(dev)
    for (size_t i = 0; i < rows; ++i)
        #pragma omp parallel for shared(A,B,C)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }

I get from gcc 15.2:

[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]


Which looks correct. But it should yield the same result as collapse(2) as
A,B,C are shared and C is accessed just once. Note that the collapse statement
works on the host with a simple parallel for collapse(2). It should also work
with a teams distribute parallel for collapse(2), and does so with clang.

This was compiled with gcc 15.2 and 
-fopenmp -foffload=nvptx-none -fno-stack-protector -Wall

linux kernel linux-6.12.52-gentoo-dist

dev-util/nvidia-cuda-toolkit-12.9.1-r1:0/12.9.1::gentoo

x11-drivers/nvidia-drivers-580.95.05


nvidia-smi
Tue Oct 14 13:50:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA
Version: 13.0     |
|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:2D:00.0  On |           
      N/A |


This is, by the way, part of a test suite. The program calling the matmul is
mathdemonstrations.cpp.

This program calls several functions. It later triggers errors like


libgomp: cuCtxSynchronize error: an illegal memory access was encountered

libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered

libgomp: cuMemFree_v2 error: an illegal memory access was encountered

libgomp: device finalization failed


Note that, however, clang compiles the entire application just fine. 

I also get this strange error in another test application where I can verify
that there is no out of bounds memory access by my code. So I dont know why the
program prints this message at the end. but I will open another bug for that...

I want to note that I observe especially the last problem only recently.



Perhaps it is because of the new nvidia drivers? I think that with cuda 13
there are reports of wrong matmuls. But I deliberately installed an old
nvidia-cuda-toolkit. My card is an rtx 5060. I tried temporarily to get to
older drivers but it did not change anything. I observe this only recently. I
do not know what portage changed recently which caused this.

Clang compiles the application correctly, however.

[Bug libgomp/122280] New: target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to