https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280

            Bug ID: 122280
           Summary: target teams distribute parallel for collapse(2)
                    yields different results in a matmul than separate
                    loops (one with omp target teams distribute  the
                    second with omp parallel for) on nvptx target. Clang
                    compiles the code correctly
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schulz.benjamin at googlemail dot com
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 62555
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62555&action=edit
archive.tar.gz

Hi there, 

the following code for a matrix multiplication

in gpu_mathfunctions.h in line 395

 #pragma omp target teams distribute parallel for collapse(2) shared(A,B,C)
device(dev)
    for (size_t i = 0; i < rows; ++i)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }


should yield for data which is given to the function in mathdemonstrations.cpp
on line 156

[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]

and it does so with clang.

With gcc 15.2, I get:


[[529, 529, 422, 422, 648, 648, 438, 438, 640, 1041, 689, 689], 
 [550, 550, 488, 488, 511, 1059, 470, 929, 431, 431, 637, 637], 
 [564, 564, 415, 415, 607, 607, 382, 382, 669, 1068, 689, 689], 
 [515, 515, 495, 495, 589, 589, 515, 515, 433, 433, 637, 637], 
 [508, 508, 395, 395, 631, 631, 456, 456, 449, 449, 663, 663], 
 [571, 571, 515, 515, 565, 565, 487, 928, 383, 383, 663, 663], 
 [500, 500, 476, 476, 551, 551, 517, 517, 382, 382, 412, 412], 
 [537, 537, 475, 475, 609, 609, 401, 401, 573, 573, 641, 641], 
 [473, 473, 915, 466, 516, 648, 414, 414, 457, 457, 697, 697], 
 [566, 566, 448, 448, 551, 551, 387, 387, 586, 586, 408, 1025], 
 [549, 549, 484, 484, 640, 640, 405, 405, 598, 598, 677, 677], 
 [613, 613, 507, 507, 570, 570, 491, 491, 537, 537, 359, 359]]


Obviously, something different... note the 422 in the first row and fourth
column...

Writing:

   #pragma omp target teams distribute shared(A,B,C) device(dev)
    for (size_t i = 0; i < rows; ++i)
        #pragma omp parallel for shared(A,B,C)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }

I get from gcc 15.2:

[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]


Which looks correct. But it should yield the same result as collapse(2) as
A,B,C are shared and C is accessed just once. Note that the collapse statement
works on the host with a simple parallel for collapse(2). It should also work
with a teams distribute parallel for collapse(2), and does so with clang.

This was compiled with gcc 15.2 and 
-fopenmp -foffload=nvptx-none -fno-stack-protector -Wall

linux kernel linux-6.12.52-gentoo-dist

dev-util/nvidia-cuda-toolkit-12.9.1-r1:0/12.9.1::gentoo

x11-drivers/nvidia-drivers-580.95.05


nvidia-smi
Tue Oct 14 13:50:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA
Version: 13.0     |
|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:2D:00.0  On |           
      N/A |


This is, by the way, part of a test suite. The program calling the matmul is
mathdemonstrations.cpp.

This program calls several functions. It later triggers errors like


libgomp: cuCtxSynchronize error: an illegal memory access was encountered

libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered

libgomp: cuMemFree_v2 error: an illegal memory access was encountered

libgomp: device finalization failed


Note that, however, clang compiles the entire application just fine. 

I also get this strange error in another test application where I can verify
that there is no out of bounds memory access by my code. So I dont know why the
program prints this message at the end. but I will open another bug for that...

I want to note that I observe especially the last problem only recently.



Perhaps it is because of the new nvidia drivers? I think that with cuda 13
there are reports of wrong matmuls. But I deliberately installed an old
nvidia-cuda-toolkit. My card is an rtx 5060. I tried temporarily to get to
older drivers but it did not change anything. I observe this only recently. I
do not know what portage changed recently which caused this.

Clang compiles the application correctly, however.

Reply via email to