https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280
Bug ID: 122280
Summary: target teams distribute parallel for collapse(2)
yields different results in a matmul than separate
loops (one with omp target teams distribute the
second with omp parallel for) on nvptx target. Clang
compiles the code correctly
Product: gcc
Version: 15.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: libgomp
Assignee: unassigned at gcc dot gnu.org
Reporter: schulz.benjamin at googlemail dot com
CC: jakub at gcc dot gnu.org
Target Milestone: ---
Created attachment 62555
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62555&action=edit
archive.tar.gz
Hi there,
the following code for a matrix multiplication
in gpu_mathfunctions.h in line 395
#pragma omp target teams distribute parallel for collapse(2) shared(A,B,C)
device(dev)
for (size_t i = 0; i < rows; ++i)
for (size_t j = 0; j < cols; ++j)
{
T sum = 0;
#pragma omp simd reduction(+:sum)
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
should yield for data which is given to the function in mathdemonstrations.cpp
on line 156
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689],
[525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637],
[575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689],
[491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637],
[557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663],
[509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663],
[500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754],
[587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641],
[485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697],
[561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617],
[549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677],
[572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
and it does so with clang.
With gcc 15.2, I get:
[[529, 529, 422, 422, 648, 648, 438, 438, 640, 1041, 689, 689],
[550, 550, 488, 488, 511, 1059, 470, 929, 431, 431, 637, 637],
[564, 564, 415, 415, 607, 607, 382, 382, 669, 1068, 689, 689],
[515, 515, 495, 495, 589, 589, 515, 515, 433, 433, 637, 637],
[508, 508, 395, 395, 631, 631, 456, 456, 449, 449, 663, 663],
[571, 571, 515, 515, 565, 565, 487, 928, 383, 383, 663, 663],
[500, 500, 476, 476, 551, 551, 517, 517, 382, 382, 412, 412],
[537, 537, 475, 475, 609, 609, 401, 401, 573, 573, 641, 641],
[473, 473, 915, 466, 516, 648, 414, 414, 457, 457, 697, 697],
[566, 566, 448, 448, 551, 551, 387, 387, 586, 586, 408, 1025],
[549, 549, 484, 484, 640, 640, 405, 405, 598, 598, 677, 677],
[613, 613, 507, 507, 570, 570, 491, 491, 537, 537, 359, 359]]
Obviously, something different... note the 422 in the first row and fourth
column...
Writing:
#pragma omp target teams distribute shared(A,B,C) device(dev)
for (size_t i = 0; i < rows; ++i)
#pragma omp parallel for shared(A,B,C)
for (size_t j = 0; j < cols; ++j)
{
T sum = 0;
#pragma omp simd reduction(+:sum)
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
I get from gcc 15.2:
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689],
[525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637],
[575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689],
[491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637],
[557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663],
[509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663],
[500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754],
[587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641],
[485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697],
[561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617],
[549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677],
[572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
Which looks correct. But it should yield the same result as collapse(2) as
A,B,C are shared and C is accessed just once. Note that the collapse statement
works on the host with a simple parallel for collapse(2). It should also work
with a teams distribute parallel for collapse(2), and does so with clang.
This was compiled with gcc 15.2 and
-fopenmp -foffload=nvptx-none -fno-stack-protector -Wall
linux kernel linux-6.12.52-gentoo-dist
dev-util/nvidia-cuda-toolkit-12.9.1-r1:0/12.9.1::gentoo
x11-drivers/nvidia-drivers-580.95.05
nvidia-smi
Tue Oct 14 13:50:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA
Version: 13.0 |
| 0 NVIDIA GeForce RTX 5060 Ti Off | 00000000:2D:00.0 On |
N/A |
This is, by the way, part of a test suite. The program calling the matmul is
mathdemonstrations.cpp.
This program calls several functions. It later triggers errors like
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
Note that, however, clang compiles the entire application just fine.
I also get this strange error in another test application where I can verify
that there is no out of bounds memory access by my code. So I dont know why the
program prints this message at the end. but I will open another bug for that...
I want to note that I observe especially the last problem only recently.
Perhaps it is because of the new nvidia drivers? I think that with cuda 13
there are reports of wrong matmuls. But I deliberately installed an old
nvidia-cuda-toolkit. My card is an rtx 5060. I tried temporarily to get to
older drivers but it did not change anything. I observe this only recently. I
do not know what portage changed recently which caused this.
Clang compiles the application correctly, however.