[Bug target/123272] [nvptx] miscompilation in matrix multiplication with #pragma omp target teams parallel for collapse(2) if members of classes use templates

schulz.benjamin at googlemail dot com via Gcc-bugs Mon, 22 Dec 2025 20:47:22 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123272


--- Comment #2 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Created attachment 63131
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63131&action=edit
Archiv.tar.gz

Attached is a tar.gz file with a library and a cmakelists.txt. The
cmakelists.txt will, per default, compile for clang and generate various
programs. 

Apart from an incompatibility with a cuda-aware message passing interface and
clang in the program arraytest-mpi, the programs will run with cuda saniitzier
reporting zero errors and many cuda kernels.

if one comments the line for the clang compilation and instead activates the
line for the gcc compilation in the cmakelists.txt file, then the library will
compile for gcc-16 without warnings.

But the program sparsetests will, if compiled with gcc-16, crash with a 

libgomp: cuCtxSynchronize error: an illegal memory access was encountered

libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered

libgomp: cuMemFree_v2 error: an illegal memory access was encountered

libgomp: device finalization failed

in a rather simple loop. 

The crash occurs in the function build_blocks_rank2 at line 292 in the file
datablockcontainer.h, which has an atomic capture construct, and where the
memory was allocated by omp_target_alloc to the correct bounds. 


Similarly, the program mathdemonstrations, if compiled with gcc-16 will crash
with a 

libgomp: cuCtxSynchronize error: an illegal memory access was encountered

libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered

libgomp: cuMemFree_v2 error: an illegal memory access was encountered

libgomp: device finalization failed

I think at the function lu_decomposition_g, line 868.

Tobias has tried to run the program sparsetests on his card (I think it was an
amd) and could not reproduce the problem. The program finished correctly.

The loops where the failure occurs are indeed rather simple. If gcc would
compile these wrong in general on nvptx, this would have been noticed long ago.

Cuda sanitizer, if compiled with clang, shows zero errors for the memcheck tool
and other tools for the tar.gz attachment.

Why do I post this here? in a bug on a miscompilation for a matrix
multiplication?


Well, because the code in the tar.gz archive of this comment uses classes and
structs with member fields that are templates. Similar to the small reproducer
main.cpp where the matrix multiplication fails.


Unfortunately, unlike the attachment main.cpp for the matrix multiplications, i
am not able to remove the error by switching to -O1 when compiling the code in
Archiv.tar.gz

The loops in the programs of archiv.tar.gz are rather simple, but over template
typed arrays that are member fields of a class.

This may suggest suggests that the memory problems, which occur if the sources
in archiv.tar.gz are compiled with gcc-16, are similar to memory problems that
give rise to the random results in the matrix multiplication of the smaller
reproducer main.cpp above, which also occured with loops over member fields
that were templated types. 

Tobias could see neither the matrix multiplication miscompilation, nor the
memory problem of sparsetests in archiv.tar.gz) on his amd card. 

This would suggest that there may be memory problems with nvptx in general,
when it comes to mapping templated types.

Perhaps a fix of the matrix multiplication problem in main.cpp of the first
post, will then also help to resolve the problems of the programs in the
attached archiv.tar.gz

[Bug target/123272] [nvptx] miscompilation in matrix multiplication with #pragma omp target teams parallel for collapse(2) if members of classes use templates

Reply via email to