https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123272
--- Comment #2 from Benjamin Schulz <schulz.benjamin at googlemail dot com> --- Created attachment 63131 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63131&action=edit Archiv.tar.gz Attached is a tar.gz file with a library and a cmakelists.txt. The cmakelists.txt will, per default, compile for clang and generate various programs. Apart from an incompatibility with a cuda-aware message passing interface and clang in the program arraytest-mpi, the programs will run with cuda saniitzier reporting zero errors and many cuda kernels. if one comments the line for the clang compilation and instead activates the line for the gcc compilation in the cmakelists.txt file, then the library will compile for gcc-16 without warnings. But the program sparsetests will, if compiled with gcc-16, crash with a libgomp: cuCtxSynchronize error: an illegal memory access was encountered libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal memory access was encountered libgomp: cuMemFree_v2 error: an illegal memory access was encountered libgomp: device finalization failed in a rather simple loop. The crash occurs in the function build_blocks_rank2 at line 292 in the file datablockcontainer.h, which has an atomic capture construct, and where the memory was allocated by omp_target_alloc to the correct bounds. Similarly, the program mathdemonstrations, if compiled with gcc-16 will crash with a libgomp: cuCtxSynchronize error: an illegal memory access was encountered libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal memory access was encountered libgomp: cuMemFree_v2 error: an illegal memory access was encountered libgomp: device finalization failed I think at the function lu_decomposition_g, line 868. Tobias has tried to run the program sparsetests on his card (I think it was an amd) and could not reproduce the problem. The program finished correctly. The loops where the failure occurs are indeed rather simple. If gcc would compile these wrong in general on nvptx, this would have been noticed long ago. Cuda sanitizer, if compiled with clang, shows zero errors for the memcheck tool and other tools for the tar.gz attachment. Why do I post this here? in a bug on a miscompilation for a matrix multiplication? Well, because the code in the tar.gz archive of this comment uses classes and structs with member fields that are templates. Similar to the small reproducer main.cpp where the matrix multiplication fails. Unfortunately, unlike the attachment main.cpp for the matrix multiplications, i am not able to remove the error by switching to -O1 when compiling the code in Archiv.tar.gz The loops in the programs of archiv.tar.gz are rather simple, but over template typed arrays that are member fields of a class. This may suggest suggests that the memory problems, which occur if the sources in archiv.tar.gz are compiled with gcc-16, are similar to memory problems that give rise to the random results in the matrix multiplication of the smaller reproducer main.cpp above, which also occured with loops over member fields that were templated types. Tobias could see neither the matrix multiplication miscompilation, nor the memory problem of sparsetests in archiv.tar.gz) on his amd card. This would suggest that there may be memory problems with nvptx in general, when it comes to mapping templated types. Perhaps a fix of the matrix multiplication problem in main.cpp of the first post, will then also help to resolve the problems of the programs in the attached archiv.tar.gz
