https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122281

--- Comment #20 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Created attachment 62844
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62844&action=edit
compute-sanitizer-log.txt

Hi there, for the sparsetests.cpp file, I want to note that there is a dmesg
message on my gpu.


When I run it, I get:

[ 4841.758732] perf: interrupt took too long (2519 > 2500), lowering
kernel.perf_event_max_sample_rate to 79200

[21014.299014] NVRM: GPU at PCI:0000:2d:00:
GPU-959d8d20-bc99-9be9-c49f-adc2913c511c
[21014.299020] NVRM: GPU Board Serial Number: 0
[21014.299022] NVRM: Xid (PCI:0000:2d:00): 31, pid=25568, name=sparsetests,
channel 0x00000018, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC0
GPCCLIENT_T1_5 faulted @ 0x7fff_7be14000. Fault is of type FAULT_PDE
ACCESS_TYPE_VIRT_READ

When compiling with clang, I get no such message and the application runs fine

When i run mathdemonstrations.cpp, I get a similar error with gcc:

[21456.631660] NVRM: Xid (PCI:0000:2d:00): 31, pid=26607, name=mathdemonstrati,
channel 0x00000018, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC0
GPCCLIENT_T1_1 faulted @ 0x7ffd_91678000. Fault is of type FAULT_PDE
ACCESS_TYPE_VIRT_READ

With clang, I also do not get such a message and the application runs fine.

Running compute-sanitizer --tool memcheck ./sparsetests

shows no errors with the clang generated output.

my cuda version is this:
dev-util/nvidia-cuda-toolkit-12.9.1-r1:0/12.9.1::gentoo
x11-drivers/nvidia-drivers-580.95.05:0/580::gentoo 
sys-kernel/gentoo-kernel-6.17.8:6.17.8::gentoo  


If I run compute-sanitizer --tool memcheck on ./sparsetests with the gcc
generated file, i get errors like this

now an example with sparse matrx multiplication and the mdspan class
of course we offload the data first to device
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid
device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========         Host Frame: GOMP_OFFLOAD_init_device in plugin-nvptx.c:1371
[0x5677] in libgomp-plugin-nvptx.so.1
=========         Host Frame: gomp_init_device in target.c:3000 [0x3fc05] in
libgomp.so.1
=========         Host Frame: resolve_device in target.c:190 [0x3ffd4] in
libgomp.so.1
=========         Host Frame: omp_target_alloc in target.c:4667 [0x44fc2] in
libgomp.so.1
=========         Host Frame:
DataBlock_GPU_Memory_Functions<double>::alloc_device_ptr(unsigned long, int)
[0xb173] in sparsetests
=========         Host Frame:
DataBlock_GPU_Memory_Functions<double>::copy_data_to_device_set_devptr(DataBlock<double>&,
int) [0x96a5] in sparsetests
=========         Host Frame: mdspan<double, std::vector<unsigned long,
std::allocator<unsigned long> > >::device_data_upload(bool, int) [0x8045] in
sparsetests
=========         Host Frame: main [0x465c] in sparsetests
========= 
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetGlobal_v2.
=========     Saved host backtrace up to driver entry point at error
=========         Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1800
[0x76d7] in libgomp-plugin-nvptx.so.1
=========         Host Frame: gomp_load_image_to_device in target.c:2608
[0x34100] in libgomp.so.1
=========         Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in
libgomp.so.1
=========         Host Frame: resolve_device in target.c:190 [0x3ffd4] in
libgomp.so.1
=========         Host Frame: omp_target_alloc in target.c:4667 [0x44fc2] in
libgomp.so.1
=========         Host Frame:
DataBlock_GPU_Memory_Functions<double>::alloc_device_ptr(unsigned long, int)
[0xb173] in sparsetests
=========         Host Frame:
DataBlock_GPU_Memory_Functions<double>::copy_data_to_device_set_devptr(DataBlock<double>&,
int) [0x96a5] in sparsetests
=========         Host Frame: mdspan<double, std::vector<unsigned long,
std::allocator<unsigned long> > >::device_data_upload(bool, int) [0x8045] in
sparsetests
=========         Host Frame: main [0x465c] in sparsetests
========= 
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetFunction.
=========     Saved host backtrace up to driver entry point at error
=========         Host Frame: nvptx_do_global_cdtors in plugin-nvptx.c:1464
[0x3cdb] in libgomp-plugin-nvptx.so.1
=========         Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1802
[0x7739] in libgomp-plugin-nvptx.so.1
=========         Host Frame: gomp_load_image_to_device in target.c:2608
[0x34100] in libgomp.so.1
=========         Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in
libgomp.so.1
=========         Host Frame: resolve_device in target.c:190 [0x3ffd4] in
libgomp.so.1
=========         Host Frame: omp_target_alloc in target.c:4667 [0x44fc2] in
libgomp.so.1
=========         Host Frame:
DataBlock_GPU_Memory_Functions<double>::alloc_device_ptr(unsigned long, int)
[0xb173] in sparsetests
=========         Host Frame:
DataBlock_GPU_Memory_Functions<double>::copy_data_to_device_set_devptr(DataBlock<double>&,
int) [0x96a5] in sparsetests
=========         Host Frame: mdspan<double, std::vector<unsigned long,
std::allocator<unsigned long> > >::device_data_upload(bool, int) [0x8045] in
sparsetests
=========         Host Frame: main [0x465c] in sparsetests
========= 
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetFunction.
=========     Saved host backtrace up to driver entry point at error
=========         Host Frame: nvptx_do_global_cdtors in plugin-nvptx.c:1482
[0x3ef6] in libgomp-plugin-nvptx.so.1
=========         Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1802
[0x7739] in libgomp-plugin-nvptx.so.1
=========         Host Frame: gomp_load_image_to_device in target.c:2608
[0x34100] in libgomp.so.1
=========         Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in
libgomp.so.1
=========         Host Frame: resolve_device in target.c:190 [0x3ffd4] in
libgomp.so.1
=========         Host Frame: omp_target_alloc in target.c:4667 [0x44fc2] in
libgomp.so.1
=========         Host Frame:
DataBlock_GPU_Memory_Functions<double>::alloc_device_ptr(unsigned long, int)
[0xb173] in sparsetests
=========         Host Frame:
DataBlock_GPU_Memory_Functions<double>::copy_data_to_device_set_devptr(DataBlock<double>&,
int) [0x96a5] in sparsetests
=========         Host Frame: mdspan<double, std::vector<unsigned long,
std::allocator<unsigned long> > >::device_data_upload(bool, int) [0x8045] in
sparsetests
=========         Host Frame: main [0x465c] in sparsetests
========= 

and
========= Invalid __global__ read of size 8 bytes
=========     at [clone BlockedDataView<double>::build_blocks_rank2(unsigned
long, unsigned long, bool)] _omp_fn$1+0x1530
=========     by thread (0,1,0) in block (0,0,0)
=========     Access at 0x7ffeceecd760 is out of bounds
=========     and is 243.440.342.369 bytes after the nearest allocation at
0x7fc620c00000 of size 512 bytes
=========         Device Frame: gomp_nvptx_main+0x1030 in team.c:135
=========         Device Frame: [clone
BlockedDataView<double>::build_blocks_rank2(unsigned long, unsigned long,
bool)] _omp_fn$0+0x240
=========     Saved host backtrace up to driver entry point at kernel launch
time
=========         Host Frame: cuLaunchKernel [0x39d6c4] in libcuda.so.1
=========         Host Frame: GOMP_OFFLOAD_run in plugin-nvptx.c:2965 [0xa2d9]
in libgomp-plugin-nvptx.so.1
=========         Host Frame: GOMP_target_ext in target.c:3544 [0x4132d] in
libgomp.so.1
=========         Host Frame:
BlockedDataView<double>::build_blocks_rank2(unsigned long, unsigned long, bool)
[0xa3be] in sparsetests
=========         Host Frame: BlockedDataView<double>::build_blocks(unsigned
long const*, bool) [0x8f0a] in sparsetests
=========         Host Frame:
BlockedDataView<double>::BlockedDataView(DataBlock<double> const&, unsigned
long const*, bool) [0x7447] in sparsetests
=========         Host Frame: main [0x4705] in sparsetests



Since clang compiles this fine and I can see cuda kernels for the gpu, that
seem to be gcc problems.

Interesting is this out of bounds error... Why does this appear? If it is out
of bounds due to me, then I should also get an out of bounds error with clang.
In fact, i checked this with the sourcecode. I can reserve the arrays
pooled_offsets_flat and pooled_offsets_starts such that they would fill 2000
elements and i still get this out of bounds error, even when the loop which
indices these just runs from 0 to 128...

All this does not seem to make sense.. especially since compiling with clang
does not show any such errors...

Reply via email to