Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following stack trace :
[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041] Signal: Segmentation fault (11)
[gpu-k20-07:40041] Signal code: Address not mapped (1)
[gpu-k20-07:40041] Failing at address: 0x8
[gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699]
[gpu-k20-07:40041] *** End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda.


Thanks,


--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Reply via email to