Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Alex A. Granovsky Tue, 19 Aug 2014 08:00:08 -0400 (EDT)

Also you need to check return code from cudaMalloc before calling cudaFree -
the pointer may be invalid as you did not initialized cuda properly.


Alex

-----Original Message-----From: Maxime Boissonneault

Sent: Tuesday, August 19, 2014 2:19 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Same thing :

[mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1
[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node
cudampi_simple
malloc: using debugging hooks
malloc: using debugging hooks
[gpu-k20-07:47628] *** Process received signal ***
[gpu-k20-07:47628] Signal: Segmentation fault (11)
[gpu-k20-07:47628] Signal code: Address not mapped (1)
[gpu-k20-07:47628] Failing at address: 0x8
[gpu-k20-07:47628] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b14cf850710]
[gpu-k20-07:47628] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2b14d4e9facf]
[gpu-k20-07:47628] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2b14d4e65a83]
[gpu-k20-07:47628] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2b14d4d972da]
[gpu-k20-07:47628] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2b14d4d83933]
[gpu-k20-07:47628] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b14cf0cf965]
[gpu-k20-07:47628] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b14cf0cfa0a]
[gpu-k20-07:47628] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b14cf0cfa3b]
[gpu-k20-07:47628] [ 8]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2b14cf0f0532]
[gpu-k20-07:47628] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:47628] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b14cfa7cd1d]
[gpu-k20-07:47628] [11] cudampi_simple[0x400699]
[gpu-k20-07:47628] *** End of error message ***
... (same segfault from the other node)

Maxime


Le 2014-08-18 16:52, Alex A. Granovsky a écrit :

Try the following:

export MALLOC_CHECK_=1

and then run it again

Kind regards,
Alex Granovsky



-----Original Message----- From: Maxime Boissonneault
Sent: Tuesday, August 19, 2014 12:23 AM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory,
and then free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following
stack trace :
[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041] Signal: Segmentation fault (11)
[gpu-k20-07:40041] Signal code: Address not mapped (1)
[gpu-k20-07:40041] Failing at address: 0x8
[gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
[gpu-k20-07:40041] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
[gpu-k20-07:40041] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
[gpu-k20-07:40041] [ 8]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699]
[gpu-k20-07:40041] *** End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware)
or OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI
(since it does the same for two different versions), but I figured I
would ask here if somebody has a clue of what might be going on. I have
yet to be able to fill a bug report on NVidia's website for Cuda.


Thanks,



--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post:http://www.open-mpi.org/community/lists/users/2014/08/25067.php

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Reply via email to