Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Rolf vandeVaart Tue, 19 Aug 2014 09:16:02 -0400 (EDT)

Hi:
This problem does not appear to have anything to do with MPI. We are getting a 
SEGV during the initial call into the CUDA driver.  Can you log on to 
gpu-k20-08, compile your simple program without MPI, and run it there?  Also, 
maybe run dmesg on gpu-k20-08 and see if there is anything in the log?


Also, does your program run if you just run it on gpu-k20-07?  

Can you include the output from nvidia-smi on each node?

Thanks,
Rolf

>-----Original Message-----
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Tuesday, August 19, 2014 8:55 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me
>much more information.
>[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
>                   Prefix:
>/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
>   Internal debug support: yes
>Memory debugging support: no
>
>
>Is there something I need to do at run time to get more information out
>of it ?
>
>
>[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
>ppr:1:node
>cudampi_simple
>[gpu-k20-08:46045] *** Process received signal ***
>[gpu-k20-08:46045] Signal: Segmentation fault (11)
>[gpu-k20-08:46045] Signal code: Address not mapped (1)
>[gpu-k20-08:46045] Failing at address: 0x8
>[gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
>[gpu-k20-08:46045] [ 1]
>/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
>[gpu-k20-08:46045] [ 2]
>/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
>[gpu-k20-08:46045] [ 3]
>/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
>[gpu-k20-08:46045] [ 4]
>/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
>[gpu-k20-08:46045] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965]
>[gpu-k20-08:46045] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a]
>[gpu-k20-08:46045] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b]
>[gpu-k20-08:46045] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647]
>[gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
>[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal ***
>[gpu-k20-07:61816] Signal: Segmentation fault (11)
>[gpu-k20-07:61816] Signal code: Address not mapped (1)
>[gpu-k20-07:61816] Failing at address: 0x8
>[gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
>[gpu-k20-07:61816] [ 1]
>/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
>[gpu-k20-07:61816] [ 2]
>/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
>[gpu-k20-07:61816] [ 3]
>/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
>[gpu-k20-07:61816] [ 4]
>/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
>[gpu-k20-07:61816] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965]
>[gpu-k20-07:61816] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a]
>[gpu-k20-07:61816] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b]
>[gpu-k20-07:61816] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647
>]
>[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae]
>[gpu-k20-07:61816] [10]
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
>[gpu-k20-07:61816] [11] cudampi_simple[0x400699]
>[gpu-k20-07:61816] *** End of error message ***
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
>[gpu-k20-08:46045] [11] cudampi_simple[0x400699]
>[gpu-k20-08:46045] *** End of error message ***
>--------------------------------------------------------------------------
>mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08
>exited on signal 11 (Segmentation fault).
>--------------------------------------------------------------------------
>
>
>Thanks,
>
>Maxime
>
>
>Le 2014-08-18 16:45, Rolf vandeVaart a écrit :
>> Just to help reduce the scope of the problem, can you retest with a non
>CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the
>configure line to help with the stack trace?
>>
>>
>>> -----Original Message-----
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>>> Boissonneault
>>> Sent: Monday, August 18, 2014 4:23 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>>>
>>> Hi,
>>> Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
>>> derailed into two problems, one of which has been addressed, I figured I
>>> would start a new, more precise and simple one.
>>>
>>> I reduced the code to the minimal that would reproduce the bug. I have
>>> pasted it here :
>>> http://pastebin.com/1uAK4Z8R
>>> Basically, it is a program that initializes MPI and cudaMalloc memory, and
>then
>>> free memory and finalize MPI. Nothing else.
>>>
>>> When I compile and run this on a single node, everything works fine.
>>>
>>> When I compile and run this on more than one node, I get the following
>stack
>>> trace :
>>> [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041]
>Signal:
>>> Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not
>mapped
>>> (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
>>> /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
>>> [gpu-k20-07:40041] [ 1]
>>> /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
>>> [gpu-k20-07:40041] [ 2]
>>> /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
>>> [gpu-k20-07:40041] [ 3]
>>> /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
>>> [gpu-k20-07:40041] [ 4]
>>> /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
>>> [gpu-k20-07:40041] [ 5]
>>> /software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
>>> [gpu-k20-07:40041] [ 6]
>>> /software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
>>> [gpu-k20-07:40041] [ 7]
>>> /software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
>>> [gpu-k20-07:40041] [ 8]
>>> /software-
>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
>>> [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10]
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
>>> [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041]
>***
>>> End of error message ***
>>>
>>>
>>> The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware)
>or
>>> OpenMPI 1.8.1 (cuda aware).
>>>
>>> I know this is more than likely a problem with Cuda than with OpenMPI
>(since
>>> it does the same for two different versions), but I figured I would ask here
>if
>>> somebody has a clue of what might be going on. I have yet to be able to fill
>a
>>> bug report on NVidia's website for Cuda.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> --
>>> ---------------------------------
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-
>>> mpi.org/community/lists/users/2014/08/25064.php
>> -----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and may
>contain
>> confidential information.  Any unauthorized review, use, disclosure or
>distribution
>> is prohibited.  If you are not the intended recipient, please contact the
>sender by
>> reply email and destroy all copies of the original message.
>> -----------------------------------------------------------------------------------
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/08/25065.php
>
>
>--
>---------------------------------
>Maxime Boissonneault
>Analyste de calcul - Calcul Québec, Université Laval
>Ph. D. en physique
>
>_______________________________________________
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/08/25074.php

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Reply via email to