Hi: This problem does not appear to have anything to do with MPI. We are getting a SEGV during the initial call into the CUDA driver. Can you log on to gpu-k20-08, compile your simple program without MPI, and run it there? Also, maybe run dmesg on gpu-k20-08 and see if there is anything in the log?
Also, does your program run if you just run it on gpu-k20-07? Can you include the output from nvidia-smi on each node? Thanks, Rolf >-----Original Message----- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime >Boissonneault >Sent: Tuesday, August 19, 2014 8:55 AM >To: Open MPI Users >Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes > >Hi, >I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me >much more information. >[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug > Prefix: >/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda > Internal debug support: yes >Memory debugging support: no > > >Is there something I need to do at run time to get more information out >of it ? > > >[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by >ppr:1:node >cudampi_simple >[gpu-k20-08:46045] *** Process received signal *** >[gpu-k20-08:46045] Signal: Segmentation fault (11) >[gpu-k20-08:46045] Signal code: Address not mapped (1) >[gpu-k20-08:46045] Failing at address: 0x8 >[gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710] >[gpu-k20-08:46045] [ 1] >/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf] >[gpu-k20-08:46045] [ 2] >/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83] >[gpu-k20-08:46045] [ 3] >/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da] >[gpu-k20-08:46045] [ 4] >/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933] >[gpu-k20-08:46045] [ 5] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965] >[gpu-k20-08:46045] [ 6] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a] >[gpu-k20-08:46045] [ 7] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b] >[gpu-k20-08:46045] [ 8] >/software- >gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647] >[gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae] >[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal *** >[gpu-k20-07:61816] Signal: Segmentation fault (11) >[gpu-k20-07:61816] Signal code: Address not mapped (1) >[gpu-k20-07:61816] Failing at address: 0x8 >[gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710] >[gpu-k20-07:61816] [ 1] >/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf] >[gpu-k20-07:61816] [ 2] >/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83] >[gpu-k20-07:61816] [ 3] >/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da] >[gpu-k20-07:61816] [ 4] >/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933] >[gpu-k20-07:61816] [ 5] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965] >[gpu-k20-07:61816] [ 6] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a] >[gpu-k20-07:61816] [ 7] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b] >[gpu-k20-07:61816] [ 8] >/software- >gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647 >] >[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae] >[gpu-k20-07:61816] [10] >/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d] >[gpu-k20-07:61816] [11] cudampi_simple[0x400699] >[gpu-k20-07:61816] *** End of error message *** >/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d] >[gpu-k20-08:46045] [11] cudampi_simple[0x400699] >[gpu-k20-08:46045] *** End of error message *** >-------------------------------------------------------------------------- >mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08 >exited on signal 11 (Segmentation fault). >-------------------------------------------------------------------------- > > >Thanks, > >Maxime > > >Le 2014-08-18 16:45, Rolf vandeVaart a écrit : >> Just to help reduce the scope of the problem, can you retest with a non >CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the >configure line to help with the stack trace? >> >> >>> -----Original Message----- >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime >>> Boissonneault >>> Sent: Monday, August 18, 2014 4:23 PM >>> To: Open MPI Users >>> Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes >>> >>> Hi, >>> Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda >>> derailed into two problems, one of which has been addressed, I figured I >>> would start a new, more precise and simple one. >>> >>> I reduced the code to the minimal that would reproduce the bug. I have >>> pasted it here : >>> http://pastebin.com/1uAK4Z8R >>> Basically, it is a program that initializes MPI and cudaMalloc memory, and >then >>> free memory and finalize MPI. Nothing else. >>> >>> When I compile and run this on a single node, everything works fine. >>> >>> When I compile and run this on more than one node, I get the following >stack >>> trace : >>> [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] >Signal: >>> Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not >mapped >>> (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] >>> /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] >>> [gpu-k20-07:40041] [ 1] >>> /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] >>> [gpu-k20-07:40041] [ 2] >>> /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] >>> [gpu-k20-07:40041] [ 3] >>> /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] >>> [gpu-k20-07:40041] [ 4] >>> /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] >>> [gpu-k20-07:40041] [ 5] >>> /software- >gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] >>> [gpu-k20-07:40041] [ 6] >>> /software- >gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] >>> [gpu-k20-07:40041] [ 7] >>> /software- >gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] >>> [gpu-k20-07:40041] [ 8] >>> /software- >>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] >>> [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] >>> [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] >*** >>> End of error message *** >>> >>> >>> The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) >or >>> OpenMPI 1.8.1 (cuda aware). >>> >>> I know this is more than likely a problem with Cuda than with OpenMPI >(since >>> it does the same for two different versions), but I figured I would ask here >if >>> somebody has a clue of what might be going on. I have yet to be able to fill >a >>> bug report on NVidia's website for Cuda. >>> >>> >>> Thanks, >>> >>> >>> -- >>> --------------------------------- >>> Maxime Boissonneault >>> Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: http://www.open- >>> mpi.org/community/lists/users/2014/08/25064.php >> ----------------------------------------------------------------------------------- >> This email message is for the sole use of the intended recipient(s) and may >contain >> confidential information. Any unauthorized review, use, disclosure or >distribution >> is prohibited. If you are not the intended recipient, please contact the >sender by >> reply email and destroy all copies of the original message. >> ----------------------------------------------------------------------------------- >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: http://www.open- >mpi.org/community/lists/users/2014/08/25065.php > > >-- >--------------------------------- >Maxime Boissonneault >Analyste de calcul - Calcul Québec, Université Laval >Ph. D. en physique > >_______________________________________________ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: http://www.open- >mpi.org/community/lists/users/2014/08/25074.php