I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded today) built
against OpenMPI 1.8.4 with CUDA support activated to asynchronously send GPU
arrays between multiple Tesla GPUs (Fermi generation). Each MPI process is
associated with a single GPU; the process has a run loop that starts several
Isends to transmit the contents of GPU arrays to destination processes and
several Irecvs to receive data from source processes into GPU arrays on the
process' GPU. Some of the sends/recvs use one tag, while the remainder use a
second tag. A single Waitall invocation is used to wait for all of these sends
and receives to complete before the next iteration of the loop can commence. All
GPU arrays are preallocated before the run loop starts. While this pattern works
most of the time, it sometimes fails with a segfault that appears to occur
during an Isend:

[myhost:05471] *** Process received signal ***
[myhost:05471] Signal: Segmentation fault (11)
[myhost:05471] Signal code:  (128)
[myhost:05471] Failing at address: (nil)
[myhost:05471] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x2ac2bb176340]
[myhost:05471] [ 1]
/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x1f6b18)[0x2ac2c48bfb18]
[myhost:05471] [ 2]
/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x16dcc3)[0x2ac2c4836cc3]
[myhost:05471] [ 3]
/usr/lib/x86_64-linux-gnu/libcuda.so.1(cuIpcGetEventHandle+0x5d)[0x2ac2c480bccd]
[myhost:05471] [ 4]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_common_cuda_construct_event_and_handle+0x27)[0x2ac2c27d3087]
[myhost:05471] [ 5]
/opt/openmpi-1.8.4/lib/libmpi.so.1(ompi_free_list_grow+0x199)[0x2ac2c277b8e9]
[myhost:05471] [ 6]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_mpool_gpusm_register+0xf4)[0x2ac2c28c9fd4]
[myhost:05471] [ 7]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_rdma_cuda_btls+0xcd)[0x2ac2c28f8afd]
[myhost:05471] [ 8]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_send_request_start_cuda+0xbf)[0x2ac2c28f8d5f]
[myhost:05471] [ 9]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_isend+0x60e)[0x2ac2c28eb6fe]
[myhost:05471] [10]
/opt/openmpi-1.8.4/lib/libmpi.so.1(MPI_Isend+0x137)[0x2ac2c27b7cc7]
[myhost:05471] [11]
/home/lev/Work/miniconda/envs/MYENV/lib/python2.7/site-packages/mpi4py/MPI.so(+0xd3bb2)[0x2ac2c24b3bb2]
(Python-related debug lines omitted.)

Any ideas as to what could be causing this problem?

I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to