Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
>
> >-----Original Message-----
> >From: users [mailto:[email protected]] On Behalf Of Lev Givon
> >Sent: Tuesday, May 19, 2015 6:30 PM
> >To: [email protected]
> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
> >1.8.5 with CUDA 7.0 and Multi-Process Service
> >
> >I'm encountering intermittent errors while trying to use the Multi-Process
> >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
> >by multiple MPI processes that perform GPU-to-GPU communication with
> >each other (i.e., GPU pointers are passed to the MPI transmission
> >primitives).
> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
> >which is in turn built against CUDA 7.0. In my current configuration, I have
> >4
> >MPS server daemons running, each of which controls access to one of 4 GPUs;
> >the MPI processes spawned by my program are partitioned into 4 groups
> >(which might contain different numbers of processes) that each talk to a
> >separate daemon. For certain transmission patterns between these
> >processes, the program runs without any problems. For others (e.g., 16
> >processes partitioned into 4 groups), however, it dies with the following
> >error:
> >
> >[node05:20562] Failed to register remote memory, rc=-1
> >--------------------------------------------------------------------------
> >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
> >will cause the program to abort.
> > cuIpcOpenMemHandle return value: 21199360
> > address: 0x1
> >Check the cuda.h file for what the return value means. Perhaps a reboot of
> >the node will clear the problem.
(snip)
> >After the above error occurs, I notice that /dev/shm/ is littered with
> >cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
> >but that doesn't seem to have any effect upon the problem. Rebooting the
> >machine also doesn't have any effect. I should also add that my program runs
> >without any error if the groups of MPI processes talk directly to the GPUs
> >instead of via MPS.
> >
> >Does anyone have any ideas as to what could be going on?
>
> I am not sure why you are seeing this. One thing that is clear is that you
> have found a bug in the error reporting. The error message is a little
> garbled and I see a bug in what we are reporting. I will fix that.
>
> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0. My
> expectation is that you will not see any errors, but may lose some
> performance.
>
> What does your hardware configuration look like? Can you send me output from
> "nvidia-smi topo -m"
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SOC SOC 0-23
GPU1 PHB X SOC SOC 0-23
GPU2 SOC SOC X PHB 0-23
GPU3 SOC SOC PHB X 0-23
Legend:
X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/