Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
> >-----Original Message-----
> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
> >Sent: Tuesday, May 19, 2015 6:30 PM
> >To: us...@open-mpi.org
> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
> >1.8.5 with CUDA 7.0 and Multi-Process Service
> >
> >I'm encountering intermittent errors while trying to use the Multi-Process
> >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
> >by multiple MPI processes that perform GPU-to-GPU communication with
> >each other (i.e., GPU pointers are passed to the MPI transmission 
> >primitives).
> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
> >which is in turn built against CUDA 7.0. In my current configuration, I have 
> >4
> >MPS server daemons running, each of which controls access to one of 4 GPUs;
> >the MPI processes spawned by my program are partitioned into 4 groups
> >(which might contain different numbers of processes) that each talk to a
> >separate daemon. For certain transmission patterns between these
> >processes, the program runs without any problems. For others (e.g., 16
> >processes partitioned into 4 groups), however, it dies with the following 
> >error:
> >
> >[node05:20562] Failed to register remote memory, rc=-1
> >--------------------------------------------------------------------------
> >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
> >will cause the program to abort.
> >  cuIpcOpenMemHandle return value:   21199360
> >  address: 0x1
> >Check the cuda.h file for what the return value means. Perhaps a reboot of
> >the node will clear the problem.

(snip)

> >After the above error occurs, I notice that /dev/shm/ is littered with
> >cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
> >but that doesn't seem to have any effect upon the problem. Rebooting the
> >machine also doesn't have any effect. I should also add that my program runs
> >without any error if the groups of MPI processes talk directly to the GPUs
> >instead of via MPS.
> >
> >Does anyone have any ideas as to what could be going on?
>
> I am not sure why you are seeing this.  One thing that is clear is that you
> have found a bug in the error reporting.  The error message is a little
> garbled and I see a bug in what we are reporting. I will fix that.
> 
> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My
> expectation is that you will not see any errors, but may lose some
> performance.

The error does indeed go away when IPC is disabled, although I do want to
avoid degrading the performance of data transfers between GPU memory locations.

> What does your hardware configuration look like?  Can you send me output from
> "nvidia-smi topo -m"
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to