I'm encountering intermittent errors while trying to use the Multi-Process
Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU by
multiple MPI processes that perform GPU-to-GPU communication with each other
(i.e., GPU pointers are passed to the MPI transmission primitives). I'm using
GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5, which is in turn
built against CUDA 7.0. In my current configuration, I have 4 MPS server daemons
running, each of which controls access to one of 4 GPUs; the MPI processes
spawned by my program are partitioned into 4 groups (which might contain
different numbers of processes) that each talk to a separate daemon. For certain
transmission patterns between these processes, the program runs without any
problems. For others (e.g., 16 processes partitioned into 4 groups), however, it
dies with the following error:

[node05:20562] Failed to register remote memory, rc=-1
--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
  cuIpcOpenMemHandle return value:   21199360
  address: 0x1
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at 
line 477
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
[node05:20564] Failed to register remote memory, rc=-1
[node05:20564] [[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at 
line 477
[node05:20566] Failed to register remote memory, rc=-1
[node05:20566] [[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at 
line 477
[node05:20567] Failed to register remote memory, rc=-1
[node05:20567] [[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at 
line 477
[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node05:20569] Failed to register remote memory, rc=-1
[node05:20569] [[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c 
at line 477
[node05:20571] Failed to register remote memory, rc=-1
[node05:20571] [[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c 
at line 477
[node05:20572] Failed to register remote memory, rc=-1
[node05:20572] [[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c 
at line 477

After the above error occurs, I notice that /dev/shm/ is littered with
cuda.shm.* files. I tried cleaning up /dev/shm before running my program, but
that doesn't seem to have any effect upon the problem. Rebooting the machine
also doesn't have any effect. I should also add that my program runs without any
error if the groups of MPI processes talk directly to the GPUs instead of via
MPS.

Does anyone have any ideas as to what could be going on?
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to