I'm encountering intermittent errors while trying to use the Multi-Process Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU by multiple MPI processes that perform GPU-to-GPU communication with each other (i.e., GPU pointers are passed to the MPI transmission primitives). I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5, which is in turn built against CUDA 7.0. In my current configuration, I have 4 MPS server daemons running, each of which controls access to one of 4 GPUs; the MPI processes spawned by my program are partitioned into 4 groups (which might contain different numbers of processes) that each talk to a separate daemon. For certain transmission patterns between these processes, the program runs without any problems. For others (e.g., 16 processes partitioned into 4 groups), however, it dies with the following error:
[node05:20562] Failed to register remote memory, rc=-1 -------------------------------------------------------------------------- The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and will cause the program to abort. cuIpcOpenMemHandle return value: 21199360 address: 0x1 Check the cuda.h file for what the return value means. Perhaps a reboot of the node will clear the problem. -------------------------------------------------------------------------- [node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 ------------------------------------------------------- Child job 2 terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- [node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104) [node05:20564] Failed to register remote memory, rc=-1 [node05:20564] [[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05:20566] Failed to register remote memory, rc=-1 [node05:20566] [[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05:20567] Failed to register remote memory, rc=-1 [node05:20567] [[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node05:20569] Failed to register remote memory, rc=-1 [node05:20569] [[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05:20571] Failed to register remote memory, rc=-1 [node05:20571] [[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05:20572] Failed to register remote memory, rc=-1 [node05:20572] [[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 After the above error occurs, I notice that /dev/shm/ is littered with cuda.shm.* files. I tried cleaning up /dev/shm before running my program, but that doesn't seem to have any effect upon the problem. Rebooting the machine also doesn't have any effect. I should also add that my program runs without any error if the groups of MPI processes talk directly to the GPUs instead of via MPS. Does anyone have any ideas as to what could be going on? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/