Hi All, We have run into issues, that don’t really seem to materialize into incorrect results, nonetheless, we hope to figure out why we are getting them.
We have several environments with test from one machine, with say 1-16 processes per node, to several machines with 1-16 processes. All systems are certified from Nvidia and use Nvidia Tesla k40 GPUs. We notice frequent situations of the following, -------------------------------------------------------------------------- The call to cuEventCreate failed. This is a unrecoverable error and will cause the program to abort. Hostname: aHost cuEventCreate return value: 304 Check the cuda.h file for what the return value means. -------------------------------------------------------------------------- -------------------------------------------------------------------------- The call to cuIpcGetEventHandle failed. This is a unrecoverable error and will cause the program to abort. cuIpcGetEventHandle return value: 304 Check the cuda.h file for what the return value means. -------------------------------------------------------------------------- -------------------------------------------------------------------------- The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol cannot be used. cuIpcGetMemHandle return value: 304 address: 0x700fd0400 Check the cuda.h file for what the return value means. Perhaps a reboot of the node will clear the problem. -------------------------------------------------------------------------- Now, our test suite still verifies results but this does cause the following when it happens, The call to cuEventDestory failed. This is a unrecoverable error and will cause the program to abort. cuEventDestory return value: 400 Check the cuda.h file for what the return value means. -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[37290,1],2] Exit code: 1 We have traced the code back to the following files: -ompi/mca/common/cuda/common_cuda.c :: mca_common_cuda_construct_event_and_handle() We also know the the following: -it happens on every machine on the very first entry to the function previously mentioned, -does not happen if the buffer size is under 128 bytes… likely a different mech. Used for the IPC, Last, here is an intermittent one and it produces a lot failed tests in our suite… when in fact they are solid, besides this error. Cause notification, annoyances and it would be nice to clean it up. mpi_rank_3][cudaipc_allocate_ipc_region] [src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_ipc.c:487] cuda failed with mapping of buffer object failed We have not been able to duplicate these errors in other MPI libs, Thank you for your time & looking forward to your response, Kindest Regards, — Steven Eliuk, Ph.D. Comp Sci, Advanced Software Platforms Lab, SRA - SV, Samsung Electronics, 1732 North First Street, San Jose, CA 95112, Work: +1 408-652-1976, Work: +1 408-544-5781 Wednesdays, Cell: +1 408-819-4407.