Can you post the full mpirun command? or at least the relevant mpi mca params?
" I'm still curious about your input on whether or not those mca parameters I mentioned yesterday are disabling GPUDirect RDMA as well?" Even if you disable sm_cuda_ipc, it's possible you're still using cuda ipc via ucx for example. The mentioned mca params disable it for sm_cuda btl, but UCX doesn't use smcuda as a transport so it's irrelevant for ucx pml. Do you know which pml you're using? -Tommy On Saturday, May 31, 2025 at 1:26:58 PM UTC-5 Mike Adams wrote: > Interestingly, I made an error - Delta on 4.1.5 did fail like some of the > cases on Bridges2 on 4.0.5, but at 16 ranks per GPU. This is the core > count of the AMD processor on Delta with 4 GPUs. So, it looks like > Bridges2 needs an OpenMPI upgrade. > > Tommy, I'm still curious about your input on whether or not those mca > parameters I mentioned yesterday are disabling GPUDirect RDMA as well? > > Thank you both for your help! > > Mike Adams > > On Friday, May 30, 2025 at 11:39:49 AM UTC-6 Mike Adams wrote: > >> Dmitry, >> >> I'm not too familiar with the internals of OpenMPI, but I just tried >> 4.1.5 on NCSA Delta and received the same IPC errors (no mca flags >> switched). The actual calls didn't fail this time to perform the actual >> operation, so maybe that's an improvement from v4.0.x to v4.1.x? >> >> Thanks, >> >> Mike Adams >> >> On Friday, May 30, 2025 at 11:21:16 AM UTC-6 Dmitry N. Mikushin wrote: >> >>> There is a relevant explanation of the same issue reported for Julia: >>> https://github.com/JuliaGPU/CUDA.jl/issues/1053 >>> >>> пт, 30 мая 2025 г. в 19:05, Mike Adams <mikeca...@gmail.com>: >>> >>>> Hi Tommy, >>>> >>>> I'm setting btl_smcuda_use_cuda_ipc_same_gpu 0 and >>>> btl_smcuda_use_cuda_ipc 0. >>>> So, are you saying that with these params, it is also not using >>>> GPUDirect RDMA? >>>> >>>> PSC Bridges 2 only has v4 OpenMPI, but they may be working on >>>> installing v5 now. Everything works on v5 on NCSA Delta - I'll try to >>>> test >>>> on an older OpenMPI. >>>> >>>> Mike Adams >>>> On Friday, May 30, 2025 at 10:54:23 AM UTC-6 Tomislav Janjusic US wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm not sure if it's a known issue, in v4.0 possibly, not sure about >>>>> v4.1 or v5.0 - can you try? >>>>> As far as CUDA IPC - how are you disabling it? I don't remember the >>>>> mca params in v4.0 >>>>> If it's either through pml ucx, or smcuda then no, it won't use it. >>>>> -Tommy >>>>> >>>>> >>>>> On Saturday, May 24, 2025 at 8:56:50 AM UTC-7 Mike Adams wrote: >>>>> >>>>>> Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2. I'm >>>>>> calling collectives like MPI_Allreduce on buffers that have already been >>>>>> shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle. >>>>>> >>>>>> On these buffers, I receive the following message and some >>>>>> communication sizes fail: >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol >>>>>> cannot be used. >>>>>> cuIpcGetMemHandle return value: 1 >>>>>> address: 0x147d54000068 >>>>>> Check the cuda.h file for what the return value means. Perhaps a >>>>>> reboot >>>>>> of the node will clear the problem. >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> If I pass in the two mca parameters to disable OpenMPI IPC, >>>>>> everything works. >>>>>> >>>>>> I'm wondering two things: >>>>>> Is this failure to handle IPC buffers in OpenMPI 4 a known issue? >>>>>> When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI >>>>>> still use GPUDirect RDMA? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Mike Adams >>>>>> >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to users+un...@lists.open-mpi.org. >>>> >>> To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.