Can you post the full mpirun command? or at least the relevant mpi mca 
params?

" I'm still curious about your input on whether or not those mca parameters 
I mentioned yesterday are disabling GPUDirect RDMA as well?"
Even if you disable sm_cuda_ipc, it's possible you're still using cuda ipc 
via ucx for example.
The mentioned mca params disable it for sm_cuda btl, but UCX doesn't use 
smcuda as a transport so it's irrelevant for ucx pml.
Do you know which pml you're using?
-Tommy 


On Saturday, May 31, 2025 at 1:26:58 PM UTC-5 Mike Adams wrote:

> Interestingly, I made an error - Delta on 4.1.5 did fail like some of the 
> cases on Bridges2 on 4.0.5, but at 16 ranks per GPU.  This is the core 
> count of the AMD processor on Delta with 4 GPUs.  So, it looks like 
> Bridges2 needs an OpenMPI upgrade.
>
> Tommy, I'm still curious about your input on whether or not those mca 
> parameters I mentioned yesterday are disabling GPUDirect RDMA as well?
>
> Thank you both for your help!
>
> Mike Adams
>
> On Friday, May 30, 2025 at 11:39:49 AM UTC-6 Mike Adams wrote:
>
>> Dmitry, 
>>
>> I'm not too familiar with the internals of OpenMPI, but I just tried 
>> 4.1.5 on NCSA Delta and received the same IPC errors (no mca flags 
>> switched).  The actual calls didn't fail this time to perform the actual 
>> operation, so maybe that's an improvement from v4.0.x to v4.1.x?
>>
>> Thanks,
>>
>> Mike Adams
>>
>> On Friday, May 30, 2025 at 11:21:16 AM UTC-6 Dmitry N. Mikushin wrote:
>>
>>> There is a relevant explanation of the same issue reported for Julia: 
>>> https://github.com/JuliaGPU/CUDA.jl/issues/1053
>>>
>>> пт, 30 мая 2025 г. в 19:05, Mike Adams <mikeca...@gmail.com>:
>>>
>>>> Hi Tommy,
>>>>
>>>> I'm setting btl_smcuda_use_cuda_ipc_same_gpu 0 and 
>>>> btl_smcuda_use_cuda_ipc 0. 
>>>> So, are you saying that with these params, it is also not using 
>>>> GPUDirect RDMA?
>>>>
>>>> PSC Bridges 2 only has v4 OpenMPI, but they may be working on 
>>>> installing v5 now.  Everything works on v5 on NCSA Delta - I'll try to 
>>>> test 
>>>> on an older OpenMPI.
>>>>
>>>> Mike Adams
>>>> On Friday, May 30, 2025 at 10:54:23 AM UTC-6 Tomislav Janjusic US wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm not sure if it's a known issue, in v4.0 possibly, not sure about 
>>>>> v4.1 or v5.0 - can you try?
>>>>> As far as CUDA IPC - how are you disabling it? I don't remember the 
>>>>> mca params in v4.0
>>>>> If it's either through pml ucx, or smcuda then no, it won't use it.
>>>>> -Tommy
>>>>>
>>>>>
>>>>> On Saturday, May 24, 2025 at 8:56:50 AM UTC-7 Mike Adams wrote:
>>>>>
>>>>>> Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2.  I'm 
>>>>>> calling collectives like MPI_Allreduce on buffers that have already been 
>>>>>> shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle.
>>>>>>
>>>>>> On these buffers, I receive the following message and some 
>>>>>> communication sizes fail:
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
>>>>>> cannot be used.
>>>>>>   cuIpcGetMemHandle return value:   1
>>>>>>   address: 0x147d54000068
>>>>>> Check the cuda.h file for what the return value means. Perhaps a 
>>>>>> reboot
>>>>>> of the node will clear the problem.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> If I pass in the two mca parameters to disable OpenMPI IPC, 
>>>>>> everything works.
>>>>>>
>>>>>> I'm wondering two things:
>>>>>> Is this failure to handle IPC buffers in OpenMPI 4 a known issue?
>>>>>> When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI 
>>>>>> still use GPUDirect RDMA?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Mike Adams
>>>>>>
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to users+un...@lists.open-mpi.org.
>>>>
>>>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to