As a workaround for now, I have found that setting OMPI_MCA_pml=ucx seems to 
get around this issue. I'm not sure why this works, but perhaps there is 
different initialization that happens such that the offending device search 
problem doesn't occur?


Thanks,

David


________________________________
From: Shrader, David Lee
Sent: Tuesday, November 2, 2021 2:09 PM
To: Open MPI Users
Cc: Michael Di Domenico
Subject: Re: [EXTERNAL] [OMPI users] strange pml error


I too have been getting this using 4.1.1, but not with the master nightly 
tarballs from mid-October. I still have it on my to-do list to open a github 
issue. The problem seems to come from device detection in the ucx pml: on some 
ranks, it fails to find a device and thus the ucx pml disqualifies itself. 
Which then just leaves the ob1 pml.


Thanks,

David


________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Michael Di Domenico 
via users <users@lists.open-mpi.org>
Sent: Tuesday, November 2, 2021 1:35 PM
To: Open MPI Users
Cc: Michael Di Domenico
Subject: [EXTERNAL] [OMPI users] strange pml error

fairly frequently, but not everytime when trying to run xhpl on a new
machine i'm bumping into this.  it happens with a single node or
multiple nodes

node1 selected pml ob1, but peer on node1 selected pml ucx

if i rerun the exact same command a few minutes later, it works fine.
the machine is new and i'm the only one using it so there are no user
conflicts

the software stack is

slurm 21.8.2.1
ompi 4.1.1
pmix 3.2.3
ucx 1.9.0

the hardware is HPE w/ mellanox edr cards (but i doubt that matters)

any thoughts?

Reply via email to