Yo folks

I was working on the PMIx integration in support of connect/accept, and 
happened to take a closer look at the “disconnect” function we call during 
finalize. I then realized that we had broken this function for the use-case 
where two jobs started by different mpiruns connect when we made the grpcomm 
changes some time back.

I’m still scratching my head about how to correctly fix this problem. 
Obviously, this isn’t something people do very often in the real-world - 
ompi-server is required, and it hasn’t been operational in quite some time. I 
checked and found exactly one query about it over the last several years. 
Likewise, I haven’t found any RM out there that supports it either.

PMIx gives us the ability to provide that support, but it will take a little 
time to figure out how to provide the required backend “fence” in all use-cases:

* two jobs started by the same mpirun - supported today by ORTE

* two jobs started by different mpiruns - we used to support, but is broken in 
grpcomm/barrier

* two direct-launched jobs  - never supported

* one direct-launched job and one started by mpirun  - never supported

Given lack of use out there, I don’t see a reason to hold release of the 2.x 
series over this issue. Will keep you posted on progress towards a resolution

Ralph



Reply via email to