Yo folks I was working on the PMIx integration in support of connect/accept, and happened to take a closer look at the “disconnect” function we call during finalize. I then realized that we had broken this function for the use-case where two jobs started by different mpiruns connect when we made the grpcomm changes some time back.
I’m still scratching my head about how to correctly fix this problem. Obviously, this isn’t something people do very often in the real-world - ompi-server is required, and it hasn’t been operational in quite some time. I checked and found exactly one query about it over the last several years. Likewise, I haven’t found any RM out there that supports it either. PMIx gives us the ability to provide that support, but it will take a little time to figure out how to provide the required backend “fence” in all use-cases: * two jobs started by the same mpirun - supported today by ORTE * two jobs started by different mpiruns - we used to support, but is broken in grpcomm/barrier * two direct-launched jobs - never supported * one direct-launched job and one started by mpirun - never supported Given lack of use out there, I don’t see a reason to hold release of the 2.x series over this issue. Will keep you posted on progress towards a resolution Ralph