Just reread your suggestions in our out-of-list discussion and found that I misunderstand it. So no parallel PMI! Take all possible code into opal/mca/common/pmi.
To additionally clarify what is the preferred way: 1. to create one joined PMI module having a switches to decide what functiononality to implement. 2. or to have 2 separate common modules for PMI1 and one for PMI2, and does this fit opal/mca/common/ ideology at all? 2014-05-08 6:44 GMT+07:00 Artem Polyakov <artpo...@gmail.com>: > > 2014-05-08 5:54 GMT+07:00 Ralph Castain <r...@open-mpi.org>: > > Ummm....no, I don't think that's right. I believe we decided to instead >> create the separate components, default to PMI-2 if available, print nice >> error message if not, otherwise use PMI-1. >> >> I don't want to initialize both PMIs in parallel as most installations >> won't support it. >> > > Ok, I agree. Beside the lack of support there can be a performance hit > caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 > since it is quite simple and local. But I didn't consider other > implementations. > > On May 7, 2014, at 3:49 PM, Artem Polyakov <artpo...@gmail.com> wrote: >> >> We discussed with Ralph Joshuas concerns and decided to try automatic >> PMI2 correctness first as it was initially intended. Here is my idea. The >> universal way to decide if PMI2 is correct is to compare PMI_Init(.., >> &rank, &size, ...) and PMI2_Init(.., &rank, &size, ...). Size and rank >> should be equal. In this case we proceed with PMI2 finalizing PMI1. >> Otherwise we finalize PMI2 and proceed with PMI1. >> I need to clarify with SLURM guys if parallel initialization of both PMIs >> are legal. If not - we'll do that sequentially. >> In other places we'll just use the flag saying what PMI version to use. >> Does that sounds reasonable? >> >> 2014-05-07 23:10 GMT+07:00 Artem Polyakov <artpo...@gmail.com>: >> >>> That's a good point. There is actually a bunch of modules in ompi, opal >>> and orte that has to be duplicated. >>> >>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал: >>> >>>> +1 Sounds like a good idea - but decoupling the two and adding all the >>>> right selection mojo might be a bit of a pain. There are several places in >>>> OMPI where the distinction between PMI1and PMI2 is made, not only in >>>> grpcomm. DB and ESS frameworks off the top of my head. >>>> >>>> Josh >>>> >>>> >>>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov <artpo...@gmail.com> >>>> wrote: >>>> >>>>> Good idea :)! >>>>> >>>>> среда, 7 мая 2014 г. пользователь Ralph Castain написал: >>>>> >>>>> Jeff actually had a useful suggestion (gasp!).He proposed that we >>>>> separate the PMI-1 and PMI-2 codes into separate components so you could >>>>> select them at runtime. Thus, we would build both (assuming both PMI-1 and >>>>> 2 libs are found), default to PMI-1, but users could select to try PMI-2. >>>>> If the PMI-2 component failed, we would emit a show_help indicating that >>>>> they probably have a broken PMI-2 version and should try PMI-1. >>>>> >>>>> Make sense? >>>>> Ralph >>>>> >>>>> On May 7, 2014, at 8:00 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>> >>>>> On May 7, 2014, at 7:56 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: >>>>> >>>>> Ah, I see. Sorry for the reactionary comment - but this feature falls >>>>> squarely within my "jurisdiction", and we've invested a lot in improving >>>>> OMPI jobstart under srun. >>>>> >>>>> That being said (now that I've taken some deep breaths and carefully >>>>> read your original email :)), what you're proposing isn't a bad idea. I >>>>> think it would be good to maybe add a "--with-pmi2" flag to configure >>>>> since >>>>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This >>>>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or >>>>> hack the installation. >>>>> >>>>> >>>>> That would be a much simpler solution than what Artem proposed >>>>> (off-list) where we would try PMI2 and then if it didn't work try to >>>>> figure >>>>> out how to fall back to PMI1. I'll add this for now, and if Artem wants to >>>>> try his more automagic solution and can make it work, then we can >>>>> reconsider that option. >>>>> >>>>> Thanks >>>>> Ralph >>>>> >>>>> >>>>> Josh >>>>> >>>>> >>>>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <r...@open-mpi.org> >>>>> wrote: >>>>> >>>>> Okay, then we'll just have to develop a workaround for all those Slurm >>>>> releases where PMI-2 is borked :-( >>>>> >>>>> FWIW: I think people misunderstood my statement. I specifically did >>>>> *not* propose to *lose* PMI-2 support. I suggested that we change it to >>>>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep >>>>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >>>>> stabilized, then we could reverse that policy. >>>>> >>>>> However, given that both you and Chris appear to prefer to keep it >>>>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >>>>> broken and then fall back to PMI-1. >>>>> >>>>> >>>>> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: >>>>> >>>>> Just saw this thread, and I second Chris' observations: at scale we >>>>> are seeing huge gains in jobstart performance with PMI2 over PMI1. We >>>>> *CANNOT* loose this functionality. For competitive reasons, I cannot >>>>> provide exact numbers, but let's say the difference is in the ballpark of >>>>> a >>>>> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely >>>>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, >>>>> but there is no contest between PMI1 and PMI2. We (MLNX) are actively >>>>> working to resolve some of the scalability issues in PMI2. >>>>> >>>>> Josh >>>>> >>>>> Joshua S. Ladd >>>>> Staff Engineer, HPC Software >>>>> Mellanox Technologies >>>>> >>>>> Email: josh...@mellanox.com >>>>> >>>>> >>>>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <r...@open-mpi.org> >>>>> wrote: >>>>> >>>>> Interesting - how many nodes were involved? As I said, the bad scaling >>>>> becomes more evident at a fairly high node count. >>>>> >>>>> On May 7, 2014, at 12:07 AM, Christopher Samuel <sam...@unimelb.edu.au> >>>>> wrote: >>>>> >>>>> > -----BEGIN PGP SIGNED MESSAGE----- >>>>> > Hash: SHA1 >>>>> > >>>>> > Hiya Ralph, >>>>> > >>>>> > On 07/05/14 14:49, Ralph Castain wrote: >>>>> > >>>>> >> I should have looked closer to see the numbers you posted, Chris - >>>>> >> those include time for MPI wireup. So what you are seeing is that >>>>> >> mpirun is much more efficient at exchanging the MPI endpoint info >>>>> >> than PMI. I suspect that PMI2 is not much better as the primary >>>>> >> reason for the difference is that mpriun sends blobs, while PMI >>>>> >> requires that everything b >>>>> >>>>> _______________________________________________ >>>>> >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14716.php >>>>> >>>> >>>> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14725.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14726.php >> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov