Ah, I see. Sorry for the reactionary comment - but this feature falls squarely within my "jurisdiction", and we've invested a lot in improving OMPI jobstart under srun.
That being said (now that I've taken some deep breaths and carefully read your original email :)), what you're proposing isn't a bad idea. I think it would be good to maybe add a "--with-pmi2" flag to configure since "--with-pmi" automagically uses PMI2 if it finds the header and lib. This way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or hack the installation. Josh On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <r...@open-mpi.org> wrote: > Okay, then we'll just have to develop a workaround for all those Slurm > releases where PMI-2 is borked :-( > > FWIW: I think people misunderstood my statement. I specifically did *not* > propose to *lose* PMI-2 support. I suggested that we change it to > "on-by-request" instead of the current "on-by-default" so we wouldn't keep > getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation > stabilized, then we could reverse that policy. > > However, given that both you and Chris appear to prefer to keep it > "on-by-default", we'll see if we can find a way to detect that PMI-2 is > broken and then fall back to PMI-1. > > > On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > Just saw this thread, and I second Chris' observations: at scale we are > seeing huge gains in jobstart performance with PMI2 over PMI1. We > *CANNOT*loose this functionality. For competitive reasons, I cannot provide > exact > numbers, but let's say the difference is in the ballpark of a full > order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely > unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, > but there is no contest between PMI1 and PMI2. We (MLNX) are actively > working to resolve some of the scalability issues in PMI2. > > Josh > > Joshua S. Ladd > Staff Engineer, HPC Software > Mellanox Technologies > > Email: josh...@mellanox.com > > > On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Interesting - how many nodes were involved? As I said, the bad scaling >> becomes more evident at a fairly high node count. >> >> On May 7, 2014, at 12:07 AM, Christopher Samuel <sam...@unimelb.edu.au> >> wrote: >> >> > -----BEGIN PGP SIGNED MESSAGE----- >> > Hash: SHA1 >> > >> > Hiya Ralph, >> > >> > On 07/05/14 14:49, Ralph Castain wrote: >> > >> >> I should have looked closer to see the numbers you posted, Chris - >> >> those include time for MPI wireup. So what you are seeing is that >> >> mpirun is much more efficient at exchanging the MPI endpoint info >> >> than PMI. I suspect that PMI2 is not much better as the primary >> >> reason for the difference is that mpriun sends blobs, while PMI >> >> requires that everything be encoded into strings and sent in little >> >> pieces. >> >> >> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" >> >> operation) much faster, and MPI_Init completes faster. Rest of the >> >> computation should be the same, so long compute apps will see the >> >> difference narrow considerably. >> > >> > Unfortunately it looks like I had an enthusiastic cleanup at some point >> > and so I cannot find the out files from those runs at the moment, but >> > I did find some comparisons from around that time. >> > >> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 >> > run with mpirun and srun successively from inside the same Slurm job. >> > >> > mpirun namd2 macpf.conf >> > srun --mpi=pmi2 namd2 macpf.conf >> > >> > Firstly the mpirun output (grep'ing the interesting bits): >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns >> 1055.19 MB memory >> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns >> 1055.19 MB memory >> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns >> 1055.19 MB memory >> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns >> 1055.19 MB memory >> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns >> 1055.19 MB memory >> > WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB >> > >> > Now the srun output: >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns >> 1036.75 MB memory >> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns >> 1036.75 MB memory >> > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns >> 1036.75 MB memory >> > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns >> 1036.75 MB memory >> > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns >> 1036.75 MB memory >> > WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB >> > >> > >> > The next two pairs are first launched using mpirun from 1.6.x and then >> with srun >> > from 1.7.3a1r29103. Again each pair inside the same Slurm job with the >> same inputs. >> > >> > First pair mpirun: >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB >> memory >> > WallClock: 8341.524414 CPUTime: 8341.524414 Memory: 975.015625 MB >> > >> > First pair srun: >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 >> MB memory >> > Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 >> MB memory >> > Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 >> MB memory >> > Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 >> MB memory >> > WallClock: 7476.643555 CPUTime: 7476.643555 Memory: 968.867188 MB >> > >> > >> > Second pair mpirun: >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 >> MB memory >> > Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 >> MB memory >> > Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 >> MB memory >> > Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 >> MB memory >> > WallClock: 7842.831543 CPUTime: 7842.831543 Memory: 1004.050781 MB >> > >> > Second pair srun: >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB >> memory >> > Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB >> memory >> > WallClock: 7522.677246 CPUTime: 7522.677246 Memory: 969.433594 MB >> > >> > >> > So to me it looks like (for NAMD on our system at least) that >> > PMI2 does seem to give better scalability. >> > >> > All the best! >> > Chris >> > - -- >> > Christopher Samuel Senior Systems Administrator >> > VLSCI - Victorian Life Sciences Computation Initiative >> > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 >> > http://www.vlsci.org.au/ http://twitter.com/vlsci >> > >> > -----BEGIN PGP SIGNATURE----- >> > Version: GnuPG v1.4.14 (GNU/Linux) >> > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ >> > >> > iEYEARECAAYFAlNp28UACgkQO2KABBYQAh8hagCfewbbxUR6grg5R40GrwjtIZV0 >> > 1KYAn2uX0yKLdOEbkHARKouzwFilaTTD >> > =A/Yw >> > -----END PGP SIGNATURE----- >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14697.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14698.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14707.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14708.php >