Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

Ralph Castain Wed, 7 May 2014 11:47:16 -0400 (EDT)

Jeff actually had a useful suggestion (gasp!).He proposed that we separate the 
PMI-1 and PMI-2 codes into separate components so you could select them at 
runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are found), 
default to PMI-1, but users could select to try PMI-2. If the PMI-2 component 
failed, we would emit a show_help indicating that they probably have a broken 
PMI-2 version and should try PMI-1.


Make sense?
Ralph

On May 7, 2014, at 8:00 AM, Ralph Castain <r...@open-mpi.org> wrote:

> 
> On May 7, 2014, at 7:56 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> 
>> Ah, I see. Sorry for the reactionary comment - but this feature falls 
>> squarely within my "jurisdiction", and we've invested a lot in improving 
>> OMPI jobstart under srun. 
>> 
>> That being said (now that I've taken some deep breaths and carefully read 
>> your original email :)), what you're proposing isn't a bad idea. I think it 
>> would be good to maybe add a "--with-pmi2" flag to configure since 
>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This 
>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or 
>> hack the installation. 
> 
> That would be a much simpler solution than what Artem proposed (off-list) 
> where we would try PMI2 and then if it didn't work try to figure out how to 
> fall back to PMI1. I'll add this for now, and if Artem wants to try his more 
> automagic solution and can make it work, then we can reconsider that option.
> 
> Thanks
> Ralph
> 
>> 
>> Josh  
>> 
>> 
>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> Okay, then we'll just have to develop a workaround for all those Slurm 
>> releases where PMI-2 is borked :-(
>> 
>> FWIW: I think people misunderstood my statement. I specifically did *not* 
>> propose to *lose* PMI-2 support. I suggested that we change it to 
>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep 
>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation 
>> stabilized, then we could reverse that policy.
>> 
>> However, given that both you and Chris appear to prefer to keep it 
>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is 
>> broken and then fall back to PMI-1.
>> 
>> 
>> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>> 
>>> Just saw this thread, and I second Chris' observations: at scale we are 
>>> seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT 
>>> loose this functionality. For competitive reasons, I cannot provide exact 
>>> numbers, but let's say the difference is in the ballpark of a full 
>>> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely 
>>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, 
>>> but there is no contest between PMI1 and PMI2.  We (MLNX) are actively 
>>> working to resolve some of the scalability issues in PMI2. 
>>> 
>>> Josh
>>> 
>>> Joshua S. Ladd
>>> Staff Engineer, HPC Software
>>> Mellanox Technologies
>>> 
>>> Email: josh...@mellanox.com
>>> 
>>> 
>>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Interesting - how many nodes were involved? As I said, the bad scaling 
>>> becomes more evident at a fairly high node count.
>>> 
>>> On May 7, 2014, at 12:07 AM, Christopher Samuel <sam...@unimelb.edu.au> 
>>> wrote:
>>> 
>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>> > Hash: SHA1
>>> >
>>> > Hiya Ralph,
>>> >
>>> > On 07/05/14 14:49, Ralph Castain wrote:
>>> >
>>> >> I should have looked closer to see the numbers you posted, Chris -
>>> >> those include time for MPI wireup. So what you are seeing is that
>>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>>> >> than PMI. I suspect that PMI2 is not much better as the primary
>>> >> reason for the difference is that mpriun sends blobs, while PMI
>>> >> requires that everything be encoded into strings and sent in little
>>> >> pieces.
>>> >>
>>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
>>> >> operation) much faster, and MPI_Init completes faster. Rest of the
>>> >> computation should be the same, so long compute apps will see the
>>> >> difference narrow considerably.
>>> >
>>> > Unfortunately it looks like I had an enthusiastic cleanup at some point
>>> > and so I cannot find the out files from those runs at the moment, but
>>> > I did find some comparisons from around that time.
>>> >
>>> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
>>> > run with mpirun and srun successively from inside the same Slurm job.
>>> >
>>> > mpirun namd2 macpf.conf
>>> > srun --mpi=pmi2 namd2 macpf.conf
>>> >
>>> > Firstly the mpirun output (grep'ing the interesting bits):
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 
>>> > MB memory
>>> > WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB
>>> >
>>> > Now the srun output:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 
>>> > MB memory
>>> > WallClock: 1230.784424  CPUTime: 1230.784424  Memory: 1100.648438 MB
>>> >
>>> >
>>> > The next two pairs are first launched using mpirun from 1.6.x and then 
>>> > with srun
>>> > from 1.7.3a1r29103.  Again each pair inside the same Slurm job with the 
>>> > same inputs.
>>> >
>>> > First pair mpirun:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB 
>>> > memory
>>> > WallClock: 8341.524414  CPUTime: 8341.524414  Memory: 975.015625 MB
>>> >
>>> > First pair srun:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB 
>>> > memory
>>> > WallClock: 7476.643555  CPUTime: 7476.643555  Memory: 968.867188 MB
>>> >
>>> >
>>> > Second pair mpirun:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB 
>>> > memory
>>> > WallClock: 7842.831543  CPUTime: 7842.831543  Memory: 1004.050781 MB
>>> >
>>> > Second pair srun:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB 
>>> > memory
>>> > Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB 
>>> > memory
>>> > WallClock: 7522.677246  CPUTime: 7522.677246  Memory: 969.433594 MB
>>> >
>>> >
>>> > So to me it looks like (for NAMD on our system at least) that
>>> > PMI2 does seem to give better scalability.
>>> >
>>> > All the best!
>>> > Chris
>>> > - --
>>> > Christopher Samuel        Senior Systems Administrator
>>> > VLSCI - Victorian Life Sciences Computation Initiative
>>> > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>>> > http://www.vlsci.org.au/      http://twitter.com/vlsci
>>> >
>>> > -----BEGIN PGP SIGNATURE-----
>>> > Version: GnuPG v1.4.14 (GNU/Linux)
>>> > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>>> >
>>> > iEYEARECAAYFAlNp28UACgkQO2KABBYQAh8hagCfewbbxUR6grg5R40GrwjtIZV0
>>> > 1KYAn2uX0yKLdOEbkHARKouzwFilaTTD
>>> > =A/Yw
>>> > -----END PGP SIGNATURE-----
>>> > _______________________________________________
>>> > devel mailing list
>>> > de...@open-mpi.org
>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> > Link to this post: 
>>> > http://www.open-mpi.org/community/lists/devel/2014/05/14697.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/05/14698.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/05/14707.php
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14708.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14711.php
>

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

Reply via email to