2014-05-08 9:54 GMT+07:00 Ralph Castain <r...@open-mpi.org>: > > On May 7, 2014, at 6:15 PM, Christopher Samuel <sam...@unimelb.edu.au> > wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Hi all, > > > > Apologies for having dropped out of the thread, night intervened here. > ;-) > > > > On 08/05/14 00:45, Ralph Castain wrote: > > > >> Okay, then we'll just have to develop a workaround for all those > >> Slurm releases where PMI-2 is borked :-( > > > > Do you know what these releases are? Are we talking 2.6.x or 14.03? > > The 14.03 series has had a fair few rapid point releases and doesn't > > appear to be anywhere as near as stable as 2.6 was when it came out. :-( > > Yeah :-( > > I think there was one 2.6.x that was borked, and definitely problems in > the 14.03.x line. Can't pinpoint it for you, though. >
The bug I experienced with abnormal OMPI termination persist starting from 2.6.3 till latest slurm release. It may appear earlier - I didn't check. However SLURM gyus didn't confirm that it's a bug acually. Things will get clear after 2 weeks when the person who maintains the code will review the patch. But I am pretty sure thats a bug. Refer to this thread http://thread.gmane.org/gmane.comp.distributed.slurm.devel/5213. > > > > >> FWIW: I think people misunderstood my statement. I specifically > >> did *not* propose to *lose* PMI-2 support. I suggested that we > >> change it to "on-by-request" instead of the current "on-by-default" > >> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once > >> the Slurm implementation stabilized, then we could reverse that > >> policy. > >> > >> However, given that both you and Chris appear to prefer to keep it > >> "on-by-default", we'll see if we can find a way to detect that > >> PMI-2 is broken and then fall back to PMI-1. > > > > My intention was to provide the data that led us to want PMI2, but if > > configure had an option to enable PMI2 by default so that only those > > who requested it got it then I'd be more than happy - we'd just add it > > to our script to build it. > > Sounds good. I'm going to have to dig deeper into those numbers, though, > as they don't entirely add up to me. Once the job gets launched, the launch > method itself should have no bearing on computational speed - IF all things > are equal. In other words, if the process layout is the same, and the > binding pattern is the same, then computational speed should be roughly > equivalent regardless of how the procs were started. > > My guess is that your data might indicate a difference in the layout > and/or binding pattern as opposed to PMI2 vs mpirun. At the scale you > mention later in the thread (only 70 nodes x 16 ppn), the difference in > launch timing would be zilch. So I'm betting you would find (upon further > exploration) that (a) you might not have been binding processes when > launching by mpirun, since we didn't bind by default until the 1.8 series, > but were binding under direct srun launch, and (b) your process mapping > would quite likely be different as we default to byslot mapping, and I > believe srun defaults to bynode? > > Might be worth another comparison run when someone has time. > > > > > > All the best! > > Chris > > - -- > > Christopher Samuel Senior Systems Administrator > > VLSCI - Victorian Life Sciences Computation Initiative > > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > > http://www.vlsci.org.au/ http://twitter.com/vlsci > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.14 (GNU/Linux) > > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > > > iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP > > 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt > > =OvH4 > > -----END PGP SIGNATURE----- > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14733.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14738.php > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov