[OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there slurm-dev and OMPI devel lists,

Bringing up a new IBM SandyBridge cluster I'm running a NAMD test case
and noticed that if I run it with srun rather than mpirun it goes over
20% slower.  These are all launched from an sbatch script too.

Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.

Here are some timings as reported as the WallClock time by NAMD itself
(so not including startup/tear down overhead from Slurm).

srun:

run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773
run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959
run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799
run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918

Average of 692 seconds

mpirun:

run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035
run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333
run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693

Average of 563 seconds.

So that's about 23% slower.

Everything is identical (they're all symlinks to the same golden
master) *except* for the srun / mpirun which is modified by copying
the batch script and substituting mpirun for srun.

When they are running I can see that for jobs launched with srun they
are direct children of slurmstepd whereas when started with mpirun
they are children of Open-MPI's orted (or mpirun on the launch node)
which itself is a child of slurmstepd.

Has anyone else seen anything like this, or got any ideas?

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHuKxoACgkQO2KABBYQAh8cYQCfT/YIFkyeDaNb/ksT2xk4W416
kycAoJfdZInLwy+nTIL7CzWapZZU20qm
=ZJ1B
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Joshua Ladd
Hi, Chris

Funny you should mention this now. We identified and diagnosed the issue some 
time ago as a combination of SLURM's PMI1 implementation and some of, what I'll 
call, OMPI's topology requirements (probably not the right word.) Here's what 
is happening, in a nutshell, when you launch with srun:

1. Each process pushes his endpoint data up to the PMI "cloud" via PMI put (I 
think it's about five or six puts, bottom line, O(1).)
2. Then executes a PMI commit and PMI barrier to ensure all other processes 
have finished committing their data to the "cloud".
3.  Subsequent to this, each process executes O(N) (N is the number of procs in 
the job) PMI gets in order to get all of the endpoint data for every process 
regardless of whether or not the process communicates with that endpoint. 

"We" (MLNX et al.) undertook an in-depth scaling study of this and identified 
several poorly scaling pieces with the worst offenders being:

1. PMI Barrier scales worse than linear.
2. At scale, the PMI get phase starts to look quadratic.   

The proposed solution that "we" (OMPI + SLURM) have come up with is to modify 
OMPI to support PMI2 and to use SLURM 2.6 which has support for PMI2 and is 
(allegedly) much more scalable than PMI1. Several folks in the combined 
communities are working hard, as we speak, trying to get this functional to see 
if it indeed makes a difference. Stay tuned, Chris. Hopefully we will have some 
data by the end of the week.  

Best regards,

Josh


Joshua S. Ladd, PhD
HPC Algorithms Engineer
Mellanox Technologies 

Email: josh...@mellanox.com
Cell: +1 (865) 258 - 8898





-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf 
Of Christopher Samuel
Sent: Tuesday, July 23, 2013 3:06 AM
To: slurm-dev; Open MPI Developers
Subject: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed 
than with mpirun

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there slurm-dev and OMPI devel lists,

Bringing up a new IBM SandyBridge cluster I'm running a NAMD test case and 
noticed that if I run it with srun rather than mpirun it goes over 20% slower.  
These are all launched from an sbatch script too.

Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.

Here are some timings as reported as the WallClock time by NAMD itself (so not 
including startup/tear down overhead from Slurm).

srun:

run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773
run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959
run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799
run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918

Average of 692 seconds

mpirun:

run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035
run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333
run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693

Average of 563 seconds.

So that's about 23% slower.

Everything is identical (they're all symlinks to the same golden
master) *except* for the srun / mpirun which is modified by copying the batch 
script and substituting mpirun for srun.

When they are running I can see that for jobs launched with srun they are 
direct children of slurmstepd whereas when started with mpirun they are 
children of Open-MPI's orted (or mpirun on the launch node) which itself is a 
child of slurmstepd.

Has anyone else seen anything like this, or got any ideas?

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHuKxoACgkQO2KABBYQAh8cYQCfT/YIFkyeDaNb/ksT2xk4W416
kycAoJfdZInLwy+nTIL7CzWapZZU20qm
=ZJ1B
-END PGP SIGNATURE-
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] basename: a faulty warning 'extra operand --test-name' in tests causes test-driver to fail

2013-07-23 Thread Jeff Squyres (jsquyres)
Sorry for the delay in replying...

Great!

In run_tests, does changing

progname="`basename $*`"

to

progname="`basename $1`"

fix the problem for you?


On Jul 14, 2013, at 3:51 AM, Vasiliy  wrote:

> I'm happy to provide you with an update on 'extra operand --test-name'
> occasionally being fed to 'basename' by Open MPI's testsuite, which
> was fixed by Automake maintainers:
> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=14840
> 
> You may still want to look at 'test/asm/run_tests' why it was passed through.
> 
> On Fri, Jul 12, 2013 at 9:30 PM, Vasiliy  wrote:
>> I've just gone through a test suite, and in 'test/asm/run_tests' there
>> is a statement:
>> 
>> progname="`basename $*`"
>> 
>> where '--test-name' could accidentally get in, causing the reported
>> issues, since 'basename' does not have such an option. Somebody
>> familiar with a test suite may want to look into it.
>> 
>> On Fri, Jul 12, 2013 at 5:17 PM, Vasiliy  wrote:
>>> Sorry again, my report was a stub because I didn't have enough time to
>>> investigate the issue. Due to the verbose level was set to zero, I've
>>> assumed from the log that 'basename' belongs to the Open MPI source
>>> whereas it is not. Thank you for drawing my attention it's actually a
>>> utility from 'coreutils' Cygwin package. I'll report it to their team.
>>> I've also filed a report with Automake's team about their part.
>>> 
>>> 1. I'm testing the Open MPI SVN patched source, that is, 1.9a1-svn
>>> with the latest autotools assembled from their git/svn sources, and my
>>> humble patches, yet have to be polished.
>>> 
>>> 2. Indeed, I'm running 'make check' when seeing those failures.
>>> Unfortunately, that failure with 'test-driver' obscures how many (how
>>> less), if any, true tests have been failed. I've just run it now on
>>> the latest sources (bzw, there's still an old rot with 'trace.c') and,
>>> if I could manage to make 'test-drive' working, it passes *ALL* the
>>> tests, except those with bogus 'test-drive' crashes, that is:
>>> 
>>> atomic_spinlock_noinline.exe
>>> atomic_cmpset_noinline.exe
>>> atomic_math_noinline.exe
>>> atomic_spinlock_noinline.exe
>>> atomic_cmpset_noinline.exe
>>> atomic_spinlock_noinline.exe
>>> atomic_math_noinline.exe
>>> atomic_cmpset_noinline.exe
>>> atomic_spinlock_noinline.exe
>>> atomic_math_noinline.exe
>>> atomic_spinlock_noinline.exe
>>> atomic_cmpset_noinline.exe
>>> atomic_math_noinline.exe
>>> 
>>> Clearly, they're inline/noinline issues, need to be looked into at
>>> some time later.
>>> 
>>> I can now give a feedback why I've got early reported warning about
>>> the shared libraries which haven't got created, and a blowout of
>>> 'undefined symbols'. Indeed, that was a problem with Makefile.am's.
>>> I've tested just two from about a hundred of other successfully
>>> compiled static libraries, which DSO counterparts weren't created upon
>>> compilation process, though being requested to:
>>> 
>>> - 'ompi/datatype's Makefile compiles 'libdatatype' without very much
>>> needed 'libopen-pal' and 'libmpi' libraries, what causes a shared
>>> library not to be created because of undefined symbols; bzw, even if
>>> added to the libtool (v2.4.2-374) invocation command line they are
>>> still not being produced, gcc doesn't have this kind of a problem;
>>> 
>>> - 'ompi/debuggers's Makefile does not make a 'libompi_dbg_msgq.dll.a'
>>> import library (though there is a shared library), the corresponding
>>> part has to be created manually;
>>> 
>>> I haven't checked other 95's.
>>> 
>>> 
>>> 
>>> On Fri, Jul 12, 2013 at 2:26 PM, Jeff Squyres (jsquyres)
>>>  wrote:
 I'm sorry, I'm still unclear what you're trying to tell us.  :-(
 
 1. What version of Open MPI are you testing?  If you're testing Open MPI 
 1.6.x with very new Automake, I'm not surprised that there's some 
 failures.  We usually pick the newest GNU Autotools when we begin a 
 release series, and then stick with those tool versions for the life of 
 that series.  We do not attempt to forward-port to newer Autotools on that 
 series, meaning that sometimes newer versions of the Autotools will break 
 the builds of that series.  That's ok.
 
 2. Assumedly, you're seeing this failure when you run "make check".  Is 
 that correct?  What test, exactly, is failing?  It's very difficult to 
 grok what you're reporting when you only include the last few lines of 
 output, which exclude the majority of the context that we need to know 
 what you're talking about.
 
 Your bug reports have been *extremely* helpful in cleaning out some old 
 kruft from our tree, but could you include more context in the future?  
 E.g., include all the "compile problems" items from here:
 
http://www.open-mpi.org/community/help/
 
 3. We don't have a test named "basename" or "test-driver"; basename is 
 usually an OS utility, and test-driver is part of the new Automake testing 
 fram

Re: [OMPI devel] 'make re-install' : remove 'ortecc' symlink also

2013-07-23 Thread Jeff Squyres (jsquyres)
Hmm, I think we do, but it looks like we might have done it wrong for OSs that 
have an $(EXEEXT), namely Windows.  Can you test this trunk patch and see if it 
fixes the issue?



On Jul 14, 2013, at 5:35 PM, Vasiliy  wrote:

> Makefile: please, remove/check for 'ortecc' symlink before proceeding
> with install
> 
> make[4]: Entering directory
> '/usr/src/64bit/release/openmpi/openmpi-1.9.0-a1/build/orte/tools/wrappers'
> test -z "/usr/bin" || /usr/bin/mkdir -p "/usr/bin"
> make  install-data-hook
> (cd /usr/bin; rm -f ortecc.exe; ln -s opal_wrapper ortecc)
> ln: failed to create symbolic link `ortecc': File exists
> make[4]: Entering directory
> '/usr/src/64bit/release/openmpi/openmpi-1.9.0-a1/build/orte/tools/wrappers'
> make[4]: Nothing to be done for 'install-data-hook'.
> make[4]: Leaving directory
> '/usr/src/64bit/release/openmpi/openmpi-1.9.0-a1/build/orte/tools/wrappers'
> Makefile:1668: recipe for target 'install-exec-hook-always' failed
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


wrappers.diff
Description: wrappers.diff


[OMPI devel] OpenSHMEM up on bitbucket

2013-07-23 Thread Joshua Ladd
Dear OMPI Developers,

I have put Mellanox OpenSHMEM up for review on my Bitbucket. Please "git" and 
test at your leisure. Questions, comments, and critiques are most welcome.

git clone 
https://jladd_m...@bitbucket.org/jladd_math/mlnx-oshmem.git

To build with OSHMEM support, build as you would OMPI but simply include 
'--with-oshmem' on your configure line. This will get you started.

Best regards,

Josh



Joshua S. Ladd, PhD
HPC Algorithms Engineer
Mellanox Technologies

Email: josh...@mellanox.com
Cell: +1 (865) 258 - 8898




Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 23/07/13 19:34, Joshua Ladd wrote:

> Hi, Chris

Hi Joshua,

I've quoted you in full as I don't think your message made it through
to the slurm-dev list (at least I've not received it from there yet).

> Funny you should mention this now. We identified and diagnosed the 
> issue some time ago as a combination of SLURM's PMI1
> implementation and some of, what I'll call, OMPI's topology
> requirements (probably not the right word.) Here's what is
> happening, in a nutshell, when you launch with srun:
> 
> 1. Each process pushes his endpoint data up to the PMI "cloud" via
> PMI put (I think it's about five or six puts, bottom line, O(1).) 
> 2. Then executes a PMI commit and PMI barrier to ensure all other 
> processes have finished committing their data to the "cloud". 3.
> Subsequent to this, each process executes O(N) (N is the number of 
> procs in the job) PMI gets in order to get all of the endpoint
> data for every process regardless of whether or not the process 
> communicates with that endpoint.
> 
> "We" (MLNX et al.) undertook an in-depth scaling study of this and 
> identified several poorly scaling pieces with the worst offenders 
> being:
> 
> 1. PMI Barrier scales worse than linear. 2. At scale, the PMI get
> phase starts to look quadratic.
> 
> The proposed solution that "we" (OMPI + SLURM) have come up with is
> to modify OMPI to support PMI2 and to use SLURM 2.6 which has
> support for PMI2 and is (allegedly) much more scalable than PMI1.
> Several folks in the combined communities are working hard, as we
> speak, trying to get this functional to see if it indeed makes a
> difference. Stay tuned, Chris. Hopefully we will have some data by
> the end of the week.

Wonderful, great to know that what we're seeing is actually real and
not just pilot error on our part!   We're happy enough to tell users
to keep on using mpirun as they will be used to from our other Intel
systems and to only use srun if the code requires it (one or two
commercial apps that use Intel MPI).

Can I ask, if the PMI2 ideas work out is that likely to get backported
to OMPI 1.6.x ?

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHvEZIACgkQO2KABBYQAh9QogCeMuR/E4oPivdsX3r671+z7EWd
Hv8An1N8csHMby7bouT/gC07i/J2PW+i
=gZsB
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Ralph Castain
Not to 1.6 series, but it is in the about-to-be-released 1.7.3, and will be 
there from that point onwards. Still waiting to see if it resolves the 
difference.


On Jul 23, 2013, at 4:28 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 23/07/13 19:34, Joshua Ladd wrote:
> 
>> Hi, Chris
> 
> Hi Joshua,
> 
> I've quoted you in full as I don't think your message made it through
> to the slurm-dev list (at least I've not received it from there yet).
> 
>> Funny you should mention this now. We identified and diagnosed the 
>> issue some time ago as a combination of SLURM's PMI1
>> implementation and some of, what I'll call, OMPI's topology
>> requirements (probably not the right word.) Here's what is
>> happening, in a nutshell, when you launch with srun:
>> 
>> 1. Each process pushes his endpoint data up to the PMI "cloud" via
>> PMI put (I think it's about five or six puts, bottom line, O(1).) 
>> 2. Then executes a PMI commit and PMI barrier to ensure all other 
>> processes have finished committing their data to the "cloud". 3.
>> Subsequent to this, each process executes O(N) (N is the number of 
>> procs in the job) PMI gets in order to get all of the endpoint
>> data for every process regardless of whether or not the process 
>> communicates with that endpoint.
>> 
>> "We" (MLNX et al.) undertook an in-depth scaling study of this and 
>> identified several poorly scaling pieces with the worst offenders 
>> being:
>> 
>> 1. PMI Barrier scales worse than linear. 2. At scale, the PMI get
>> phase starts to look quadratic.
>> 
>> The proposed solution that "we" (OMPI + SLURM) have come up with is
>> to modify OMPI to support PMI2 and to use SLURM 2.6 which has
>> support for PMI2 and is (allegedly) much more scalable than PMI1.
>> Several folks in the combined communities are working hard, as we
>> speak, trying to get this functional to see if it indeed makes a
>> difference. Stay tuned, Chris. Hopefully we will have some data by
>> the end of the week.
> 
> Wonderful, great to know that what we're seeing is actually real and
> not just pilot error on our part!   We're happy enough to tell users
> to keep on using mpirun as they will be used to from our other Intel
> systems and to only use srun if the code requires it (one or two
> commercial apps that use Intel MPI).
> 
> Can I ask, if the PMI2 ideas work out is that likely to get backported
> to OMPI 1.6.x ?
> 
> All the best,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlHvEZIACgkQO2KABBYQAh9QogCeMuR/E4oPivdsX3r671+z7EWd
> Hv8An1N8csHMby7bouT/gC07i/J2PW+i
> =gZsB
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel