[OMPI devel] Memory accounting issues with mpirun (was Re: [slurm-dev] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 23/07/13 17:06, Christopher Samuel wrote:

> Bringing up a new IBM SandyBridge cluster I'm running a NAMD test 
> case and noticed that if I run it with srun rather than mpirun it 
> goes over 20% slower.

Following on from this issue, we've found that whilst mpirun gives
acceptable performance the memory accounting doesn't appear to be correct.

Anyone seen anything similar, or any ideas on what could be going on?

Here are two identical NAMD jobs running over 69 nodes using 16 nodes
per core, this one launched with mpirun (Open-MPI 1.6.5):


==> slurm-94491.out <==
WallClock: 101.176193  CPUTime: 101.176193  Memory: 1268.554688 MB
End of program

[samuel@barcoo-test Mem]$ sacct -j 94491 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
-  -- --
94491
94491.batch6504068K  11167820K
94491.05952048K   9028060K


This one launched with srun (about 60% slower):

==> slurm-94505.out <==
WallClock: 163.314163  CPUTime: 163.314163  Memory: 1253.511719 MB
End of program

[samuel@barcoo-test Mem]$ sacct -j 94505 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
-  -- --
94505
94505.batch   7248K   1582692K
94505.01022744K   1307112K



cheers!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIB5sEACgkQO2KABBYQAh9QMQCfQ57w0YqVDwgyGRqUe3dSvQDj
e9cAnRRx/kDNUNqUCuFGY87mXf2fMOr+
=JUPK
-END PGP SIGNATURE-


Re: [OMPI devel] Memory accounting issues with mpirun (was Re: [slurm-dev] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/08/13 16:18, Christopher Samuel wrote:

> Anyone seen anything similar, or any ideas on what could be going
> on?

Apologies, forgot to mention that Slurm is set up with:

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

We are testing with cgroups now.

- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIB6h0ACgkQO2KABBYQAh8gowCfTG0p/RFOuUHQG47avDL2YwOg
uM8Anjw16dWen6kykBfMhWpHUWr709zv
=BR3G
-END PGP SIGNATURE-


Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/08/13 16:19, Christopher Samuel wrote:

> Anyone seen anything similar, or any ideas on what could be going
> on?

Sorry, this was with:

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

Since those initial tests we've started enforcing memory limits (the
system is not yet in full production) and found that this causes jobs
to get killed.

We tried the cgroups gathering method, but jobs still die with mpirun
and now the numbers don't seem to right for mpirun or srun either:

mpirun (killed):

[samuel@barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
-  -- --
94564
94564.batch-523362K  0
94564.0 394525K  0

srun:

[samuel@barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
-  -- --
94565
94565.batch998K  0
94565.0  88663K  0


All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93
KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk
=jYrC
-END PGP SIGNATURE-


Re: [OMPI devel] [slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/08/13 16:59, Janne Blomqvist wrote:

> That is, the memory accounting is per task, and when launching
> using mpirun the number of tasks does not correspond to the number
> of MPI processes, but rather to the number of "orted" processes (1
> per node).

That appears to be correct, I am seeing 1 task in the batch and 68
tasks for orted when I use mpirun whilst I see 1 task in the batch and
1104 tasks as namd2 when I use srun.

I could understand how that might result in Slurm (wrongly) thinking
that a single task is using more than its allowed memory per tasks,
but I'm not sure I understand how that could lead to Slurm thinking
the job is using vastly more memory than it actually is though.


cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIB+lgACgkQO2KABBYQAh8uqgCdGuA03jCEdJVJE2dJGBHEJjb/
WY4An3em/48L25xq4Ui/GHijSJY2Oo6T
=Zk4G
-END PGP SIGNATURE-


[OMPI devel] Reminder: scheduled maintenance is about 40 mins

2013-08-07 Thread Jeff Squyres (jsquyres)
Friendly reminder that our hosting provider has a multiple-house scheduled 
maintenance window starting in about 40 mins for www.open-MPI.org. The web site 
and all mailing lists will be down.

SVN should still be up. 

Sent from my phone. No type good. 


[OMPI devel] Migration is complete

2013-08-07 Thread Jeff Squyres (jsquyres)
Let me know if you find any problems with the web site and/or the mailing lists.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/