[OMPI devel] Memory accounting issues with mpirun (was Re: [slurm-dev] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23/07/13 17:06, Christopher Samuel wrote: > Bringing up a new IBM SandyBridge cluster I'm running a NAMD test > case and noticed that if I run it with srun rather than mpirun it > goes over 20% slower. Following on from this issue, we've found that whilst mpirun gives acceptable performance the memory accounting doesn't appear to be correct. Anyone seen anything similar, or any ideas on what could be going on? Here are two identical NAMD jobs running over 69 nodes using 16 nodes per core, this one launched with mpirun (Open-MPI 1.6.5): ==> slurm-94491.out <== WallClock: 101.176193 CPUTime: 101.176193 Memory: 1268.554688 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94491 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94491 94491.batch6504068K 11167820K 94491.05952048K 9028060K This one launched with srun (about 60% slower): ==> slurm-94505.out <== WallClock: 163.314163 CPUTime: 163.314163 Memory: 1253.511719 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94505 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94505 94505.batch 7248K 1582692K 94505.01022744K 1307112K cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB5sEACgkQO2KABBYQAh9QMQCfQ57w0YqVDwgyGRqUe3dSvQDj e9cAnRRx/kDNUNqUCuFGY87mXf2fMOr+ =JUPK -END PGP SIGNATURE-
Re: [OMPI devel] Memory accounting issues with mpirun (was Re: [slurm-dev] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/13 16:18, Christopher Samuel wrote: > Anyone seen anything similar, or any ideas on what could be going > on? Apologies, forgot to mention that Slurm is set up with: # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 We are testing with cgroups now. - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB6h0ACgkQO2KABBYQAh8gowCfTG0p/RFOuUHQG47avDL2YwOg uM8Anjw16dWen6kykBfMhWpHUWr709zv =BR3G -END PGP SIGNATURE-
Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/13 16:19, Christopher Samuel wrote: > Anyone seen anything similar, or any ideas on what could be going > on? Sorry, this was with: # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 Since those initial tests we've started enforcing memory limits (the system is not yet in full production) and found that this causes jobs to get killed. We tried the cgroups gathering method, but jobs still die with mpirun and now the numbers don't seem to right for mpirun or srun either: mpirun (killed): [samuel@barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94564 94564.batch-523362K 0 94564.0 394525K 0 srun: [samuel@barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94565 94565.batch998K 0 94565.0 88663K 0 All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93 KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk =jYrC -END PGP SIGNATURE-
Re: [OMPI devel] [slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/13 16:59, Janne Blomqvist wrote: > That is, the memory accounting is per task, and when launching > using mpirun the number of tasks does not correspond to the number > of MPI processes, but rather to the number of "orted" processes (1 > per node). That appears to be correct, I am seeing 1 task in the batch and 68 tasks for orted when I use mpirun whilst I see 1 task in the batch and 1104 tasks as namd2 when I use srun. I could understand how that might result in Slurm (wrongly) thinking that a single task is using more than its allowed memory per tasks, but I'm not sure I understand how that could lead to Slurm thinking the job is using vastly more memory than it actually is though. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB+lgACgkQO2KABBYQAh8uqgCdGuA03jCEdJVJE2dJGBHEJjb/ WY4An3em/48L25xq4Ui/GHijSJY2Oo6T =Zk4G -END PGP SIGNATURE-
[OMPI devel] Reminder: scheduled maintenance is about 40 mins
Friendly reminder that our hosting provider has a multiple-house scheduled maintenance window starting in about 40 mins for www.open-MPI.org. The web site and all mailing lists will be down. SVN should still be up. Sent from my phone. No type good.
[OMPI devel] Migration is complete
Let me know if you find any problems with the web site and/or the mailing lists. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/