Re: [OMPI devel] Open-MPI backwards compatibility and library version changes
Hi Brian, On 17/11/18 5:13 am, Barrett, Brian via devel wrote: > Unfortunately, I don’t have a good idea of what to do now. We already > did the damage on the 3.x series. Our backwards compatibility testing > (as lame as it is) just links libmpi, so it’s all good. But if anyone > uses libtool, we’ll have a problem, because we install the .la files > that allow libtool to see the dependency of libmpi on libopen-pal, and > it gets too excited. > > We’ll need to talk about how we think about this change in the future. Thanks for that - personally I think it's a misfeature in libtool to add these extra dependencies, it would be handy if there was a way to turn it off - but that's not your problem. For us it just means that when we bring in a new Open-MPI we just need to build new versions of our installed libraries and codes against it, fortunately that's something that Easybuild makes (relatively) easy. Thanks for your time everyone - this is my last week at Swinburne before I leave Australia to start at NERSC in December! All the best, Chris -- Christopher Samuel OzGrav Senior Data Science Support ARC Centre of Excellence for Gravitational Wave Discovery http://www.ozgrav.org/ http://twitter.com/ozgrav ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Open-MPI backwards compatibility and library version changes
On 15/11/18 12:10 pm, Christopher Samuel wrote: > I wonder if it's because they use libtool instead? Yup, it's libtool - using it compile my toy example shows the same behaviour with "readelf -d" pulling in the private libraries directly. :-( [csamuel@farnarkle2 libtool]$ cat hhgttg.c int answer(void) { return(42); } [csamuel@farnarkle2 libtool]$ libtool compile gcc hhgttg.c -c -o hhgttg.o libtool: compile: gcc hhgttg.c -c -fPIC -DPIC -o .libs/hhgttg.o libtool: compile: gcc hhgttg.c -c -o hhgttg.o >/dev/null 2>&1 [csamuel@farnarkle2 libtool]$ libtool link gcc -o libhhgttg.la hhgttg.lo -lmpi -rpath /usr/local/lib libtool: link: gcc -shared -fPIC -DPIC .libs/hhgttg.o -Wl,-rpath -Wl,/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib -Wl,-rpath -Wl,/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libmpi.so -L/apps/skylake/software/core/gcccore/6.4.0/lib64 -L/apps/skylake/software/core/gcccore/6.4.0/lib -L/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-rte.so /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-pal.so -ldl -lrt -lutil -lm -lpthread -lz -lhwloc-Wl,-soname -Wl,libhhgttg.so.0 -o .libs/libhhgttg.so.0.0.0 libtool: link: (cd ".libs" && rm -f "libhhgttg.so.0" && ln -s "libhhgttg.so.0.0.0" "libhhgttg.so.0") libtool: link: (cd ".libs" && rm -f "libhhgttg.so" && ln -s "libhhgttg.so.0.0.0" "libhhgttg.so") libtool: link: ar cru .libs/libhhgttg.a hhgttg.o libtool: link: ranlib .libs/libhhgttg.a libtool: link: ( cd ".libs" && rm -f "libhhgttg.la" && ln -s "../libhhgttg.la" "libhhgttg.la" ) [csamuel@farnarkle2 libtool]$ readelf -d .libs/libhhgttg.so.0| fgrep -i lib 0x0001 (NEEDED) Shared library: [libmpi.so.40] 0x0001 (NEEDED) Shared library: [libopen-rte.so.40] 0x0001 (NEEDED) Shared library: [libopen-pal.so.40] 0x0001 (NEEDED) Shared library: [libdl.so.2] 0x0001 (NEEDED) Shared library: [librt.so.1] 0x0001 (NEEDED) Shared library: [libutil.so.1] 0x0001 (NEEDED) Shared library: [libm.so.6] 0x0001 (NEEDED) Shared library: [libpthread.so.0] 0x0001 (NEEDED) Shared library: [libz.so.1] 0x0001 (NEEDED) Shared library: [libhwloc.so.5] 0x0001 (NEEDED) Shared library: [libc.so.6] 0x000e (SONAME) Library soname: [libhhgttg.so.0] 0x001d (RUNPATH)Library runpath: [/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib] All the best, Chris -- Christopher Samuel OzGrav Senior Data Science Support ARC Centre of Excellence for Gravitational Wave Discovery http://www.ozgrav.org/ http://twitter.com/ozgrav ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Open-MPI backwards compatibility and library version changes
On 15/11/18 11:45 am, Christopher Samuel wrote: > Unfortunately that's not the case, just creating a shared library > that only links in libmpi.so will create dependencies on the private > libraries too in the final shared library. :-( Hmm, I might be misinterpreting the output of "ldd", it looks like it reports the dependencies of dependencies not just the direct dependencies. "readelf -d" seems more reliable. [csamuel@farnarkle2 libtool]$ readelf -d libhhgttg.so.1 | fgrep -i lib 0x0001 (NEEDED) Shared library: [libmpi.so.40] 0x0001 (NEEDED) Shared library: [libc.so.6] 0x000e (SONAME) Library soname: [libhhgttg.so.1] Whereas the HDF5 libraries really do have them listed as a dependency. [csamuel@farnarkle2 1.10.1]$ readelf -d ./lib/libhdf5_fortran.so.100 | fgrep -i lib 0x0001 (NEEDED) Shared library: [libhdf5.so.101] 0x0001 (NEEDED) Shared library: [libsz.so.2] 0x0001 (NEEDED) Shared library: [libmpi_usempif08.so.40] 0x0001 (NEEDED) Shared library: [libmpi_usempi_ignore_tkr.so.40] 0x0001 (NEEDED) Shared library: [libmpi_mpifh.so.40] 0x0001 (NEEDED) Shared library: [libmpi.so.40] 0x0001 (NEEDED) Shared library: [libopen-rte.so.40] 0x0001 (NEEDED) Shared library: [libopen-pal.so.40] 0x0001 (NEEDED) Shared library: [libdl.so.2] 0x0001 (NEEDED) Shared library: [librt.so.1] 0x0001 (NEEDED) Shared library: [libutil.so.1] 0x0001 (NEEDED) Shared library: [libpthread.so.0] 0x0001 (NEEDED) Shared library: [libz.so.1] 0x0001 (NEEDED) Shared library: [libhwloc.so.5] 0x0001 (NEEDED) Shared library: [libgfortran.so.3] 0x0001 (NEEDED) Shared library: [libm.so.6] 0x0001 (NEEDED) Shared library: [libquadmath.so.0] 0x0001 (NEEDED) Shared library: [libc.so.6] 0x0001 (NEEDED) Shared library: [libgcc_s.so.1] 0x000e (SONAME) Library soname: [libhdf5_fortran.so.100] 0x001d (RUNPATH)Library runpath: [/apps/skylake/software/mpi/gcc/6.4.0/openmpi/3.0.0/hdf5/1.10.1/lib:/apps/skylake/software/core/szip/2.1.1/lib:/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib:/apps/skylake/software/core/gcccore/6.4.0/lib/../lib64] I wonder if it's because they use libtool instead? All the best, Chris -- Christopher Samuel OzGrav Senior Data Science Support ARC Centre of Excellence for Gravitational Wave Discovery http://www.ozgrav.org/ http://twitter.com/ozgrav ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Open-MPI backwards compatibility and library version changes
On 15/11/18 2:16 am, Barrett, Brian via devel wrote: > In practice, this should not be a problem. The wrapper compilers (and > our instructions for linking when not using the wrapper compilers) > only link against libmpi.so (or a set of libraries if using Fortran), > as libmpi.so contains the public interface. libmpi.so has a > dependency on libopen-pal.so so the loader will load the version of > libopen-pal.so that matches the version of Open MPI used to build > libmpi.so However, if someone explicitly links against libopen-pal.so > you end up where we are today. Unfortunately that's not the case, just creating a shared library that only links in libmpi.so will create dependencies on the private libraries too in the final shared library. :-( Here's a toy example to illustrate that. [csamuel@farnarkle2 libtool]$ cat hhgttg.c int answer(void) { return(42); } [csamuel@farnarkle2 libtool]$ gcc hhgttg.c -c -o hhgttg.o [csamuel@farnarkle2 libtool]$ gcc -shared -Wl,-soname,libhhgttg.so.1 -o libhhgttg.so.1 hhgttg.o -lmpi [csamuel@farnarkle2 libtool]$ ldd libhhgttg.so.1 linux-vdso.so.1 => (0x7ffc625b3000) libmpi.so.40 => /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libmpi.so.40 (0x7f018a582000) libc.so.6 => /lib64/libc.so.6 (0x7f018a09e000) libopen-rte.so.40 => /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-rte.so.40 (0x7f018a4b5000) libopen-pal.so.40 => /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-pal.so.40 (0x7f0189fde000) libdl.so.2 => /lib64/libdl.so.2 (0x7f0189dda000) librt.so.1 => /lib64/librt.so.1 (0x7f0189bd2000) libutil.so.1 => /lib64/libutil.so.1 (0x7f01899cf000) libm.so.6 => /lib64/libm.so.6 (0x7f01896cd000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7f01894b1000) libz.so.1 => /lib64/libz.so.1 (0x7f018929b000) libhwloc.so.5 => /lib64/libhwloc.so.5 (0x7f018905e000) /lib64/ld-linux-x86-64.so.2 (0x7f018a46b000) libnuma.so.1 => /lib64/libnuma.so.1 (0x7f0188e52000) libltdl.so.7 => /lib64/libltdl.so.7 (0x7f0188c48000) libgcc_s.so.1 => /apps/skylake/software/core/gcccore/6.4.0/lib64/libgcc_s.so.1 (0x7f018a499000) All the best, Chris -- Christopher Samuel OzGrav Senior Data Science Support ARC Centre of Excellence for Gravitational Wave Discovery http://www.ozgrav.org/ http://twitter.com/ozgrav ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
[OMPI devel] Open-MPI backwards compatibility and library version changes
Hi folks, Just resub'd after a long time to ask a question about binary/backwards compatibility. We got bitten when upgrading from 3.0.0 to 3.0.3 which we assumed would be binary compatible and so (after some testing to confirm it was) replaced our existing 3.0.0 install with the 3.0.3 one (because we're using hierarchical namespaces in Lmod it meant we avoided needed to recompile everything we'd already built over the last 12 months with 3.0.0). However, once we'd done that we heard from a user that their code would no longer run because it couldn't find libopen-pal.so.40 and saw that instead 3.0.3 had libopen-pal.so.42. Initially we thought this was some odd build system problem, but then on digging further we realised that they were linking against libraries that in turn were built against OpenMPI (HDF5) and that those had embedded the libopen-pal.so.40 names. Of course our testing hadn't found that because we weren't linking against anything like those for our MPI tests. :-( But I was really surprised to see that these version numbers were changing, I thought the idea was to keep things backwardly compatible within these series? Now fortunately our reason for doing the forced upgrade (we found our 3.0.0 didn't work with our upgrade to Slurm 18.08.3) was us missing one combination out of our testing whilst fault-finding and having gotten it going we've been able to drop back to the original 3.0.0 & fixed it for them. But is this something that you folks have come across before? All the best, Chris -- Christopher Samuel OzGrav Senior Data Science Support ARC Centre of Excellence for Gravitational Wave Discovery http://www.ozgrav.org/ http://twitter.com/ozgrav ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] subcommunicator OpenMPI issues on K
On 08/11/17 12:30, Kawashima, Takahiro wrote: > As other people said, Fujitsu MPI used in K is based on old > Open MPI (v1.6.3 with bug fixes). I guess the obvious question is will the vanilla Open-MPI work on K? -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Open-MPI killing nodes with mlx5 drivers?
On 30/10/17 14:07, Christopher Samuel wrote: > We have an issue where codes compiled with Open-MPI kill nodes with > ConnectX-4 and ConnectX-5 cards connected to Mellanox Ethernet switches > using the mlx5 driver from the latest Mellanox OFED For the record, this crash is fixed in Mellanox OFED 4.2, which came out after I wrote that with the necessary fix. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
[OMPI devel] Open-MPI killing nodes with mlx5 drivers?
Hi folks, Trying the devel list to see if folks here have hit this issue when testing out as I suspect it's not something many users will have access to yet. We have an issue where codes compiled with Open-MPI kill nodes with ConnectX-4 and ConnectX-5 cards connected to Mellanox Ethernet switches using the mlx5 driver from the latest Mellanox OFED, the kernel hangs with no oops (or any other error) and we have to power cycle the node to get it back. This happens with even a singleton (no srun or mpirun) and from what I can see from strace before the node hangs Open-MPI is starting to probe for what fabrics are available. The folks I'm helping have engaged Mellanox support but I was wondering if anyone else had run across this? Distro: RHEL 7.4 (x86-64) Kernel: 4.12.9 (needed for the CephFS filesystem they use) OFED: 4.1-1.0.2.0 Open-MPI: 1.10.x, 2.0.2, 3.0.0 All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Segfault on MPI init
On 15/02/17 00:45, Gilles Gouaillardet wrote: > i would expect orted generate a core, and then you can use gdb post > mortem to get the stack trace. > there should be several threads, so you can > info threads > bt > you might have to switch to an other thread You can also get a backtrace from all threads at once with: thread apply all bt It's not just limited to 'bt' either: (gdb) help thread apply Apply a command to a list of threads. List of thread apply subcommands: thread apply all -- Apply a command to all threads -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] RFC: Rename nightly snapshot tarballs
On 18/10/16 07:17, Jeff Squyres (jsquyres) wrote: > NOTE: It may be desirable to add HHMM in there; it's not common, but > *sometimes* we do make more than one snapshot in a day (e.g., if one > snapshot is borked, so we fix it and then generate another > snapshot). If it's been happened before then I'd suggest allow for it to happen again by adding HHMM. Otherwise looks sensible to me (YMMV). -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Off-topic re: supporting old systems
On 31/08/16 14:01, Paul Hargrove wrote: > So, the sparc platform is a bit more orphaned that it already was when > support stopped at Wheezy. Ah sorry, I didn't realise you were on a non-LTS Wheezy architecture. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Off-topic re: supporting old systems
On 31/08/16 12:05, Paul Hargrove wrote: > As Giles mentions the http: redirects to https: before anything is fetched. > Replacing "-nv" in the wget command with "-v" shows that redirect clearly. Agreed, but it still just works on Debian Wheezy for me. :-) What does "apt-cache policy wget" say for you? root@db3:/tmp# apt-cache policy wget wget: Installed: 1.13.4-3+deb7u3 Candidate: 1.13.4-3+deb7u3 [...] Here's the plain wget, with redirect, don't even need to disable the certificate check here on Debian Wheezy (though it still works if you do). root@db3:/tmp# wget http://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 --2016-08-31 12:11:59-- http://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 Resolving www.open-mpi.org (www.open-mpi.org)... 192.185.39.252 Connecting to www.open-mpi.org (www.open-mpi.org)|192.185.39.252|:80... connected. HTTP request sent, awaiting response... 302 Found Location: https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 [following] --2016-08-31 12:11:59-- https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 Connecting to www.open-mpi.org (www.open-mpi.org)|192.185.39.252|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 8192091 (7.8M) [application/x-tar] Saving to: `openmpi-2.0.1rc2.tar.bz2' 100%[>] 8,192,091 1.75M/s in 7.3s 2016-08-31 12:12:08 (1.07 MB/s) - `openmpi-2.0.1rc2.tar.bz2' saved [8192091/8192091] All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Off-topic re: supporting old systems
On 31/08/16 06:22, Paul Hargrove wrote: > It seems that a stock Debian Wheezy system cannot even *download* Open > MPI any more: Works for me, both http (which shouldn't be using SSL anyway) and https. Are you behind some weird intercepting proxy? root@db3:/tmp# wget -nv --no-check-certificate http://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 2016-08-31 10:42:34 URL:https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 [8192091/8192091] -> "openmpi-2.0.1rc2.tar.bz2" [1] root@db3:/tmp# wget -nv --no-check-certificate https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 2016-08-31 10:43:10 URL:https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 [8192091/8192091] -> "openmpi-2.0.1rc2.tar.bz2.1" [1] root@db3:/tmp# cat /etc/issue Debian GNU/Linux 7 \n \l cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Migration of mailman mailing lists
On 19/07/16 02:05, Brice Goglin wrote: > Yes, kill all netloc lists. Will the archives be preserved somewhere for historical reference? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] [1.10.3rc4] testing results
On 06/06/16 15:09, Larry Baker wrote: > An impressive accomplishment by the development team. And impressive > coverage by Paul's testbed. Well done! Agreed, it is very impressive to watch both on the breaking & the fixing side of things. :-) Thanks so much to all involved with this. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] RFC: Public Test Repo
Hi Josh, On 19/05/16 13:54, Josh Hursey wrote: > Let me know what you think. Certainly everything here is open for > discussion, and we will likely need to refine aspects as we go. I think having an open test suite in conjunction with the current private one is a great way to go, I think it sends the right message about openness and hopefully allows a community to build around MPI testing in general. Certainly happy to try it out! All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] Github pricing plan changes announced today
On 18/05/16 09:59, Gilles Gouaillardet wrote: > the (main) reason is none of us are lawyers and none of us know whether > all test suites can be redistributed for general public use or not. Thanks Gilles, All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] Github pricing plan changes announced today
On 12/05/16 06:21, Jeff Squyres (jsquyres) wrote: > We basically have one important private repo (the tests repo). Possibly a dumb question (sorry), but what's the reason for that repo being private? I ask as someone on the Beowulf list today was looking for an MPI regression test tool and found MTT but commented: # OpenMPI has the MPI Testing Tool which looks like it would work, # but most of there tests seem private. and so moved on to look at other options instead. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] [2.0.0rc2] xlc-13.1.0 ICE (hwloc)
On 03/05/16 18:11, Paul Hargrove wrote: > xlc-13.1.0 on Linux dies compiling the embedded hwloc in this rc > (details below). In case it's useful xlc 12.1.0.9-140729 (yay for BGQ living in the past) doesn't ICE on RHEL6 on Power7. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran
Hi Gilles, On 04/03/16 13:33, Gilles Gouaillardet wrote: > there is clearly no hope when you use mpi.mod and mpi_f08.mod > my point was, it is not even possible to expect "legacy" mpif.h work > with different compilers. Sorry, my knowledge of FORTRAN is limited to trying to debug why their code wouldn't compile. :-) Apologies for the noise. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran
On 04/03/16 12:17, Dave Turner wrote: > My understanding is that OpenMPI built with either Intel or > GNU compilers should be able to use the other compilers using the > OMPI_CC and OMPI_FC environmental variables. Sadly not, we tried this but when our one of our very few FORTRAN users (who happened to be our director) tried to use it it failed because the mpi.mod module created during the build is compiler dependent. :-( So ever since we've done separate builds for GCC and for Intel. All the best! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] Problem with the 1.8.8 version
On 05/12/15 01:52, Baldassari Caroline wrote: > I have installed OpenMPI 1.8.8 (the last version 1.8.8 downloaded on > your site) v1.8 morphed into the v1.10 series, I'd suggest trying that.. http://www.slideshare.net/jsquyres/open-mpi-new-version-number-scheme-and-roadmap cheers! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] Slides from the Open MPI SC'15 State of the Union BOF
On 20/11/15 03:31, Dasari, Annapurna wrote: > Jeff, could you check the link it didn¹t work for me.. > I tried to check out the slides by opening the link and downloading the > file, I am getting a file damaged error on my system. Not sure if Jeff has fixed them up since this, but they open fine for me at the moment (using KDE's Okular PDF viewer). Thanks for putting them up Jeff! All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] PMI2 in Slurm 14.11.8
On 02/09/15 13:09, Christopher Samuel wrote: > Instead PMI2 is in a contrib directory which appears to need manual > intervention to install. Confirming from the Slurm list that PMI2 is not built by default, it's only the RPM build process that will include it without intervention. Thanks for your help Ralph! cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[OMPI devel] PMI2 in Slurm 14.11.8
Hi all, The OpenMPI FAQ says: https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps # Yes, if you have configured OMPI --with-pmi=foo, where foo is # the path to the directory where pmi.h/pmi2.h is located. # Slurm (> 2.6, > 14.03) installs PMI-2 support by default. However, we've found on a new system we're bringing up this doesn't appear to be true for the vanilla Slurm 14.11.8 we're installing. Instead PMI2 is in a contrib directory which appears to need manual intervention to install. I've sent an email to the Slurm list to query this behaviour but I was wondering in case anyone had run into this here too? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] 1.10.0rc6 - slightly different mx problem
On 25/08/15 05:08, Jeff Squyres (jsquyres) wrote: > FWIW, we have had verbal agreement in the past that the v1.8 series > was the last one to contain MX support. I think it would be fine for > all MX-related components to disappear from v1.10. > > Don't forget that Myricom as an HPC company no longer exists. INRIA does have Open-MX (Myrinet Express over Generic Ethernet Hardware), last release December 2014. No idea if it's still developed or used.. http://open-mx.gforge.inria.fr/ Brice? Open-MPI is listed as working with it there. ;-) -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] 1.8.7rc1 testing results
On 14/07/15 01:49, Ralph Castain wrote: > Okay, 1.8.7rc3 (we already had an rc2) is now out with all these changes > - please take one last look. Looks OK for XRC here, thanks! -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] Error in ./configure for Yocto
On 10/07/15 01:38, Jeff Squyres (jsquyres) wrote: > Just curious -- what's Yocto? It's a system for building embedded Linux distros: https://www.yoctoproject.org/ Intel announced the switch to Yocto for their MPSS distro for Xeon Phi a couple of years ago (v3 and later). https://software.intel.com/en-us/articles/intelr-mpss-transition-to-yocto-faq All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] Proposal: update Open MPI's version number and release process
On 20/05/15 14:37, Howard Pritchard wrote: > It would also be easy to trap the I-want-to-bypass-PR-because-I > know-what-I'm-doing-developer with a second level of protection. Just > set up a jenkins project that does a smoke test after ever commit to > master. If the smoke test fails, send a naughty-gram to the committer > and copy devel. Pretty soon the developer will get trained to use the PR > process, unless they are that engineer I've yet to meet who always > writes flawless code. VMware used to have a bot that tweeted info about their testing, including "$USER just broke the build at VMWare"; for example: https://twitter.com/vmwarepbs/status/4634524702 :-) -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] Proposal: update Open MPI's version number and release process
On 19/05/15 05:11, Jeff Squyres (jsquyres) wrote: > We've reached internal consensus, and would like to present this to the > larger community for feedback. My gut feeling is that this is very good; from a cluster admin point of view it means we keep a system tracking one level up from where we are currently, i.e. at V4.x.x (for example) rather than v1.6.x or v1.8.x. We've got a new system coming up in the next few months (fingers crossed) and so it'll be interesting to see where we fall in terms of the v1.10 or v2 releases but either way I see this as making our lives easier. Thanks! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] Chris Yeoh
On 01/05/15 00:41, Jeff Squyres (jsquyres) wrote: > I am saddened to inform the Open MPI developer community of the death > of Chris Yeoh. There is page for donations to lung cancer research in his memory here (Chris was not a smoker, but it still took his life): http://participate.freetobreathe.org/site/TR?px=1582460_id=2710=personal#.VSscH5SUd90 # Chris never smoked, yet was taken too early by this dreadful # disease. Lung cancer has the greatest kill rate with the # smallest funding rate because of the stigma associated with # it being a "smoker's disease", but anyone with lungs can get # it. We hope that further funding to research will one day # provide a cure and better trials for others. Valē Chris. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] RFC: Remove embedded libltdl
On 03/02/15 05:09, Ralph Castain wrote: > Just out of curiosity: I see you are reporting about a build on the > headnode of a BG cluster. We've never ported OMPI to BG - are you using > it on such a system? Or were you just test building the code on a > convenient server? Just a convenient server with a not-so-mainstream architecture (and an older RHEL release through necessity). Sorry to get your hopes up! :-) All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] RFC: Remove embedded libltdl
On 31/01/15 10:51, Jeff Squyres (jsquyres) wrote: > New tarball posted (same location). Now featuring 100% fewer "make check" > failures. On our BG/Q front-end node (PPC64, RHEL 6.4) I see: ../../config/test-driver: line 95: 30173 Segmentation fault (core dumped) "$@" > $log_file 2>&1 FAIL: opal_lifo Stack trace implies the culprit is in: #0 0x10001048 in opal_atomic_swap_32 (addr=0x20, newval=1) at /vlsci/VLSCI/samuel/tmp/OMPI/openmpi-gitclone/opal/include/opal/sys/atomic_impl.h:51 51 old = *addr; I've attached a script of gdb doing "thread apply all bt full" in case that's helpful. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci Script started on Mon 02 Feb 2015 12:32:56 EST [samuel@avoca class]$ gdb /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo core.32444 [?1034hGNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "ppc64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo...done. [New Thread 32465] [New Thread 32464] [New Thread 32466] [New Thread 32444] [New Thread 32469] [New Thread 32467] [New Thread 32470] [New Thread 32463] [New Thread 32468] Missing separate debuginfo for /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0 Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/de/a09192aa84bbc15579ae5190dc8acd16eb94fe Missing separate debuginfo for /usr/local/slurm/14.03.10/lib/libpmi.so.0 Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/28/09dfc4706ed44259cc31a5898c8d1a9b76b949 Missing separate debuginfo for /usr/local/slurm/14.03.10/lib/libslurm.so.27 Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/e2/39d8a2994ae061ab7ada0ebb7719b8efa5de96 Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/1a/063e3d64bb5560021ec2ba5329fb1e420b470f Reading symbols from /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0...done. Loaded symbols for /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0 Reading symbols from /usr/local/slurm/14.03.10/lib/libpmi.so.0...done. Loaded symbols for /usr/local/slurm/14.03.10/lib/libpmi.so.0 Reading symbols from /usr/local/slurm/14.03.10/lib/libslurm.so.27...done. Loaded symbols for /usr/local/slurm/14.03.10/lib/libslurm.so.27 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/librt.so.1 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libutil.so.1 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld64.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/ld64.so.1 Core was generated by `/vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo '. Program terminated with signal 11, Segmentation fault. #0 0x10001048 in opal_atomic_swap_32 (addr=0x20, newval=1) at /vlsci/VLSCI/samuel/tmp/OMPI/openmpi-gitclone/opal/include/opal/sys/atomic_impl.h:51 51 old = *addr; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6_4.5.ppc64 (gdb) thread apply all bt full Thread 9 (Thread 0xfff7a0ef200 (LWP 32468)): #0 0x0080adb6629c in .__libc_write () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0fff7d6905b4 in show_stackframe (signo=11, info=0xfff7a0ee3d8, p=0xfff7a0edd00) at /vlsci/VLSCI/samuel/tmp/OMPI/openmpi-gitclone/opal/util/stacktrace.c:81 print_buffer = "[avoca:32444] *** Process received signal ***\n", '\000' tmp = 0xfff7a0ed858 "[avoca:32444] *** Process received signal ***\n" size = 1024 ret = 46 si_code_str
Re: [OMPI devel] mlx4 QP operation err
Hi Dave, On 29/01/15 11:31, Dave Turner wrote: > I've found some old references to similar mlx4 errors dating back to > 2009 that lead me to believe this may be a firmware error. I believe we're > running the most up to date version of the firmware. There was a new version released a few days ago, 2.33.5100: http://www.mellanox.com/page/firmware_table_ConnectX3ProEN Release notes are here: http://www.mellanox.com/pdf/firmware/ConnectX3Pro-FW-2_33_5100-release_notes.pdf Bug fixes start on page 23, looks like there are 29 fixes in this version, and fix 1 is for RoCE (though of course may not be relevant) - "The first Read response was not treated as implicit ACK" (discovered in 2.30.8000). All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [hwloc-devel] Using hwloc to detect Hard Disks
On 24/09/14 00:57, Ralph Castain wrote: > Memory info is available from lshw, though they are a GPL code: FWIW on this laptop (Intel Haswell) lshw only report DIMM info when run as root, which I suspect would point them to accessing DMI information via /dev/mem. Using strace supports this: 3405 open("/dev/mem", O_RDONLY)= -1 EACCES (Permission denied) FWIW dmidecode does the same. samuel@haswell:~$ dmidecode # dmidecode 2.12 /dev/mem: Permission denied All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[OMPI devel] Grammar error in git master: 'You job will now abort'
Hi all, We spotted this in 1.6.5 and git grep shows it's fixed in the v1.8 branch but in master it's still there: samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ git grep -n 'You job will now abort' orte/tools/orterun/help-orterun.txt:679:You job will now abort. samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ I'm using https://github.com/open-mpi/ompi-svn-mirror.git so let me know if I should be using something else now. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/05/14 00:16, Joshua Ladd wrote: > The necessary packages will be supported and available in community > OFED. We're constrained to what is in RHEL6 I'm afraid. This is because we have to run GPFS over IB to BG/Q from the same NSDs that talk GPFS to all our Intel clusters. We did try MOFED 2.x (in connected mode) on a new Intel cluster during its bring up last year which worked for MPI but stopped it talking to the NSDs. Reverting to vanilla RHEL6 fixed it. Not your problem though. :-) As Ralph has said there is work on an alternative solution that we will be able to use. Thanks! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNsG88ACgkQO2KABBYQAh8+SwCfZWpViBFwuhlxqERXpbXbr8Eq awwAnjj7NJ2/zUGBeZNT0UPwkmaGOaLR =nPxl -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/05/14 23:45, Ralph Castain wrote: > Artem and I are working on a new PMIx plugin that will resolve it > for non-Mellanox cases. Ah yes of course, sorry my bad! - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNsGcsACgkQO2KABBYQAh/ATgCfeQHS1KsZbLS8Hdux6p98K3w3 DqsAn3vZJMtYGs1xWK4ubK26ceuACtf1 =zPyS -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/05/14 12:54, Ralph Castain wrote: > I think there was one 2.6.x that was borked, and definitely > problems in the 14.03.x line. Can't pinpoint it for you, though. No worries, thanks. > Sounds good. I'm going to have to dig deeper into those numbers, > though, as they don't entirely add up to me. Once the job gets > launched, the launch method itself should have no bearing on > computational speed - IF all things are equal. In other words, if > the process layout is the same, and the binding pattern is the > same, then computational speed should be roughly equivalent > regardless of how the procs were started. Not sure if it's significant but when mpirun was launching processes it was using srun to start orted which then started MPI ranks whereas with PMI/PMI2 it appeared to directly start the ranks. > My guess is that your data might indicate a difference in the > layout and/or binding pattern as opposed to PMI2 vs mpirun. At the > scale you mention later in the thread (only 70 nodes x 16 ppn), the > difference in launch timing would be zilch. So I'm betting you > would find (upon further exploration) that (a) you might not have > been binding processes when launching by mpirun, since we didn't > bind by default until the 1.8 series, but were binding under direct > srun launch, and (b) your process mapping would quite likely be > different as we default to byslot mapping, and I believe srun > defaults to bynode? FWIW all our environment modules that do OMPI have: setenv OMPI_MCA_orte_process_binding core > Might be worth another comparison run when someone has time. Yeah, I'll try and queue up some more tests - unfortunately the cluster we tested on then is flat out at the moment but I'll try and sneak a 64-core job using identical configs and compare mpirun, srun on its own and srun with PMI2. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNq/K8ACgkQO2KABBYQAh/q0wCcDvYjl4tYVXrHNciCkKgbnwF7 VHoAn3Q+gZXQNKzs++3uajmiGTkq/EeD =ucJg -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/05/14 18:00, Ralph Castain wrote: > Interesting - how many nodes were involved? As I said, the bad > scaling becomes more evident at a fairly high node count. Our x86-64 systems are low node counts (we've got BG/Q for capacity), the cluster that those tests were run on has 70 nodes, each with 16 cores, so I suspect we're a long long way away from that pain point. All the best! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNq4zQACgkQO2KABBYQAh8ErQCcCBFFeB5q27b7AkqfClliUdvC NJIAn1Cun+yY8zd6IToEsYJELpJTIdGb =K0XF -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, Apologies for having dropped out of the thread, night intervened here. ;-) On 08/05/14 00:45, Ralph Castain wrote: > Okay, then we'll just have to develop a workaround for all those > Slurm releases where PMI-2 is borked :-( Do you know what these releases are? Are we talking 2.6.x or 14.03? The 14.03 series has had a fair few rapid point releases and doesn't appear to be anywhere as near as stable as 2.6 was when it came out. :-( > FWIW: I think people misunderstood my statement. I specifically > did *not* propose to *lose* PMI-2 support. I suggested that we > change it to "on-by-request" instead of the current "on-by-default" > so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once > the Slurm implementation stabilized, then we could reverse that > policy. > > However, given that both you and Chris appear to prefer to keep it > "on-by-default", we'll see if we can find a way to detect that > PMI-2 is broken and then fall back to PMI-1. My intention was to provide the data that led us to want PMI2, but if configure had an option to enable PMI2 by default so that only those who requested it got it then I'd be more than happy - we'd just add it to our script to build it. All the best! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt =OvH4 -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hiya Ralph, On 07/05/14 14:49, Ralph Castain wrote: > I should have looked closer to see the numbers you posted, Chris - > those include time for MPI wireup. So what you are seeing is that > mpirun is much more efficient at exchanging the MPI endpoint info > than PMI. I suspect that PMI2 is not much better as the primary > reason for the difference is that mpriun sends blobs, while PMI > requires that everything be encoded into strings and sent in little > pieces. > > Hence, mpirun can exchange the endpoint info (the dreaded "modex" > operation) much faster, and MPI_Init completes faster. Rest of the > computation should be the same, so long compute apps will see the > difference narrow considerably. Unfortunately it looks like I had an enthusiastic cleanup at some point and so I cannot find the out files from those runs at the moment, but I did find some comparisons from around that time. This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 run with mpirun and srun successively from inside the same Slurm job. mpirun namd2 macpf.conf srun --mpi=pmi2 namd2 macpf.conf Firstly the mpirun output (grep'ing the interesting bits): Charm++> Running on MPI version: 2.1 Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB memory WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB Now the srun output: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB memory WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB The next two pairs are first launched using mpirun from 1.6.x and then with srun from 1.7.3a1r29103. Again each pair inside the same Slurm job with the same inputs. First pair mpirun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory WallClock: 8341.524414 CPUTime: 8341.524414 Memory: 975.015625 MB First pair srun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory WallClock: 7476.643555 CPUTime: 7476.643555 Memory: 968.867188 MB Second pair mpirun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory WallClock: 7842.831543 CPUTime: 7842.831543 Memory: 1004.050781 MB Second pair srun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory WallClock: 7522.677246 CPUTime: 7522.677246 Memory: 969.433594 MB So to me it looks like (for NAMD on our system at least) that PMI2 does seem to give better scalability. All the best! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNAT
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/05/14 13:37, Moody, Adam T. wrote: > Hi Chris, Hi Adam, > I'm interested in SLURM / OpenMPI startup numbers, but I haven't > done this testing myself. We're stuck with an older version of > SLURM for various internal reasons, and I'm wondering whether it's > worth the effort to back port the PMI2 support. Can you share some > of the differences in times at different scales? We've not looked at startup times I'm afraid, this was time to solution. We noticed it with Slurm when we first started using on x86-64 for our NAMD tests (this from a posting to the list last year when I raised the issue and were told PMI2 would be the solution): > Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB. > > Here are some timings as reported as the WallClock time by NAMD > itself (so not including startup/tear down overhead from Slurm). > > srun: > > run1/slurm-93744.out:WallClock: 695.079773 CPUTime: 695.079773 > run4/slurm-94011.out:WallClock: 723.907959 CPUTime: 723.907959 > run5/slurm-94013.out:WallClock: 726.156799 CPUTime: 726.156799 > run6/slurm-94017.out:WallClock: 724.828918 CPUTime: 724.828918 > > Average of 692 seconds > > mpirun: > > run2/slurm-93746.out:WallClock: 559.311035 CPUTime: 559.311035 > run3/slurm-93910.out:WallClock: 544.116333 CPUTime: 544.116333 > run7/slurm-94019.out:WallClock: 586.072693 CPUTime: 586.072693 > > Average of 563 seconds. > > So that's about 23% slower. > > Everything is identical (they're all symlinks to the same golden > master) *except* for the srun / mpirun which is modified by > copying the batch script and substituting mpirun for srun. - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNprUUACgkQO2KABBYQAh9rLACfcZc4HR/u6G0bJejM3C/my7Nw 8b4AnRasOMvKZjpjpyKkbplc6/Iq9qBK =pqH9 -END PGP SIGNATURE-
Re: [hwloc-devel] GIT: hwloc branch master updated. 0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 30/03/14 02:04, Ralph Castain wrote: > turns out that some linux distro's automatically set LS_COLORS in > your environment when running old versions of csh/tcsh via their > default dot files For example RHEL6 does this.. - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlM4oXcACgkQO2KABBYQAh8Y6QCbBf7tHJ/7CuUSUbcaa+SvRtBx snwAn2zLdXGF9bzyBVmsPjl56uY3ozWW =FxcX -END PGP SIGNATURE-
Re: [OMPI devel] SC13 birds of a feather
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/12/13 01:52, Jeff Squyres (jsquyres) wrote: > Ralph -- let's chat about this in Chicago next Friday. I'll add > it to the agenda on the wiki. I assume this would not be > difficult stuff; we don't really need to do anything fancy at all. > I think we just want to sketch out what exactly we want to do, and > it could probably be done in a day or three. There's also stuff that ACPI can expose under: /sys/class/thermal/thermal_zone*/temp though it might need a bit more prodding to work out what's what there. > (Thanks for the idea, Samuel!) My pleasure! All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlKgESUACgkQO2KABBYQAh+44gCeIsDplsMAiwC4PJBbco5vurVy PbwAn0h9kJYIoeK1Y/mlowwHLBRb1oQX =WFYZ -END PGP SIGNATURE-
Re: [OMPI devel] SC13 birds of a feather
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 04/12/13 09:27, Jeff Squyres (jsquyres) wrote: > 2. The MPI_T performance variables are new. There's only a few > created right now (e.g., in the Cisco usnic BTL). But the field > is pretty wide open here -- the infrastructure is there, but we're > really not exposing much information yet. There's lots that can > be done here. Random thought - please shoot it down if crazy... Would it make any sense to expose system/environmental/thermal information to the application via MPI_T ? For our sort of systems with a grab bag of jobs it's not likely useful, but if you had a system dedicated to running an in house code then you could conceive of situations where you might want to react to over-temperature cores, nodes, etc. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlKefTUACgkQO2KABBYQAh8pCACaAo1Bf+5mKHWT2ZLysWkSG9fs Rc8An3H4NwI0MwqkGxG2PWMJ+4U/Vdsv =2YN+ -END PGP SIGNATURE-
Re: [OMPI devel] Openmpi 1.6.5 is freezing under GNU/Linux ia64
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/12/13 23:50, Sylvestre Ledru wrote: > FYI, Debian has stopped supporting ia64 for its next release > So, I stopped working on that issue. Yeah, it's not looking good - here's the context for this: http://lists.debian.org/debian-devel-announce/2013/11/msg7.html # We have stopped considering ia64 as a blocker for testing # migration. This means that the out-of-date and uninstallability # criteria on ia64 will not hold up transitions from unstable to # testing for packages. It is expected that unless drastic # improvement occurs, ia64 will be removed from testing on # Friday 24th January 2014. - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlKeb1YACgkQO2KABBYQAh8jeQCfUVYyP39G5m31dQL/ZuEAZOIz xJIAn0Fs+bBZCRSwbmU35CAN8X8tzpex =tRU7 -END PGP SIGNATURE-
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29615 - in trunk: . contrib contrib/dist/linux debian debian/source
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/11/13 04:40, Mike Dubman wrote: > I did not find debian packaging files in the OMPI tree, could you please > point me to it? As Sylvestre explained Debian (and presumably Ubuntu too) will automatically delete any /debian/ directory in an upstream tarball and substitute their own packaging. You can see what they put in for sid (testing) here: http://ftp.de.debian.org/debian/pool/main/o/openmpi/openmpi_1.6.5-5.debian.tar.gz Whilst I can understand the enthusiasm I don't think it's going to be very helpful to Debian; perhaps a better way to assist would be to help out Sylvestre and the other Debian maintainers? This might be a handy place to start: http://qa.debian.org/developer.php?login=pkg-openmpi-maintainers%40lists.alioth.debian.org All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlJ6yWMACgkQO2KABBYQAh/3pwCghbRhvVYPa5WV9XmcLzMQbCQB mxsAn3LKsvax6RyiRtAj3Zag9yynEoe6 =sH/h -END PGP SIGNATURE-
Re: [OMPI devel] 1.6.5 large matrix test doesn't pass (decode) ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/10/13 01:49, KAWASHIMA Takahiro wrote: > It is a bug in the test program, test/datatype/ddt_raw.c, and it > was fixed at r24328 in trunk. > > https://svn.open-mpi.org/trac/ompi/changeset/24328 > > I've confirmed the failure occurs with plain v1.6.5 and it doesn't > occur with patched v1.6.5. Perfect, thanks! Sorry for the delay, been away on holiday. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlJeE8gACgkQO2KABBYQAh9LMgCeJ7EQKFD/nRPBtFFDH/kSFw51 j0AAn2RfQrNz6E1KTnL0BL5p3tQMLHVT =VSfO -END PGP SIGNATURE-
[OMPI devel] 1.6.5 large matrix test doesn't pass (decode) ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Not sure if this is important, or expected, but I ran a make check out of interest after seeing recent emails and saw the final one of these tests be reported as "NOT PASSED" (it seems to be the only failure). No idea if this is important or not. The text I see is: # * TEST UPPER MATRIX # test upper matrix complete raw in 7 microsec decode [NOT PASSED] This happens on both our Nehalem and SandyBridge clusters and we are building with the system GCC. I've attached the full log from our Nehalem cluster (RHEL 6.4). Our configure script is: #!/bin/bash BASE=`basename $PWD | sed -e s,-,/,` module purge ./configure --prefix=/usr/local/${BASE} --with-slurm --with-openib \ --enable-static --enable-shared make -j I'm away on leave next week (first break for a year, yay!) but back the week after.. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlJOVUcACgkQO2KABBYQAh+J/QCfX+U1kZvtgFL1UxyIZBbNdqyW PC4An2AciGo2BkOq5RnceDYjACcUsV7X =0VKJ -END PGP SIGNATURE- Making check in config make[1]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/config' make[1]: Nothing to be done for `check'. make[1]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/config' Making check in contrib make[1]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/contrib' make[1]: Nothing to be done for `check'. make[1]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/contrib' Making check in opal make[1]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal' Making check in include make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/include' make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/include' Making check in libltdl make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/libltdl' make check-am make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/libltdl' make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/libltdl' make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/libltdl' Making check in asm make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/asm' make[2]: Nothing to be done for `check'. make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/asm' Making check in datatype make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/datatype' make[2]: Nothing to be done for `check'. make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/datatype' Making check in etc make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/etc' make[2]: Nothing to be done for `check'. make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/etc' Making check in event make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event' Making check in compat make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat' Making check in sys make[4]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat/sys' make[4]: Nothing to be done for `check'. make[4]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat/sys' make[4]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat' make[4]: Nothing to be done for `check-am'. make[4]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat' make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat' make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event' make[3]: Nothing to be done for `check-am'. make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event' make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event' Making check in util make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util' Making check in keyval make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util/keyval' make[3]: Nothing to be done for `check'. make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util/keyval' make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util' make[3]: Nothing to be done for `check-am'. make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util' make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util' Making check in mca/base make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/mca/base' make[2]: Nothing to be done for `check'. make[2]: Leaving directory `/usr/local/src/OPENM
Re: [OMPI devel] Openmpi 1.6.5 is freezing under GNU/Linux ia64
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 21/09/13 14:33, Ralph Castain wrote: > I think you misunderstood the issue here. The problem is that > mpirun appears to be hanging before it ever gets to the point of > launching something. Ah, quite correct, I hadn't realised the debug info hadn't shown it getting to the point of launching the executable. Mea culpa. I blame jet-lag. ;-) cheers, Chris (about to get a second dose) - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlI96CoACgkQO2KABBYQAh9o6gCdFQ4HiKtHlhoqmQjHGRRMZXCC QooAnjNRPf3dzh/MjD0rzspLRxs2ExFd =V7Ux -END PGP SIGNATURE-
Re: [OMPI devel] Openmpi 1.6.5 is freezing under GNU/Linux ia64
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 21/09/13 05:49, Sylvestre Ledru wrote: > Does it ring a bell to anyone ? Possibly, if you run the binary without mpirun does it do the same? If so, could you try and run it with strace -f and see if you see repeating SEGV's? cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlI9AXAACgkQO2KABBYQAh/QqQCeIXNLXsO094MdRT9OTguQdSqp apAAniGAjZOJly2FLdM7YWyvrvZfhOPI =MsBl -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/09/13 14:14, Christopher Samuel wrote: > However, modifying the test program confirms that variable is getting > propagated as expected with both mpirun and srun for 1.6.5 and the 1.7 > snapshot. :-( Investigating further by setting: export OMPI_MCA_orte_report_bindings=1 export SLURM_CPU_BIND=core export SLURM_CPU_BIND_VERBOSE=verbose reveals that only OMPI 1.6.5 with mpirun reports bindings being set (see below). We cannot understand why Slurm doesn't *appear* to be setting bindings as we have the correct settings according to the documentation. Whilst it may explain the difference between 1.6.5 mpirun and srun it doesn't to explain why the 1.7 snapshot is so much better as you'd expect them to be hurt in the same way. ==OPENMPI 1.6.5== ==mpirun== [barcoo003:03633] System has detected external process binding to cores 0001 [barcoo003:03633] MCW rank 0 bound to socket 0[core 0]: [B] [barcoo004:04504] MCW rank 1 bound to socket 0[core 0]: [B] Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 universe envar 2 Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 universe envar 2 ==srun== Hello, World, I am 0 of 2 on host barcoo003 from app number 1 universe size 2 universe envar NULL Hello, World, I am 1 of 2 on host barcoo004 from app number 1 universe size 2 universe envar NULL = ==OPENMPI 1.7.3== DANGER: YOU ARE LOADING A TEST VERSION OF OPENMPI. THIS MAY BE BAD. ==mpirun== Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 universe envar 2 Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 universe envar 2 ==srun== Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 universe envar NULL Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 universe envar NULL = - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIpcxcACgkQO2KABBYQAh/wdQCfR4q7DfGqJVSU0O3BmgXqAn8w HsEAn3po0xaxB0+ywejWgSjQ385da7Pa =T3w4 -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/09/13 00:23, Hjelm, Nathan T wrote: > I assume that process binding is enabled for both mpirun and srun? > If not that could account for a difference between the runtimes. You raise an interesting point, we have been doing that with: [samuel@barcoo ~]$ module show openmpi 2>&1 | grep binding setenv OMPI_MCA_orte_process_binding core However, modifying the test program confirms that variable is getting propagated as expected with both mpirun and srun for 1.6.5 and the 1.7 snapshot. :-( cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIpVp4ACgkQO2KABBYQAh88rQCggOZkAjPV+/1PX2R9auuij+1M jdsAn17nDCoubkdvCsLRKozqGEYWjOY1 =RaoK -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Ralph, On 05/09/13 12:50, Ralph Castain wrote: > Jeff and I were looking at a similar issue today and suddenly > realized that the mappings were different - i.e., what ranks are > on what nodes differs depending on how you launch. You might want > to check if that's the issue here as well. Just launch the > attached program using mpirun vs srun and check to see if the maps > are the same or not. Very interesting, the ranks to node mappings are identical in all cases (mpirun and srun for 1.6.5 and my test 1.7.3 snapshot) but what is different is as follows. For the 1.6.5 build I see mpirun report: number 0 universe size 64 universe envar 64 whereas srun report: number 1 universe size 64 universe envar NULL For the 1.7.3 snapshot both report "number 0" so the only difference there is that mpirun has: envar 64 whereas srun has: envar NULL Are these differences significant? I'm intrigued that the problem child (srun 1.6.5) is the only one where number is 1. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIoCesACgkQO2KABBYQAh+0NACeK9uyDk3UZerufAopuQRxhR/T 4skAmwS/X+8jNOPlGt4H/t5yRK8vmMer =8TGu -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 04/09/13 18:33, George Bosilca wrote: > You can confirm that the slowdown happen during the MPI > initialization stages by profiling the application (especially the > MPI_Init call). NAMD helpfully prints benchmark and timing numbers during the initial part of the simulation, so here's what they say. For both seconds per step and days per nanosecond of simulation less is better. I've included the benchmark numbers (every 100 steps or so from the start) and the final timing number after 25000 steps. It looks like to me (as a sysadmin and not an MD person) that the final timing number includes CPU time in seconds per step and wallclock time in seconds per step. 64 cores over 10 nodes: OMPI 1.7.3a1r29103 mpirun Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory TIMING: 25000 CPU: 8247.2, 0.330157/step Wall: 8247.2, 0.330157/step, 0.0229276 hours remaining, 921.894531 MB of memory in use. OMPI 1.7.3a1r29103 srun Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory TIMING: 25000 CPU: 7390.15, 0.296/step Wall: 7390.15, 0.296/step, 0.020 hours remaining, 915.746094 MB of memory in use. 64 cores over 18 nodes: OMPI 1.6.5 mpirun Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory TIMING: 25000 CPU: 7754.17, 0.312071/step Wall: 7754.17, 0.312071/step, 0.0216716 hours remaining, 950.929688 MB of memory in use. OMPI 1.7.3a1r29103 srun Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory TIMING: 25000 CPU: 7420.91, 0.296029/step Wall: 7420.91, 0.296029/step, 0.0205575 hours remaining, 916.312500 MB of memory in use. Hope this is useful! All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIn6UoACgkQO2KABBYQAh9GWgCghcYKSj1i9rDDQospURAeusD5 E+EAn2beqUlYZWHxi1Dgj8ZEpiai4zH1 =k5Uz -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 04/09/13 11:29, Ralph Castain wrote: > Your code is obviously doing something much more than just > launching and wiring up, so it is difficult to assess the > difference in speed between 1.6.5 and 1.7.3 - my guess is that it > has to do with changes in the MPI transport layer and nothing to do > with PMI or not. I'm testing with what would be our most used application in aggregate across our systems, the NAMD molecular dynamics code from here: http://www.ks.uiuc.edu/Research/namd/ so yes, you're quite right, it's doing a lot more than that and has a reputation for being a *very* chatty MPI code. For comparison whilst users see GROMACS also suffer with srun under 1.6.5 they don't see anything like the slow down that NAMD gets. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlImsCwACgkQO2KABBYQAh8c4wCfQlOd6ZL68tncAd1h3Fyb1hAr DicAn06seL8GzYPGtGImnYkb7sYd5op9 =pkwZ -END PGP SIGNATURE-
Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 04/09/13 04:47, Jeff Squyres (jsquyres) wrote: > Hmm. Are you building Open MPI in a special way? I ask because I'm > unable to replicate the issue -- I've run your test (and a C > equivalent) a few hundred times now: I don't think we do anything unusual, the script we are using is fairly simple (it does a module purge to ensure we are just using the system compilers and don't pick up anything strange) and is as follows: #!/bin/bash BASE=`basename $PWD | sed -e s,-,/,` module purge ./configure --prefix=/usr/local/${BASE} --with-slurm --with-openib --enable-static --enable-shared make -j - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlImicgACgkQO2KABBYQAh83GQCcDp/TF/lCe3RnmNYq+tl6ef0D q2AAn3BNG8omGncmLc4HadRPZgRjQEph =56wh -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/09/13 10:56, Ralph Castain wrote: > Yeah - --with-pmi= Actually I found that just --with-pmi=/usr/local/slurm/latest worked. :-) I've got some initial numbers for 64 cores, as I mentioned the system I found this on initially is so busy at the moment I won't be able to run anything bigger for a while, so I'm going to move my testing to another system which is a bit quieter, but slower (it's Nehalem vs SandyBridge). All the below tests are with the same NAMD 2.9 binary and within the same Slurm job so it runs on the same cores each time. It's nice to find that C code at least seems to be backwardly compatible! 64 cores over 18 nodes: Open-MPI 1.6.5 with mpirun - 7842 seconds Open-MPI 1.7.3a1r29103 with srun - 7522 seconds so that's about a 4% speedup. 64 cores over 10 nodes: Open-MPI 1.7.3a1r29103 with mpirun - 8341 seconds Open-MPI 1.7.3a1r29103 with srun - 7476 seconds So that's about 11% faster, and the mpirun speed has decreased though of course that's built using PMI so perhaps that's the cause? cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEUEARECAAYFAlImiUUACgkQO2KABBYQAh+WvwCeM1ufCWvK627oz8aBbgKjfONe cDEAmM3w+/EJ0unbmaetNR3ay4U6nrM= =v/PT -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 31/08/13 02:42, Ralph Castain wrote: > We did some work on the OMPI side and removed the O(N) calls to > "get", so it should behave better now. If you get the chance, > please try the 1.7.3 nightly tarball. We hope to officially release > it soon. Stupid question, but never having played with PMI before is it just the case of appending the --with-pmi option to our current configure? thanks, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIlLPAACgkQO2KABBYQAh9GhwCeN192n4g5PBHpeHwOi2Kpyhs3 +X8An0TJ2VzrgeKl4+2YVVeZXq+6fz/W =h4Ip -END PGP SIGNATURE-
Re: [hwloc-devel] lstopo - please add the information about the kernel to the graphical output
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/09/13 02:03, Jiri Hladky wrote: > I vote for --append-legend I like that too, though the idea of an additional undocumented --jirka option also appeals. :-) - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIlLFcACgkQO2KABBYQAh8JyACfbEIKp5fvL1RodhpORUPLj0zN w4gAn2CLmB8x6roBovo0vdEjumrDb7KE =rnHF -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 31/08/13 02:42, Ralph Castain wrote: > Hi Chris et al Hiya, > We did some work on the OMPI side and removed the O(N) calls to > "get", so it should behave better now. If you get the chance, > please try the 1.7.3 nightly tarball. We hope to officially release > it soon. Thanks so much, I'll get our folks to rebuild a test version of NAMD against 1.7.3a1r29103 which I built this afternoon. It might be some time until I can get a test job of a suitable size to run though, looks like our systems are flat out! All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIkNsMACgkQO2KABBYQAh9AqgCggCUKRRLODZhfXUAJ6T2pYjGI iSgAniISxkxnHXyEj7L6kmTs4wERy1rW =31Qg -END PGP SIGNATURE-
Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 30/08/13 16:01, Christopher Samuel wrote: > Thanks for this, I'll take a look further next week.. The code where it's SEGV'ing is here: /* check that one of the above allocation paths succeeded */ if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) { remainder_size = size - nb; remainder = chunk_at_offset(p, nb); av->top = remainder; set_head(p, nb | PREV_INUSE | (av != _arena ? NON_MAIN_ARENA : 0)); set_head(remainder, remainder_size | PREV_INUSE); check_malloced_chunk(av, p, nb); return chunk2mem(p); } It dies when it does: set_head(remainder, remainder_size | PREV_INUSE); where remainder_size=0. This implies that size and nb are the same, so I'm wondering if the test at the top of that block should not have the equals, so instead be this? /* check that one of the above allocation paths succeeded */ if ((unsigned long)(size) > (unsigned long)(nb + MINSIZE)) { It would ensure that the set_head() macro would never get called with a 0 argument. The code would then fall through to the malloc failure part (which is what I suspect we want). Thoughts? All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIkJNkACgkQO2KABBYQAh+Y/QCeLwnqEQGK4meKQbETwqHg1RtI iikAoIofXBPnpI8qbS+zau9ezX78WizW =QCSz -END PGP SIGNATURE-
Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hiya Jeff, On 30/08/13 11:13, Jeff Squyres (jsquyres) wrote: > FWIW, the stack traces you sent are not during MPI_INIT. I did say it was a suspicion. ;-) > What happens with OMPI's memory manager is that it inserts itself > to be *the* memory allocator for the entire process before main() > even starts. We have to do this as part of the horribleness of > that is OpenFabrics/verbs and how it just doesn't match the MPI > programming model at all. :-( (I think I wrote some blog entries > about this a while ago... Ah, here's a few: Thanks! I'll take a look next week (just got out of a 5.5 hour meeting and have to head home now). > Therefore, (in C) if you call malloc() before MPI_Init(), it'll be > calling OMPI's ptmalloc. The stack traces you sent imply that > it's just when your app is calling the fortran allocate -- which is > after MPI_Init(). OK, that makes sense. > FWIW, you can build OMPI with --without-memory-manager, or you can > setenv OMPI_MCA_memory_linux_disable to 1 (note: this is NOT a > regular MCA parameter -- it *must* be set in the environment > before the MPI app starts). If this env variable is set, OMPI will > *not* interpose its own memory manager in the pre-main hook. That > should be a quick/easy way to try with and without the memory > manager and see what happens. Well with OMPI_MCA_memory_linux_disable=1 I don't get the crash at all, or the spin with the Intel compiler build. Nice! Thanks for this, I'll take a look further next week.. Very much obliged, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIgNSgACgkQO2KABBYQAh9UhwCfXPKDbParUn3XBOOcwBNjionS KxAAnRH1HGFsKWNVGqvmh4caE8cN85jn =U4yB -END PGP SIGNATURE-
Re: [hwloc-devel] lstopo - please add the information about the kernel to the graphical output
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 28/08/13 02:19, Brice Goglin wrote: > The problem I have while playing with this is that it takes a lot > of space. Putting the entire uname on a single line will be > truncated when the topology drawing isn't large (on machines with 2 > cores for instance). And using multiple lines would make the legend > huge. Would there be any benefit of inserting it into the EXIF information for the image (every time) instead? That way it would be accessible for those who need it (now and in the future) whilst not cluttering up the image. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIf1v0ACgkQO2KABBYQAh9QlgCdG+o7x6GaTiT0dnBPvMRW/UCH dgcAn1LCpbhvVafxd95hW/+6G97/HKNe =9iNn -END PGP SIGNATURE-
Re: [hwloc-devel] hwloc-distrib - please add the option to distribute the jobs in the reverse direction
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 27/08/13 00:07, Brice Goglin wrote: > But there's a more general problem here, some people may want > something similar for other cases. I need to think about it. Something like a sort order perhaps, combined with some method to exclude or weight PUs based on some metrics (including a user defined weight)? I had a quick poke around looking at /proc/irq/*/ and it would appear you can gather info about which CPUs are eligible to handle IRQs from the smp_affinity bitmask (or smp_affinity_list). The node file there just "shows the node to which the device using the IRQ reports itself as being attached. This hardware locality information does not include information about any possible driver locality preference." cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIcF/QACgkQO2KABBYQAh/7oQCcDSLlgEJqBGDerUD481ho6UWc Rp0AnRC4cC/Kdhwe75tgg1O/LrcfxXM0 =r4pj -END PGP SIGNATURE-
[OMPI devel] How to deal with F90 mpi.mod with single stack and multiple compiler suites?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi folks, We've got what we thought would be a fairly standard OMPI (1.6.5) install which is a single install built with GCC and then setting the appropriate variables to use the Intel compilers when someone loads our "intel" module: $ module show intel [...] setenv OMPI_CC icc setenv OMPI_CXX icpc setenv OMPI_F77 ifort setenv OMPI_FC ifort setenv OMPI_CFLAGS -xHOST -O3 -mkl=sequential setenv OMPI_FFLAGS -xHOST -O3 -mkl=sequential setenv OMPI_FCFLAGS -xHOST -O3 -mkl=sequential setenv OMPI_CXXFLAGS -xHOST -O3 -mkl=sequential This works wonderfully, *except* when our director attempted to build an F90 program with the Intel compilers that fails to build because the mpi.mod F90 module was produced with gfortran rather than the Intel compilers. :-( Is there any way to avoid having to do parallel installs of OMPI with GCC and Intel compilers just to have two different versions of these files? My brief googling hasn't indicated anything, and I don't see anything in the mpif90 manual page (though I have to admit I've had to rush to try and get this done before I need to leave for the day). :-( cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIVrpIACgkQO2KABBYQAh/GAQCggQGnc18kSfMcGle3a3pWZGgD UQ8AoIz61uuOPj+TFJwSYMTaAtUBLk3J =yJ6J -END PGP SIGNATURE-
Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Ralph, On 12/08/13 06:17, Ralph Castain wrote: > 1. Slurm has no direct knowledge or visibility into the > application procs themselves when launched by mpirun. Slurm only > sees the ORTE daemons. I'm sure that Slurm rolls up all the > resources used by those daemons and their children, so the totals > should include them > > 2. Since all Slurm can do is roll everything up, the resources > shown in sacct will include those used by the daemons and mpirun as > well as the application procs. Slurm doesn't include their daemons > or the slurmctld in their accounting. so the two numbers will be > significantly different. If you are attempting to limit overall > resource usage, you may need to leave some slack for the daemons > and mpirun. Thanks for that explanation, makes a lot of sense. In the end due to time pressure we decided to just do what we did with Torque and patch Slurm to set RLIMIT_AS instead of RLIMIT_DATA for jobs so no single sub-process can request more RAM than the job has asked for. Works nicely and our users are used to it from Torque, we've not hit any issues with it so far. In the long term I suspect the jobacct_gather/cgroup plugin will give better numbers once it's had more work. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIS0dwACgkQO2KABBYQAh9X7ACgkTPVIJx7xhqYSPeqb4/vC5+W +XYAn2xETmiTnO7S2Hv9C93gCjs2R8Gw =ypc1 -END PGP SIGNATURE-
Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/13 16:19, Christopher Samuel wrote: > Anyone seen anything similar, or any ideas on what could be going > on? Sorry, this was with: # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 Since those initial tests we've started enforcing memory limits (the system is not yet in full production) and found that this causes jobs to get killed. We tried the cgroups gathering method, but jobs still die with mpirun and now the numbers don't seem to right for mpirun or srun either: mpirun (killed): [samuel@barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94564 94564.batch-523362K 0 94564.0 394525K 0 srun: [samuel@barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94565 94565.batch998K 0 94565.0 88663K 0 All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93 KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk =jYrC -END PGP SIGNATURE-
Re: [OMPI devel] Memory accounting issues with mpirun (was Re: [slurm-dev] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/13 16:18, Christopher Samuel wrote: > Anyone seen anything similar, or any ideas on what could be going > on? Apologies, forgot to mention that Slurm is set up with: # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 We are testing with cgroups now. - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB6h0ACgkQO2KABBYQAh8gowCfTG0p/RFOuUHQG47avDL2YwOg uM8Anjw16dWen6kykBfMhWpHUWr709zv =BR3G -END PGP SIGNATURE-
[OMPI devel] Memory accounting issues with mpirun (was Re: [slurm-dev] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23/07/13 17:06, Christopher Samuel wrote: > Bringing up a new IBM SandyBridge cluster I'm running a NAMD test > case and noticed that if I run it with srun rather than mpirun it > goes over 20% slower. Following on from this issue, we've found that whilst mpirun gives acceptable performance the memory accounting doesn't appear to be correct. Anyone seen anything similar, or any ideas on what could be going on? Here are two identical NAMD jobs running over 69 nodes using 16 nodes per core, this one launched with mpirun (Open-MPI 1.6.5): ==> slurm-94491.out <== WallClock: 101.176193 CPUTime: 101.176193 Memory: 1268.554688 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94491 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94491 94491.batch6504068K 11167820K 94491.05952048K 9028060K This one launched with srun (about 60% slower): ==> slurm-94505.out <== WallClock: 163.314163 CPUTime: 163.314163 Memory: 1253.511719 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94505 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94505 94505.batch 7248K 1582692K 94505.01022744K 1307112K cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB5sEACgkQO2KABBYQAh9QMQCfQ57w0YqVDwgyGRqUe3dSvQDj e9cAnRRx/kDNUNqUCuFGY87mXf2fMOr+ =JUPK -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 24/07/13 09:42, Ralph Castain wrote: > Not to 1.6 series, but it is in the about-to-be-released 1.7.3, > and will be there from that point onwards. Oh dear, I cannot delay this machine any more to change to 1.7.x. :-( > Still waiting to see if it resolves the difference. When I've got the current rush out of the way I'll try a private build of 1.7 and see how that goes with NAMD. cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlHvbl8ACgkQO2KABBYQAh9a6QCgi0HOHHV/opqjPq+Av+lTasaj 4OkAnA8i8ajZ9Umw7MoaH8qJbWBgFOAf =p7Xl -END PGP SIGNATURE-
Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23/07/13 19:34, Joshua Ladd wrote: > Hi, Chris Hi Joshua, I've quoted you in full as I don't think your message made it through to the slurm-dev list (at least I've not received it from there yet). > Funny you should mention this now. We identified and diagnosed the > issue some time ago as a combination of SLURM's PMI1 > implementation and some of, what I'll call, OMPI's topology > requirements (probably not the right word.) Here's what is > happening, in a nutshell, when you launch with srun: > > 1. Each process pushes his endpoint data up to the PMI "cloud" via > PMI put (I think it's about five or six puts, bottom line, O(1).) > 2. Then executes a PMI commit and PMI barrier to ensure all other > processes have finished committing their data to the "cloud". 3. > Subsequent to this, each process executes O(N) (N is the number of > procs in the job) PMI gets in order to get all of the endpoint > data for every process regardless of whether or not the process > communicates with that endpoint. > > "We" (MLNX et al.) undertook an in-depth scaling study of this and > identified several poorly scaling pieces with the worst offenders > being: > > 1. PMI Barrier scales worse than linear. 2. At scale, the PMI get > phase starts to look quadratic. > > The proposed solution that "we" (OMPI + SLURM) have come up with is > to modify OMPI to support PMI2 and to use SLURM 2.6 which has > support for PMI2 and is (allegedly) much more scalable than PMI1. > Several folks in the combined communities are working hard, as we > speak, trying to get this functional to see if it indeed makes a > difference. Stay tuned, Chris. Hopefully we will have some data by > the end of the week. Wonderful, great to know that what we're seeing is actually real and not just pilot error on our part! We're happy enough to tell users to keep on using mpirun as they will be used to from our other Intel systems and to only use srun if the code requires it (one or two commercial apps that use Intel MPI). Can I ask, if the PMI2 ideas work out is that likely to get backported to OMPI 1.6.x ? All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlHvEZIACgkQO2KABBYQAh9QogCeMuR/E4oPivdsX3r671+z7EWd Hv8An1N8csHMby7bouT/gC07i/J2PW+i =gZsB -END PGP SIGNATURE-
Re: [OMPI devel] Any plans to support Intel MIC (Xeon Phi) in Open-MPI?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/05/13 10:47, Ralph Castain wrote: > We had something similar at one time - I developed it for the > Roadrunner cluster so you could run MPI tasks on the GPUs. Worked > well, but eventually fell into disrepair due to lack of use. OK, interesting! RR was Cell rather than GPU though wasn't it? > In this case, I suspect it will be much easier to do as the Phis > appear to be a lot more visible to the host than the GPU did on RR. > Looking at the documentation, the Phis just sit directly on the > PCIe bus, so they should look just like any other processor, Yup, they show up in lspci: [root@barcoo061 ~]# lspci -d 8086:2250 2a:00.0 Co-processor: Intel Corporation Device 2250 (rev 11) 90:00.0 Co-processor: Intel Corporation Device 2250 (rev 11) > and they are Xeon binary compatible - so there is no issue with > tracking which binary to run on which processor. Sadly they're not binary compatible, you have to cross-compile for them (or compile on the Phi itself). I haven't got any further than have xCAT install the (rebuilt) kernel module so far, so I can't log into them yet. > Brice: do the Phis appear in the hwloc topology object? They appear in lstopo as mic0 and mic1. > Chris: can you run lstopo on one of the nodes and send me the > output (off-list)? One of the hosts? Not a problem, will do. All the best! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGDDZIACgkQO2KABBYQAh/TUQCgh29RPf5FM3PWe/p/qpMW3wGX ZaUAn0uxw8i/BZxXDOFXQZIyY1rn4/zm =zock -END PGP SIGNATURE-
[OMPI devel] Any plans to support Intel MIC (Xeon Phi) in Open-MPI?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi folks, The new system we're bringing up has 10 nodes with dual Xeon Phi MIC cards, are there any plans to support them by launching MPI tasks directly on the Phis themselves (rather than just as offload devices for code on the hosts)? All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGDAPYACgkQO2KABBYQAh+y9ACfZ0SdqDuV7Euq3B0ANtxPhH1D 3h4An1Zlhu2Ut+OFvbTa9xbLBkspwwPY =TbIy -END PGP SIGNATURE-
Re: [OMPI devel] Choosing an Open-MPI release for a new cluster
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Ralph, Jeff, Paul, On 02/05/13 14:14, Ralph Castain wrote: > Depends on what you think you might want, and how tolerant you and > your users are about bugs. > > The 1.6 series is clearly more mature and stable. It has nearly > all the MPI-2 stuff now, but no MPI-3. Great, thanks! > If you think there is something in MPI-3 you might want, then the > 1.7 series could be the way to go - though you'll have to suffer > thru its growing pains. [...] Well our users are life sciences researchers and as a result very few of those are developers, they are mostly using applications we build for them on request (or Java and the occasional commercial package). So from the sound of it 1.6 is the way to go and if we ever hit something that needs MPI-3 then we'll install that in parallel but leave the default at 1.6. Thanks so much to you all! All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGC8u4ACgkQO2KABBYQAh8P9gCdHJGNLE63akY/1SMdeIhxMRyn k90AnRLnj8nJbsnj/rWP/yj4E5u8up3n =EfJH -END PGP SIGNATURE-
[OMPI devel] Choosing an Open-MPI release for a new cluster
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi folks, We're about to bring up a new cluster (IBM iDataplex with SandyBridge CPUs including 10 nodes with two Intel Xeon Phi cards) and I'm at the stage where we need to pick an OMPI release to put on. Given that this system is at the start of its life whatever we pick now is likely to be baked in for the next 4 years or so (with OMPI point release updates of course) and so I'm thinking that I should be going with the 1.7.x release rather than the 1.6.x one. For comparison the Nehalem iDP this is going in next to is still at 1.4.x, it wouldn't be worth the effort to take it to a later release given it has probably only another 18 months of life left. However, not having been able to keep up with this list for some time I'd like to throw myself on your tender mercies for advice for whether that's a good plan or not! Thoughts please? All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGB40cACgkQO2KABBYQAh+zHwCfbKMFtmmnc07PPrXdEHghxqf1 SCYAn2hgWaLBUXhbBAmzA20BXLBzdLsJ =KGxX -END PGP SIGNATURE-
Re: [hwloc-devel] [hwloc-svn] svn:hwloc r5324 - branches/libpciaccess/doc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 17/02/13 01:22, Jeff Squyres (jsquyres) wrote: > No, it's not. RHEL6, for example, does have libpciaccess, but > does not have a libpciaccess-dev (or devel). Ergo, you have to > get externally. It's in the server-optional repo: [root@imgr ~]# yum info libpciaccess-devel Loaded plugins: downloadonly, etckeeper, product-id, rhnplugin, security Available Packages Name: libpciaccess-devel Arch: i686 Version : 0.12.1 Release : 1.el6 Size: 11 k Repo: rhel-x86_64-server-optional-6 Summary : PCI access library development package License : MIT Description : Development package for libpciaccess. Name: libpciaccess-devel Arch: x86_64 Version : 0.12.1 Release : 1.el6 Size: 11 k Repo: rhel-x86_64-server-optional-6 Summary : PCI access library development package License : MIT Description : Development package for libpciaccess. If they're not enabled then you won't see it. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlEjAc8ACgkQO2KABBYQAh88ngCfTGoYJrWvW4RclZxsBrq/6/Fo FHIAn3wU9c4UD9B+Vg9GGWLip2wNx353 =+Jg7 -END PGP SIGNATURE-
Re: [hwloc-devel] libpci: GPL
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 /* * Trying to catch up with email, but I've not seen the question of * whether or not linking proprietary->BSD->GPL was OK or not addressed * yet. */ On 06/02/13 08:50, Jeff Squyres (jsquyres) wrote: > It was just pointed out to me that libpci is licensed under the GPL > (not the LGPL). > > Hence, even though hwloc is BSD, if it links to libpci.*, it's > tainted. I wouldn't say hwloc is tainted, more that you were tainting the GPL'd code by linking the proprietary code to it, but that's just case of perspective. ;-) After a brief search of the GPL FAQs I'd say the closest I can get is: http://www.gnu.org/licenses/old-licenses/gpl-2.0-faq.html#GPLWrapper # I'd like to incorporate GPL-covered software in my proprietary # system. Can I do this by putting a “wrapper” module, under a # GPL-compatible lax permissive license (such as the X11 license) # in between the GPL-covered part and the proprietary part? # #No. The X11 license is compatible with the GPL, so you can add a # module to the GPL-covered program and put it under the X11 license. # But if you were to incorporate them both in a larger program, that # whole would include the GPL-covered part, so it would have to be # licensed as a whole under the GNU GPL. # # The fact that proprietary module A communicates with GPL-covered # module C only through X11-licensed module B is legally irrelevant; # what matters is the fact that module C is included in the whole. So yes, if you want to permit proprietary code to link to hwloc then you need to stick to permissive licenses in hwlocs dependencies. Disclaimer: IANAL, I don't play a lawyer on TV (or the Internet), batteries not included, caveat emptor, dates in calendar are closer than they appear, etc, etc, etc... Of course it might be possible to ask the pciutils maintainer to split out libpci from pciutils and LGPL it. Interestingly, Steam for Linux appears to have linked to libpci.. http://steamcommunity.com/app/221410/discussions/1/846938351130480716/ cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlEi/wcACgkQO2KABBYQAh+MyACfdd9CyGvIcIIHZD2pTvVM1ZXG 6SUAn1Yr9D4knUhld9F/fa68EzR64Xnq =Sd+l -END PGP SIGNATURE-
[OMPI devel] CRIU checkpoint support in Open-MPI?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi folks, I don't know if people have seen that the Linux kernel community is following its own different checkpoint/restart path to those currently supported by OMPI, namely that of the OpenVZ developers "checkpoint/restore in user space" project (CRIU). You can read more about its current state here: https://lwn.net/Articles/525675/ The CRIU website is here: http://criu.org/ CRIU will also be up for discussion at LCA2013 in Canberra this year (though I won't be there): http://linux.conf.au/schedule/30116/view_talk?day=thursday Is there interest from OMPI in supporting this, given it looks like it's quite likely to make it into the mainline kernel? Or is better to wait for it to be merged, and then take a look? All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlDACXYACgkQO2KABBYQAh8LIQCfagfyZNzK3KVKb+W0etJV4tyL AxwAn0z6q7TVNcOTom0tmvy7brfFf4QV =SLvF -END PGP SIGNATURE-
Re: [hwloc-devel] Cgroup resource limits
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/11/12 09:05, Ralph Castain wrote: > System resource managers don't usually provide this capability, so we > will do it at the ORTE level. Interestingly one of the Torque developers posted this overnight: http://www.supercluster.org/pipermail/torqueusers/2012-November/015183.html # We are interested in incorporated cgroups into TORQUE. One # of the things that is delaying it is that we haven't found # a good library to manage the cgroups - it is obviously a much # larger project if we have to write such a library ourselves, # and also much harder to maintain. Does anyone know of a good # library for cgroups? So I've pointed them at this thread and strongly encouraged them to get involved. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlCYdA4ACgkQO2KABBYQAh9aGACdEP+xuJptSUwAe0tHyUzJRi25 tTwAn0V/km+ltgigmQa5XoVI7lIVUlTw =UzmX -END PGP SIGNATURE-
Re: [hwloc-devel] Cgroup resource limits
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/11/12 13:01, Ralph Castain wrote: > Depends on the use-case. If you are going to direct-launch the > processes (e.g., using srun), then you are correct. Yup. > However, that isn't the case in other scenarios. For example, if > you get an allocation and then use mpirun to launch your job, you > definitely do *not* want the RM setting the cgroup constraints as > the RM only launches the orteds - it never sees the MPI procs. The > constraints are to apply to the individual procs as separate > entities - if you apply them to the orteds, then all procs will be > constrained to the same container. Ick. That's not been my experience recently; for instance Torque currently creates a cpuset for your job containing all the procs you've been given there and then you can use mpirun/mpiexec to launch orted across all the nodes you've been given. Those processes are then constrained to the allocation set up on each node. They are free to bind themselves to the cores present within that cpuset should they so wish. In the very beginnings (when I was at VPAC and when wew were using MVPAICH2 rather than OpenMPI) Torque would bind processes to a core within the allocation which worked fine for that, but of course broke in the way you explain when we moved to Open-MPI. I fixed that bug up very quickly.. ;-) We've only ever run Slurm on BlueGene where this isn't an issue, so I don't know if that does things differently. > Similarly, if you are running MapReduce, your application has to > figure out what nodes to run on, how much memory will be required, > etc. All that goes into the allocation request (made by the > equivalent of mpirun in that scenario) sent to the RM. Again, the > orteds need to set those constraints on a per-process basis. But for the scheduler to be able to plan workload well I believe that once your job has started the best you can do is ask for less than you have been given, otherwise you're free to game the system by queuing a short small job and once it's started asking for many more cores or RAM.. :-) > So we need the capability in ORTE to support the non-direct-launch > cases. I'm pretty sure we're agreeing here, just in different ways of expressing ourselves.. :-) cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlCYcvAACgkQO2KABBYQAh9C8ACcD3Tvjho1ZWuDMI+qX7iccUDQ mQQAmgNmVRisYsUfajunEBacNFjRBCIa =1S3e -END PGP SIGNATURE-
Re: [hwloc-devel] Cgroup resource limits
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/11/12 01:43, Ralph Castain wrote: > On Nov 4, 2012, at 7:28 PM, Christopher Samuel > <sam...@unimelb.edu.au> wrote: > >> I would argue that the resource managers *should* be doing it > > No argument from me - I would love for them to provide me with an > easy API that mpirun can use to specify the requirements for a > given application. Wouldn't it be the other way around with the resource manager setting limits and then having the job run inside it? Basically like the current cpuset support in Torque, et. al, but on steroids. That way mpirun and/or orted could learn from the kernel the details of the cgroup it is in and arrange itself appropriately. I believe that Slurm has some support for cgroups already: http://www.schedmd.com/slurmdocs/cgroups.html [memcg performance] > Yick! However, I would expect the community to reduce that impact > over time. If systems don't want that capability, then they can > and should disable it. On the other hand, if they do want it, then > we want to support it. Indeed! cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlCYReYACgkQO2KABBYQAh+BxQCbB1lbNCqotuA2paV+G6+cfAdP xxwAnAurUX8OoK1+4oJJJY7NV9cmIoRV =yrCv -END PGP SIGNATURE-
Re: [hwloc-devel] Cgroup resource limits
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/11/12 09:05, Ralph Castain wrote: > System resource managers don't usually provide this capability, so > we will do it at the ORTE level. I would argue that the resource managers *should* be doing it - however, I will also argue that the resource managers should be doing it via hwloc (so I'm afraid it's not an out for you folks :-) ). It's also worth remembering that the memcg code has an appalling reputation with the kernel developers in terms of performance overhead, for instance at the recent Kernel Summit numbers were reported showing a substantial impact for just having the code present, but not used. Following that a patch set was sent out trying to avoid that impact if it's not in use which doesn't help here but does give a measure of the performance hit: http://lwn.net/Articles/517562/ # So as one can see, the difference between base and nomemcg in terms # of both system time and elapsed time is quite drastic, and consistent # with the figures shown by Mel Gorman in the Kernel summit. This is a # ~7 % drop in performance, just by having memcg enabled. memcg # functions appear heavily in the profiles, even if all tasks lives in # the root memcg. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlCXMlUACgkQO2KABBYQAh8eTgCgkruuxIKc3mqpoxwMaeQBI1hR /osAn225q4G6FWs1b4Lm6F/9GHDgw9JB =jkm0 -END PGP SIGNATURE-
Re: [hwloc-devel] backends and plugins
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/12 19:02, Brice Goglin wrote: > Aside from the main "discover" callback, backends may also define > some callbacks to be invoked when new object are created. The main > example is Linux creating "OS devices" when a new "PCI device" is > added by the PCI backend. That could also be useful to some folks for non-PCI devices, say if a CPU gets hotplugged in/out (or more likely added/removed from a cpuset/cgroup you're in). - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAoe3QACgkQO2KABBYQAh8DagCeKwDn0lPdX1D7GlLD0ksuIX/t jvEAn2l7+FQhnYvdPoN1CUd6Y6oyHSTv =mBxD -END PGP SIGNATURE-
Re: [hwloc-devel] [PATCH] Use plain "inline" in C++
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 10/05/12 07:40, Jeff Squyres wrote: > Huh -- really? I always thought that the C++ language itself > included the keyword "inline". I asked via Twitter and got these responses.. # Inline was part of C++98 - the first c++ standard, and # the inline kwd is in the cfront 1.0 ('86) source. So # functionally, yes. ...and... # This may be a different question than "have all C++ # compilers always accepted inline?" I note that autoconf has an inline test for C: http://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/C-Compiler.html But not for C++: http://www.gnu.org/savannah-checkouts/gnu/autoconf/manual/autoconf-2.69/html_node/C_002b_002b-Compiler.html So perhaps the fact that they've never needed to implement such a test is in itself a good guide ? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk+rPAoACgkQO2KABBYQAh+fqwCfbsCOjeK5y+WEZnWQ1e+pQmQg DhQAoJdN6S7IJpUZ51IlXbE0QJOI1jjI =dWPv -END PGP SIGNATURE-
Re: [hwloc-devel] lstopo-nox strikes back
On 26/04/12 02:35, Brice Goglin wrote: > I think I would vote for lstopo (no X/cairo) and lstopo so > that completion helps. Not sure if that's an option with Debian given the policy; the hwloc package would have to have lstopo with X enabled and then a nox package would install that variant of lstopo and use the alternatives system to select which to use. cheers, Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/
Re: [hwloc-devel] lstopo-nox strikes back
On 25/04/12 23:44, Jeffrey Squyres wrote: > FWIW: Having lstopo plugins for output would obviate the need for > having two executable names. IIRC that's generally handled via the alternatives system (or diversions if you don't like alternatives) in Debian/Ubuntu. -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/
Re: [hwloc-devel] interoperability with X displays
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 30/03/12 01:08, Brice Goglin wrote: > * The code uses NVIDIA's apparently-open-source nvctrl library. The > lib is unfortunately only built as a static lib in at least debian > and ubuntu (without -fPIC), which is annoying. I don't see that reported as a bug in the BTS, so I'd suggest reporting it and seeing what happens. http://bugs.debian.org/cgi-bin/pkgreport.cgi?src=nvidia-settings cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk90+swACgkQO2KABBYQAh8n7gCeLoqfiHq70fpNhctK4ivoVB9C LzgAn0qakbmIrTGMJUzCVZNXGmsrxEJK =27JG -END PGP SIGNATURE-
Re: [hwloc-devel] Fwd: BGQ empty topology with MPI
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 26/03/12 17:14, Brice Goglin wrote: > Thanks, that would explain such a strange behavior. Not a problem. > For the record, you can run "lstopo -v" or even "lstopo -.xml" to > get more info, especially machine attributes. OK, please find attached both lstopo -v (with debug enabled) and also the XML file requested. This is BG/P, not BG/Q of course! cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9wDuYACgkQO2KABBYQAh+5rwCffUVzbgIGgfAH9HtjAlBO90uV kLoAn0Rk2X6dlkNCBC3hKqPz1EZlx9KO =G9MN -END PGP SIGNATURE- could not open /proc/cpuinfo * CPU cpusets * cpu 0 (os 0) has cpuset 0x0001 cpu 1 (os 1) has cpuset 0x0002 cpu 2 (os 2) has cpuset 0x0004 cpu 3 (os 3) has cpuset 0x0008 Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 0xf...f complete 0x000f online 0xf...f allowed 0xf...f nodeset 0x0 completeN 0x0 allowedN 0xf...f PU#0 cpuset 0x0001 PU#1 cpuset 0x0002 PU#2 cpuset 0x0004 PU#3 cpuset 0x0008 Restrict topology cpusets to existing PU and NODE objects Propagate offline and disallowed cpus down and up Propagate nodesets Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 0x000f complete 0x000f online 0x000f allowed 0x000f PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 0x0001 PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 0x0002 PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 0x0004 PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 0x0008 Removing unauthorized and offline cpusets from all cpusets Removing disallowed memory according to nodesets Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 0x000f complete 0x000f online 0x000f allowed 0x000f PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 0x0001 PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 0x0002 PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 0x0004 PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 0x0008 Removing ignored objects Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 0x000f complete 0x000f online 0x000f allowed 0x000f PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 0x0001 PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 0x0002 PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 0x0004 PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 0x0008 Removing empty objects except numa nodes and PCI devices Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 0x000f complete 0x000f online 0x000f allowed 0x000f PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 0x0001 PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 0x0002 PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 0x0004 PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 0x0008 Removing objects whose type has HWLOC_IGNORE_TYPE_KEEP_STRUCTURE and have only one child or are the only child Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 0x000f complete 0x000f online 0x000f allowed 0x000f PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 0x0001 PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 0x0002 PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 0x0004 PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 0x0008 Add default object sets Ok, finished tweaking, now connect Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 0x000f complete 0x000f online 0x000f allowed 0x000f nodeset 0xf...f completeN 0xf...f allowedN
Re: [hwloc-devel] Fwd: BGQ empty topology with MPI
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 25/03/12 17:43, Brice Goglin wrote: > But it'd be good to understand what's going on in /sys on this > machine. And I still don't understand why MPI changes things here. My guess (looking at the BG/P CNK kernel code) is that /sys is not present on a BG/Q compute node, only on its I/O nodes (which run a Linux kernel), and so the code is only picking them up when the I/O is being redirected via an I/O node (i.e. when MPI is in play). Now I'd have thought that would happen with or without MPI, but who knows.. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9v4XkACgkQO2KABBYQAh8QrwCdGVrp1OzExLnB9v696lqEO2yz qKwAnivU+GJ2lXB5wzRBw1WlCkj0XeSy =rgKS -END PGP SIGNATURE-
Re: [hwloc-devel] Fwd: BGQ empty topology with MPI
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 25/03/12 09:04, Daniel Ibanez wrote: > Additional printfs confirm that with MPI in the code, > hwloc_accessat succeeds on the various /sys/ directories, but the > overall procedure for getting PUs from these fails. Without MPI, > access to /sys/ directories fails but the fallback > hwloc_setup_pu_level works. Sounds like your I/O with MPI is getting redirected to the I/O node (and hence finding /sys from the Linux kernel there) but when you're running without MPI it's trying to open files on the compute node and the CNK isn't presenting the /sys directories, causing it to fall back. I've run lstopo on our BG/P and I get to see the 4 cores there whether it's the stock code or if I add an MPI_Init() to the start. The output from lstopo when built with --enable-debug confirms it's reporting kernel and hostname info from the I/O node associated with the block: Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 HostName=r00-m1-n04.pcf.vlsci.unimelb.edu.au Architecture=BGP) [...] It might be interesting to build something like ls with the BG/Q compilers to see if you can run it on a compute node to see what /proc or /sys look like in each case. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9v33UACgkQO2KABBYQAh+S1ACfSypUPtoOFV8fHOObBztuUMGI RmwAnRy/Estz8Qi2KzAuQigPJbgtSlD4 =sdGx -END PGP SIGNATURE-
Re: [hwloc-devel] PCI device name question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 21/03/12 08:07, Brice Goglin wrote: > New patch attached, it doesn't add port numbers for non-IB > devices. Extract from lstopo on SGI XE270 box with Mellanox dual port IB card: PCIBridge PCI 15b3:673c Net L#2 "ib1" Net L#3 "ib0" OpenFabrics L#4 "mlx4_0" Looks OK to me. - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9r5lQACgkQO2KABBYQAh8kygCfWGaqIN0Xo8nHFCWhL31iCgtQ JqIAn0WP5CXBFBhsJL7qB5vpGABfPtel =i2eQ -END PGP SIGNATURE-
Re: [hwloc-devel] BGQ empty topology with MPI
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 22/03/12 20:58, Brice Goglin wrote: > So there's something strange going on when MPI is added. Which MPI > are using? Is this a derivative of MPICH that embeds hwloc? (MPICH > >= 1.2.1 if I remember correctly) Not sure about BG/Q, but BG/P uses code derived from MPICH2 according to: http://wiki.bg.anl-external.org/index.php/Main_Page Our BG/P seems to claim it's from MPICH2 1.1: samuel@tambo:~> mpicc -v mpicc for 1.1 cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9r42cACgkQO2KABBYQAh9mbwCeOYrI5bsk/XOiXFl128BksV2D SR4An1bs09e2lpyYadABbaRIG2dtg7Fr =ucpF -END PGP SIGNATURE-
Re: [hwloc-devel] BGQ empty topology with MPI
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 22/03/12 01:08, Daniel Ibanez wrote: > Attached is the stderr and stdout from lstopo compiled as you > said. Interesting, so it's not correctly detecting the topology as BG/Q is 16 compute cores, each with 4 hardware threads. Instead it's detecting all 64 hardware threads and treating them as cores if I'm reading that right. I was puzzled by the OS info output too, it says: Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.32-220.el6.bgq110_20120104.ppc64 OSVersion=1 HostName=R00-ID-J04.i2b.cetus Architecture=) cpuset 0xf...f complete 0x,0x online 0xf...f allowed 0xf...f nodeset 0x0 completeN 0x0 allowedN 0xf...f However, looking at the (open) source code for the CNK [1] (at least for BG/P) the uname info seems to be derived from the I/O nodes when its running in CIOD mode, so I suspect that's what's happening here (looks like a RHEL6 derived kernel from that). > I can't run hwloc-gather-topology.sh on the compute nodes since its > a script, but I can run it on the front end node. For those unfamiliar with BlueGene (at least P, and I suspect the same is true for Q), this is because the CNK doesn't implement fork() or execve(), they're designed to start your code and just keep running it until it dies. [1] - http://wiki.bg.anl-external.org/index.php/Cnk cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9r4r4ACgkQO2KABBYQAh8zswCfaoTK+PQ/ystZEX23AxK/0007 OwYAmwYHiVYzjtrCrAJ5L0GNfdbM/Hsr =9qJj -END PGP SIGNATURE-
Re: [hwloc-devel] BGQ empty topology with MPI
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 21/03/12 13:37, Daniel Ibanez wrote: > Please let me know if theres a hint of what could be causing it, > where to post, and what info to provide. Are you running Linux or CNK on the compute nodes for this? cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9pQ+QACgkQO2KABBYQAh/y1gCdFVeWEOgfdobkp+Xdl/Y9y6+i 0a4Anjt1REedBOQKbCvTEvl5tZrLSJjy =/Tk1 -END PGP SIGNATURE-
Re: [OMPI devel] [OMPI svn] svn:open-mpi r26077 (fwd)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/03/12 02:56, Nathan Hjelm wrote: > Found a pretty nasty frag leak (and a minor one) in ob1 (see > commit below). If this fix addresses some hangs we are seeing on > infiniband LANL might want a 1.4.6 rolled (or a faster rollout for > 1.6.0). What symptoms would an affected job show? Does it fail with an OMPI error or does it just hang using 0% CPU? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9QN10ACgkQO2KABBYQAh9aRgCePZXdzqlI8lpfqWtHf8rtFvup 2D8An3E9y411xTyRBpfwHLPpWTzqUiuv =3EXP -END PGP SIGNATURE-
Re: [OMPI devel] Open MPI nightly tarballs suspended / 1.5.5rc3
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 29/02/12 07:44, Jeffrey Squyres wrote: > - BlueGene fixes rc3 fixes the builds on our front end node, thanks! - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9NesUACgkQO2KABBYQAh/xZACeOaKCOdHfOkcWu2W6KxZNsP9+ QMIAnAkwhmu3m/DnNubN4BoED51K8CGg =T8Ca -END PGP SIGNATURE-
Re: [OMPI devel] poor btl sm latency
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 13/02/12 22:11, Matthias Jurenz wrote: > Do you have any idea? Please help! Do you see the same bad latency in the old branch (1.4.5) ? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9MZBwACgkQO2KABBYQAh99aQCggjCQB/+aaQ3XCrdq4QyMlsD0 m2IAoI+TcrStWFkTZhEV50ax23ulmJvZ =Soi0 -END PGP SIGNATURE-
Re: [OMPI devel] 1.5.5rc2
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 24/02/12 15:12, Christopher Samuel wrote: > I suspect this is irrelevant, but I got a build failure trying to > compile it on our BG/P front end node (login node) with the IBM XL > compilers. Oops, forgot how I built it.. export PATH=/opt/ibmcmp/vac/bg/9.0/bin/:/opt/ibmcmp/vacpp/bg/9.0/bin:/opt/ibmcmp/xlf/bg/11.1/bin:$PATH CC=xlc CXX=xlC F77=xlf ./configure && make - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9HD1wACgkQO2KABBYQAh9EZgCcCz9x2i6KuE7/UpPzr194jHQD rdcAni+dfEMhlqMzYMILn8jeS9yWlInu =+rA4 -END PGP SIGNATURE-