Re: [OMPI devel] Open-MPI backwards compatibility and library version changes

2018-11-18 Thread Christopher Samuel
Hi Brian,

On 17/11/18 5:13 am, Barrett, Brian via devel wrote:

> Unfortunately, I don’t have a good idea of what to do now. We already 
> did the damage on the 3.x series. Our backwards compatibility testing 
> (as lame as it is) just links libmpi, so it’s all good. But if anyone 
> uses libtool, we’ll have a problem, because we install the .la files 
> that allow libtool to see the dependency of libmpi on libopen-pal, and 
> it gets too excited.
> 
> We’ll need to talk about how we think about this change in the future.

Thanks for that - personally I think it's a misfeature in libtool to add 
these extra dependencies, it would be handy if there was a way to turn 
it off - but that's not your problem.

For us it just means that when we bring in a new Open-MPI we just need 
to build new versions of our installed libraries and codes against it, 
fortunately that's something that Easybuild makes (relatively) easy.

Thanks for your time everyone - this is my last week at Swinburne before 
I leave Australia to start at NERSC in December!

All the best,
Chris
-- 
  Christopher Samuel OzGrav Senior Data Science Support
  ARC Centre of Excellence for Gravitational Wave Discovery
  http://www.ozgrav.org/  http://twitter.com/ozgrav
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Open-MPI backwards compatibility and library version changes

2018-11-14 Thread Christopher Samuel
On 15/11/18 12:10 pm, Christopher Samuel wrote:

> I wonder if it's because they use libtool instead?

Yup, it's libtool - using it compile my toy example shows the same
behaviour with "readelf -d" pulling in the private libraries directly. :-(

[csamuel@farnarkle2 libtool]$ cat hhgttg.c
int answer(void)
{
return(42);
}


[csamuel@farnarkle2 libtool]$ libtool compile gcc hhgttg.c -c -o hhgttg.o
libtool: compile:  gcc hhgttg.c -c  -fPIC -DPIC -o .libs/hhgttg.o
libtool: compile:  gcc hhgttg.c -c -o hhgttg.o >/dev/null 2>&1


[csamuel@farnarkle2 libtool]$ libtool link gcc -o libhhgttg.la hhgttg.lo 
-lmpi -rpath /usr/local/lib
libtool: link: gcc -shared  -fPIC -DPIC  .libs/hhgttg.o   -Wl,-rpath 
-Wl,/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib 
-Wl,-rpath 
-Wl,/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib 
/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libmpi.so 
-L/apps/skylake/software/core/gcccore/6.4.0/lib64 
-L/apps/skylake/software/core/gcccore/6.4.0/lib 
-L/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib 
/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-rte.so 
/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-pal.so 
-ldl -lrt -lutil -lm -lpthread -lz -lhwloc-Wl,-soname 
-Wl,libhhgttg.so.0 -o .libs/libhhgttg.so.0.0.0
libtool: link: (cd ".libs" && rm -f "libhhgttg.so.0" && ln -s 
"libhhgttg.so.0.0.0" "libhhgttg.so.0")
libtool: link: (cd ".libs" && rm -f "libhhgttg.so" && ln -s 
"libhhgttg.so.0.0.0" "libhhgttg.so")
libtool: link: ar cru .libs/libhhgttg.a  hhgttg.o
libtool: link: ranlib .libs/libhhgttg.a
libtool: link: ( cd ".libs" && rm -f "libhhgttg.la" && ln -s 
"../libhhgttg.la" "libhhgttg.la" )


[csamuel@farnarkle2 libtool]$ readelf -d .libs/libhhgttg.so.0| fgrep -i lib
  0x0001 (NEEDED) Shared library: [libmpi.so.40]
  0x0001 (NEEDED) Shared library: 
[libopen-rte.so.40]
  0x0001 (NEEDED) Shared library: 
[libopen-pal.so.40]
  0x0001 (NEEDED) Shared library: [libdl.so.2]
  0x0001 (NEEDED) Shared library: [librt.so.1]
  0x0001 (NEEDED) Shared library: [libutil.so.1]
  0x0001 (NEEDED) Shared library: [libm.so.6]
  0x0001 (NEEDED) Shared library: [libpthread.so.0]
  0x0001 (NEEDED) Shared library: [libz.so.1]
  0x0001 (NEEDED) Shared library: [libhwloc.so.5]
  0x0001 (NEEDED) Shared library: [libc.so.6]
  0x000e (SONAME) Library soname: [libhhgttg.so.0]
  0x001d (RUNPATH)Library runpath: 
[/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib]


All the best,
Chris
-- 
  Christopher Samuel OzGrav Senior Data Science Support
  ARC Centre of Excellence for Gravitational Wave Discovery
  http://www.ozgrav.org/  http://twitter.com/ozgrav

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Open-MPI backwards compatibility and library version changes

2018-11-14 Thread Christopher Samuel
On 15/11/18 11:45 am, Christopher Samuel wrote:

> Unfortunately that's not the case, just creating a shared library
> that only links in libmpi.so will create dependencies on the private
> libraries too in the final shared library. :-(

Hmm, I might be misinterpreting the output of "ldd", it looks like it
reports the dependencies of dependencies not just the direct
dependencies.  "readelf -d" seems more reliable.

[csamuel@farnarkle2 libtool]$ readelf -d libhhgttg.so.1 | fgrep -i lib
  0x0001 (NEEDED) Shared library: [libmpi.so.40]
  0x0001 (NEEDED) Shared library: [libc.so.6]
  0x000e (SONAME) Library soname: [libhhgttg.so.1]

Whereas the HDF5 libraries really do have them listed as a dependency.

[csamuel@farnarkle2 1.10.1]$ readelf -d ./lib/libhdf5_fortran.so.100 | 
fgrep -i lib
  0x0001 (NEEDED) Shared library: [libhdf5.so.101]
  0x0001 (NEEDED) Shared library: [libsz.so.2]
  0x0001 (NEEDED) Shared library: 
[libmpi_usempif08.so.40]
  0x0001 (NEEDED) Shared library: 
[libmpi_usempi_ignore_tkr.so.40]
  0x0001 (NEEDED) Shared library: 
[libmpi_mpifh.so.40]
  0x0001 (NEEDED) Shared library: [libmpi.so.40]
  0x0001 (NEEDED) Shared library: 
[libopen-rte.so.40]
  0x0001 (NEEDED) Shared library: 
[libopen-pal.so.40]
  0x0001 (NEEDED) Shared library: [libdl.so.2]
  0x0001 (NEEDED) Shared library: [librt.so.1]
  0x0001 (NEEDED) Shared library: [libutil.so.1]
  0x0001 (NEEDED) Shared library: [libpthread.so.0]
  0x0001 (NEEDED) Shared library: [libz.so.1]
  0x0001 (NEEDED) Shared library: [libhwloc.so.5]
  0x0001 (NEEDED) Shared library: [libgfortran.so.3]
  0x0001 (NEEDED) Shared library: [libm.so.6]
  0x0001 (NEEDED) Shared library: [libquadmath.so.0]
  0x0001 (NEEDED) Shared library: [libc.so.6]
  0x0001 (NEEDED) Shared library: [libgcc_s.so.1]
  0x000e (SONAME) Library soname: 
[libhdf5_fortran.so.100]
  0x001d (RUNPATH)Library runpath: 
[/apps/skylake/software/mpi/gcc/6.4.0/openmpi/3.0.0/hdf5/1.10.1/lib:/apps/skylake/software/core/szip/2.1.1/lib:/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib:/apps/skylake/software/core/gcccore/6.4.0/lib/../lib64]

I wonder if it's because they use libtool instead?

All the best,
Chris
-- 
  Christopher Samuel OzGrav Senior Data Science Support
  ARC Centre of Excellence for Gravitational Wave Discovery
  http://www.ozgrav.org/  http://twitter.com/ozgrav

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Open-MPI backwards compatibility and library version changes

2018-11-14 Thread Christopher Samuel
On 15/11/18 2:16 am, Barrett, Brian via devel wrote:

> In practice, this should not be a problem. The wrapper compilers (and
>  our instructions for linking when not using the wrapper compilers)
> only link against libmpi.so (or a set of libraries if using Fortran),
> as libmpi.so contains the public interface. libmpi.so has a
> dependency on libopen-pal.so so the loader will load the version of
> libopen-pal.so that matches the version of Open MPI used to build
> libmpi.so However, if someone explicitly links against libopen-pal.so
> you end up where we are today.

Unfortunately that's not the case, just creating a shared library
that only links in libmpi.so will create dependencies on the private
libraries too in the final shared library. :-(

Here's a toy example to illustrate that.

[csamuel@farnarkle2 libtool]$ cat hhgttg.c
int answer(void)
{
return(42);
}

[csamuel@farnarkle2 libtool]$ gcc hhgttg.c -c -o hhgttg.o

[csamuel@farnarkle2 libtool]$ gcc -shared -Wl,-soname,libhhgttg.so.1 -o 
libhhgttg.so.1 hhgttg.o -lmpi

[csamuel@farnarkle2 libtool]$ ldd libhhgttg.so.1
linux-vdso.so.1 =>  (0x7ffc625b3000)
libmpi.so.40 => 
/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libmpi.so.40 
(0x7f018a582000)
libc.so.6 => /lib64/libc.so.6 (0x7f018a09e000)
libopen-rte.so.40 => 
/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-rte.so.40 
(0x7f018a4b5000)
libopen-pal.so.40 => 
/apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-pal.so.40 
(0x7f0189fde000)
libdl.so.2 => /lib64/libdl.so.2 (0x7f0189dda000)
librt.so.1 => /lib64/librt.so.1 (0x7f0189bd2000)
libutil.so.1 => /lib64/libutil.so.1 (0x7f01899cf000)
libm.so.6 => /lib64/libm.so.6 (0x7f01896cd000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f01894b1000)
libz.so.1 => /lib64/libz.so.1 (0x7f018929b000)
libhwloc.so.5 => /lib64/libhwloc.so.5 (0x7f018905e000)
/lib64/ld-linux-x86-64.so.2 (0x7f018a46b000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x7f0188e52000)
libltdl.so.7 => /lib64/libltdl.so.7 (0x7f0188c48000)
libgcc_s.so.1 => 
/apps/skylake/software/core/gcccore/6.4.0/lib64/libgcc_s.so.1 
(0x7f018a499000)


All the best,
Chris
-- 
  Christopher Samuel OzGrav Senior Data Science Support
  ARC Centre of Excellence for Gravitational Wave Discovery
  http://www.ozgrav.org/  http://twitter.com/ozgrav
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] Open-MPI backwards compatibility and library version changes

2018-11-14 Thread Christopher Samuel
Hi folks,

Just resub'd after a long time to ask a question about binary/backwards 
compatibility.

We got bitten when upgrading from 3.0.0 to 3.0.3 which we assumed would be 
binary compatible and so (after some testing to confirm it was) replaced our 
existing 3.0.0 install with the 3.0.3 one (because we're using hierarchical 
namespaces in Lmod it meant we avoided needed to recompile everything we'd 
already built over the last 12 months with 3.0.0).

However, once we'd done that we heard from a user that their code would no 
longer run because it couldn't find libopen-pal.so.40 and saw that instead 
3.0.3 had libopen-pal.so.42.

Initially we thought this was some odd build system problem, but then on 
digging further we realised that they were linking against libraries that in 
turn were built against OpenMPI (HDF5) and that those had embedded the 
libopen-pal.so.40 names.

Of course our testing hadn't found that because we weren't linking against 
anything like those for our MPI tests. :-(

But I was really surprised to see that these version numbers were changing, I 
thought the idea was to keep things backwardly compatible within these series?

Now fortunately our reason for doing the forced upgrade (we found our 3.0.0 
didn't work with our upgrade to Slurm 18.08.3) was us missing one combination 
out of our testing whilst fault-finding and having gotten it going we've been 
able to drop back to the original 3.0.0 & fixed it for them.

But is this something that you folks have come across before?

All the best,
Chris
-- 
  Christopher Samuel OzGrav Senior Data Science Support
  ARC Centre of Excellence for Gravitational Wave Discovery
  http://www.ozgrav.org/  http://twitter.com/ozgrav



___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Christopher Samuel
On 08/11/17 12:30, Kawashima, Takahiro wrote:

> As other people said, Fujitsu MPI used in K is based on old
> Open MPI (v1.6.3 with bug fixes). 

I guess the obvious question is will the vanilla Open-MPI work on K?

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Open-MPI killing nodes with mlx5 drivers?

2017-11-05 Thread Christopher Samuel
On 30/10/17 14:07, Christopher Samuel wrote:

> We have an issue where codes compiled with Open-MPI kill nodes with
> ConnectX-4 and ConnectX-5 cards connected to Mellanox Ethernet switches
> using the mlx5 driver from the latest Mellanox OFED

For the record, this crash is fixed in Mellanox OFED 4.2, which came out
after I wrote that with the necessary fix.

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] Open-MPI killing nodes with mlx5 drivers?

2017-10-29 Thread Christopher Samuel
Hi folks,

Trying the devel list to see if folks here have hit this issue when
testing out as I suspect it's not something many users will have access
to yet.

We have an issue where codes compiled with Open-MPI kill nodes with
ConnectX-4 and ConnectX-5 cards connected to Mellanox Ethernet switches
using the mlx5 driver from the latest Mellanox OFED, the kernel hangs
with no oops (or any other error) and we have to power cycle the node to
get it back.

This happens with even a singleton (no srun or mpirun) and from what I
can see from strace before the node hangs Open-MPI is starting to probe
for what fabrics are available.

The folks I'm helping have engaged Mellanox support but I was wondering
if anyone else had run across this?

Distro: RHEL 7.4 (x86-64)
Kernel: 4.12.9 (needed for the CephFS filesystem they use)
OFED: 4.1-1.0.2.0
Open-MPI: 1.10.x, 2.0.2, 3.0.0

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Segfault on MPI init

2017-02-21 Thread Christopher Samuel
On 15/02/17 00:45, Gilles Gouaillardet wrote:

> i would expect orted generate a core, and then you can use gdb post
> mortem to get the stack trace.
> there should be several threads, so you can
> info threads
> bt
> you might have to switch to an other thread

You can also get a backtrace from all threads at once with:

thread apply all bt

It's not just limited to 'bt' either:

(gdb) help thread apply
Apply a command to a list of threads.

List of thread apply subcommands:

thread apply all -- Apply a command to all threads


-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] RFC: Rename nightly snapshot tarballs

2016-10-17 Thread Christopher Samuel
On 18/10/16 07:17, Jeff Squyres (jsquyres) wrote:

> NOTE: It may be desirable to add HHMM in there; it's not common, but
> *sometimes* we do make more than one snapshot in a day (e.g., if one
> snapshot is borked, so we fix it and then generate another
> snapshot).

If it's been happened before then I'd suggest allow for it to happen
again by adding HHMM.  Otherwise looks sensible to me (YMMV).

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Off-topic re: supporting old systems

2016-08-31 Thread Christopher Samuel
On 31/08/16 14:01, Paul Hargrove wrote:

> So, the sparc platform is a bit more orphaned that it already was when
> support stopped at Wheezy.

Ah sorry, I didn't realise you were on a non-LTS Wheezy architecture.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Off-topic re: supporting old systems

2016-08-30 Thread Christopher Samuel
On 31/08/16 12:05, Paul Hargrove wrote:

> As Giles mentions the http: redirects to https: before anything is fetched.
> Replacing "-nv" in the wget command with "-v" shows that redirect clearly.

Agreed, but it still just works on Debian Wheezy for me. :-)

What does "apt-cache policy wget" say for you?

root@db3:/tmp# apt-cache policy wget
wget:
  Installed: 1.13.4-3+deb7u3
  Candidate: 1.13.4-3+deb7u3
[...]

Here's the plain wget, with redirect, don't even need to disable the
certificate check here on Debian Wheezy (though it still works if you do).

root@db3:/tmp# wget  
http://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2
--2016-08-31 12:11:59--  
http://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2
Resolving www.open-mpi.org (www.open-mpi.org)... 192.185.39.252
Connecting to www.open-mpi.org (www.open-mpi.org)|192.185.39.252|:80... 
connected.
HTTP request sent, awaiting response... 302 Found
Location: 
https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2 
[following]
--2016-08-31 12:11:59--  
https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2
Connecting to www.open-mpi.org (www.open-mpi.org)|192.185.39.252|:443... 
connected.
HTTP request sent, awaiting response... 200 OK
Length: 8192091 (7.8M) [application/x-tar]
Saving to: `openmpi-2.0.1rc2.tar.bz2'

100%[>]
 8,192,091   1.75M/s   in 7.3s

2016-08-31 12:12:08 (1.07 MB/s) - `openmpi-2.0.1rc2.tar.bz2' saved 
[8192091/8192091]


All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Off-topic re: supporting old systems

2016-08-30 Thread Christopher Samuel
On 31/08/16 06:22, Paul Hargrove wrote:

> It seems that a stock Debian Wheezy system cannot even *download* Open
> MPI any more:

Works for me, both http (which shouldn't be using SSL anyway) and https.

Are you behind some weird intercepting proxy?

root@db3:/tmp# wget -nv --no-check-certificate 
http://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2
2016-08-31 10:42:34 
URL:https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2
 [8192091/8192091] -> "openmpi-2.0.1rc2.tar.bz2" [1]

root@db3:/tmp# wget -nv --no-check-certificate 
https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2
2016-08-31 10:43:10 
URL:https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1rc2.tar.bz2
 [8192091/8192091] -> "openmpi-2.0.1rc2.tar.bz2.1" [1]

root@db3:/tmp# cat /etc/issue
Debian GNU/Linux 7 \n \l

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Migration of mailman mailing lists

2016-07-18 Thread Christopher Samuel
On 19/07/16 02:05, Brice Goglin wrote:

> Yes, kill all netloc lists.

Will the archives be preserved somewhere for historical reference?

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


Re: [OMPI devel] [1.10.3rc4] testing results

2016-06-06 Thread Christopher Samuel
On 06/06/16 15:09, Larry Baker wrote:

> An impressive accomplishment by the development team.  And impressive
> coverage by Paul's testbed.  Well done!

Agreed, it is very impressive to watch both on the breaking & the fixing
side of things. :-)

Thanks so much to all involved with this.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


Re: [OMPI devel] RFC: Public Test Repo

2016-05-19 Thread Christopher Samuel
Hi Josh,

On 19/05/16 13:54, Josh Hursey wrote:

> Let me know what you think. Certainly everything here is open for
> discussion, and we will likely need to refine aspects as we go.

I think having an open test suite in conjunction with the current
private one is a great way to go, I think it sends the right message
about openness and hopefully allows a community to build around MPI
testing in general.

Certainly happy to try it out!

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


Re: [OMPI devel] Github pricing plan changes announced today

2016-05-17 Thread Christopher Samuel
On 18/05/16 09:59, Gilles Gouaillardet wrote:

> the (main) reason is none of us are lawyers and none of us know whether
> all test suites can be redistributed for general public use or not.

Thanks Gilles,

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


Re: [OMPI devel] Github pricing plan changes announced today

2016-05-17 Thread Christopher Samuel
On 12/05/16 06:21, Jeff Squyres (jsquyres) wrote:

> We basically have one important private repo (the tests repo).

Possibly a dumb question (sorry), but what's the reason for that repo
being private?

I ask as someone on the Beowulf list today was looking for an MPI
regression test tool and found MTT but commented:

# OpenMPI has the MPI Testing Tool which looks like it would work,
# but most of there tests seem private.

and so moved on to look at other options instead.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


Re: [OMPI devel] [2.0.0rc2] xlc-13.1.0 ICE (hwloc)

2016-05-05 Thread Christopher Samuel
On 03/05/16 18:11, Paul Hargrove wrote:

> xlc-13.1.0 on Linux dies compiling the embedded hwloc in this rc
> (details below).

In case it's useful xlc 12.1.0.9-140729 (yay for BGQ living in the past)
doesn't ICE on RHEL6 on Power7.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

2016-03-03 Thread Christopher Samuel
Hi Gilles,

On 04/03/16 13:33, Gilles Gouaillardet wrote:

> there is clearly no hope when you use mpi.mod and mpi_f08.mod
> my point was, it is not even possible to expect "legacy" mpif.h work
> with different compilers.

Sorry, my knowledge of FORTRAN is limited to trying to debug why their
code wouldn't compile. :-)

Apologies for the noise.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

2016-03-03 Thread Christopher Samuel
On 04/03/16 12:17, Dave Turner wrote:

>  My understanding is that OpenMPI built with either Intel or
> GNU compilers should be able to use the other compilers using the
> OMPI_CC and OMPI_FC environmental variables.

Sadly not, we tried this but when our one of our very few FORTRAN users
(who happened to be our director) tried to use it it failed because the
mpi.mod module created during the build is compiler dependent. :-(

So ever since we've done separate builds for GCC and for Intel.

All the best!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] Problem with the 1.8.8 version

2015-12-06 Thread Christopher Samuel
On 05/12/15 01:52, Baldassari Caroline wrote:

> I have installed OpenMPI 1.8.8 (the last version 1.8.8 downloaded on
> your site)

v1.8 morphed into the v1.10 series, I'd suggest trying that..

http://www.slideshare.net/jsquyres/open-mpi-new-version-number-scheme-and-roadmap

cheers!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] Slides from the Open MPI SC'15 State of the Union BOF

2015-11-19 Thread Christopher Samuel
On 20/11/15 03:31, Dasari, Annapurna wrote:

> Jeff, could you check the link it didn¹t work for me..
> I tried to check out the slides by opening the link and downloading the
> file, I am getting a file damaged error on my system.

Not sure if Jeff has fixed them up since this, but they open fine for me
at the moment (using KDE's Okular PDF viewer).

Thanks for putting them up Jeff!

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] PMI2 in Slurm 14.11.8

2015-09-02 Thread Christopher Samuel
On 02/09/15 13:09, Christopher Samuel wrote:

> Instead PMI2 is in a contrib directory which appears to need manual
> intervention to install.

Confirming from the Slurm list that PMI2 is not built by default, it's
only the RPM build process that will include it without intervention.

Thanks for your help Ralph!

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



[OMPI devel] PMI2 in Slurm 14.11.8

2015-09-02 Thread Christopher Samuel
Hi all,

The OpenMPI FAQ says:

https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps

# Yes, if you have configured OMPI --with-pmi=foo, where foo is
# the path to the directory where pmi.h/pmi2.h is located.
# Slurm (> 2.6, > 14.03) installs PMI-2 support by default.

However, we've found on a new system we're bringing up this doesn't
appear to be true for the vanilla Slurm 14.11.8 we're installing.

Instead PMI2 is in a contrib directory which appears to need manual
intervention to install.

I've sent an email to the Slurm list to query this behaviour but I was
wondering in case anyone had run into this here too?

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] 1.10.0rc6 - slightly different mx problem

2015-08-25 Thread Christopher Samuel
On 25/08/15 05:08, Jeff Squyres (jsquyres) wrote:

> FWIW, we have had verbal agreement in the past that the v1.8 series
> was the last one to contain MX support.  I think it would be fine for
> all MX-related components to disappear from v1.10.
> 
> Don't forget that Myricom as an HPC company no longer exists.

INRIA does have Open-MX (Myrinet Express over Generic Ethernet
Hardware), last release December 2014.  No idea if it's still developed
or used..

http://open-mx.gforge.inria.fr/

Brice?

Open-MPI is listed as working with it there. ;-)

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] 1.8.7rc1 testing results

2015-07-14 Thread Christopher Samuel
On 14/07/15 01:49, Ralph Castain wrote:

> Okay, 1.8.7rc3 (we already had an rc2) is now out with all these changes
> - please take one last look.

Looks OK for XRC here, thanks!

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] Error in ./configure for Yocto

2015-07-09 Thread Christopher Samuel
On 10/07/15 01:38, Jeff Squyres (jsquyres) wrote:

> Just curious -- what's Yocto?

It's a system for building embedded Linux distros:

https://www.yoctoproject.org/

Intel announced the switch to Yocto for their MPSS distro
for Xeon Phi a couple of years ago (v3 and later).

https://software.intel.com/en-us/articles/intelr-mpss-transition-to-yocto-faq

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] Proposal: update Open MPI's version number and release process

2015-05-20 Thread Christopher Samuel
On 20/05/15 14:37, Howard Pritchard wrote:

> It would also be easy to trap the I-want-to-bypass-PR-because-I
> know-what-I'm-doing-developer with a second level of protection.  Just
> set up a jenkins project that does a smoke test after ever commit to
> master.  If the smoke test fails, send a naughty-gram to the committer
> and copy devel. Pretty soon the developer will get trained to use the PR
> process, unless they are that engineer I've yet to meet who always
> writes flawless code.

VMware used to have a bot that tweeted info about their testing,
including "$USER just broke the build at VMWare"; for example:

https://twitter.com/vmwarepbs/status/4634524702

:-)

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] Proposal: update Open MPI's version number and release process

2015-05-18 Thread Christopher Samuel
On 19/05/15 05:11, Jeff Squyres (jsquyres) wrote:

>  We've reached internal consensus, and would like to present this to the 
> larger community for feedback.

My gut feeling is that this is very good; from a cluster admin point of
view it means we keep a system tracking one level up from where we are
currently, i.e. at V4.x.x (for example) rather than v1.6.x or v1.8.x.

We've got a new system coming up in the next few months (fingers
crossed) and so it'll be interesting to see where we fall in terms of
the v1.10 or v2 releases but either way I see this as making our lives
easier.

Thanks!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] Chris Yeoh

2015-04-30 Thread Christopher Samuel
On 01/05/15 00:41, Jeff Squyres (jsquyres) wrote:

> I am saddened to inform the Open MPI developer community of the death
> of Chris Yeoh.

There is page for donations to lung cancer research in his memory here
(Chris was not a smoker, but it still took his life):

http://participate.freetobreathe.org/site/TR?px=1582460_id=2710=personal#.VSscH5SUd90

# Chris never smoked, yet was taken too early by this dreadful
# disease. Lung cancer has the greatest kill rate with the
# smallest funding rate because of the stigma associated with
# it being a "smoker's disease", but anyone with lungs can get
# it. We hope that further funding to research will one day
# provide a cure and better trials for others.

Valē Chris.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


Re: [OMPI devel] RFC: Remove embedded libltdl

2015-02-02 Thread Christopher Samuel
On 03/02/15 05:09, Ralph Castain wrote:

> Just out of curiosity: I see you are reporting about a build on the
> headnode of a BG cluster. We've never ported OMPI to BG - are you using
> it on such a system? Or were you just test building the code on a
> convenient server?

Just a convenient server with a not-so-mainstream architecture (and an
older RHEL release through necessity).  Sorry to get your hopes up! :-)

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] RFC: Remove embedded libltdl

2015-02-01 Thread Christopher Samuel
On 31/01/15 10:51, Jeff Squyres (jsquyres) wrote:

> New tarball posted (same location).  Now featuring 100% fewer "make check" 
> failures.

On our BG/Q front-end node (PPC64, RHEL 6.4) I see:

../../config/test-driver: line 95: 30173 Segmentation fault  (core dumped) 
"$@" > $log_file 2>&1
FAIL: opal_lifo

Stack trace implies the culprit is in:

#0  0x10001048 in opal_atomic_swap_32 (addr=0x20, newval=1)
at 
/vlsci/VLSCI/samuel/tmp/OMPI/openmpi-gitclone/opal/include/opal/sys/atomic_impl.h:51
51  old = *addr;

I've attached a script of gdb doing "thread apply all bt full" in
case that's helpful.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

Script started on Mon 02 Feb 2015 12:32:56 EST

[samuel@avoca class]$ gdb /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo core.32444
[?1034hGNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo...done.
[New Thread 32465]
[New Thread 32464]
[New Thread 32466]
[New Thread 32444]
[New Thread 32469]
[New Thread 32467]
[New Thread 32470]
[New Thread 32463]
[New Thread 32468]
Missing separate debuginfo for /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/de/a09192aa84bbc15579ae5190dc8acd16eb94fe
Missing separate debuginfo for /usr/local/slurm/14.03.10/lib/libpmi.so.0
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/28/09dfc4706ed44259cc31a5898c8d1a9b76b949
Missing separate debuginfo for /usr/local/slurm/14.03.10/lib/libslurm.so.27
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/e2/39d8a2994ae061ab7ada0ebb7719b8efa5de96
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/1a/063e3d64bb5560021ec2ba5329fb1e420b470f
Reading symbols from /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0...done.
Loaded symbols for /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0
Reading symbols from /usr/local/slurm/14.03.10/lib/libpmi.so.0...done.
Loaded symbols for /usr/local/slurm/14.03.10/lib/libpmi.so.0
Reading symbols from /usr/local/slurm/14.03.10/lib/libslurm.so.27...done.
Loaded symbols for /usr/local/slurm/14.03.10/lib/libslurm.so.27
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld64.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld64.so.1
Core was generated by `/vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo '.
Program terminated with signal 11, Segmentation fault.
#0  0x10001048 in opal_atomic_swap_32 (addr=0x20, newval=1)
at /vlsci/VLSCI/samuel/tmp/OMPI/openmpi-gitclone/opal/include/opal/sys/atomic_impl.h:51
51	old = *addr;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6_4.5.ppc64
(gdb) thread apply all bt full

Thread 9 (Thread 0xfff7a0ef200 (LWP 32468)):
#0  0x0080adb6629c in .__libc_write () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0fff7d6905b4 in show_stackframe (signo=11, info=0xfff7a0ee3d8, p=0xfff7a0edd00)
at /vlsci/VLSCI/samuel/tmp/OMPI/openmpi-gitclone/opal/util/stacktrace.c:81
print_buffer = "[avoca:32444] *** Process received signal ***\n", '\000' 
tmp = 0xfff7a0ed858 "[avoca:32444] *** Process received signal ***\n"
size = 1024
ret = 46
si_code_str

Re: [OMPI devel] mlx4 QP operation err

2015-01-28 Thread Christopher Samuel
Hi Dave,

On 29/01/15 11:31, Dave Turner wrote:

>   I've found some old references to similar mlx4 errors dating back to
> 2009 that lead me to believe this may be a firmware error.  I believe we're
> running the most up to date version of the firmware.

There was a new version released a few days ago, 2.33.5100:

http://www.mellanox.com/page/firmware_table_ConnectX3ProEN

Release notes are here:

http://www.mellanox.com/pdf/firmware/ConnectX3Pro-FW-2_33_5100-release_notes.pdf

Bug fixes start on page 23, looks like there are 29 fixes
in this version, and fix 1 is for RoCE (though of course may
not be relevant) - "The first Read response was not treated as
implicit ACK" (discovered in 2.30.8000).

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [hwloc-devel] Using hwloc to detect Hard Disks

2014-09-23 Thread Christopher Samuel
On 24/09/14 00:57, Ralph Castain wrote:

> Memory info is available from lshw, though they are a GPL code:

FWIW on this laptop (Intel Haswell) lshw only report DIMM info when run
as root, which I suspect would point them to accessing DMI information
via /dev/mem.

Using strace supports this:

3405  open("/dev/mem", O_RDONLY)= -1 EACCES (Permission denied)

FWIW dmidecode does the same.

samuel@haswell:~$ dmidecode
# dmidecode 2.12
/dev/mem: Permission denied

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



[OMPI devel] Grammar error in git master: 'You job will now abort'

2014-08-13 Thread Christopher Samuel
Hi all,

We spotted this in 1.6.5 and git grep shows it's fixed in the
v1.8 branch but in master it's still there:

samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ git grep -n 'You job will now abort'
orte/tools/orterun/help-orterun.txt:679:You job will now abort.
samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ 

I'm using https://github.com/open-mpi/ompi-svn-mirror.git so
let me know if I should be using something else now.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/05/14 00:16, Joshua Ladd wrote:

> The necessary packages will be supported and available in community
> OFED.

We're constrained to what is in RHEL6 I'm afraid.

This is because we have to run GPFS over IB to BG/Q from the same NSDs
that talk GPFS to all our Intel clusters.   We did try MOFED 2.x (in
connected mode) on a new Intel cluster during its bring up last year
which worked for MPI but stopped it talking to the NSDs.  Reverting to
vanilla RHEL6 fixed it.

Not your problem though. :-)  As Ralph has said there is work on an
alternative solution that we will be able to use.

Thanks!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNsG88ACgkQO2KABBYQAh8+SwCfZWpViBFwuhlxqERXpbXbr8Eq
awwAnjj7NJ2/zUGBeZNT0UPwkmaGOaLR
=nPxl
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/05/14 23:45, Ralph Castain wrote:

> Artem and I are working on a new PMIx plugin that will resolve it 
> for non-Mellanox cases.

Ah yes of course, sorry my bad!

- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNsGcsACgkQO2KABBYQAh/ATgCfeQHS1KsZbLS8Hdux6p98K3w3
DqsAn3vZJMtYGs1xWK4ubK26ceuACtf1
=zPyS
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/05/14 12:54, Ralph Castain wrote:

> I think there was one 2.6.x that was borked, and definitely
> problems in the 14.03.x line. Can't pinpoint it for you, though.

No worries, thanks.

> Sounds good. I'm going to have to dig deeper into those numbers, 
> though, as they don't entirely add up to me. Once the job gets 
> launched, the launch method itself should have no bearing on 
> computational speed - IF all things are equal. In other words, if
> the process layout is the same, and the binding pattern is the
> same, then computational speed should be roughly equivalent
> regardless of how the procs were started.

Not sure if it's significant but when mpirun was launching processes
it was using srun to start orted which then started MPI ranks whereas
with PMI/PMI2 it appeared to directly start the ranks.

> My guess is that your data might indicate a difference in the
> layout and/or binding pattern as opposed to PMI2 vs mpirun. At the
> scale you mention later in the thread (only 70 nodes x 16 ppn), the
> difference in launch timing would be zilch. So I'm betting you
> would find (upon further exploration) that (a) you might not have
> been binding processes when launching by mpirun, since we didn't
> bind by default until the 1.8 series, but were binding under direct
> srun launch, and (b) your process mapping would quite likely be
> different as we default to byslot mapping, and I believe srun
> defaults to bynode?

FWIW all our environment modules that do OMPI have:

setenv OMPI_MCA_orte_process_binding core

> Might be worth another comparison run when someone has time.

Yeah, I'll try and queue up some more tests - unfortunately the
cluster we tested on then is flat out at the moment but I'll try and
sneak a 64-core job using identical configs and compare mpirun, srun
on its own and srun with PMI2.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNq/K8ACgkQO2KABBYQAh/q0wCcDvYjl4tYVXrHNciCkKgbnwF7
VHoAn3Q+gZXQNKzs++3uajmiGTkq/EeD
=ucJg
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/05/14 18:00, Ralph Castain wrote:

> Interesting - how many nodes were involved? As I said, the bad 
> scaling becomes more evident at a fairly high node count.

Our x86-64 systems are low node counts (we've got BG/Q for capacity),
the cluster that those tests were run on has 70 nodes, each with 16
cores, so I suspect we're a long long way away from that pain point.

All the best!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNq4zQACgkQO2KABBYQAh8ErQCcCBFFeB5q27b7AkqfClliUdvC
NJIAn1Cun+yY8zd6IToEsYJELpJTIdGb
=K0XF
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,

Apologies for having dropped out of the thread, night intervened here. ;-)

On 08/05/14 00:45, Ralph Castain wrote:

> Okay, then we'll just have to develop a workaround for all those 
> Slurm releases where PMI-2 is borked :-(

Do you know what these releases are?  Are we talking 2.6.x or 14.03?
The 14.03 series has had a fair few rapid point releases and doesn't
appear to be anywhere as near as stable as 2.6 was when it came out. :-(

> FWIW: I think people misunderstood my statement. I specifically
> did *not* propose to *lose* PMI-2 support. I suggested that we
> change it to "on-by-request" instead of the current "on-by-default"
> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once
> the Slurm implementation stabilized, then we could reverse that
> policy.
> 
> However, given that both you and Chris appear to prefer to keep it 
> "on-by-default", we'll see if we can find a way to detect that
> PMI-2 is broken and then fall back to PMI-1.

My intention was to provide the data that led us to want PMI2, but if
configure had an option to enable PMI2 by default so that only those
who requested it got it then I'd be more than happy - we'd just add it
to our script to build it.

All the best!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP
7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt
=OvH4
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hiya Ralph,

On 07/05/14 14:49, Ralph Castain wrote:

> I should have looked closer to see the numbers you posted, Chris -
> those include time for MPI wireup. So what you are seeing is that
> mpirun is much more efficient at exchanging the MPI endpoint info
> than PMI. I suspect that PMI2 is not much better as the primary
> reason for the difference is that mpriun sends blobs, while PMI
> requires that everything be encoded into strings and sent in little
> pieces.
> 
> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
> operation) much faster, and MPI_Init completes faster. Rest of the
> computation should be the same, so long compute apps will see the
> difference narrow considerably.

Unfortunately it looks like I had an enthusiastic cleanup at some point
and so I cannot find the out files from those runs at the moment, but
I did find some comparisons from around that time.

This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
run with mpirun and srun successively from inside the same Slurm job.

mpirun namd2 macpf.conf 
srun --mpi=pmi2 namd2 macpf.conf 

Firstly the mpirun output (grep'ing the interesting bits):

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB 
memory
WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB

Now the srun output:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB 
memory
WallClock: 1230.784424  CPUTime: 1230.784424  Memory: 1100.648438 MB


The next two pairs are first launched using mpirun from 1.6.x and then with srun
from 1.7.3a1r29103.  Again each pair inside the same Slurm job with the same 
inputs.

First pair mpirun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory
WallClock: 8341.524414  CPUTime: 8341.524414  Memory: 975.015625 MB

First pair srun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory
WallClock: 7476.643555  CPUTime: 7476.643555  Memory: 968.867188 MB


Second pair mpirun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory
WallClock: 7842.831543  CPUTime: 7842.831543  Memory: 1004.050781 MB

Second pair srun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory
WallClock: 7522.677246  CPUTime: 7522.677246  Memory: 969.433594 MB


So to me it looks like (for NAMD on our system at least) that
PMI2 does seem to give better scalability.

All the best!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNAT

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/05/14 13:37, Moody, Adam T. wrote:

> Hi Chris,

Hi Adam,

> I'm interested in SLURM / OpenMPI startup numbers, but I haven't
> done this testing myself.  We're stuck with an older version of
> SLURM for various internal reasons, and I'm wondering whether it's
> worth the effort to back port the PMI2 support.  Can you share some
> of the differences in times at different scales?

We've not looked at startup times I'm afraid, this was time to
solution. We noticed it with Slurm when we first started using on
x86-64 for our NAMD tests (this from a posting to the list last year
when I raised the issue and were told PMI2 would be the solution):

> Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.
> 
> Here are some timings as reported as the WallClock time by NAMD 
> itself (so not including startup/tear down overhead from Slurm).
> 
> srun:
> 
> run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773 
> run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959 
> run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799 
> run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918
> 
> Average of 692 seconds
> 
> mpirun:
> 
> run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035 
> run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333 
> run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693
> 
> Average of 563 seconds.
> 
> So that's about 23% slower.
> 
> Everything is identical (they're all symlinks to the same golden 
> master) *except* for the srun / mpirun which is modified by
> copying the batch script and substituting mpirun for srun.



- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNprUUACgkQO2KABBYQAh9rLACfcZc4HR/u6G0bJejM3C/my7Nw
8b4AnRasOMvKZjpjpyKkbplc6/Iq9qBK
=pqH9
-END PGP SIGNATURE-


Re: [hwloc-devel] GIT: hwloc branch master updated. 0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec

2014-03-30 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 30/03/14 02:04, Ralph Castain wrote:

> turns out that some linux distro's automatically set LS_COLORS in 
> your environment when running old versions of csh/tcsh via their 
> default dot files

For example RHEL6 does this..

- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlM4oXcACgkQO2KABBYQAh8Y6QCbBf7tHJ/7CuUSUbcaa+SvRtBx
snwAn2zLdXGF9bzyBVmsPjl56uY3ozWW
=FxcX
-END PGP SIGNATURE-


Re: [OMPI devel] SC13 birds of a feather

2013-12-05 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/12/13 01:52, Jeff Squyres (jsquyres) wrote:

> Ralph -- let's chat about this in Chicago next Friday.  I'll add
> it to the agenda on the wiki.  I assume this would not be
> difficult stuff; we don't really need to do anything fancy at all.
> I think we just want to sketch out what exactly we want to do, and
> it could probably be done in a day or three.

There's also stuff that ACPI can expose under:

/sys/class/thermal/thermal_zone*/temp

though it might need a bit more prodding to work out what's what there.

> (Thanks for the idea, Samuel!)

My pleasure!

All the best,

Chris

- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlKgESUACgkQO2KABBYQAh+44gCeIsDplsMAiwC4PJBbco5vurVy
PbwAn0h9kJYIoeK1Y/mlowwHLBRb1oQX
=WFYZ
-END PGP SIGNATURE-


Re: [OMPI devel] SC13 birds of a feather

2013-12-03 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 04/12/13 09:27, Jeff Squyres (jsquyres) wrote:

> 2. The MPI_T performance variables are new.  There's only a few 
> created right now (e.g., in the Cisco usnic BTL).  But the field
> is pretty wide open here -- the infrastructure is there, but we're 
> really not exposing much information yet.  There's lots that can
> be done here.

Random thought - please shoot it down if crazy...

Would it make any sense to expose system/environmental/thermal
information to the application via MPI_T ?

For our sort of systems with a grab bag of jobs it's not likely
useful, but if you had a system dedicated to running an in house code
then you could conceive of situations where you might want to react to
over-temperature cores, nodes, etc.

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlKefTUACgkQO2KABBYQAh8pCACaAo1Bf+5mKHWT2ZLysWkSG9fs
Rc8An3H4NwI0MwqkGxG2PWMJ+4U/Vdsv
=2YN+
-END PGP SIGNATURE-


Re: [OMPI devel] Openmpi 1.6.5 is freezing under GNU/Linux ia64

2013-12-03 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/12/13 23:50, Sylvestre Ledru wrote:

> FYI, Debian has stopped supporting ia64 for its next release
> So, I stopped working on that issue.

Yeah, it's not looking good - here's the context for this:

http://lists.debian.org/debian-devel-announce/2013/11/msg7.html

# We have stopped considering ia64 as a blocker for testing
# migration. This means that the out-of-date and uninstallability
# criteria on ia64 will not hold up transitions from unstable to
# testing for packages. It is expected that unless drastic
# improvement occurs, ia64 will be removed from testing on
# Friday 24th January 2014.


- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlKeb1YACgkQO2KABBYQAh8jeQCfUVYyP39G5m31dQL/ZuEAZOIz
xJIAn0Fs+bBZCRSwbmU35CAN8X8tzpex
=tRU7
-END PGP SIGNATURE-


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29615 - in trunk: . contrib contrib/dist/linux debian debian/source

2013-11-06 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/11/13 04:40, Mike Dubman wrote:

> I did not find debian packaging files in the OMPI tree, could you please
> point me to it?

As Sylvestre explained Debian (and presumably Ubuntu too) will
automatically delete any /debian/ directory in an upstream tarball
and substitute their own packaging.

You can see what they put in for sid (testing) here:

http://ftp.de.debian.org/debian/pool/main/o/openmpi/openmpi_1.6.5-5.debian.tar.gz

Whilst I can understand the enthusiasm I don't think it's
going to be very helpful to Debian; perhaps a better way to
assist would be to help out Sylvestre and the other Debian
maintainers?  This might be a handy place to start:

http://qa.debian.org/developer.php?login=pkg-openmpi-maintainers%40lists.alioth.debian.org

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJ6yWMACgkQO2KABBYQAh/3pwCghbRhvVYPa5WV9XmcLzMQbCQB
mxsAn3LKsvax6RyiRtAj3Zag9yynEoe6
=sH/h
-END PGP SIGNATURE-


Re: [OMPI devel] 1.6.5 large matrix test doesn't pass (decode) ?

2013-10-16 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/10/13 01:49, KAWASHIMA Takahiro wrote:

> It is a bug in the test program, test/datatype/ddt_raw.c, and it
> was fixed at r24328 in trunk.
> 
> https://svn.open-mpi.org/trac/ompi/changeset/24328
> 
> I've confirmed the failure occurs with plain v1.6.5 and it doesn't 
> occur with patched v1.6.5.

Perfect, thanks!

Sorry for the delay, been away on holiday.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJeE8gACgkQO2KABBYQAh9LMgCeJ7EQKFD/nRPBtFFDH/kSFw51
j0AAn2RfQrNz6E1KTnL0BL5p3tQMLHVT
=VSfO
-END PGP SIGNATURE-


[OMPI devel] 1.6.5 large matrix test doesn't pass (decode) ?

2013-10-04 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Not sure if this is important, or expected, but I ran a make check out
of interest after seeing recent emails and saw the final one of these
tests be reported as "NOT PASSED" (it seems to be the only failure).

No idea if this is important or not.  The text I see is:

 #
 * TEST UPPER MATRIX
 #

test upper matrix
complete raw in 7 microsec
decode [NOT PASSED]


This happens on both our Nehalem and SandyBridge clusters and we are
building with the system GCC.  I've attached the full log from our
Nehalem cluster (RHEL 6.4).


Our configure script is:

#!/bin/bash

BASE=`basename $PWD | sed -e s,-,/,`

module purge

./configure --prefix=/usr/local/${BASE} --with-slurm --with-openib \
--enable-static  --enable-shared

make -j


I'm away on leave next week (first break for a year, yay!) but back
the week after..

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJOVUcACgkQO2KABBYQAh+J/QCfX+U1kZvtgFL1UxyIZBbNdqyW
PC4An2AciGo2BkOq5RnceDYjACcUsV7X
=0VKJ
-END PGP SIGNATURE-
Making check in config
make[1]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/config'
make[1]: Nothing to be done for `check'.
make[1]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/config'
Making check in contrib
make[1]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/contrib'
make[1]: Nothing to be done for `check'.
make[1]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/contrib'
Making check in opal
make[1]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal'
Making check in include
make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/include'
make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/include'
Making check in libltdl
make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/libltdl'
make  check-am
make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/libltdl'
make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/libltdl'
make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/libltdl'
Making check in asm
make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/asm'
make[2]: Nothing to be done for `check'.
make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/asm'
Making check in datatype
make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/datatype'
make[2]: Nothing to be done for `check'.
make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/datatype'
Making check in etc
make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/etc'
make[2]: Nothing to be done for `check'.
make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/etc'
Making check in event
make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event'
Making check in compat
make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat'
Making check in sys
make[4]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat/sys'
make[4]: Nothing to be done for `check'.
make[4]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat/sys'
make[4]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat'
make[4]: Nothing to be done for `check-am'.
make[4]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat'
make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event/compat'
make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event'
make[3]: Nothing to be done for `check-am'.
make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event'
make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/event'
Making check in util
make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util'
Making check in keyval
make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util/keyval'
make[3]: Nothing to be done for `check'.
make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util/keyval'
make[3]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util'
make[3]: Nothing to be done for `check-am'.
make[3]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util'
make[2]: Leaving directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/util'
Making check in mca/base
make[2]: Entering directory `/usr/local/src/OPENMPI/openmpi-1.6.5.1/opal/mca/base'
make[2]: Nothing to be done for `check'.
make[2]: Leaving directory `/usr/local/src/OPENM

Re: [OMPI devel] Openmpi 1.6.5 is freezing under GNU/Linux ia64

2013-09-21 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 21/09/13 14:33, Ralph Castain wrote:

> I think you misunderstood the issue here. The problem is that
> mpirun appears to be hanging before it ever gets to the point of
> launching something.

Ah, quite correct, I hadn't realised the debug info hadn't shown it
getting to the point of launching the executable. Mea culpa.
I blame jet-lag. ;-)

cheers,
Chris (about to get a second dose)
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI96CoACgkQO2KABBYQAh9o6gCdFQ4HiKtHlhoqmQjHGRRMZXCC
QooAnjNRPf3dzh/MjD0rzspLRxs2ExFd
=V7Ux
-END PGP SIGNATURE-


Re: [OMPI devel] Openmpi 1.6.5 is freezing under GNU/Linux ia64

2013-09-20 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 21/09/13 05:49, Sylvestre Ledru wrote:

> Does it ring a bell to anyone ?

Possibly, if you run the binary without mpirun does it do the same?

If so, could you try and run it with strace -f and see if you see
repeating SEGV's?

cheers!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI9AXAACgkQO2KABBYQAh/QqQCeIXNLXsO094MdRT9OTguQdSqp
apAAniGAjZOJly2FLdM7YWyvrvZfhOPI
=MsBl
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-06 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/09/13 14:14, Christopher Samuel wrote:

> However, modifying the test program confirms that variable is getting
> propagated as expected with both mpirun and srun for 1.6.5 and the 1.7
> snapshot. :-(

Investigating further by setting:

export OMPI_MCA_orte_report_bindings=1
export SLURM_CPU_BIND=core
export SLURM_CPU_BIND_VERBOSE=verbose

reveals that only OMPI 1.6.5 with mpirun reports bindings being set
(see below).   We cannot understand why Slurm doesn't *appear* to be
setting bindings as we have the correct settings according to the
documentation.

Whilst it may explain the difference between 1.6.5 mpirun and srun
it doesn't to explain why the 1.7 snapshot is so much better as you'd
expect them to be hurt in the same way.


==OPENMPI 1.6.5==
==mpirun==
[barcoo003:03633] System has detected external process binding to cores 0001
[barcoo003:03633] MCW rank 0 bound to socket 0[core 0]: [B]
[barcoo004:04504] MCW rank 1 bound to socket 0[core 0]: [B]
Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 
universe envar 2
Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 
universe envar 2
==srun==
Hello, World, I am 0 of 2 on host barcoo003 from app number 1 universe size 2 
universe envar NULL
Hello, World, I am 1 of 2 on host barcoo004 from app number 1 universe size 2 
universe envar NULL
=
==OPENMPI 1.7.3==
DANGER: YOU ARE LOADING A TEST VERSION OF OPENMPI. THIS MAY BE BAD.
==mpirun==
Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 
universe envar 2
Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 
universe envar 2
==srun==
Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 
universe envar NULL
Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 
universe envar NULL
=



- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIpcxcACgkQO2KABBYQAh/wdQCfR4q7DfGqJVSU0O3BmgXqAn8w
HsEAn3po0xaxB0+ywejWgSjQ385da7Pa
=T3w4
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-06 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/09/13 00:23, Hjelm, Nathan T wrote:

> I assume that process binding is enabled for both mpirun and srun?
> If not that could account for a difference between the runtimes.

You raise an interesting point, we have been doing that with:

[samuel@barcoo ~]$ module show openmpi 2>&1 | grep binding
setenv   OMPI_MCA_orte_process_binding core

However, modifying the test program confirms that variable is getting
propagated as expected with both mpirun and srun for 1.6.5 and the 1.7
snapshot. :-(

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIpVp4ACgkQO2KABBYQAh88rQCggOZkAjPV+/1PX2R9auuij+1M
jdsAn17nDCoubkdvCsLRKozqGEYWjOY1
=RaoK
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-05 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Ralph,

On 05/09/13 12:50, Ralph Castain wrote:

> Jeff and I were looking at a similar issue today and suddenly 
> realized that the mappings were different - i.e., what ranks are
> on what nodes differs depending on how you launch. You might want
> to check if that's the issue here as well. Just launch the
> attached program using mpirun vs srun and check to see if the maps
> are the same or not.

Very interesting, the ranks to node mappings are identical in all
cases (mpirun and srun for 1.6.5 and my test 1.7.3 snapshot) but what
is different is as follows.


For the 1.6.5 build I see mpirun report:

number 0 universe size 64 universe envar 64

whereas srun report:

number 1 universe size 64 universe envar NULL



For the 1.7.3 snapshot both report "number 0" so the only difference
there is that mpirun has:

envar 64

whereas srun has:

envar NULL


Are these differences significant?

I'm intrigued that the problem child (srun 1.6.5) is the only one
where number is 1.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIoCesACgkQO2KABBYQAh+0NACeK9uyDk3UZerufAopuQRxhR/T
4skAmwS/X+8jNOPlGt4H/t5yRK8vmMer
=8TGu
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-04 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 04/09/13 18:33, George Bosilca wrote:

> You can confirm that the slowdown happen during the MPI
> initialization stages by profiling the application (especially the
> MPI_Init call).

NAMD helpfully prints benchmark and timing numbers during the initial
part of the simulation, so here's what they say.  For both seconds
per step and days per nanosecond of simulation less is better.

I've included the benchmark numbers (every 100 steps or so from the
start) and the final timing number after 25000 steps.  It looks like
to me (as a sysadmin and not an MD person) that the final timing
number includes CPU time in seconds per step and wallclock time in
seconds per step.

64 cores over 10 nodes:

OMPI 1.7.3a1r29103 mpirun

Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory

TIMING: 25000  CPU: 8247.2, 0.330157/step  Wall: 8247.2, 0.330157/step, 
0.0229276 hours remaining, 921.894531 MB of memory in use.

OMPI 1.7.3a1r29103 srun

Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory

TIMING: 25000  CPU: 7390.15, 0.296/step  Wall: 7390.15, 0.296/step, 0.020 
hours remaining, 915.746094 MB of memory in use.


64 cores over 18 nodes:

OMPI 1.6.5 mpirun

Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory

TIMING: 25000  CPU: 7754.17, 0.312071/step  Wall: 7754.17, 0.312071/step, 
0.0216716 hours remaining, 950.929688 MB of memory in use.

OMPI 1.7.3a1r29103 srun

Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory

TIMING: 25000  CPU: 7420.91, 0.296029/step  Wall: 7420.91, 0.296029/step, 
0.0205575 hours remaining, 916.312500 MB of memory in use.


Hope this is useful!

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIn6UoACgkQO2KABBYQAh9GWgCghcYKSj1i9rDDQospURAeusD5
E+EAn2beqUlYZWHxi1Dgj8ZEpiai4zH1
=k5Uz
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-04 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 04/09/13 11:29, Ralph Castain wrote:

> Your code is obviously doing something much more than just
> launching and wiring up, so it is difficult to assess the
> difference in speed between 1.6.5 and 1.7.3 - my guess is that it
> has to do with changes in the MPI transport layer and nothing to do
> with PMI or not.

I'm testing with what would be our most used application in aggregate
across our systems, the NAMD molecular dynamics code from here:

http://www.ks.uiuc.edu/Research/namd/

so yes,  you're quite right, it's doing a lot more than that and has a
reputation for being a *very* chatty MPI code.

For comparison whilst users see GROMACS also suffer with srun under
1.6.5 they don't see anything like the slow down that NAMD gets.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlImsCwACgkQO2KABBYQAh8c4wCfQlOd6ZL68tncAd1h3Fyb1hAr
DicAn06seL8GzYPGtGImnYkb7sYd5op9
=pkwZ
-END PGP SIGNATURE-


Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-09-03 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 04/09/13 04:47, Jeff Squyres (jsquyres) wrote:

> Hmm.  Are you building Open MPI in a special way?  I ask because I'm
> unable to replicate the issue -- I've run your test (and a C
> equivalent) a few hundred times now:

I don't think we do anything unusual, the script we are using is
fairly simple (it does a module purge to ensure we are just using the
system compilers and don't pick up anything strange) and is as follows:

#!/bin/bash

BASE=`basename $PWD | sed -e s,-,/,`

module purge

./configure --prefix=/usr/local/${BASE} --with-slurm --with-openib 
--enable-static  --enable-shared

make -j


- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlImicgACgkQO2KABBYQAh83GQCcDp/TF/lCe3RnmNYq+tl6ef0D
q2AAn3BNG8omGncmLc4HadRPZgRjQEph
=56wh
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-03 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/09/13 10:56, Ralph Castain wrote:

> Yeah - --with-pmi=

Actually I found that just --with-pmi=/usr/local/slurm/latest worked. :-)

I've got some initial numbers for 64 cores, as I mentioned the system
I found this on initially is so busy at the moment I won't be able to
run anything bigger for a while, so I'm going to move my testing to
another system which is a bit quieter, but slower (it's Nehalem vs
SandyBridge).

All the below tests are with the same NAMD 2.9 binary and within the
same Slurm job so it runs on the same cores each time. It's nice to
find that C code at least seems to be backwardly compatible!

64 cores over 18 nodes:

Open-MPI 1.6.5 with mpirun - 7842 seconds
Open-MPI 1.7.3a1r29103 with srun - 7522 seconds

so that's about a 4% speedup.

64 cores over 10 nodes:

Open-MPI 1.7.3a1r29103 with mpirun - 8341 seconds
Open-MPI 1.7.3a1r29103 with srun - 7476 seconds

So that's about 11% faster, and the mpirun speed has decreased though
of course that's built using PMI so perhaps that's the cause?

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEUEARECAAYFAlImiUUACgkQO2KABBYQAh+WvwCeM1ufCWvK627oz8aBbgKjfONe
cDEAmM3w+/EJ0unbmaetNR3ay4U6nrM=
=v/PT
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 31/08/13 02:42, Ralph Castain wrote:

> We did some work on the OMPI side and removed the O(N) calls to 
> "get", so it should behave better now. If you get the chance,
> please try the 1.7.3 nightly tarball. We hope to officially release
> it soon.

Stupid question, but never having played with PMI before is it just
the case of appending the --with-pmi option to our current configure?

thanks,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIlLPAACgkQO2KABBYQAh9GhwCeN192n4g5PBHpeHwOi2Kpyhs3
+X8An0TJ2VzrgeKl4+2YVVeZXq+6fz/W
=h4Ip
-END PGP SIGNATURE-


Re: [hwloc-devel] lstopo - please add the information about the kernel to the graphical output

2013-09-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/09/13 02:03, Jiri Hladky wrote:

> I vote for --append-legend

I like that too, though the idea of an additional undocumented --jirka
option also appeals. :-)

- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIlLFcACgkQO2KABBYQAh8JyACfbEIKp5fvL1RodhpORUPLj0zN
w4gAn2CLmB8x6roBovo0vdEjumrDb7KE
=rnHF
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 31/08/13 02:42, Ralph Castain wrote:

> Hi Chris et al

Hiya,

> We did some work on the OMPI side and removed the O(N) calls to 
> "get", so it should behave better now. If you get the chance,
> please try the 1.7.3 nightly tarball. We hope to officially release
> it soon.

Thanks so much, I'll get our folks to rebuild a test version of NAMD
against 1.7.3a1r29103 which I built this afternoon.

It might be some time until I can get a test job of a suitable size to
run though, looks like our systems are flat out!

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIkNsMACgkQO2KABBYQAh9AqgCggCUKRRLODZhfXUAJ6T2pYjGI
iSgAniISxkxnHXyEj7L6kmTs4wERy1rW
=31Qg
-END PGP SIGNATURE-


Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-09-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 30/08/13 16:01, Christopher Samuel wrote:

> Thanks for this, I'll take a look further next week..

The code where it's SEGV'ing is here:

  /* check that one of the above allocation paths succeeded */
  if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) {
remainder_size = size - nb;
remainder = chunk_at_offset(p, nb);
av->top = remainder;
set_head(p, nb | PREV_INUSE | (av != _arena ? NON_MAIN_ARENA : 0));
set_head(remainder, remainder_size | PREV_INUSE);
check_malloced_chunk(av, p, nb);
return chunk2mem(p);
  }


It dies when it does:

set_head(remainder, remainder_size | PREV_INUSE);

where remainder_size=0.

This implies that size and nb are the same, so I'm wondering
if the test at the top of that block should not have the equals,
so instead be this?

  /* check that one of the above allocation paths succeeded */
  if ((unsigned long)(size) > (unsigned long)(nb + MINSIZE)) {

It would ensure that the set_head() macro would never get called
with a 0 argument.

The code would then fall through to the malloc failure part
(which is what I suspect we want).

Thoughts?

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIkJNkACgkQO2KABBYQAh+Y/QCeLwnqEQGK4meKQbETwqHg1RtI
iikAoIofXBPnpI8qbS+zau9ezX78WizW
=QCSz
-END PGP SIGNATURE-


Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-08-30 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hiya Jeff,

On 30/08/13 11:13, Jeff Squyres (jsquyres) wrote:

> FWIW, the stack traces you sent are not during MPI_INIT.

I did say it was a suspicion. ;-)

> What happens with OMPI's memory manager is that it inserts itself
> to be *the* memory allocator for the entire process before main()
> even starts.  We have to do this as part of the horribleness of
> that is OpenFabrics/verbs and how it just doesn't match the MPI
> programming model at all.  :-(  (I think I wrote some blog entries
> about this a while ago...  Ah, here's a few:

Thanks!  I'll take a look next week (just got out of a 5.5 hour
meeting and have to head home now).

> Therefore, (in C) if you call malloc() before MPI_Init(), it'll be 
> calling OMPI's ptmalloc.  The stack traces you sent imply that
> it's just when your app is calling the fortran allocate -- which is
> after MPI_Init().

OK, that makes sense.

> FWIW, you can build OMPI with --without-memory-manager, or you can 
> setenv OMPI_MCA_memory_linux_disable to 1 (note: this is NOT a 
> regular MCA parameter -- it *must* be set in the environment
> before the MPI app starts).  If this env variable is set, OMPI will
> *not* interpose its own memory manager in the pre-main hook.  That
> should be a quick/easy way to try with and without the memory
> manager and see what happens.

Well with OMPI_MCA_memory_linux_disable=1 I don't get the crash at all,
or the spin with the Intel compiler build.  Nice!

Thanks for this, I'll take a look further next week..

Very much obliged,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIgNSgACgkQO2KABBYQAh9UhwCfXPKDbParUn3XBOOcwBNjionS
KxAAnRH1HGFsKWNVGqvmh4caE8cN85jn
=U4yB
-END PGP SIGNATURE-


Re: [hwloc-devel] lstopo - please add the information about the kernel to the graphical output

2013-08-29 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 28/08/13 02:19, Brice Goglin wrote:

> The problem I have while playing with this is that it takes a lot
> of space. Putting the entire uname on a single line will be
> truncated when the topology drawing isn't large (on machines with 2
> cores for instance). And using multiple lines would make the legend
> huge.

Would there be any benefit of inserting it into the EXIF information
for the image (every time) instead?

That way it would be accessible for those who need it (now and in the
future) whilst not cluttering up the image.

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIf1v0ACgkQO2KABBYQAh9QlgCdG+o7x6GaTiT0dnBPvMRW/UCH
dgcAn1LCpbhvVafxd95hW/+6G97/HKNe
=9iNn
-END PGP SIGNATURE-


Re: [hwloc-devel] hwloc-distrib - please add the option to distribute the jobs in the reverse direction

2013-08-27 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 27/08/13 00:07, Brice Goglin wrote:

> But there's a more general problem here, some people may want
> something similar for other cases. I need to think about it.

Something like a sort order perhaps, combined with some method to
exclude or weight PUs based on some metrics (including a user defined
weight)?

I had a quick poke around looking at /proc/irq/*/ and it would appear
you can gather info about which CPUs are eligible to handle IRQs from
the smp_affinity bitmask (or smp_affinity_list).

The node file there just "shows the node to which the device using the
IRQ reports itself as being attached. This hardware locality
information does not include information about any possible driver
locality preference."

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIcF/QACgkQO2KABBYQAh/7oQCcDSLlgEJqBGDerUD481ho6UWc
Rp0AnRC4cC/Kdhwe75tgg1O/LrcfxXM0
=r4pj
-END PGP SIGNATURE-


[OMPI devel] How to deal with F90 mpi.mod with single stack and multiple compiler suites?

2013-08-22 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi folks,

We've got what we thought would be a fairly standard OMPI (1.6.5)
install which is a single install built with GCC and then setting the
appropriate variables to use the Intel compilers when someone loads
our "intel" module:

$ module show intel
[...]
setenv   OMPI_CC icc
setenv   OMPI_CXX icpc
setenv   OMPI_F77 ifort
setenv   OMPI_FC ifort
setenv   OMPI_CFLAGS -xHOST -O3 -mkl=sequential
setenv   OMPI_FFLAGS -xHOST -O3 -mkl=sequential
setenv   OMPI_FCFLAGS -xHOST -O3 -mkl=sequential
setenv   OMPI_CXXFLAGS -xHOST -O3 -mkl=sequential

This works wonderfully, *except* when our director attempted to build
an F90 program with the Intel compilers that fails to build because
the mpi.mod F90 module was produced with gfortran rather than the
Intel compilers. :-(

Is there any way to avoid having to do parallel installs of OMPI with
GCC and Intel compilers just to have two different versions of these
files?

My brief googling hasn't indicated anything, and I don't see anything
in the mpif90 manual page (though I have to admit I've had to rush to
try and get this done before I need to leave for the day). :-(

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIVrpIACgkQO2KABBYQAh/GAQCggQGnc18kSfMcGle3a3pWZGgD
UQ8AoIz61uuOPj+TFJwSYMTaAtUBLk3J
=yJ6J
-END PGP SIGNATURE-


Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-19 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Ralph,

On 12/08/13 06:17, Ralph Castain wrote:

> 1. Slurm has no direct knowledge or visibility into the
> application procs themselves when launched by mpirun. Slurm only
> sees the ORTE daemons. I'm sure that Slurm rolls up all the
> resources used by those daemons and their children, so the totals
> should include them
> 
> 2. Since all Slurm can do is roll everything up, the resources
> shown in sacct will include those used by the daemons and mpirun as
> well as the application procs. Slurm doesn't include their daemons
> or the slurmctld in their accounting. so the two numbers will be 
> significantly different. If you are attempting to limit overall 
> resource usage, you may need to leave some slack for the daemons
> and mpirun.

Thanks for that explanation, makes a lot of sense.

In the end due to time pressure we decided to just do what we did with
Torque and patch Slurm to set RLIMIT_AS instead of RLIMIT_DATA for
jobs so no single sub-process can request more RAM than the job has
asked for.

Works nicely and our users are used to it from Torque, we've not hit
any issues with it so far.

In the long term I suspect the jobacct_gather/cgroup plugin will give
better numbers once it's had more work.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIS0dwACgkQO2KABBYQAh9X7ACgkTPVIJx7xhqYSPeqb4/vC5+W
+XYAn2xETmiTnO7S2Hv9C93gCjs2R8Gw
=ypc1
-END PGP SIGNATURE-


Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/08/13 16:19, Christopher Samuel wrote:

> Anyone seen anything similar, or any ideas on what could be going
> on?

Sorry, this was with:

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

Since those initial tests we've started enforcing memory limits (the
system is not yet in full production) and found that this causes jobs
to get killed.

We tried the cgroups gathering method, but jobs still die with mpirun
and now the numbers don't seem to right for mpirun or srun either:

mpirun (killed):

[samuel@barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
-  -- --
94564
94564.batch-523362K  0
94564.0 394525K  0

srun:

[samuel@barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
-  -- --
94565
94565.batch998K  0
94565.0  88663K  0


All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93
KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk
=jYrC
-END PGP SIGNATURE-


Re: [OMPI devel] Memory accounting issues with mpirun (was Re: [slurm-dev] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/08/13 16:18, Christopher Samuel wrote:

> Anyone seen anything similar, or any ideas on what could be going
> on?

Apologies, forgot to mention that Slurm is set up with:

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

We are testing with cgroups now.

- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIB6h0ACgkQO2KABBYQAh8gowCfTG0p/RFOuUHQG47avDL2YwOg
uM8Anjw16dWen6kykBfMhWpHUWr709zv
=BR3G
-END PGP SIGNATURE-


[OMPI devel] Memory accounting issues with mpirun (was Re: [slurm-dev] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 23/07/13 17:06, Christopher Samuel wrote:

> Bringing up a new IBM SandyBridge cluster I'm running a NAMD test 
> case and noticed that if I run it with srun rather than mpirun it 
> goes over 20% slower.

Following on from this issue, we've found that whilst mpirun gives
acceptable performance the memory accounting doesn't appear to be correct.

Anyone seen anything similar, or any ideas on what could be going on?

Here are two identical NAMD jobs running over 69 nodes using 16 nodes
per core, this one launched with mpirun (Open-MPI 1.6.5):


==> slurm-94491.out <==
WallClock: 101.176193  CPUTime: 101.176193  Memory: 1268.554688 MB
End of program

[samuel@barcoo-test Mem]$ sacct -j 94491 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
-  -- --
94491
94491.batch6504068K  11167820K
94491.05952048K   9028060K


This one launched with srun (about 60% slower):

==> slurm-94505.out <==
WallClock: 163.314163  CPUTime: 163.314163  Memory: 1253.511719 MB
End of program

[samuel@barcoo-test Mem]$ sacct -j 94505 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
-  -- --
94505
94505.batch   7248K   1582692K
94505.01022744K   1307112K



cheers!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIB5sEACgkQO2KABBYQAh9QMQCfQ57w0YqVDwgyGRqUe3dSvQDj
e9cAnRRx/kDNUNqUCuFGY87mXf2fMOr+
=JUPK
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-24 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 24/07/13 09:42, Ralph Castain wrote:

> Not to 1.6 series, but it is in the about-to-be-released 1.7.3,
> and will be there from that point onwards.

Oh dear, I cannot delay this machine any more to change to 1.7.x. :-(

> Still waiting to see if it resolves the difference.

When I've got the current rush out of the way I'll try a private build
of 1.7 and see how that goes with NAMD.

cheers!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHvbl8ACgkQO2KABBYQAh9a6QCgi0HOHHV/opqjPq+Av+lTasaj
4OkAnA8i8ajZ9Umw7MoaH8qJbWBgFOAf
=p7Xl
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 23/07/13 19:34, Joshua Ladd wrote:

> Hi, Chris

Hi Joshua,

I've quoted you in full as I don't think your message made it through
to the slurm-dev list (at least I've not received it from there yet).

> Funny you should mention this now. We identified and diagnosed the 
> issue some time ago as a combination of SLURM's PMI1
> implementation and some of, what I'll call, OMPI's topology
> requirements (probably not the right word.) Here's what is
> happening, in a nutshell, when you launch with srun:
> 
> 1. Each process pushes his endpoint data up to the PMI "cloud" via
> PMI put (I think it's about five or six puts, bottom line, O(1).) 
> 2. Then executes a PMI commit and PMI barrier to ensure all other 
> processes have finished committing their data to the "cloud". 3.
> Subsequent to this, each process executes O(N) (N is the number of 
> procs in the job) PMI gets in order to get all of the endpoint
> data for every process regardless of whether or not the process 
> communicates with that endpoint.
> 
> "We" (MLNX et al.) undertook an in-depth scaling study of this and 
> identified several poorly scaling pieces with the worst offenders 
> being:
> 
> 1. PMI Barrier scales worse than linear. 2. At scale, the PMI get
> phase starts to look quadratic.
> 
> The proposed solution that "we" (OMPI + SLURM) have come up with is
> to modify OMPI to support PMI2 and to use SLURM 2.6 which has
> support for PMI2 and is (allegedly) much more scalable than PMI1.
> Several folks in the combined communities are working hard, as we
> speak, trying to get this functional to see if it indeed makes a
> difference. Stay tuned, Chris. Hopefully we will have some data by
> the end of the week.

Wonderful, great to know that what we're seeing is actually real and
not just pilot error on our part!   We're happy enough to tell users
to keep on using mpirun as they will be used to from our other Intel
systems and to only use srun if the code requires it (one or two
commercial apps that use Intel MPI).

Can I ask, if the PMI2 ideas work out is that likely to get backported
to OMPI 1.6.x ?

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHvEZIACgkQO2KABBYQAh9QogCeMuR/E4oPivdsX3r671+z7EWd
Hv8An1N8csHMby7bouT/gC07i/J2PW+i
=gZsB
-END PGP SIGNATURE-


Re: [OMPI devel] Any plans to support Intel MIC (Xeon Phi) in Open-MPI?

2013-05-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/05/13 10:47, Ralph Castain wrote:

> We had something similar at one time - I developed it for the 
> Roadrunner cluster so you could run MPI tasks on the GPUs. Worked 
> well, but eventually fell into disrepair due to lack of use.

OK, interesting!   RR was Cell rather than GPU though wasn't it?

> In this case, I suspect it will be much easier to do as the Phis 
> appear to be a lot more visible to the host than the GPU did on RR.
>  Looking at the documentation, the Phis just sit directly on the
> PCIe bus, so they should look just like any other processor,

Yup, they show up in lspci:

[root@barcoo061 ~]# lspci -d 8086:2250
2a:00.0 Co-processor: Intel Corporation Device 2250 (rev 11)
90:00.0 Co-processor: Intel Corporation Device 2250 (rev 11)

> and they are Xeon binary compatible - so there is no issue with 
> tracking which binary to run on which processor.

Sadly they're not binary compatible, you have to cross-compile for
them (or compile on the Phi itself).

I haven't got any further than have xCAT install the (rebuilt) kernel
module so far, so I can't log into them yet.

> Brice: do the Phis appear in the hwloc topology object?

They appear in lstopo as mic0 and mic1.

> Chris: can you run lstopo on one of the nodes and send me the
> output (off-list)?

One of the hosts?  Not a problem, will do.

All the best!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGDDZIACgkQO2KABBYQAh/TUQCgh29RPf5FM3PWe/p/qpMW3wGX
ZaUAn0uxw8i/BZxXDOFXQZIyY1rn4/zm
=zock
-END PGP SIGNATURE-


[OMPI devel] Any plans to support Intel MIC (Xeon Phi) in Open-MPI?

2013-05-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi folks,

The new system we're bringing up has 10 nodes with dual Xeon Phi MIC
cards, are there any plans to support them by launching MPI tasks
directly on the Phis themselves (rather than just as offload devices
for code on the hosts)?

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGDAPYACgkQO2KABBYQAh+y9ACfZ0SdqDuV7Euq3B0ANtxPhH1D
3h4An1Zlhu2Ut+OFvbTa9xbLBkspwwPY
=TbIy
-END PGP SIGNATURE-


Re: [OMPI devel] Choosing an Open-MPI release for a new cluster

2013-05-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Ralph, Jeff, Paul,

On 02/05/13 14:14, Ralph Castain wrote:

> Depends on what you think you might want, and how tolerant you and 
> your users are about bugs.
> 
> The 1.6 series is clearly more mature and stable. It has nearly
> all the MPI-2 stuff now, but no MPI-3.

Great, thanks!

> If you think there is something in MPI-3 you might want, then the
> 1.7 series could be the way to go - though you'll have to suffer
> thru its growing pains.
[...]

Well our users are life sciences researchers and as a result very few
of those are developers, they are mostly using applications we build
for them on request (or Java and the occasional commercial package).

So from the sound of it 1.6 is the way to go and if we ever hit
something that needs MPI-3 then we'll install that in parallel but
leave the default at 1.6.

Thanks so much to you all!

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGC8u4ACgkQO2KABBYQAh8P9gCdHJGNLE63akY/1SMdeIhxMRyn
k90AnRLnj8nJbsnj/rWP/yj4E5u8up3n
=EfJH
-END PGP SIGNATURE-


[OMPI devel] Choosing an Open-MPI release for a new cluster

2013-05-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi folks,

We're about to bring up a new cluster (IBM iDataplex with SandyBridge
CPUs including 10 nodes with two Intel Xeon Phi cards) and I'm at the
stage where we need to pick an OMPI release to put on.

Given that this system is at the start of its life whatever we pick
now is likely to be baked in for the next 4 years or so (with OMPI
point release updates of course) and so I'm thinking that I should be
going with the 1.7.x release rather than the 1.6.x one.

For comparison the Nehalem iDP this is going in next to is still at
1.4.x, it wouldn't be worth the effort to take it to a later release
given it has probably only another 18 months of life left.

However, not having been able to keep up with this list for some time
I'd like to throw myself on your tender mercies for advice for whether
that's a good plan or not!

Thoughts please?

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGB40cACgkQO2KABBYQAh+zHwCfbKMFtmmnc07PPrXdEHghxqf1
SCYAn2hgWaLBUXhbBAmzA20BXLBzdLsJ
=KGxX
-END PGP SIGNATURE-


Re: [hwloc-devel] [hwloc-svn] svn:hwloc r5324 - branches/libpciaccess/doc

2013-02-18 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 17/02/13 01:22, Jeff Squyres (jsquyres) wrote:

> No, it's not.  RHEL6, for example, does have libpciaccess, but
> does not have a libpciaccess-dev (or devel).  Ergo, you have to
> get externally.

It's in the server-optional repo:

[root@imgr ~]# yum info libpciaccess-devel
Loaded plugins: downloadonly, etckeeper, product-id, rhnplugin, security
Available Packages
Name: libpciaccess-devel
Arch: i686
Version : 0.12.1
Release : 1.el6
Size: 11 k
Repo: rhel-x86_64-server-optional-6
Summary : PCI access library development package
License : MIT
Description : Development package for libpciaccess.

Name: libpciaccess-devel
Arch: x86_64
Version : 0.12.1
Release : 1.el6
Size: 11 k
Repo: rhel-x86_64-server-optional-6
Summary : PCI access library development package
License : MIT
Description : Development package for libpciaccess.

If they're not enabled then you won't see it.

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlEjAc8ACgkQO2KABBYQAh88ngCfTGoYJrWvW4RclZxsBrq/6/Fo
FHIAn3wU9c4UD9B+Vg9GGWLip2wNx353
=+Jg7
-END PGP SIGNATURE-


Re: [hwloc-devel] libpci: GPL

2013-02-18 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

/*
 * Trying to catch up with email, but I've not seen the question of
 * whether or not linking proprietary->BSD->GPL was OK or not addressed
 * yet.
 */

On 06/02/13 08:50, Jeff Squyres (jsquyres) wrote:

> It was just pointed out to me that libpci is licensed under the GPL
> (not the LGPL).
> 
> Hence, even though hwloc is BSD, if it links to libpci.*, it's
> tainted.

I wouldn't say hwloc is tainted, more that you were tainting the GPL'd
code by linking the proprietary code to it, but that's just case of
perspective. ;-)

After a brief search of the GPL FAQs I'd say the closest I can get is:

http://www.gnu.org/licenses/old-licenses/gpl-2.0-faq.html#GPLWrapper

# I'd like to incorporate GPL-covered software in my proprietary
# system. Can I do this by putting a “wrapper” module, under a
# GPL-compatible lax permissive license (such as the X11 license)
# in between the GPL-covered part and the proprietary part?
#
#No. The X11 license is compatible with the GPL, so you can add a
# module to the GPL-covered program and put it under the X11 license.
# But if you were to incorporate them both in a larger program, that
# whole would include the GPL-covered part, so it would have to be
# licensed as a whole under the GNU GPL.
#
# The fact that proprietary module A communicates with GPL-covered
# module C only through X11-licensed module B is legally irrelevant;
# what matters is the fact that module C is included in the whole.

So yes, if you want to permit proprietary code to link to hwloc then
you need to stick to permissive licenses in hwlocs dependencies.

Disclaimer: IANAL, I don't play a lawyer on TV (or the Internet),
batteries not included, caveat emptor, dates in calendar are closer
than they appear, etc, etc, etc...

Of course it might be possible to ask the pciutils maintainer to split
out libpci from pciutils and LGPL it.

Interestingly, Steam for Linux appears to have linked to libpci..

http://steamcommunity.com/app/221410/discussions/1/846938351130480716/

cheers!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlEi/wcACgkQO2KABBYQAh+MyACfdd9CyGvIcIIHZD2pTvVM1ZXG
6SUAn1Yr9D4knUhld9F/fa68EzR64Xnq
=Sd+l
-END PGP SIGNATURE-


[OMPI devel] CRIU checkpoint support in Open-MPI?

2012-12-05 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi folks,

I don't know if people have seen that the Linux kernel community is
following its own different checkpoint/restart path to those currently
supported by OMPI, namely that of the OpenVZ developers
"checkpoint/restore in user space" project (CRIU).

You can read more about its current state here:

 https://lwn.net/Articles/525675/

The CRIU website is here:

 http://criu.org/

CRIU will also be up for discussion at LCA2013 in Canberra this year
(though I won't be there):

http://linux.conf.au/schedule/30116/view_talk?day=thursday

Is there interest from OMPI in supporting this, given it looks like
it's quite likely to make it into the mainline kernel?

Or is better to wait for it to be merged, and then take a look?

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iEYEARECAAYFAlDACXYACgkQO2KABBYQAh8LIQCfagfyZNzK3KVKb+W0etJV4tyL
AxwAn0z6q7TVNcOTom0tmvy7brfFf4QV
=SLvF
-END PGP SIGNATURE-


Re: [hwloc-devel] Cgroup resource limits

2012-11-05 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/11/12 09:05, Ralph Castain wrote:

> System resource managers don't usually provide this capability, so we
> will do it at the ORTE level.

Interestingly one of the Torque developers posted this overnight:

http://www.supercluster.org/pipermail/torqueusers/2012-November/015183.html

# We are interested in incorporated cgroups into TORQUE. One
# of the things that is delaying it is that we haven't found
# a good library to manage the cgroups - it is obviously a much
# larger project if we have to write such a library ourselves,
# and also much harder to maintain. Does anyone know of a good
# library for cgroups?

So I've pointed them at this thread and strongly encouraged them
to get involved.

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCYdA4ACgkQO2KABBYQAh9aGACdEP+xuJptSUwAe0tHyUzJRi25
tTwAn0V/km+ltgigmQa5XoVI7lIVUlTw
=UzmX
-END PGP SIGNATURE-


Re: [hwloc-devel] Cgroup resource limits

2012-11-05 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/11/12 13:01, Ralph Castain wrote:

> Depends on the use-case. If you are going to direct-launch the 
> processes (e.g., using srun), then you are correct.

Yup.

> However, that isn't the case in other scenarios. For example, if
> you get an allocation and then use mpirun to launch your job, you 
> definitely do *not* want the RM setting the cgroup constraints as
> the RM only launches the orteds - it never sees the MPI procs. The 
> constraints are to apply to the individual procs as separate
> entities - if you apply them to the orteds, then all procs will be
> constrained to the same container. Ick.

That's not been my experience recently; for instance Torque currently
creates a cpuset for your job containing all the procs you've been
given there and then you can use mpirun/mpiexec to launch orted across
all the nodes you've been given. Those processes are then constrained
to the allocation set up on each node.  They are free to bind
themselves to the cores present within that cpuset should they so wish.

In the very beginnings (when I was at VPAC and when wew were using
MVPAICH2 rather than OpenMPI) Torque would bind processes to a core
within the allocation which worked fine for that, but of course broke
in the way you explain when we moved to Open-MPI.  I fixed that bug up
very quickly.. ;-)

We've only ever run Slurm on BlueGene where this isn't an issue, so I
don't know if that does things differently.

> Similarly, if you are running MapReduce, your application has to 
> figure out what nodes to run on, how much memory will be required, 
> etc. All that goes into the allocation request (made by the
> equivalent of mpirun in that scenario) sent to the RM. Again, the
> orteds need to set those constraints on a per-process basis.

But for the scheduler to be able to plan workload well I believe that
once your job has started the best you can do is ask for less than you
have been given, otherwise you're free to game the system by queuing a
short small job and once it's started asking for many more cores or
RAM.. :-)

> So we need the capability in ORTE to support the non-direct-launch
> cases.

I'm pretty sure we're agreeing here, just in different ways of
expressing ourselves.. :-)

cheers!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCYcvAACgkQO2KABBYQAh9C8ACcD3Tvjho1ZWuDMI+qX7iccUDQ
mQQAmgNmVRisYsUfajunEBacNFjRBCIa
=1S3e
-END PGP SIGNATURE-


Re: [hwloc-devel] Cgroup resource limits

2012-11-05 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/11/12 01:43, Ralph Castain wrote:

> On Nov 4, 2012, at 7:28 PM, Christopher Samuel 
> <sam...@unimelb.edu.au> wrote:
> 
>> I would argue that the resource managers *should* be doing it
> 
> No argument from me - I would love for them to provide me with an 
> easy API that mpirun can use to specify the requirements for a
> given application.

Wouldn't it be the other way around with the resource manager setting
limits and then having the job run inside it?  Basically like the
current cpuset support in Torque, et. al, but on steroids.

That way mpirun and/or orted could learn from the kernel the details
of the cgroup it is in and arrange itself appropriately.

I believe that Slurm has some support for cgroups already:

http://www.schedmd.com/slurmdocs/cgroups.html

[memcg performance]
> Yick! However, I would expect the community to reduce that impact 
> over time. If systems don't want that capability, then they can
> and should disable it. On the other hand, if they do want it, then
> we want to support it.

Indeed!

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCYReYACgkQO2KABBYQAh+BxQCbB1lbNCqotuA2paV+G6+cfAdP
xxwAnAurUX8OoK1+4oJJJY7NV9cmIoRV
=yrCv
-END PGP SIGNATURE-


Re: [hwloc-devel] Cgroup resource limits

2012-11-04 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/11/12 09:05, Ralph Castain wrote:

> System resource managers don't usually provide this capability, so
> we will do it at the ORTE level.

I would argue that the resource managers *should* be doing it - however,
I will also argue that the resource managers should be doing it via
hwloc (so I'm afraid it's not an out for you folks :-) ).

It's also worth remembering that the memcg code has an appalling
reputation with the kernel developers in terms of performance overhead,
for instance at the recent Kernel Summit numbers were reported showing a
substantial impact for just having the code present, but not used.

Following that a patch set was sent out trying to avoid that impact if
it's not in use which doesn't help here but does give a measure of the
performance hit:

http://lwn.net/Articles/517562/

# So as one can see, the difference between base and nomemcg in terms
# of both system time and elapsed time is quite drastic, and consistent
# with the figures shown by Mel Gorman in the Kernel summit. This is a
# ~7 % drop in performance, just by having memcg enabled. memcg
# functions appear heavily in the profiles, even if all tasks lives in
# the root memcg.

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCXMlUACgkQO2KABBYQAh8eTgCgkruuxIKc3mqpoxwMaeQBI1hR
/osAn225q4G6FWs1b4Lm6F/9GHDgw9JB
=jkm0
-END PGP SIGNATURE-


Re: [hwloc-devel] backends and plugins

2012-08-13 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/08/12 19:02, Brice Goglin wrote:

> Aside from the main "discover" callback, backends may also define
> some callbacks to be invoked when new object are created. The main
> example is Linux creating "OS devices" when a new "PCI device" is
> added by the PCI backend.

That could also be useful to some folks for non-PCI devices, say if a
CPU gets hotplugged in/out (or more likely added/removed from a
cpuset/cgroup you're in).

- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlAoe3QACgkQO2KABBYQAh8DagCeKwDn0lPdX1D7GlLD0ksuIX/t
jvEAn2l7+FQhnYvdPoN1CUd6Y6oyHSTv
=mBxD
-END PGP SIGNATURE-


Re: [hwloc-devel] [PATCH] Use plain "inline" in C++

2012-05-10 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/05/12 07:40, Jeff Squyres wrote:

> Huh -- really?  I always thought that the C++ language itself
> included the keyword "inline".

I asked via Twitter and got these responses..

# Inline was part of C++98 - the first c++ standard, and
# the inline kwd is in the cfront 1.0 ('86) source. So
# functionally, yes.

...and...

# This may be a different question than "have all C++
# compilers always accepted inline?"


I note that autoconf has an inline test for C:

http://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/C-Compiler.html

But not for C++:

http://www.gnu.org/savannah-checkouts/gnu/autoconf/manual/autoconf-2.69/html_node/C_002b_002b-Compiler.html

So perhaps the fact that they've never needed to implement
such a test is in itself a good guide ?

cheers,
Chris
- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk+rPAoACgkQO2KABBYQAh+fqwCfbsCOjeK5y+WEZnWQ1e+pQmQg
DhQAoJdN6S7IJpUZ51IlXbE0QJOI1jjI
=dWPv
-END PGP SIGNATURE-


Re: [hwloc-devel] lstopo-nox strikes back

2012-04-26 Thread Christopher Samuel
On 26/04/12 02:35, Brice Goglin wrote:

> I think I would vote for lstopo (no X/cairo) and lstopo so
> that completion helps.

Not sure if that's an option with Debian given the policy; the hwloc
package would have to have lstopo with X enabled and then a nox
package would install that variant of lstopo and use the alternatives
system to select which to use.

cheers,
Chris
-- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/



Re: [hwloc-devel] lstopo-nox strikes back

2012-04-26 Thread Christopher Samuel
On 25/04/12 23:44, Jeffrey Squyres wrote:

> FWIW: Having lstopo plugins for output would obviate the need for
> having two executable names.

IIRC that's generally handled via the alternatives system (or
diversions if you don't like alternatives) in Debian/Ubuntu.

-- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/



Re: [hwloc-devel] interoperability with X displays

2012-03-29 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 30/03/12 01:08, Brice Goglin wrote:

> * The code uses NVIDIA's apparently-open-source nvctrl library. The
> lib is unfortunately only built as a static lib in at least debian
> and ubuntu (without -fPIC), which is annoying.

I don't see that reported as a bug in the BTS, so I'd suggest
reporting it and seeing what happens.

http://bugs.debian.org/cgi-bin/pkgreport.cgi?src=nvidia-settings

cheers!
Chris
- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk90+swACgkQO2KABBYQAh8n7gCeLoqfiHq70fpNhctK4ivoVB9C
LzgAn0qakbmIrTGMJUzCVZNXGmsrxEJK
=27JG
-END PGP SIGNATURE-


Re: [hwloc-devel] Fwd: BGQ empty topology with MPI

2012-03-26 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 26/03/12 17:14, Brice Goglin wrote:

> Thanks, that would explain such a strange behavior.

Not a problem.

> For the record, you can run "lstopo -v" or even "lstopo -.xml" to
> get more info, especially machine attributes.

OK, please find attached both lstopo -v (with debug enabled) and also
the XML file requested.  This is BG/P, not BG/Q of course!

cheers!
Chris
- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9wDuYACgkQO2KABBYQAh+5rwCffUVzbgIGgfAH9HtjAlBO90uV
kLoAn0Rk2X6dlkNCBC3hKqPz1EZlx9KO
=G9MN
-END PGP SIGNATURE-



  











  

could not open /proc/cpuinfo


 * CPU cpusets *

cpu 0 (os 0) has cpuset 0x0001
cpu 1 (os 1) has cpuset 0x0002
cpu 2 (os 2) has cpuset 0x0004
cpu 3 (os 3) has cpuset 0x0008
Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 
HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 0xf...f 
complete 0x000f online 0xf...f allowed 0xf...f nodeset 0x0 completeN 0x0 
allowedN 0xf...f
  PU#0 cpuset 0x0001
  PU#1 cpuset 0x0002
  PU#2 cpuset 0x0004
  PU#3 cpuset 0x0008

Restrict topology cpusets to existing PU and NODE objects

Propagate offline and disallowed cpus down and up

Propagate nodesets
Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 
HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 
0x000f complete 0x000f online 0x000f allowed 0x000f
  PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 
0x0001
  PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 
0x0002
  PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 
0x0004
  PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 
0x0008

Removing unauthorized and offline cpusets from all cpusets

Removing disallowed memory according to nodesets
Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 
HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 
0x000f complete 0x000f online 0x000f allowed 0x000f
  PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 
0x0001
  PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 
0x0002
  PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 
0x0004
  PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 
0x0008

Removing ignored objects
Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 
HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 
0x000f complete 0x000f online 0x000f allowed 0x000f
  PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 
0x0001
  PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 
0x0002
  PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 
0x0004
  PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 
0x0008

Removing empty objects except numa nodes and PCI devices
Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 
HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 
0x000f complete 0x000f online 0x000f allowed 0x000f
  PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 
0x0001
  PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 
0x0002
  PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 
0x0004
  PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 
0x0008

Removing objects whose type has HWLOC_IGNORE_TYPE_KEEP_STRUCTURE and have only 
one child or are the only child
Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 
HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 
0x000f complete 0x000f online 0x000f allowed 0x000f
  PU#0 cpuset 0x0001 complete 0x0001 online 0x0001 allowed 
0x0001
  PU#1 cpuset 0x0002 complete 0x0002 online 0x0002 allowed 
0x0002
  PU#2 cpuset 0x0004 complete 0x0004 online 0x0004 allowed 
0x0004
  PU#3 cpuset 0x0008 complete 0x0008 online 0x0008 allowed 
0x0008

Add default object sets

Ok, finished tweaking, now connect
Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1 
HostName=r00-m1-n02.pcf.vlsci.unimelb.edu.au Architecture=BGP) cpuset 
0x000f complete 0x000f online 0x000f allowed 0x000f nodeset 
0xf...f completeN 0xf...f allowedN 

Re: [hwloc-devel] Fwd: BGQ empty topology with MPI

2012-03-26 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 25/03/12 17:43, Brice Goglin wrote:

> But it'd be good to understand what's going on in /sys on this
> machine. And I still don't understand why MPI changes things here.

My guess (looking at the BG/P CNK kernel code) is that /sys is not
present on a BG/Q compute node, only on its I/O nodes (which run a
Linux kernel), and so the code is only picking them up when the I/O is
being redirected via an I/O node (i.e. when MPI is in play).

Now I'd have thought that would happen with or without MPI, but who
knows..

cheers,
Chris
- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9v4XkACgkQO2KABBYQAh8QrwCdGVrp1OzExLnB9v696lqEO2yz
qKwAnivU+GJ2lXB5wzRBw1WlCkj0XeSy
=rgKS
-END PGP SIGNATURE-


Re: [hwloc-devel] Fwd: BGQ empty topology with MPI

2012-03-26 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 25/03/12 09:04, Daniel Ibanez wrote:

> Additional printfs confirm that with MPI in the code, 
> hwloc_accessat succeeds on the various /sys/ directories, but the
> overall procedure for getting PUs from these fails. Without MPI,
> access to /sys/ directories fails but the fallback
> hwloc_setup_pu_level works.

Sounds like your I/O with MPI is getting redirected to the I/O node
(and hence finding /sys from the Linux kernel there) but when you're
running without MPI it's trying to open files on the compute node and
the CNK isn't presenting the /sys directories, causing it to fall back.

I've run lstopo on our BG/P and I get to see the 4 cores there whether
it's the stock code or if I add an MPI_Init() to the start.  The
output from lstopo when built with --enable-debug confirms it's
reporting kernel and hostname info from the I/O node associated with
the block:

Machine#0(Backend=Linux OSName=CNK OSRelease=2.6.16.60-304 OSVersion=1
HostName=r00-m1-n04.pcf.vlsci.unimelb.edu.au Architecture=BGP) [...]

It might be interesting to build something like ls with the BG/Q
compilers to see if you can run it on a compute node to see what /proc
or /sys look like in each case.

cheers,
Chris
- -- 
    Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9v33UACgkQO2KABBYQAh+S1ACfSypUPtoOFV8fHOObBztuUMGI
RmwAnRy/Estz8Qi2KzAuQigPJbgtSlD4
=sdGx
-END PGP SIGNATURE-


Re: [hwloc-devel] PCI device name question

2012-03-22 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 21/03/12 08:07, Brice Goglin wrote:

> New patch attached, it doesn't add port numbers for non-IB
> devices.

Extract from lstopo on SGI XE270 box with Mellanox dual port IB card:

PCIBridge
  PCI 15b3:673c
Net L#2 "ib1"
Net L#3 "ib0"
OpenFabrics L#4 "mlx4_0"

Looks OK to me.
- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9r5lQACgkQO2KABBYQAh8kygCfWGaqIN0Xo8nHFCWhL31iCgtQ
JqIAn0WP5CXBFBhsJL7qB5vpGABfPtel
=i2eQ
-END PGP SIGNATURE-


Re: [hwloc-devel] BGQ empty topology with MPI

2012-03-22 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 22/03/12 20:58, Brice Goglin wrote:

> So there's something strange going on when MPI is added. Which MPI
> are using? Is this a derivative of MPICH that embeds hwloc? (MPICH
> >= 1.2.1 if I remember correctly)

Not sure about BG/Q, but BG/P uses code derived from MPICH2 according
to: http://wiki.bg.anl-external.org/index.php/Main_Page

Our BG/P seems to claim it's from MPICH2 1.1:

samuel@tambo:~> mpicc -v
mpicc for 1.1

cheers,
Chris
- -- 
    Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9r42cACgkQO2KABBYQAh9mbwCeOYrI5bsk/XOiXFl128BksV2D
SR4An1bs09e2lpyYadABbaRIG2dtg7Fr
=ucpF
-END PGP SIGNATURE-


Re: [hwloc-devel] BGQ empty topology with MPI

2012-03-22 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 22/03/12 01:08, Daniel Ibanez wrote:

> Attached is the stderr and stdout from lstopo compiled as you
> said.

Interesting, so it's not correctly detecting the topology as BG/Q is
16 compute cores, each with 4 hardware threads.  Instead it's
detecting all 64 hardware threads and treating them as cores if I'm
reading that
right.

I was puzzled by the OS info output too, it says:

Machine#0(Backend=Linux OSName=CNK
OSRelease=2.6.32-220.el6.bgq110_20120104.ppc64 OSVersion=1
HostName=R00-ID-J04.i2b.cetus Architecture=) cpuset 0xf...f complete
0x,0x online 0xf...f allowed 0xf...f nodeset 0x0
completeN 0x0 allowedN 0xf...f

However, looking at the (open) source code for the CNK [1] (at least
for BG/P) the uname info seems to be derived from the I/O nodes when
its running in CIOD mode, so I suspect that's what's happening here
(looks like a RHEL6 derived kernel from that).

> I can't run hwloc-gather-topology.sh on the compute nodes since its
> a script, but I can run it on the front end node.

For those unfamiliar with BlueGene (at least P, and I suspect the same
is true for Q), this is because the CNK doesn't implement fork() or
execve(), they're designed to start your code and just keep running it
until it dies.

[1] - http://wiki.bg.anl-external.org/index.php/Cnk

cheers!
Chris
- -- 
    Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9r4r4ACgkQO2KABBYQAh8zswCfaoTK+PQ/ystZEX23AxK/0007
OwYAmwYHiVYzjtrCrAJ5L0GNfdbM/Hsr
=9qJj
-END PGP SIGNATURE-


Re: [hwloc-devel] BGQ empty topology with MPI

2012-03-20 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 21/03/12 13:37, Daniel Ibanez wrote:

> Please let me know if theres a hint of what could be causing it, 
> where to post, and what info to provide.

Are you running Linux or CNK on the compute nodes for this?

cheers!
Chris
- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9pQ+QACgkQO2KABBYQAh/y1gCdFVeWEOgfdobkp+Xdl/Y9y6+i
0a4Anjt1REedBOQKbCvTEvl5tZrLSJjy
=/Tk1
-END PGP SIGNATURE-


Re: [OMPI devel] [OMPI svn] svn:open-mpi r26077 (fwd)

2012-03-01 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/03/12 02:56, Nathan Hjelm wrote:

> Found a pretty nasty frag leak (and a minor one) in ob1 (see
> commit below). If this fix addresses some hangs we are seeing on
> infiniband LANL might want a 1.4.6 rolled (or a faster rollout for
> 1.6.0).

What symptoms would an affected job show?  Does it fail with an OMPI
error or does it just hang using 0% CPU?

cheers,
Chris
- -- 
    Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9QN10ACgkQO2KABBYQAh9aRgCePZXdzqlI8lpfqWtHf8rtFvup
2D8An3E9y411xTyRBpfwHLPpWTzqUiuv
=3EXP
-END PGP SIGNATURE-


Re: [OMPI devel] Open MPI nightly tarballs suspended / 1.5.5rc3

2012-02-28 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 29/02/12 07:44, Jeffrey Squyres wrote:

> - BlueGene fixes

rc3 fixes the builds on our front end node, thanks!

- -- 
    Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9NesUACgkQO2KABBYQAh/xZACeOaKCOdHfOkcWu2W6KxZNsP9+
QMIAnAkwhmu3m/DnNubN4BoED51K8CGg
=T8Ca
-END PGP SIGNATURE-


Re: [OMPI devel] poor btl sm latency

2012-02-28 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 13/02/12 22:11, Matthias Jurenz wrote:

> Do you have any idea? Please help!

Do you see the same bad latency in the old branch (1.4.5) ?

cheers,
Chris
- -- 
    Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9MZBwACgkQO2KABBYQAh99aQCggjCQB/+aaQ3XCrdq4QyMlsD0
m2IAoI+TcrStWFkTZhEV50ax23ulmJvZ
=Soi0
-END PGP SIGNATURE-


Re: [OMPI devel] 1.5.5rc2

2012-02-23 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 24/02/12 15:12, Christopher Samuel wrote:

> I suspect this is irrelevant, but I got a build failure trying to 
> compile it on our BG/P front end node (login node) with the IBM XL 
> compilers.

Oops, forgot how I built it..

export
PATH=/opt/ibmcmp/vac/bg/9.0/bin/:/opt/ibmcmp/vacpp/bg/9.0/bin:/opt/ibmcmp/xlf/bg/11.1/bin:$PATH

CC=xlc CXX=xlC F77=xlf ./configure && make

- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9HD1wACgkQO2KABBYQAh9EZgCcCz9x2i6KuE7/UpPzr194jHQD
rdcAni+dfEMhlqMzYMILn8jeS9yWlInu
=+rA4
-END PGP SIGNATURE-


  1   2   >