Re: [OMPI devel] [OMPI users] Warning about not enough registerable memory on SL6.6

2014-12-08 Thread Gilles Gouaillardet
Folks, FWIW, i observe a similar behaviour on my system. imho, the root cause is OFED has been upgraded from a (quite) older version to latest 3.12 version here is the relevant part of code (btl_openib.c from the master) : static uint64_t calculate_max_reg (void) { if (0 == stat("/sys/modu

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Ralph Castain
Hmmm…they probably linked that to the external, system hwloc version, so it sounds like one or more of your nodes has a different hwloc rpm on it. I couldn’t leaf thru your output well enough to see all the lstopo versions, but you might check to ensure they are the same. Looking at the code ba

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Pim Schellart
It is the default openmpi that comes with Ubuntu 14.04. > On 08 Dec 2014, at 17:17, Ralph Castain wrote: > > Pim: is this an OMPI you built, or one you were given somehow? If you built > it, how did you configure it? > >> On Dec 8, 2014, at 8:12 AM, Brice Goglin wrote: >> >> It likely depend

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Ralph Castain
Pim: is this an OMPI you built, or one you were given somehow? If you built it, how did you configure it? > On Dec 8, 2014, at 8:12 AM, Brice Goglin wrote: > > It likely depends on how SLURM allocates the cpuset/cgroup inside the > nodes. The XML warning is related to these restrictions inside

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Brice Goglin
It likely depends on how SLURM allocates the cpuset/cgroup inside the nodes. The XML warning is related to these restrictions inside the node. Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere. How do we check after install whether OMPI uses the embedded or the system-wide hwl

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Pim Schellart
Dear Ralph, the nodes are called coma## and as you can see in the logs the nodes of the broken example are the same as the nodes of the working one, so that doesn’t seem to be the cause. Unless (very likely) I’m missing something. Anything else I can check? Regards, Pim > On 08 Dec 2014, at

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Ralph Castain
As Brice said, OMPI has its own embedded version of hwloc that we use, so there is no Slurm interaction to be considered. The most likely cause is that one or more of your nodes is picking up a different version of OMPI. So things “work” if you happen to get nodes where all the versions match, a

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-08 Thread Pim Schellart
Dear Brice, I am not sure why this is happening since all code seems to be using the same hwloc library version (1.8) but it does :) An MPI program is started through SLURM on two nodes with four CPU cores total (divided over the nodes) using the following script: #! /bin/bash #SBATCH -N 2 -n

Re: [OMPI devel] (no subject)

2014-12-08 Thread Howard Pritchard
Hello Kevin, Could you try testing with Open MPI 1.8.3? There was a bug in 1.8.1 that you are likely hitting in your testing. Thanks, Howard 2014-12-07 17:18 GMT-07:00 Kevin Buckley < kevin.buckley.ecs.vuw.ac...@gmail.com>: > Apologies for the lack of a subject line: cut and pasted the body