Folks,
FWIW, i observe a similar behaviour on my system.
imho, the root cause is OFED has been upgraded from a (quite) older
version to latest 3.12 version
here is the relevant part of code (btl_openib.c from the master) :
static uint64_t calculate_max_reg (void)
{
if (0 == stat("/sys/modu
Hmmm…they probably linked that to the external, system hwloc version, so it
sounds like one or more of your nodes has a different hwloc rpm on it.
I couldn’t leaf thru your output well enough to see all the lstopo versions,
but you might check to ensure they are the same.
Looking at the code ba
It is the default openmpi that comes with Ubuntu 14.04.
> On 08 Dec 2014, at 17:17, Ralph Castain wrote:
>
> Pim: is this an OMPI you built, or one you were given somehow? If you built
> it, how did you configure it?
>
>> On Dec 8, 2014, at 8:12 AM, Brice Goglin wrote:
>>
>> It likely depend
Pim: is this an OMPI you built, or one you were given somehow? If you built it,
how did you configure it?
> On Dec 8, 2014, at 8:12 AM, Brice Goglin wrote:
>
> It likely depends on how SLURM allocates the cpuset/cgroup inside the
> nodes. The XML warning is related to these restrictions inside
It likely depends on how SLURM allocates the cpuset/cgroup inside the
nodes. The XML warning is related to these restrictions inside the node.
Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere.
How do we check after install whether OMPI uses the embedded or the
system-wide hwl
Dear Ralph,
the nodes are called coma## and as you can see in the logs the nodes of the
broken example are the same as the nodes of the working one, so that doesn’t
seem to be the cause. Unless (very likely) I’m missing something. Anything else
I can check?
Regards,
Pim
> On 08 Dec 2014, at
As Brice said, OMPI has its own embedded version of hwloc that we use, so there
is no Slurm interaction to be considered. The most likely cause is that one or
more of your nodes is picking up a different version of OMPI. So things “work”
if you happen to get nodes where all the versions match, a
Dear Brice,
I am not sure why this is happening since all code seems to be using the same
hwloc library version (1.8) but it does :) An MPI program is started through
SLURM on two nodes with four CPU cores total (divided over the nodes) using the
following script:
#! /bin/bash
#SBATCH -N 2 -n
Hello Kevin,
Could you try testing with Open MPI 1.8.3? There was a bug in 1.8.1
that you are likely hitting in your testing.
Thanks,
Howard
2014-12-07 17:18 GMT-07:00 Kevin Buckley <
kevin.buckley.ecs.vuw.ac...@gmail.com>:
> Apologies for the lack of a subject line: cut and pasted the body