Hi Ralph,
On 2014/06/25 2:51, Ralph Castain wrote:
> Had a chance to review this with folks here, and we think that having
> oversubscribe automatically set overload makes some sense. However, we do
> want to retain the ability to separately specify oversubscribe and overload
> as well since these
Ralph,
i pushed the change (r32079) and updated the wiki.
the RFC can be now closed and the consensus is semantic of
opal_hwloc_base_get_relative_locality
will not be changed since this is not needed : the hang is a coll/ml
bug, so it will be fixed within coll/ml.
Cheers,
Gilles
On 2014/06/25
Hi Gilles
Had a chance to review this with folks here, and we think that having
oversubscribe automatically set overload makes some sense. However, we do
want to retain the ability to separately specify oversubscribe and overload
as well since these two terms don't mean quite the same thing.
Our
Yeah, we should make that change, if you wouldn't mind doing it.
On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET <
gilles.gouaillar...@gmail.com> wrote:
> Ralph,
>
> That makes perfect sense.
>
> What about FCA_IS_LOCAL_PROCESS ?
> Shall we keep it or shall we use directly OPAL_PROC_ON_LOC
Ralph,
That makes perfect sense.
What about FCA_IS_LOCAL_PROCESS ?
Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly ?
Cheers
Gilles
Ralph Castain wrote:
>Hi Gilles
>
>
>We discussed this at the devel conference this morning. The root cause of the
>problem is a test
Hi Gilles
We discussed this at the devel conference this morning. The root cause of
the problem is a test in coll/ml that we feel is incorrect - it basically
checks to see if the proc itself is bound, and then assumes that all other
procs are similarly bound. This in fact is never guaranteed to be
Folks,
this issue is related to the failures reported by mtt on the trunk when
the ibm test suite invokes MPI_Comm_spawn.
my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus
per socket each.
if i run on one host (without any batch manager)
mpirun -np 16 --host slurm1 --oversub
WHAT: semantic change of opal_hwloc_base_get_relative_locality
WHY: make is closer to what coll/ml expects.
Currently, opal_hwloc_base_get_relative_locality means "at what level do
these procs share cpus"
however, coll/ml is using it as "at what level are these procs commonly
bound