Re: [OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Gilles Gouaillardet
Hi Ralph, On 2014/06/25 2:51, Ralph Castain wrote: > Had a chance to review this with folks here, and we think that having > oversubscribe automatically set overload makes some sense. However, we do > want to retain the ability to separately specify oversubscribe and overload > as well since these

Re: [OMPI devel] OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles Gouaillardet
Ralph, i pushed the change (r32079) and updated the wiki. the RFC can be now closed and the consensus is semantic of opal_hwloc_base_get_relative_locality will not be changed since this is not needed : the hang is a coll/ml bug, so it will be fixed within coll/ml. Cheers, Gilles On 2014/06/25

Re: [OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Ralph Castain
Hi Gilles Had a chance to review this with folks here, and we think that having oversubscribe automatically set overload makes some sense. However, we do want to retain the ability to separately specify oversubscribe and overload as well since these two terms don't mean quite the same thing. Our

Re: [OMPI devel] OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Ralph Castain
Yeah, we should make that change, if you wouldn't mind doing it. On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET < gilles.gouaillar...@gmail.com> wrote: > Ralph, > > That makes perfect sense. > > What about FCA_IS_LOCAL_PROCESS ? > Shall we keep it or shall we use directly OPAL_PROC_ON_LOC

Re: [OMPI devel] OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles GOUAILLARDET
Ralph, That makes perfect sense. What about FCA_IS_LOCAL_PROCESS ? Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly ? Cheers Gilles Ralph Castain wrote: >Hi Gilles > > >We discussed this at the devel conference this morning. The root cause of the >problem is a test

Re: [OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Ralph Castain
Hi Gilles We discussed this at the devel conference this morning. The root cause of the problem is a test in coll/ml that we feel is incorrect - it basically checks to see if the proc itself is bound, and then assumes that all other procs are similarly bound. This in fact is never guaranteed to be

[OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Gilles Gouaillardet
Folks, this issue is related to the failures reported by mtt on the trunk when the ibm test suite invokes MPI_Comm_spawn. my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus per socket each. if i run on one host (without any batch manager) mpirun -np 16 --host slurm1 --oversub

[OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles Gouaillardet
WHAT: semantic change of opal_hwloc_base_get_relative_locality WHY: make is closer to what coll/ml expects. Currently, opal_hwloc_base_get_relative_locality means "at what level do these procs share cpus" however, coll/ml is using it as "at what level are these procs commonly bound