Ralph, That makes perfect sense.
What about FCA_IS_LOCAL_PROCESS ? Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly ? Cheers Gilles Ralph Castain <r...@open-mpi.org> wrote: >Hi Gilles > > >We discussed this at the devel conference this morning. The root cause of the >problem is a test in coll/ml that we feel is incorrect - it basically checks >to see if the proc itself is bound, and then assumes that all other procs are >similarly bound. This in fact is never guaranteed to be true as someone could >use the rank_file method to specify that some procs are to be left unbound, >while others are to be bound to specified cpus. > > >Nathan has looked at that check before and believes it isn't necessary. All >coll/ml really needs to know is that the two procs share the same node, and >the current locality algorithm will provide that information. We have asked >him to "fix" the coll/ml selection logic to resolve that situation. > > >After then discussing the various locality definitions, it was our feeling >that the current definition is probably the better one unless you have a >reason for changing it other than coll/ml. If so, we'd be happy to revisit the >proposal. > > >Make sense? > >Ralph > > > > >On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet ><gilles.gouaillar...@iferc.org> wrote: > >WHAT: semantic change of opal_hwloc_base_get_relative_locality > >WHY: make is closer to what coll/ml expects. > > Currently, opal_hwloc_base_get_relative_locality means "at what level do >these procs share cpus" > however, coll/ml is using it as "at what level are these procs commonly >bound". > > it is important to note that if a task is bound to all the available >cpus, locality should > be set to OPAL_PROC_ON_NODE only. > /* e.g. on a single socket Sandy Bridge system, use OPAL_PROC_ON_NODE >instead of OPAL_PROC_ON_L3CACHE */ > > This has been initially discussed in the devel mailing list > http://www.open-mpi.org/community/lists/devel/2014/06/15030.php > > as advised by Ralph, i browsed the source code looking for how the >(ompi_proc_t *)->proc_flags is used. > so far, it is mainly used to figure out wether the proc is on the same >node or not. > > notable exceptions are : > a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c : >OPAL_PROC_ON_LOCAL_SOCKET > b) ompi/mca/coll/fca/coll_fca_module.c and >oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS > > about a) the new definition fixes a hang in coll/ml > about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only >found OMPI_PROC_FLAG_LOCAL in v1.3 */ > so this macro can be simply removed and replaced with >OPAL_PROC_ON_LOCAL_NODE > > at this stage, i cannot find any objection not to do the described >change. > please report if any and/or feel free to comment. > >WHERE: see the two attached patches > >TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago, June >24-26. > The RFC will become final only after the meeting. > /* Ralph already added this topic to the agenda */ > >Thanks > >Gilles > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/06/15046.php > >