Ralph, i pushed the change (r32079) and updated the wiki.
the RFC can be now closed and the consensus is semantic of opal_hwloc_base_get_relative_locality will not be changed since this is not needed : the hang is a coll/ml bug, so it will be fixed within coll/ml. Cheers, Gilles On 2014/06/25 1:12, Ralph Castain wrote: > Yeah, we should make that change, if you wouldn't mind doing it. > > > > On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET < > gilles.gouaillar...@gmail.com> wrote: > >> Ralph, >> >> That makes perfect sense. >> >> What about FCA_IS_LOCAL_PROCESS ? >> Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly >> ? >> >> Cheers >> >> Gilles >> >> Ralph Castain <r...@open-mpi.org> wrote: >> Hi Gilles >> >> We discussed this at the devel conference this morning. The root cause of >> the problem is a test in coll/ml that we feel is incorrect - it basically >> checks to see if the proc itself is bound, and then assumes that all other >> procs are similarly bound. This in fact is never guaranteed to be true as >> someone could use the rank_file method to specify that some procs are to be >> left unbound, while others are to be bound to specified cpus. >> >> Nathan has looked at that check before and believes it isn't necessary. >> All coll/ml really needs to know is that the two procs share the same node, >> and the current locality algorithm will provide that information. We have >> asked him to "fix" the coll/ml selection logic to resolve that situation. >> >> After then discussing the various locality definitions, it was our feeling >> that the current definition is probably the better one unless you have a >> reason for changing it other than coll/ml. If so, we'd be happy to revisit >> the proposal. >> >> Make sense? >> Ralph >> >> >> >> On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >>> WHAT: semantic change of opal_hwloc_base_get_relative_locality >>> >>> WHY: make is closer to what coll/ml expects. >>> >>> Currently, opal_hwloc_base_get_relative_locality means "at what >>> level do these procs share cpus" >>> however, coll/ml is using it as "at what level are these procs >>> commonly bound". >>> >>> it is important to note that if a task is bound to all the >>> available cpus, locality should >>> be set to OPAL_PROC_ON_NODE only. >>> /* e.g. on a single socket Sandy Bridge system, use >>> OPAL_PROC_ON_NODE instead of OPAL_PROC_ON_L3CACHE */ >>> >>> This has been initially discussed in the devel mailing list >>> http://www.open-mpi.org/community/lists/devel/2014/06/15030.php >>> >>> as advised by Ralph, i browsed the source code looking for how the >>> (ompi_proc_t *)->proc_flags is used. >>> so far, it is mainly used to figure out wether the proc is on the >>> same node or not. >>> >>> notable exceptions are : >>> a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c : >>> OPAL_PROC_ON_LOCAL_SOCKET >>> b) ompi/mca/coll/fca/coll_fca_module.c and >>> oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS >>> >>> about a) the new definition fixes a hang in coll/ml >>> about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only >>> found OMPI_PROC_FLAG_LOCAL in v1.3 */ >>> so this macro can be simply removed and replaced with >>> OPAL_PROC_ON_LOCAL_NODE >>> >>> at this stage, i cannot find any objection not to do the described >>> change. >>> please report if any and/or feel free to comment. >>> >>> WHERE: see the two attached patches >>> >>> TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago, >>> June 24-26. >>> The RFC will become final only after the meeting. >>> /* Ralph already added this topic to the agenda */ >>> >>> Thanks >>> >>> Gilles >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/06/15046.php >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/15049.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15050.php