Yeah, we should make that change, if you wouldn't mind doing it.
On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET < gilles.gouaillar...@gmail.com> wrote: > Ralph, > > That makes perfect sense. > > What about FCA_IS_LOCAL_PROCESS ? > Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly > ? > > Cheers > > Gilles > > Ralph Castain <r...@open-mpi.org> wrote: > Hi Gilles > > We discussed this at the devel conference this morning. The root cause of > the problem is a test in coll/ml that we feel is incorrect - it basically > checks to see if the proc itself is bound, and then assumes that all other > procs are similarly bound. This in fact is never guaranteed to be true as > someone could use the rank_file method to specify that some procs are to be > left unbound, while others are to be bound to specified cpus. > > Nathan has looked at that check before and believes it isn't necessary. > All coll/ml really needs to know is that the two procs share the same node, > and the current locality algorithm will provide that information. We have > asked him to "fix" the coll/ml selection logic to resolve that situation. > > After then discussing the various locality definitions, it was our feeling > that the current definition is probably the better one unless you have a > reason for changing it other than coll/ml. If so, we'd be happy to revisit > the proposal. > > Make sense? > Ralph > > > > On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> WHAT: semantic change of opal_hwloc_base_get_relative_locality >> >> WHY: make is closer to what coll/ml expects. >> >> Currently, opal_hwloc_base_get_relative_locality means "at what >> level do these procs share cpus" >> however, coll/ml is using it as "at what level are these procs >> commonly bound". >> >> it is important to note that if a task is bound to all the >> available cpus, locality should >> be set to OPAL_PROC_ON_NODE only. >> /* e.g. on a single socket Sandy Bridge system, use >> OPAL_PROC_ON_NODE instead of OPAL_PROC_ON_L3CACHE */ >> >> This has been initially discussed in the devel mailing list >> http://www.open-mpi.org/community/lists/devel/2014/06/15030.php >> >> as advised by Ralph, i browsed the source code looking for how the >> (ompi_proc_t *)->proc_flags is used. >> so far, it is mainly used to figure out wether the proc is on the >> same node or not. >> >> notable exceptions are : >> a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c : >> OPAL_PROC_ON_LOCAL_SOCKET >> b) ompi/mca/coll/fca/coll_fca_module.c and >> oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS >> >> about a) the new definition fixes a hang in coll/ml >> about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only >> found OMPI_PROC_FLAG_LOCAL in v1.3 */ >> so this macro can be simply removed and replaced with >> OPAL_PROC_ON_LOCAL_NODE >> >> at this stage, i cannot find any objection not to do the described >> change. >> please report if any and/or feel free to comment. >> >> WHERE: see the two attached patches >> >> TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago, >> June 24-26. >> The RFC will become final only after the meeting. >> /* Ralph already added this topic to the agenda */ >> >> Thanks >> >> Gilles >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/15046.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15049.php >