Saw it and will review - thanks!
On Tue, Jun 24, 2014 at 9:51 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Ralph, > > i pushed the change (r32079) and updated the wiki. > > the RFC can be now closed and the consensus is semantic of > opal_hwloc_base_get_relative_locality > will not be changed since this is not needed : the hang is a coll/ml bug, > so it will be fixed within coll/ml. > > Cheers, > > Gilles > > > On 2014/06/25 1:12, Ralph Castain wrote: > > Yeah, we should make that change, if you wouldn't mind doing it. > > > > On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET > <gilles.gouaillar...@gmail.com> wrote: > > > Ralph, > > That makes perfect sense. > > What about FCA_IS_LOCAL_PROCESS ? > Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly > ? > > Cheers > > Gilles > > Ralph Castain <r...@open-mpi.org> <r...@open-mpi.org> wrote: > Hi Gilles > > We discussed this at the devel conference this morning. The root cause of > the problem is a test in coll/ml that we feel is incorrect - it basically > checks to see if the proc itself is bound, and then assumes that all other > procs are similarly bound. This in fact is never guaranteed to be true as > someone could use the rank_file method to specify that some procs are to be > left unbound, while others are to be bound to specified cpus. > > Nathan has looked at that check before and believes it isn't necessary. > All coll/ml really needs to know is that the two procs share the same node, > and the current locality algorithm will provide that information. We have > asked him to "fix" the coll/ml selection logic to resolve that situation. > > After then discussing the various locality definitions, it was our feeling > that the current definition is probably the better one unless you have a > reason for changing it other than coll/ml. If so, we'd be happy to revisit > the proposal. > > Make sense? > Ralph > > > > On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > > > WHAT: semantic change of opal_hwloc_base_get_relative_locality > > WHY: make is closer to what coll/ml expects. > > Currently, opal_hwloc_base_get_relative_locality means "at what > level do these procs share cpus" > however, coll/ml is using it as "at what level are these procs > commonly bound". > > it is important to note that if a task is bound to all the > available cpus, locality should > be set to OPAL_PROC_ON_NODE only. > /* e.g. on a single socket Sandy Bridge system, use > OPAL_PROC_ON_NODE instead of OPAL_PROC_ON_L3CACHE */ > > This has been initially discussed in the devel mailing list > http://www.open-mpi.org/community/lists/devel/2014/06/15030.php > > as advised by Ralph, i browsed the source code looking for how the > (ompi_proc_t *)->proc_flags is used. > so far, it is mainly used to figure out wether the proc is on the > same node or not. > > notable exceptions are : > a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c : > OPAL_PROC_ON_LOCAL_SOCKET > b) ompi/mca/coll/fca/coll_fca_module.c and > oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS > > about a) the new definition fixes a hang in coll/ml > about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only > found OMPI_PROC_FLAG_LOCAL in v1.3 */ > so this macro can be simply removed and replaced with > OPAL_PROC_ON_LOCAL_NODE > > at this stage, i cannot find any objection not to do the described > change. > please report if any and/or feel free to comment. > > WHERE: see the two attached patches > > TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago, > June 24-26. > The RFC will become final only after the meeting. > /* Ralph already added this topic to the agenda */ > > Thanks > > Gilles > > > _______________________________________________ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this > post:http://www.open-mpi.org/community/lists/devel/2014/06/15046.php > > > _______________________________________________ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this > post:http://www.open-mpi.org/community/lists/devel/2014/06/15049.php > > > > _______________________________________________ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15050.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15052.php >