Ralph,

That makes perfect sense.

What about FCA_IS_LOCAL_PROCESS ?
Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly ?

Cheers

Gilles

Ralph Castain <r...@open-mpi.org> wrote:
>Hi Gilles
>
>
>We discussed this at the devel conference this morning. The root cause of the 
>problem is a test in coll/ml that we feel is incorrect - it basically checks 
>to see if the proc itself is bound, and then assumes that all other procs are 
>similarly bound. This in fact is never guaranteed to be true as someone could 
>use the rank_file method to specify that some procs are to be left unbound, 
>while others are to be bound to specified cpus.
>
>
>Nathan has looked at that check before and believes it isn't necessary. All 
>coll/ml really needs to know is that the two procs share the same node, and 
>the current locality algorithm will provide that information. We have asked 
>him to "fix" the coll/ml selection logic to resolve that situation.
>
>
>After then discussing the various locality definitions, it was our feeling 
>that the current definition is probably the better one unless you have a 
>reason for changing it other than coll/ml. If so, we'd be happy to revisit the 
>proposal.
>
>
>Make sense?
>
>Ralph
>
>
>
>
>On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet 
><gilles.gouaillar...@iferc.org> wrote:
>
>WHAT: semantic change of opal_hwloc_base_get_relative_locality
>
>WHY:  make is closer to what coll/ml expects.
>
>      Currently, opal_hwloc_base_get_relative_locality means "at what level do 
>these procs share cpus"
>      however, coll/ml is using it as "at what level are these procs commonly 
>bound".
>
>      it is important to note that if a task is bound to all the available 
>cpus, locality should
>      be set to OPAL_PROC_ON_NODE only.
>      /* e.g. on a single socket Sandy Bridge system, use OPAL_PROC_ON_NODE 
>instead of OPAL_PROC_ON_L3CACHE */
>
>      This has been initially discussed in the devel mailing list
>      http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
>
>      as advised by Ralph, i browsed the source code looking for how the 
>(ompi_proc_t *)->proc_flags is used.
>      so far, it is mainly used to figure out wether the proc is on the same 
>node or not.
>
>      notable exceptions are :
>       a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c : 
>OPAL_PROC_ON_LOCAL_SOCKET
>       b) ompi/mca/coll/fca/coll_fca_module.c and 
>oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS
>
>      about a) the new definition fixes a hang in coll/ml
>      about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only 
>found OMPI_PROC_FLAG_LOCAL in v1.3 */
>      so this macro can be simply removed and replaced with 
>OPAL_PROC_ON_LOCAL_NODE
>
>      at this stage, i cannot find any objection not to do the described 
>change.
>      please report if any and/or feel free to comment.
>
>WHERE: see the two attached patches
>
>TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago, June 
>24-26.
>         The RFC will become final only after the meeting.
>         /* Ralph already added this topic to the agenda */
>
>Thanks
>
>Gilles
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/06/15046.php
>
>

Reply via email to