Ralph,

i pushed the change (r32079) and updated the wiki.

the RFC can be now closed and the consensus is semantic of
opal_hwloc_base_get_relative_locality
will not be changed since this is not needed : the hang is a coll/ml
bug, so it will be fixed within coll/ml.

Cheers,

Gilles

On 2014/06/25 1:12, Ralph Castain wrote:
> Yeah, we should make that change, if you wouldn't mind doing it.
>
>
>
> On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> That makes perfect sense.
>>
>> What about FCA_IS_LOCAL_PROCESS ?
>> Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly
>> ?
>>
>> Cheers
>>
>> Gilles
>>
>> Ralph Castain <r...@open-mpi.org> wrote:
>> Hi Gilles
>>
>> We discussed this at the devel conference this morning. The root cause of
>> the problem is a test in coll/ml that we feel is incorrect - it basically
>> checks to see if the proc itself is bound, and then assumes that all other
>> procs are similarly bound. This in fact is never guaranteed to be true as
>> someone could use the rank_file method to specify that some procs are to be
>> left unbound, while others are to be bound to specified cpus.
>>
>> Nathan has looked at that check before and believes it isn't necessary.
>> All coll/ml really needs to know is that the two procs share the same node,
>> and the current locality algorithm will provide that information. We have
>> asked him to "fix" the coll/ml selection logic to resolve that situation.
>>
>> After then discussing the various locality definitions, it was our feeling
>> that the current definition is probably the better one unless you have a
>> reason for changing it other than coll/ml. If so, we'd be happy to revisit
>> the proposal.
>>
>> Make sense?
>> Ralph
>>
>>
>>
>> On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>>> WHAT: semantic change of opal_hwloc_base_get_relative_locality
>>>
>>> WHY:  make is closer to what coll/ml expects.
>>>
>>>       Currently, opal_hwloc_base_get_relative_locality means "at what
>>> level do these procs share cpus"
>>>       however, coll/ml is using it as "at what level are these procs
>>> commonly bound".
>>>
>>>       it is important to note that if a task is bound to all the
>>> available cpus, locality should
>>>       be set to OPAL_PROC_ON_NODE only.
>>>       /* e.g. on a single socket Sandy Bridge system, use
>>> OPAL_PROC_ON_NODE instead of OPAL_PROC_ON_L3CACHE */
>>>
>>>       This has been initially discussed in the devel mailing list
>>>       http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
>>>
>>>       as advised by Ralph, i browsed the source code looking for how the
>>> (ompi_proc_t *)->proc_flags is used.
>>>       so far, it is mainly used to figure out wether the proc is on the
>>> same node or not.
>>>
>>>       notable exceptions are :
>>>        a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c :
>>> OPAL_PROC_ON_LOCAL_SOCKET
>>>        b) ompi/mca/coll/fca/coll_fca_module.c and
>>> oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS
>>>
>>>       about a) the new definition fixes a hang in coll/ml
>>>       about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only
>>> found OMPI_PROC_FLAG_LOCAL in v1.3 */
>>>       so this macro can be simply removed and replaced with
>>> OPAL_PROC_ON_LOCAL_NODE
>>>
>>>       at this stage, i cannot find any objection not to do the described
>>> change.
>>>       please report if any and/or feel free to comment.
>>>
>>> WHERE: see the two attached patches
>>>
>>> TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago,
>>> June 24-26.
>>>          The RFC will become final only after the meeting.
>>>          /* Ralph already added this topic to the agenda */
>>>
>>> Thanks
>>>
>>> Gilles
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/06/15046.php
>>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/06/15049.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15050.php

Reply via email to