Saw it and will review - thanks!


On Tue, Jun 24, 2014 at 9:51 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Ralph,
>
> i pushed the change (r32079) and updated the wiki.
>
> the RFC can be now closed and the consensus is semantic of
> opal_hwloc_base_get_relative_locality
> will not be changed since this is not needed : the hang is a coll/ml bug,
> so it will be fixed within coll/ml.
>
> Cheers,
>
> Gilles
>
>
> On 2014/06/25 1:12, Ralph Castain wrote:
>
> Yeah, we should make that change, if you wouldn't mind doing it.
>
>
>
> On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET 
> <gilles.gouaillar...@gmail.com> wrote:
>
>
>  Ralph,
>
> That makes perfect sense.
>
> What about FCA_IS_LOCAL_PROCESS ?
> Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly
> ?
>
> Cheers
>
> Gilles
>
> Ralph Castain <r...@open-mpi.org> <r...@open-mpi.org> wrote:
> Hi Gilles
>
> We discussed this at the devel conference this morning. The root cause of
> the problem is a test in coll/ml that we feel is incorrect - it basically
> checks to see if the proc itself is bound, and then assumes that all other
> procs are similarly bound. This in fact is never guaranteed to be true as
> someone could use the rank_file method to specify that some procs are to be
> left unbound, while others are to be bound to specified cpus.
>
> Nathan has looked at that check before and believes it isn't necessary.
> All coll/ml really needs to know is that the two procs share the same node,
> and the current locality algorithm will provide that information. We have
> asked him to "fix" the coll/ml selection logic to resolve that situation.
>
> After then discussing the various locality definitions, it was our feeling
> that the current definition is probably the better one unless you have a
> reason for changing it other than coll/ml. If so, we'd be happy to revisit
> the proposal.
>
> Make sense?
> Ralph
>
>
>
> On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
>
>
>  WHAT: semantic change of opal_hwloc_base_get_relative_locality
>
> WHY:  make is closer to what coll/ml expects.
>
>       Currently, opal_hwloc_base_get_relative_locality means "at what
> level do these procs share cpus"
>       however, coll/ml is using it as "at what level are these procs
> commonly bound".
>
>       it is important to note that if a task is bound to all the
> available cpus, locality should
>       be set to OPAL_PROC_ON_NODE only.
>       /* e.g. on a single socket Sandy Bridge system, use
> OPAL_PROC_ON_NODE instead of OPAL_PROC_ON_L3CACHE */
>
>       This has been initially discussed in the devel mailing list
>       http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
>
>       as advised by Ralph, i browsed the source code looking for how the
> (ompi_proc_t *)->proc_flags is used.
>       so far, it is mainly used to figure out wether the proc is on the
> same node or not.
>
>       notable exceptions are :
>        a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c :
> OPAL_PROC_ON_LOCAL_SOCKET
>        b) ompi/mca/coll/fca/coll_fca_module.c and
> oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS
>
>       about a) the new definition fixes a hang in coll/ml
>       about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only
> found OMPI_PROC_FLAG_LOCAL in v1.3 */
>       so this macro can be simply removed and replaced with
> OPAL_PROC_ON_LOCAL_NODE
>
>       at this stage, i cannot find any objection not to do the described
> change.
>       please report if any and/or feel free to comment.
>
> WHERE: see the two attached patches
>
> TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago,
> June 24-26.
>          The RFC will become final only after the meeting.
>          /* Ralph already added this topic to the agenda */
>
> Thanks
>
> Gilles
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this 
> post:http://www.open-mpi.org/community/lists/devel/2014/06/15046.php
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this 
> post:http://www.open-mpi.org/community/lists/devel/2014/06/15049.php
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15050.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/15052.php
>

Reply via email to