Folks,

i was reviewing the sources of the coll/sync module, and


1) i noticed the same pattern is used in *every* sources :

    if (s->in_operation) {
        return s->c_coll.coll_xxx(...);
    } else {
        COLL_SYNC(s, s->c_coll.coll_xxx(...));

  }


is there any rationale for not moving the if(s->in_operation) test into the COLL_SYNC macro ?


2) i could not find a rationale for using s->in_operation :
- if a barrier must be performed, the barrier of the underlying module (e.g. coll/tuned) is directly invoked, so coll/sync is not somehow reentrant - with MPI_THREAD_MULTIPLE, it is the enduser responsability that two threads never invoke simultaneously a collective operation on the *same* communicator (and s->in_operation is a per-communicator boolean), so i do not see how s->in_operation can be true in a valid MPI program.


Though the first point can be seen as a "matter of style", i am pretty curious about the second one.


Cheers,


Gilles

On 8/21/2016 3:44 AM, George Bosilca wrote:
Ralph,

Bringing back the coll/sync is a cheap shot at hiding a real issue behind a smoke curtain. As Nathan described in his email, Open MPI lacks of control flow on eager messages is the real culprit here, and the loop around any one-to-many collective (bcast and scatter*) was only helping to exacerbate the issue. However, doing a loop around a small MPI_Send will also end on a memory exhaustion issue, one that would not be easily circumvented by adding synchronizations deep inside the library.

  George.


On Sat, Aug 20, 2016 at 12:30 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    I can not provide the user report as it is a proprietary problem.
    However, it consists of a large loop of calls to MPI_Bcast that
    crashes due to unexpected messages. We have been looking at
    instituting flow control, but that has way too widespread an
    impact. The coll/sync component would be a simple solution.

    I honestly don’t believe the issue I was resolving was due to a
    bug - it was a simple problem of one proc running slow and
    creating an overload of unexpected messages that eventually
    consumed too much memory. Rather, I think you solved a different
    problem - by the time you arrived at LANL, the app I was working
    with had already modified their code to no longer create the
    problem (essentially refactoring the algorithm to avoid the
    massive loop over allreduce).

    I have no issue supporting it as it takes near-zero effort to
    maintain, and this is a fairly common problem with legacy codes
    that don’t want to refactor their algorithms.


    > On Aug 19, 2016, at 8:48 PM, Nathan Hjelm <hje...@me.com
    <mailto:hje...@me.com>> wrote:
    >
    >> On Aug 19, 2016, at 4:24 PM, r...@open-mpi.org
    <mailto:r...@open-mpi.org> wrote:
    >>
    >> Hi folks
    >>
    >> I had a question arise regarding a problem being seen by an
    OMPI user - has to do with the old bugaboo I originally dealt with
    back in my LANL days. The problem is with an app that repeatedly
    hammers on a collective, and gets overwhelmed by unexpected
    messages when one of the procs falls behind.
    >
    > I did some investigation on roadrunner several years ago and
    determined that the user code issue coll/sync was attempting to
    fix was due to a bug in ob1/cksum (really can’t remember).
    coll/sync was simply masking a live-lock problem. I committed a
    workaround for the bug in r26575
    
(https://github.com/open-mpi/ompi/commit/59e529cf1dfe986e40d14ec4d2a2e5ef0cea5e35
    
<https://github.com/open-mpi/ompi/commit/59e529cf1dfe986e40d14ec4d2a2e5ef0cea5e35>)
    and tested it with the user code. After this change the user code
    ran fine without coll/sync. Since lanl no longer had any users of
    coll/sync we stopped supporting it.
    >
    >> I solved this back then by introducing the “sync” component in
    ompi/mca/coll, which injected a barrier operation every N
    collectives. You could even “tune” it by doing the injection for
    only specific collectives.
    >>
    >> However, I can no longer find that component in the code base -
    I find it in the 1.6 series, but someone removed it during the 1.7
    series.
    >>
    >> Can someone tell me why this was done??? Is there any reason
    not to bring it back? It solves a very real, not uncommon, problem.
    >> Ralph
    >
    > This was discussed during one (or several) tel-cons years ago.
    We agreed to kill it and bring it back if there is 1) a use case,
    and 2) someone is willing to support it. See
    
https://github.com/open-mpi/ompi/commit/5451ee46bd6fcdec002b333474dec919475d2d62
    
<https://github.com/open-mpi/ompi/commit/5451ee46bd6fcdec002b333474dec919475d2d62>
    .
    >
    > Can you link the user email?
    >
    > -Nathan
    > _______________________________________________
    > devel mailing list
    > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>




_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to