Thanks, George! So, the function you mentioned is used when I turn off HCOLL and use OpenMPI's tuned coll instead. That helps a lot. Another thing that makes me think is that in my case the data is sent to the targets asynchronously, or rather - it is a 'put' operation in nature, and the targets don't know, when the data is ready. I guess the tree algorithms you mentioned require active participation of all nodes, otherwise the algorithm will not progress? Is it enough to call any MPI routine to assure progression, or do I have to call the matching Bcast?

Anyone from Mellanox here, who knows how HCOLL does this internally? Especially on the EDR architecture. Is there any hardware aid?

Thanks!

Marcin


On 3/20/19 5:10 PM, George Bosilca wrote:
If you have support for FCA then it might happen that the collective will use the hardware support. In any case, most of the bcast algorithms have a logarithmic behavior, so there will be at most O(log(P)) memory accesses on the root.

If you want to take a look at the code in OMPI to understand what function is called in your specific case head to ompi/mca/coll/tuned/ and search for the ompi_coll_tuned_bcast_intra_dec_fixed function in coll_tuned_decision_fixed.c.

  George.


On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> wrote:

    Hi!

    I'm wondering about the details of Bcast implementation in
    OpenMPI. I'm
    specifically interested in IB interconnects, but information about
    other
    architectures (and OpenMPI in general) would also be very useful.

    I am working with a code, which sends the same  (large) message to a
    bunch of 'neighboring' processes. Somewhat like a ghost-zone
    exchange,
    but the message is the same for all neighbors. Since memory
    bandwidth is
    a scarce resource, I'd like to make sure we send the message with
    fewest
    possible memory accesses.

    Hence the question: what does OpenMPI (and specifically for the IB
    case
    - the HPCX) do in such case? Does it get the buffer from memory O(1)
    times to send it to n peers, and the broadcast is orchestrated by the
    hardware? Or does it have to read the memory O(n) times? Is it more
    efficient to use Bcast, or is it the same as implementing the
    operation
    by n distinct send / put operations? Finally, is there any way to use
    the RMA put method with multiple targets, so that I only have to read
    the host memory once, and the switches / HCA take care of the rest?

    Thanks a lot for any insights!

    Marcin


    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to