Re: [OMPI devel] Memory performance with Bcast

marcin.krotkiewski Thu, 21 Mar 2019 02:01:42 -0700

Thanks, George! So, the function you mentioned is used when I turn offHCOLL and use OpenMPI's tuned coll instead. That helps a lot. Anotherthing that makes me think is that in my case the data is sent to thetargets asynchronously, or rather - it is a 'put' operation in nature,and the targets don't know, when the data is ready. I guess the treealgorithms you mentioned require active participation of all nodes,otherwise the algorithm will not progress? Is it enough to call any MPIroutine to assure progression, or do I have to call the matching Bcast?

Anyone from Mellanox here, who knows how HCOLL does this internally?Especially on the EDR architecture. Is there any hardware aid?


Thanks!

Marcin


On 3/20/19 5:10 PM, George Bosilca wrote:

If you have support for FCA then it might happen that the collectivewill use the hardware support. In any case, most of the bcastalgorithms have a logarithmic behavior, so there will be at mostO(log(P)) memory accesses on the root.

If you want to take a look at the code in OMPI to understand whatfunction is called in your specific case head to ompi/mca/coll/tuned/and search for the ompi_coll_tuned_bcast_intra_dec_fixed functionin coll_tuned_decision_fixed.c.


  George.

On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski<marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>>wrote:


    Hi!

    I'm wondering about the details of Bcast implementation in
    OpenMPI. I'm
    specifically interested in IB interconnects, but information about
    other
    architectures (and OpenMPI in general) would also be very useful.

    I am working with a code, which sends the same  (large) message to a
    bunch of 'neighboring' processes. Somewhat like a ghost-zone
    exchange,
    but the message is the same for all neighbors. Since memory
    bandwidth is
    a scarce resource, I'd like to make sure we send the message with
    fewest
    possible memory accesses.

    Hence the question: what does OpenMPI (and specifically for the IB
    case
    - the HPCX) do in such case? Does it get the buffer from memory O(1)
    times to send it to n peers, and the broadcast is orchestrated by the
    hardware? Or does it have to read the memory O(n) times? Is it more
    efficient to use Bcast, or is it the same as implementing the
    operation
    by n distinct send / put operations? Finally, is there any way to use
    the RMA put method with multiple targets, so that I only have to read
    the host memory once, and the switches / HCA take care of the rest?

    Thanks a lot for any insights!

    Marcin


    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Memory performance with Bcast

Reply via email to