Thanks, George! So, the function you mentioned is used when I turn off
HCOLL and use OpenMPI's tuned coll instead. That helps a lot. Another
thing that makes me think is that in my case the data is sent to the
targets asynchronously, or rather - it is a 'put' operation in nature,
and the targets don't know, when the data is ready. I guess the tree
algorithms you mentioned require active participation of all nodes,
otherwise the algorithm will not progress? Is it enough to call any MPI
routine to assure progression, or do I have to call the matching Bcast?
Anyone from Mellanox here, who knows how HCOLL does this internally?
Especially on the EDR architecture. Is there any hardware aid?
Thanks!
Marcin
On 3/20/19 5:10 PM, George Bosilca wrote:
If you have support for FCA then it might happen that the collective
will use the hardware support. In any case, most of the bcast
algorithms have a logarithmic behavior, so there will be at most
O(log(P)) memory accesses on the root.
If you want to take a look at the code in OMPI to understand what
function is called in your specific case head to ompi/mca/coll/tuned/
and search for the ompi_coll_tuned_bcast_intra_dec_fixed function
in coll_tuned_decision_fixed.c.
George.
On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski
<marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>>
wrote:
Hi!
I'm wondering about the details of Bcast implementation in
OpenMPI. I'm
specifically interested in IB interconnects, but information about
other
architectures (and OpenMPI in general) would also be very useful.
I am working with a code, which sends the same (large) message to a
bunch of 'neighboring' processes. Somewhat like a ghost-zone
exchange,
but the message is the same for all neighbors. Since memory
bandwidth is
a scarce resource, I'd like to make sure we send the message with
fewest
possible memory accesses.
Hence the question: what does OpenMPI (and specifically for the IB
case
- the HPCX) do in such case? Does it get the buffer from memory O(1)
times to send it to n peers, and the broadcast is orchestrated by the
hardware? Or does it have to read the memory O(n) times? Is it more
efficient to use Bcast, or is it the same as implementing the
operation
by n distinct send / put operations? Finally, is there any way to use
the RMA put method with multiple targets, so that I only have to read
the host memory once, and the switches / HCA take care of the rest?
Thanks a lot for any insights!
Marcin
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel