Hi All,
I am trying to learn something about OpenMPI’s tuning options for mpi_alltoall.
Ompi_info shows that there are a number of different options for collectives
(see below). I need more information about the intended purpose of each, what
knobs each provides. I’d like to make an informed choices about which is worth
using in specific situations. I feel like I’m missing the information needed to
make use of them.
Driving my curiosity is an observed a change in alltoall throughput at 64 ranks
across a wide range of payload sizes on a specific system. Below run of 64
ranks I observe higher throughput. in the case of 1mb payload it is 2x higher
compared with 72 ranks. I wanted to see if I can keep OpenMPI doing what it’s
doing for runs with <= 64 ranks for larger runs with > 64 ranks. I wondered
if there is a knob that says 64 ranks is the point to switch between two
internal implementations? Or is there something else going on?
Of the following coll options I’m most interested in learning more about:
tuned, adapt, basic, han, inter, libnbc
On the other hand, it would be very useful to know more about all
Thanks
Burlen
$ompi_info | grep " coll"
MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.7)
MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.7)