This may have changed since, but these used to be relevant points. Overall, the Open MPI FAQ have lots of good suggestions: https://www.open-mpi.org/faq/ some specific for performance tuning: https://www.open-mpi.org/faq/?category=tuning https://www.open-mpi.org/faq/?category=openfabrics
1) Make sure you are not using the Ethernet TCP/IP, which is widely available in compute nodes: mpirun --mca btl self,sm,openib ... https://www.open-mpi.org/faq/?category=tuning#selecting-components However, this may have changed lately: https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable 2) Maximum locked memory used by IB and their system limit. Start here: https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage 3) The eager vs. rendezvous message size threshold. I wonder if it may sit right where you see the latency spike. https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user 4) Processor and memory locality/affinity and binding (please check the current options and syntax) https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users < users@lists.open-mpi.org> wrote: > Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php > > mpirun --verbose --display-map > > Have you tried newer OpenMPI versions? > > Do you get similar behavior for the osu_reduce and osu_gather benchmarks? > > Typically internal buffer sizes as well as your hardware will affect > performance. Can you give specifications similar to what is available at: > http://mvapich.cse.ohio-state.edu/performance/collectives/ > where the operating system, switch, node type and memory are indicated. > > If you need good performance, may want to also specify the algorithm > used. You can find some of the parameters you can tune using: > > ompi_info --all > > A particular helpful parameter is: > > MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current > value: "ignore", data source: default, level: 5 tuner/detail, type: int) > Which allreduce algorithm is used. Can be > locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned > reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring > Valid values: 0:"ignore", 1:"basic_linear", > 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", > 5:"segmented_ring", 6:"rabenseifner" > MCA coll tuned: parameter > "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data > source: default, level: 5 tuner/detail, type: int) > > For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. > > [1] > > https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi > [2] https://github.com/open-mpi/ompi-collectives-tuning > > On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > > Hi > > > > When i repeat i always got the huge discrepancy at the > > > > message size of 16384. > > > > May be there is a way to run mpi in verbose mode in order > > > > to further investigate this behaviour? > > > > Best > > > > Denis > > > > ------------------------------------------------------------------------ > > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Benson > > Muite via users <users@lists.open-mpi.org> > > *Sent:* Monday, February 7, 2022 2:27:34 PM > > *To:* users@lists.open-mpi.org > > *Cc:* Benson Muite > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > > network > > Hi, > > Do you get similar results when you repeat the test? Another job could > > have interfered with your run. > > Benson > > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: > >> Hi > >> > >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to > >> check/benchmark > >> > >> the infiniband network for our cluster. > >> > >> For that i use the collective all_reduce benchmark and run over 200 > >> nodes, using 1 process per node. > >> > >> And this is the results i obtained 😎 > >> > >> > >> > >> ################################################################ > >> > >> # OSU MPI Allreduce Latency Test v5.7.1 > >> # Size Avg Latency(us) Min Latency(us) Max Latency(us) > Iterations > >> 4 114.65 83.22 147.98 > 1000 > >> 8 133.85 106.47 164.93 > 1000 > >> 16 116.41 87.57 150.58 > 1000 > >> 32 112.17 93.25 130.23 > 1000 > >> 64 106.85 81.93 134.74 > 1000 > >> 128 117.53 87.50 152.27 > 1000 > >> 256 143.08 115.63 173.97 > 1000 > >> 512 130.34 100.20 167.56 > 1000 > >> 1024 155.67 111.29 188.20 > 1000 > >> 2048 151.82 116.03 198.19 > 1000 > >> 4096 159.11 122.09 199.24 > 1000 > >> 8192 176.74 143.54 221.98 > 1000 > >> 16384 48862.85 39270.21 54970.96 > 1000 > >> 32768 2737.37 2614.60 2802.68 > 1000 > >> 65536 2723.15 2585.62 2813.65 > 1000 > >> > >> #################################################################### > >> > >> Could someone explain me what is happening for message = 16384 ? > >> One can notice a huge latency (~ 300 time larger) compare to message > >> size = 8192. > >> I do not really understand what could create such an increase in the > >> latency. > >> The reason i use the OSU microbenchmarks is that we > >> sporadically experience a drop > >> in the bandwith for typical collective operations such as MPI_Reduce in > >> our cluster > >> which is difficult to understand. > >> I would be grateful if somebody can share its expertise or such problem > >> with me. > >> > >> Best, > >> Denis > >> > >> > >> > >> --------- > >> Denis Bertini > >> Abteilung: CIT > >> Ort: SB3 2.265a > >> > >> Tel: +49 6159 71 2240 > >> Fax: +49 6159 71 2986 > >> E-Mail: d.bert...@gsi.de > >> > >> GSI Helmholtzzentrum für Schwerionenforschung GmbH > >> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de > >> > >> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 > >> Managing Directors / Geschäftsführung: > >> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock > >> Chairman of the GSI Supervisory Board / Vorsitzender des > GSI-Aufsichtsrats: > >> Ministerialdirigent Dr. Volkmar Dietz > >> > > > >