Hi George,

May be i did not explain properly the point i wanted to be clarified.

What i was tring to say is that i had the impression that MPI_T has been 
developped for tuning the internal MPI parameter for a

defined underlying network fabric.

Furthermore it is mainly used by MPI developer to  do some debugging / 
improvement  right?

Or is it meant for users too?

If it is the case,  how as user can i make use of  MPI_T interface to find out 
if the fabric itself ( system ) has a problem ( and not the MPI implementation 
on top of it)?

Cheers,

Denis



________________________________
From: George Bosilca <bosi...@icl.utk.edu>
Sent: Saturday, February 12, 2022 7:38:02 AM
To: Bertini, Denis Dr.
Cc: Open MPI Users; Joseph Schuchart
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

I am not sure I understand the comment about MPI_T.

Each network card has internal counters that can be gathered by any process on 
the node. Similarly, some information is available from the switches, but I 
always assumed that information is aggregated across all ongoing jobs. But, 
merging the switch-level information with the MPI level the necessary trend can 
be highlighted.

  George.


On Fri, Feb 11, 2022 at 12:43 PM Bertini, Denis Dr. 
<d.bert...@gsi.de<mailto:d.bert...@gsi.de>> wrote:

May be i am wrong, but the MPI_T seems to aim to internal openMPI parameters 
right?


So with which kind of magic a tool like OSU INAM can get info from network 
fabric and even

switches related to a particular MPI job ...


There should be more info gathered in the background ....


________________________________
From: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>>
Sent: Friday, February 11, 2022 4:25:42 PM
To: Open MPI Users
Cc: Joseph Schuchart; Bertini, Denis Dr.
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Collecting data during execution is possible in OMPI either with an external 
tool, such as mpiP, or the internal infrastructure, SPC. Take a look at 
./examples/spc_example.c or ./test/spc/spc_test.c to see how to use this.

  George.


On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

I have seen in OSU INAM paper:

"
While we chose MVAPICH2 for implementing our designs, any MPI
runtime (e.g.: OpenMPI [12]) can be modified to perform similar data collection 
and
transmission.
"

But i do not know what it is meant with "modified" openMPI ?


Cheers,

Denis


________________________________
From: Joseph Schuchart <schuch...@icl.utk.edu<mailto:schuch...@icl.utk.edu>>
Sent: Friday, February 11, 2022 3:02:36 PM
To: Bertini, Denis Dr.; Open MPI Users
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
with other MPI implementations? Would be worth investigating...

Joseph

On 2/11/22 06:54, Bertini, Denis Dr. wrote:
>
> Hi Joseph
>
> Looking at the MVAPICH i noticed that, in this MPI implementation
>
> a Infiniband Network Analysis  and Profiling Tool  is provided:
>
>
> OSU-INAM
>
>
> Is there something equivalent using openMPI ?
>
> Best
>
> Denis
>
>
> ------------------------------------------------------------------------
> *From:* users 
> <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> 
> on behalf of Joseph
> Schuchart via users 
> <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
> *Sent:* Tuesday, February 8, 2022 4:02:53 PM
> *To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
> *Cc:* Joseph Schuchart
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> Infiniband network
> Hi Denis,
>
> Sorry if I missed it in your previous messages but could you also try
> running a different MPI implementation (MVAPICH) to see whether Open MPI
> is at fault or the system is somehow to blame for it?
>
> Thanks
> Joseph
>
> On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
> >
> > Hi
> >
> > Thanks for all these informations !
> >
> >
> > But i have to confess that in this multi-tuning-parameter space,
> >
> > i got somehow lost.
> >
> > Furthermore it is somtimes mixing between user-space and kernel-space.
> >
> > I have only possibility to act on the user space.
> >
> >
> > 1) So i have on the system max locked memory:
> >
> >                         - ulimit -l unlimited (default )
> >
> >   and i do not see any warnings/errors related to that when
> launching MPI.
> >
> >
> > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> > drop in
> >
> > bw for size=16384
> >
> >
> > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
> >
> > the same behaviour.
> >
> >
> > 3) i realized that increasing the so-called warm up parameter  in the
> >
> > OSU benchmark (argument -x 200 as default) the discrepancy.
> >
> > At the contrary putting lower threshold ( -x 10 ) can increase this BW
> >
> > discrepancy up to factor 300 at message size 16384 compare to
> >
> > message size 8192 for example.
> >
> > So does it means that there are some caching effects
> >
> > in the internode communication?
> >
> >
> > From my experience, to tune parameters is a time-consuming and
> cumbersome
> >
> > task.
> >
> >
> > Could it also be the problem is not really on the openMPI
> > implemenation but on the
> >
> > system?
> >
> >
> > Best
> >
> > Denis
> >
> > ------------------------------------------------------------------------
> > *From:* users 
> > <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> 
> > on behalf of Gus
> > Correa via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
> > *Sent:* Monday, February 7, 2022 9:14:19 PM
> > *To:* Open MPI Users
> > *Cc:* Gus Correa
> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> > Infiniband network
> > This may have changed since, but these used to be relevant points.
> > Overall, the Open MPI FAQ have lots of good suggestions:
> > https://www.open-mpi.org/faq/
> > some specific for performance tuning:
> > https://www.open-mpi.org/faq/?category=tuning
> > https://www.open-mpi.org/faq/?category=openfabrics
> >
> > 1) Make sure you are not using the Ethernet TCP/IP, which is widely
> > available in compute nodes:
> > mpirun  --mca btl self,sm,openib  ...
> >
> > https://www.open-mpi.org/faq/?category=tuning#selecting-components
> >
> > However, this may have changed lately:
> > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> > 2) Maximum locked memory used by IB and their system limit. Start
> > here:
> >
> https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
> > 3) The eager vs. rendezvous message size threshold. I wonder if it may
> > sit right where you see the latency spike.
> > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
> > 4) Processor and memory locality/affinity and binding (please check
> > the current options and syntax)
> > https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
> >
> > On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users
> > <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
> >
> >     Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
> >
> >     mpirun --verbose --display-map
> >
> >     Have you tried newer OpenMPI versions?
> >
> >     Do you get similar behavior for the osu_reduce and osu_gather
> >     benchmarks?
> >
> >     Typically internal buffer sizes as well as your hardware will affect
> >     performance. Can you give specifications similar to what is
> >     available at:
> > http://mvapich.cse.ohio-state.edu/performance/collectives/
> >     where the operating system, switch, node type and memory are
> >     indicated.
> >
> >     If you need good performance, may want to also specify the algorithm
> >     used. You can find some of the parameters you can tune using:
> >
> >     ompi_info --all
> >
> >     A particular helpful parameter is:
> >
> >     MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
> >     value: "ignore", data source: default, level: 5 tuner/detail,
> >     type: int)
> >                                Which allreduce algorithm is used. Can be
> >     locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping
> >     (tuned
> >     reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented
> ring
> >                                Valid values: 0:"ignore",
> >     1:"basic_linear",
> >     2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
> >     5:"segmented_ring", 6:"rabenseifner"
> >                MCA coll tuned: parameter
> >     "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0",
> >     data
> >     source: default, level: 5 tuner/detail, type: int)
> >
> >     For OpenMPI 4.0, there is a tuning program [2] that might also be
> >     helpful.
> >
> >     [1]
> >
> https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
> >     [2] https://github.com/open-mpi/ompi-collectives-tuning
> >
> >     On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> >     > Hi
> >     >
> >     > When i repeat i always got the huge discrepancy at the
> >     >
> >     > message size of 16384.
> >     >
> >     > May be there is a way to run mpi in verbose mode in order
> >     >
> >     > to further investigate this behaviour?
> >     >
> >     > Best
> >     >
> >     > Denis
> >     >
> >     >
> > ------------------------------------------------------------------------
> >     > *From:* users 
> > <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> 
> > on behalf of
> >     Benson
> >     > Muite via users 
> > <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
> >     > *Sent:* Monday, February 7, 2022 2:27:34 PM
> >     > *To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
> >     > *Cc:* Benson Muite
> >     > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> >     Infiniband
> >     > network
> >     > Hi,
> >     > Do you get similar results when you repeat the test? Another job
> >     could
> >     > have interfered with your run.
> >     > Benson
> >     > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
> >     >> Hi
> >     >>
> >     >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in
> >     order to
> >     >> check/benchmark
> >     >>
> >     >> the infiniband network for our cluster.
> >     >>
> >     >> For that i use the collective all_reduce benchmark and run over
> >     200
> >     >> nodes, using 1 process per node.
> >     >>
> >     >> And this is the results i obtained 😎
> >     >>
> >     >>
> >     >>
> >     >> ################################################################
> >     >>
> >     >> # OSU MPI Allreduce Latency Test v5.7.1
> >     >> # Size       Avg Latency(us)   Min Latency(us)  Max
> >     Latency(us)  Iterations
> >     >> 4                     114.65  83.22       147.98
> >         1000
> >     >> 8                     133.85 106.47       164.93
> >         1000
> >     >> 16                    116.41  87.57       150.58
> >         1000
> >     >> 32                    112.17  93.25       130.23
> >         1000
> >     >> 64                    106.85  81.93       134.74
> >         1000
> >     >> 128                   117.53  87.50       152.27
> >         1000
> >     >> 256                   143.08 115.63       173.97
> >         1000
> >     >> 512                   130.34 100.20       167.56
> >         1000
> >     >> 1024                  155.67 111.29       188.20
> >         1000
> >     >> 2048                  151.82 116.03       198.19
> >         1000
> >     >> 4096                  159.11 122.09       199.24
> >         1000
> >     >> 8192                  176.74 143.54       221.98
> >         1000
> >     >> 16384               48862.85 39270.21     54970.96
> >         1000
> >     >> 32768                2737.37  2614.60      2802.68
> >         1000
> >     >> 65536                2723.15  2585.62      2813.65
> >         1000
> >     >>
> >     >>
> > ####################################################################
> >     >>
> >     >> Could someone explain me what is happening for message = 16384 ?
> >     >> One can notice a huge latency (~ 300 time larger) compare to
> >     message
> >     >> size = 8192.
> >     >> I do not really understand what could create such an increase
> >     in the
> >     >> latency.
> >     >> The reason i use the OSU microbenchmarks is that we
> >     >> sporadically experience a drop
> >     >> in the bandwith for typical collective operations such as
> >     MPI_Reduce in
> >     >> our cluster
> >     >> which is difficult to understand.
> >     >> I would be grateful if somebody can share its expertise or such
> >     problem
> >     >> with me.
> >     >>
> >     >> Best,
> >     >> Denis
> >     >>
> >     >>
> >     >>
> >     >> ---------
> >     >> Denis Bertini
> >     >> Abteilung: CIT
> >     >> Ort: SB3 2.265a
> >     >>
> >     >> Tel: +49 6159 71 2240
> >     >> Fax: +49 6159 71 2986
> >     >> E-Mail: d.bert...@gsi.de<mailto:d.bert...@gsi.de>
> >     >>
> >     >> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> >     >> Planckstraße 1, 64291 Darmstadt, Germany, 
> > www.gsi.de<http://www.gsi.de>
> >     <http://www.gsi.de>
> >     >>
> >     >> Commercial Register / Handelsregister: Amtsgericht Darmstadt,
> >     HRB 1528
> >     >> Managing Directors / Geschäftsführung:
> >     >> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
> >     >> Chairman of the GSI Supervisory Board / Vorsitzender des
> >     GSI-Aufsichtsrats:
> >     >> Ministerialdirigent Dr. Volkmar Dietz
> >     >>
> >     >
> >
>

Reply via email to