Patrick,

A few more questions for you:

1. What version of IFS are you running?
2. Are you using CUDA cards by any chance? If so, what version of CUDA?

-----Original Message-----
From: Heinz, Michael William 
Sent: Wednesday, January 27, 2021 3:45 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: RE: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

Patrick,

Do you have any PSM2_* or HFI_* environment variables defined in your run time 
environment that could be affecting things?


-----Original Message-----
From: users <users-boun...@lists.open-mpi.org> On Behalf Of Heinz, Michael 
William via users
Sent: Wednesday, January 27, 2021 3:37 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Heinz, Michael William <michael.william.he...@cornelisnetworks.com>
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or by 
Cornelis Networks - but I should point out you can download the latest official 
source for PSM2 and the drivers from Github.

-----Original Message-----
From: users <users-boun...@lists.open-mpi.org> On Behalf Of Michael Di Domenico 
via users
Sent: Wednesday, January 27, 2021 3:32 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Michael Di Domenico <mdidomeni...@gmail.com>
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

if you have OPA cards, for openmpi you only need --with-ofi, you don't need 
psm/psm2/verbs/ucx.  but this assumes you're running a rhel based distro and 
have installed the OPA fabric suite of software from Intel/CornelisNetworks.  
which is what i have.  perhaps there's something really odd in debian or 
there's an incompatibility with the older ofed drivers perhaps included with 
debian.  unfortunately i don't have access to a debian, so i can't be much more 
help

if i had to guess totally pulling junk from the air, there's probably something 
incompatible with PSM and OPA when running specifically on debian (likely due 
to library versioning).  i don't know how common that is, so it's not clear how 
flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users 
<users@lists.open-mpi.org> wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the 
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL 
> data center at Austin) when I was testing some architectures some 
> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any 
> problem. The goal was running my tests with same software stacks and 
> be sure to be able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all 
> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and 
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and 
> build again OpenMPI.  UCX is not available on this cluster, should I 
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm          1.10.0-2-1ifs+deb9        amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2         1.10.0-2-1ifs+deb9        amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1     3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development 
> files for libpsm-infinipath1
> ii  libpsm2-2              11.2.185-1-1ifs+deb9      amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat       11.2.185-1-1ifs+deb9      amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev            11.2.185-1-1ifs+deb9      amd64 Development
> files for Intel PSM2
> ii  psmisc                 22.21-2.1+b2              amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:       hfi1_0
> >       transport:                      InfiniBand (0)
> >       fw_ver:                         1.27.0
> >       node_guid:                      0011:7501:0179:e2d7
> >       sys_image_guid:                 0011:7501:0179:e2d7
> >       vendor_id:                      0x1175
> >       vendor_part_id:                 9456
> >       hw_ver:                         0x11
> >       board_id:                       Intel Omni-Path Host Fabric Interface 
> > Adapter 100 Series
> >       phys_port_cnt:                  1
> >               port:   1
> >                       state:                  PORT_ACTIVE (4)
> >                       max_mtu:                4096 (5)
> >                       active_mtu:             4096 (5)
> >                       sm_lid:                 1
> >                       port_lid:               99
> >                       port_lmc:               0x00
> >                       link_layer:             InfiniBand
> >
> > using gcc/gfortran 9.3.0
> >
> > Built Open MPI 4.0.5 without any special configure options.
> >
> > Howard
> >
> > On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
> > <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> 
> > wrote:
> >
> >     for whatever it's worth running the test program on my OPA cluster
> >     seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
> >     sure if it's supposed to stop at some point
> >
> >     i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, 
> > without-{psm,ucx,verbs}
> >
> >     On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
> >     <users@lists.open-mpi.org> wrote:
> >     >
> >     > Hi Michael
> >     >
> >     > indeed I'm a little bit lost with all these parameters in OpenMPI, 
> > mainly because for years it works just fine out of the box in all my 
> > deployments on various architectures, interconnects and linux flavor. Some 
> > weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an 
> > AMD epyc2 cluster with connectX6, and it just works fine.  It is the first 
> > time I've such trouble to deploy this library.
> >     >
> >     > If you have my mail posted  the 25/01/2021 in this discussion at 
> > 18h54 (may be Paris TZ) there is a small test case attached that show the 
> > problem. Did you got it or did the list strip these attachments ? I can 
> > provide it again.
> >     >
> >     > Many thanks
> >     >
> >     > Patrick
> >     >
> >     > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
> >     >
> >     > Patrick how are you using original PSM if you’re using Omni-Path 
> > hardware? The original PSM was written for QLogic DDR and QDR Infiniband 
> > adapters.
> >     >
> >     > As far as needing openib - the issue is that the PSM2 MTL doesn’t 
> > support a subset of MPI operations that we previously used the pt2pt BTL 
> > for. For recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
> >     >
> >     > Is there any chance you can give us a sample MPI app that reproduces 
> > the problem? I can’t think of another way I can give you more help without 
> > being able to see what’s going on. It’s always possible there’s a bug in 
> > the PSM2 MTL but it would be surprising at this point.
> >     >
> >     > Sent from my iPad
> >     >
> >     > On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
> > <users@lists.open-mpi.org> wrote:
> >     >
> >     >
> >     > Hi all,
> >     >
> >     > I ran many tests today. I saw that an older 4.0.2 version of OpenMPI 
> > packaged with Nix was running using openib. So I add the --with-verbs 
> > option to setup this module.
> >     >
> >     > That I can see now is that:
> >     >
> >     > mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca 
> > btl_openib_allow_ib true ....
> >     >
> >     > - the testcase test_layout_array is running without error
> >     >
> >     > - the bandwidth measured with osu_bw is half of thar it should be:
> >     >
> >     > # OSU MPI Bandwidth Test v5.7
> >     > # Size      Bandwidth (MB/s)
> >     > 1                       0.54
> >     > 2                       1.13
> >     > 4                       2.26
> >     > 8                       4.51
> >     > 16                      9.06
> >     > 32                     17.93
> >     > 64                     33.87
> >     > 128                    69.29
> >     > 256                   161.24
> >     > 512                   333.82
> >     > 1024                  682.66
> >     > 2048                 1188.63
> >     > 4096                 1760.14
> >     > 8192                 2166.08
> >     > 16384                2036.95
> >     > 32768                3466.63
> >     > 65536                6296.73
> >     > 131072               7509.43
> >     > 262144               9104.78
> >     > 524288               6908.55
> >     > 1048576              5530.37
> >     > 2097152              4489.16
> >     > 4194304              3498.14
> >     >
> >     > mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca 
> > btl_openib_allow_ib true ...
> >     >
> >     > - the testcase test_layout_array is not giving correct results
> >     >
> >     > - the bandwidth measured with osu_bw is the right one:
> >     >
> >     > # OSU MPI Bandwidth Test v5.7
> >     > # Size      Bandwidth (MB/s)
> >     > 1                       3.73
> >     > 2                       7.96
> >     > 4                      15.82
> >     > 8                      31.22
> >     > 16                     51.52
> >     > 32                    107.61
> >     > 64                    196.51
> >     > 128                   438.66
> >     > 256                   817.70
> >     > 512                  1593.90
> >     > 1024                 2786.09
> >     > 2048                 4459.77
> >     > 4096                 6658.70
> >     > 8192                 8092.95
> >     > 16384                8664.43
> >     > 32768                8495.96
> >     > 65536               11458.77
> >     > 131072              12094.64
> >     > 262144              11781.84
> >     > 524288              12297.58
> >     > 1048576             12346.92
> >     > 2097152             12206.53
> >     > 4194304             12167.00
> >     >
> >     > But yes, I know openib is deprecated too in 4.0.5.
> >     >
> >     > Patrick
> >     >
> >     >
> >
> >
>

Reply via email to