Hi, just to update this discussion and answer some questions also here:
1. What version of IFS are you running? ii ifs-kernel-updates-dev 1:3.10.0-1062-2123-1ifs+deb9 amd64 Development headers for Intel HFI1 driver interface ii kmod-ifs-kernel-updates-4.9.0-13-amd64 1:3.10.0-1062-2123-1ifs+deb9 amd64 Updated kernel modules for Omni-Path ii kmod-ifs-kernel-updates-4.9.0-6-amd64 1:3.10.0-514-724-2ifs+deb9 amd64 Updated kernel modules for Omni-Path rc kmod-ifs-kernel-updates-4.9.0-8-amd64 1:3.10.0-957-1793-1ifs+deb9 amd64 Updated kernel modules for Omni-Path 2. Are you using CUDA cards by any chance? No, these nodes do not have any GPU and the code is MPI only (no hybrid implementation) These last days - I have compiled the libpsm2 from the github sources but it seams to be the same level of development as the one installed. And it does not solve the problem. - Another user tried to deploy "automagically" OpenMPI with Spack tool but the problem is also found - The problem also exist with the OpenMPI 4.0.3 provide by the O.S. - I try to run a test with mpich (installed in the O.S.) but it is not compatible with the local batch scheduler and the install is not functionnal. - I've downgraded my simulation code back to point to point communications (12% slower) as a workaround for the PhD students on this supercomputer while a solution is found (so they can work) - I've opened an issue on https://github.com/cornelisnetworks/opa-psm2 describing the problem and providing the test-case. Thanks to Michael who is looking at this. Patrick Le 28/01/2021 à 17:52, Heinz, Michael William via users a écrit : > Patrick, > > A few more questions for you: > > 1. What version of IFS are you running? > 2. Are you using CUDA cards by any chance? If so, what version of CUDA? > > -----Original Message----- > From: Heinz, Michael William > Sent: Wednesday, January 27, 2021 3:45 PM > To: Open MPI Users <users@lists.open-mpi.org> > Subject: RE: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path > > Patrick, > > Do you have any PSM2_* or HFI_* environment variables defined in your run > time environment that could be affecting things? > > > -----Original Message----- > From: users <users-boun...@lists.open-mpi.org> On Behalf Of Heinz, Michael > William via users > Sent: Wednesday, January 27, 2021 3:37 PM > To: Open MPI Users <users@lists.open-mpi.org> > Cc: Heinz, Michael William <michael.william.he...@cornelisnetworks.com> > Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path > > Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or > by Cornelis Networks - but I should point out you can download the latest > official source for PSM2 and the drivers from Github. > > -----Original Message----- > From: users <users-boun...@lists.open-mpi.org> On Behalf Of Michael Di > Domenico via users > Sent: Wednesday, January 27, 2021 3:32 PM > To: Open MPI Users <users@lists.open-mpi.org> > Cc: Michael Di Domenico <mdidomeni...@gmail.com> > Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path > > if you have OPA cards, for openmpi you only need --with-ofi, you don't need > psm/psm2/verbs/ucx. but this assumes you're running a rhel based distro and > have installed the OPA fabric suite of software from Intel/CornelisNetworks. > which is what i have. perhaps there's something really odd in debian or > there's an incompatibility with the older ofed drivers perhaps included with > debian. unfortunately i don't have access to a debian, so i can't be much > more help > > if i had to guess totally pulling junk from the air, there's probably > something incompatible with PSM and OPA when running specifically on debian > (likely due to library versioning). i don't know how common that is, so it's > not clear how flushed out and tested it is > > > > > On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users > <users@lists.open-mpi.org> wrote: >> Hi Howard and Michael >> >> first many thanks for testing with my short application. Yes, when the >> test code runs fine it just show the max RSS size of rank 0 process. >> When it runs wrong it put a messages about each invalid value found. >> >> As I said, I have also deployed OpenMPI on various cluster (in DELL >> data center at Austin) when I was testing some architectures some >> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any >> problem. The goal was rutryied nning my tests with same software stacks and >> be sure to be able to deploy my software stack on the selected solution. >> But as your clusters (and my small local clusters) they were all >> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or >> 10). >> The university's cluster I have access is running Debian stretch and >> provides GCC6 as default compiler. >> >> I cannot ask for a different OS, but I can deploy a local gcc10 and >> build again OpenMPI. UCX is not available on this cluster, should I >> deploy a local UCX too ? >> >> Libpsm2 seams good: >> dahu103 : dpkg -l |grep psm >> ii libfabric-psm 1.10.0-2-1ifs+deb9 amd64 Dynamic PSM >> provider for user-space Open Fabric Interfaces >> ii libfabric-psm2 1.10.0-2-1ifs+deb9 amd64 Dynamic PSM2 >> provider for user-space Open Fabric Interfaces >> ii libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging >> library for Intel Truescale adapters >> ii libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development >> files for libpsm-infinipath1 >> ii libpsm2-2 11.2.185-1-1ifs+deb9 amd64 Intel PSM2 >> Libraries >> ii libpsm2-2-compat 11.2.185-1-1ifs+deb9 amd64 Compat >> library for Intel PSM2 >> ii libpsm2-dev 11.2.185-1-1ifs+deb9 amd64 Development >> files for Intel PSM2 >> ii psmisc 22.21-2.1+b2 amd64 utilities >> that use the proc file system >> >> This will be my next try to install OpenMPI on this cluster. >> >> Patrick >> >> >> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit : >>> Hi Folks, >>> >>> I'm also have problems reproducing this on one of our OPA clusters: >>> >>> libpsm2-11.2.78-1.el7.x86_64 >>> libpsm2-devel-11.2.78-1.el7.x86_64 >>> >>> cluster runs RHEL 7.8 >>> >>> hca_id: hfi1_0 >>> transport: InfiniBand (0) >>> fw_ver: 1.27.0 >>> node_guid: 0011:7501:0179:e2d7 >>> sys_image_guid: 0011:7501:0179:e2d7 >>> vendor_id: 0x1175 >>> vendor_part_id: 9456 >>> hw_ver: 0x11 >>> board_id: Intel Omni-Path Host Fabric Interface >>> Adapter 100 Series >>> phys_port_cnt: 1 >>> port: 1 >>> state: PORT_ACTIVE (4) >>> max_mtu: 4096 (5) >>> active_mtu: 4096 (5) >>> sm_lid: 1 >>> port_lid: 99 >>> port_lmc: 0x00 >>> link_layer: InfiniBand >>> >>> using gcc/gfortran 9.3.0 >>> >>> Built Open MPI 4.0.5 without any special configure options. >>> >>> Howard >>> >>> On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" >>> <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> >>> wrote: >>> >>> for whatever it's worth running the test program on my OPA cluster >>> seems to work. well it keeps spitting out [INFO MEMORY] lines, not >>> sure if it's supposed to stop at some point >>> >>> i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, >>> without-{psm,ucx,verbs} >>> >>> On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users >>> <users@lists.open-mpi.org> wrote: >>> > >>> > Hi Michael >>> > >>> > indeed I'm a little bit lost with all these parameters in OpenMPI, >>> mainly because for years it works just fine out of the box in all my >>> deployments on various architectures, interconnects and linux flavor. Some >>> weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an >>> AMD epyc2 cluster with connectX6, and it just works fine. It is the first >>> time I've such trouble to deploy this library. >>> > >>> > If you have my mail posted the 25/01/2021 in this discussion at >>> 18h54 (may be Paris TZ) there is a small test case attached that show the >>> problem. Did you got it or did the list strip these attachments ? I can >>> provide it again. >>> > >>> > Many thanks >>> > >>> > Patrick >>> > >>> > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit : >>> > >>> > Patrick how are you using original PSM if you’re using Omni-Path >>> hardware? The original PSM was written for QLogic DDR and QDR Infiniband >>> adapters. >>> > >>> > As far as needing openib - the issue is that the PSM2 MTL doesn’t >>> support a subset of MPI operations that we previously used the pt2pt BTL >>> for. For recent version of OMPI, the preferred BTL to use with PSM2 is OFI. >>> > >>> > Is there any chance you can give us a sample MPI app that reproduces >>> the problem? I can’t think of another way I can give you more help without >>> being able to see what’s going on. It’s always possible there’s a bug in >>> the PSM2 MTL but it would be surprising at this point. >>> > >>> > Sent from my iPad >>> > >>> > On Jan 26, 2021, at 1:13 PM, Patrick Begou via users >>> <users@lists.open-mpi.org> wrote: >>> > >>> > >>> > Hi all, >>> > >>> > I ran many tests today. I saw that an older 4.0.2 version of OpenMPI >>> packaged with Nix was running using openib. So I add the --with-verbs >>> option to setup this module. >>> > >>> > That I can see now is that: >>> > >>> > mpirun -hostfile $OAR_NODEFILE --mca mtl psm -mca >>> btl_openib_allow_ib true .... >>> > >>> > - the testcase test_layout_array is running without error >>> > >>> > - the bandwidth measured with osu_bw is half of thar it should be: >>> > >>> > # OSU MPI Bandwidth Test v5.7 >>> > # Size Bandwidth (MB/s) >>> > 1 0.54 >>> > 2 1.13 >>> > 4 2.26 >>> > 8 4.51 >>> > 16 9.06 >>> > 32 17.93 >>> > 64 33.87 >>> > 128 69.29 >>> > 256 161.24 >>> > 512 333.82 >>> > 1024 682.66 >>> > 2048 1188.63 >>> > 4096 1760.14 >>> > 8192 2166.08 >>> > 16384 2036.95 >>> > 32768 3466.63 >>> > 65536 6296.73 >>> > 131072 7509.43 >>> > 262144 9104.78 >>> > 524288 6908.55 >>> > 1048576 5530.37 >>> > 2097152 4489.16 >>> > 4194304 3498.14 >>> > >>> > mpirun -hostfile $OAR_NODEFILE --mca mtl psm2 -mca >>> btl_openib_allow_ib true ... >>> > >>> > - the testcase test_layout_array is not giving correct results >>> > >>> > - the bandwidth measured with osu_bw is the right one: >>> > >>> > # OSU MPI Bandwidth Test v5.7 >>> > # Size Bandwidth (MB/s) >>> > 1 3.73 >>> > 2 7.96 >>> > 4 15.82 >>> > 8 31.22 >>> > 16 51.52 >>> > 32 107.61 >>> > 64 196.51 >>> > 128 438.66 >>> > 256 817.70 >>> > 512 1593.90 >>> > 1024 2786.09 >>> > 2048 4459.77 >>> > 4096 6658.70 >>> > 8192 8092.95 >>> > 16384 8664.43 >>> > 32768 8495.96 >>> > 65536 11458.77 >>> > 131072 12094.64 >>> > 262144 11781.84 >>> > 524288 12297.58 >>> > 1048576 12346.92 >>> > 2097152 12206.53 >>> > 4194304 12167.00 >>> > >>> > But yes, I know openib is deprecated too in 4.0.5. >>> > >>> > Patrick >>> > >>> > >>> >>>