Hi,

just to update this discussion and answer some questions also here:

1. What version of IFS are you running?
ii  ifs-kernel-updates-dev                 1:3.10.0-1062-2123-1ifs+deb9      
amd64        Development headers for Intel HFI1 driver interface
ii  kmod-ifs-kernel-updates-4.9.0-13-amd64 1:3.10.0-1062-2123-1ifs+deb9      
amd64        Updated kernel modules for Omni-Path
ii  kmod-ifs-kernel-updates-4.9.0-6-amd64  1:3.10.0-514-724-2ifs+deb9        
amd64        Updated kernel modules for Omni-Path
rc  kmod-ifs-kernel-updates-4.9.0-8-amd64  1:3.10.0-957-1793-1ifs+deb9       
amd64        Updated kernel modules for Omni-Path

2. Are you using CUDA cards by any chance?
No, these nodes do not have any GPU and the code is MPI only (no hybrid 
implementation)

These last days
- I have compiled the libpsm2 from the github sources but it seams to be
the same level of development as the one installed. And it does not
solve the problem.
- Another user tried to deploy "automagically" OpenMPI with Spack tool
but the problem is also found
- The problem also exist with the OpenMPI 4.0.3 provide by the O.S.
- I try to run a test with mpich (installed in the O.S.) but it is not
compatible with the local batch scheduler and the install is not
functionnal.
- I've downgraded my simulation code back to point to point
communications (12% slower) as a workaround for the PhD students on this
supercomputer while a solution is found (so they can work)
- I've opened an issue on https://github.com/cornelisnetworks/opa-psm2
describing the problem and providing the test-case. Thanks to Michael
who is looking at this.

Patrick

Le 28/01/2021 à 17:52, Heinz, Michael William via users a écrit :
> Patrick,
>
> A few more questions for you:
>
> 1. What version of IFS are you running?
> 2. Are you using CUDA cards by any chance? If so, what version of CUDA?
>
> -----Original Message-----
> From: Heinz, Michael William 
> Sent: Wednesday, January 27, 2021 3:45 PM
> To: Open MPI Users <users@lists.open-mpi.org>
> Subject: RE: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path
>
> Patrick,
>
> Do you have any PSM2_* or HFI_* environment variables defined in your run 
> time environment that could be affecting things?
>
>
> -----Original Message-----
> From: users <users-boun...@lists.open-mpi.org> On Behalf Of Heinz, Michael 
> William via users
> Sent: Wednesday, January 27, 2021 3:37 PM
> To: Open MPI Users <users@lists.open-mpi.org>
> Cc: Heinz, Michael William <michael.william.he...@cornelisnetworks.com>
> Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path
>
> Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or 
> by Cornelis Networks - but I should point out you can download the latest 
> official source for PSM2 and the drivers from Github.
>
> -----Original Message-----
> From: users <users-boun...@lists.open-mpi.org> On Behalf Of Michael Di 
> Domenico via users
> Sent: Wednesday, January 27, 2021 3:32 PM
> To: Open MPI Users <users@lists.open-mpi.org>
> Cc: Michael Di Domenico <mdidomeni...@gmail.com>
> Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path
>
> if you have OPA cards, for openmpi you only need --with-ofi, you don't need 
> psm/psm2/verbs/ucx.  but this assumes you're running a rhel based distro and 
> have installed the OPA fabric suite of software from Intel/CornelisNetworks.  
> which is what i have.  perhaps there's something really odd in debian or 
> there's an incompatibility with the older ofed drivers perhaps included with 
> debian.  unfortunately i don't have access to a debian, so i can't be much 
> more help
>
> if i had to guess totally pulling junk from the air, there's probably 
> something incompatible with PSM and OPA when running specifically on debian 
> (likely due to library versioning).  i don't know how common that is, so it's 
> not clear how flushed out and tested it is
>
>
>
>
> On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users 
> <users@lists.open-mpi.org> wrote:
>> Hi Howard and Michael
>>
>> first many thanks for testing with my short application. Yes, when the 
>> test code runs fine it just show the max RSS size of rank 0 process.
>> When it runs wrong it put a messages about each invalid value found.
>>
>> As I said, I have also deployed OpenMPI on various cluster (in DELL 
>> data center at Austin) when I was testing some architectures some 
>> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any 
>> problem. The goal was rutryied nning my tests with same software stacks and 
>> be sure to be able to deploy my software stack on the selected solution.
>> But as your clusters (and my small local clusters) they were all 
>> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 
>> 10).
>> The university's cluster I have access is running Debian stretch and 
>> provides GCC6 as default compiler.
>>
>> I cannot ask for a different OS, but I can deploy a local gcc10 and 
>> build again OpenMPI.  UCX is not available on this cluster, should I 
>> deploy a local UCX too ?
>>
>> Libpsm2 seams good:
>> dahu103 : dpkg -l |grep psm
>> ii  libfabric-psm          1.10.0-2-1ifs+deb9        amd64 Dynamic PSM
>> provider for user-space Open Fabric Interfaces
>> ii  libfabric-psm2         1.10.0-2-1ifs+deb9        amd64 Dynamic PSM2
>> provider for user-space Open Fabric Interfaces
>> ii  libpsm-infinipath1     3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
>> library for Intel Truescale adapters
>> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development 
>> files for libpsm-infinipath1
>> ii  libpsm2-2              11.2.185-1-1ifs+deb9      amd64 Intel PSM2
>> Libraries
>> ii  libpsm2-2-compat       11.2.185-1-1ifs+deb9      amd64 Compat
>> library for Intel PSM2
>> ii  libpsm2-dev            11.2.185-1-1ifs+deb9      amd64 Development
>> files for Intel PSM2
>> ii  psmisc                 22.21-2.1+b2              amd64 utilities
>> that use the proc file system
>>
>> This will be my next try to install OpenMPI on this cluster.
>>
>> Patrick
>>
>>
>> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
>>> Hi Folks,
>>>
>>> I'm also have problems reproducing this on one of our OPA clusters:
>>>
>>> libpsm2-11.2.78-1.el7.x86_64
>>> libpsm2-devel-11.2.78-1.el7.x86_64
>>>
>>> cluster runs RHEL 7.8
>>>
>>> hca_id:       hfi1_0
>>>       transport:                      InfiniBand (0)
>>>       fw_ver:                         1.27.0
>>>       node_guid:                      0011:7501:0179:e2d7
>>>       sys_image_guid:                 0011:7501:0179:e2d7
>>>       vendor_id:                      0x1175
>>>       vendor_part_id:                 9456
>>>       hw_ver:                         0x11
>>>       board_id:                       Intel Omni-Path Host Fabric Interface 
>>> Adapter 100 Series
>>>       phys_port_cnt:                  1
>>>               port:   1
>>>                       state:                  PORT_ACTIVE (4)
>>>                       max_mtu:                4096 (5)
>>>                       active_mtu:             4096 (5)
>>>                       sm_lid:                 1
>>>                       port_lid:               99
>>>                       port_lmc:               0x00
>>>                       link_layer:             InfiniBand
>>>
>>> using gcc/gfortran 9.3.0
>>>
>>> Built Open MPI 4.0.5 without any special configure options.
>>>
>>> Howard
>>>
>>> On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
>>> <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> 
>>> wrote:
>>>
>>>     for whatever it's worth running the test program on my OPA cluster
>>>     seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
>>>     sure if it's supposed to stop at some point
>>>
>>>     i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, 
>>> without-{psm,ucx,verbs}
>>>
>>>     On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
>>>     <users@lists.open-mpi.org> wrote:
>>>     >
>>>     > Hi Michael
>>>     >
>>>     > indeed I'm a little bit lost with all these parameters in OpenMPI, 
>>> mainly because for years it works just fine out of the box in all my 
>>> deployments on various architectures, interconnects and linux flavor. Some 
>>> weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an 
>>> AMD epyc2 cluster with connectX6, and it just works fine.  It is the first 
>>> time I've such trouble to deploy this library.
>>>     >
>>>     > If you have my mail posted  the 25/01/2021 in this discussion at 
>>> 18h54 (may be Paris TZ) there is a small test case attached that show the 
>>> problem. Did you got it or did the list strip these attachments ? I can 
>>> provide it again.
>>>     >
>>>     > Many thanks
>>>     >
>>>     > Patrick
>>>     >
>>>     > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
>>>     >
>>>     > Patrick how are you using original PSM if you’re using Omni-Path 
>>> hardware? The original PSM was written for QLogic DDR and QDR Infiniband 
>>> adapters.
>>>     >
>>>     > As far as needing openib - the issue is that the PSM2 MTL doesn’t 
>>> support a subset of MPI operations that we previously used the pt2pt BTL 
>>> for. For recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
>>>     >
>>>     > Is there any chance you can give us a sample MPI app that reproduces 
>>> the problem? I can’t think of another way I can give you more help without 
>>> being able to see what’s going on. It’s always possible there’s a bug in 
>>> the PSM2 MTL but it would be surprising at this point.
>>>     >
>>>     > Sent from my iPad
>>>     >
>>>     > On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
>>> <users@lists.open-mpi.org> wrote:
>>>     >
>>>     >
>>>     > Hi all,
>>>     >
>>>     > I ran many tests today. I saw that an older 4.0.2 version of OpenMPI 
>>> packaged with Nix was running using openib. So I add the --with-verbs 
>>> option to setup this module.
>>>     >
>>>     > That I can see now is that:
>>>     >
>>>     > mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca 
>>> btl_openib_allow_ib true ....
>>>     >
>>>     > - the testcase test_layout_array is running without error
>>>     >
>>>     > - the bandwidth measured with osu_bw is half of thar it should be:
>>>     >
>>>     > # OSU MPI Bandwidth Test v5.7
>>>     > # Size      Bandwidth (MB/s)
>>>     > 1                       0.54
>>>     > 2                       1.13
>>>     > 4                       2.26
>>>     > 8                       4.51
>>>     > 16                      9.06
>>>     > 32                     17.93
>>>     > 64                     33.87
>>>     > 128                    69.29
>>>     > 256                   161.24
>>>     > 512                   333.82
>>>     > 1024                  682.66
>>>     > 2048                 1188.63
>>>     > 4096                 1760.14
>>>     > 8192                 2166.08
>>>     > 16384                2036.95
>>>     > 32768                3466.63
>>>     > 65536                6296.73
>>>     > 131072               7509.43
>>>     > 262144               9104.78
>>>     > 524288               6908.55
>>>     > 1048576              5530.37
>>>     > 2097152              4489.16
>>>     > 4194304              3498.14
>>>     >
>>>     > mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca 
>>> btl_openib_allow_ib true ...
>>>     >
>>>     > - the testcase test_layout_array is not giving correct results
>>>     >
>>>     > - the bandwidth measured with osu_bw is the right one:
>>>     >
>>>     > # OSU MPI Bandwidth Test v5.7
>>>     > # Size      Bandwidth (MB/s)
>>>     > 1                       3.73
>>>     > 2                       7.96
>>>     > 4                      15.82
>>>     > 8                      31.22
>>>     > 16                     51.52
>>>     > 32                    107.61
>>>     > 64                    196.51
>>>     > 128                   438.66
>>>     > 256                   817.70
>>>     > 512                  1593.90
>>>     > 1024                 2786.09
>>>     > 2048                 4459.77
>>>     > 4096                 6658.70
>>>     > 8192                 8092.95
>>>     > 16384                8664.43
>>>     > 32768                8495.96
>>>     > 65536               11458.77
>>>     > 131072              12094.64
>>>     > 262144              11781.84
>>>     > 524288              12297.58
>>>     > 1048576             12346.92
>>>     > 2097152             12206.53
>>>     > 4194304             12167.00
>>>     >
>>>     > But yes, I know openib is deprecated too in 4.0.5.
>>>     >
>>>     > Patrick
>>>     >
>>>     >
>>>
>>>

Reply via email to