Hi Howard and Michael

first many thanks for testing with my short application. Yes, when the
test code runs fine it just show the max RSS size of rank 0 process.
When it runs wrong it put a messages about each invalid value found.

As I said, I have also deployed OpenMPI on various cluster (in DELL data
center at Austin) when I was testing some architectures some months ago
and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any problem. The
goal was running my tests with same software stacks and be sure to be
able to deploy my software stack on the selected solution.
But as your clusters (and my small local clusters) they were all running
RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
The university's cluster I have access is running Debian stretch and
provides GCC6 as default compiler.

I cannot ask for a different OS, but I can deploy a local gcc10 and
build again OpenMPI.  UCX is not available on this cluster, should I
deploy a local UCX too ?

Libpsm2 seams good:
dahu103 : dpkg -l |grep psm
ii  libfabric-psm          1.10.0-2-1ifs+deb9        amd64 Dynamic PSM
provider for user-space Open Fabric Interfaces
ii  libfabric-psm2         1.10.0-2-1ifs+deb9        amd64 Dynamic PSM2
provider for user-space Open Fabric Interfaces
ii  libpsm-infinipath1     3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
library for Intel Truescale adapters
ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development
files for libpsm-infinipath1
ii  libpsm2-2              11.2.185-1-1ifs+deb9      amd64 Intel PSM2
Libraries
ii  libpsm2-2-compat       11.2.185-1-1ifs+deb9      amd64 Compat
library for Intel PSM2
ii  libpsm2-dev            11.2.185-1-1ifs+deb9      amd64 Development
files for Intel PSM2
ii  psmisc                 22.21-2.1+b2              amd64 utilities
that use the proc file system

This will be my next try to install OpenMPI on this cluster.

Patrick


Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> Hi Folks,
>
> I'm also have problems reproducing this on one of our OPA clusters:
>
> libpsm2-11.2.78-1.el7.x86_64
> libpsm2-devel-11.2.78-1.el7.x86_64
>
> cluster runs RHEL 7.8
>
> hca_id:       hfi1_0
>       transport:                      InfiniBand (0)
>       fw_ver:                         1.27.0
>       node_guid:                      0011:7501:0179:e2d7
>       sys_image_guid:                 0011:7501:0179:e2d7
>       vendor_id:                      0x1175
>       vendor_part_id:                 9456
>       hw_ver:                         0x11
>       board_id:                       Intel Omni-Path Host Fabric Interface 
> Adapter 100 Series
>       phys_port_cnt:                  1
>               port:   1
>                       state:                  PORT_ACTIVE (4)
>                       max_mtu:                4096 (5)
>                       active_mtu:             4096 (5)
>                       sm_lid:                 1
>                       port_lid:               99
>                       port_lmc:               0x00
>                       link_layer:             InfiniBand
>
> using gcc/gfortran 9.3.0
>
> Built Open MPI 4.0.5 without any special configure options.
>
> Howard
>
> On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
> <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> 
> wrote:
>
>     for whatever it's worth running the test program on my OPA cluster
>     seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
>     sure if it's supposed to stop at some point
>     
>     i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, 
> without-{psm,ucx,verbs}
>     
>     On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
>     <users@lists.open-mpi.org> wrote:
>     >
>     > Hi Michael
>     >
>     > indeed I'm a little bit lost with all these parameters in OpenMPI, 
> mainly because for years it works just fine out of the box in all my 
> deployments on various architectures, interconnects and linux flavor. Some 
> weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an 
> AMD epyc2 cluster with connectX6, and it just works fine.  It is the first 
> time I've such trouble to deploy this library.
>     >
>     > If you have my mail posted  the 25/01/2021 in this discussion at 18h54 
> (may be Paris TZ) there is a small test case attached that show the problem. 
> Did you got it or did the list strip these attachments ? I can provide it 
> again.
>     >
>     > Many thanks
>     >
>     > Patrick
>     >
>     > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
>     >
>     > Patrick how are you using original PSM if you’re using Omni-Path 
> hardware? The original PSM was written for QLogic DDR and QDR Infiniband 
> adapters.
>     >
>     > As far as needing openib - the issue is that the PSM2 MTL doesn’t 
> support a subset of MPI operations that we previously used the pt2pt BTL for. 
> For recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
>     >
>     > Is there any chance you can give us a sample MPI app that reproduces 
> the problem? I can’t think of another way I can give you more help without 
> being able to see what’s going on. It’s always possible there’s a bug in the 
> PSM2 MTL but it would be surprising at this point.
>     >
>     > Sent from my iPad
>     >
>     > On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
> <users@lists.open-mpi.org> wrote:
>     >
>     > 
>     > Hi all,
>     >
>     > I ran many tests today. I saw that an older 4.0.2 version of OpenMPI 
> packaged with Nix was running using openib. So I add the --with-verbs option 
> to setup this module.
>     >
>     > That I can see now is that:
>     >
>     > mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca btl_openib_allow_ib 
> true ....
>     >
>     > - the testcase test_layout_array is running without error
>     >
>     > - the bandwidth measured with osu_bw is half of thar it should be:
>     >
>     > # OSU MPI Bandwidth Test v5.7
>     > # Size      Bandwidth (MB/s)
>     > 1                       0.54
>     > 2                       1.13
>     > 4                       2.26
>     > 8                       4.51
>     > 16                      9.06
>     > 32                     17.93
>     > 64                     33.87
>     > 128                    69.29
>     > 256                   161.24
>     > 512                   333.82
>     > 1024                  682.66
>     > 2048                 1188.63
>     > 4096                 1760.14
>     > 8192                 2166.08
>     > 16384                2036.95
>     > 32768                3466.63
>     > 65536                6296.73
>     > 131072               7509.43
>     > 262144               9104.78
>     > 524288               6908.55
>     > 1048576              5530.37
>     > 2097152              4489.16
>     > 4194304              3498.14
>     >
>     > mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca btl_openib_allow_ib 
> true ...
>     >
>     > - the testcase test_layout_array is not giving correct results
>     >
>     > - the bandwidth measured with osu_bw is the right one:
>     >
>     > # OSU MPI Bandwidth Test v5.7
>     > # Size      Bandwidth (MB/s)
>     > 1                       3.73
>     > 2                       7.96
>     > 4                      15.82
>     > 8                      31.22
>     > 16                     51.52
>     > 32                    107.61
>     > 64                    196.51
>     > 128                   438.66
>     > 256                   817.70
>     > 512                  1593.90
>     > 1024                 2786.09
>     > 2048                 4459.77
>     > 4096                 6658.70
>     > 8192                 8092.95
>     > 16384                8664.43
>     > 32768                8495.96
>     > 65536               11458.77
>     > 131072              12094.64
>     > 262144              11781.84
>     > 524288              12297.58
>     > 1048576             12346.92
>     > 2097152             12206.53
>     > 4194304             12167.00
>     >
>     > But yes, I know openib is deprecated too in 4.0.5.
>     >
>     > Patrick
>     >
>     >
>     
>

Reply via email to