Re: [OMPI users] mpi-test-suite shows errors on openmpi 4.1.x

2022-05-03 Thread Alois Schlögl via users

Hello Gilles,

thanks for your response. I'm testing with 20 task, each using 8 
threads. When using a single node or only few nodes, we do not see this 
either.


Attached is the used slurm script, which reports also the environment 
variables, and the output log from three different runs with srun, 
mpirun, and mpirun --mca 


It is correct, that when running with mpirun we do not see this issue. 
The errors are only observed when running with "srun"

Moreover, I notice that fewer tests are performed when using mpirun.

From that we can conclude that the issue is related to slurm-openmpi 
interaction.


Switching from srun to mpirun has also some negative implications w.r.t. 
to scheduling and also robustness. Therefore, we would like to start the 
job with srun.



Cheers,
    Alois








Am 5/3/22 um 12:52 schrieb Gilles Gouaillardet via users:

Alois,

Thanks for the report.

FWIW, I am not seeing any errors on my Mac with Open MPI from brew (4.1.3)

How many MPI tasks are you running?
Can you please confirm you can evidence the error with

mpirun -np  ./mpi_test_suite -d 
MPI_TYPE_MIX_ARRAY -c 0 -t collective



Also, can you try the same command with
mpirun --mca pml ob1 --mca btl tcp,self ...

Cheers,

Gilles

On Tue, May 3, 2022 at 7:08 PM Alois Schlögl via users 
 wrote:



Within our cluster (debian10/slurm16, debian11/slurm20), with
infiniband, and we have several instances of openmpi installed
through
the Lmod module system. When testing the openmpi installations
with the
mpi-test-suite 1.1 [1], it shows errors like these

...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
(4), type MPI_TYPE_MIX (27) number of values:1000
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
(4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000
...

when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3) The
number of
errors may vary, but the first errors are always about
    ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD

When testing on openmpi/3.1.3, the tests runs successfully, and there
are no failed tests.

Typically, the openmpi/4.1.x installation is configured with
 ./configure --prefix=${PREFIX} \
 --with-ucx=$UCX_HOME \
 --enable-orterun-prefix-by-default  \
 --enable-mpi-cxx \
 --with-hwloc \
 --with-pmi \
 --with-pmix \
 --with-cuda=$CUDA_HOME \
 --with-slurm

but I've also tried different compilation options including w/ and
w/o
--enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the
OS, or
compiled from source. But I could not identify any pattern.

Therefore, I'd like asking you what the issue might be. Specifically,
I'm would like to know:

- Am I right in assuming that mpi-test-suite [1] suitable for testing
openmpi ?
- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?

Kind regards,
   Alois


[1] https://github.com/open-mpi/mpi-test-suite/t



job-mpi-test3.sh
Description: application/shellscript
delta197
/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/ompi_info
 running on 20*8 cores with 20 MPI-tasks and 8 threads
SHELL=/bin/bash
SLURM_JOB_USER=schloegl
SLURM_TASKS_PER_NODE=2(x10)
SLURM_JOB_UID=10103
SLURM_TASK_PID=50793
PKG_CONFIG_PATH=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/lib/pkgconfig:/mnt/nfs/clustersw/Debian/bullseye/hwloc/2.7.1/lib/pkgconfig:/mnt/nfs/clustersw/shared/cuda/11.2.2/pkgconfig
SLURM_LOCALID=0
SLURM_SUBMIT_DIR=/nfs/scistore16/jonasgrp/schloegl/slurm
HOSTNAME=delta197
LANGUAGE=en_US:en
SLURMD_NODENAME=delta197
_ModuleTable002_=ewpmbiA9ICIvbW50L25mcy9jbHVzdGVyc3cvRGViaWFuL2J1bGxzZXllL21vZHVsZWZpbGVzL0NvcmUvaHdsb2MvMi43LjEubHVhIiwKZnVsbE5hbWUgPSAiaHdsb2MvMi43LjEiLApsb2FkT3JkZXIgPSAzLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMSwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gImh3bG9jLzIuNy4xIiwKd1YgPSAiMDAwMDAwMDAyLjAwMDAwMDAwNy4wMDAwMDAwMDEuKnpmaW5hbCIsCn0sCm9wZW5tcGkgPSB7CmZuID0gIi9tbnQvbmZzL2NsdXN0ZXJzdy9EZWJpYW4vYnVsbHNleWUvbW9kdWxlZmlsZXMvQ29yZS9vcGVubXBpLzQuMS4zZC5sdWEiLApmdWxsTmFtZSA9ICJvcGVubXBpLzQuMS4zZCIsCmxvYWRPcmRlciA9IDQsCnByb3BUID0g
MPICC=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/mpicc
__LMOD_REF_COUNT_MODULEPATH=/mnt/nfs/clustersw/Debian/bullseye/modulefiles/MPI/openmpi/4.1.3d:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Linux:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Core:1;/mnt/nfs/clustersw/Debian/bullseye/lmod/lmod/modulefiles/Core:1
OMPI_MCA_btl=s

[OMPI users] mpi-test-suite shows errors on openmpi 4.1.x

2022-05-03 Thread Alois Schlögl via users



Within our cluster (debian10/slurm16, debian11/slurm20), with 
infiniband, and we have several instances of openmpi installed through 
the Lmod module system. When testing the openmpi installations with the 
mpi-test-suite 1.1 [1], it shows errors like these


...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD 
(4), type MPI_TYPE_MIX (27) number of values:1000
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD 
(4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000

...

when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3)  The number of 
errors may vary, but the first errors are always about

   ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD

When testing on openmpi/3.1.3, the tests runs successfully, and there 
are no failed tests.


Typically, the openmpi/4.1.x installation is configured with
    ./configure --prefix=${PREFIX} \
    --with-ucx=$UCX_HOME \
    --enable-orterun-prefix-by-default  \
    --enable-mpi-cxx \
    --with-hwloc \
    --with-pmi \
    --with-pmix \
    --with-cuda=$CUDA_HOME \
    --with-slurm

but I've also tried different compilation options including w/ and w/o 
--enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or 
compiled from source. But I could not identify any pattern.


Therefore, I'd like asking you what the issue might be. Specifically, 
I'm would like to know:


- Am I right in assuming that mpi-test-suite [1] suitable for testing 
openmpi ?

- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?

Kind regards,
  Alois


[1] https://github.com/open-mpi/mpi-test-suite/t