[OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-03 Thread Scott Sayres via users
Hello,
I am new to openmpi, but would like to use it for ORCA calculations, and plan
to run codes on the 10 processors of my macbook pro.  I installed this
manually and also through homebrew with similar results.  I am able to
compile codes with mpicc and run them as native codes, but everything that
I attempt with mpirun, mpiexec just freezes.  I can end the program by
typing 'control C' twice, but it continues to run in the background and
requires me to 'kill '.
even as simple as 'mpirun uname' freezes

I have tried one installation by: 'arch -arm64 brew install openmpi '
and a second by downloading the source file, './configure
--prefix=/usr/local', 'make all', make install

the commands: 'which mpicc', 'which 'mpirun', etc are able to find them on
the path... it just hangs.

Can anyone suggest how to fix the problem of the program hanging?
Thanks!
Scott
<>


Re: [OMPI users] mpi-test-suite shows errors on openmpi 4.1.x

2022-05-03 Thread Alois Schlögl via users

Hello Gilles,

thanks for your response. I'm testing with 20 task, each using 8 
threads. When using a single node or only few nodes, we do not see this 
either.


Attached is the used slurm script, which reports also the environment 
variables, and the output log from three different runs with srun, 
mpirun, and mpirun --mca 


It is correct, that when running with mpirun we do not see this issue. 
The errors are only observed when running with "srun"

Moreover, I notice that fewer tests are performed when using mpirun.

From that we can conclude that the issue is related to slurm-openmpi 
interaction.


Switching from srun to mpirun has also some negative implications w.r.t. 
to scheduling and also robustness. Therefore, we would like to start the 
job with srun.



Cheers,
    Alois








Am 5/3/22 um 12:52 schrieb Gilles Gouaillardet via users:

Alois,

Thanks for the report.

FWIW, I am not seeing any errors on my Mac with Open MPI from brew (4.1.3)

How many MPI tasks are you running?
Can you please confirm you can evidence the error with

mpirun -np  ./mpi_test_suite -d 
MPI_TYPE_MIX_ARRAY -c 0 -t collective



Also, can you try the same command with
mpirun --mca pml ob1 --mca btl tcp,self ...

Cheers,

Gilles

On Tue, May 3, 2022 at 7:08 PM Alois Schlögl via users 
 wrote:



Within our cluster (debian10/slurm16, debian11/slurm20), with
infiniband, and we have several instances of openmpi installed
through
the Lmod module system. When testing the openmpi installations
with the
mpi-test-suite 1.1 [1], it shows errors like these

...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
(4), type MPI_TYPE_MIX (27) number of values:1000
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
(4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000
...

when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3) The
number of
errors may vary, but the first errors are always about
    ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD

When testing on openmpi/3.1.3, the tests runs successfully, and there
are no failed tests.

Typically, the openmpi/4.1.x installation is configured with
 ./configure --prefix=${PREFIX} \
 --with-ucx=$UCX_HOME \
 --enable-orterun-prefix-by-default  \
 --enable-mpi-cxx \
 --with-hwloc \
 --with-pmi \
 --with-pmix \
 --with-cuda=$CUDA_HOME \
 --with-slurm

but I've also tried different compilation options including w/ and
w/o
--enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the
OS, or
compiled from source. But I could not identify any pattern.

Therefore, I'd like asking you what the issue might be. Specifically,
I'm would like to know:

- Am I right in assuming that mpi-test-suite [1] suitable for testing
openmpi ?
- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?

Kind regards,
   Alois


[1] https://github.com/open-mpi/mpi-test-suite/t



job-mpi-test3.sh
Description: application/shellscript
delta197
/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/ompi_info
 running on 20*8 cores with 20 MPI-tasks and 8 threads
SHELL=/bin/bash
SLURM_JOB_USER=schloegl
SLURM_TASKS_PER_NODE=2(x10)
SLURM_JOB_UID=10103
SLURM_TASK_PID=50793
PKG_CONFIG_PATH=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/lib/pkgconfig:/mnt/nfs/clustersw/Debian/bullseye/hwloc/2.7.1/lib/pkgconfig:/mnt/nfs/clustersw/shared/cuda/11.2.2/pkgconfig
SLURM_LOCALID=0
SLURM_SUBMIT_DIR=/nfs/scistore16/jonasgrp/schloegl/slurm
HOSTNAME=delta197
LANGUAGE=en_US:en
SLURMD_NODENAME=delta197
_ModuleTable002_=ewpmbiA9ICIvbW50L25mcy9jbHVzdGVyc3cvRGViaWFuL2J1bGxzZXllL21vZHVsZWZpbGVzL0NvcmUvaHdsb2MvMi43LjEubHVhIiwKZnVsbE5hbWUgPSAiaHdsb2MvMi43LjEiLApsb2FkT3JkZXIgPSAzLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMSwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gImh3bG9jLzIuNy4xIiwKd1YgPSAiMDAwMDAwMDAyLjAwMDAwMDAwNy4wMDAwMDAwMDEuKnpmaW5hbCIsCn0sCm9wZW5tcGkgPSB7CmZuID0gIi9tbnQvbmZzL2NsdXN0ZXJzdy9EZWJpYW4vYnVsbHNleWUvbW9kdWxlZmlsZXMvQ29yZS9vcGVubXBpLzQuMS4zZC5sdWEiLApmdWxsTmFtZSA9ICJvcGVubXBpLzQuMS4zZCIsCmxvYWRPcmRlciA9IDQsCnByb3BUID0g
MPICC=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/mpicc
__LMOD_REF_COUNT_MODULEPATH=/mnt/nfs/clustersw/Debian/bullseye/modulefiles/MPI/openmpi/4.1.3d:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Linux:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Core:1;/mnt/nfs/clustersw/Debian/bullseye/lmod/lmod/modulefiles/Core:1
OMPI_MCA_btl=self,openib

Re: [OMPI users] mpi-test-suite shows errors on openmpi 4.1.x

2022-05-03 Thread Gilles Gouaillardet via users
Alois,

Thanks for the report.

FWIW, I am not seeing any errors on my Mac with Open MPI from brew (4.1.3)

How many MPI tasks are you running?
Can you please confirm you can evidence the error with

mpirun -np  ./mpi_test_suite -d MPI_TYPE_MIX_ARRAY -c
0 -t collective


Also, can you try the same command with
mpirun --mca pml ob1 --mca btl tcp,self ...

Cheers,

Gilles

On Tue, May 3, 2022 at 7:08 PM Alois Schlögl via users <
users@lists.open-mpi.org> wrote:

>
> Within our cluster (debian10/slurm16, debian11/slurm20), with
> infiniband, and we have several instances of openmpi installed through
> the Lmod module system. When testing the openmpi installations with the
> mpi-test-suite 1.1 [1], it shows errors like these
>
> ...
> Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
> (Rank:0) tst_test_array[46]:Allreduce Sum
> (Rank:0) tst_test_array[47]:Alltoall
> Number of failed tests: 130
> Summary of failed tests:
> ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
> (4), type MPI_TYPE_MIX (27) number of values:1000
> ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
> (4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000
> ...
>
> when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3)  The number of
> errors may vary, but the first errors are always about
> ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
>
> When testing on openmpi/3.1.3, the tests runs successfully, and there
> are no failed tests.
>
> Typically, the openmpi/4.1.x installation is configured with
>  ./configure --prefix=${PREFIX} \
>  --with-ucx=$UCX_HOME \
>  --enable-orterun-prefix-by-default  \
>  --enable-mpi-cxx \
>  --with-hwloc \
>  --with-pmi \
>  --with-pmix \
>  --with-cuda=$CUDA_HOME \
>  --with-slurm
>
> but I've also tried different compilation options including w/ and w/o
> --enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or
> compiled from source. But I could not identify any pattern.
>
> Therefore, I'd like asking you what the issue might be. Specifically,
> I'm would like to know:
>
> - Am I right in assuming that mpi-test-suite [1] suitable for testing
> openmpi ?
> - what are possible causes for these type of errors ?
> - what would you recommend how to debug these issues ?
>
> Kind regards,
>Alois
>
>
> [1] https://github.com/open-mpi/mpi-test-suite/t
>
>


[OMPI users] mpi-test-suite shows errors on openmpi 4.1.x

2022-05-03 Thread Alois Schlögl via users



Within our cluster (debian10/slurm16, debian11/slurm20), with 
infiniband, and we have several instances of openmpi installed through 
the Lmod module system. When testing the openmpi installations with the 
mpi-test-suite 1.1 [1], it shows errors like these


...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD 
(4), type MPI_TYPE_MIX (27) number of values:1000
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD 
(4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000

...

when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3)  The number of 
errors may vary, but the first errors are always about

   ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD

When testing on openmpi/3.1.3, the tests runs successfully, and there 
are no failed tests.


Typically, the openmpi/4.1.x installation is configured with
    ./configure --prefix=${PREFIX} \
    --with-ucx=$UCX_HOME \
    --enable-orterun-prefix-by-default  \
    --enable-mpi-cxx \
    --with-hwloc \
    --with-pmi \
    --with-pmix \
    --with-cuda=$CUDA_HOME \
    --with-slurm

but I've also tried different compilation options including w/ and w/o 
--enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or 
compiled from source. But I could not identify any pattern.


Therefore, I'd like asking you what the issue might be. Specifically, 
I'm would like to know:


- Am I right in assuming that mpi-test-suite [1] suitable for testing 
openmpi ?

- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?

Kind regards,
  Alois


[1] https://github.com/open-mpi/mpi-test-suite/t