[OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3
Hello, I am new to openmpi, but would like to use it for ORCA calculations, and plan to run codes on the 10 processors of my macbook pro. I installed this manually and also through homebrew with similar results. I am able to compile codes with mpicc and run them as native codes, but everything that I attempt with mpirun, mpiexec just freezes. I can end the program by typing 'control C' twice, but it continues to run in the background and requires me to 'kill '. even as simple as 'mpirun uname' freezes I have tried one installation by: 'arch -arm64 brew install openmpi ' and a second by downloading the source file, './configure --prefix=/usr/local', 'make all', make install the commands: 'which mpicc', 'which 'mpirun', etc are able to find them on the path... it just hangs. Can anyone suggest how to fix the problem of the program hanging? Thanks! Scott <>
Re: [OMPI users] mpi-test-suite shows errors on openmpi 4.1.x
Hello Gilles, thanks for your response. I'm testing with 20 task, each using 8 threads. When using a single node or only few nodes, we do not see this either. Attached is the used slurm script, which reports also the environment variables, and the output log from three different runs with srun, mpirun, and mpirun --mca It is correct, that when running with mpirun we do not see this issue. The errors are only observed when running with "srun" Moreover, I notice that fewer tests are performed when using mpirun. From that we can conclude that the issue is related to slurm-openmpi interaction. Switching from srun to mpirun has also some negative implications w.r.t. to scheduling and also robustness. Therefore, we would like to start the job with srun. Cheers, Alois Am 5/3/22 um 12:52 schrieb Gilles Gouaillardet via users: Alois, Thanks for the report. FWIW, I am not seeing any errors on my Mac with Open MPI from brew (4.1.3) How many MPI tasks are you running? Can you please confirm you can evidence the error with mpirun -np ./mpi_test_suite -d MPI_TYPE_MIX_ARRAY -c 0 -t collective Also, can you try the same command with mpirun --mca pml ob1 --mca btl tcp,self ... Cheers, Gilles On Tue, May 3, 2022 at 7:08 PM Alois Schlögl via users wrote: Within our cluster (debian10/slurm16, debian11/slurm20), with infiniband, and we have several instances of openmpi installed through the Lmod module system. When testing the openmpi installations with the mpi-test-suite 1.1 [1], it shows errors like these ... Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE (Rank:0) tst_test_array[46]:Allreduce Sum (Rank:0) tst_test_array[47]:Alltoall Number of failed tests: 130 Summary of failed tests: ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD (4), type MPI_TYPE_MIX (27) number of values:1000 ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD (4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000 ... when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3) The number of errors may vary, but the first errors are always about ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD When testing on openmpi/3.1.3, the tests runs successfully, and there are no failed tests. Typically, the openmpi/4.1.x installation is configured with ./configure --prefix=${PREFIX} \ --with-ucx=$UCX_HOME \ --enable-orterun-prefix-by-default \ --enable-mpi-cxx \ --with-hwloc \ --with-pmi \ --with-pmix \ --with-cuda=$CUDA_HOME \ --with-slurm but I've also tried different compilation options including w/ and w/o --enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or compiled from source. But I could not identify any pattern. Therefore, I'd like asking you what the issue might be. Specifically, I'm would like to know: - Am I right in assuming that mpi-test-suite [1] suitable for testing openmpi ? - what are possible causes for these type of errors ? - what would you recommend how to debug these issues ? Kind regards, Alois [1] https://github.com/open-mpi/mpi-test-suite/t job-mpi-test3.sh Description: application/shellscript delta197 /mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/ompi_info running on 20*8 cores with 20 MPI-tasks and 8 threads SHELL=/bin/bash SLURM_JOB_USER=schloegl SLURM_TASKS_PER_NODE=2(x10) SLURM_JOB_UID=10103 SLURM_TASK_PID=50793 PKG_CONFIG_PATH=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/lib/pkgconfig:/mnt/nfs/clustersw/Debian/bullseye/hwloc/2.7.1/lib/pkgconfig:/mnt/nfs/clustersw/shared/cuda/11.2.2/pkgconfig SLURM_LOCALID=0 SLURM_SUBMIT_DIR=/nfs/scistore16/jonasgrp/schloegl/slurm HOSTNAME=delta197 LANGUAGE=en_US:en SLURMD_NODENAME=delta197 _ModuleTable002_=ewpmbiA9ICIvbW50L25mcy9jbHVzdGVyc3cvRGViaWFuL2J1bGxzZXllL21vZHVsZWZpbGVzL0NvcmUvaHdsb2MvMi43LjEubHVhIiwKZnVsbE5hbWUgPSAiaHdsb2MvMi43LjEiLApsb2FkT3JkZXIgPSAzLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMSwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gImh3bG9jLzIuNy4xIiwKd1YgPSAiMDAwMDAwMDAyLjAwMDAwMDAwNy4wMDAwMDAwMDEuKnpmaW5hbCIsCn0sCm9wZW5tcGkgPSB7CmZuID0gIi9tbnQvbmZzL2NsdXN0ZXJzdy9EZWJpYW4vYnVsbHNleWUvbW9kdWxlZmlsZXMvQ29yZS9vcGVubXBpLzQuMS4zZC5sdWEiLApmdWxsTmFtZSA9ICJvcGVubXBpLzQuMS4zZCIsCmxvYWRPcmRlciA9IDQsCnByb3BUID0g MPICC=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/mpicc __LMOD_REF_COUNT_MODULEPATH=/mnt/nfs/clustersw/Debian/bullseye/modulefiles/MPI/openmpi/4.1.3d:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Linux:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Core:1;/mnt/nfs/clustersw/Debian/bullseye/lmod/lmod/modulefiles/Core:1 OMPI_MCA_btl=self,openib
Re: [OMPI users] mpi-test-suite shows errors on openmpi 4.1.x
Alois, Thanks for the report. FWIW, I am not seeing any errors on my Mac with Open MPI from brew (4.1.3) How many MPI tasks are you running? Can you please confirm you can evidence the error with mpirun -np ./mpi_test_suite -d MPI_TYPE_MIX_ARRAY -c 0 -t collective Also, can you try the same command with mpirun --mca pml ob1 --mca btl tcp,self ... Cheers, Gilles On Tue, May 3, 2022 at 7:08 PM Alois Schlögl via users < users@lists.open-mpi.org> wrote: > > Within our cluster (debian10/slurm16, debian11/slurm20), with > infiniband, and we have several instances of openmpi installed through > the Lmod module system. When testing the openmpi installations with the > mpi-test-suite 1.1 [1], it shows errors like these > > ... > Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE > (Rank:0) tst_test_array[46]:Allreduce Sum > (Rank:0) tst_test_array[47]:Alltoall > Number of failed tests: 130 > Summary of failed tests: > ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD > (4), type MPI_TYPE_MIX (27) number of values:1000 > ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD > (4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000 > ... > > when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3) The number of > errors may vary, but the first errors are always about > ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD > > When testing on openmpi/3.1.3, the tests runs successfully, and there > are no failed tests. > > Typically, the openmpi/4.1.x installation is configured with > ./configure --prefix=${PREFIX} \ > --with-ucx=$UCX_HOME \ > --enable-orterun-prefix-by-default \ > --enable-mpi-cxx \ > --with-hwloc \ > --with-pmi \ > --with-pmix \ > --with-cuda=$CUDA_HOME \ > --with-slurm > > but I've also tried different compilation options including w/ and w/o > --enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or > compiled from source. But I could not identify any pattern. > > Therefore, I'd like asking you what the issue might be. Specifically, > I'm would like to know: > > - Am I right in assuming that mpi-test-suite [1] suitable for testing > openmpi ? > - what are possible causes for these type of errors ? > - what would you recommend how to debug these issues ? > > Kind regards, >Alois > > > [1] https://github.com/open-mpi/mpi-test-suite/t > >
[OMPI users] mpi-test-suite shows errors on openmpi 4.1.x
Within our cluster (debian10/slurm16, debian11/slurm20), with infiniband, and we have several instances of openmpi installed through the Lmod module system. When testing the openmpi installations with the mpi-test-suite 1.1 [1], it shows errors like these ... Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE (Rank:0) tst_test_array[46]:Allreduce Sum (Rank:0) tst_test_array[47]:Alltoall Number of failed tests: 130 Summary of failed tests: ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD (4), type MPI_TYPE_MIX (27) number of values:1000 ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD (4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000 ... when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3) The number of errors may vary, but the first errors are always about ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD When testing on openmpi/3.1.3, the tests runs successfully, and there are no failed tests. Typically, the openmpi/4.1.x installation is configured with ./configure --prefix=${PREFIX} \ --with-ucx=$UCX_HOME \ --enable-orterun-prefix-by-default \ --enable-mpi-cxx \ --with-hwloc \ --with-pmi \ --with-pmix \ --with-cuda=$CUDA_HOME \ --with-slurm but I've also tried different compilation options including w/ and w/o --enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or compiled from source. But I could not identify any pattern. Therefore, I'd like asking you what the issue might be. Specifically, I'm would like to know: - Am I right in assuming that mpi-test-suite [1] suitable for testing openmpi ? - what are possible causes for these type of errors ? - what would you recommend how to debug these issues ? Kind regards, Alois [1] https://github.com/open-mpi/mpi-test-suite/t