You didn’t show your srun direct launch cmd line or what version of Slurm is being used (and how it was configured), so I can only provide some advice. If you want to use PMIx, then you have to do two things:
1. Slurm must be configured to use PMIx - depending on the version, that might be there by default in the rpm 2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix to the cmd line - you should check that) If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to configure OMPI --with-pmi=<path-to-those-libraries> Ralph > On Jun 7, 2018, at 5:21 AM, Bennet Fauber <ben...@umich.edu> wrote: > > We are trying out MPI on an aarch64 cluster. > > Our system administrators installed SLURM and PMIx 2.0.2 from .rpm. > > I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the > configure flags shown in this snippet from the top of config.log > > It was created by Open MPI configure 3.1.0, which was > generated by GNU Autoconf 2.69. Invocation command line was > > $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0 > --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > --with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran > > ## --------- ## > ## Platform. ## > ## --------- ## > > hostname = cavium-hpc.arc-ts.umich.edu > uname -m = aarch64 > uname -r = 4.11.0-45.4.1.el7a.aarch64 > uname -s = Linux > uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018 > > /usr/bin/uname -p = aarch64 > > > It checks for pmi and reports it found, > > > configure:12680: checking if user requested external PMIx > support(/opt/pmix/2.0.2) > configure:12690: result: yes > configure:12701: checking --with-external-pmix value > configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include) > configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64 > configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib > configure:12794: checking PMIx version > configure:12804: result: version file found > > > It fails on the test for PMIx 3, which is expected, but then reports > > > configure:12843: checking version 2x > configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c > configure:12861: $? = 0 > configure:12862: result: found > > > I have a small, test MPI program that I run, and it runs when run with > mpirun using mpirun. The processes running on the first node of a two > node job are > > > [bennet@cav02 ~]$ ps -ef | grep bennet | egrep 'test_mpi|srun' > > bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi > > bennet 20346 20340 0 08:04 ? 00:00:00 srun > --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 > --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid > "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca > orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri > "3609657344.0;tcp://10.242.15.36:58681" > > bennet 20347 20346 0 08:04 ? 00:00:00 srun > --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 > --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid > "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca > orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri > "3609657344.0;tcp://10.242.15.36:58681" > > bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi > > bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi > > > However, when I run it using srun directly, I get the following output: > > > srun: Step created for job 87 > [cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file > pmix2x_client.c at line 109 > -------------------------------------------------------------------------- > The application appears to have been direct launched using "srun", > but OMPI was not built with SLURM's PMI support and therefore cannot > execute. There are several options for building PMI support under > SLURM, depending upon the SLURM version you are using: > > version 16.05 or later: you can use SLURM's PMIx support. This > requires that you configure and build SLURM --with-pmix. > > Versions earlier than 16.05: you must use either SLURM's PMI-1 or > PMI-2 support. SLURM builds PMI-1 by default, or you can manually > install PMI-2. You must then build Open MPI using --with-pmi pointing > to the SLURM PMI library location. > > Please configure as appropriate and try again. > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed > completed successfully, but am not able to aggregate error messages, > and not able to guarantee that all other processes were killed! > > > Using the same scheme to set this up on x86_64 worked, and I am taking > installation parameters, test files, and job parameters from the > working x86_64 installation. > > Other than the architecture, the main difference between the two > clusters is that the aarch64 has only ethernet networking, whereas > there is infiniband on the x86_64 cluster. I removed the --with-verbs > from the configure line, though, and I thought that would be > sufficient. > > Anyone have suggestions what might be wrong, how to fix it, or for > further diagnostics? > > Thank you, -- bennet > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users