Bug#954272: slurmd: SLURM not working with OpenMPI
On 20/07/2020 14:52, Lars Veldscholte wrote: srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types Running `strace srun --mpi=pmix ./a.out` revealed that SLURM is looking for the pmix library at `/usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so`, which does not exist, only `libpmix.so.2` exists. Installing the package `libpmix-dev` installs this library (it symlinks it to the same file `libpmix.so.2` is symlinked to). Now, `srun --mpi=pmix ./a.out` is working! I'm not 100% sure, but I think that the package `libpmix2` should also install the file `libpmix.so`. The dev package shouldn't be required for that, right? Lars pmix is transitioning from pmix2 -> pmix3 (at least in bullseye timeframe) so it was important that the modules in $libdir/pmix/lib/pmix be versioned, so I renamed it to $libdir/pmix2/lib/pmix. I had thought that only libpmix.so.2 accessed these modules so the path was ok, but looking at slurm-llnl debian/rules its clear slurmd uses "--with-pmix=/usr/lib/x86_64-linux-gnu/pmix" which needs to be updated. libX.so files are normally development-only ; I'll move libpmix.so into libpmix2 from libpmix-dev to fix the above error. Alastair -- Alastair McKinstry, , , https://diaspora.sceal.ie/u/amckinstry Misentropy: doubting that the Universe is becoming more disordered.
Bug#954272: slurmd: SLURM not working with OpenMPI
On 20-07-2020 16:27, Alastair McKinstry wrote: On 20/07/2020 14:52, Lars Veldscholte wrote: Hi, I believe I have found a solution. I must confess that I still don't fully understand the difference between the various PMI APIs, and which ones are supported by OpenMPI, but I found that the recommended way is to use PMIx. However, PMIx was not working on my system even though libpmix2 is installed: # srun --mpi pmix ./a.out srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types Running `strace srun --mpi=pmix ./a.out` revealed that SLURM is looking for the pmix library at `/usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so`, which does not exist, only `libpmix.so.2` exists. Which code is trying to load libpmix.so ? The compiled code should be loading libpmix.so.2 directly; the libpmix.so should only be needed at build-time. srun is, I suppose. I've attached the entire strace output in case it's of any help. Perhaps the problem is that libpmix-dev is not installed at compile time. I can add it as a Dependency of libopenmpi-dev. Installing the package `libpmix-dev` installs this library (it symlinks it to the same file `libpmix.so.2` is symlinked to). Now, `srun --mpi=pmix ./a.out` is working! I'm not 100% sure, but I think that the package `libpmix2` should also install the file `libpmix.so`. The dev package shouldn't be required for that, right? Lars Regards Alastair Regards, Lars # strace srun --mpi=pmix_v3 ./a.out execve("/usr/bin/srun", ["srun", "--mpi=pmix_v3", "./a.out"], 0x7ffd4f380370 /* 17 vars */) = 0 brk(NULL) = 0x563aea4e4000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64/x86_64/libz.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64/x86_64", 0x7ffe2154f320) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64/libz.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64", 0x7ffe2154f320) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64/libz.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64", 0x7ffe2154f320) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/libz.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/tls", 0x7ffe2154f320) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64/x86_64/libz.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64/x86_64", 0x7ffe2154f320) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64/libz.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64", 0x7ffe2154f320) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64/libz.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64", 0x7ffe2154f320) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/libz.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/lib/x86_64-linux-gnu/slurm-wlm", {st_mode=S_IFDIR|0755, st_size=20480, ...}) = 0 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=34055, ...}) = 0 mmap(NULL, 34055, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f2eddeec000 close(3)= 0 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libz.so.1", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0203\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0644, st_size=113088, ...}) = 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2eddeea000 mmap(NULL, 115088, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f2eddecd000 mprotect(0x7f2edded, 98304, PROT_NONE) = 0 mmap(0x7f2edded, 69632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f2edded mmap(0x7f2eddee1000, 24576, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7f2eddee1000 mmap(0x7f2eddee8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7f2eddee8000 close(3)= 0
Bug#954272: slurmd: SLURM not working with OpenMPI
Hi, I believe I have found a solution. I must confess that I still don't fully understand the difference between the various PMI APIs, and which ones are supported by OpenMPI, but I found that the recommended way is to use PMIx. However, PMIx was not working on my system even though libpmix2 is installed: # srun --mpi pmix ./a.out srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types Running `strace srun --mpi=pmix ./a.out` revealed that SLURM is looking for the pmix library at `/usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so`, which does not exist, only `libpmix.so.2` exists. Installing the package `libpmix-dev` installs this library (it symlinks it to the same file `libpmix.so.2` is symlinked to). Now, `srun --mpi=pmix ./a.out` is working! I'm not 100% sure, but I think that the package `libpmix2` should also install the file `libpmix.so`. The dev package shouldn't be required for that, right? Lars signature.asc Description: OpenPGP digital signature
Bug#954272: slurmd: SLURM not working with OpenMPI
On 20/07/2020 14:52, Lars Veldscholte wrote: Hi, I believe I have found a solution. I must confess that I still don't fully understand the difference between the various PMI APIs, and which ones are supported by OpenMPI, but I found that the recommended way is to use PMIx. However, PMIx was not working on my system even though libpmix2 is installed: # srun --mpi pmix ./a.out srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types Running `strace srun --mpi=pmix ./a.out` revealed that SLURM is looking for the pmix library at `/usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so`, which does not exist, only `libpmix.so.2` exists. Which code is trying to load libpmix.so ? The compiled code should be loading libpmix.so.2 directly; the libpmix.so should only be needed at build-time. Perhaps the problem is that libpmix-dev is not installed at compile time. I can add it as a Dependency of libopenmpi-dev. Installing the package `libpmix-dev` installs this library (it symlinks it to the same file `libpmix.so.2` is symlinked to). Now, `srun --mpi=pmix ./a.out` is working! I'm not 100% sure, but I think that the package `libpmix2` should also install the file `libpmix.so`. The dev package shouldn't be required for that, right? Lars Regards Alastair -- Alastair McKinstry, , , https://diaspora.sceal.ie/u/amckinstry Misentropy: doubting that the Universe is becoming more disordered.
Bug#954272: slurmd: SLURM not working with OpenMPI
Hello Gennaro, On 2020-03-24 00:54, Gennaro Oliva wrote: Hi Lars, On Thu, Mar 19, 2020 at 03:16:15PM +0100, Lars Veldscholte wrote: A simple test like `srun hostname` works, even on multiple cores. However, when trying to use MPI, it crashes with the following error message: *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) This happens even in the most simple "Hello World" case, as long as the program is MPI-enabled. I am trying to use OpenMPI (4.0.2) from the Debian repositories. `srun --mpi list` returns: srun: MPI types are... srun: openmpi srun: pmi2 srun: none I have tried all options, but the result is the same in all cases. Maybe this is user error, as this is my first time setting up SLURM, but I have not been able to find any possible causes/solutions and I am kind of stuck at this point. I don't know why srun doesn't execute openmpi directly, and I'll try to investigate this issue but as a workaround you can use both sbatch and salloc as in [1]: salloc -n 4 mpirun mympiprogram ... or sbatch -n 4 mympiprogram.sh where mympiprogram.sh is something like: #!/bin/sh mpirun mympiprogram ... Notice you don't need to specify the number of processes to mpirun, as it takes it from SLURM. [1] https://www.open-mpi.org/faq/?category=slurm Best regards, Thanks a lot, this seems to be working! I hadn't realised that you could simply use mpirun instead of srun inside salloc/sbatch. Regards, Lars signature.asc Description: OpenPGP digital signature
Bug#954272: slurmd: SLURM not working with OpenMPI
Hi Lars, On Thu, Mar 19, 2020 at 03:16:15PM +0100, Lars Veldscholte wrote: > A simple test like `srun hostname` works, even on multiple cores. However, > when trying to use MPI, it crashes with the following error message: > > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > ***and potentially your MPI job) > > This happens even in the most simple "Hello World" case, as long as the > program is MPI-enabled. > > I am trying to use OpenMPI (4.0.2) from the Debian repositories. `srun --mpi > list` returns: > > srun: MPI types are... > srun: openmpi > srun: pmi2 > srun: none > > I have tried all options, but the result is the same in all cases. > > Maybe this is user error, as this is my first time setting up SLURM, but I > have not been able to find any possible causes/solutions and I am kind of > stuck at this point. I don't know why srun doesn't execute openmpi directly, and I'll try to investigate this issue but as a workaround you can use both sbatch and salloc as in [1]: salloc -n 4 mpirun mympiprogram ... or sbatch -n 4 mympiprogram.sh where mympiprogram.sh is something like: #!/bin/sh mpirun mympiprogram ... Notice you don't need to specify the number of processes to mpirun, as it takes it from SLURM. [1] https://www.open-mpi.org/faq/?category=slurm Best regards, -- Gennaro Oliva
Bug#954272: slurmd: SLURM not working with OpenMPI
Package: slurmd Version: 19.05.3.2-2+b1 Severity: important Dear Maintainer, I am trying to get SLURM working on a single node. I have installed and configured slurmd and slurmctld. A simple test like `srun hostname` works, even on multiple cores. However, when trying to use MPI, it crashes with the following error message: *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) This happens even in the most simple "Hello World" case, as long as the program is MPI-enabled. I am trying to use OpenMPI (4.0.2) from the Debian repositories. `srun --mpi list` returns: srun: MPI types are... srun: openmpi srun: pmi2 srun: none I have tried all options, but the result is the same in all cases. Maybe this is user error, as this is my first time setting up SLURM, but I have not been able to find any possible causes/solutions and I am kind of stuck at this point. Regards, Lars -- System Information: Debian Release: bullseye/sid APT prefers testing APT policy: (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 5.4.0-3-amd64 (SMP w/64 CPU cores) Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages slurmd depends on: ii libc62.30-2 ii libhwloc15 2.1.0+dfsg-4 ii liblz4-1 1.9.2-2 ii libnuma1 2.0.12-1+b1 ii libpam0g 1.3.1-5 ii lsb-base 11.1.0 ii munge0.5.13-2+b1 ii openssl 1.1.1d-2 ii slurm-wlm-basic-plugins 19.05.3.2-2+b1 ii ucf 3.0038+nmu1 ii zlib1g 1:1.2.11.dfsg-2 slurmd recommends no packages. slurmd suggests no packages. -- no debconf information