Bug#954272: slurmd: SLURM not working with OpenMPI

2020-10-13 Thread Alastair McKinstry



On 20/07/2020 14:52, Lars Veldscholte wrote:



srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: 
pmi/pmix: can not load PMIx library


srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin 
init() callback failed


srun: error: cannot create mpi context for mpi/pmix

srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

Running `strace srun --mpi=pmix ./a.out` revealed that SLURM is 
looking for the pmix library at 
`/usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so`, which does not exist, 
only `libpmix.so.2` exists.


Installing the package `libpmix-dev` installs this library (it 
symlinks it to the same file `libpmix.so.2` is symlinked to).


Now, `srun --mpi=pmix ./a.out` is working!

I'm not 100% sure, but I think that the package `libpmix2` should also 
install the file `libpmix.so`. The dev package shouldn't be required 
for that, right?


Lars

pmix is transitioning from pmix2 -> pmix3 (at least in bullseye 
timeframe) so it was important that the modules in $libdir/pmix/lib/pmix 
be versioned, so I


renamed it to $libdir/pmix2/lib/pmix. I had thought that only 
libpmix.so.2 accessed these modules so the path was ok, but looking at 
slurm-llnl debian/rules


its clear slurmd uses "--with-pmix=/usr/lib/x86_64-linux-gnu/pmix"  
which needs to be updated.


libX.so files are normally development-only ; I'll move libpmix.so into 
libpmix2 from libpmix-dev to fix the above error.



Alastair


--
Alastair McKinstry, , , 
https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.



Bug#954272: slurmd: SLURM not working with OpenMPI

2020-07-20 Thread Lars Veldscholte

On 20-07-2020 16:27, Alastair McKinstry wrote:


On 20/07/2020 14:52, Lars Veldscholte wrote:

Hi,

I believe I have found a solution.

I must confess that I still don't fully understand the difference 
between the various PMI APIs, and which ones are supported by OpenMPI, 
but I found that the recommended way is to use PMIx.


However, PMIx was not working on my system even though libpmix2 is 
installed:


# srun --mpi pmix ./a.out

srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: 
pmi/pmix: can not load PMIx library


srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin 
init() callback failed


srun: error: cannot create mpi context for mpi/pmix

srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

Running `strace srun --mpi=pmix ./a.out` revealed that SLURM is 
looking for the pmix library at 
`/usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so`, which does not exist, 
only `libpmix.so.2` exists.


Which code is trying to load libpmix.so ? The compiled code should be 
loading libpmix.so.2 directly; the libpmix.so should only be needed

at build-time.


srun is, I suppose. I've attached the entire strace output in case it's 
of any help.




Perhaps the problem is that libpmix-dev is not installed at compile 
time. I can add it as a Dependency of libopenmpi-dev.



Installing the package `libpmix-dev` installs this library (it 
symlinks it to the same file `libpmix.so.2` is symlinked to).


Now, `srun --mpi=pmix ./a.out` is working!

I'm not 100% sure, but I think that the package `libpmix2` should also 
install the file `libpmix.so`. The dev package shouldn't be required 
for that, right?


Lars


Regards

Alastair



Regards,

Lars

# strace srun --mpi=pmix_v3 ./a.out
execve("/usr/bin/srun", ["srun", "--mpi=pmix_v3", "./a.out"], 0x7ffd4f380370 /* 
17 vars */) = 0
brk(NULL)   = 0x563aea4e4000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, 
"/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64/x86_64/libz.so.1", 
O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64/x86_64", 0x7ffe2154f320) = 
-1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64/libz.so.1", 
O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64", 0x7ffe2154f320) = -1 
ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64/libz.so.1", 
O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/x86_64", 0x7ffe2154f320) = -1 
ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/tls/libz.so.1", 
O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/tls", 0x7ffe2154f320) = -1 ENOENT (No 
such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64/x86_64/libz.so.1", 
O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64/x86_64", 0x7ffe2154f320) = -1 
ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64/libz.so.1", 
O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64", 0x7ffe2154f320) = -1 ENOENT 
(No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64/libz.so.1", 
O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/slurm-wlm/x86_64", 0x7ffe2154f320) = -1 ENOENT 
(No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/slurm-wlm/libz.so.1", 
O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/slurm-wlm", {st_mode=S_IFDIR|0755, 
st_size=20480, ...}) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=34055, ...}) = 0
mmap(NULL, 34055, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f2eddeec000
close(3)= 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0203\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=113088, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f2eddeea000
mmap(NULL, 115088, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f2eddecd000
mprotect(0x7f2edded, 98304, PROT_NONE) = 0
mmap(0x7f2edded, 69632, PROT_READ|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f2edded
mmap(0x7f2eddee1000, 24576, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 
0x14000) = 0x7f2eddee1000
mmap(0x7f2eddee8000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7f2eddee8000
close(3)= 0

Bug#954272: slurmd: SLURM not working with OpenMPI

2020-07-20 Thread Lars Veldscholte

Hi,

I believe I have found a solution.

I must confess that I still don't fully understand the difference 
between the various PMI APIs, and which ones are supported by OpenMPI, 
but I found that the recommended way is to use PMIx.


However, PMIx was not working on my system even though libpmix2 is 
installed:


# srun --mpi pmix ./a.out

srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: 
pmi/pmix: can not load PMIx library


srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin 
init() callback failed


srun: error: cannot create mpi context for mpi/pmix

srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

Running `strace srun --mpi=pmix ./a.out` revealed that SLURM is looking 
for the pmix library at `/usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so`, 
which does not exist, only `libpmix.so.2` exists.


Installing the package `libpmix-dev` installs this library (it symlinks 
it to the same file `libpmix.so.2` is symlinked to).


Now, `srun --mpi=pmix ./a.out` is working!

I'm not 100% sure, but I think that the package `libpmix2` should also 
install the file `libpmix.so`. The dev package shouldn't be required for 
that, right?


Lars



signature.asc
Description: OpenPGP digital signature


Bug#954272: slurmd: SLURM not working with OpenMPI

2020-07-20 Thread Alastair McKinstry



On 20/07/2020 14:52, Lars Veldscholte wrote:

Hi,

I believe I have found a solution.

I must confess that I still don't fully understand the difference 
between the various PMI APIs, and which ones are supported by OpenMPI, 
but I found that the recommended way is to use PMIx.


However, PMIx was not working on my system even though libpmix2 is 
installed:


# srun --mpi pmix ./a.out

srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: 
pmi/pmix: can not load PMIx library


srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin 
init() callback failed


srun: error: cannot create mpi context for mpi/pmix

srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

Running `strace srun --mpi=pmix ./a.out` revealed that SLURM is 
looking for the pmix library at 
`/usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so`, which does not exist, 
only `libpmix.so.2` exists.


Which code is trying to load libpmix.so ? The compiled code should be 
loading libpmix.so.2 directly; the libpmix.so should only be needed


at build-time.


Perhaps the problem is that libpmix-dev is not installed at compile 
time. I can add it as a Dependency of libopenmpi-dev.



Installing the package `libpmix-dev` installs this library (it 
symlinks it to the same file `libpmix.so.2` is symlinked to).


Now, `srun --mpi=pmix ./a.out` is working!

I'm not 100% sure, but I think that the package `libpmix2` should also 
install the file `libpmix.so`. The dev package shouldn't be required 
for that, right?


Lars


Regards

Alastair

--
Alastair McKinstry, , , 
https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.



Bug#954272: slurmd: SLURM not working with OpenMPI

2020-03-24 Thread Lars Veldscholte

Hello Gennaro,

On 2020-03-24 00:54, Gennaro Oliva wrote:

Hi Lars,

On Thu, Mar 19, 2020 at 03:16:15PM +0100, Lars Veldscholte wrote:

A simple test like `srun hostname` works, even on multiple cores. However, when 
trying to use MPI, it crashes with the following error message:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)

This happens even in the most simple "Hello World" case, as long as the program 
is MPI-enabled.

I am trying to use OpenMPI (4.0.2) from the Debian repositories. `srun --mpi 
list` returns:

srun: MPI types are...
srun: openmpi
srun: pmi2
srun: none

I have tried all options, but the result is the same in all cases.

Maybe this is user error, as this is my first time setting up SLURM, but I have 
not been able to find any possible causes/solutions and I am kind of stuck at 
this point.


I don't know why srun doesn't execute openmpi directly, and I'll try to
investigate this issue but as a workaround you can use both sbatch and
salloc as in [1]:

salloc -n 4 mpirun mympiprogram ...

or

sbatch -n 4 mympiprogram.sh

where mympiprogram.sh is something like:

#!/bin/sh
mpirun mympiprogram ...

Notice you don't need to specify the number of processes to mpirun, as
it takes it from SLURM.

[1] https://www.open-mpi.org/faq/?category=slurm

Best regards,



Thanks a lot, this seems to be working!

I hadn't realised that you could simply use mpirun instead of srun 
inside salloc/sbatch.


Regards,

Lars



signature.asc
Description: OpenPGP digital signature


Bug#954272: slurmd: SLURM not working with OpenMPI

2020-03-23 Thread Gennaro Oliva
Hi Lars,

On Thu, Mar 19, 2020 at 03:16:15PM +0100, Lars Veldscholte wrote:
> A simple test like `srun hostname` works, even on multiple cores. However, 
> when trying to use MPI, it crashes with the following error message:
> 
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> 
> This happens even in the most simple "Hello World" case, as long as the 
> program is MPI-enabled.
> 
> I am trying to use OpenMPI (4.0.2) from the Debian repositories. `srun --mpi 
> list` returns:
> 
> srun: MPI types are...
> srun: openmpi
> srun: pmi2
> srun: none
> 
> I have tried all options, but the result is the same in all cases.
> 
> Maybe this is user error, as this is my first time setting up SLURM, but I 
> have not been able to find any possible causes/solutions and I am kind of 
> stuck at this point.

I don't know why srun doesn't execute openmpi directly, and I'll try to
investigate this issue but as a workaround you can use both sbatch and
salloc as in [1]:

salloc -n 4 mpirun mympiprogram ...

or

sbatch -n 4 mympiprogram.sh

where mympiprogram.sh is something like:

#!/bin/sh
mpirun mympiprogram ...

Notice you don't need to specify the number of processes to mpirun, as
it takes it from SLURM.

[1] https://www.open-mpi.org/faq/?category=slurm

Best regards,
-- 
Gennaro Oliva



Bug#954272: slurmd: SLURM not working with OpenMPI

2020-03-19 Thread Lars Veldscholte
Package: slurmd
Version: 19.05.3.2-2+b1
Severity: important

Dear Maintainer,

I am trying to get SLURM working on a single node. I have installed and 
configured slurmd and slurmctld.

A simple test like `srun hostname` works, even on multiple cores. However, when 
trying to use MPI, it crashes with the following error message:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)

This happens even in the most simple "Hello World" case, as long as the program 
is MPI-enabled.

I am trying to use OpenMPI (4.0.2) from the Debian repositories. `srun --mpi 
list` returns:

srun: MPI types are...
srun: openmpi
srun: pmi2
srun: none

I have tried all options, but the result is the same in all cases.

Maybe this is user error, as this is my first time setting up SLURM, but I have 
not been able to find any possible causes/solutions and I am kind of stuck at 
this point.

Regards,

Lars

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 5.4.0-3-amd64 (SMP w/64 CPU cores)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages slurmd depends on:
ii  libc62.30-2
ii  libhwloc15   2.1.0+dfsg-4
ii  liblz4-1 1.9.2-2
ii  libnuma1 2.0.12-1+b1
ii  libpam0g 1.3.1-5
ii  lsb-base 11.1.0
ii  munge0.5.13-2+b1
ii  openssl  1.1.1d-2
ii  slurm-wlm-basic-plugins  19.05.3.2-2+b1
ii  ucf  3.0038+nmu1
ii  zlib1g   1:1.2.11.dfsg-2

slurmd recommends no packages.

slurmd suggests no packages.

-- no debconf information