[OMPI devel] mpirun 4.1.0 segmentation fault

2021-01-31 Thread Andrej Prsa via devel

Hello list,

I just upgraded openmpi from 4.0.3 to 4.1.0 to see if it would solve a 
weird openpmix problem we've been having; I configured it using:


./configure --prefix=/usr/local --with-pmix=internal --with-slurm 
--without-tm --without-moab --without-singularity --without-fca 
--without-hcoll --without-ime --without-lustre --without-psm 
--without-psm2 --without-mxm --with-gnu-ld


(I also have an external pmix version installed and tried using that 
instead of internal, but it doesn't change anything). Here's the output 
of configure:


Open MPI configuration:
---
Version: 4.1.0
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): no
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)

Miscellaneous
---
CUDA support: no
HWLOC support: external
Libevent support: external
PMIx support: Internal

Transports
---
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: no
OpenFabrics OFI Libfabric: no
OpenFabrics Verbs: no
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Resource Managers
---
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no

OMPIO File Systems
---
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no
Lustre: no
PVFS2/OrangeFS: no

Once configured, make and sudo make install worked without a glitch; but 
when I run mpirun, I get this:


andrej@terra:~/system/openmpi-4.1.0$ mpirun --version
mpirun (Open MPI) 4.1.0

Report bugs to http://www.open-mpi.org/community/help/
andrej@terra:~/system/openmpi-4.1.0$ mpirun
malloc(): corrupted top size
Aborted (core dumped)

No matter what I try to run, it always segfaults. Any suggestions on 
what I can try to resolve this?


Oh, I should also mention that I tried to remove the global libevent; 
openmpi configured its internal copy but then failed to build.


Thanks,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-01-31 Thread Andrej Prsa via devel

Hi Ralph,


Just trying to understand - why are you saying this is a pmix problem? 
Obviously, something to do with mpirun is failing, but I don't see any 
indication here that it has to do with pmix.


No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs 
across multiple nodes using slurm (i.e. -mca plm slurm), I'd get this:


--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

But the same would work if I submitted it with rsh (i.e. -mca plm rsh). 
I read online that there were issues with cpu bind so I thought 4.1.0 
might have resolved it.


So, back to the problem at hand. I reconfigured with --enable-debug and 
this is what I get:


andrej@terra:~/system/openmpi-4.1.0$ mpirun
[terra:4145441] *** Process received signal ***
[terra:4145441] Signal: Segmentation fault (11)
[terra:4145441] Signal code:  (128)
[terra:4145441] Failing at address: (nil)
[terra:4145441] [ 0] 
/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f487ebf4210]
[terra:4145441] [ 1] 
/usr/local/lib/openmpi/mca_pmix_pmix3x.so(opal_pmix_pmix3x_check_evars+0x15c)[0x7f487a340b3c]
[terra:4145441] [ 2] 
/usr/local/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x496)[0x7f487a3422e6]
[terra:4145441] [ 3] 
/usr/local/lib/libopen-rte.so.40(pmix_server_init+0x5da)[0x7f487ef2f5ec]
[terra:4145441] [ 4] 
/usr/local/lib/openmpi/mca_ess_hnp.so(+0x58d5)[0x7f487e90a8d5]
[terra:4145441] [ 5] 
/usr/local/lib/libopen-rte.so.40(orte_init+0x354)[0x7f487efab836]
[terra:4145441] [ 6] 
/usr/local/lib/libopen-rte.so.40(orte_submit_init+0x123b)[0x7f487efad0cd]

[terra:4145441] [ 7] mpirun(+0x16bc)[0x55d26c3bb6bc]
[terra:4145441] [ 8] mpirun(+0x134d)[0x55d26c3bb34d]
[terra:4145441] [ 9] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f487ebd50b3]

[terra:4145441] [10] mpirun(+0x126e)[0x55d26c3bb26e]
[terra:4145441] *** End of error message ***
Segmentation fault (core dumped)

gdb backtrace:

(gdb) r
Starting program: /usr/local/bin/mpirun
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x73302b3c in opal_pmix_pmix3x_check_evars () from 
/usr/local/lib/openmpi/mca_pmix_pmix3x.so

(gdb) bt
#0  0x73302b3c in opal_pmix_pmix3x_check_evars () from 
/usr/local/lib/openmpi/mca_pmix_pmix3x.so
#1  0x733042e6 in pmix3x_server_init () from 
/usr/local/lib/openmpi/mca_pmix_pmix3x.so
#2  0x77ef15ec in pmix_server_init () at 
orted/pmix/pmix_server.c:296

#3  0x778cc8d5 in rte_init () at ess_hnp_module.c:329
#4  0x77f6d836 in orte_init (pargc=0x7fffddbc, 
pargv=0x7fffddb0, flags=4) at runtime/orte_init.c:271
#5  0x77f6f0cd in orte_submit_init (argc=1, argv=0x7fffe478, 
opts=0x0) at orted/orted_submit.c:570
#6  0x56bc in orterun (argc=1, argv=0x7fffe478) at 
orterun.c:136

#7  0x534d in main (argc=1, argv=0x7fffe478) at main.c:13

This build is using the latest openpmix from github master.

Thanks,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Gilles,


I invite you to do some cleanup
sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix
and then
sudo make install
and try again.


Good catch! Alright, I deleted /usr/local/lib/openmpi and 
/usr/local/lib/pmix, then I rebuilt (make clean; make) and installed 
pmix from the latest master (should I use 3.1.6 instead?), and rebuilt 
(make clean; make) and installed the debug-enabled version of openmpi. 
Now I'm getting this:


[terra:199344] [[43961,0],0] ORTE_ERROR_LOG: Not found in file 
ess_hnp_module.c at line 320

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--


if the issue persists, please post the output of the following commands
$ env | grep ^OPAL_
$ env | grep ^PMIX_


I don't have any env variables defined.

Cheers,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Gilles,


what is your mpirun command line?
is mpirun invoked from a batch allocation?


I call mpirun directly; here's a full output:

andrej@terra:~/system/tests/MPI$ mpirun --mca ess_base_verbose 10 --mca 
pmix_base_verbose 10 -np 4 python testmpi.py
[terra:203257] mca: base: components_register: registering framework ess 
components

[terra:203257] mca: base: components_register: found loaded component slurm
[terra:203257] mca: base: components_register: component slurm has no 
register or open function

[terra:203257] mca: base: components_register: found loaded component env
[terra:203257] mca: base: components_register: component env has no 
register or open function

[terra:203257] mca: base: components_register: found loaded component pmi
[terra:203257] mca: base: components_register: component pmi has no 
register or open function

[terra:203257] mca: base: components_register: found loaded component tool
[terra:203257] mca: base: components_register: component tool register 
function successful

[terra:203257] mca: base: components_register: found loaded component hnp
[terra:203257] mca: base: components_register: component hnp has no 
register or open function
[terra:203257] mca: base: components_register: found loaded component 
singleton
[terra:203257] mca: base: components_register: component singleton 
register function successful

[terra:203257] mca: base: components_open: opening ess components
[terra:203257] mca: base: components_open: found loaded component slurm
[terra:203257] mca: base: components_open: component slurm open function 
successful

[terra:203257] mca: base: components_open: found loaded component env
[terra:203257] mca: base: components_open: component env open function 
successful

[terra:203257] mca: base: components_open: found loaded component pmi
[terra:203257] mca: base: components_open: component pmi open function 
successful

[terra:203257] mca: base: components_open: found loaded component tool
[terra:203257] mca: base: components_open: component tool open function 
successful

[terra:203257] mca: base: components_open: found loaded component hnp
[terra:203257] mca: base: components_open: component hnp open function 
successful

[terra:203257] mca: base: components_open: found loaded component singleton
[terra:203257] mca: base: components_open: component singleton open 
function successful

[terra:203257] mca:base:select: Auto-selecting ess components
[terra:203257] mca:base:select:(  ess) Querying component [slurm]
[terra:203257] mca:base:select:(  ess) Querying component [env]
[terra:203257] mca:base:select:(  ess) Querying component [pmi]
[terra:203257] mca:base:select:(  ess) Querying component [tool]
[terra:203257] mca:base:select:(  ess) Querying component [hnp]
[terra:203257] mca:base:select:(  ess) Query of component [hnp] set 
priority to 100

[terra:203257] mca:base:select:(  ess) Querying component [singleton]
[terra:203257] mca:base:select:(  ess) Selected component [hnp]
[terra:203257] mca: base: close: component slurm closed
[terra:203257] mca: base: close: unloading component slurm
[terra:203257] mca: base: close: component env closed
[terra:203257] mca: base: close: unloading component env
[terra:203257] mca: base: close: component pmi closed
[terra:203257] mca: base: close: unloading component pmi
[terra:203257] mca: base: close: component tool closed
[terra:203257] mca: base: close: unloading component tool
[terra:203257] mca: base: close: component singleton closed
[terra:203257] mca: base: close: unloading component singleton
[terra:203257] mca: base: components_register: registering framework 
pmix components

[terra:203257] mca: base: components_register: found loaded component flux
[terra:203257] mca: base: components_register: component flux register 
function successful

[terra:203257] mca: base: components_open: opening pmix components
[terra:203257] mca: base: components_open: found loaded component flux
[terra:203257] mca:base:select: Auto-selecting pmix components
[terra:203257] mca:base:select:( pmix) Querying component [flux]
[terra:203257] mca:base:select:( pmix) No component selected!
[terra:203257] [[47344,0],0] ORTE_ERROR_LOG: Not found in file 
ess_hnp_module.c at line 320

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--

Thanks,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Gilles,


it seems only flux is a PMIx option, which is very suspicious.

can you check other components are available?

ls -l /usr/local/lib/openmpi/mca_pmix_*.so


andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so
-rwxr-xr-x 1 root root 97488 Feb  1 08:20 
/usr/local/lib/openmpi/mca_pmix_flux.so
-rwxr-xr-x 1 root root 92240 Feb  1 08:20 
/usr/local/lib/openmpi/mca_pmix_isolated.so


Thank you for your continued help!

Cheers,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Gilles,


that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
with the internal pmix)


Ah, I didn't -- I linked against the latest git pmix; here's the 
configure line:


./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm 
--without-tm --without-moab --without-singularity --without-fca 
--without-hcoll --without-ime --without-lustre --without-psm 
--without-psm2 --without-mxm --with-gnu-ld --enable-debug


I'll try nuking the install again and configuring it to use internal pmix.

Cheers,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Joseph,

Thanks -- I did that and checked that the configure summary says 
internal for pmix. I also distcleaned the tree just to be sure. It's 
building as we speak.


Cheers,
Andrej


On 2/1/21 9:55 AM, Joseph Schuchart via devel wrote:

Andrej,

If your installation originally picked up a preinstalled PMIx and you 
deleted it, it's better to run OMPI's configure again (make/make 
install might not be sufficient to install the internal PMIx).


Cheers
Joseph

On 2/1/21 3:48 PM, Gilles Gouaillardet via devel wrote:

Andrej,

that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
with the internal pmix)

what was your exact configure command line?

fwiw, in your build tree, there should be a
opal/mca/pmix/pmix3x/.libs/mca_pmix_pmix3x.so
if it's there, try running
sudo make install
once more and see if it helps

Cheers,

Gilles

On Mon, Feb 1, 2021 at 11:05 PM Andrej Prsa via devel
 wrote:


Hi Gilles,


it seems only flux is a PMIx option, which is very suspicious.

can you check other components are available?

ls -l /usr/local/lib/openmpi/mca_pmix_*.so


andrej@terra:~/system/tests/MPI$ ls -l 
/usr/local/lib/openmpi/mca_pmix_*.so

-rwxr-xr-x 1 root root 97488 Feb  1 08:20
/usr/local/lib/openmpi/mca_pmix_flux.so
-rwxr-xr-x 1 root root 92240 Feb  1 08:20
/usr/local/lib/openmpi/mca_pmix_isolated.so

Thank you for your continued help!

Cheers,
Andrej





Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Alright, I rebuilt mpirun and it's working on a local machine. But now 
I'm back to my original problem: running this works:


mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96 
python testmpi.py


but running this doesn't:

mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 
python testmpi.py


Here's the verbose output from the latter command:

andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca 
pmix_base_verbose 10 -mca plm slurm -np 384 -H 
node15:96,node16:96,node17:96,node18:96 python testmpi.py
[terra:387112] mca: base: components_register: registering framework ess 
components

[terra:387112] mca: base: components_register: found loaded component slurm
[terra:387112] mca: base: components_register: component slurm has no 
register or open function

[terra:387112] mca: base: components_register: found loaded component env
[terra:387112] mca: base: components_register: component env has no 
register or open function

[terra:387112] mca: base: components_register: found loaded component pmi
[terra:387112] mca: base: components_register: component pmi has no 
register or open function

[terra:387112] mca: base: components_register: found loaded component tool
[terra:387112] mca: base: components_register: component tool register 
function successful

[terra:387112] mca: base: components_register: found loaded component hnp
[terra:387112] mca: base: components_register: component hnp has no 
register or open function
[terra:387112] mca: base: components_register: found loaded component 
singleton
[terra:387112] mca: base: components_register: component singleton 
register function successful

[terra:387112] mca: base: components_open: opening ess components
[terra:387112] mca: base: components_open: found loaded component slurm
[terra:387112] mca: base: components_open: component slurm open function 
successful

[terra:387112] mca: base: components_open: found loaded component env
[terra:387112] mca: base: components_open: component env open function 
successful

[terra:387112] mca: base: components_open: found loaded component pmi
[terra:387112] mca: base: components_open: component pmi open function 
successful

[terra:387112] mca: base: components_open: found loaded component tool
[terra:387112] mca: base: components_open: component tool open function 
successful

[terra:387112] mca: base: components_open: found loaded component hnp
[terra:387112] mca: base: components_open: component hnp open function 
successful

[terra:387112] mca: base: components_open: found loaded component singleton
[terra:387112] mca: base: components_open: component singleton open 
function successful

[terra:387112] mca:base:select: Auto-selecting ess components
[terra:387112] mca:base:select:(  ess) Querying component [slurm]
[terra:387112] mca:base:select:(  ess) Querying component [env]
[terra:387112] mca:base:select:(  ess) Querying component [pmi]
[terra:387112] mca:base:select:(  ess) Querying component [tool]
[terra:387112] mca:base:select:(  ess) Querying component [hnp]
[terra:387112] mca:base:select:(  ess) Query of component [hnp] set 
priority to 100

[terra:387112] mca:base:select:(  ess) Querying component [singleton]
[terra:387112] mca:base:select:(  ess) Selected component [hnp]
[terra:387112] mca: base: close: component slurm closed
[terra:387112] mca: base: close: unloading component slurm
[terra:387112] mca: base: close: component env closed
[terra:387112] mca: base: close: unloading component env
[terra:387112] mca: base: close: component pmi closed
[terra:387112] mca: base: close: unloading component pmi
[terra:387112] mca: base: close: component tool closed
[terra:387112] mca: base: close: unloading component tool
[terra:387112] mca: base: close: component singleton closed
[terra:387112] mca: base: close: unloading component singleton
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--

This was the exact problem that prompted me to try and upgrade from 
4.0.3 to 4.1.0. Openmpi 4.1.0 (in debug mode, with internal pmix) is now 
installed on the head and on all compute nodes.


I'd appreciate any ideas on what to try to overcome this.

Cheers,
Andrej


On 2/1/21 9:57 AM, Andrej Prsa wrote:

Hi Gilles,


that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
with the internal pmix)


Ah, I didn't -- I linked against the latest git pmix; here's the 
configure line:


./configure --pr

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Gilles,


srun -N 1 -n 1 orted

that is expected to fail, but it should at least find all its
dependencies and start


This was quite illuminating!

andrej@terra:~/system/tests/MPI$ srun -N 1 -n 1 orted
srun: /usr/local/lib/slurm/switch_generic.so: Incompatible Slurm plugin 
version (20.02.6)
srun: error: Couldn't load specified plugin name for switch/generic: 
Incompatible plugin version
srun: /usr/local/lib/slurm/mpi_pmix.so: Incompatible Slurm plugin 
version (20.02.6)
srun: error: Couldn't load specified plugin name for mpi/pmix: 
Incompatible plugin version

srun: error: cannot create mpi context for mpi/pmix
srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

So it looks like there were conflicting slurm versions running -- 
20.02.6 (slurmdbd) and 20.11.3 (slurmctld/slurmd). I deleted all slurm 
stuff in /usr/local and reconfigured/rebuilt/reinstalled 20.11.3. Now 
I'm getting this:


andrej@terra:~$ srun -N 1 -n 1 orted
srun: error: Couldn't find the specified plugin name for mpi/pmix 
looking at all files

srun: error: cannot find mpi plugin for mpi/pmix
srun: error: cannot create mpi context for mpi/pmix
srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

It seems that slurm doesn't see pmix:

andrej@terra:~$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2

I'll try to point slurm to use openmpi's internal pmix and rebuild, but 
posting this now in case I'm going down the rabbit hole and someone has 
a better idea.


Cheers,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

The saga continues.

I managed to build slurm with pmix by first patching slurm using this 
patch and manually building the plugin:


https://bugs.schedmd.com/show_bug.cgi?id=10683

Now srun shows pmix as an option:

andrej@terra:~/system/tests/MPI$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v4

But when I try to run mpirun with slurm plugin, it still fails:

andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca 
pmix_base_verbose 10 -mca plm slurm -np 384 -H 
node15:96,node16:96,node17:96,node18:96 python testmpi.py
[terra:149214] mca: base: components_register: registering framework ess 
components

[terra:149214] mca: base: components_register: found loaded component slurm
[terra:149214] mca: base: components_register: component slurm has no 
register or open function

[terra:149214] mca: base: components_register: found loaded component env
[terra:149214] mca: base: components_register: component env has no 
register or open function

[terra:149214] mca: base: components_register: found loaded component pmi
[terra:149214] mca: base: components_register: component pmi has no 
register or open function

[terra:149214] mca: base: components_register: found loaded component tool
[terra:149214] mca: base: components_register: component tool register 
function successful

[terra:149214] mca: base: components_register: found loaded component hnp
[terra:149214] mca: base: components_register: component hnp has no 
register or open function
[terra:149214] mca: base: components_register: found loaded component 
singleton
[terra:149214] mca: base: components_register: component singleton 
register function successful

[terra:149214] mca: base: components_open: opening ess components
[terra:149214] mca: base: components_open: found loaded component slurm
[terra:149214] mca: base: components_open: component slurm open function 
successful

[terra:149214] mca: base: components_open: found loaded component env
[terra:149214] mca: base: components_open: component env open function 
successful

[terra:149214] mca: base: components_open: found loaded component pmi
[terra:149214] mca: base: components_open: component pmi open function 
successful

[terra:149214] mca: base: components_open: found loaded component tool
[terra:149214] mca: base: components_open: component tool open function 
successful

[terra:149214] mca: base: components_open: found loaded component hnp
[terra:149214] mca: base: components_open: component hnp open function 
successful

[terra:149214] mca: base: components_open: found loaded component singleton
[terra:149214] mca: base: components_open: component singleton open 
function successful

[terra:149214] mca:base:select: Auto-selecting ess components
[terra:149214] mca:base:select:(  ess) Querying component [slurm]
[terra:149214] mca:base:select:(  ess) Querying component [env]
[terra:149214] mca:base:select:(  ess) Querying component [pmi]
[terra:149214] mca:base:select:(  ess) Querying component [tool]
[terra:149214] mca:base:select:(  ess) Querying component [hnp]
[terra:149214] mca:base:select:(  ess) Query of component [hnp] set 
priority to 100

[terra:149214] mca:base:select:(  ess) Querying component [singleton]
[terra:149214] mca:base:select:(  ess) Selected component [hnp]
[terra:149214] mca: base: close: component slurm closed
[terra:149214] mca: base: close: unloading component slurm
[terra:149214] mca: base: close: component env closed
[terra:149214] mca: base: close: unloading component env
[terra:149214] mca: base: close: component pmi closed
[terra:149214] mca: base: close: unloading component pmi
[terra:149214] mca: base: close: component tool closed
[terra:149214] mca: base: close: unloading component tool
[terra:149214] mca: base: close: component singleton closed
[terra:149214] mca: base: close: unloading component singleton
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--

I'm at my wits' end what to try, and all ears if anyone has any leads or 
suggestions.


Thanks,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Ralph, Gilles,


I fail to understand why you continue to think that PMI has anything to do with 
this problem. I see no indication of a PMIx-related issue in anything you have 
provided to date.


Oh, I went off the traceback that yelled about pmix, and slurm not being 
able to find it until I patched the latest version; I'm an 
astrophysicist pretending to be a sys admin for our research cluster, so 
while I can hold my ground with c, python and technical computing, I'm 
out of my depths when it comes to mpi, pmix, slurm and all that good 
stuff. So I appreciate your patience. I am trying though. :)



In the output below, it is clear what the problem is - you locked it to the "slurm" launcher (with 
-mca plm slurm) and the "slurm" launcher was not found. Try adding "--mca plm_base_verbose 
10" to your cmd line and let's see why that launcher wasn't accepted.


andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca 
plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python 
testmpi.py
[terra:168998] mca: base: components_register: registering framework plm 
components

[terra:168998] mca: base: components_register: found loaded component slurm
[terra:168998] mca: base: components_register: component slurm register 
function successful

[terra:168998] mca: base: components_open: opening plm components
[terra:168998] mca: base: components_open: found loaded component slurm
[terra:168998] mca: base: components_open: component slurm open function 
successful

[terra:168998] mca:base:select: Auto-selecting plm components
[terra:168998] mca:base:select:(  plm) Querying component [slurm]
[terra:168998] mca:base:select:(  plm) No component selected!
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--

Gilles, I did try all the suggestions from the previous email but that 
led me to think that slurm is the culprit, and now I'm back to openmpi.


Cheers,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Gilles,


I can reproduce this behavior ... when running outside of a slurm allocation.


I just tried from slurm (sbatch run.sh) and I get the exact same error.



What does
$ env | grep ^SLURM_
reports?


Empty; no environment variables have been defined.

Thanks,
Andrej



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
omponent slurm closed
[terra:177267] mca: base: close: unloading component slurm

Thanks, as always,
Andrej


On 2/1/21 7:50 PM, Gilles Gouaillardet via devel wrote:

Andrej,

you *have* to invoke
mpirun --mca plm slurm ...
from a SLURM allocation, and SLURM_* environment variables should have
been set by SLURM
(otherwise, this is a SLURM error out of the scope of Open MPI).

Here is what you can try (and send the logs if that fails)

$ salloc -N 4 -n 384
and once you get the allocation
$ env | grep ^SLURM_
$ mpirun --mca plm_base_verbose 10 --mca plm slurm true


Cheers,

Gilles

On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel
 wrote:

Hi Gilles,


I can reproduce this behavior ... when running outside of a slurm allocation.

I just tried from slurm (sbatch run.sh) and I get the exact same error.


What does
$ env | grep ^SLURM_
reports?

Empty; no environment variables have been defined.

Thanks,
Andrej





Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Gilles,


Here is what you can try

$ salloc -N 4 -n 384
/* and then from the allocation */

$ srun -n 1 orted
/* that should fail, but the error message can be helpful */

$ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true


andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384
salloc: Granted job allocation 837
andrej@terra:~/system/tests/MPI$ srun -n 1 orted
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
srun: launch/slurm: launch_p_step_launch: StepId=837.0 aborted before 
step completely launched.

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error
andrej@terra:~/system/tests/MPI$ /usr/local/bin/mpirun -mca plm slurm 
-mca plm_base_verbose 10 true
[terra:179991] mca: base: components_register: registering framework plm 
components

[terra:179991] mca: base: components_register: found loaded component slurm
[terra:179991] mca: base: components_register: component slurm register 
function successful

[terra:179991] mca: base: components_open: opening plm components
[terra:179991] mca: base: components_open: found loaded component slurm
[terra:179991] mca: base: components_open: component slurm open function 
successful

[terra:179991] mca:base:select: Auto-selecting plm components
[terra:179991] mca:base:select:(  plm) Querying component [slurm]
[terra:179991] [[INVALID],INVALID] plm:slurm: available for selection
[terra:179991] mca:base:select:(  plm) Query of component [slurm] set 
priority to 75

[terra:179991] mca:base:select:(  plm) Selected component [slurm]
[terra:179991] plm:base:set_hnp_name: initial bias 179991 nodename hash 
2928217987

[terra:179991] plm:base:set_hnp_name: final jobfam 7711
[terra:179991] [[7711,0],0] plm:base:receive start comm
[terra:179991] [[7711,0],0] plm:base:setup_job
[terra:179991] [[7711,0],0] plm:slurm: LAUNCH DAEMONS CALLED
[terra:179991] [[7711,0],0] plm:base:setup_vm
[terra:179991] [[7711,0],0] plm:base:setup_vm creating map
[terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],1]
[terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon 
[[7711,0],1] to node node9

[terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],2]
[terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon 
[[7711,0],2] to node node10

[terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],3]
[terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon 
[[7711,0],3] to node node11

[terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],4]
[terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon 
[[7711,0],4] to node node12
[terra:179991] [[7711,0],0] plm:slurm: launching on nodes 
node9,node10,node11,node12

[terra:179991] [[7711,0],0] plm:slurm: Set prefix:/usr/local
[terra:179991] [[7711,0],0] plm:slurm: final top-level argv:
    srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=4 orted -mca 
ess "slurm" -mca ess_base_jobid "505348096" -mca ess_base_vpid "1" -mca 
ess_base_num_procs "5" -mca orte_node_regex 
"terra,node[1:9],node[2:10-12]@0(5)" -mca orte_hnp_uri 
"505348096.0;tcp://10.9.2.10,192.168.1.1:38995" -mca plm_base_verbose "10"
[terra:179991] [[7711,0],0] plm:slurm: reset PATH: 
/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

[terra:179991] [[7711,0],0] plm:slurm: reset LD_LIBRARY_PATH: /usr/local/lib
srun: launch/slurm: launch_p_step_launch: StepId=837.1 aborted before 
step completely launched.

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 3 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error
srun: error: task 2 launch failed: Unspecified error
srun: error: task 0 launch failed: Unspecified error
[terra:179991] [[7711,0],0] plm:slurm: primary daemons complete!
[terra:179991] [[7711,0],0] plm:base:receive stop comm
[terra:179991] mca: base: close: component slurm closed
[terra:179991] mca: base: close: unloading component slurm

This is what I'm seeing in slurmctld.log:

[2021-02-01T20:15:18.358] sched: _slurm_rpc_allocate_resources JobId=837 
NodeList=node[9-12] usec=537
[2021-02-01T20:15:26.815] error: mpi_hook_slurmstepd_prefork failure for 
0x557ce5b92960s on node9
[2021-02-01T20:15:59.621] error: mpi_hook_slurmstepd_prefork failure for 
0x55cc6c89a7e0s on node12
[2021-02-01T20:15:59.621] error: mpi_hook_slurmstepd_prefork failure for 
0x55b7b8b467e0s on node10
[2021-02-01T20:15:59.622] error: mpi_hook_slurmstepd_prefork failure for 
0x55f8cd69a7e0s on node11
[2021-02-01T20:15:59.628] error: mpi_hook_slurmstepd_prefork failure for 
0xb45bc7e0s on node9


And this is in slurmd.node9.log (and similar for the remaining 3 nodes):

[2021-02-01T20:15:59.592] task/affinity: lllp_distribution: JobId=837 
manual binding: none
[2021-02-01T20:15:59.624] [837.1] error: node9 [0] pmixp_client_v2.c:246 
[pmixp_lib_init] m

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel

Hi Ralph,


Andrej - what  version of Slurm are you using here?


It's slurm 20.11.3, i.e. the latest release afaik.

But Gilles is correct; the proposed test failed:

andrej@terra:~/system/tests/MPI$ salloc -N 2 -n 2
salloc: Granted job allocation 838
andrej@terra:~/system/tests/MPI$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=838.0 aborted before 
step completely launched.

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 1 launch failed: Unspecified error
srun: error: task 0 launch failed: Unspecified error

Now I'll dig in and try to figure out why slurm is failing. I'll post 
the update once I've figured it out so that it may help others who find 
themselves in a similar situation. (provided I do figure it out %-))


Guys, my sincere thanks for all your help! I truly appreciate it!!

Cheers,
Andrej