[OMPI devel] mpirun 4.1.0 segmentation fault
Hello list, I just upgraded openmpi from 4.0.3 to 4.1.0 to see if it would solve a weird openpmix problem we've been having; I configured it using: ./configure --prefix=/usr/local --with-pmix=internal --with-slurm --without-tm --without-moab --without-singularity --without-fca --without-hcoll --without-ime --without-lustre --without-psm --without-psm2 --without-mxm --with-gnu-ld (I also have an external pmix version installed and tried using that instead of internal, but it doesn't change anything). Here's the output of configure: Open MPI configuration: --- Version: 4.1.0 Build MPI C bindings: yes Build MPI C++ bindings (deprecated): no Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08 MPI Build Java bindings (experimental): no Build Open SHMEM support: false (no spml) Debug build: no Platform file: (none) Miscellaneous --- CUDA support: no HWLOC support: external Libevent support: external PMIx support: Internal Transports --- Cisco usNIC: no Cray uGNI (Gemini/Aries): no Intel Omnipath (PSM2): no Intel TrueScale (PSM): no Mellanox MXM: no Open UCX: no OpenFabrics OFI Libfabric: no OpenFabrics Verbs: no Portals4: no Shared memory/copy in+copy out: yes Shared memory/Linux CMA: yes Shared memory/Linux KNEM: no Shared memory/XPMEM: no TCP: yes Resource Managers --- Cray Alps: no Grid Engine: no LSF: no Moab: no Slurm: yes ssh/rsh: yes Torque: no OMPIO File Systems --- DDN Infinite Memory Engine: no Generic Unix FS: yes IBM Spectrum Scale/GPFS: no Lustre: no PVFS2/OrangeFS: no Once configured, make and sudo make install worked without a glitch; but when I run mpirun, I get this: andrej@terra:~/system/openmpi-4.1.0$ mpirun --version mpirun (Open MPI) 4.1.0 Report bugs to http://www.open-mpi.org/community/help/ andrej@terra:~/system/openmpi-4.1.0$ mpirun malloc(): corrupted top size Aborted (core dumped) No matter what I try to run, it always segfaults. Any suggestions on what I can try to resolve this? Oh, I should also mention that I tried to remove the global libevent; openmpi configured its internal copy but then failed to build. Thanks, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Ralph, Just trying to understand - why are you saying this is a pmix problem? Obviously, something to do with mpirun is failing, but I don't see any indication here that it has to do with pmix. No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs across multiple nodes using slurm (i.e. -mca plm slurm), I'd get this: -- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- But the same would work if I submitted it with rsh (i.e. -mca plm rsh). I read online that there were issues with cpu bind so I thought 4.1.0 might have resolved it. So, back to the problem at hand. I reconfigured with --enable-debug and this is what I get: andrej@terra:~/system/openmpi-4.1.0$ mpirun [terra:4145441] *** Process received signal *** [terra:4145441] Signal: Segmentation fault (11) [terra:4145441] Signal code: (128) [terra:4145441] Failing at address: (nil) [terra:4145441] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f487ebf4210] [terra:4145441] [ 1] /usr/local/lib/openmpi/mca_pmix_pmix3x.so(opal_pmix_pmix3x_check_evars+0x15c)[0x7f487a340b3c] [terra:4145441] [ 2] /usr/local/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x496)[0x7f487a3422e6] [terra:4145441] [ 3] /usr/local/lib/libopen-rte.so.40(pmix_server_init+0x5da)[0x7f487ef2f5ec] [terra:4145441] [ 4] /usr/local/lib/openmpi/mca_ess_hnp.so(+0x58d5)[0x7f487e90a8d5] [terra:4145441] [ 5] /usr/local/lib/libopen-rte.so.40(orte_init+0x354)[0x7f487efab836] [terra:4145441] [ 6] /usr/local/lib/libopen-rte.so.40(orte_submit_init+0x123b)[0x7f487efad0cd] [terra:4145441] [ 7] mpirun(+0x16bc)[0x55d26c3bb6bc] [terra:4145441] [ 8] mpirun(+0x134d)[0x55d26c3bb34d] [terra:4145441] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f487ebd50b3] [terra:4145441] [10] mpirun(+0x126e)[0x55d26c3bb26e] [terra:4145441] *** End of error message *** Segmentation fault (core dumped) gdb backtrace: (gdb) r Starting program: /usr/local/bin/mpirun [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Program received signal SIGSEGV, Segmentation fault. 0x73302b3c in opal_pmix_pmix3x_check_evars () from /usr/local/lib/openmpi/mca_pmix_pmix3x.so (gdb) bt #0 0x73302b3c in opal_pmix_pmix3x_check_evars () from /usr/local/lib/openmpi/mca_pmix_pmix3x.so #1 0x733042e6 in pmix3x_server_init () from /usr/local/lib/openmpi/mca_pmix_pmix3x.so #2 0x77ef15ec in pmix_server_init () at orted/pmix/pmix_server.c:296 #3 0x778cc8d5 in rte_init () at ess_hnp_module.c:329 #4 0x77f6d836 in orte_init (pargc=0x7fffddbc, pargv=0x7fffddb0, flags=4) at runtime/orte_init.c:271 #5 0x77f6f0cd in orte_submit_init (argc=1, argv=0x7fffe478, opts=0x0) at orted/orted_submit.c:570 #6 0x56bc in orterun (argc=1, argv=0x7fffe478) at orterun.c:136 #7 0x534d in main (argc=1, argv=0x7fffe478) at main.c:13 This build is using the latest openpmix from github master. Thanks, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Gilles, I invite you to do some cleanup sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix and then sudo make install and try again. Good catch! Alright, I deleted /usr/local/lib/openmpi and /usr/local/lib/pmix, then I rebuilt (make clean; make) and installed pmix from the latest master (should I use 3.1.6 instead?), and rebuilt (make clean; make) and installed the debug-enabled version of openmpi. Now I'm getting this: [terra:199344] [[43961,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 320 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- if the issue persists, please post the output of the following commands $ env | grep ^OPAL_ $ env | grep ^PMIX_ I don't have any env variables defined. Cheers, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Gilles, what is your mpirun command line? is mpirun invoked from a batch allocation? I call mpirun directly; here's a full output: andrej@terra:~/system/tests/MPI$ mpirun --mca ess_base_verbose 10 --mca pmix_base_verbose 10 -np 4 python testmpi.py [terra:203257] mca: base: components_register: registering framework ess components [terra:203257] mca: base: components_register: found loaded component slurm [terra:203257] mca: base: components_register: component slurm has no register or open function [terra:203257] mca: base: components_register: found loaded component env [terra:203257] mca: base: components_register: component env has no register or open function [terra:203257] mca: base: components_register: found loaded component pmi [terra:203257] mca: base: components_register: component pmi has no register or open function [terra:203257] mca: base: components_register: found loaded component tool [terra:203257] mca: base: components_register: component tool register function successful [terra:203257] mca: base: components_register: found loaded component hnp [terra:203257] mca: base: components_register: component hnp has no register or open function [terra:203257] mca: base: components_register: found loaded component singleton [terra:203257] mca: base: components_register: component singleton register function successful [terra:203257] mca: base: components_open: opening ess components [terra:203257] mca: base: components_open: found loaded component slurm [terra:203257] mca: base: components_open: component slurm open function successful [terra:203257] mca: base: components_open: found loaded component env [terra:203257] mca: base: components_open: component env open function successful [terra:203257] mca: base: components_open: found loaded component pmi [terra:203257] mca: base: components_open: component pmi open function successful [terra:203257] mca: base: components_open: found loaded component tool [terra:203257] mca: base: components_open: component tool open function successful [terra:203257] mca: base: components_open: found loaded component hnp [terra:203257] mca: base: components_open: component hnp open function successful [terra:203257] mca: base: components_open: found loaded component singleton [terra:203257] mca: base: components_open: component singleton open function successful [terra:203257] mca:base:select: Auto-selecting ess components [terra:203257] mca:base:select:( ess) Querying component [slurm] [terra:203257] mca:base:select:( ess) Querying component [env] [terra:203257] mca:base:select:( ess) Querying component [pmi] [terra:203257] mca:base:select:( ess) Querying component [tool] [terra:203257] mca:base:select:( ess) Querying component [hnp] [terra:203257] mca:base:select:( ess) Query of component [hnp] set priority to 100 [terra:203257] mca:base:select:( ess) Querying component [singleton] [terra:203257] mca:base:select:( ess) Selected component [hnp] [terra:203257] mca: base: close: component slurm closed [terra:203257] mca: base: close: unloading component slurm [terra:203257] mca: base: close: component env closed [terra:203257] mca: base: close: unloading component env [terra:203257] mca: base: close: component pmi closed [terra:203257] mca: base: close: unloading component pmi [terra:203257] mca: base: close: component tool closed [terra:203257] mca: base: close: unloading component tool [terra:203257] mca: base: close: component singleton closed [terra:203257] mca: base: close: unloading component singleton [terra:203257] mca: base: components_register: registering framework pmix components [terra:203257] mca: base: components_register: found loaded component flux [terra:203257] mca: base: components_register: component flux register function successful [terra:203257] mca: base: components_open: opening pmix components [terra:203257] mca: base: components_open: found loaded component flux [terra:203257] mca:base:select: Auto-selecting pmix components [terra:203257] mca:base:select:( pmix) Querying component [flux] [terra:203257] mca:base:select:( pmix) No component selected! [terra:203257] [[47344,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 320 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- Thanks, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Gilles, it seems only flux is a PMIx option, which is very suspicious. can you check other components are available? ls -l /usr/local/lib/openmpi/mca_pmix_*.so andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so -rwxr-xr-x 1 root root 97488 Feb 1 08:20 /usr/local/lib/openmpi/mca_pmix_flux.so -rwxr-xr-x 1 root root 92240 Feb 1 08:20 /usr/local/lib/openmpi/mca_pmix_isolated.so Thank you for your continued help! Cheers, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Gilles, that's odd, there should be a mca_pmix_pmix3x.so (assuming you built with the internal pmix) Ah, I didn't -- I linked against the latest git pmix; here's the configure line: ./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm --without-tm --without-moab --without-singularity --without-fca --without-hcoll --without-ime --without-lustre --without-psm --without-psm2 --without-mxm --with-gnu-ld --enable-debug I'll try nuking the install again and configuring it to use internal pmix. Cheers, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Joseph, Thanks -- I did that and checked that the configure summary says internal for pmix. I also distcleaned the tree just to be sure. It's building as we speak. Cheers, Andrej On 2/1/21 9:55 AM, Joseph Schuchart via devel wrote: Andrej, If your installation originally picked up a preinstalled PMIx and you deleted it, it's better to run OMPI's configure again (make/make install might not be sufficient to install the internal PMIx). Cheers Joseph On 2/1/21 3:48 PM, Gilles Gouaillardet via devel wrote: Andrej, that's odd, there should be a mca_pmix_pmix3x.so (assuming you built with the internal pmix) what was your exact configure command line? fwiw, in your build tree, there should be a opal/mca/pmix/pmix3x/.libs/mca_pmix_pmix3x.so if it's there, try running sudo make install once more and see if it helps Cheers, Gilles On Mon, Feb 1, 2021 at 11:05 PM Andrej Prsa via devel wrote: Hi Gilles, it seems only flux is a PMIx option, which is very suspicious. can you check other components are available? ls -l /usr/local/lib/openmpi/mca_pmix_*.so andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so -rwxr-xr-x 1 root root 97488 Feb 1 08:20 /usr/local/lib/openmpi/mca_pmix_flux.so -rwxr-xr-x 1 root root 92240 Feb 1 08:20 /usr/local/lib/openmpi/mca_pmix_isolated.so Thank you for your continued help! Cheers, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Alright, I rebuilt mpirun and it's working on a local machine. But now I'm back to my original problem: running this works: mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py but running this doesn't: mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py Here's the verbose output from the latter command: andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca pmix_base_verbose 10 -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py [terra:387112] mca: base: components_register: registering framework ess components [terra:387112] mca: base: components_register: found loaded component slurm [terra:387112] mca: base: components_register: component slurm has no register or open function [terra:387112] mca: base: components_register: found loaded component env [terra:387112] mca: base: components_register: component env has no register or open function [terra:387112] mca: base: components_register: found loaded component pmi [terra:387112] mca: base: components_register: component pmi has no register or open function [terra:387112] mca: base: components_register: found loaded component tool [terra:387112] mca: base: components_register: component tool register function successful [terra:387112] mca: base: components_register: found loaded component hnp [terra:387112] mca: base: components_register: component hnp has no register or open function [terra:387112] mca: base: components_register: found loaded component singleton [terra:387112] mca: base: components_register: component singleton register function successful [terra:387112] mca: base: components_open: opening ess components [terra:387112] mca: base: components_open: found loaded component slurm [terra:387112] mca: base: components_open: component slurm open function successful [terra:387112] mca: base: components_open: found loaded component env [terra:387112] mca: base: components_open: component env open function successful [terra:387112] mca: base: components_open: found loaded component pmi [terra:387112] mca: base: components_open: component pmi open function successful [terra:387112] mca: base: components_open: found loaded component tool [terra:387112] mca: base: components_open: component tool open function successful [terra:387112] mca: base: components_open: found loaded component hnp [terra:387112] mca: base: components_open: component hnp open function successful [terra:387112] mca: base: components_open: found loaded component singleton [terra:387112] mca: base: components_open: component singleton open function successful [terra:387112] mca:base:select: Auto-selecting ess components [terra:387112] mca:base:select:( ess) Querying component [slurm] [terra:387112] mca:base:select:( ess) Querying component [env] [terra:387112] mca:base:select:( ess) Querying component [pmi] [terra:387112] mca:base:select:( ess) Querying component [tool] [terra:387112] mca:base:select:( ess) Querying component [hnp] [terra:387112] mca:base:select:( ess) Query of component [hnp] set priority to 100 [terra:387112] mca:base:select:( ess) Querying component [singleton] [terra:387112] mca:base:select:( ess) Selected component [hnp] [terra:387112] mca: base: close: component slurm closed [terra:387112] mca: base: close: unloading component slurm [terra:387112] mca: base: close: component env closed [terra:387112] mca: base: close: unloading component env [terra:387112] mca: base: close: component pmi closed [terra:387112] mca: base: close: unloading component pmi [terra:387112] mca: base: close: component tool closed [terra:387112] mca: base: close: unloading component tool [terra:387112] mca: base: close: component singleton closed [terra:387112] mca: base: close: unloading component singleton -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- This was the exact problem that prompted me to try and upgrade from 4.0.3 to 4.1.0. Openmpi 4.1.0 (in debug mode, with internal pmix) is now installed on the head and on all compute nodes. I'd appreciate any ideas on what to try to overcome this. Cheers, Andrej On 2/1/21 9:57 AM, Andrej Prsa wrote: Hi Gilles, that's odd, there should be a mca_pmix_pmix3x.so (assuming you built with the internal pmix) Ah, I didn't -- I linked against the latest git pmix; here's the configure line: ./configure --pr
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Gilles, srun -N 1 -n 1 orted that is expected to fail, but it should at least find all its dependencies and start This was quite illuminating! andrej@terra:~/system/tests/MPI$ srun -N 1 -n 1 orted srun: /usr/local/lib/slurm/switch_generic.so: Incompatible Slurm plugin version (20.02.6) srun: error: Couldn't load specified plugin name for switch/generic: Incompatible plugin version srun: /usr/local/lib/slurm/mpi_pmix.so: Incompatible Slurm plugin version (20.02.6) srun: error: Couldn't load specified plugin name for mpi/pmix: Incompatible plugin version srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types So it looks like there were conflicting slurm versions running -- 20.02.6 (slurmdbd) and 20.11.3 (slurmctld/slurmd). I deleted all slurm stuff in /usr/local and reconfigured/rebuilt/reinstalled 20.11.3. Now I'm getting this: andrej@terra:~$ srun -N 1 -n 1 orted srun: error: Couldn't find the specified plugin name for mpi/pmix looking at all files srun: error: cannot find mpi plugin for mpi/pmix srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types It seems that slurm doesn't see pmix: andrej@terra:~$ srun --mpi=list srun: MPI types are... srun: cray_shasta srun: none srun: pmi2 I'll try to point slurm to use openmpi's internal pmix and rebuild, but posting this now in case I'm going down the rabbit hole and someone has a better idea. Cheers, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
The saga continues. I managed to build slurm with pmix by first patching slurm using this patch and manually building the plugin: https://bugs.schedmd.com/show_bug.cgi?id=10683 Now srun shows pmix as an option: andrej@terra:~/system/tests/MPI$ srun --mpi=list srun: MPI types are... srun: cray_shasta srun: none srun: pmi2 srun: pmix srun: pmix_v4 But when I try to run mpirun with slurm plugin, it still fails: andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca pmix_base_verbose 10 -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py [terra:149214] mca: base: components_register: registering framework ess components [terra:149214] mca: base: components_register: found loaded component slurm [terra:149214] mca: base: components_register: component slurm has no register or open function [terra:149214] mca: base: components_register: found loaded component env [terra:149214] mca: base: components_register: component env has no register or open function [terra:149214] mca: base: components_register: found loaded component pmi [terra:149214] mca: base: components_register: component pmi has no register or open function [terra:149214] mca: base: components_register: found loaded component tool [terra:149214] mca: base: components_register: component tool register function successful [terra:149214] mca: base: components_register: found loaded component hnp [terra:149214] mca: base: components_register: component hnp has no register or open function [terra:149214] mca: base: components_register: found loaded component singleton [terra:149214] mca: base: components_register: component singleton register function successful [terra:149214] mca: base: components_open: opening ess components [terra:149214] mca: base: components_open: found loaded component slurm [terra:149214] mca: base: components_open: component slurm open function successful [terra:149214] mca: base: components_open: found loaded component env [terra:149214] mca: base: components_open: component env open function successful [terra:149214] mca: base: components_open: found loaded component pmi [terra:149214] mca: base: components_open: component pmi open function successful [terra:149214] mca: base: components_open: found loaded component tool [terra:149214] mca: base: components_open: component tool open function successful [terra:149214] mca: base: components_open: found loaded component hnp [terra:149214] mca: base: components_open: component hnp open function successful [terra:149214] mca: base: components_open: found loaded component singleton [terra:149214] mca: base: components_open: component singleton open function successful [terra:149214] mca:base:select: Auto-selecting ess components [terra:149214] mca:base:select:( ess) Querying component [slurm] [terra:149214] mca:base:select:( ess) Querying component [env] [terra:149214] mca:base:select:( ess) Querying component [pmi] [terra:149214] mca:base:select:( ess) Querying component [tool] [terra:149214] mca:base:select:( ess) Querying component [hnp] [terra:149214] mca:base:select:( ess) Query of component [hnp] set priority to 100 [terra:149214] mca:base:select:( ess) Querying component [singleton] [terra:149214] mca:base:select:( ess) Selected component [hnp] [terra:149214] mca: base: close: component slurm closed [terra:149214] mca: base: close: unloading component slurm [terra:149214] mca: base: close: component env closed [terra:149214] mca: base: close: unloading component env [terra:149214] mca: base: close: component pmi closed [terra:149214] mca: base: close: unloading component pmi [terra:149214] mca: base: close: component tool closed [terra:149214] mca: base: close: unloading component tool [terra:149214] mca: base: close: component singleton closed [terra:149214] mca: base: close: unloading component singleton -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- I'm at my wits' end what to try, and all ears if anyone has any leads or suggestions. Thanks, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Ralph, Gilles, I fail to understand why you continue to think that PMI has anything to do with this problem. I see no indication of a PMIx-related issue in anything you have provided to date. Oh, I went off the traceback that yelled about pmix, and slurm not being able to find it until I patched the latest version; I'm an astrophysicist pretending to be a sys admin for our research cluster, so while I can hold my ground with c, python and technical computing, I'm out of my depths when it comes to mpi, pmix, slurm and all that good stuff. So I appreciate your patience. I am trying though. :) In the output below, it is clear what the problem is - you locked it to the "slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not found. Try adding "--mca plm_base_verbose 10" to your cmd line and let's see why that launcher wasn't accepted. andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py [terra:168998] mca: base: components_register: registering framework plm components [terra:168998] mca: base: components_register: found loaded component slurm [terra:168998] mca: base: components_register: component slurm register function successful [terra:168998] mca: base: components_open: opening plm components [terra:168998] mca: base: components_open: found loaded component slurm [terra:168998] mca: base: components_open: component slurm open function successful [terra:168998] mca:base:select: Auto-selecting plm components [terra:168998] mca:base:select:( plm) Querying component [slurm] [terra:168998] mca:base:select:( plm) No component selected! -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- Gilles, I did try all the suggestions from the previous email but that led me to think that slurm is the culprit, and now I'm back to openmpi. Cheers, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Gilles, I can reproduce this behavior ... when running outside of a slurm allocation. I just tried from slurm (sbatch run.sh) and I get the exact same error. What does $ env | grep ^SLURM_ reports? Empty; no environment variables have been defined. Thanks, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
omponent slurm closed [terra:177267] mca: base: close: unloading component slurm Thanks, as always, Andrej On 2/1/21 7:50 PM, Gilles Gouaillardet via devel wrote: Andrej, you *have* to invoke mpirun --mca plm slurm ... from a SLURM allocation, and SLURM_* environment variables should have been set by SLURM (otherwise, this is a SLURM error out of the scope of Open MPI). Here is what you can try (and send the logs if that fails) $ salloc -N 4 -n 384 and once you get the allocation $ env | grep ^SLURM_ $ mpirun --mca plm_base_verbose 10 --mca plm slurm true Cheers, Gilles On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel wrote: Hi Gilles, I can reproduce this behavior ... when running outside of a slurm allocation. I just tried from slurm (sbatch run.sh) and I get the exact same error. What does $ env | grep ^SLURM_ reports? Empty; no environment variables have been defined. Thanks, Andrej
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Gilles, Here is what you can try $ salloc -N 4 -n 384 /* and then from the allocation */ $ srun -n 1 orted /* that should fail, but the error message can be helpful */ $ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384 salloc: Granted job allocation 837 andrej@terra:~/system/tests/MPI$ srun -n 1 orted srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1 srun: launch/slurm: launch_p_step_launch: StepId=837.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: task 0 launch failed: Unspecified error andrej@terra:~/system/tests/MPI$ /usr/local/bin/mpirun -mca plm slurm -mca plm_base_verbose 10 true [terra:179991] mca: base: components_register: registering framework plm components [terra:179991] mca: base: components_register: found loaded component slurm [terra:179991] mca: base: components_register: component slurm register function successful [terra:179991] mca: base: components_open: opening plm components [terra:179991] mca: base: components_open: found loaded component slurm [terra:179991] mca: base: components_open: component slurm open function successful [terra:179991] mca:base:select: Auto-selecting plm components [terra:179991] mca:base:select:( plm) Querying component [slurm] [terra:179991] [[INVALID],INVALID] plm:slurm: available for selection [terra:179991] mca:base:select:( plm) Query of component [slurm] set priority to 75 [terra:179991] mca:base:select:( plm) Selected component [slurm] [terra:179991] plm:base:set_hnp_name: initial bias 179991 nodename hash 2928217987 [terra:179991] plm:base:set_hnp_name: final jobfam 7711 [terra:179991] [[7711,0],0] plm:base:receive start comm [terra:179991] [[7711,0],0] plm:base:setup_job [terra:179991] [[7711,0],0] plm:slurm: LAUNCH DAEMONS CALLED [terra:179991] [[7711,0],0] plm:base:setup_vm [terra:179991] [[7711,0],0] plm:base:setup_vm creating map [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],1] [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon [[7711,0],1] to node node9 [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],2] [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon [[7711,0],2] to node node10 [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],3] [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon [[7711,0],3] to node node11 [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],4] [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon [[7711,0],4] to node node12 [terra:179991] [[7711,0],0] plm:slurm: launching on nodes node9,node10,node11,node12 [terra:179991] [[7711,0],0] plm:slurm: Set prefix:/usr/local [terra:179991] [[7711,0],0] plm:slurm: final top-level argv: srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=4 orted -mca ess "slurm" -mca ess_base_jobid "505348096" -mca ess_base_vpid "1" -mca ess_base_num_procs "5" -mca orte_node_regex "terra,node[1:9],node[2:10-12]@0(5)" -mca orte_hnp_uri "505348096.0;tcp://10.9.2.10,192.168.1.1:38995" -mca plm_base_verbose "10" [terra:179991] [[7711,0],0] plm:slurm: reset PATH: /usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin [terra:179991] [[7711,0],0] plm:slurm: reset LD_LIBRARY_PATH: /usr/local/lib srun: launch/slurm: launch_p_step_launch: StepId=837.1 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: task 3 launch failed: Unspecified error srun: error: task 1 launch failed: Unspecified error srun: error: task 2 launch failed: Unspecified error srun: error: task 0 launch failed: Unspecified error [terra:179991] [[7711,0],0] plm:slurm: primary daemons complete! [terra:179991] [[7711,0],0] plm:base:receive stop comm [terra:179991] mca: base: close: component slurm closed [terra:179991] mca: base: close: unloading component slurm This is what I'm seeing in slurmctld.log: [2021-02-01T20:15:18.358] sched: _slurm_rpc_allocate_resources JobId=837 NodeList=node[9-12] usec=537 [2021-02-01T20:15:26.815] error: mpi_hook_slurmstepd_prefork failure for 0x557ce5b92960s on node9 [2021-02-01T20:15:59.621] error: mpi_hook_slurmstepd_prefork failure for 0x55cc6c89a7e0s on node12 [2021-02-01T20:15:59.621] error: mpi_hook_slurmstepd_prefork failure for 0x55b7b8b467e0s on node10 [2021-02-01T20:15:59.622] error: mpi_hook_slurmstepd_prefork failure for 0x55f8cd69a7e0s on node11 [2021-02-01T20:15:59.628] error: mpi_hook_slurmstepd_prefork failure for 0xb45bc7e0s on node9 And this is in slurmd.node9.log (and similar for the remaining 3 nodes): [2021-02-01T20:15:59.592] task/affinity: lllp_distribution: JobId=837 manual binding: none [2021-02-01T20:15:59.624] [837.1] error: node9 [0] pmixp_client_v2.c:246 [pmixp_lib_init] m
Re: [OMPI devel] mpirun 4.1.0 segmentation fault
Hi Ralph, Andrej - what version of Slurm are you using here? It's slurm 20.11.3, i.e. the latest release afaik. But Gilles is correct; the proposed test failed: andrej@terra:~/system/tests/MPI$ salloc -N 2 -n 2 salloc: Granted job allocation 838 andrej@terra:~/system/tests/MPI$ srun hostname srun: launch/slurm: launch_p_step_launch: StepId=838.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: task 1 launch failed: Unspecified error srun: error: task 0 launch failed: Unspecified error Now I'll dig in and try to figure out why slurm is failing. I'll post the update once I've figured it out so that it may help others who find themselves in a similar situation. (provided I do figure it out %-)) Guys, my sincere thanks for all your help! I truly appreciate it!! Cheers, Andrej