Re: [OMPI devel] mpirun 4.1.0 segmentation fault

Andrej Prsa via devel Mon, 01 Feb 2021 17:22:39 -0800

Hi Gilles,

Here is what you can try


$ salloc -N 4 -n 384
/* and then from the allocation */

$ srun -n 1 orted
/* that should fail, but the error message can be helpful */

$ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true


andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384
salloc: Granted job allocation 837
andrej@terra:~/system/tests/MPI$ srun -n 1 orted
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1

srun: launch/slurm: launch_p_step_launch: StepId=837.0 aborted beforestep completely launched.

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error

andrej@terra:~/system/tests/MPI$ /usr/local/bin/mpirun -mca plm slurm-mca plm_base_verbose 10 true[terra:179991] mca: base: components_register: registering framework plmcomponents

[terra:179991] mca: base: components_register: found loaded component slurm

[terra:179991] mca: base: components_register: component slurm registerfunction successful

[terra:179991] mca: base: components_open: opening plm components
[terra:179991] mca: base: components_open: found loaded component slurm

[terra:179991] mca: base: components_open: component slurm open functionsuccessful

[terra:179991] mca:base:select: Auto-selecting plm components
[terra:179991] mca:base:select:(  plm) Querying component [slurm]
[terra:179991] [[INVALID],INVALID] plm:slurm: available for selection

[terra:179991] mca:base:select:( plm) Query of component [slurm] setpriority to 75

[terra:179991] mca:base:select:(  plm) Selected component [slurm]

[terra:179991] plm:base:set_hnp_name: initial bias 179991 nodename hash2928217987

[terra:179991] plm:base:set_hnp_name: final jobfam 7711
[terra:179991] [[7711,0],0] plm:base:receive start comm
[terra:179991] [[7711,0],0] plm:base:setup_job
[terra:179991] [[7711,0],0] plm:slurm: LAUNCH DAEMONS CALLED
[terra:179991] [[7711,0],0] plm:base:setup_vm
[terra:179991] [[7711,0],0] plm:base:setup_vm creating map
[terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],1]

[terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon[[7711,0],1] to node node9

[terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],2]

[terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon[[7711,0],2] to node node10

[terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],3]

[terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon[[7711,0],3] to node node11

[terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],4]

[terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon[[7711,0],4] to node node12[terra:179991] [[7711,0],0] plm:slurm: launching on nodesnode9,node10,node11,node12

[terra:179991] [[7711,0],0] plm:slurm: Set prefix:/usr/local
[terra:179991] [[7711,0],0] plm:slurm: final top-level argv:

srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=4 orted -mcaess "slurm" -mca ess_base_jobid "505348096" -mca ess_base_vpid "1" -mcaess_base_num_procs "5" -mca orte_node_regex"terra,node[1:9],node[2:10-12]@0(5)" -mca orte_hnp_uri"505348096.0;tcp://10.9.2.10,192.168.1.1:38995" -mca plm_base_verbose "10"[terra:179991] [[7711,0],0] plm:slurm: reset PATH:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

[terra:179991] [[7711,0],0] plm:slurm: reset LD_LIBRARY_PATH: /usr/local/lib

srun: launch/slurm: launch_p_step_launch: StepId=837.1 aborted beforestep completely launched.

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 3 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error
srun: error: task 2 launch failed: Unspecified error
srun: error: task 0 launch failed: Unspecified error
[terra:179991] [[7711,0],0] plm:slurm: primary daemons complete!
[terra:179991] [[7711,0],0] plm:base:receive stop comm
[terra:179991] mca: base: close: component slurm closed
[terra:179991] mca: base: close: unloading component slurm

This is what I'm seeing in slurmctld.log:

[2021-02-01T20:15:18.358] sched: _slurm_rpc_allocate_resources JobId=837NodeList=node[9-12] usec=537[2021-02-01T20:15:26.815] error: mpi_hook_slurmstepd_prefork failure for0x557ce5b92960s on node9[2021-02-01T20:15:59.621] error: mpi_hook_slurmstepd_prefork failure for0x55cc6c89a7e0s on node12[2021-02-01T20:15:59.621] error: mpi_hook_slurmstepd_prefork failure for0x55b7b8b467e0s on node10[2021-02-01T20:15:59.622] error: mpi_hook_slurmstepd_prefork failure for0x55f8cd69a7e0s on node11[2021-02-01T20:15:59.628] error: mpi_hook_slurmstepd_prefork failure for0x5555b45bc7e0s on node9


And this is in slurmd.node9.log (and similar for the remaining 3 nodes):

[2021-02-01T20:15:59.592] task/affinity: lllp_distribution: JobId=837manual binding: none[2021-02-01T20:15:59.624] [837.1] error: node9 [0] pmixp_client_v2.c:246[pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init failed with error -2

: Success (0)

[2021-02-01T20:15:59.624] [837.1] error: node9 [0] pmixp_client.c:518[pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with error -1

: Success (0)

[2021-02-01T20:15:59.624] [837.1] error: node9 [0] pmixp_server.c:423[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed[2021-02-01T20:15:59.624] [837.1] error: (null) [0] mpi_pmix.c:169[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed

[2021-02-01T20:15:59.627] [837.1] error: Failed mpi_hook_slurmstepd_prefork

[2021-02-01T20:15:59.650] [837.1] error: job_manager exiting abnormally,rc = -1

[2021-02-01T20:16:02.000] [837.1] done with job

Cheers,
Andrej

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

Reply via email to