Singularity 3.5.3 on RHEL 7 cluster w/ OpenMPI 4.0.3 lives inside a
SimpleFOAM version 10 container. I've confirmed the OpenMPI versions are
the same. Perhaps this is a question for Singularity users as well but how
can I troubleshoot why mpirun just returns step creation temporarily
disabled, retrying Requested

Singularity> mpirun -V
mpirun (Open MPI) 4.0.3
Report bugs to http://www.open-mpi.org/community/help/
Singularity> which mpirun
/usr/bin/mpirun
Singularity>

$ mpirun -V
mpirun (Open MPI) 4.0.3

mpirun -n 2 -mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca
rss_base_verbose 100 --mca rmaps_base_verbose 100  singularity exec
openfoam   simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam
openfoam10/          openfoam10.sif       openfoamtestfile.sh
 openfoam_v2012.sif
[myuser@node047 motorBike]$ mpirun -n 2 -mca plm_base_verbose 100 --mca
ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose
100  singularity exec openfoam   simpleFoam -fileHandler uncollated
-parallel | tee log.simpleFoam
openfoam10/          openfoam10.sif       openfoamtestfile.sh
 openfoam_v2012.sif
[myuser@node047 motorBike]$ mpirun -n 2 -mca plm_base_verbose 100 --mca
ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose
100  singularity exec openfoam10.sif   simpleFoam  -parallel | tee
log.simpleFoam
[node047:11650] mca: base: components_register: registering framework plm
components
[node047:11650] mca: base: components_register: found loaded component slurm
[node047:11650] mca: base: components_register: component slurm register
function successful
[node047:11650] mca: base: components_register: found loaded component
isolated
[node047:11650] mca: base: components_register: component isolated has no
register or open function
[node047:11650] mca: base: components_register: found loaded component rsh
[node047:11650] mca: base: components_register: component rsh register
function successful
[node047:11650] mca: base: components_open: opening plm components
[node047:11650] mca: base: components_open: found loaded component slurm
[node047:11650] mca: base: components_open: component slurm open function
successful
[node047:11650] mca: base: components_open: found loaded component isolated
[node047:11650] mca: base: components_open: component isolated open
function successful
[node047:11650] mca: base: components_open: found loaded component rsh
[node047:11650] mca: base: components_open: component rsh open function
successful
[node047:11650] mca:base:select: Auto-selecting plm components
[node047:11650] mca:base:select:(  plm) Querying component [slurm]
[node047:11650] mca:base:select:(  plm) Query of component [slurm] set
priority to 75
[node047:11650] mca:base:select:(  plm) Querying component [isolated]
[node047:11650] mca:base:select:(  plm) Query of component [isolated] set
priority to 0
[node047:11650] mca:base:select:(  plm) Querying component [rsh]
[node047:11650] mca:base:select:(  plm) Query of component [rsh] set
priority to 10
[node047:11650] mca:base:select:(  plm) Selected component [slurm]
[node047:11650] mca: base: close: component isolated closed
[node047:11650] mca: base: close: unloading component isolated
[node047:11650] mca: base: close: component rsh closed
[node047:11650] mca: base: close: unloading component rsh
[node047:11650] mca: base: components_register: registering framework ras
components
[node047:11650] mca: base: components_register: found loaded component slurm
[node047:11650] mca: base: components_register: component slurm register
function successful
[node047:11650] mca: base: components_register: found loaded component
simulator
[node047:11650] mca: base: components_register: component simulator
register function successful
[node047:11650] mca: base: components_open: opening ras components
[node047:11650] mca: base: components_open: found loaded component slurm
[node047:11650] mca: base: components_open: component slurm open function
successful
[node047:11650] mca: base: components_open: found loaded component simulator
[node047:11650] mca:base:select: Auto-selecting ras components
[node047:11650] mca:base:select:(  ras) Querying component [slurm]
[node047:11650] mca:base:select:(  ras) Query of component [slurm] set
priority to 50
[node047:11650] mca:base:select:(  ras) Querying component [simulator]
[node047:11650] mca:base:select:(  ras) Selected component [slurm]
[node047:11650] mca: base: close: unloading component simulator
[node047:11650] mca: base: components_register: registering framework rmaps
components
[node047:11650] mca: base: components_register: found loaded component seq
[node047:11650] mca: base: components_register: component seq register
function successful
[node047:11650] mca: base: components_register: found loaded component
rank_file
[node047:11650] mca: base: components_register: component rank_file
register function successful
[node047:11650] mca: base: components_register: found loaded component
resilient
[node047:11650] mca: base: components_register: component resilient
register function successful
[node047:11650] mca: base: components_register: found loaded component
mindist
[node047:11650] mca: base: components_register: component mindist register
function successful
[node047:11650] mca: base: components_register: found loaded component
round_robin
[node047:11650] mca: base: components_register: component round_robin
register function successful
[node047:11650] mca: base: components_register: found loaded component ppr
[node047:11650] mca: base: components_register: component ppr register
function successful
[node047:11650] [[57513,0],0] rmaps:base set policy with NULL device NONNULL
[node047:11650] mca: base: components_open: opening rmaps components
[node047:11650] mca: base: components_open: found loaded component seq
[node047:11650] mca: base: components_open: component seq open function
successful
[node047:11650] mca: base: components_open: found loaded component rank_file
[node047:11650] mca: base: components_open: component rank_file open
function successful
[node047:11650] mca: base: components_open: found loaded component resilient
[node047:11650] mca: base: components_open: component resilient open
function successful
[node047:11650] mca: base: components_open: found loaded component mindist
[node047:11650] mca: base: components_open: component mindist open function
successful
[node047:11650] mca: base: components_open: found loaded component
round_robin
[node047:11650] mca: base: components_open: component round_robin open
function successful
[node047:11650] mca: base: components_open: found loaded component ppr
[node047:11650] mca: base: components_open: component ppr open function
successful
[node047:11650] mca:rmaps:select: checking available component seq
[node047:11650] mca:rmaps:select: Querying component [seq]
[node047:11650] mca:rmaps:select: checking available component rank_file
[node047:11650] mca:rmaps:select: Querying component [rank_file]
[node047:11650] mca:rmaps:select: checking available component resilient
[node047:11650] mca:rmaps:select: Querying component [resilient]
[node047:11650] mca:rmaps:select: checking available component mindist
[node047:11650] mca:rmaps:select: Querying component [mindist]
[node047:11650] mca:rmaps:select: checking available component round_robin
[node047:11650] mca:rmaps:select: Querying component [round_robin]
[node047:11650] mca:rmaps:select: checking available component ppr
[node047:11650] mca:rmaps:select: Querying component [ppr]
[node047:11650] [[57513,0],0]: Final mapper priorities
[node047:11650] Mapper: ppr Priority: 90
[node047:11650] Mapper: seq Priority: 60
[node047:11650] Mapper: resilient Priority: 40
[node047:11650] Mapper: mindist Priority: 20
[node047:11650] Mapper: round_robin Priority: 10
[node047:11650] Mapper: rank_file Priority: 0
[node047:11650] [[57513,0],0] plm:slurm: final top-level argv:
srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node048
--ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "3769171968" -mca
ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex
"t[3:47-48]@0(2)" -mca orte_hnp_uri
"3769171968.0;tcp://10.x.x.47,10.x.x.47:50819" -mca plm_base_verbose "100"
--mca ras_base_verbose "100" --mca rss_base_verbose "100" --mca
rmaps_base_verbose "100"

======================   ALLOCATED NODES   ======================
node047: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
node048: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================

My process:
myuser    11650  10965  0 22:28 pts/0    00:00:00 mpirun -n 2 -mca
plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100
--mca rmaps_base_verbose 100 singularity exec openfoam10.sif simpleFoam
-parallel

strace just hangs at:
strace: Process 11650 attached
restart_syscall(<... resuming interrupted poll ...>^Cstrace: Process 11650
detached
 <detached ...>

With or without the --exclusive option all I get is:

srun: Job 12525169 step creation temporarily disabled, retrying (Requested
nodes are busy)
srun: Job 12525169 step creation temporarily disabled, retrying (Requested
nodes are busy)

Are the options not in the correct order?

Thanks,

Rob

Reply via email to