Singularity 3.5.3 on RHEL 7 cluster w/ OpenMPI 4.0.3 lives inside a SimpleFOAM version 10 container. I've confirmed the OpenMPI versions are the same. Perhaps this is a question for Singularity users as well but how can I troubleshoot why mpirun just returns step creation temporarily disabled, retrying Requested
Singularity> mpirun -V mpirun (Open MPI) 4.0.3 Report bugs to http://www.open-mpi.org/community/help/ Singularity> which mpirun /usr/bin/mpirun Singularity> $ mpirun -V mpirun (Open MPI) 4.0.3 mpirun -n 2 -mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose 100 singularity exec openfoam simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam openfoam10/ openfoam10.sif openfoamtestfile.sh openfoam_v2012.sif [myuser@node047 motorBike]$ mpirun -n 2 -mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose 100 singularity exec openfoam simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam openfoam10/ openfoam10.sif openfoamtestfile.sh openfoam_v2012.sif [myuser@node047 motorBike]$ mpirun -n 2 -mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose 100 singularity exec openfoam10.sif simpleFoam -parallel | tee log.simpleFoam [node047:11650] mca: base: components_register: registering framework plm components [node047:11650] mca: base: components_register: found loaded component slurm [node047:11650] mca: base: components_register: component slurm register function successful [node047:11650] mca: base: components_register: found loaded component isolated [node047:11650] mca: base: components_register: component isolated has no register or open function [node047:11650] mca: base: components_register: found loaded component rsh [node047:11650] mca: base: components_register: component rsh register function successful [node047:11650] mca: base: components_open: opening plm components [node047:11650] mca: base: components_open: found loaded component slurm [node047:11650] mca: base: components_open: component slurm open function successful [node047:11650] mca: base: components_open: found loaded component isolated [node047:11650] mca: base: components_open: component isolated open function successful [node047:11650] mca: base: components_open: found loaded component rsh [node047:11650] mca: base: components_open: component rsh open function successful [node047:11650] mca:base:select: Auto-selecting plm components [node047:11650] mca:base:select:( plm) Querying component [slurm] [node047:11650] mca:base:select:( plm) Query of component [slurm] set priority to 75 [node047:11650] mca:base:select:( plm) Querying component [isolated] [node047:11650] mca:base:select:( plm) Query of component [isolated] set priority to 0 [node047:11650] mca:base:select:( plm) Querying component [rsh] [node047:11650] mca:base:select:( plm) Query of component [rsh] set priority to 10 [node047:11650] mca:base:select:( plm) Selected component [slurm] [node047:11650] mca: base: close: component isolated closed [node047:11650] mca: base: close: unloading component isolated [node047:11650] mca: base: close: component rsh closed [node047:11650] mca: base: close: unloading component rsh [node047:11650] mca: base: components_register: registering framework ras components [node047:11650] mca: base: components_register: found loaded component slurm [node047:11650] mca: base: components_register: component slurm register function successful [node047:11650] mca: base: components_register: found loaded component simulator [node047:11650] mca: base: components_register: component simulator register function successful [node047:11650] mca: base: components_open: opening ras components [node047:11650] mca: base: components_open: found loaded component slurm [node047:11650] mca: base: components_open: component slurm open function successful [node047:11650] mca: base: components_open: found loaded component simulator [node047:11650] mca:base:select: Auto-selecting ras components [node047:11650] mca:base:select:( ras) Querying component [slurm] [node047:11650] mca:base:select:( ras) Query of component [slurm] set priority to 50 [node047:11650] mca:base:select:( ras) Querying component [simulator] [node047:11650] mca:base:select:( ras) Selected component [slurm] [node047:11650] mca: base: close: unloading component simulator [node047:11650] mca: base: components_register: registering framework rmaps components [node047:11650] mca: base: components_register: found loaded component seq [node047:11650] mca: base: components_register: component seq register function successful [node047:11650] mca: base: components_register: found loaded component rank_file [node047:11650] mca: base: components_register: component rank_file register function successful [node047:11650] mca: base: components_register: found loaded component resilient [node047:11650] mca: base: components_register: component resilient register function successful [node047:11650] mca: base: components_register: found loaded component mindist [node047:11650] mca: base: components_register: component mindist register function successful [node047:11650] mca: base: components_register: found loaded component round_robin [node047:11650] mca: base: components_register: component round_robin register function successful [node047:11650] mca: base: components_register: found loaded component ppr [node047:11650] mca: base: components_register: component ppr register function successful [node047:11650] [[57513,0],0] rmaps:base set policy with NULL device NONNULL [node047:11650] mca: base: components_open: opening rmaps components [node047:11650] mca: base: components_open: found loaded component seq [node047:11650] mca: base: components_open: component seq open function successful [node047:11650] mca: base: components_open: found loaded component rank_file [node047:11650] mca: base: components_open: component rank_file open function successful [node047:11650] mca: base: components_open: found loaded component resilient [node047:11650] mca: base: components_open: component resilient open function successful [node047:11650] mca: base: components_open: found loaded component mindist [node047:11650] mca: base: components_open: component mindist open function successful [node047:11650] mca: base: components_open: found loaded component round_robin [node047:11650] mca: base: components_open: component round_robin open function successful [node047:11650] mca: base: components_open: found loaded component ppr [node047:11650] mca: base: components_open: component ppr open function successful [node047:11650] mca:rmaps:select: checking available component seq [node047:11650] mca:rmaps:select: Querying component [seq] [node047:11650] mca:rmaps:select: checking available component rank_file [node047:11650] mca:rmaps:select: Querying component [rank_file] [node047:11650] mca:rmaps:select: checking available component resilient [node047:11650] mca:rmaps:select: Querying component [resilient] [node047:11650] mca:rmaps:select: checking available component mindist [node047:11650] mca:rmaps:select: Querying component [mindist] [node047:11650] mca:rmaps:select: checking available component round_robin [node047:11650] mca:rmaps:select: Querying component [round_robin] [node047:11650] mca:rmaps:select: checking available component ppr [node047:11650] mca:rmaps:select: Querying component [ppr] [node047:11650] [[57513,0],0]: Final mapper priorities [node047:11650] Mapper: ppr Priority: 90 [node047:11650] Mapper: seq Priority: 60 [node047:11650] Mapper: resilient Priority: 40 [node047:11650] Mapper: mindist Priority: 20 [node047:11650] Mapper: round_robin Priority: 10 [node047:11650] Mapper: rank_file Priority: 0 [node047:11650] [[57513,0],0] plm:slurm: final top-level argv: srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node048 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "3769171968" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "t[3:47-48]@0(2)" -mca orte_hnp_uri "3769171968.0;tcp://10.x.x.47,10.x.x.47:50819" -mca plm_base_verbose "100" --mca ras_base_verbose "100" --mca rss_base_verbose "100" --mca rmaps_base_verbose "100" ====================== ALLOCATED NODES ====================== node047: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP node048: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP ================================================================= My process: myuser 11650 10965 0 22:28 pts/0 00:00:00 mpirun -n 2 -mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose 100 singularity exec openfoam10.sif simpleFoam -parallel strace just hangs at: strace: Process 11650 attached restart_syscall(<... resuming interrupted poll ...>^Cstrace: Process 11650 detached <detached ...> With or without the --exclusive option all I get is: srun: Job 12525169 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 12525169 step creation temporarily disabled, retrying (Requested nodes are busy) Are the options not in the correct order? Thanks, Rob