Another possibility to check - it is entirely possible that Moab is miscommunicating the values to Slurm. You might need to check it - I'll install a copy of 2.6.5 on my machines and see if I get similar issues when Slurm does the allocation itself.
On Feb 12, 2014, at 7:47 AM, Ralph Castain <[email protected]> wrote: > > On Feb 12, 2014, at 7:32 AM, Adrian Reber <[email protected]> wrote: > >> >> $ msub -I -l nodes=3:ppn=8 >> salloc: Job is in held state, pending scheduler release >> salloc: Pending job allocation 131828 >> salloc: job 131828 queued and waiting for resources >> salloc: job 131828 has been allocated resources >> salloc: Granted job allocation 131828 >> sh-4.1$ echo $SLURM_TASKS_PER_NODE >> 1 >> sh-4.1$ rpm -q slurm >> slurm-2.6.5-1.el6.x86_64 >> sh-4.1$ echo $SLURM_NNODES >> 1 >> sh-4.1$ echo $SLURM_JOB_NODELIST >> xxxx[107-108,176] >> sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE >> 8(x3) >> sh-4.1$ echo $SLURM_NODELIST >> xxxx[107-108,176] >> sh-4.1$ echo $SLURM_NPROCS >> 1 >> sh-4.1$ echo $SLURM_NTASKS >> 1 >> sh-4.1$ echo $SLURM_TASKS_PER_NODE >> 1 >> >> The information in *_NODELIST seems to make sense, but all the other >> variables (PROCS, TASKS, NODES) report '1', which seems wrong. > > Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, and > my guess is that SchedMD once again has changed the @$!#%#@ meaning of their > envars. Frankly, it is nearly impossible to track all the variants they have > created over the years. > > Please check to see if someone did a little customizing on your end as > sometimes people do that to Slurm. Could also be they did something in the > Slurm config file that is causing the changed behavior. > > Meantime, I'll try to ponder a potential solution in case this really is the > "latest" Slurm screwup. > > >> >> >> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote: >>> ...and your version of Slurm? >>> >>> On Feb 12, 2014, at 7:19 AM, Ralph Castain <[email protected]> wrote: >>> >>>> What is your SLURM_TASKS_PER_NODE? >>>> >>>> On Feb 12, 2014, at 6:58 AM, Adrian Reber <[email protected]> wrote: >>>> >>>>> No, the system has only a few MOAB_* variables and many SLURM_* >>>>> variables: >>>>> >>>>> $BASH $IFS $SECONDS >>>>> $SLURM_PTY_PORT >>>>> $BASHOPTS $LINENO $SHELL >>>>> $SLURM_PTY_WIN_COL >>>>> $BASHPID $LINES $SHELLOPTS >>>>> $SLURM_PTY_WIN_ROW >>>>> $BASH_ALIASES $MACHTYPE $SHLVL >>>>> $SLURM_SRUN_COMM_HOST >>>>> $BASH_ARGC $MAILCHECK >>>>> $SLURMD_NODENAME $SLURM_SRUN_COMM_PORT >>>>> $BASH_ARGV $MOAB_CLASS >>>>> $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID >>>>> $BASH_CMDS $MOAB_GROUP $SLURM_CONF >>>>> $SLURM_STEP_ID >>>>> $BASH_COMMAND $MOAB_JOBID >>>>> $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT >>>>> $BASH_LINENO $MOAB_NODECOUNT >>>>> $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST >>>>> $BASH_SOURCE $MOAB_PARTITION $SLURM_GTIDS >>>>> $SLURM_STEP_NUM_NODES >>>>> $BASH_SUBSHELL $MOAB_PROCCOUNT $SLURM_JOBID >>>>> $SLURM_STEP_NUM_TASKS >>>>> $BASH_VERSINFO $MOAB_SUBMITDIR >>>>> $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE >>>>> $BASH_VERSION $MOAB_USER $SLURM_JOB_ID >>>>> $SLURM_SUBMIT_DIR >>>>> $COLUMNS $OPTERR >>>>> $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST >>>>> $COMP_WORDBREAKS $OPTIND >>>>> $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE >>>>> $DIRSTACK $OSTYPE >>>>> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID >>>>> $EUID $PATH >>>>> $SLURM_LOCALID $SLURM_TOPOLOGY_ADDR >>>>> $GROUPS $POSIXLY_CORRECT $SLURM_NNODES >>>>> $SLURM_TOPOLOGY_ADDR_PATTERN >>>>> $HISTCMD $PPID $SLURM_NODEID >>>>> $SRUN_DEBUG >>>>> $HISTFILE $PS1 >>>>> $SLURM_NODELIST $TERM >>>>> $HISTFILESIZE $PS2 $SLURM_NPROCS >>>>> $TMPDIR >>>>> $HISTSIZE $PS4 $SLURM_NTASKS >>>>> $UID >>>>> $HOSTNAME $PWD >>>>> $SLURM_PRIO_PROCESS $_ >>>>> $HOSTTYPE $RANDOM $SLURM_PROCID >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote: >>>>>> Seems rather odd - since this is managed by Moab, you shouldn't be >>>>>> seeing SLURM envars at all. What you should see are PBS_* envars, >>>>>> including a PBS_NODEFILE that actually contains the allocation. >>>>>> >>>>>> >>>>>> On Feb 12, 2014, at 4:42 AM, Adrian Reber <[email protected]> wrote: >>>>>> >>>>>>> I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system >>>>>>> with slurm and moab. I requested an interactive session using: >>>>>>> >>>>>>> msub -I -l nodes=3:ppn=8 >>>>>>> >>>>>>> and started a simple test case which fails: >>>>>>> >>>>>>> $ mpirun -np 2 ./mpi-test 1 >>>>>>> -------------------------------------------------------------------------- >>>>>>> There are not enough slots available in the system to satisfy the 2 >>>>>>> slots >>>>>>> that were requested by the application: >>>>>>> ./mpi-test >>>>>>> >>>>>>> Either request fewer slots for your application, or make more slots >>>>>>> available >>>>>>> for use. >>>>>>> -------------------------------------------------------------------------- >>>>>>> srun: error: xxxx108: task 1: Exited with exit code 1 >>>>>>> srun: Terminating job step 131823.4 >>>>>>> srun: error: xxxx107: task 0: Exited with exit code 1 >>>>>>> srun: Job step aborted >>>>>>> slurmd[xxxx108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH >>>>>>> SIGNAL 9 *** >>>>>>> >>>>>>> >>>>>>> requesting only one core works: >>>>>>> >>>>>>> $ mpirun ./mpi-test 1 >>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000 >>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000 >>>>>>> >>>>>>> >>>>>>> using openmpi-1.6.5 works with multiple cores: >>>>>>> >>>>>>> $ mpirun -np 24 ./mpi-test 2 >>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 24: >>>>>>> 0.000000 >>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on xxxx106 out of 24: >>>>>>> 12.000000 >>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on xxxx108 out of 24: >>>>>>> 11.000000 >>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on xxxx106 out of 24: >>>>>>> 18.000000 >>>>>>> >>>>>>> $ echo $SLURM_JOB_CPUS_PER_NODE >>>>>>> 8(x3) >>>>>>> >>>>>>> I never used slurm before so this could also be a user error on my side. >>>>>>> But as 1.6.5 works it seems something has changed and wanted to let >>>>>>> you know in case it was not intentionally. >>>>>>> >>>>>>> Adrian >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> [email protected] >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> [email protected] >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Adrian >>>>> >>>>> -- >>>>> Adrian Reber <[email protected]> http://lisas.de/~adrian/ >>>>> "Let us all bask in television's warm glowing warming glow." -- Homer >>>>> Simpson >>>>> _______________________________________________ >>>>> devel mailing list >>>>> [email protected] >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> [email protected] >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Adrian >> >> -- >> Adrian Reber <[email protected]> http://lisas.de/~adrian/ >> There's got to be more to life than compile-and-go. >> _______________________________________________ >> devel mailing list >> [email protected] >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
