$ msub -I -l nodes=3:ppn=8 salloc: Job is in held state, pending scheduler release salloc: Pending job allocation 131828 salloc: job 131828 queued and waiting for resources salloc: job 131828 has been allocated resources salloc: Granted job allocation 131828 sh-4.1$ echo $SLURM_TASKS_PER_NODE 1 sh-4.1$ rpm -q slurm slurm-2.6.5-1.el6.x86_64 sh-4.1$ echo $SLURM_NNODES 1 sh-4.1$ echo $SLURM_JOB_NODELIST xxxx[107-108,176] sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 8(x3) sh-4.1$ echo $SLURM_NODELIST xxxx[107-108,176] sh-4.1$ echo $SLURM_NPROCS 1 sh-4.1$ echo $SLURM_NTASKS 1 sh-4.1$ echo $SLURM_TASKS_PER_NODE 1
The information in *_NODELIST seems to make sense, but all the other variables (PROCS, TASKS, NODES) report '1', which seems wrong. On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote: > ...and your version of Slurm? > > On Feb 12, 2014, at 7:19 AM, Ralph Castain <[email protected]> wrote: > > > What is your SLURM_TASKS_PER_NODE? > > > > On Feb 12, 2014, at 6:58 AM, Adrian Reber <[email protected]> wrote: > > > >> No, the system has only a few MOAB_* variables and many SLURM_* > >> variables: > >> > >> $BASH $IFS $SECONDS > >> $SLURM_PTY_PORT > >> $BASHOPTS $LINENO $SHELL > >> $SLURM_PTY_WIN_COL > >> $BASHPID $LINES $SHELLOPTS > >> $SLURM_PTY_WIN_ROW > >> $BASH_ALIASES $MACHTYPE $SHLVL > >> $SLURM_SRUN_COMM_HOST > >> $BASH_ARGC $MAILCHECK > >> $SLURMD_NODENAME $SLURM_SRUN_COMM_PORT > >> $BASH_ARGV $MOAB_CLASS > >> $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID > >> $BASH_CMDS $MOAB_GROUP $SLURM_CONF > >> $SLURM_STEP_ID > >> $BASH_COMMAND $MOAB_JOBID > >> $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT > >> $BASH_LINENO $MOAB_NODECOUNT > >> $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST > >> $BASH_SOURCE $MOAB_PARTITION $SLURM_GTIDS > >> $SLURM_STEP_NUM_NODES > >> $BASH_SUBSHELL $MOAB_PROCCOUNT $SLURM_JOBID > >> $SLURM_STEP_NUM_TASKS > >> $BASH_VERSINFO $MOAB_SUBMITDIR > >> $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE > >> $BASH_VERSION $MOAB_USER $SLURM_JOB_ID > >> $SLURM_SUBMIT_DIR > >> $COLUMNS $OPTERR > >> $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST > >> $COMP_WORDBREAKS $OPTIND > >> $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE > >> $DIRSTACK $OSTYPE > >> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID > >> $EUID $PATH $SLURM_LOCALID > >> $SLURM_TOPOLOGY_ADDR > >> $GROUPS $POSIXLY_CORRECT $SLURM_NNODES > >> $SLURM_TOPOLOGY_ADDR_PATTERN > >> $HISTCMD $PPID $SLURM_NODEID > >> $SRUN_DEBUG > >> $HISTFILE $PS1 > >> $SLURM_NODELIST $TERM > >> $HISTFILESIZE $PS2 $SLURM_NPROCS > >> $TMPDIR > >> $HISTSIZE $PS4 $SLURM_NTASKS > >> $UID > >> $HOSTNAME $PWD > >> $SLURM_PRIO_PROCESS $_ > >> $HOSTTYPE $RANDOM $SLURM_PROCID > >> > >> > >> > >> > >> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote: > >>> Seems rather odd - since this is managed by Moab, you shouldn't be seeing > >>> SLURM envars at all. What you should see are PBS_* envars, including a > >>> PBS_NODEFILE that actually contains the allocation. > >>> > >>> > >>> On Feb 12, 2014, at 4:42 AM, Adrian Reber <[email protected]> wrote: > >>> > >>>> I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system > >>>> with slurm and moab. I requested an interactive session using: > >>>> > >>>> msub -I -l nodes=3:ppn=8 > >>>> > >>>> and started a simple test case which fails: > >>>> > >>>> $ mpirun -np 2 ./mpi-test 1 > >>>> -------------------------------------------------------------------------- > >>>> There are not enough slots available in the system to satisfy the 2 > >>>> slots > >>>> that were requested by the application: > >>>> ./mpi-test > >>>> > >>>> Either request fewer slots for your application, or make more slots > >>>> available > >>>> for use. > >>>> -------------------------------------------------------------------------- > >>>> srun: error: xxxx108: task 1: Exited with exit code 1 > >>>> srun: Terminating job step 131823.4 > >>>> srun: error: xxxx107: task 0: Exited with exit code 1 > >>>> srun: Job step aborted > >>>> slurmd[xxxx108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH > >>>> SIGNAL 9 *** > >>>> > >>>> > >>>> requesting only one core works: > >>>> > >>>> $ mpirun ./mpi-test 1 > >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000 > >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000 > >>>> > >>>> > >>>> using openmpi-1.6.5 works with multiple cores: > >>>> > >>>> $ mpirun -np 24 ./mpi-test 2 > >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 24: 0.000000 > >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on xxxx106 out of 24: > >>>> 12.000000 > >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on xxxx108 out of 24: > >>>> 11.000000 > >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on xxxx106 out of 24: > >>>> 18.000000 > >>>> > >>>> $ echo $SLURM_JOB_CPUS_PER_NODE > >>>> 8(x3) > >>>> > >>>> I never used slurm before so this could also be a user error on my side. > >>>> But as 1.6.5 works it seems something has changed and wanted to let > >>>> you know in case it was not intentionally. > >>>> > >>>> Adrian > >>>> _______________________________________________ > >>>> devel mailing list > >>>> [email protected] > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> [email protected] > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> Adrian > >> > >> -- > >> Adrian Reber <[email protected]> http://lisas.de/~adrian/ > >> "Let us all bask in television's warm glowing warming glow." -- Homer > >> Simpson > >> _______________________________________________ > >> devel mailing list > >> [email protected] > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/devel Adrian -- Adrian Reber <[email protected]> http://lisas.de/~adrian/ There's got to be more to life than compile-and-go.
