Interesting - good to know. Thanks On Feb 12, 2014, at 10:38 AM, Adrian Reber <adr...@lisas.de> wrote:
> It seems this is indeed a Moab bug for interactive jobs. At least a bug > was opened against moab. Using non-interactive jobs the variables have > the correct values and mpirun has no problems detecting the correct > number of cores. > > On Wed, Feb 12, 2014 at 07:50:40AM -0800, Ralph Castain wrote: >> Another possibility to check - it is entirely possible that Moab is >> miscommunicating the values to Slurm. You might need to check it - I'll >> install a copy of 2.6.5 on my machines and see if I get similar issues when >> Slurm does the allocation itself. >> >> On Feb 12, 2014, at 7:47 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> >>> On Feb 12, 2014, at 7:32 AM, Adrian Reber <adr...@lisas.de> wrote: >>> >>>> >>>> $ msub -I -l nodes=3:ppn=8 >>>> salloc: Job is in held state, pending scheduler release >>>> salloc: Pending job allocation 131828 >>>> salloc: job 131828 queued and waiting for resources >>>> salloc: job 131828 has been allocated resources >>>> salloc: Granted job allocation 131828 >>>> sh-4.1$ echo $SLURM_TASKS_PER_NODE >>>> 1 >>>> sh-4.1$ rpm -q slurm >>>> slurm-2.6.5-1.el6.x86_64 >>>> sh-4.1$ echo $SLURM_NNODES >>>> 1 >>>> sh-4.1$ echo $SLURM_JOB_NODELIST >>>> xxxx[107-108,176] >>>> sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE >>>> 8(x3) >>>> sh-4.1$ echo $SLURM_NODELIST >>>> xxxx[107-108,176] >>>> sh-4.1$ echo $SLURM_NPROCS >>>> 1 >>>> sh-4.1$ echo $SLURM_NTASKS >>>> 1 >>>> sh-4.1$ echo $SLURM_TASKS_PER_NODE >>>> 1 >>>> >>>> The information in *_NODELIST seems to make sense, but all the other >>>> variables (PROCS, TASKS, NODES) report '1', which seems wrong. >>> >>> Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, >>> and my guess is that SchedMD once again has changed the @$!#%#@ meaning of >>> their envars. Frankly, it is nearly impossible to track all the variants >>> they have created over the years. >>> >>> Please check to see if someone did a little customizing on your end as >>> sometimes people do that to Slurm. Could also be they did something in the >>> Slurm config file that is causing the changed behavior. >>> >>> Meantime, I'll try to ponder a potential solution in case this really is >>> the "latest" Slurm screwup. >>> >>> >>>> >>>> >>>> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote: >>>>> ...and your version of Slurm? >>>>> >>>>> On Feb 12, 2014, at 7:19 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>>> What is your SLURM_TASKS_PER_NODE? >>>>>> >>>>>> On Feb 12, 2014, at 6:58 AM, Adrian Reber <adr...@lisas.de> wrote: >>>>>> >>>>>>> No, the system has only a few MOAB_* variables and many SLURM_* >>>>>>> variables: >>>>>>> >>>>>>> $BASH $IFS $SECONDS >>>>>>> $SLURM_PTY_PORT >>>>>>> $BASHOPTS $LINENO $SHELL >>>>>>> $SLURM_PTY_WIN_COL >>>>>>> $BASHPID $LINES $SHELLOPTS >>>>>>> $SLURM_PTY_WIN_ROW >>>>>>> $BASH_ALIASES $MACHTYPE $SHLVL >>>>>>> $SLURM_SRUN_COMM_HOST >>>>>>> $BASH_ARGC $MAILCHECK >>>>>>> $SLURMD_NODENAME $SLURM_SRUN_COMM_PORT >>>>>>> $BASH_ARGV $MOAB_CLASS >>>>>>> $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID >>>>>>> $BASH_CMDS $MOAB_GROUP $SLURM_CONF >>>>>>> $SLURM_STEP_ID >>>>>>> $BASH_COMMAND $MOAB_JOBID >>>>>>> $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT >>>>>>> $BASH_LINENO $MOAB_NODECOUNT >>>>>>> $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST >>>>>>> $BASH_SOURCE $MOAB_PARTITION >>>>>>> $SLURM_GTIDS $SLURM_STEP_NUM_NODES >>>>>>> $BASH_SUBSHELL $MOAB_PROCCOUNT >>>>>>> $SLURM_JOBID $SLURM_STEP_NUM_TASKS >>>>>>> $BASH_VERSINFO $MOAB_SUBMITDIR >>>>>>> $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE >>>>>>> $BASH_VERSION $MOAB_USER >>>>>>> $SLURM_JOB_ID $SLURM_SUBMIT_DIR >>>>>>> $COLUMNS $OPTERR >>>>>>> $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST >>>>>>> $COMP_WORDBREAKS $OPTIND >>>>>>> $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE >>>>>>> $DIRSTACK $OSTYPE >>>>>>> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID >>>>>>> $EUID $PATH >>>>>>> $SLURM_LOCALID $SLURM_TOPOLOGY_ADDR >>>>>>> $GROUPS $POSIXLY_CORRECT >>>>>>> $SLURM_NNODES $SLURM_TOPOLOGY_ADDR_PATTERN >>>>>>> $HISTCMD $PPID >>>>>>> $SLURM_NODEID $SRUN_DEBUG >>>>>>> $HISTFILE $PS1 >>>>>>> $SLURM_NODELIST $TERM >>>>>>> $HISTFILESIZE $PS2 >>>>>>> $SLURM_NPROCS $TMPDIR >>>>>>> $HISTSIZE $PS4 >>>>>>> $SLURM_NTASKS $UID >>>>>>> $HOSTNAME $PWD >>>>>>> $SLURM_PRIO_PROCESS $_ >>>>>>> $HOSTTYPE $RANDOM >>>>>>> $SLURM_PROCID >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote: >>>>>>>> Seems rather odd - since this is managed by Moab, you shouldn't be >>>>>>>> seeing SLURM envars at all. What you should see are PBS_* envars, >>>>>>>> including a PBS_NODEFILE that actually contains the allocation. >>>>>>>> >>>>>>>> >>>>>>>> On Feb 12, 2014, at 4:42 AM, Adrian Reber <adr...@lisas.de> wrote: >>>>>>>> >>>>>>>>> I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a >>>>>>>>> system >>>>>>>>> with slurm and moab. I requested an interactive session using: >>>>>>>>> >>>>>>>>> msub -I -l nodes=3:ppn=8 >>>>>>>>> >>>>>>>>> and started a simple test case which fails: >>>>>>>>> >>>>>>>>> $ mpirun -np 2 ./mpi-test 1 >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> There are not enough slots available in the system to satisfy the 2 >>>>>>>>> slots >>>>>>>>> that were requested by the application: >>>>>>>>> ./mpi-test >>>>>>>>> >>>>>>>>> Either request fewer slots for your application, or make more slots >>>>>>>>> available >>>>>>>>> for use. >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> srun: error: xxxx108: task 1: Exited with exit code 1 >>>>>>>>> srun: Terminating job step 131823.4 >>>>>>>>> srun: error: xxxx107: task 0: Exited with exit code 1 >>>>>>>>> srun: Job step aborted >>>>>>>>> slurmd[xxxx108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH >>>>>>>>> SIGNAL 9 *** >>>>>>>>> >>>>>>>>> >>>>>>>>> requesting only one core works: >>>>>>>>> >>>>>>>>> $ mpirun ./mpi-test 1 >>>>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: >>>>>>>>> 0.000000 >>>>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: >>>>>>>>> 0.000000 >>>>>>>>> >>>>>>>>> >>>>>>>>> using openmpi-1.6.5 works with multiple cores: >>>>>>>>> >>>>>>>>> $ mpirun -np 24 ./mpi-test 2 >>>>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 24: >>>>>>>>> 0.000000 >>>>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on xxxx106 out of 24: >>>>>>>>> 12.000000 >>>>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on xxxx108 out of 24: >>>>>>>>> 11.000000 >>>>>>>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on xxxx106 out of 24: >>>>>>>>> 18.000000 >>>>>>>>> >>>>>>>>> $ echo $SLURM_JOB_CPUS_PER_NODE >>>>>>>>> 8(x3) >>>>>>>>> >>>>>>>>> I never used slurm before so this could also be a user error on my >>>>>>>>> side. >>>>>>>>> But as 1.6.5 works it seems something has changed and wanted to let >>>>>>>>> you know in case it was not intentionally. >>>>>>>>> >>>>>>>>> Adrian >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> Adrian >>>>>>> >>>>>>> -- >>>>>>> Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ >>>>>>> "Let us all bask in television's warm glowing warming glow." -- Homer >>>>>>> Simpson >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> Adrian >>>> >>>> -- >>>> Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ >>>> There's got to be more to life than compile-and-go. >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Adrian > > -- > Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ > Killing is wrong. > -- Losira, "That Which Survives", stardate unknown > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel