Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

Adrian Reber Wed, 12 Feb 2014 10:32:10 -0500 (EST)

$ msub -I -l nodes=3:ppn=8
salloc: Job is in held state, pending scheduler release
salloc: Pending job allocation 131828
salloc: job 131828 queued and waiting for resources
salloc: job 131828 has been allocated resources
salloc: Granted job allocation 131828
sh-4.1$ echo $SLURM_TASKS_PER_NODE 
1
sh-4.1$ rpm -q slurm
slurm-2.6.5-1.el6.x86_64
sh-4.1$ echo $SLURM_NNODES 
1
sh-4.1$ echo $SLURM_JOB_NODELIST 
xxxx[107-108,176]
sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
8(x3)
sh-4.1$ echo $SLURM_NODELIST 
xxxx[107-108,176]
sh-4.1$ echo $SLURM_NPROCS  
1
sh-4.1$ echo $SLURM_NTASKS 
1
sh-4.1$ echo $SLURM_TASKS_PER_NODE 
1


The information in *_NODELIST seems to make sense, but all the other
variables (PROCS, TASKS, NODES) report '1', which seems wrong.


On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
> ...and your version of Slurm?
> 
> On Feb 12, 2014, at 7:19 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > What is your SLURM_TASKS_PER_NODE?
> > 
> > On Feb 12, 2014, at 6:58 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> >> No, the system has only a few MOAB_* variables and many SLURM_*
> >> variables:
> >> 
> >> $BASH                         $IFS                          $SECONDS       
> >>                $SLURM_PTY_PORT
> >> $BASHOPTS                     $LINENO                       $SHELL         
> >>                $SLURM_PTY_WIN_COL
> >> $BASHPID                      $LINES                        $SHELLOPTS     
> >>                $SLURM_PTY_WIN_ROW
> >> $BASH_ALIASES                 $MACHTYPE                     $SHLVL         
> >>                $SLURM_SRUN_COMM_HOST
> >> $BASH_ARGC                    $MAILCHECK                    
> >> $SLURMD_NODENAME              $SLURM_SRUN_COMM_PORT
> >> $BASH_ARGV                    $MOAB_CLASS                   
> >> $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
> >> $BASH_CMDS                    $MOAB_GROUP                   $SLURM_CONF    
> >>                $SLURM_STEP_ID
> >> $BASH_COMMAND                 $MOAB_JOBID                   
> >> $SLURM_CPUS_ON_NODE           $SLURM_STEP_LAUNCHER_PORT
> >> $BASH_LINENO                  $MOAB_NODECOUNT               
> >> $SLURM_DISTRIBUTION           $SLURM_STEP_NODELIST
> >> $BASH_SOURCE                  $MOAB_PARTITION               $SLURM_GTIDS   
> >>                $SLURM_STEP_NUM_NODES
> >> $BASH_SUBSHELL                $MOAB_PROCCOUNT               $SLURM_JOBID   
> >>                $SLURM_STEP_NUM_TASKS
> >> $BASH_VERSINFO                $MOAB_SUBMITDIR               
> >> $SLURM_JOB_CPUS_PER_NODE      $SLURM_STEP_TASKS_PER_NODE
> >> $BASH_VERSION                 $MOAB_USER                    $SLURM_JOB_ID  
> >>                $SLURM_SUBMIT_DIR
> >> $COLUMNS                      $OPTERR                       
> >> $SLURM_JOB_NODELIST           $SLURM_SUBMIT_HOST
> >> $COMP_WORDBREAKS              $OPTIND                       
> >> $SLURM_JOB_NUM_NODES          $SLURM_TASKS_PER_NODE
> >> $DIRSTACK                     $OSTYPE                       
> >> $SLURM_LAUNCH_NODE_IPADDR     $SLURM_TASK_PID
> >> $EUID                         $PATH                         $SLURM_LOCALID 
> >>                $SLURM_TOPOLOGY_ADDR
> >> $GROUPS                       $POSIXLY_CORRECT              $SLURM_NNODES  
> >>                $SLURM_TOPOLOGY_ADDR_PATTERN
> >> $HISTCMD                      $PPID                         $SLURM_NODEID  
> >>                $SRUN_DEBUG
> >> $HISTFILE                     $PS1                          
> >> $SLURM_NODELIST               $TERM
> >> $HISTFILESIZE                 $PS2                          $SLURM_NPROCS  
> >>                $TMPDIR
> >> $HISTSIZE                     $PS4                          $SLURM_NTASKS  
> >>                $UID
> >> $HOSTNAME                     $PWD                          
> >> $SLURM_PRIO_PROCESS           $_
> >> $HOSTTYPE                     $RANDOM                       $SLURM_PROCID  
> >>                
> >> 
> >> 
> >> 
> >> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
> >>> Seems rather odd - since this is managed by Moab, you shouldn't be seeing 
> >>> SLURM envars at all. What you should see are PBS_* envars, including a 
> >>> PBS_NODEFILE that actually contains the allocation.
> >>> 
> >>> 
> >>> On Feb 12, 2014, at 4:42 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>> 
> >>>> I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
> >>>> with slurm and moab. I requested an interactive session using:
> >>>> 
> >>>> msub -I -l nodes=3:ppn=8
> >>>> 
> >>>> and started a simple test case which fails:
> >>>> 
> >>>> $ mpirun -np 2 ./mpi-test 1
> >>>> --------------------------------------------------------------------------
> >>>> There are not enough slots available in the system to satisfy the 2 
> >>>> slots 
> >>>> that were requested by the application:
> >>>> ./mpi-test
> >>>> 
> >>>> Either request fewer slots for your application, or make more slots 
> >>>> available
> >>>> for use.
> >>>> --------------------------------------------------------------------------
> >>>> srun: error: xxxx108: task 1: Exited with exit code 1
> >>>> srun: Terminating job step 131823.4
> >>>> srun: error: xxxx107: task 0: Exited with exit code 1
> >>>> srun: Job step aborted
> >>>> slurmd[xxxx108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH 
> >>>> SIGNAL 9 ***
> >>>> 
> >>>> 
> >>>> requesting only one core works:
> >>>> 
> >>>> $ mpirun  ./mpi-test 1
> >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000
> >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000
> >>>> 
> >>>> 
> >>>> using openmpi-1.6.5 works with multiple cores:
> >>>> 
> >>>> $ mpirun -np 24 ./mpi-test 2
> >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 24: 0.000000
> >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on xxxx106 out of 24: 
> >>>> 12.000000
> >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on xxxx108 out of 24: 
> >>>> 11.000000
> >>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on xxxx106 out of 24: 
> >>>> 18.000000
> >>>> 
> >>>> $ echo $SLURM_JOB_CPUS_PER_NODE 
> >>>> 8(x3)
> >>>> 
> >>>> I never used slurm before so this could also be a user error on my side.
> >>>> But as 1.6.5 works it seems something has changed and wanted to let
> >>>> you know in case it was not intentionally.
> >>>> 
> >>>>          Adrian
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >>            Adrian
> >> 
> >> -- 
> >> Adrian Reber <adr...@lisas.de>            http://lisas.de/~adrian/
> >> "Let us all bask in television's warm glowing warming glow." -- Homer 
> >> Simpson
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

                Adrian

-- 
Adrian Reber <adr...@lisas.de>            http://lisas.de/~adrian/
There's got to be more to life than compile-and-go.

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

Reply via email to