Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

Ralph Castain Wed, 12 Feb 2014 10:19:25 -0500 (EST)

What is your SLURM_TASKS_PER_NODE?

On Feb 12, 2014, at 6:58 AM, Adrian Reber <[email protected]> wrote:


> No, the system has only a few MOAB_* variables and many SLURM_*
> variables:
> 
> $BASH                         $IFS                          $SECONDS          
>             $SLURM_PTY_PORT
> $BASHOPTS                     $LINENO                       $SHELL            
>             $SLURM_PTY_WIN_COL
> $BASHPID                      $LINES                        $SHELLOPTS        
>             $SLURM_PTY_WIN_ROW
> $BASH_ALIASES                 $MACHTYPE                     $SHLVL            
>             $SLURM_SRUN_COMM_HOST
> $BASH_ARGC                    $MAILCHECK                    $SLURMD_NODENAME  
>             $SLURM_SRUN_COMM_PORT
> $BASH_ARGV                    $MOAB_CLASS                   
> $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
> $BASH_CMDS                    $MOAB_GROUP                   $SLURM_CONF       
>             $SLURM_STEP_ID
> $BASH_COMMAND                 $MOAB_JOBID                   
> $SLURM_CPUS_ON_NODE           $SLURM_STEP_LAUNCHER_PORT
> $BASH_LINENO                  $MOAB_NODECOUNT               
> $SLURM_DISTRIBUTION           $SLURM_STEP_NODELIST
> $BASH_SOURCE                  $MOAB_PARTITION               $SLURM_GTIDS      
>             $SLURM_STEP_NUM_NODES
> $BASH_SUBSHELL                $MOAB_PROCCOUNT               $SLURM_JOBID      
>             $SLURM_STEP_NUM_TASKS
> $BASH_VERSINFO                $MOAB_SUBMITDIR               
> $SLURM_JOB_CPUS_PER_NODE      $SLURM_STEP_TASKS_PER_NODE
> $BASH_VERSION                 $MOAB_USER                    $SLURM_JOB_ID     
>             $SLURM_SUBMIT_DIR
> $COLUMNS                      $OPTERR                       
> $SLURM_JOB_NODELIST           $SLURM_SUBMIT_HOST
> $COMP_WORDBREAKS              $OPTIND                       
> $SLURM_JOB_NUM_NODES          $SLURM_TASKS_PER_NODE
> $DIRSTACK                     $OSTYPE                       
> $SLURM_LAUNCH_NODE_IPADDR     $SLURM_TASK_PID
> $EUID                         $PATH                         $SLURM_LOCALID    
>             $SLURM_TOPOLOGY_ADDR
> $GROUPS                       $POSIXLY_CORRECT              $SLURM_NNODES     
>             $SLURM_TOPOLOGY_ADDR_PATTERN
> $HISTCMD                      $PPID                         $SLURM_NODEID     
>             $SRUN_DEBUG
> $HISTFILE                     $PS1                          $SLURM_NODELIST   
>             $TERM
> $HISTFILESIZE                 $PS2                          $SLURM_NPROCS     
>             $TMPDIR
> $HISTSIZE                     $PS4                          $SLURM_NTASKS     
>             $UID
> $HOSTNAME                     $PWD                          
> $SLURM_PRIO_PROCESS           $_
> $HOSTTYPE                     $RANDOM                       $SLURM_PROCID     
>             
> 
> 
> 
> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
>> Seems rather odd - since this is managed by Moab, you shouldn't be seeing 
>> SLURM envars at all. What you should see are PBS_* envars, including a 
>> PBS_NODEFILE that actually contains the allocation.
>> 
>> 
>> On Feb 12, 2014, at 4:42 AM, Adrian Reber <[email protected]> wrote:
>> 
>>> I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
>>> with slurm and moab. I requested an interactive session using:
>>> 
>>> msub -I -l nodes=3:ppn=8
>>> 
>>> and started a simple test case which fails:
>>> 
>>> $ mpirun -np 2 ./mpi-test 1
>>> --------------------------------------------------------------------------
>>> There are not enough slots available in the system to satisfy the 2 slots 
>>> that were requested by the application:
>>> ./mpi-test
>>> 
>>> Either request fewer slots for your application, or make more slots 
>>> available
>>> for use.
>>> --------------------------------------------------------------------------
>>> srun: error: xxxx108: task 1: Exited with exit code 1
>>> srun: Terminating job step 131823.4
>>> srun: error: xxxx107: task 0: Exited with exit code 1
>>> srun: Job step aborted
>>> slurmd[xxxx108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH 
>>> SIGNAL 9 ***
>>> 
>>> 
>>> requesting only one core works:
>>> 
>>> $ mpirun  ./mpi-test 1
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000
>>> 
>>> 
>>> using openmpi-1.6.5 works with multiple cores:
>>> 
>>> $ mpirun -np 24 ./mpi-test 2
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 24: 0.000000
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on xxxx106 out of 24: 12.000000
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on xxxx108 out of 24: 11.000000
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on xxxx106 out of 24: 18.000000
>>> 
>>> $ echo $SLURM_JOB_CPUS_PER_NODE 
>>> 8(x3)
>>> 
>>> I never used slurm before so this could also be a user error on my side.
>>> But as 1.6.5 works it seems something has changed and wanted to let
>>> you know in case it was not intentionally.
>>> 
>>>             Adrian
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
>               Adrian
> 
> -- 
> Adrian Reber <[email protected]>            http://lisas.de/~adrian/
> "Let us all bask in television's warm glowing warming glow." -- Homer Simpson
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

Reply via email to