The --distribution option controls layout of tasks across resources allocated to a job. SLURM packs the jobs onto nodes in order to minimize fragmentation, which increases the locality of parallel jobs.
Quoting Tal Hazan <tha...@doc.com>: > Moe, > 16428 regs simv_tes nniv R 0:11 1 node002 > 16429 regs simv_tes nniv R 0:11 1 node002 > > I see now jobs being allocated to node002 apparently licenses were > set wrong on the head node. > > We are using --distribution=cyclic:block, shouldn't it round-robin > jobs between the nodes? > > Regards, > Tal > > > -----Original Message----- > From: Moe Jette [mailto:je...@schedmd.com] > Sent: Thursday, May 10, 2012 10:28 PM > To: slurm-dev > Subject: [slurm-dev] Re: Slurm not allocating jobs in node (IDLE > STATE all time) > > > The squeue command should report the reason for the jobs not running. > > > Quoting Tal Hazan <tha...@doc.com>: > >> Hi, >> >> We have two nodes, one in mixed mode and the second in IDLE mode. >> Currently no tasks are running on the second node and no errors are >> showing up. >> >> Below are the scripts we are using to send and show config from slurm: >> >> Currently over 1000 jobs are queued >> >> Submission script: >> sbatch $ADDITIONAL --mail-type=FAIL >> --mail-user=$u...@doc.com<mailto:--mail-user=$u...@doc.com> -J >> $jname_static --partition=regs <<-EOF >> #!/bin/bash >> #SBATCH --get-user-env >> #SBATCH --cpu_bind=cores >> #SBATCH --distribution=cyclic >> #SBATCH --partition=regs >> #SBATCH --time=0 >> srun -q -o out_%j_%t --cpu_bind=cores --ntasks-per-core=1 >> --ntasks=1 $* >> EOF >> >> Show config: >> Configuration data as of 2012-05-10T20:31:02 >> AccountingStorageBackupHost = (null) AccountingStorageEnforce = none >> AccountingStorageHost = localhost >> AccountingStorageLoc = N/A >> AccountingStoragePort = 6819 >> AccountingStorageType = accounting_storage/slurmdbd >> AccountingStorageUser = N/A >> AuthType = auth/munge >> BackupAddr = tlvhpcbcm2 >> BackupController = tlvhpcbcm2 >> BatchStartTimeout = 10 sec >> BOOT_TIME = 2012-05-10T20:27:01 >> CacheGroups = 1 >> CheckpointType = checkpoint/none >> ClusterName = slurm_cluster >> CompleteWait = 0 sec >> ControlAddr = tlvhpcbcm1 >> ControlMachine = tlvhpcbcm1 >> CryptoType = crypto/munge >> DebugFlags = (null) >> DefMemPerCPU = UNLIMITED >> DisableRootJobs = NO >> EnforcePartLimits = NO >> Epilog = (null) >> EpilogMsgTime = 2000 usec >> EpilogSlurmctld = (null) >> FastSchedule = 0 >> FirstJobId = 1 >> GetEnvTimeout = 2 sec >> GresTypes = gpu >> GroupUpdateForce = 0 >> GroupUpdateTime = 600 sec >> HashVal = Match >> HealthCheckInterval = 0 sec >> HealthCheckProgram = (null) >> InactiveLimit = 0 sec >> JobAcctGatherFrequency = 30 sec >> JobAcctGatherType = jobacct_gather/linux >> JobCheckpointDir = /var/slurm/checkpoint >> JobCompHost = localhost >> JobCompLoc = /tmp/slurmCompLog >> JobCompPort = 0 >> JobCompType = jobcomp/none >> JobCompUser = root >> JobCredentialPrivateKey = (null) >> JobCredentialPublicCertificate = (null) >> JobFileAppend = 0 >> JobRequeue = 1 >> JobSubmitPlugins = (null) >> KillOnBadExit = 0 >> KillWait = 30 sec >> Licenses = vcsruntime*6 >> MailProg = /bin/mail >> MaxJobCount = 10000 >> MaxMemPerCPU = UNLIMITED >> MaxTasksPerNode = 12 >> MessageTimeout = 10 sec >> MinJobAge = 300 sec >> MpiDefault = none >> MpiParams = (null) >> NEXT_JOB_ID = 18656 >> OverTimeLimit = 0 min >> PluginDir = /cm/shared/apps/slurm/2.2.7/lib64/slurm >> PlugStackConfig = /etc/slurm/plugstack.conf >> PreemptMode = REQUEUE >> PreemptType = preempt/partition_prio >> PriorityType = priority/basic >> PrivateData = none >> ProctrackType = proctrack/pgid >> Prolog = (null) >> PrologSlurmctld = /cm/local/apps/cmd/scripts/prolog >> PropagatePrioProcess = 0 >> PropagateResourceLimits = ALL >> PropagateResourceLimitsExcept = (null) >> ResumeProgram = (null) >> ResumeRate = 300 nodes/min >> ResumeTimeout = 60 sec >> ResvOverRun = 0 min >> ReturnToService = 1 >> SallocDefaultCommand = (null) >> SchedulerParameters = (null) >> SchedulerPort = 7321 >> SchedulerRootFilter = 1 >> SchedulerTimeSlice = 30 sec >> SchedulerType = sched/backfill >> SelectType = select/cons_res >> SelectTypeParameters = CR_CORE >> SlurmUser = slurm(117) >> SlurmctldDebug = 3 >> SlurmctldLogFile = /var/log/slurmctld >> SlurmSchedLogFile = (null) >> SlurmctldPort = 6817 >> SlurmctldTimeout = 20 sec >> SlurmdDebug = 3 >> SlurmdLogFile = /var/log/slurmd >> SlurmdPidFile = /var/run/slurmd.pid >> SlurmdPort = 6818 >> SlurmdSpoolDir = /cm/local/apps/slurm/2.2.4/spool >> SlurmdTimeout = 20 sec >> SlurmdUser = root(0) >> SlurmSchedLogLevel = 0 >> SlurmctldPidFile = /var/run/slurmctld.pid >> SLURM_CONF = /etc/slurm/slurm.conf >> SLURM_VERSION = 2.2.7 >> SrunEpilog = (null) >> SrunProlog = (null) >> StateSaveLocation = /cm/shared/apps/slurm/current/cm/statesave >> SuspendExcNodes = (null) >> SuspendExcParts = (null) >> SuspendProgram = (null) >> SuspendRate = 60 nodes/min >> SuspendTime = NONE >> SuspendTimeout = 30 sec >> SwitchType = switch/none >> TaskEpilog = (null) >> TaskPlugin = task/affinity >> TaskPluginParam = (null type) >> TaskProlog = (null) >> TmpFS = /tmp >> TopologyPlugin = topology/none >> TrackWCKey = 0 >> TreeWidth = 50 >> UsePam = 0 >> UnkillableStepProgram = (null) >> UnkillableStepTimeout = 60 sec >> VSizeFactor = 0 percent >> WaitTime = 0 sec >> >> [root@tlvhpc root]# sinfo -l >> Thu May 10 20:31:19 2012 >> PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT SHARE GROUPS NODES >> STATE NODELIST >> defq* up infinite 1-infinite no NO all 1 >> allocated node001 >> defq* up infinite 1-infinite no NO all 1 >> idle node002 >> regs up infinite 1-infinite no NO all 1 >> allocated node001 >> regs up infinite 1-infinite no NO all 1 >> idle node002 >> >> Scheduling pool data: >> ------------------------------------------------------------- >> Pool Memory Cpus Total Usable Free Other Traits >> ------------------------------------------------------------- >> defq* 96865Mb 12 2 2 1 >> regs 96865Mb 12 2 2 1 >> >> slurm.conf: >> >> ClusterName=SLURM_CLUSTER >> #ControlAddr= >> #BackupAddr= >> # >> SlurmUser=slurm >> #SlurmdUser=root >> SlurmctldPort=6817 >> SlurmdPort=6818 >> AuthType=auth/munge >> #JobCredentialPrivateKey= >> #JobCredentialPublicCertificate= >> StateSaveLocation=/cm/shared/apps/slurm/current/cm/statesave >> SlurmdSpoolDir=/cm/local/apps/slurm/2.2.4/spool >> SwitchType=switch/none >> MpiDefault=none >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmdPidFile=/var/run/slurmd.pid >> ProctrackType=proctrack/pgid >> #PluginDir= >> CacheGroups=1 >> #FirstJobId= >> ReturnToService=1 >> #MaxJobCount= >> #PlugStackConfig= >> #PropagatePrioProcess= >> #PropagateResourceLimits= >> #PropagateResourceLimitsExcept= >> #Prolog= >> PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog >> #Epilog= >> #SrunProlog= >> #SrunEpilog= >> #TaskProlog= >> #TaskEpilog= >> TaskPlugin=task/affinity >> TaskPluginParam=sched >> #TaskPlugin=task/none >> #TaskPluginParam=Cores >> #TrackWCKey=no >> #TreeWidth=50 >> #TmpFs= >> #UsePAM= >> # >> # TIMERS >> SlurmctldTimeout=20 >> SlurmdTimeout=20 >> InactiveLimit=0 >> MinJobAge=300 >> KillWait=30 >> Waittime=0 >> # >> # SCHEDULING >> #SchedulerAuth= >> #SchedulerPort= >> #SchedulerRootFilter= >> FastSchedule=0 >> #PriorityType=priority/multifactor >> #PriorityDecayHalfLife=14-0 >> #PriorityUsageResetPeriod=14-0 >> #PriorityWeightFairshare=100000 >> #PriorityWeightAge=1000 >> #PriorityWeightPartition=10000 >> #PriorityWeightJobSize=1000 >> #PriorityMaxAge=1-0 >> # >> # LOGGING >> SlurmctldDebug=3 >> SlurmctldLogFile=/var/log/slurmctld >> SlurmdDebug=3 >> SlurmdLogFile=/var/log/slurmd >> JobCompType=jobcomp/none >> JobCompLoc=/tmp/slurmCompLog >> # >> # ACCOUNTING >> JobAcctGatherType=jobacct_gather/linux >> JobAcctGatherFrequency=30 >> # >> AccountingStorageType=accounting_storage/slurmdbd >> # AccountingStorageHost=localhost >> # AccountingStorageLoc=slurm_acct_db >> # AccountingStoragePass=SLURMDBD_USERPASS >> # AccountingStorageUser=slurm >> # >> # GENERIC RESOURCES >> GresTypes=gpu >> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE # Scheduler >> SchedulerType=sched/backfill # Master nodes >> ControlMachine=tlvhpcbcm1 >> ControlAddr=tlvhpcbcm1 >> BackupController=tlvhpcbcm2 >> BackupAddr=tlvhpcbcm2 >> # Nodes >> PartitionName=defq Nodes=node[001,002] Default=YES MinNodes=1 >> MaxNodes=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL Priority=10 >> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO PartitionName=regs >> Nodes=node[001,002] Default=NO MinNodes=1 MaxNodes=UNLIMITED >> MaxTime=UNLIMITED AllowGroups=ALL Priority=5 DisableRootJobs=NO >> RootOnly=NO Hidden=NO Shared=NO >> # END AUTOGENERATED SECTION -- DO NOT REMOVE >> # Plugins: >> SelectType=select/cons_res >> SelectTypeParameters=CR_Core >> MaxTasksPerNode=12 >> NodeName=node[001,002] Procs=12 Sockets=2 CoresPerSocket=6 >> ThreadsPerCore=2 RealMemory=96865 TmpDisk=1922 >> Licenses=vcsruntime*6 >> PreemptType=preempt/partition_prio >> PreemptMode=REQUEUE >> >> >> >> Best Regards, >> >> Tal Hazan, IT Specialist >> DigitalOptics Corporation Israel Ltd. >> www.doc.com<http://www.doc.com/> >> Mobile: +972-54-332-3338 >> Desk: +972-732-404-777 >> 6a Habarzel st. Tel Aviv, 69710 Israel >> >> [Description: Logo - DOC] >> >> > >