[slurm-dev] select/cons_res scheduling jobs that exceed nodes CPU count

Trey Dockendorf Thu, 13 Nov 2014 13:53:00 -0800

I have noticed in production that our "long" partition has been sharing
nodes with jobs from our "background" partition and the NumCPUs of each job
sum to be greater than the CPU count of the node they get scheduled onto.
This issue has been seen in production on 14.03.10 and also in our test
environment also on 14.03.10.  The jobs in production MAY have been
scheduled before we updated to 14.03.10 which was done earlier this week.
Previously we were on 14.03.6.


Node:

NodeName=c0218 NodeAddr=192.168.200.68 CPUs=8 Sockets=2 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=32200 TmpDisk=16000
Feature=core8,mem32gb,gig,harpertown State=UNKNOWN

Partitions:

PartitionName=DEFAULT Nodes=c0218,c0[931-932]n[1-2] DefMemPerCPU=1900
MaxMemPerCPU=2000
PartitionName=serial-long Nodes=c0218 Priority=10 PreemptMode=OFF
MaxNodes=1 MaxTime=720:00:00 DefMemPerCPU=3900 MaxMemPerCPU=4000 State=UP
PartitionName=background Priority=10 MaxNodes=1 MaxTime=96:00:00 State=UP

slurm.conf items:

PreemptMode             = GANG,SUSPEND
PreemptType             = preempt/partition_prio
SelectType              = select/cons_res
SelectTypeParameters    = CR_CPU,CR_CORE_DEFAULT_DIST_BLOCK

Using a simple batch script that just calls the "stress" program to
generate load I scheduled two jobs.  SLURM correctly adjusted each job's
NumCPUs based on the MaxMemPerCPU values

sbatch --mem=14400 -p background batches/stress.slrm
# jobID 11323
# NumCPUs=8

sbatch --mem=15360 -p serial-long batches/stress.slrm
# jobID 11324
# NumCPUs=4

Below is the debug output from slurmctld.log regarding scheduling of the
second job.  I'm wondering if this is a bug or a configuration issue.  It
seems like a bug because more CPUs were scheduled than available, and all
partitions have Shared=NO by default.

[2014-11-13T15:38:25.278] debug:  Setting job's pn_min_cpus to 4 due to
memory limit
[2014-11-13T15:38:25.278] debug3: acct_policy_validate: MPN: job_memory set
to 15360
[2014-11-13T15:38:25.278] debug3: before alteration asking for nodes
1-4294967294 cpus 1-4294967294
[2014-11-13T15:38:25.278] debug3: after alteration asking for nodes
1-4294967294 cpus 1-4294967294
[2014-11-13T15:38:25.279] debug2: initial priority for job 11324 is 30668109
[2014-11-13T15:38:25.279] debug2: found 1 usable nodes from config
containing c0218
[2014-11-13T15:38:25.279] debug3: _pick_best_nodes: job 11324 idle_nodes 4
share_nodes 5
[2014-11-13T15:38:25.279] debug2: select_p_job_test for job 11324
[2014-11-13T15:38:25.279] debug3: acct_policy_job_runnable_post_select: job
11324: MPN: job_memory set to 15360
[2014-11-13T15:38:25.279] debug2: _adjust_limit_usage: job 11324: MPN:
job_memory set to 0
[2014-11-13T15:38:25.279] debug2: sched: JobId=11324 allocated resources:
NodeList=(null)
[2014-11-13T15:38:25.279] _slurm_rpc_submit_batch_job JobId=11324 usec=1582
[2014-11-13T15:38:25.281] debug3: Writing job id 11324 to header record of
job_state file
[2014-11-13T15:38:26.187] debug:  sched: Running job scheduler
[2014-11-13T15:38:26.187] debug2: found 1 usable nodes from config
containing c0218
[2014-11-13T15:38:26.187] debug3: _pick_best_nodes: job 11324 idle_nodes 4
share_nodes 5
[2014-11-13T15:38:26.187] debug2: select_p_job_test for job 11324
[2014-11-13T15:38:26.188] debug3: cons_res: best_fit: node[0]: required
cpus: 4, min req boards: 1,
[2014-11-13T15:38:26.188] debug3: cons_res: best_fit: node[0]: min req
sockets: 1, min avail cores: 8
[2014-11-13T15:38:26.188] debug3: cons_res: best_fit: using node[0]:
board[0]: socket[1]: 4 cores available
[2014-11-13T15:38:26.188] debug3: acct_policy_job_runnable_post_select: job
11324: MPN: job_memory set to 15360
[2014-11-13T15:38:26.188] debug3: cons_res: _add_job_to_res: job 11324 act 0
[2014-11-13T15:38:26.188] debug3: cons_res: adding job 11324 to part
serial-long row 0
[2014-11-13T15:38:26.188] debug2: _adjust_limit_usage: job 11324: MPN:
job_memory set to 15360
[2014-11-13T15:38:26.188] debug3: sched: JobId=11324 initiated
[2014-11-13T15:38:26.188] sched: Allocate JobId=11324 NodeList=c0218 #CPUs=4

Thanks,
- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected]
Jabber: [email protected]

[slurm-dev] select/cons_res scheduling jobs that exceed nodes CPU count

Reply via email to