I ended up just doing ‘scancel’ on all the jobs and resubmitting them.

I seem to be making progress.
Now I am having trouble figuring out the –distribution option.
I want to have it such that each node runs 1 of each array job, but shares the 
remaining resources for other jobs.

Here is what is in my script:
#SBATCH --nodes=1
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=5
#SBATCH --threads-per-core=2
#SBATCH --distribution=cyclic:block,NoPack

So I am getting 10 threads on a box. This is to run an mpi program.

I do a sbatch:
sbatch --array=1-100%2 slurm_array.sh


I would expect my job to have 1 running on node1 and 1 on node2, but both start 
on node1.



From: John Desantis [mailto:desan...@mail.usf.edu]
Sent: Wednesday, January 27, 2016 7:37 AM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Update job and partition for shared jobs

Brian,

I've never run into that message with SLURM yet.

Have you tried releasing the jobs with scontrol, e.g. "scontrol release ID" 
where "ID" is the job number?

We do not automatically requeue jobs due to a bug (fixed!) which caused the 
controller to crash because of an empty task_id_bitmap.

John DeSantis

2016-01-26 20:05 GMT-05:00 Andrus, Brian Contractor 
<bdand...@nps.edu<mailto:bdand...@nps.edu>>:
John,

Thanks. That seemed to help; a job started on a node that had a job on it once 
the job that had been on it (‘using’ all the memory) completed.

But now all my jobs won’t start and have a status of ‘JobHoldMaxRequeue’

From the docs, it seems that is because MAX_BATCH_REQUEUE is too low, but I 
don’t see where to change that.

Even worse, I cannot seem to scancel any of those jobs just to clean things up 
and test stuff.

Anyone know how to get rid of jobs with a status of ‘JobHoldMaxRequeue’?

Brian Andrus


From: John Desantis [mailto:desan...@mail.usf.edu<mailto:desan...@mail.usf.edu>]
Sent: Tuesday, January 26, 2016 12:37 PM
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Re: Update job and partition for shared jobs

Brian,

Try setting a default memory per CPU in the partition definition.  Later 
versions of SLURM (>= 14.11.6?) require this value to be set, otherwise all 
memory per node is scheduled.

HTH,
John DeSantis

2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor 
<bdand...@nps.edu<mailto:bdand...@nps.edu>>:
All,

I am in the process of transitioning from Torque to Slurm.
So far it is doing very well, especially handling arrays.

Now I have one array job that is running across several nodes, but only using 
some of the node resources. I would like to have slurm start sharing the nodes 
so some of the array jobs will start where there are unused resources.

I ran a scontrol update to force sharing and see the partition did change:

#scontrol show partitions
PartitionName=debug
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO 
MaxCPUsPerNode=UNLIMITED
   Nodes=compute[45-49]
   Priority=1 RootOnly=NO ReqResv=NO Shared=FORCE:4 PreemptMode=OFF
   State=UP TotalCPUs=280 TotalNodes=5 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

But it is not starting job 416_37 on any node as I would expect.

#squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
   416_[37-1013%6]     debug slurm_ar  user1 PD       0:00      1 (Resources)
            416_36     debug slurm_ar  user1  R      35:46      1 compute49
            416_35     debug slurm_ar  user1  R    1:47:25      1 compute46
            416_33     debug slurm_ar  user1  R    7:30:50      1 compute45
            416_32     debug slurm_ar  user1  R    7:38:39      1 compute47
            416_31     debug slurm_ar  user1  R    8:53:26      1 compute48

In my config, I have:
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY


What am I missing to get more than one job to run on a node?

Thanks in advance,

Brian Andrus


Reply via email to