I ended up just doing ‘scancel’ on all the jobs and resubmitting them. I seem to be making progress. Now I am having trouble figuring out the –distribution option. I want to have it such that each node runs 1 of each array job, but shares the remaining resources for other jobs.
Here is what is in my script: #SBATCH --nodes=1 #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=5 #SBATCH --threads-per-core=2 #SBATCH --distribution=cyclic:block,NoPack So I am getting 10 threads on a box. This is to run an mpi program. I do a sbatch: sbatch --array=1-100%2 slurm_array.sh I would expect my job to have 1 running on node1 and 1 on node2, but both start on node1. From: John Desantis [mailto:desan...@mail.usf.edu] Sent: Wednesday, January 27, 2016 7:37 AM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Re: Update job and partition for shared jobs Brian, I've never run into that message with SLURM yet. Have you tried releasing the jobs with scontrol, e.g. "scontrol release ID" where "ID" is the job number? We do not automatically requeue jobs due to a bug (fixed!) which caused the controller to crash because of an empty task_id_bitmap. John DeSantis 2016-01-26 20:05 GMT-05:00 Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@nps.edu>>: John, Thanks. That seemed to help; a job started on a node that had a job on it once the job that had been on it (‘using’ all the memory) completed. But now all my jobs won’t start and have a status of ‘JobHoldMaxRequeue’ From the docs, it seems that is because MAX_BATCH_REQUEUE is too low, but I don’t see where to change that. Even worse, I cannot seem to scancel any of those jobs just to clean things up and test stuff. Anyone know how to get rid of jobs with a status of ‘JobHoldMaxRequeue’? Brian Andrus From: John Desantis [mailto:desan...@mail.usf.edu<mailto:desan...@mail.usf.edu>] Sent: Tuesday, January 26, 2016 12:37 PM To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Subject: [slurm-dev] Re: Update job and partition for shared jobs Brian, Try setting a default memory per CPU in the partition definition. Later versions of SLURM (>= 14.11.6?) require this value to be set, otherwise all memory per node is scheduled. HTH, John DeSantis 2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@nps.edu>>: All, I am in the process of transitioning from Torque to Slurm. So far it is doing very well, especially handling arrays. Now I have one array job that is running across several nodes, but only using some of the node resources. I would like to have slurm start sharing the nodes so some of the array jobs will start where there are unused resources. I ran a scontrol update to force sharing and see the partition did change: #scontrol show partitions PartitionName=debug AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=compute[45-49] Priority=1 RootOnly=NO ReqResv=NO Shared=FORCE:4 PreemptMode=OFF State=UP TotalCPUs=280 TotalNodes=5 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED But it is not starting job 416_37 on any node as I would expect. #squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 416_[37-1013%6] debug slurm_ar user1 PD 0:00 1 (Resources) 416_36 debug slurm_ar user1 R 35:46 1 compute49 416_35 debug slurm_ar user1 R 1:47:25 1 compute46 416_33 debug slurm_ar user1 R 7:30:50 1 compute45 416_32 debug slurm_ar user1 R 7:38:39 1 compute47 416_31 debug slurm_ar user1 R 8:53:26 1 compute48 In my config, I have: SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY What am I missing to get more than one job to run on a node? Thanks in advance, Brian Andrus