Re: [slurm-users] unable to Hold and release the job using scontrol

2021-05-23 Thread Brian Andrus

Yep, job 28 is already running.

If you want it to be on hold to start, use 'sbatch -h test.sh' and it 
will start out in a hold state.


Brian Andrus

On 5/22/2021 11:36 PM, Chris Samuel wrote:

On Saturday, 22 May 2021 11:05:54 PM PDT Zainul Abiddin wrote:


i am trying to hold the job from Scontol but not able to hold the job.

It looks like you're trying to hold a running job, which isn't possible.

I see from the Slurm FAQ that you should be able to use "scontrol requeuehold"
for what you are trying to achieve.

https://slurm.schedmd.com/faq.html#req

# Slurm supports requeuing jobs in a hold state with the command:
#
# scontrol requeuehold job_id
#
# The job can be in state RUNNING, SUSPENDED, COMPLETED or FAILED before
# being requeued.

Best of luck,
Chris




Re: [slurm-users] unable to Hold and release the job using scontrol

2021-05-23 Thread Chris Samuel
On Saturday, 22 May 2021 11:05:54 PM PDT Zainul Abiddin wrote:

> i am trying to hold the job from Scontol but not able to hold the job.

It looks like you're trying to hold a running job, which isn't possible.

I see from the Slurm FAQ that you should be able to use "scontrol requeuehold" 
for what you are trying to achieve.

https://slurm.schedmd.com/faq.html#req

# Slurm supports requeuing jobs in a hold state with the command:
#
# scontrol requeuehold job_id
#
# The job can be in state RUNNING, SUSPENDED, COMPLETED or FAILED before
# being requeued.

Best of luck,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






[slurm-users] unable to Hold and release the job using scontrol

2021-05-23 Thread Zainul Abiddin
Hi All,
i am trying to hold the job from Scontol but not able to hold the job.
i am not able to understand..can any one please explain the concept of Hold
and Release, Suspend and Resume.

Please find the below steps which i have tried.

[root@master ~]# cat test.sh
#!/bin/bash

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p hpc
#SBATCH -t 01:00:00
#SBATCH -J testjob
#SBATCH -o testjob.o%j
#SBATCH -e testjob.e%j

cd $SLURM_SUBMIT_DIR
/bin/hostname
date
sleep 120

[root@master ~]# sbatch test.sh
Submitted batch job 28
[root@master ~]# sbatch test.sh
Submitted batch job 29
[root@master ~]# sbatch test.sh
Submitted batch job 30
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root PD   0:00  1
(Resources)
28   hpc  testjob root  R   0:06  1 master
29   hpc  testjob root  R   0:05  1 master
[root@master ~]# sinfo -Nl
Sun May 23 11:16:55 2021
NODELIST   NODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
master 1  hpc*   allocated 2   2:1:1   10240  1
  (null) none
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root PD   0:00  1
(Resources)
28   hpc  testjob root  R   0:39  1 master
29   hpc  testjob root  R   0:38  1 master
[root@master ~]# scontrol hold 28
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root PD   0:00  1
(Resources)
29   hpc  testjob root  R   1:04  1 master
28   hpc  testjob root  R   1:05  1 master
[root@master ~]# scontrol hold 28
[root@master ~]# scontrol hold 28
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root PD   0:00  1
(Resources)
29   hpc  testjob root  R   1:14  1 master
28   hpc  testjob root  R   1:15  1 master
[root@master ~]# scontrol suspend 28
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
29   hpc  testjob root  R   1:38  1 master
30   hpc  testjob root  R   0:01  1 master
28   hpc  testjob root  S   1:37  1 master
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
29   hpc  testjob root  R   1:59  1 master
30   hpc  testjob root  R   0:22  1 master
28   hpc  testjob root  S   1:37  1 master
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root  R   0:41  1 master
28   hpc  testjob root  S   1:37  1 master
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root  R   0:55  1 master
28   hpc  testjob root  S   1:37  1 master
[root@master ~]# scontrol release 28
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root  R   1:20  1 master
28   hpc  testjob root  S   1:37  1 master
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root  R   1:22  1 master
28   hpc  testjob root  S   1:37  1 master
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root  R   1:23  1 master
28   hpc  testjob root  S   1:37  1 master
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root  R   1:25  1 master
28   hpc  testjob root  S   1:37  1 master
[root@master ~]# scontrol resume 28
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root  R   1:40  1 master
[root@master ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
30   hpc  testjob root  R