[slurm-users] Backfill pushing jobs back

2020-12-09 Thread David Baker
Hello,


We see the following issue with smaller jobs pushing back large jobs. We are 
using slurm 19.05.8 so not sure if this is patched in newer releases. With a 4 
node test partition I submit 3 jobs as 2 users



ssh hpcdev1@navy51 'sbatch --nodes=3 --ntasks-per-node=40 
--partition=backfilltest --time=120 --wrap="sleep 7200"'

ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 
--partition=backfilltest --time=60 --wrap="sleep 3600"'

ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 
--partition=backfilltest --time=60 --wrap="sleep 3600"'



Then I increase the priority of the pending jobs significantly. Reading the 
manual, my understanding is that nodes job should be held for these jobs.

for job in $(squeue -h -p backfilltest -t pd -o %i); do scontrol update job 
${job} priority=10;done



squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"

JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE

28482 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING

28483 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING

28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING



So, there is one node free in our 4 node partition. Naturally, a small job with 
a walltime of less than 1 hour could run in that but we are also seeing 
backfill start longer jobs.



backfilltestup 2-12:00:00  3  alloc reddev[001-003]

backfilltestup 2-12:00:00  1   idle reddev004





ssh hpcdev3@navy51 'sbatch --nodes=1 --ntasks-per-node=40 
--partition=backfilltest --time=720 --wrap="sleep 432000"'





squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"

JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE

28482 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING

28483 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING

28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING

28484 | hpcdev3 | 40 | 37541 | 12:00:00 | 2020-12-08T09:54:48 | RUNNING



Is this expect behaviour? It is also weird that the pending jobs don't have a 
start time. I have increased the backfill parameters significantly, but it 
doesn't seem to affect this at all.



SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60


Best regards,

David



Re: [slurm-users] Question about unit tests

2020-12-09 Thread Andy Riebs
Did you do the first "make check" from the top-level Slurm directory 
(not testsuite/slurm_unit)?


On 12/8/2020 11:15 PM, Rikimaru Honjo wrote:

Hi,

I ran unit tests according to the following document.

https://github.com/SchedMD/slurm/blob/master/testsuite/slurm_unit/README

As a result, all unit tests are passed.

But, I am concerned that test case number is too small. Total number 
is 5.

Is this correct? It seems that test code is not small.

This is the console log.
--
root@ho-slurmctld:/work/slurm/testsuite/slurm_unit# make check
[...]
PASS: api-test
 


Testsuite summary for slurm 20.02
 


# TOTAL: 1
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
 


[...]
Making check in common
[...]
 


Testsuite summary for slurm 20.02
 


# TOTAL: 0
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
 


[...]
Making check in slurmdb_pack
[...]
 


Testsuite summary for slurm 20.02
 


# TOTAL: 0
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
 


[...]
make  bitstring-test job-resources-test log-test pack-test
[...]
PASS: bitstring-test
PASS: job-resources-test
PASS: log-test
PASS: pack-test
 


Testsuite summary for slurm 20.02
 


# TOTAL: 4
# PASS:  4
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
 


make[4]: Leaving directory '/work/slurm/testsuite/slurm_unit/common'
make[3]: Leaving directory '/work/slurm/testsuite/slurm_unit/common'
make[2]: Leaving directory '/work/slurm/testsuite/slurm_unit/common'
make[1]: Leaving directory '/work/slurm/testsuite/slurm_unit/common'
make[1]: Entering directory '/work/slurm/testsuite/slurm_unit'
make[1]: Nothing to be done for 'check-am'.
make[1]: Leaving directory '/work/slurm/testsuite/slurm_unit'
--

Best regards,




Re: [slurm-users] Backfill pushing jobs back

2020-12-09 Thread Chris Samuel

Hi David,

On 9/12/20 3:35 am, David Baker wrote:

We see the following issue with smaller jobs pushing back large jobs. We 
are using slurm 19.05.8 so not sure if this is patched in newer releases.


This sounds like a problem that we had at NERSC (small jobs pushing back 
multi-thousand node jobs), and we carried a local patch for which Doug 
managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but 
20.02.6 is the current version).


Hope this helps!
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Question about unit tests

2020-12-09 Thread Rikimaru Honjo

Hi,

On 2020/12/09 23:30, Andy Riebs wrote:

Did you do the first "make check" from the top-level Slurm directory (not 
testsuite/slurm_unit)?


Oh, I had never tried it.
Thank you for your information.
I will report the result after trying.



On 12/8/2020 11:15 PM, Rikimaru Honjo wrote:

Hi,

I ran unit tests according to the following document.

https://github.com/SchedMD/slurm/blob/master/testsuite/slurm_unit/README

As a result, all unit tests are passed.

But, I am concerned that test case number is too small. Total number is 5.
Is this correct? It seems that test code is not small.

This is the console log.
--
root@ho-slurmctld:/work/slurm/testsuite/slurm_unit# make check
[...]
PASS: api-test

Testsuite summary for slurm 20.02

# TOTAL: 1
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[...]
Making check in common
[...]

Testsuite summary for slurm 20.02

# TOTAL: 0
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[...]
Making check in slurmdb_pack
[...]

Testsuite summary for slurm 20.02

# TOTAL: 0
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[...]
make  bitstring-test job-resources-test log-test pack-test
[...]
PASS: bitstring-test
PASS: job-resources-test
PASS: log-test
PASS: pack-test

Testsuite summary for slurm 20.02

# TOTAL: 4
# PASS:  4
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

make[4]: Leaving directory '/work/slurm/testsuite/slurm_unit/common'
make[3]: Leaving directory '/work/slurm/testsuite/slurm_unit/common'
make[2]: Leaving directory '/work/slurm/testsuite/slurm_unit/common'
make[1]: Leaving directory '/work/slurm/testsuite/slurm_unit/common'
make[1]: Entering directory '/work/slurm/testsuite/slurm_unit'
make[1]: Nothing to be done for 'check-am'.
make[1]: Leaving directory '/work/slurm/testsuite/slurm_unit'
--

Best regards,