Re: [slurm-users] Question about unit tests
Hi, On 2020/12/09 23:30, Andy Riebs wrote: Did you do the first "make check" from the top-level Slurm directory (not testsuite/slurm_unit)? Oh, I had never tried it. Thank you for your information. I will report the result after trying. On 12/8/2020 11:15 PM, Rikimaru Honjo wrote: Hi, I ran unit tests according to the following document. https://github.com/SchedMD/slurm/blob/master/testsuite/slurm_unit/README As a result, all unit tests are passed. But, I am concerned that test case number is too small. Total number is 5. Is this correct? It seems that test code is not small. This is the console log. -- root@ho-slurmctld:/work/slurm/testsuite/slurm_unit# make check [...] PASS: api-test Testsuite summary for slurm 20.02 # TOTAL: 1 # PASS: 1 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [...] Making check in common [...] Testsuite summary for slurm 20.02 # TOTAL: 0 # PASS: 0 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [...] Making check in slurmdb_pack [...] Testsuite summary for slurm 20.02 # TOTAL: 0 # PASS: 0 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [...] make bitstring-test job-resources-test log-test pack-test [...] PASS: bitstring-test PASS: job-resources-test PASS: log-test PASS: pack-test Testsuite summary for slurm 20.02 # TOTAL: 4 # PASS: 4 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 make[4]: Leaving directory '/work/slurm/testsuite/slurm_unit/common' make[3]: Leaving directory '/work/slurm/testsuite/slurm_unit/common' make[2]: Leaving directory '/work/slurm/testsuite/slurm_unit/common' make[1]: Leaving directory '/work/slurm/testsuite/slurm_unit/common' make[1]: Entering directory '/work/slurm/testsuite/slurm_unit' make[1]: Nothing to be done for 'check-am'. make[1]: Leaving directory '/work/slurm/testsuite/slurm_unit' -- Best regards,
Re: [slurm-users] Backfill pushing jobs back
Hi David, On 9/12/20 3:35 am, David Baker wrote: We see the following issue with smaller jobs pushing back large jobs. We are using slurm 19.05.8 so not sure if this is patched in newer releases. This sounds like a problem that we had at NERSC (small jobs pushing back multi-thousand node jobs), and we carried a local patch for which Doug managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but 20.02.6 is the current version). Hope this helps! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Question about unit tests
Did you do the first "make check" from the top-level Slurm directory (not testsuite/slurm_unit)? On 12/8/2020 11:15 PM, Rikimaru Honjo wrote: Hi, I ran unit tests according to the following document. https://github.com/SchedMD/slurm/blob/master/testsuite/slurm_unit/README As a result, all unit tests are passed. But, I am concerned that test case number is too small. Total number is 5. Is this correct? It seems that test code is not small. This is the console log. -- root@ho-slurmctld:/work/slurm/testsuite/slurm_unit# make check [...] PASS: api-test Testsuite summary for slurm 20.02 # TOTAL: 1 # PASS: 1 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [...] Making check in common [...] Testsuite summary for slurm 20.02 # TOTAL: 0 # PASS: 0 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [...] Making check in slurmdb_pack [...] Testsuite summary for slurm 20.02 # TOTAL: 0 # PASS: 0 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [...] make bitstring-test job-resources-test log-test pack-test [...] PASS: bitstring-test PASS: job-resources-test PASS: log-test PASS: pack-test Testsuite summary for slurm 20.02 # TOTAL: 4 # PASS: 4 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 make[4]: Leaving directory '/work/slurm/testsuite/slurm_unit/common' make[3]: Leaving directory '/work/slurm/testsuite/slurm_unit/common' make[2]: Leaving directory '/work/slurm/testsuite/slurm_unit/common' make[1]: Leaving directory '/work/slurm/testsuite/slurm_unit/common' make[1]: Entering directory '/work/slurm/testsuite/slurm_unit' make[1]: Nothing to be done for 'check-am'. make[1]: Leaving directory '/work/slurm/testsuite/slurm_unit' -- Best regards,
[slurm-users] Backfill pushing jobs back
Hello, We see the following issue with smaller jobs pushing back large jobs. We are using slurm 19.05.8 so not sure if this is patched in newer releases. With a 4 node test partition I submit 3 jobs as 2 users ssh hpcdev1@navy51 'sbatch --nodes=3 --ntasks-per-node=40 --partition=backfilltest --time=120 --wrap="sleep 7200"' ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 --partition=backfilltest --time=60 --wrap="sleep 3600"' ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 --partition=backfilltest --time=60 --wrap="sleep 3600"' Then I increase the priority of the pending jobs significantly. Reading the manual, my understanding is that nodes job should be held for these jobs. for job in $(squeue -h -p backfilltest -t pd -o %i); do scontrol update job ${job} priority=10;done squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T" JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE 28482 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING 28483 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING 28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING So, there is one node free in our 4 node partition. Naturally, a small job with a walltime of less than 1 hour could run in that but we are also seeing backfill start longer jobs. backfilltestup 2-12:00:00 3 alloc reddev[001-003] backfilltestup 2-12:00:00 1 idle reddev004 ssh hpcdev3@navy51 'sbatch --nodes=1 --ntasks-per-node=40 --partition=backfilltest --time=720 --wrap="sleep 432000"' squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T" JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE 28482 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING 28483 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING 28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING 28484 | hpcdev3 | 40 | 37541 | 12:00:00 | 2020-12-08T09:54:48 | RUNNING Is this expect behaviour? It is also weird that the pending jobs don't have a start time. I have increased the backfill parameters significantly, but it doesn't seem to affect this at all. SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60 Best regards, David