[slurm-users] cpu-bind=MASK at output files

2023-06-27 Thread Gestió Servidors
Hello,

Running this simple script:
#!/bin/bash
#
#SBATCH --job-name=mega_job
#SBATCH --output=mega_job.out
#SBATCH --tasks=3
#SBATCH --array=0-5
#SBATCH --partition=cuda.q
echo "STARTING"
srun echo "hello world" >> file_${SLURM_ARRAY_TASK_ID}.out
echo "ENDING"


I always get this output:
STARTING
STARTING
cpu-bind=MASK - aoclsd, task  0  0 [14072]: mask 0xc00c00 set
STARTING
cpu-bind=MASK - aoclsd, task  0  0 [14080]: mask 0xc00c set
cpu-bind=MASK - aoclsd, task  0  0 [14081]: mask 0x30030 set
cpu-bind=MASK - aoclsd, task  1  1 [14136]: mask 0xc00c0 set
STARTING
cpu-bind=MASK - aoclsd, task  0  0 [14144]: mask 0x3003 set
cpu-bind=MASK - aoclsd, task  1  1 [14145]: mask 0x3003 set
cpu-bind=MASK - aoclsd, task  2  2 [14150]: mask 0x3003 set
cpu-bind=MASK - aoclsd, task  2  2 [14158]: mask 0x300300 set
cpu-bind=MASK - aoclsd, task  2  2 [14137]: mask 0xc00c0 set
cpu-bind=MASK - aoclsd, task  0  0 [14135]: mask 0xc00c0 set
STARTING
STARTING
cpu-bind=MASK - aoclsd, task  0  0 [14156]: mask 0x300300 set
cpu-bind=MASK - aoclsd, task  1  1 [14157]: mask 0x300300 set
cpu-bind=MASK - aoclsd, task  2  2 [14175]: mask 0xc00c00 set
cpu-bind=MASK - aoclsd, task  1  1 [14174]: mask 0xc00c00 set
cpu-bind=MASK - aoclsd, task  0  0 [14173]: mask 0xc00c00 set
cpu-bind=MASK - aoclsd, task  1  1 [14197]: mask 0xc00c set
cpu-bind=MASK - aoclsd, task  0  0 [14196]: mask 0xc00c set
cpu-bind=MASK - aoclsd, task  2  2 [14198]: mask 0xc00c set
cpu-bind=MASK - aoclsd, task  0  0 [14206]: mask 0x30030 set
cpu-bind=MASK - aoclsd, task  1  1 [14207]: mask 0x30030 set
cpu-bind=MASK - aoclsd, task  2  2 [14208]: mask 0x30030 set
ENDING
ENDING
ENDING
ENDING
ENDING
ENDING

As you can see, it appears some "cpu-bind=MASK" that I would like to prevent 
them from appearing in the output file. I have reviewed configuration files and 
now I'm going to copy here main lines from my slurmd.conf file:
ControlMachine=my_server
ProctrackType=proctrack/linuxproc
AuthType=auth/munge
SwitchType=switch/none
TaskPlugin=task/none,task/affinity,task/cgroup
TaskPluginParam=none
DebugFlags=NO_CONF_HASH,BackfillMap,SelectType,Steps,TraceJobs
PropagateResourceLimits=ALL
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/slurmdbd
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/job_completions
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=my_server
AccountingStorageLoc=/var/log/slurm/slurm_job_accounting.txt
GresTypes=gpu
KillOnBadExit=1
OverTimeLimit=2
TCPTimeout=5


I would appreciate any help in the future.

Thanks.


[slurm-users] Unconfigured GPUs being allocated

2023-06-27 Thread Wilson, Steven M
Hi,

I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
gres.conf) and everything works fine when all the GPUs in a node are configured 
in gres.conf and available to Slurm.  But we have some nodes where a GPU is 
reserved for running the display and is specifically not configured in 
gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes it 
available to Slurm jobs.  Using a simple Slurm job that executes "nvidia-smi 
-L", it will display the unconfigured GPU along with as many configured GPUs as 
requested by the job.

For example, in a node configured with this line in slurm.conf:
NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
and this line in gres.conf:
Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
I will get the following results from a job running "nvidia-smi -L" that 
requested a single GPU:
GPU 0: NVIDIA GeForce GT 710 (UUID: 
GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
GPU-0dc4da58-5026-6173-1156-c4559a268bf5)

But in another node that has all GPUs configured in Slurm like this in 
slurm.conf:
NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 Gres=gpu:TITANX:2
and this line in gres.conf:
Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
I get the expected results from the job running "nvidia-smi -L" that requested 
a single GPU:
GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)

I'm running Slurm 22.05.5.

Thanks in advance for any suggestions to help correct this problem!

Steve


Re: [slurm-users] Backfill Scheduling

2023-06-27 Thread Reed Dier
> On Jun 27, 2023, at 1:10 AM, Loris Bennett  wrote:
> 
> Hi Reed,
> 
> Reed Dier mailto:reed.d...@focusvq.com>> writes:
> 
>> Is this an issue with the relative FIFO nature of the priority scheduling 
>> currently with all of the other factors disabled,
>> or since my queue is fairly deep, is this due to bf_max_job_test being
>> the default 100, and it can’t look deep enough into the queue to find
>> a job that will fit into what is unoccupied?
> 
> It could be that bf_max_job_test is too low.  On our system some users
> think it is a good idea to submit lots of jobs with identical resource
> requirements by writing a loop around sbatch.  Such jobs will exhaust
> the bf_max_job_test very quickly.  Thus we increased the limit to 1000
> and try to persuade users to use job arrays instead of home-grown loops.
> This seem to work OK[1].
> 
> Cheers,
> 
> Loris
> 
> -- 
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin


Thanks Loris,
I think this will be the next knob to turn and gives a bit more confidence to 
that, as we too have many such identical jobs.

> On Jun 26, 2023, at 9:10 PM, Brian Andrus  wrote:
> 
> Reed,
> 
> You may want to look at the timelimit aspect of the job(s).
> 
> For one to 'squeeze in', it needs to be able to finish before the resources 
> in use are expected to become available.
> 
> Consider:
> Job A is running on 2 nodes of a 3 node cluster. It will finish in 1 hour.
> Pending job B will run for 2 hours needs 2 nodes, but only 1 is free, it 
> waits.
> Pending job C (with a lower priority) needs 1 node for 2 hours. Hmm, well it 
> won't finish before the time job B is expected to start, so it waits.
> Pending job D (with even lower priority) needs 1 node for 30 minutes. That 
> can squeeze in before the additional node for Job B is expected to be 
> available, so it runs on the idle node.
> 
> Brian Andrus


Thanks Brian,

Our layout is a bit less exciting, in that none of these are >1 node per job.
So the blocking out nodes for job:node Tetris isn’t really at play here.
The timing however is something I may turn an eye towards.
Most jobs have a “sanity” time limit applied, in that it is not so much an 
expected time limit, but rather an “if it goes this long, something obviously 
went awry and we shouldn’t keep holding on to resources” limit.
So its a bit hard to quantify the timing portion, but I haven’t looked into the 
slurm guesses of when it thinks the next task will start, etc.

The pretty simplistic example at play here is that there are nodes that are 
~50-60% loaded for CPU and memory.
The next job up is a “whale” job that wants a ton of resources, cpu and/or 
memory, but down the line there is a job with 2 cpu’s and 2 gb of memory that 
can easily slot in to the unused resources.

So my thinking was that the job_test list may be too short to actually get that 
far down the queue to see that it could shove that job into some holes.

I’ll report back any findings after testing Loris’s suggestions.

Appreciate everyone’s help and suggestions,
Reed

smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Backfill Scheduling

2023-06-27 Thread Loris Bennett
Hi Reed,

Reed Dier  writes:

>  On Jun 27, 2023, at 1:10 AM, Loris Bennett  
> wrote:
>
>  Hi Reed,
>
>  Reed Dier  writes:
>
>  Is this an issue with the relative FIFO nature of the priority scheduling 
> currently with all of the other factors disabled,
>  or since my queue is fairly deep, is this due to bf_max_job_test being
>  the default 100, and it can’t look deep enough into the queue to find
>  a job that will fit into what is unoccupied?
>
>  It could be that bf_max_job_test is too low.  On our system some users
>  think it is a good idea to submit lots of jobs with identical resource
>  requirements by writing a loop around sbatch.  Such jobs will exhaust
>  the bf_max_job_test very quickly.  Thus we increased the limit to 1000
>  and try to persuade users to use job arrays instead of home-grown loops.
>  This seem to work OK[1].
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Herr/Mr)
>  ZEDAT, Freie Universität Berlin
>
> Thanks Loris,
> I think this will be the next knob to turn and gives a bit more confidence to 
> that, as we too have many such identical jobs.
>
>  On Jun 26, 2023, at 9:10 PM, Brian Andrus  wrote:
>
>  Reed,
>
>  You may want to look at the timelimit aspect of the job(s).
>
>  For one to 'squeeze in', it needs to be able to finish before the resources 
> in use are expected to become available.
>
>  Consider:
>  Job A is running on 2 nodes of a 3 node cluster. It will finish in 1 hour.
>  Pending job B will run for 2 hours needs 2 nodes, but only 1 is free, it 
> waits.
>  Pending job C (with a lower priority) needs 1 node for 2 hours. Hmm, well it 
> won't finish before the time job B is expected to start, so it waits.
>  Pending job D (with even lower priority) needs 1 node for 30 minutes. That 
> can squeeze in before the additional node for Job B is expected to be
>  available, so it runs on the idle node.
>
>  Brian Andrus
>
> Thanks Brian,
>
> Our layout is a bit less exciting, in that none of these are >1 node per job.
> So the blocking out nodes for job:node Tetris isn’t really at play here.
> The timing however is something I may turn an eye towards.
> Most jobs have a “sanity” time limit applied, in that it is not so much an 
> expected time limit, but rather an “if it goes this long, something obviously 
> went
> awry and we shouldn’t keep holding on to resources” limit.
> So its a bit hard to quantify the timing portion, but I haven’t looked into 
> the slurm guesses of when it thinks the next task will start, etc.
>
> The pretty simplistic example at play here is that there are nodes that are 
> ~50-60% loaded for CPU and memory.
> The next job up is a “whale” job that wants a ton of resources, cpu and/or 
> memory, but down the line there is a job with 2 cpu’s and 2 gb of memory
> that can easily slot in to the unused resources.
>
> So my thinking was that the job_test list may be too short to actually get 
> that far down the queue to see that it could shove that job into some holes.

You might also want to look at increasing bf_window to the maximum time
limit, as suggested in 'man slurm.conf'.  If backfill is not looking far
enough into the future to know whether starting a job early will
negatively impact a 'whale', then that 'whale' could potentially wait
indefinitely.  This is what happened on our system when we had a maximum
runtime of 14 days but the 1 day default for bf_window.  With both set
to 14 days the problem was solved.

Cheers,

Loris

> I’ll report back any findings after testing Loris’s suggestions.
>
> Appreciate everyone’s help and suggestions,
> Reed
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin