[slurm-users] OverSubscribe=YES:4 starting 5 jobs

2023-01-13 Thread Richard Ems
Hi all, We configured a partition with OverSubscribe=YES:4 expecting that partition to start a max of 4 jobs. But we see that 5 jobs get started on a node. We use also --mem=34G and since most nodes have 192G, 5 jobs would fit, but we still want only 4 jobs to start. Setting a higher mem value

Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-13 Thread Cristóbal Navarro
Many thanks Rodrigo and Daniel, Indeed I misunderstood that part of Slurm, so thanks for clarifying this aspect now it makes a lot of sense. Regarding the approach, I went with the cgroup.conf approach as suggested by both. I will start doing some synthetic tests to make sure the job gets killed

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Helder Daniel
Thanks for all your Help Kevin, I really did miss the OverSubscribe option in the docs :-( But now cpu job scheduling is working and I have a picture of the problem with gpu job scheduling to dig further :-) On Fri, 13 Jan 2023 at 13:01, Kevin Broch wrote: > Sorry to hear that. Hopefully

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Kevin Broch
Sorry to hear that. Hopefully others in the group have some ideas/explanations. I haven't had to deal with GPU resources in Slurm. On Fri, Jan 13, 2023 at 4:51 AM Helder Daniel wrote: > Oh, ok. > I guess I was expecting that the GPU job was suspended copying GPU memory > to RAM memory. > > I

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Helder Daniel
Oh, ok. I guess I was expecting that the GPU job was suspended copying GPU memory to RAM memory. I tried also: REQUEUE,GANG and CANCEL,GANG. None of these options seems to be able to preempt GPU jobs On Fri, 13 Jan 2023 at 12:30, Kevin Broch wrote: > My guess, is that this isn't possible with

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Helder Daniel
PS: I checked the resources while running the 3 GPU jobs which where launched with: sbatch --gpus-per-task=2 --cpus-per-task=1 cnn-multi.sh The server have 64 cores (32 x2 with hyperthreading) cat /proc/cpuinfo | grep processor | tail -n1 processor : 63 128 GB main memory:

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Kevin Broch
My guess, is that this isn't possible with GANG,SUSPEND. GPU memory isn't managed in Slurm so the idea of suspending GPU memory for another job to use the rest simply isn't possible. On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel wrote: > Hi Kevin > > I did a "scontrol show partition". >

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Helder Daniel
Hi Kevin I did a "scontrol show partition". Oversubscribe was not enabled. I enable it in slurm.conf with: (...) GresTypes=gpu NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Kevin Broch
Problem might be that OverSubscribe is not enabled? w/o it, I don't believe the time-slicing can be GANG scheduled Can you do a "scontrol show partition" to verify that it is? On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel wrote: > Hi, > > I am trying to enable gang scheduling on a server with