Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-11-01 Thread Rémi Palancher
Hello Gérard,

> On 30/10/2023 15:46, Gérard Henry (AMU) wrote:
>> Hello all,
>> …
>> when it fails, sacct gives the follwing information:
>> JobID   JobName    Elapsed  NCPUS   TotalCPU    CPUTime
>> ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode
>>  -- -- -- -- --
>> -- --   -- 
>> 8500578    analyse5   00:03:04 60   02:57:58   03:04:00
>> 9M  OUT_OF_ME+    0:125
>> 8500578.bat+  batch   00:03:04 16  46:34.302   00:49:04
>>      21465736K    0.23M    0.01M OUT_OF_ME+    0:125
>> 8500578.0 orted   00:03:05 44   02:11:24   02:15:40
>>     40952K    0.42M    0.03M  COMPLETED  0:0
>>
>> i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus
>> and 1500M per cpu (24M)

Due to job accounting sampling intervals, tasks whose memory consumption 
increase quickly might not be properly reported by `sacct`. Default 
JobAcctGatherFrequency is 30 seconds so your batch step may have reached 
its limit in the 30 seconds time frame following the 21GB measure.

You can probably retrieve the exact memory consumption in the nodes 
kernel logs when the tasks have been killed.

Le 30/10/2023 à 15:53, Gérard Henry a écrit :
 > if i try to request just nodes and memory, for instance:
 > #SBATCH -N 2
 > #SBATCH --mem=0
 > to resquest all memory on a node, and 2nodes seem sufficient for a
 > program that consumes 100GB, i ot this error:
 > sbatch: error: CPU count per node can not be satisfied
 > sbatch: error: Batch job submission failed: Requested node configuration
 > is not available

Do you have a MaxMemPerCPU on the cluster or on the partition? If this 
value is too low, this could make the job fail due to CPU count limit.

-- 
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/




Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU

if i try to request just nodes and memory, for instance:
#SBATCH -N 2
#SBATCH --mem=0
to resquest all memory on a node, and 2nodes seem sufficient for a 
program that consumes 100GB, i ot this error:

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration 
is not available


thanks

On 30/10/2023 15:46, Gérard Henry (AMU) wrote:

Hello all,


I can't configure the slurm script correctly. My program needs 100GB of 
memory, it's the only criteria. But the job always fails with an out of 
memory.

Here's the cluster configuration I'm using:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

partition:
DefMemPerCPU=5770 MaxMemPerCPU=5778
TRES=cpu=5056,mem=3002M,node=158
for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640

my script contains:
#SBATCH -N 5
#SBATCH --ntasks=60
#SBATCH --mem-per-cpu=1500M
#SBATCH --cpus-per-task=1
...
mpirun ../zsimpletest_analyse

when it fails, sacct gives the follwing information:
JobID   JobName    Elapsed  NCPUS   TotalCPU    CPUTime 
ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode
 -- -- -- -- -- 
-- --   -- 
8500578    analyse5   00:03:04 60   02:57:58   03:04:00 
9M  OUT_OF_ME+    0:125
8500578.bat+  batch   00:03:04 16  46:34.302   00:49:04 
    21465736K    0.23M    0.01M OUT_OF_ME+    0:125
8500578.0 orted   00:03:05 44   02:11:24   02:15:40 
   40952K    0.42M    0.03M  COMPLETED  0:0


i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus 
and 1500M per cpu (24M)


if anybody can help?

thanks in advance



--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.




[slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU

Hello all,


I can't configure the slurm script correctly. My program needs 100GB of 
memory, it's the only criteria. But the job always fails with an out of 
memory.

Here's the cluster configuration I'm using:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

partition:
DefMemPerCPU=5770 MaxMemPerCPU=5778
TRES=cpu=5056,mem=3002M,node=158
for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640

my script contains:
#SBATCH -N 5
#SBATCH --ntasks=60
#SBATCH --mem-per-cpu=1500M
#SBATCH --cpus-per-task=1
...
mpirun ../zsimpletest_analyse

when it fails, sacct gives the follwing information:
JobID   JobNameElapsed  NCPUS   TotalCPUCPUTime 
ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode
 -- -- -- -- -- 
-- --   -- 
8500578analyse5   00:03:04 60   02:57:58   03:04:00 
9M  OUT_OF_ME+0:125
8500578.bat+  batch   00:03:04 16  46:34.302   00:49:04 
   21465736K0.23M0.01M OUT_OF_ME+0:125
8500578.0 orted   00:03:05 44   02:11:24   02:15:40 
  40952K0.42M0.03M  COMPLETED  0:0


i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus 
and 1500M per cpu (24M)


if anybody can help?

thanks in advance

--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.