Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory
Hello Gérard, > On 30/10/2023 15:46, Gérard Henry (AMU) wrote: >> Hello all, >> … >> when it fails, sacct gives the follwing information: >> JobID JobName Elapsed NCPUS TotalCPU CPUTime >> ReqMem MaxRSS MaxDiskRead MaxDiskWrite State ExitCode >> -- -- -- -- -- >> -- -- -- >> 8500578 analyse5 00:03:04 60 02:57:58 03:04:00 >> 9M OUT_OF_ME+ 0:125 >> 8500578.bat+ batch 00:03:04 16 46:34.302 00:49:04 >> 21465736K 0.23M 0.01M OUT_OF_ME+ 0:125 >> 8500578.0 orted 00:03:05 44 02:11:24 02:15:40 >> 40952K 0.42M 0.03M COMPLETED 0:0 >> >> i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus >> and 1500M per cpu (24M) Due to job accounting sampling intervals, tasks whose memory consumption increase quickly might not be properly reported by `sacct`. Default JobAcctGatherFrequency is 30 seconds so your batch step may have reached its limit in the 30 seconds time frame following the 21GB measure. You can probably retrieve the exact memory consumption in the nodes kernel logs when the tasks have been killed. Le 30/10/2023 à 15:53, Gérard Henry a écrit : > if i try to request just nodes and memory, for instance: > #SBATCH -N 2 > #SBATCH --mem=0 > to resquest all memory on a node, and 2nodes seem sufficient for a > program that consumes 100GB, i ot this error: > sbatch: error: CPU count per node can not be satisfied > sbatch: error: Batch job submission failed: Requested node configuration > is not available Do you have a MaxMemPerCPU on the cluster or on the partition? If this value is too low, this could make the job fail due to CPU count limit. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io/
Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory
if i try to request just nodes and memory, for instance: #SBATCH -N 2 #SBATCH --mem=0 to resquest all memory on a node, and 2nodes seem sufficient for a program that consumes 100GB, i ot this error: sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available thanks On 30/10/2023 15:46, Gérard Henry (AMU) wrote: Hello all, I can't configure the slurm script correctly. My program needs 100GB of memory, it's the only criteria. But the job always fails with an out of memory. Here's the cluster configuration I'm using: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory partition: DefMemPerCPU=5770 MaxMemPerCPU=5778 TRES=cpu=5056,mem=3002M,node=158 for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640 my script contains: #SBATCH -N 5 #SBATCH --ntasks=60 #SBATCH --mem-per-cpu=1500M #SBATCH --cpus-per-task=1 ... mpirun ../zsimpletest_analyse when it fails, sacct gives the follwing information: JobID JobName Elapsed NCPUS TotalCPU CPUTime ReqMem MaxRSS MaxDiskRead MaxDiskWrite State ExitCode -- -- -- -- -- -- -- -- 8500578 analyse5 00:03:04 60 02:57:58 03:04:00 9M OUT_OF_ME+ 0:125 8500578.bat+ batch 00:03:04 16 46:34.302 00:49:04 21465736K 0.23M 0.01M OUT_OF_ME+ 0:125 8500578.0 orted 00:03:05 44 02:11:24 02:15:40 40952K 0.42M 0.03M COMPLETED 0:0 i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus and 1500M per cpu (24M) if anybody can help? thanks in advance -- Gérard HENRY Institut Fresnel - UMR 7249 +33 413945457 Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue Escadrille Normandie Niemen, 13013 Marseille Site : https://fresnel.fr/ Afin de respecter l'environnement, merci de n'imprimer cet email que si nécessaire.
[slurm-users] how to configure correctly node and memory when a script fails with out of memory
Hello all, I can't configure the slurm script correctly. My program needs 100GB of memory, it's the only criteria. But the job always fails with an out of memory. Here's the cluster configuration I'm using: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory partition: DefMemPerCPU=5770 MaxMemPerCPU=5778 TRES=cpu=5056,mem=3002M,node=158 for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640 my script contains: #SBATCH -N 5 #SBATCH --ntasks=60 #SBATCH --mem-per-cpu=1500M #SBATCH --cpus-per-task=1 ... mpirun ../zsimpletest_analyse when it fails, sacct gives the follwing information: JobID JobNameElapsed NCPUS TotalCPUCPUTime ReqMem MaxRSS MaxDiskRead MaxDiskWrite State ExitCode -- -- -- -- -- -- -- -- 8500578analyse5 00:03:04 60 02:57:58 03:04:00 9M OUT_OF_ME+0:125 8500578.bat+ batch 00:03:04 16 46:34.302 00:49:04 21465736K0.23M0.01M OUT_OF_ME+0:125 8500578.0 orted 00:03:05 44 02:11:24 02:15:40 40952K0.42M0.03M COMPLETED 0:0 i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus and 1500M per cpu (24M) if anybody can help? thanks in advance -- Gérard HENRY Institut Fresnel - UMR 7249 +33 413945457 Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue Escadrille Normandie Niemen, 13013 Marseille Site : https://fresnel.fr/ Afin de respecter l'environnement, merci de n'imprimer cet email que si nécessaire.