On 08/03/2024 14:50, Jarsulic, Michael [BSD] wrote:

Ibán,

What are you using for your scheduler?

On my compute nodes, I am setting the pagepool to 16 GB and setting aside specialized memory for GPFS that will not be allocated to jobs.


What you would normally do is create a node class

mmcrnodeclass compute -N node001,node002,node003,node004,.....

then set the pagepool appropriately

mmchconfig pagepool=16G -i -N compute

We then use slurm to limit the maximum amount of RAM a job can have on a node to be physical RAM minus the pagepool size minus a bit more for good measure to allow for the OS.

If the OOM is kicking in then you need to reduce the RAM limit in slurm some more till it stops.

Note we also twiddle with some other limits for compute nodes

mmchconfig maxFilesToCache=8000 -N compute
mmchconfig maxStatCache=16000 -N compute

We have a slew of node classes where these settings are tweaked to account for their RAM and their role so dssg, compute, gpu,protocol,teaching, and login. All nodes belong to one or more node classes. Which reminds me I need a gui node class now.


JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Reply via email to