Curious, if you could say something about how you ended up with some page pool values on your client side that are that high. For what use cases does 64GB, for example, make a difference?
-- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Mar 8, 2024, at 11:32, Wahl, Edward <ew...@osc.edu> wrote: Yikes! Those must be some mighty large memory compute nodes! That is an OK setting for a large memory ESS/DSS server but NOT the compute nodes at my site, as that is in bytes. (so ~324 GB) Even on our 1TB+ memory machines we do not tune it that high. You can set pagepool for nodeclass machines such as all your compute, but pagepool is one of those settings where you will have to restart the clients for it to take effect. (such as most all the rdma settings, etc) You should look into creating a “nodeclass” for each of your “node types” if you have not already, so you can avoid OOM issues from just the pagepool, and tune other settings per node-type (rdma/network settings, etc) I would address this here, rather than on the Slurm side. Then you can address (total memory minus the pagepool) for the overall addressability to Slurm for user jobs. Leave some spare memory for the system itself or you will see more memory issues and whatnot when users get close to OOM, even in their cgroup. Example from a cross mounted compute-side cluster. Default is 1GB: [root@nostorage-manager1 ~]# mmlsconfig pagepool pagepool 1024M pagepool 4G [k8,pitzer] pagepool 64G [ascend] pagepool 16G [ib-spire-login,owenslogin,pitzerlogin] pagepool 48G [dm] pagepool 4G [cardinal] pagepool 64G [cardinal_quadport] example from the ESS/DSS server side. Later ESS versions set things by mmvdisk groups, rather than server type. # mmlsconfig pagepool pagepool 32G pagepool 358G [gss_ppc64] pagepool 16384M [ibmems11-hs,ems] pagepool 324383477760 [ess3200_mmvdisk_ibmessio13_hs_ibmessio14_hs,ess3200_mmvdisk_ibmessio15_hs_ibmessio16_hs,ess3200_mmvdisk_ibmessio17_hs_ibmessio18_hs] pagepool 64G [sp] pagepool 384399572992 [ibmgssio1_hsibmgssio2_hs,ibmgssio3_hsibmgssio4_hs,ibmgssio5_hsibmgssio6_hs] pagepool 573475966156 [ess5k_mmvdisk_ibmessio11_hs_ibmessio12_hs] pagepool 96G [ces] example of nodeclasses used to address other settings, such as what Infiniband port(s) to use. # mmlsconfig verbsports verbsPorts mlx5_0 verbsPorts mlx5_0 mlx5_2 [pitzer_dualport] verbsPorts mlx4_1/1 mlx4_1/2 [dm] verbsPorts mlx5_0 mlx5_2 [k8_dualport] verbsPorts mlx5_0 mlx5_1 mlx5_2 mlx5_3 [cardinal_quadport] Ed Wahl Ohio Supercomputer Center From: gpfsug-discuss <gpfsug-discuss-boun...@gpfsug.org<mailto:gpfsug-discuss-boun...@gpfsug.org>> On Behalf Of Iban Cabrillo Sent: Friday, March 8, 2024 9:40 AM To: gpfsug-discuss <gpfsug-disc...@spectrumscale.org<mailto:gpfsug-disc...@spectrumscale.org>> Subject: [gpfsug-discuss] pagepool Good afternoon, We are new to the DSS system configurations. Reviewing the configuration I have seen that the default pagepool is set to this value: pagepool 323908133683 But not only in the DSS servers, but also in the rest of the HPC nodes Good afternoon, We are new to the DSS system configurations. Reviewing the configuration I have seen that the default pagepool is set to this value: pagepool 323908133683 But not only in the DSS servers, but also in the rest of the HPC nodes and I don't know if it is an excessive value. We are noticing that some jobs are dying by "Memory cgroup out of memory: Killed process XXX", and my doubt is if this pagepool is reserving too much memory for the mmfs process in decripento of the execution of jobs. Any advice is welcomed, Regards, I -- ================================================================ Ibán Cabrillo Bartolomé Instituto de Física de Cantabria (IFCA-CSIC) Santander, Spain Tel: +34942200969/+34669930421 Responsible for advanced computing service (RSC) ========================================================================================= ========================================================================================= All our suppliers must know and accept IFCA policy available at: https://confluence.ifca.es/display/IC/Information+Security+Policy+for+External+Suppliers<https://urldefense.com/v3/__https:/confluence.ifca.es/display/IC/Information*Security*Policy*for*External*Suppliers__;KysrKys!!KGKeukY!3o_dGRsvxDtOG6Z646nJEb9ehb_ondS1kL3gecKjKN7mvMULc6h9iKST-ihDjnWz04X-lcNATjPzLDB2eW7P$> ========================================================================================== _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org<http://gpfsug.org/> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org