Ashton, on a compute node with 256Gbytes of RAM I would not configure any swap at all. None. I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM - and no swap. Also our ICE clusters were diskless - SGI very smartly configured swap over ISCSI - but we disabled this, the reason being that if one node in a job starts swapping the likelihood is that all the nodes are swapping, and things turn to treacle from there. Also, as another issue, if you have lots of RAM you need to look at the vm tunings for dirty ratio, background ratio and centisecs. Linux will aggressively cache data which is written to disk - you can get a situation where your processes THINK data is written to disk but it is cached, then what happens of there is a power loss? SO get those caches flushed often. https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
Oh, and my other tip. In the past vm.min_free_kbytes was ridiculously small on default Linux systems. I call this the 'wriggle room' when a system is short on RAM. Think of it like those square sliding letters puzzles - min_free_kbytes is the empty square which permits the letter tiles to move. SO look at your min_free_kbytes and increase it (If I'm not wrong in RH7 and Centos7 systems it is a reasonable value already) https://bbs.archlinux.org/viewtopic.php?id=184655 Oh, and it is good to keep a terminal open with 'watch cat /proc/meminfo' I have spent many a happy hour staring at that when looking at NFS performance etc. etc. Back to your specific case. My point is that for HPC work you should never go into swap (with a normally running process, ie no job pre-emption). I find that 20 percent rule is out of date. Yes, probably you should have some swap on a workstation. And yes disk space is cheap these days. However, you do talk about job pre-emption and suspending/resuming jobs. I have never actually seen that being used in production. At this point I would be grateful for some education from the choir - is this commonly used and am I just hopelessly out of date? Honestly, anywhere I have managed systems, lower priority jobs are either allowed to finish, or in the case of F1 we checkpointed and killed low priority jobs manually if there was a super high priority job to run. On Fri, 21 Sep 2018 at 22:34, A <andrealp...@gmail.com> wrote: > > I have a single node slurm config on my workstation (18 cores, 256 gb ram, 40 > Tb disk space). I recently just extended the array size to its current config > and am reconfiguring my LVM logical volumes. > > I'm curious on people's thoughts on swap sizes for a node. Redhat these days > recommends up to 20% of ram size for swap size, but no less than 4 gb. > > But......according to slurm faq; > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals > respectively, so swap and disk space should be sufficient to accommodate all > jobs allocated to a node, either running or suspended." > > So I'm wondering if 20% is enough, or whether it should scale by the number > of single jobs I might be running at any one time. E.g. if I'm running 10 > jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap? > > any thoughts? > > -ashton