Re: [slurm-users] swap size
Hi Chris, On Mon, Sep 24, 2018 at 7:36 AM Christopher Samuel wrote: > On 24/09/18 00:46, Raymond Wan wrote: > > > Hmm, I'm way out of my comfort zone but I am curious about what > > happens. Unfortunately, I don't think I'm able to read kernel code, but > > someone here > > (https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel) > > seems to suggest that SIGSTOP and SIGCONT moves a process > > between the runnable and waiting queues. > > SIGSTOP is a non-catchable signal that immediately stops a process from > running, and so it will sit there until either resumed, killed or the > system is rebooted. :-) > > It's like doing ^Z in the shell (which generates SIGTSTP) but isn't > catchable via signal handlers, so you can't do anything about it (same > as SIGKILL). > > Regarding memory, yes its memory is still used until the process > either resume and releases it or is killed. This is why if you want > to do preemption in this mode you'll want swap so that the kernel has > somewhere to page out the memory it's using to for the incoming > process(es). Ah!!! Yes, this clears things up for me -- thank you! Somehow, I thought what you meant was that SLURM suspends a job and "immediately" its state is saved. Then I guessed if SLURM could do that, it ought to be outside of the main memory + swap space managed by the OS. But now I see what you mean. It's just doing it within the signal communication provided by the OS. The job gets stopped but it remains in main memory. That is, it doesn't "immediately shift to swap space. But having more swap space helps to give room for the job to move to so that a currently running job that is using CPU cycles can run. Of course, if a HPC has enough main memory to support all suspended jobs and any other programs that need to be running when the others are suspended, then I also see why swap space isn't necessary. Thank you for taking the time to clarify things! Ray
Re: [slurm-users] swap size
Ray I'm also on Ubuntu. I'll try the same test, but do it with and without swap on (e.g. by running the swapoff and swapon commands first). To complicate things I also don't know if the swapiness level makes a difference. Thanks Ashton On Sun, Sep 23, 2018, 7:48 AM Raymond Wan wrote: > > Hi Chris, > > > On Sunday, September 23, 2018 09:34 AM, Chris Samuel wrote: > > On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote: > > > >> SLURM's ability to suspend jobs must be storing the state in a > >> location outside of this 512 GB. So, you're not helping this by > >> allocating more swap. > > > > I don't believe that's the case. My understanding is that in this mode > it's > > just sending processes SIGSTOP and then launching the incoming job so you > > should really have enough swap for the previous job to get swapped out > to in > > order to free up RAM for the incoming job. > > > Hmm, I'm way out of my comfort zone but I am curious > about what happens. Unfortunately, I don't think I'm able > to read kernel code, but someone here > ( > https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel) > > seems to suggest that SIGSTOP and SIGCONT moves a process > between the runnable and waiting queues. > > I'm not sure if I did the correct test, but I wrote a C > program that allocated a lot of memory: > > - > #include > > #define memsize 16000 > > int main () { >char *foo = NULL; > >foo = (char *) malloc (sizeof (char) * memsize); > >for (int i = 0; i < memsize; i++) { > foo[i] = 0; >} > >do { >} while (1); > } > - > > Then, I ran it and sent a SIGSTOP to it. According to htop > (I don't know if it's correct), it seems to still be > occupying memory, but just not any CPU cycles. > > Perhaps I've done something wrong? I did read elsewhere > that how SIGSTOP is treated can vary from system to > system... I happen to be on an Ubuntu system. > > Ray > > > >
Re: [slurm-users] swap size
Hi Chris, On Sunday, September 23, 2018 09:34 AM, Chris Samuel wrote: On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote: SLURM's ability to suspend jobs must be storing the state in a location outside of this 512 GB. So, you're not helping this by allocating more swap. I don't believe that's the case. My understanding is that in this mode it's just sending processes SIGSTOP and then launching the incoming job so you should really have enough swap for the previous job to get swapped out to in order to free up RAM for the incoming job. Hmm, I'm way out of my comfort zone but I am curious about what happens. Unfortunately, I don't think I'm able to read kernel code, but someone here (https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel) seems to suggest that SIGSTOP and SIGCONT moves a process between the runnable and waiting queues. I'm not sure if I did the correct test, but I wrote a C program that allocated a lot of memory: - #include #define memsize 16000 int main () { char *foo = NULL; foo = (char *) malloc (sizeof (char) * memsize); for (int i = 0; i < memsize; i++) { foo[i] = 0; } do { } while (1); } - Then, I ran it and sent a SIGSTOP to it. According to htop (I don't know if it's correct), it seems to still be occupying memory, but just not any CPU cycles. Perhaps I've done something wrong? I did read elsewhere that how SIGSTOP is treated can vary from system to system... I happen to be on an Ubuntu system. Ray
Re: [slurm-users] swap size
On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote: > SLURM's ability to suspend jobs must be storing the state in a > location outside of this 512 GB. So, you're not helping this by > allocating more swap. I don't believe that's the case. My understanding is that in this mode it's just sending processes SIGSTOP and then launching the incoming job so you should really have enough swap for the previous job to get swapped out to in order to free up RAM for the incoming job. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Re: [slurm-users] swap size
If your workflows are primarily CPU-bound rather than memory-bound, and since you’re the only user, you could ensure all your Slurm scripts ‘nice’ their Python commands, or use the -n flag for slurmd and the PropagatePrioProcess configuration parameter. Both of these are in the thread at https://lists.schedmd.com/pipermail/slurm-users/2018-September/001926.html -- Mike Renfro / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On Sep 22, 2018, at 1:01 AM, A wrote: > > Hi John! Thanks for the reply, lots to think about. > > In terms of suspending/resuming, my situation might be a bit different than > other people. As I mentioned this is an install on a single node workstation. > This is my daily office machine. I run alot of python processing scripts that > have low CPU need but lots of iterations. I found it easier to manage these > in slurm, opposed to writing mpi/parallel processing routines in python > directly. > > Given this, sometimes I might submit a slurm array with 10K jobs, that might > take a week to run, but I still need to sometimes do work during the day that > requires more CPU power. In those cases I suspend the background array, crank > through whatever I need to do and then resume in the evening when I go home. > Sometimes I can say for jobs to finish, sometimes I have to break in the > middle of running jobs > > On Fri, Sep 21, 2018, 10:07 PM John Hearns wrote: > Ashton, on a compute node with 256Gbytes of RAM I would not > configure any swap at all. None. > I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM - > and no swap. > Also our ICE clusters were diskless - SGI very smartly configured swap > over ISCSI - but we disabled this, the reason being that if one node > in a job starts swapping the likelihood is that all the nodes are > swapping, and things turn to treacle from there. > Also, as another issue, if you have lots of RAM you need to look at > the vm tunings for dirty ratio, background ratio and centisecs. Linux > will aggressively cache data which is written to disk - you can get a > situation where your processes THINK data is written to disk but it is > cached, then what happens of there is a power loss? SO get those > caches flushed often. > https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ > > Oh, and my other tip. In the past vm.min_free_kbytes was ridiculously > small on default Linux systems. I call this the 'wriggle room' when a > system is short on RAM. Think of it like those square sliding letters > puzzles - min_free_kbytes is the empty square which permits the letter > tiles to move. > SO look at your min_free_kbytes and increase it (If I'm not wrong in > RH7 and Centos7 systems it is a reasonable value already) > https://bbs.archlinux.org/viewtopic.php?id=184655 > > Oh, and it is good to keep a terminal open with 'watch cat > /proc/meminfo' I have spent many a happy hour staring at that when > looking at NFS performance etc. etc. > > Back to your specific case. My point is that for HPC work you should > never go into swap (with a normally running process, ie no job > pre-emption). I find that 20 percent rule is out of date. Yes, > probably you should have some swap on a workstation. And yes disk > space is cheap these days. > > > However, you do talk about job pre-emption and suspending/resuming > jobs. I have never actually seen that being used in production. > At this point I would be grateful for some education from the choir - > is this commonly used and am I just hopelessly out of date? > Honestly, anywhere I have managed systems, lower priority jobs are > either allowed to finish, or in the case of F1 we checkpointed and > killed low priority jobs manually if there was a super high priority > job to run. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 21 Sep 2018 at 22:34, A wrote: > > > > I have a single node slurm config on my workstation (18 cores, 256 gb ram, > > 40 Tb disk space). I recently just extended the array size to its current > > config and am reconfiguring my LVM logical volumes. > > > > I'm curious on people's thoughts on swap sizes for a node. Redhat these > > days recommends up to 20% of ram size for swap size, but no less than 4 gb. > > > > But..according to slurm faq; > > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals > > respectively, so swap and disk space should be sufficient to accommodate > > all jobs allocated to a node, either running or suspended." > > > > So I'm wondering if 20% is enough, or whether it should scale by the number > > of single jobs I might be running at any one time. E.g. if I'm running 10 > > jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap? > > > > any thoughts? > > > > -ashton >
Re: [slurm-users] swap size
I would say that, yes, you have a good workflow here with Slurm. As another aside - is anyone working with suspending and resuming containers? I see on the Singularity site that suspend/resume in on the roadmap (I am not talking about checkpointing here). Also it is worth saying that these days one would be swapping to SSDs and even better NVRam devices, so the penalties for swapping will be less. Warming to my theme what we should be looking at for large memory machines is tiered memory. Fast DRAM for the actual computations which are actively being written to. Then slower tiers of cheaper memory. Diablo had implemented this, I believe they are no longer ctive. lso there is Optne - which seems to hve gone a bit quiet. But having read up on Diable the drivers for tiered memory are in the Linux kernel. Enough of my ramblings! MAybe one day you will have a susyem with Tbytes of memory, and only 256 gig of real fast DRAM. On Sat, 22 Sep 2018 at 07:20, Raymond Wan wrote: > > Hi Ashton, > > On Sat, Sep 22, 2018 at 5:34 AM A wrote: > > So I'm wondering if 20% is enough, or whether it should scale by the number > > of single jobs I might be running at any one time. E.g. if I'm running 10 > > jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap? > > > Perhaps I'm a bit clueless here, but maybe someone can correct me if I'm > wrong. > > I don't think swap space or a swap file is used like that. If you > have 256 GB of memory and a 256 GB swap file (I don't suggest this > size...it just makes my math easier :-) ), then from the point of view > of the OS, it will appear there is 512 GB of memory. So, this is > memory that is used while it is running...for reading in data, etc. > > SLURM's ability to suspend jobs must be storing the state in a > location outside of this 512 GB. So, you're not helping this by > allocating more swap. > > What you are doing is perhaps allowing more jobs to run concurrently, > but I would caution against allocating more swap space. After all, > disk read/write is much slower than memory. If you can run 10 jobs > within 256 GB of memory but 20 jobs within 512 GB of (memory + swap > space), I think you should do some kind of test to see if it would be > faster to just let 10 jobs run. Since disk I/O is slower, I doubt > you're going to get double the running time. > > Personally, I still create swap space, but I agree with John that a > server with 256 GB of memory shouldn't need any swap at all. With > what I run, if it uses more than the amount of memory that I have, I > tend to stop it and find another computer to run it. If there isn't > one, I need to admit I can't do it. Because once it exceeds the > amount of main memory, it will start thrashing and, thus, take a lot > of time to run. i.e., a day versus a week or more... > > On the other hand, we do have servers that double as desktops during > the day. An alternative for you to consider is to only allocate 200 > GB of memory to slurm, for example, leaving 56 GB for your own use. > Yes, this means that, at night, 56 GB of RAM is wasted, but during the > day, they can also continue running. Of course, you should set aside > an amount that is enough for you...56 GB was chosen to make my math > easier as well. :-) > > If something I said here isn't quite correct, I'm happy to have > someone correct me... > > Ray >
Re: [slurm-users] swap size
Hi Ashton, On Sat, Sep 22, 2018 at 5:34 AM A wrote: > So I'm wondering if 20% is enough, or whether it should scale by the number > of single jobs I might be running at any one time. E.g. if I'm running 10 > jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap? Perhaps I'm a bit clueless here, but maybe someone can correct me if I'm wrong. I don't think swap space or a swap file is used like that. If you have 256 GB of memory and a 256 GB swap file (I don't suggest this size...it just makes my math easier :-) ), then from the point of view of the OS, it will appear there is 512 GB of memory. So, this is memory that is used while it is running...for reading in data, etc. SLURM's ability to suspend jobs must be storing the state in a location outside of this 512 GB. So, you're not helping this by allocating more swap. What you are doing is perhaps allowing more jobs to run concurrently, but I would caution against allocating more swap space. After all, disk read/write is much slower than memory. If you can run 10 jobs within 256 GB of memory but 20 jobs within 512 GB of (memory + swap space), I think you should do some kind of test to see if it would be faster to just let 10 jobs run. Since disk I/O is slower, I doubt you're going to get double the running time. Personally, I still create swap space, but I agree with John that a server with 256 GB of memory shouldn't need any swap at all. With what I run, if it uses more than the amount of memory that I have, I tend to stop it and find another computer to run it. If there isn't one, I need to admit I can't do it. Because once it exceeds the amount of main memory, it will start thrashing and, thus, take a lot of time to run. i.e., a day versus a week or more... On the other hand, we do have servers that double as desktops during the day. An alternative for you to consider is to only allocate 200 GB of memory to slurm, for example, leaving 56 GB for your own use. Yes, this means that, at night, 56 GB of RAM is wasted, but during the day, they can also continue running. Of course, you should set aside an amount that is enough for you...56 GB was chosen to make my math easier as well. :-) If something I said here isn't quite correct, I'm happy to have someone correct me... Ray
Re: [slurm-users] swap size
Hi John! Thanks for the reply, lots to think about. In terms of suspending/resuming, my situation might be a bit different than other people. As I mentioned this is an install on a single node workstation. This is my daily office machine. I run alot of python processing scripts that have low CPU need but lots of iterations. I found it easier to manage these in slurm, opposed to writing mpi/parallel processing routines in python directly. Given this, sometimes I might submit a slurm array with 10K jobs, that might take a week to run, but I still need to sometimes do work during the day that requires more CPU power. In those cases I suspend the background array, crank through whatever I need to do and then resume in the evening when I go home. Sometimes I can say for jobs to finish, sometimes I have to break in the middle of running jobs On Fri, Sep 21, 2018, 10:07 PM John Hearns wrote: > Ashton, on a compute node with 256Gbytes of RAM I would not > configure any swap at all. None. > I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM - > and no swap. > Also our ICE clusters were diskless - SGI very smartly configured swap > over ISCSI - but we disabled this, the reason being that if one node > in a job starts swapping the likelihood is that all the nodes are > swapping, and things turn to treacle from there. > Also, as another issue, if you have lots of RAM you need to look at > the vm tunings for dirty ratio, background ratio and centisecs. Linux > will aggressively cache data which is written to disk - you can get a > situation where your processes THINK data is written to disk but it is > cached, then what happens of there is a power loss? SO get those > caches flushed often. > > https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ > > Oh, and my other tip. In the past vm.min_free_kbytes was ridiculously > small on default Linux systems. I call this the 'wriggle room' when a > system is short on RAM. Think of it like those square sliding letters > puzzles - min_free_kbytes is the empty square which permits the letter > tiles to move. > SO look at your min_free_kbytes and increase it (If I'm not wrong in > RH7 and Centos7 systems it is a reasonable value already) > https://bbs.archlinux.org/viewtopic.php?id=184655 > > Oh, and it is good to keep a terminal open with 'watch cat > /proc/meminfo' I have spent many a happy hour staring at that when > looking at NFS performance etc. etc. > > Back to your specific case. My point is that for HPC work you should > never go into swap (with a normally running process, ie no job > pre-emption). I find that 20 percent rule is out of date. Yes, > probably you should have some swap on a workstation. And yes disk > space is cheap these days. > > > However, you do talk about job pre-emption and suspending/resuming > jobs. I have never actually seen that being used in production. > At this point I would be grateful for some education from the choir - > is this commonly used and am I just hopelessly out of date? > Honestly, anywhere I have managed systems, lower priority jobs are > either allowed to finish, or in the case of F1 we checkpointed and > killed low priority jobs manually if there was a super high priority > job to run. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 21 Sep 2018 at 22:34, A wrote: > > > > I have a single node slurm config on my workstation (18 cores, 256 gb > ram, 40 Tb disk space). I recently just extended the array size to its > current config and am reconfiguring my LVM logical volumes. > > > > I'm curious on people's thoughts on swap sizes for a node. Redhat these > days recommends up to 20% of ram size for swap size, but no less than 4 gb. > > > > But..according to slurm faq; > > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT > signals respectively, so swap and disk space should be sufficient to > accommodate all jobs allocated to a node, either running or suspended." > > > > So I'm wondering if 20% is enough, or whether it should scale by the > number of single jobs I might be running at any one time. E.g. if I'm > running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200 > gb of swap? > > > > any thoughts? > > > > -ashton > >
Re: [slurm-users] swap size
Ashton, on a compute node with 256Gbytes of RAM I would not configure any swap at all. None. I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM - and no swap. Also our ICE clusters were diskless - SGI very smartly configured swap over ISCSI - but we disabled this, the reason being that if one node in a job starts swapping the likelihood is that all the nodes are swapping, and things turn to treacle from there. Also, as another issue, if you have lots of RAM you need to look at the vm tunings for dirty ratio, background ratio and centisecs. Linux will aggressively cache data which is written to disk - you can get a situation where your processes THINK data is written to disk but it is cached, then what happens of there is a power loss? SO get those caches flushed often. https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ Oh, and my other tip. In the past vm.min_free_kbytes was ridiculously small on default Linux systems. I call this the 'wriggle room' when a system is short on RAM. Think of it like those square sliding letters puzzles - min_free_kbytes is the empty square which permits the letter tiles to move. SO look at your min_free_kbytes and increase it (If I'm not wrong in RH7 and Centos7 systems it is a reasonable value already) https://bbs.archlinux.org/viewtopic.php?id=184655 Oh, and it is good to keep a terminal open with 'watch cat /proc/meminfo' I have spent many a happy hour staring at that when looking at NFS performance etc. etc. Back to your specific case. My point is that for HPC work you should never go into swap (with a normally running process, ie no job pre-emption). I find that 20 percent rule is out of date. Yes, probably you should have some swap on a workstation. And yes disk space is cheap these days. However, you do talk about job pre-emption and suspending/resuming jobs. I have never actually seen that being used in production. At this point I would be grateful for some education from the choir - is this commonly used and am I just hopelessly out of date? Honestly, anywhere I have managed systems, lower priority jobs are either allowed to finish, or in the case of F1 we checkpointed and killed low priority jobs manually if there was a super high priority job to run. On Fri, 21 Sep 2018 at 22:34, A wrote: > > I have a single node slurm config on my workstation (18 cores, 256 gb ram, 40 > Tb disk space). I recently just extended the array size to its current config > and am reconfiguring my LVM logical volumes. > > I'm curious on people's thoughts on swap sizes for a node. Redhat these days > recommends up to 20% of ram size for swap size, but no less than 4 gb. > > But..according to slurm faq; > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals > respectively, so swap and disk space should be sufficient to accommodate all > jobs allocated to a node, either running or suspended." > > So I'm wondering if 20% is enough, or whether it should scale by the number > of single jobs I might be running at any one time. E.g. if I'm running 10 > jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap? > > any thoughts? > > -ashton