Re: [slurm-users] swap size

2018-09-23 Thread Raymond Wan
Hi Chris,


On Mon, Sep 24, 2018 at 7:36 AM Christopher Samuel  wrote:
> On 24/09/18 00:46, Raymond Wan wrote:
>
> > Hmm, I'm way out of my comfort zone but I am curious about what
> > happens.  Unfortunately, I don't think I'm able to read kernel code, but
> > someone here
> > (https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel)
> > seems to suggest that SIGSTOP and SIGCONT moves a process
> > between the runnable and waiting queues.
>
> SIGSTOP is a non-catchable signal that immediately stops a process from
> running, and so it will sit there until either resumed, killed or the
> system is rebooted. :-)
>
> It's like doing ^Z in the shell (which generates SIGTSTP) but isn't
> catchable via signal handlers, so you can't do anything about it (same
> as SIGKILL).
>
> Regarding memory, yes its memory is still used until the process
> either resume and releases it or is killed.  This is why if you want
> to do preemption in this mode you'll want swap so that the kernel has
> somewhere to page out the memory it's using to for the incoming
> process(es).


Ah!!!  Yes, this clears things up for me -- thank you!  Somehow, I
thought what you meant was that SLURM suspends a job and "immediately"
its state is saved.  Then I guessed if SLURM could do that, it ought
to be outside of the main memory + swap space managed by the OS.

But now I see what you mean.  It's just doing it within the signal
communication provided by the OS.

The job gets stopped but it remains in main memory.  That is, it
doesn't "immediately shift to swap space.  But having more swap space
helps to give room for the job to move to so that a currently running
job that is using CPU cycles can run.  Of course, if a HPC has enough
main memory to support all suspended jobs and any other programs that
need to be running when the others are suspended, then I also see why
swap space isn't necessary.

Thank you for taking the time to clarify things!

Ray



Re: [slurm-users] swap size

2018-09-23 Thread A
Ray

I'm also on Ubuntu. I'll try the same test, but do it with and without swap
on (e.g. by running the swapoff and swapon commands first). To complicate
things I also don't know if the swapiness level makes a difference.

Thanks
Ashton

On Sun, Sep 23, 2018, 7:48 AM Raymond Wan  wrote:

>
> Hi Chris,
>
>
> On Sunday, September 23, 2018 09:34 AM, Chris Samuel wrote:
> > On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote:
> >
> >> SLURM's ability to suspend jobs must be storing the state in a
> >> location outside of this 512 GB.  So, you're not helping this by
> >> allocating more swap.
> >
> > I don't believe that's the case.  My understanding is that in this mode
> it's
> > just sending processes SIGSTOP and then launching the incoming job so you
> > should really have enough swap for the previous job to get swapped out
> to in
> > order to free up RAM for the incoming job.
>
>
> Hmm, I'm way out of my comfort zone but I am curious
> about what happens.  Unfortunately, I don't think I'm able
> to read kernel code, but someone here
> (
> https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel)
>
> seems to suggest that SIGSTOP and SIGCONT moves a process
> between the runnable and waiting queues.
>
> I'm not sure if I did the correct test, but I wrote a C
> program that allocated a lot of memory:
>
> -
> #include 
>
> #define memsize 16000
>
> int main () {
>char *foo = NULL;
>
>foo = (char *) malloc (sizeof (char) * memsize);
>
>for (int i = 0; i < memsize; i++) {
>  foo[i] = 0;
>}
>
>do {
>} while (1);
> }
> -
>
> Then, I ran it and sent a SIGSTOP to it.  According to htop
> (I don't know if it's correct), it seems to still be
> occupying memory, but just not any CPU cycles.
>
> Perhaps I've done something wrong?  I did read elsewhere
> that how SIGSTOP is treated can vary from system to
> system...  I happen to be on an Ubuntu system.
>
> Ray
>
>
>
>


Re: [slurm-users] swap size

2018-09-23 Thread Raymond Wan



Hi Chris,


On Sunday, September 23, 2018 09:34 AM, Chris Samuel wrote:

On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote:


SLURM's ability to suspend jobs must be storing the state in a
location outside of this 512 GB.  So, you're not helping this by
allocating more swap.


I don't believe that's the case.  My understanding is that in this mode it's
just sending processes SIGSTOP and then launching the incoming job so you
should really have enough swap for the previous job to get swapped out to in
order to free up RAM for the incoming job.



Hmm, I'm way out of my comfort zone but I am curious 
about what happens.  Unfortunately, I don't think I'm able 
to read kernel code, but someone here 
(https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel) 
seems to suggest that SIGSTOP and SIGCONT moves a process 
between the runnable and waiting queues.


I'm not sure if I did the correct test, but I wrote a C 
program that allocated a lot of memory:


-
#include 

#define memsize 16000

int main () {
  char *foo = NULL;

  foo = (char *) malloc (sizeof (char) * memsize);

  for (int i = 0; i < memsize; i++) {
foo[i] = 0;
  }

  do {
  } while (1);
}
-

Then, I ran it and sent a SIGSTOP to it.  According to htop 
(I don't know if it's correct), it seems to still be 
occupying memory, but just not any CPU cycles.


Perhaps I've done something wrong?  I did read elsewhere 
that how SIGSTOP is treated can vary from system to 
system...  I happen to be on an Ubuntu system.


Ray





Re: [slurm-users] swap size

2018-09-22 Thread Chris Samuel
On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote:

> SLURM's ability to suspend jobs must be storing the state in a
> location outside of this 512 GB.  So, you're not helping this by
> allocating more swap.

I don't believe that's the case.  My understanding is that in this mode it's 
just sending processes SIGSTOP and then launching the incoming job so you 
should really have enough swap for the previous job to get swapped out to in 
order to free up RAM for the incoming job.

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC






Re: [slurm-users] swap size

2018-09-22 Thread Renfro, Michael
If your workflows are primarily CPU-bound rather than memory-bound, and since 
you’re the only user, you could ensure all your Slurm scripts ‘nice’ their 
Python commands, or use the -n flag for slurmd and the PropagatePrioProcess 
configuration parameter. Both of these are in the thread at 
https://lists.schedmd.com/pipermail/slurm-users/2018-September/001926.html

-- 
Mike Renfro  / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Sep 22, 2018, at 1:01 AM, A  wrote:
> 
> Hi John! Thanks for the reply, lots to think about.
> 
> In terms of suspending/resuming, my situation might be a bit different than 
> other people. As I mentioned this is an install on a single node workstation. 
> This is my daily office machine. I run alot of python processing scripts that 
> have low CPU need but lots of iterations. I found it easier to manage these 
> in slurm, opposed to writing mpi/parallel processing routines in python 
> directly.
> 
> Given this, sometimes I might submit a slurm array with 10K jobs, that might 
> take a week to run, but I still need to sometimes do work during the day that 
> requires more CPU power. In those cases I suspend the background array, crank 
> through whatever I need to do and then resume in the evening when I go home. 
> Sometimes I can say for jobs to finish, sometimes I have to break in the 
> middle of running jobs
> 
> On Fri, Sep 21, 2018, 10:07 PM John Hearns  wrote:
> Ashton,   on a compute node with 256Gbytes of RAM I would not
> configure any swap at all. None.
> I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -
> and no swap.
> Also our ICE clusters were diskless - SGI very smartly configured swap
> over ISCSI - but we disabled this, the reason being that if one node
> in a job starts swapping the likelihood is that all the nodes are
> swapping, and things turn to treacle from there.
> Also, as another issue, if you have lots of RAM you need to look at
> the vm tunings for dirty ratio, background ratio and centisecs. Linux
> will aggressively cache data which is written to disk - you can get a
> situation where your processes THINK data is written to disk but it is
> cached, then what happens of there is a power loss? SO get those
> caches flushed often.
> https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
> 
> Oh, and my other tip.  In the past vm.min_free_kbytes was ridiculously
> small on default Linux systems. I call this the 'wriggle room' when a
> system is short on RAM. Think of it like those square sliding letters
> puzzles - min_free_kbytes is the empty square which permits the letter
> tiles to move.
> SO look at your min_free_kbytes and increase it (If I'm not wrong in
> RH7 and Centos7 systems it is a reasonable value already)
> https://bbs.archlinux.org/viewtopic.php?id=184655
> 
> Oh, and  it is good to keep a terminal open with 'watch cat
> /proc/meminfo'  I have spent many a happy hour staring at that when
> looking at NFS performance etc. etc.
> 
> Back to your specific case. My point is that for HPC work you should
> never go into swap (with a normally running process, ie no job
> pre-emption). I find that 20 percent rule is out of date. Yes,
> probably you should have some swap on a workstation. And yes disk
> space is cheap these days.
> 
> 
> However, you do talk about job pre-emption and suspending/resuming
> jobs. I have never actually seen that being used in production.
> At this point I would be grateful for some education from the choir -
> is this commonly used and am I just hopelessly out of date?
> Honestly, anywhere I have managed systems, lower priority jobs are
> either allowed to finish, or in the case of F1 we checkpointed and
> killed low priority jobs manually if there was a super high priority
> job to run.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Fri, 21 Sep 2018 at 22:34, A  wrote:
> >
> > I have a single node slurm config on my workstation (18 cores, 256 gb ram, 
> > 40 Tb disk space). I recently just extended the array size to its current 
> > config and am reconfiguring my LVM logical volumes.
> >
> > I'm curious on people's thoughts on swap sizes for a node. Redhat these 
> > days recommends up to 20% of ram size for swap size, but no less than 4 gb.
> >
> > But..according to slurm faq;
> > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals 
> > respectively, so swap and disk space should be sufficient to accommodate 
> > all jobs allocated to a node, either running or suspended."
> >
> > So I'm wondering if 20% is enough, or whether it should scale by the number 
> > of single jobs I might be running at any one time. E.g. if I'm running 10 
> > jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?
> >
> > any thoughts?
> >
> > -ashton
> 



Re: [slurm-users] swap size

2018-09-22 Thread John Hearns
I would say that, yes, you have a good workflow here with Slurm.
As another aside - is anyone working with suspending and resuming containers?
I see on the Singularity site that suspend/resume in on the roadmap (I
am not talking about checkpointing here).

Also it is worth saying that these days one would be swapping to SSDs
and even better NVRam devices, so the penalties for swapping will be
less.
Warming to my theme what we should be looking at for large memory
machines is tiered memory. Fast DRAM for the actual computations which
are actively being written to. Then slower tiers of cheaper memory.
Diablo had implemented this, I believe they are no longer ctive. lso
there is Optne - which seems to hve gone a bit quiet.
But having read up on Diable the drivers for tiered memory are in the
Linux kernel.

Enough of my ramblings!
MAybe one day you will have a susyem with Tbytes of memory, and only
256 gig of real fast DRAM.

On Sat, 22 Sep 2018 at 07:20, Raymond Wan  wrote:
>
> Hi Ashton,
>
> On Sat, Sep 22, 2018 at 5:34 AM A  wrote:
> > So I'm wondering if 20% is enough, or whether it should scale by the number 
> > of single jobs I might be running at any one time. E.g. if I'm running 10 
> > jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?
>
>
> Perhaps I'm a bit clueless here, but maybe someone can correct me if I'm 
> wrong.
>
> I don't think swap space or a swap file is used like that.  If you
> have 256 GB of memory and a 256 GB swap file (I don't suggest this
> size...it just makes my math easier :-) ), then from the point of view
> of the OS, it will appear there is 512 GB of memory.  So, this is
> memory that is used while it is running...for reading in data, etc.
>
> SLURM's ability to suspend jobs must be storing the state in a
> location outside of this 512 GB.  So, you're not helping this by
> allocating more swap.
>
> What you are doing is perhaps allowing more jobs to run concurrently,
> but I would caution against allocating more swap space.  After all,
> disk read/write is much slower than memory.  If you can run 10 jobs
> within 256 GB of memory but 20 jobs within 512 GB of (memory + swap
> space), I think you should do some kind of test to see if it would be
> faster to just let 10 jobs run.  Since disk I/O is slower, I doubt
> you're going to get double the running time.
>
> Personally, I still create swap space, but I agree with John that a
> server with 256 GB of memory shouldn't need any swap at all.  With
> what I run, if it uses more than the amount of memory that I have, I
> tend to stop it and find another computer to run it.  If there isn't
> one, I need to admit I can't do it.  Because once it exceeds the
> amount of main memory, it will start thrashing and, thus, take a lot
> of time to run.  i.e., a day versus a week or more...
>
> On the other hand, we do have servers that double as desktops during
> the day.  An alternative for you to consider is to only allocate 200
> GB of memory to slurm, for example, leaving 56 GB for your own use.
> Yes, this means that, at night, 56 GB of RAM is wasted, but during the
> day, they can also continue running.  Of course, you should set aside
> an amount that is enough for you...56 GB was chosen to make my math
> easier as well.  :-)
>
> If something I said here isn't quite correct, I'm happy to have
> someone correct me...
>
> Ray
>



Re: [slurm-users] swap size

2018-09-22 Thread Raymond Wan
Hi Ashton,

On Sat, Sep 22, 2018 at 5:34 AM A  wrote:
> So I'm wondering if 20% is enough, or whether it should scale by the number 
> of single jobs I might be running at any one time. E.g. if I'm running 10 
> jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?


Perhaps I'm a bit clueless here, but maybe someone can correct me if I'm wrong.

I don't think swap space or a swap file is used like that.  If you
have 256 GB of memory and a 256 GB swap file (I don't suggest this
size...it just makes my math easier :-) ), then from the point of view
of the OS, it will appear there is 512 GB of memory.  So, this is
memory that is used while it is running...for reading in data, etc.

SLURM's ability to suspend jobs must be storing the state in a
location outside of this 512 GB.  So, you're not helping this by
allocating more swap.

What you are doing is perhaps allowing more jobs to run concurrently,
but I would caution against allocating more swap space.  After all,
disk read/write is much slower than memory.  If you can run 10 jobs
within 256 GB of memory but 20 jobs within 512 GB of (memory + swap
space), I think you should do some kind of test to see if it would be
faster to just let 10 jobs run.  Since disk I/O is slower, I doubt
you're going to get double the running time.

Personally, I still create swap space, but I agree with John that a
server with 256 GB of memory shouldn't need any swap at all.  With
what I run, if it uses more than the amount of memory that I have, I
tend to stop it and find another computer to run it.  If there isn't
one, I need to admit I can't do it.  Because once it exceeds the
amount of main memory, it will start thrashing and, thus, take a lot
of time to run.  i.e., a day versus a week or more...

On the other hand, we do have servers that double as desktops during
the day.  An alternative for you to consider is to only allocate 200
GB of memory to slurm, for example, leaving 56 GB for your own use.
Yes, this means that, at night, 56 GB of RAM is wasted, but during the
day, they can also continue running.  Of course, you should set aside
an amount that is enough for you...56 GB was chosen to make my math
easier as well.  :-)

If something I said here isn't quite correct, I'm happy to have
someone correct me...

Ray



Re: [slurm-users] swap size

2018-09-22 Thread A
Hi John! Thanks for the reply, lots to think about.

In terms of suspending/resuming, my situation might be a bit different than
other people. As I mentioned this is an install on a single node
workstation. This is my daily office machine. I run alot of python
processing scripts that have low CPU need but lots of iterations. I found
it easier to manage these in slurm, opposed to writing mpi/parallel
processing routines in python directly.

Given this, sometimes I might submit a slurm array with 10K jobs, that
might take a week to run, but I still need to sometimes do work during the
day that requires more CPU power. In those cases I suspend the background
array, crank through whatever I need to do and then resume in the evening
when I go home. Sometimes I can say for jobs to finish, sometimes I have to
break in the middle of running jobs

On Fri, Sep 21, 2018, 10:07 PM John Hearns  wrote:

> Ashton,   on a compute node with 256Gbytes of RAM I would not
> configure any swap at all. None.
> I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -
> and no swap.
> Also our ICE clusters were diskless - SGI very smartly configured swap
> over ISCSI - but we disabled this, the reason being that if one node
> in a job starts swapping the likelihood is that all the nodes are
> swapping, and things turn to treacle from there.
> Also, as another issue, if you have lots of RAM you need to look at
> the vm tunings for dirty ratio, background ratio and centisecs. Linux
> will aggressively cache data which is written to disk - you can get a
> situation where your processes THINK data is written to disk but it is
> cached, then what happens of there is a power loss? SO get those
> caches flushed often.
>
> https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>
> Oh, and my other tip.  In the past vm.min_free_kbytes was ridiculously
> small on default Linux systems. I call this the 'wriggle room' when a
> system is short on RAM. Think of it like those square sliding letters
> puzzles - min_free_kbytes is the empty square which permits the letter
> tiles to move.
> SO look at your min_free_kbytes and increase it (If I'm not wrong in
> RH7 and Centos7 systems it is a reasonable value already)
> https://bbs.archlinux.org/viewtopic.php?id=184655
>
> Oh, and  it is good to keep a terminal open with 'watch cat
> /proc/meminfo'  I have spent many a happy hour staring at that when
> looking at NFS performance etc. etc.
>
> Back to your specific case. My point is that for HPC work you should
> never go into swap (with a normally running process, ie no job
> pre-emption). I find that 20 percent rule is out of date. Yes,
> probably you should have some swap on a workstation. And yes disk
> space is cheap these days.
>
>
> However, you do talk about job pre-emption and suspending/resuming
> jobs. I have never actually seen that being used in production.
> At this point I would be grateful for some education from the choir -
> is this commonly used and am I just hopelessly out of date?
> Honestly, anywhere I have managed systems, lower priority jobs are
> either allowed to finish, or in the case of F1 we checkpointed and
> killed low priority jobs manually if there was a super high priority
> job to run.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, 21 Sep 2018 at 22:34, A  wrote:
> >
> > I have a single node slurm config on my workstation (18 cores, 256 gb
> ram, 40 Tb disk space). I recently just extended the array size to its
> current config and am reconfiguring my LVM logical volumes.
> >
> > I'm curious on people's thoughts on swap sizes for a node. Redhat these
> days recommends up to 20% of ram size for swap size, but no less than 4 gb.
> >
> > But..according to slurm faq;
> > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT
> signals respectively, so swap and disk space should be sufficient to
> accommodate all jobs allocated to a node, either running or suspended."
> >
> > So I'm wondering if 20% is enough, or whether it should scale by the
> number of single jobs I might be running at any one time. E.g. if I'm
> running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200
> gb of swap?
> >
> > any thoughts?
> >
> > -ashton
>
>


Re: [slurm-users] swap size

2018-09-21 Thread John Hearns
Ashton,   on a compute node with 256Gbytes of RAM I would not
configure any swap at all. None.
I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -
and no swap.
Also our ICE clusters were diskless - SGI very smartly configured swap
over ISCSI - but we disabled this, the reason being that if one node
in a job starts swapping the likelihood is that all the nodes are
swapping, and things turn to treacle from there.
Also, as another issue, if you have lots of RAM you need to look at
the vm tunings for dirty ratio, background ratio and centisecs. Linux
will aggressively cache data which is written to disk - you can get a
situation where your processes THINK data is written to disk but it is
cached, then what happens of there is a power loss? SO get those
caches flushed often.
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

Oh, and my other tip.  In the past vm.min_free_kbytes was ridiculously
small on default Linux systems. I call this the 'wriggle room' when a
system is short on RAM. Think of it like those square sliding letters
puzzles - min_free_kbytes is the empty square which permits the letter
tiles to move.
SO look at your min_free_kbytes and increase it (If I'm not wrong in
RH7 and Centos7 systems it is a reasonable value already)
https://bbs.archlinux.org/viewtopic.php?id=184655

Oh, and  it is good to keep a terminal open with 'watch cat
/proc/meminfo'  I have spent many a happy hour staring at that when
looking at NFS performance etc. etc.

Back to your specific case. My point is that for HPC work you should
never go into swap (with a normally running process, ie no job
pre-emption). I find that 20 percent rule is out of date. Yes,
probably you should have some swap on a workstation. And yes disk
space is cheap these days.


However, you do talk about job pre-emption and suspending/resuming
jobs. I have never actually seen that being used in production.
At this point I would be grateful for some education from the choir -
is this commonly used and am I just hopelessly out of date?
Honestly, anywhere I have managed systems, lower priority jobs are
either allowed to finish, or in the case of F1 we checkpointed and
killed low priority jobs manually if there was a super high priority
job to run.



























On Fri, 21 Sep 2018 at 22:34, A  wrote:
>
> I have a single node slurm config on my workstation (18 cores, 256 gb ram, 40 
> Tb disk space). I recently just extended the array size to its current config 
> and am reconfiguring my LVM logical volumes.
>
> I'm curious on people's thoughts on swap sizes for a node. Redhat these days 
> recommends up to 20% of ram size for swap size, but no less than 4 gb.
>
> But..according to slurm faq;
> "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals 
> respectively, so swap and disk space should be sufficient to accommodate all 
> jobs allocated to a node, either running or suspended."
>
> So I'm wondering if 20% is enough, or whether it should scale by the number 
> of single jobs I might be running at any one time. E.g. if I'm running 10 
> jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?
>
> any thoughts?
>
> -ashton