Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-03-29 Thread Nathan R Crawford
Sounds pretty safe, but with the current COVID-19 difficulties, including
Spring quarter classes being taught remotely (starts tomorrow, fun times
ahead), I'm a bit reluctant to poke a running system. This will get put on
the giant list of things waiting for a scheduled downtime once campus
re-opens.
Thanks,
Nate

On Thu, Mar 26, 2020 at 8:26 AM Steven Dick  wrote:

> When I changed this on a running system, no jobs were killed, but
> slurm lost track of jobs on nodes and was unable to kill them or tell
> when they were finished until slurmd on each node was restarted.  I
> let running jobs complete and monitored them manually, and restarted
> slurmd on each node as they finished.
>
> In desperation, you can do it, but it might be better to wait until no
> jobs (or few jobs) are running.
>
> On Thu, Mar 26, 2020 at 10:40 AM Pär Lindfors 
> wrote:
> >
> > Hi Nate,
> >
> > On Fri, 2020-02-21 at 11:38 -0800, Nathan R Crawford wrote:
> > >   If it just requires restarting slurmctld and the slurmd processes
> > > on the nodes, I will be happy! Can you confirm that no running or
> > > pending jobs were lost in the transition?
> >
> > Did you change your SelectType to cons_tres? How did it go?
> >
> > We need to do the same change on one of our clusters. I have done a few
> > tests on a tiny test cluster which so far indicates that changing works
> > even with jobs running, but a configuration change with even a small
> > risk of purging the job list makes me a little nervous.
> >
> > Regards,
> > Pär Lindfors,
> > UPPMAX
> >
> >
> >
> >
> >
> >
> >
> >
> > När du har kontakt med oss på Uppsala universitet med e-post så innebär
> det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör
> det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
> >
> > E-mailing Uppsala University means that we will process your personal
> data. For more information on how this is performed, please read here:
> http://www.uu.se/en/about-uu/data-protection-policy
>
>

-- 

Dr. Nathan Crawford  nathan.crawf...@uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA


Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-03-26 Thread Steven Dick
When I changed this on a running system, no jobs were killed, but
slurm lost track of jobs on nodes and was unable to kill them or tell
when they were finished until slurmd on each node was restarted.  I
let running jobs complete and monitored them manually, and restarted
slurmd on each node as they finished.

In desperation, you can do it, but it might be better to wait until no
jobs (or few jobs) are running.

On Thu, Mar 26, 2020 at 10:40 AM Pär Lindfors  wrote:
>
> Hi Nate,
>
> On Fri, 2020-02-21 at 11:38 -0800, Nathan R Crawford wrote:
> >   If it just requires restarting slurmctld and the slurmd processes
> > on the nodes, I will be happy! Can you confirm that no running or
> > pending jobs were lost in the transition?
>
> Did you change your SelectType to cons_tres? How did it go?
>
> We need to do the same change on one of our clusters. I have done a few
> tests on a tiny test cluster which so far indicates that changing works
> even with jobs running, but a configuration change with even a small
> risk of purging the job list makes me a little nervous.
>
> Regards,
> Pär Lindfors,
> UPPMAX
>
>
>
>
>
>
>
>
> När du har kontakt med oss på Uppsala universitet med e-post så innebär det 
> att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan 
> du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
>
> E-mailing Uppsala University means that we will process your personal data. 
> For more information on how this is performed, please read here: 
> http://www.uu.se/en/about-uu/data-protection-policy



Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-03-26 Thread Pär Lindfors
Hi Nate,

On Fri, 2020-02-21 at 11:38 -0800, Nathan R Crawford wrote:
>   If it just requires restarting slurmctld and the slurmd processes
> on the nodes, I will be happy! Can you confirm that no running or
> pending jobs were lost in the transition?

Did you change your SelectType to cons_tres? How did it go?

We need to do the same change on one of our clusters. I have done a few
tests on a tiny test cluster which so far indicates that changing works
even with jobs running, but a configuration change with even a small
risk of purging the job list makes me a little nervous.

Regards,
Pär Lindfors,
UPPMAX








När du har kontakt med oss på Uppsala universitet med e-post så innebär det att 
vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du 
läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For 
more information on how this is performed, please read here: 
http://www.uu.se/en/about-uu/data-protection-policy


Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-02-21 Thread Nathan R Crawford
Hi Chris,

  If it just requires restarting slurmctld and the slurmd processes on the
nodes, I will be happy! Can you confirm that no running or pending jobs
were lost in the transition?

Thanks,
Nate

On Thu, Feb 20, 2020 at 6:54 PM Chris Samuel  wrote:

> On 20/2/20 2:16 pm, Nathan R Crawford wrote:
>
> >I interpret this as, in general, changing SelectType will nuke
> > existing jobs, but that since cons_tres uses the same state format as
> > cons_res, it should work.
>
> We got caught with just this on our GPU nodes (though it was fixed
> before I got to see what was going on) - it seems that the format of the
> RPCs changes when you go from cons_res to cons_tres and we were having
> issues until we restarted slurmd on the compute nodes as well.
>
> My memory is that this was causing issues for starting new jobs (in a
> failing completely type of manner), I'm not sure what the consequences
> were for running jobs (though I suspect it would not have been great for
> them).
>
> If Doug sees this he may remember this (he caught and fixed it).
>
> All the best,
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>

-- 

Dr. Nathan Crawford  nathan.crawf...@uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA


Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-02-20 Thread Chris Samuel

On 20/2/20 2:16 pm, Nathan R Crawford wrote:

   I interpret this as, in general, changing SelectType will nuke 
existing jobs, but that since cons_tres uses the same state format as 
cons_res, it should work.


We got caught with just this on our GPU nodes (though it was fixed 
before I got to see what was going on) - it seems that the format of the 
RPCs changes when you go from cons_res to cons_tres and we were having 
issues until we restarted slurmd on the compute nodes as well.


My memory is that this was causing issues for starting new jobs (in a 
failing completely type of manner), I'm not sure what the consequences 
were for running jobs (though I suspect it would not have been great for 
them).


If Doug sees this he may remember this (he caught and fixed it).

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



[slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-02-20 Thread Nathan R Crawford
Hi All,

  I have 19.05.4 and want to change SelectType from select/cons_res to
select/cons_tres without losing running or pending jobs. The documentation
is a bit conflicting.

>From the man page:
SelectType
  Identifies the type of resource selection algorithm to be used. Changing
this value can only be done by restarting the slurmctld daemon and will
result in the loss of all job information (running and pending) since the
job state save format used by each plugin is different.

>From slurm.schedmd.com/SLUG19/Slurm_19.05.pdf, slide 6:
● Can revert to cons_res without loosing the queue
  ○ Although jobs using new cons_tres options cannot run
  ○ Both share a common state format to make this possible
■ Unlike cons_tres ⇎ serial which will drop the queue

  I interpret this as, in general, changing SelectType will nuke existing
jobs, but that since cons_tres uses the same state format as cons_res, it
should work.

  Has anyone done this on a running system?

Thanks,
Nate


-- 

Dr. Nathan Crawford  nathan.crawf...@uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA