Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?
Sounds pretty safe, but with the current COVID-19 difficulties, including Spring quarter classes being taught remotely (starts tomorrow, fun times ahead), I'm a bit reluctant to poke a running system. This will get put on the giant list of things waiting for a scheduled downtime once campus re-opens. Thanks, Nate On Thu, Mar 26, 2020 at 8:26 AM Steven Dick wrote: > When I changed this on a running system, no jobs were killed, but > slurm lost track of jobs on nodes and was unable to kill them or tell > when they were finished until slurmd on each node was restarted. I > let running jobs complete and monitored them manually, and restarted > slurmd on each node as they finished. > > In desperation, you can do it, but it might be better to wait until no > jobs (or few jobs) are running. > > On Thu, Mar 26, 2020 at 10:40 AM Pär Lindfors > wrote: > > > > Hi Nate, > > > > On Fri, 2020-02-21 at 11:38 -0800, Nathan R Crawford wrote: > > > If it just requires restarting slurmctld and the slurmd processes > > > on the nodes, I will be happy! Can you confirm that no running or > > > pending jobs were lost in the transition? > > > > Did you change your SelectType to cons_tres? How did it go? > > > > We need to do the same change on one of our clusters. I have done a few > > tests on a tiny test cluster which so far indicates that changing works > > even with jobs running, but a configuration change with even a small > > risk of purging the job list makes me a little nervous. > > > > Regards, > > Pär Lindfors, > > UPPMAX > > > > > > > > > > > > > > > > > > När du har kontakt med oss på Uppsala universitet med e-post så innebär > det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör > det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > > > > E-mailing Uppsala University means that we will process your personal > data. For more information on how this is performed, please read here: > http://www.uu.se/en/about-uu/data-protection-policy > > -- Dr. Nathan Crawford nathan.crawf...@uci.edu Director of Scientific Computing School of Physical Sciences 164 Rowland Hall Office: 2101 Natural Sciences II University of California, Irvine Phone: 949-824-4508 Irvine, CA 92697-2025, USA
Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?
When I changed this on a running system, no jobs were killed, but slurm lost track of jobs on nodes and was unable to kill them or tell when they were finished until slurmd on each node was restarted. I let running jobs complete and monitored them manually, and restarted slurmd on each node as they finished. In desperation, you can do it, but it might be better to wait until no jobs (or few jobs) are running. On Thu, Mar 26, 2020 at 10:40 AM Pär Lindfors wrote: > > Hi Nate, > > On Fri, 2020-02-21 at 11:38 -0800, Nathan R Crawford wrote: > > If it just requires restarting slurmctld and the slurmd processes > > on the nodes, I will be happy! Can you confirm that no running or > > pending jobs were lost in the transition? > > Did you change your SelectType to cons_tres? How did it go? > > We need to do the same change on one of our clusters. I have done a few > tests on a tiny test cluster which so far indicates that changing works > even with jobs running, but a configuration change with even a small > risk of purging the job list makes me a little nervous. > > Regards, > Pär Lindfors, > UPPMAX > > > > > > > > > När du har kontakt med oss på Uppsala universitet med e-post så innebär det > att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan > du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > > E-mailing Uppsala University means that we will process your personal data. > For more information on how this is performed, please read here: > http://www.uu.se/en/about-uu/data-protection-policy
Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?
Hi Nate, On Fri, 2020-02-21 at 11:38 -0800, Nathan R Crawford wrote: > If it just requires restarting slurmctld and the slurmd processes > on the nodes, I will be happy! Can you confirm that no running or > pending jobs were lost in the transition? Did you change your SelectType to cons_tres? How did it go? We need to do the same change on one of our clusters. I have done a few tests on a tiny test cluster which so far indicates that changing works even with jobs running, but a configuration change with even a small risk of purging the job list makes me a little nervous. Regards, Pär Lindfors, UPPMAX När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/ E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?
Hi Chris, If it just requires restarting slurmctld and the slurmd processes on the nodes, I will be happy! Can you confirm that no running or pending jobs were lost in the transition? Thanks, Nate On Thu, Feb 20, 2020 at 6:54 PM Chris Samuel wrote: > On 20/2/20 2:16 pm, Nathan R Crawford wrote: > > >I interpret this as, in general, changing SelectType will nuke > > existing jobs, but that since cons_tres uses the same state format as > > cons_res, it should work. > > We got caught with just this on our GPU nodes (though it was fixed > before I got to see what was going on) - it seems that the format of the > RPCs changes when you go from cons_res to cons_tres and we were having > issues until we restarted slurmd on the compute nodes as well. > > My memory is that this was causing issues for starting new jobs (in a > failing completely type of manner), I'm not sure what the consequences > were for running jobs (though I suspect it would not have been great for > them). > > If Doug sees this he may remember this (he caught and fixed it). > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > > -- Dr. Nathan Crawford nathan.crawf...@uci.edu Director of Scientific Computing School of Physical Sciences 164 Rowland Hall Office: 2101 Natural Sciences II University of California, Irvine Phone: 949-824-4508 Irvine, CA 92697-2025, USA
Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?
On 20/2/20 2:16 pm, Nathan R Crawford wrote: I interpret this as, in general, changing SelectType will nuke existing jobs, but that since cons_tres uses the same state format as cons_res, it should work. We got caught with just this on our GPU nodes (though it was fixed before I got to see what was going on) - it seems that the format of the RPCs changes when you go from cons_res to cons_tres and we were having issues until we restarted slurmd on the compute nodes as well. My memory is that this was causing issues for starting new jobs (in a failing completely type of manner), I'm not sure what the consequences were for running jobs (though I suspect it would not have been great for them). If Doug sees this he may remember this (he caught and fixed it). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] Is it safe to convert cons_res to cons_tres on a running system?
Hi All, I have 19.05.4 and want to change SelectType from select/cons_res to select/cons_tres without losing running or pending jobs. The documentation is a bit conflicting. >From the man page: SelectType Identifies the type of resource selection algorithm to be used. Changing this value can only be done by restarting the slurmctld daemon and will result in the loss of all job information (running and pending) since the job state save format used by each plugin is different. >From slurm.schedmd.com/SLUG19/Slurm_19.05.pdf, slide 6: ● Can revert to cons_res without loosing the queue ○ Although jobs using new cons_tres options cannot run ○ Both share a common state format to make this possible ■ Unlike cons_tres ⇎ serial which will drop the queue I interpret this as, in general, changing SelectType will nuke existing jobs, but that since cons_tres uses the same state format as cons_res, it should work. Has anyone done this on a running system? Thanks, Nate -- Dr. Nathan Crawford nathan.crawf...@uci.edu Director of Scientific Computing School of Physical Sciences 164 Rowland Hall Office: 2101 Natural Sciences II University of California, Irvine Phone: 949-824-4508 Irvine, CA 92697-2025, USA