[slurm-dev] Re: Impact to jobs when reconfiguring partitions?
On 10/27/2016 09:42 AM, Loris Bennett wrote: So is restarting slurmctld the only way to let it pick up changes in slurm.conf? No. You can also do scontrol reconfigure This does not restart slurmctld. Question: How are the slurmd daemons notified about the changes in slurm.conf? Will slurmctld force the slurmd daemons to reread slurm.conf? There is no corresponding "slurmd reconfigure" command. Question: If the cgroups.conf file is changed, will that also be picked up by a scontrol reconfigure? Thanks, Ole
[slurm-dev] Re: Impact to jobs when reconfiguring partitions?
Tuo Chen Peng writes: > I thought ‘scontrol update’ command is for letting slurmctld to pick up any > change in slurm.conf. > > But after reading the manual again, it seems this command is instead to change > the setting at runtime, instead of reading any change from slurm.conf. > > So is restarting slurmctld the only way to let it pick up changes in > slurm.conf? No. You can also do scontrol reconfigure This does not restart slurmctld. Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
[slurm-dev] Re: Impact to jobs when reconfiguring partitions?
On 25 October 2016 at 09:17, Tuo Chen Peng wrote: > Oh ok thanks for pointing this out. > > I thought ‘scontrol update’ command is for letting slurmctld to pick up > any change in slurm.conf. > > But after reading the manual again, it seems this command is instead to > change the setting at runtime, instead of reading any change from > slurm.conf. > > > > So is restarting slurmctld the only way to let it pick up changes in > slurm.conf? > > And if I change (2.2) in my plan to > > (2.2) restart slurmctld to pick changes in slurm.conf, then use ‘scontrol > reconfigure’ to push changes to all nodes > > Do you see any impact to the running jobs in the cluster? > > There shouldn't be any impact on running jobs at all, but of course there are always caveats: - while slurmctld is restarting, no one will be able to send in any jobs (although it should take ~5 seconds to restart unless you have made an error, in which case it will take probably 1 minute to restart while you fix and/or roll back, so no one should even notice) - as an extension of the above, if any of the jobs on the queue has a running job as a dependency, and that job finishes in the x seconds that slurmctld is down...but I doubt it. - I can't remember exactly what they do, but if you look over the list, I think some people save the contents of /var/spool/slurmd (which I believe holds the "state" information of all running jobs) (note that none of these is a real concern, they are just possible) L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper
[slurm-dev] Re: Impact to jobs when reconfiguring partitions?
Oh ok thanks for pointing this out. I thought ‘scontrol update’ command is for letting slurmctld to pick up any change in slurm.conf. But after reading the manual again, it seems this command is instead to change the setting at runtime, instead of reading any change from slurm.conf. So is restarting slurmctld the only way to let it pick up changes in slurm.conf? And if I change (2.2) in my plan to (2.2) restart slurmctld to pick changes in slurm.conf, then use ‘scontrol reconfigure’ to push changes to all nodes Do you see any impact to the running jobs in the cluster? Thanks From: Lachlan Musicman [mailto:data...@gmail.com] Sent: Monday, October 24, 2016 2:58 PM To: slurm-dev Subject: [slurm-dev] Re: Impact to jobs when reconfiguring partitions? On 25 October 2016 at 08:42, Tuo Chen Peng mailto:tp...@nvidia.com>> wrote: Hello all, This is my first post in the mailing list - nice to join the community! Welcome! I have a general question regarding slurm partition change: If I move one node from one partition to the other, will it cause any impact to the jobs that are still running on other nodes, in both partitions? No, it shouldn't, depending on how you execute the plan... But we would like to do this without interrupting existing, running jobs. What would be the safe way to do this? And here’s my plan: (1) drain the node in main partition for the move, and only drain that node - keep other nodes available for job submission. (2) move node from main partition to short job partition (2.1) update slurm.conf on both control node and node to be moved, so that this node is listed under short job partition (2.2) Run scontrol update on both control node and node just moved, to let slurm pick up configuration change. (3) node should now be moved to short job partition, set the node back to normal / idle state. Is “scontrol update” the right command to use in this case? Does anyone see any impact / concern in above sequence? I’m mostly worried mostly about whether such partition change could cause user’s existing jobs to be killed or fail for some reason. Looks correct except for 2.2 - my understanding is that you would need to restart the slurmctld process (`systemctl restart slurm`) at this point - which is the point the slurm "head" node picks up the changes to the slurm.conf - and then `scontrol reconfigure` to distribute that change to the nodes. Cheers L. --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
[slurm-dev] Re: Impact to jobs when reconfiguring partitions?
On 25 October 2016 at 08:42, Tuo Chen Peng wrote: > Hello all, > > This is my first post in the mailing list - nice to join the community! > Welcome! > > > I have a general question regarding slurm partition change: > > If I move one node from one partition to the other, will it cause any > impact to the jobs that are still running on other nodes, in both > partitions? > > No, it shouldn't, depending on how you execute the plan... > But we would like to do this without interrupting existing, running jobs. > > What would be the safe way to do this? > > > > And here’s my plan: > > (1) drain the node in main partition for the move, and only drain that > node - keep other nodes available for job submission. > > (2) move node from main partition to short job partition > > (2.1) update slurm.conf on both control node and node to be moved, so that > this node is listed under short job partition > > (2.2) Run scontrol update on both control node and node just moved, to let > slurm pick up configuration change. > > (3) node should now be moved to short job partition, set the node back to > normal / idle state. > > > > Is “scontrol update” the right command to use in this case? > > Does anyone see any impact / concern in above sequence? > > I’m mostly worried mostly about whether such partition change could cause > user’s existing jobs to be killed or fail for some reason. > Looks correct except for 2.2 - my understanding is that you would need to restart the slurmctld process (`systemctl restart slurm`) at this point - which is the point the slurm "head" node picks up the changes to the slurm.conf - and then `scontrol reconfigure` to distribute that change to the nodes. Cheers L.