[slurm-users] Re: How to exclude master from computing? Set to DRAINED?
Thanks Steffen, that makes a lot of sense. I will just not start slurmd in the master ansible role when the master is not to be used for computing. Best regards, Xaver On 24.06.24 14:23, Steffen Grunewald via slurm-users wrote: On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote: Dear Slurm users, in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having: PartitionName=SomePartition Nodes=master or something similar. Apparently, this is not the way to do this as it is now a fatal error fatal: Unable to determine this slurmd's NodeName You're attempting to start the slurmd - which isn't required on this machine, as you say. Disable it. Keep slurmctld enabled (and declared in the config). therefore, my *question:* What is the best practice for excluding the master node from work? Not defining it as a worker node. I personally primarily see the option to set the node into DOWN, DRAINED or RESERVED. These states are slurmd states, and therefor meaningless for a machine that doesn't have a running slurmd. (It's the nodes that are defined in the config that are supposed to be able to run slurmd.) So is *DRAINED* the correct setting in such a case? Since this only applies to a node that has been defined in the config, and you (correctly) didn't do so, there's no need (and no means) to "drain" it. Best Steffen -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] How to exclude master from computing? Set to DRAINED?
Dear Slurm users, in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having: PartitionName=SomePartition Nodes=master or something similar. Apparently, this is not the way to do this as it is now a fatal error fatal: Unable to determine this slurmd's NodeName therefore, my *question:* What is the best practice for excluding the master node from work? I personally primarily see the option to set the node into DOWN, DRAINED or RESERVED. Since we use ReturnToService=2, I guess DOWN is not the way to go. RESERVED fits with the second part "The node is in an advanced reservation and *not generally available*." and DRAINED "The node is unavailable for use per system administrator request." fits completely. So is *DRAINED* the correct setting in such a case? Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm.conf and workers
Dear slurm-user list, as far as I understood it, the slurm.conf needs to be present on the master and on the workers at slurm.conf (if no other path is set via SLURM_CONF). However, I noticed that when adding a partition only in the master's slurm.conf, all workers were able to "correctly" show the added partition when calling sinfo on them. Is the stored slurm.conf on every instance just a fallback for when connection is down or what is the purpose? The documentation only says: "This file should be consistent across all nodes in the cluster." Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?
Thank you Brian, while ResumeRate might be able to keep the CPU usage within an acceptable margin, it's not really a fix, but a workaround. I would prefer a solution that groups resume requests and therefore makes use of a single Ansible playbook run per second instead of <=ResumeRate. As we completely destroy our instances when powering down, we need to set them up from anew using Ansible. Running Ansible on the worker nodes would be possible, but that comes with additional steps in order to save all log files on the master in case the startup fails and you want to investigate. For now I feel like using the master to setup workers is the better structure. Best regards, Xaver On 08.04.24 18:18, Brian Andrus via slurm-users wrote: Xaver, You may want to look at the ResumeRate option in slurm.conf: ResumeRate The rate at which nodes in power save mode are returned to normal operation by ResumeProgram. The value is a number of nodes per minute and it can be used to prevent power surges if a large number of nodes in power save mode are assigned work at the same time (e.g. a large job starts). A value of zero results in no limits being imposed. The default value is 300 nodes per minute. I have all our nodes in the cloud and they power down/deallocate when idle for a bit. I do not use ansible to start them and use the cli interface directly, so the only cpu usage is by that command. I do plan on having ansible run from the node to do any hot-fix/updates from the base image or changes. By running it from the node, it would alleviate any cpu spikes on the slurm head node. Just a possible path to look at. Brian Andrus On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote: Dear slurm user list, we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up via Ansible. If more than one instance is requested at the exact same time, Slurm will pass those into the resume script together and one Ansible call will handle all those instances. However, more often than not workflows will request multiple instances within the same second, but not at the exact same time. This leads to multiple resume script calls and therefore to multiple Ansible calls. This will lead to less clear log files, greater CPU consumption by the multiple running Ansible calls and so on. What I am looking for is an option to force Slurm to wait a certain amount and then perform a single resume call for all instances within that time frame (let's say 1 second). Is this somehow possible? Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Elastic Computing: Is it possible to incentivize grouping power_up calls?
Dear slurm user list, we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up via Ansible. If more than one instance is requested at the exact same time, Slurm will pass those into the resume script together and one Ansible call will handle all those instances. However, more often than not workflows will request multiple instances within the same second, but not at the exact same time. This leads to multiple resume script calls and therefore to multiple Ansible calls. This will lead to less clear log files, greater CPU consumption by the multiple running Ansible calls and so on. What I am looking for is an option to force Slurm to wait a certain amount and then perform a single resume call for all instances within that time frame (let's say 1 second). Is this somehow possible? Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING
I am wondering why my question (below) didn't catch anyone's attention. Just for me as a feedback. Is it unclear where my problem lies or is it clear, but no solution is known? I looked through the documentation and now searched the Slurm repository, but am still unable to clearly identify how to handle "NOT_RESPONDING". I would really like to improve my question if necessary. Best regards, Xaver On 23.02.24 18:55, Xaver Stiensmeier wrote: Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is created on demand and therefore after a failure nothing stops the system to start the node again as it is a different instance. I thought this would be enough, but apparently the node is still marked with "NOT_RESPONDING" which leads to slurm not trying to schedule on it. After a while NOT_RESPONDING is removed, but I would like to move it directly from within my fail script if possible so that the node can return to service immediately and not be blocked by "NOT_RESPONDING". Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING
Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is created on demand and therefore after a failure nothing stops the system to start the node again as it is a different instance. I thought this would be enough, but apparently the node is still marked with "NOT_RESPONDING" which leads to slurm not trying to schedule on it. After a while NOT_RESPONDING is removed, but I would like to move it directly from within my fail script if possible so that the node can return to service immediately and not be blocked by "NOT_RESPONDING". Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0
Dear slurm-user list, I had cases where our resumeProgram failed due to temporary cloud timeouts. In that case the resumeProgram returns a value =/= 0. Why does Slurm still wait until resumeTimeout instead of just accepting the startup as failed which then should lead to a rescheduling of the job. Is there some way to achieve the described effect i.e. tell Slurm: "You can stop waiting, the node won't come alive." or am I missing the correct way how this should be handled in Slurm? Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Errors upgrading to 23.11.0 -- jwt-secret.key
Thank you for your response. I have found found out why there was no error in the log: I've been looking at the wrong log. The error didn't occur on the master, but on our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as just another worker in the same network. The error I get there is: ` Feb 08 11:38:25 cluster-vpngtw-3ts770ji3a8ubr1-0 slurmctld[32014]: slurmctld: fatal: auth/jwt: cannot stat '/etc/slurm/jwt-secret.key': No such file or directory Feb 08 11:38:25 cluster-vpngtw-3ts770ji3a8ubr1-0 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Feb 08 11:38:25 cluster-vpngtw-3ts770ji3a8ubr1-0 systemd[1]: slurmctld.service: Failed with result 'exit-code'. Feb 08 11:38:25 cluster-vpngtw-3ts770ji3a8ubr1-0 systemd[1]: Failed to start Slurm controller daemon. ` In the past we have created the `jwt-secret.key` on the master at `etc/slurm` and that was enough, but I must admit that I am not completely familiar with it, but I will now look into it closer and also double check whether such a key is stored there in the old slurm version. Best regards, Xaver On 08.02.24 11:07, Luke Sudbery via slurm-users wrote: Your systemctl output shows that slurmctld is running OK, but that doesn't match with your first entry, so it's hard to tell what's going on. But if slurmctld won't start under systemd but it's not clear why the first step would be to enable something like `SlurmctldDebug = debug` and check the full logs in journalctl or just run slurmctld in the forground with: /usr/sbin/slurmctld -D -vvv Make sure the system service is properly stopped and there aren't any rouge slurmctld processes anywhere. Many thanks, Luke -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Errors upgrading to 23.11.0
Dear slurm-user list, I got this error: Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemctl status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details. but in slurmctld.service I see nothing suspicious: slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/slurmctld.service.d └─override.conf Active: active (running) since Wed 2024-02-07 15:50:56 UTC; 19min ago Main PID: 51552 (slurmctld) Tasks: 21 (limit: 9363) Memory: 10.4M CPU: 1min 16.088s CGroup: /system.slice/slurmctld.service ├─51552 /usr/sbin/slurmctld --systemd └─51553 "slurmctld: slurmscriptd" "" "" "" "" "" "" Feb 07 15:58:21 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: sched: _slurm_rpc_allocate_resources JobId=3 NodeList=(null) usec=959 Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 WTERMSIG 2 Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 cancelled by interactive user Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 done Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _slurm_rpc_complete_job_allocation: JobId=3 error Job/step already completing or completed Feb 07 15:58:42 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: sched: _slurm_rpc_allocate_resources JobId=4 NodeList=cluster-master-2vt2bqh7ahec04c,cluster-worker-2vt2bqh7ahec04c-2 usec=512 Feb 07 16:06:04 cluster-master-2vt2bqh7ahec04c slurmctld[51553]: slurmctld: error: _run_script: JobId=0 resumeprog exit status 1:0 Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=4 WTERMSIG 2 Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=4 done Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _slurm_rpc_complete_job_allocation: JobId=4 error Job/step already completing or completed I am unsure how to debug this further. It might be coming from a previous problem I tried to fix (basically a few deprecated keys in the configuration). I will try to restart the entire cluster with the added changes to rule out any follow up errors, but maybe it's something obvious a fellow list user can see. Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] SlurmdSpoolDir full
Hello Brian Andrus, we ran 'df -h' to determine the amount of free space I mentioned below. I also should add that at the time we inspected the node, there was still around 38 GB of space left - however, we were unable to watch the remaining space while the error occurred so maybe the large file(s) got removed immediately. I will take a look at /var/log. That's a good idea. I don't think that there will be anything unusual, but it's something I haven't thought about yet (the reason of the error being somewhere else). Best regards Xaver On 10.12.23 00:41, Brian Andrus wrote: Xaver, It is likely your /var or /var/spool mount. That may be a separate partition or part of your root partition. It is the partition that is full, not the directory itself. So the cause could very well be log files in /var/log. I would check to see what (if any) partitions are getting filled on the node. You can run 'df -h' and see some info that would get you started. Brian Andrus On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote: Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, I was unable to find more precise information on that dictionary. We compute all data on another volume so SlurmdSpoolDir has roughly 38 GB of free space where nothing is intentionally put during the run. This error only occurred on very few nodes. I would like to understand what Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here. Best regards Xaver Stiensmeier
[slurm-users] SlurmdSpoolDir full
Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, I was unable to find more precise information on that dictionary. We compute all data on another volume so SlurmdSpoolDir has roughly 38 GB of free space where nothing is intentionally put during the run. This error only occurred on very few nodes. I would like to understand what Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here. Best regards Xaver Stiensmeier
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, for multiple reasons we build it ourself, but I am not really involved in that process, but I will contact the person who is. Thanks for the recommendation! We should probably implement a regular check whether there is a new slurm version. I am not 100% whether this will fix our issues or not, but it's worth a try. Best regards Xaver On 06.12.23 12:03, Ole Holm Nielsen wrote: On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving Power with Slurm" at https://slurm.schedmd.com/publications.html For reasons of security and functionality it is recommended to follow Slurm's releases (maybe not the first few minor versions of new major releases like 23.11). FYI, I've collected information about upgrading Slurm in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm /Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. Xaver On 06.12.23 11:09, Ole Holm Nielsen wrote: Hi Xaver, Your version of Slurm may matter for your power saving experience. Do you run an updated version? /Ole On 12/6/23 10:54, Xaver Stiensmeier wrote: Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not always giving all information. We run our solution for about a year now so I don't think there's a general problem (as in something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). >> IHTH, Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not always giving all information. We run our solution for about a year now so I don't think there's a general problem (as in something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). Your repository would've been really helpful for me when we started implementing the cloud scheduling, but I feel like we have implemented most things you mention there already. But I will take a look at `DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find out; SLURM plans/planned to change that in the future (cloud key behaves different than any other key in PrivateData). Of course our setup differs a little in the details. Best regards Xaver On 06.12.23 10:30, Ole Holm Nielsen wrote: Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Probably you can't assign a "reason" when you update a node with state=RESUME. The scontrol manual page says: Reason= Identify the reason the node is in a "DOWN", "DRAINED", "DRAINING", "FAILING" or "FAIL" state. Maybe you will find some useful hints in my Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving and in my power saving tools at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save IHTH, Ole
[slurm-users] Power Save: When is RESUME an invalid node state?
Dear Slurm User list, using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Best regards Xaver Stiensmeier
Re: [slurm-users] GRES and GPUs
Hey everyone, I am answering my own question: It wasn't working because I need to *reload slurmd* on the machine, too. So the full "test gpu management without gpu" workflow is: 1. Start your slurm cluster. 2. Add a gpu to an instance of your choice in the *slurm.conf* For example:* * *DebugFlags=GRES *# consider this for initial setup. *SelectType=select/cons_tres** **GresTypes=gpu* NodeName=master SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000 *GRES=gpu:1* State=UNKNOWN 3. Register it at *gres.conf *and give it *some file* NodeName=master Name=gpu File=/dev/tty0 Count=1 # count seems to be optional 4. Reload slurmctld (on the master) and slurmd (on the gpu node)* * *sudo systemctl restart slurmctld** **sudo systemctl restart slurmd* I haven't tested this solution thoroughly yet, but at least commands like:* * *sudo systemctl restart slurmd* # master run without any issues afterwards. Thank you for all your help! Best regards, Xaver On 19.07.23 17:05, Xaver Stiensmeier wrote: Hi Hermann, count doesn't make a difference, but I noticed that when I reconfigure slurm and do reloads afterwards, the error "gpu count lower than configured" no longer appears - so maybe it is just because a reconfigure is needed after reloading slurmctld - or maybe it doesn't show the error anymore, because the node is still invalid? However, I still get the error: error: _slurm_rpc_node_registration node=NName: Invalid argument If I understand correctly, this is telling me that there's something wrong with my slurm.conf. I know that all pre-existing parameters are correct, so I assume it must be the gpus entry, but I don't see where it's wrong: NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000 Gres=gpu:1 State=CLOUD # bibiserv Thanks for all the help, Xaver On 19.07.23 15:04, Hermann Schwärzler wrote: Hi Xaver, I think you are missing the "Count=..." part in gres.conf It should read NodeName=NName Name=gpu File=/dev/tty0 Count=1 in your case. Regards, Hermann On 7/19/23 14:19, Xaver Stiensmeier wrote: Okay, thanks to S. Zhang I was able to figure out why nothing changed. While I did restart systemctld at the beginning of my tests, I didn't do so later, because I felt like it was unnecessary, but it is right there in the fourth line of the log that this is needed. Somehow I misread it and thought it automatically restarted slurmctld. Given the setup: slurm.conf ... GresTypes=gpu NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000 GRES=gpu:1 State=UNKNOWN ... gres.conf NodeName=NName Name=gpu File=/dev/tty0 When restarting, I get the following error: error: Setting node NName state to INVAL with reason:gres/gpu count reported lower than configured (0 < 1) So it is still not working, but at least I get a more helpful log message. Because I know that this /dev/tty trick works, I am still unsure where the current error lies, but I will try to investigate it further. I am thankful for any ideas in that regard. Best regards, Xaver On 19.07.23 10:23, Xaver Stiensmeier wrote: Alright, I tried a few more things, but I still wasn't able to get past: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification. I should mention that the node I am trying to test GPU with, doesn't really have a gpu, but Rob was so kind to find out that you do not need a gpu as long as you just link to a file in /dev/ in the gres.conf. As mentioned: This is just for testing purposes - in the end we will run this on a node with a gpu, but it is not available at the moment. *The error isn't changing* If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error. *Debug Info* I added the gpu debug flag and logged the following: [2023-07-18T14:59:45.026] restoring original state of nodes [2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu ignored [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change GresPlugins [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu ignored [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change GresPlugins [2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set. [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed usec=5898 [2023-07-18T14:59:45.952] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_ma
Re: [slurm-users] GRES and GPUs
Hi Hermann, count doesn't make a difference, but I noticed that when I reconfigure slurm and do reloads afterwards, the error "gpu count lower than configured" no longer appears - so maybe it is just because a reconfigure is needed after reloading slurmctld - or maybe it doesn't show the error anymore, because the node is still invalid? However, I still get the error: error: _slurm_rpc_node_registration node=NName: Invalid argument If I understand correctly, this is telling me that there's something wrong with my slurm.conf. I know that all pre-existing parameters are correct, so I assume it must be the gpus entry, but I don't see where it's wrong: NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000 Gres=gpu:1 State=CLOUD # bibiserv Thanks for all the help, Xaver On 19.07.23 15:04, Hermann Schwärzler wrote: Hi Xaver, I think you are missing the "Count=..." part in gres.conf It should read NodeName=NName Name=gpu File=/dev/tty0 Count=1 in your case. Regards, Hermann On 7/19/23 14:19, Xaver Stiensmeier wrote: Okay, thanks to S. Zhang I was able to figure out why nothing changed. While I did restart systemctld at the beginning of my tests, I didn't do so later, because I felt like it was unnecessary, but it is right there in the fourth line of the log that this is needed. Somehow I misread it and thought it automatically restarted slurmctld. Given the setup: slurm.conf ... GresTypes=gpu NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000 GRES=gpu:1 State=UNKNOWN ... gres.conf NodeName=NName Name=gpu File=/dev/tty0 When restarting, I get the following error: error: Setting node NName state to INVAL with reason:gres/gpu count reported lower than configured (0 < 1) So it is still not working, but at least I get a more helpful log message. Because I know that this /dev/tty trick works, I am still unsure where the current error lies, but I will try to investigate it further. I am thankful for any ideas in that regard. Best regards, Xaver On 19.07.23 10:23, Xaver Stiensmeier wrote: Alright, I tried a few more things, but I still wasn't able to get past: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification. I should mention that the node I am trying to test GPU with, doesn't really have a gpu, but Rob was so kind to find out that you do not need a gpu as long as you just link to a file in /dev/ in the gres.conf. As mentioned: This is just for testing purposes - in the end we will run this on a node with a gpu, but it is not available at the moment. *The error isn't changing* If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error. *Debug Info* I added the gpu debug flag and logged the following: [2023-07-18T14:59:45.026] restoring original state of nodes [2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu ignored [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change GresPlugins [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu ignored [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change GresPlugins [2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set. [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed usec=5898 [2023-07-18T14:59:45.952] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 I am a bit unsure what to do next to further investigate this issue. Best regards, Xaver On 17.07.23 15:57, Groner, Rob wrote: That would certainly do it. If you look at the slurmctld log when it comes up, it will say that it's marking that node as invalid because it has less (0) gres resources then you say it should have. That's because slurmd on that node will come up and say "What gres resources??" For testing purposes, you can just create a dummy file on the node, then in gres.conf, point to that file as the "graphics file" interface. As long as you don't try to actually use it as a graphics file, that should be enough for that node to think it has gres/gpu resources. That's what I do in my vagrant slurm cluster. Rob -------- *From:* slurm-users on behalf of Xaver Stiensmeier *Sent:* Monday, July 17, 2023 9:43 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] GRES and GPUs Hi Hermann, G
Re: [slurm-users] GRES and GPUs
Okay, thanks to S. Zhang I was able to figure out why nothing changed. While I did restart systemctld at the beginning of my tests, I didn't do so later, because I felt like it was unnecessary, but it is right there in the fourth line of the log that this is needed. Somehow I misread it and thought it automatically restarted slurmctld. Given the setup: slurm.conf ... GresTypes=gpu NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000 GRES=gpu:1 State=UNKNOWN ... gres.conf NodeName=NName Name=gpu File=/dev/tty0 When restarting, I get the following error: error: Setting node NName state to INVAL with reason:gres/gpu count reported lower than configured (0 < 1) So it is still not working, but at least I get a more helpful log message. Because I know that this /dev/tty trick works, I am still unsure where the current error lies, but I will try to investigate it further. I am thankful for any ideas in that regard. Best regards, Xaver On 19.07.23 10:23, Xaver Stiensmeier wrote: Alright, I tried a few more things, but I still wasn't able to get past: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification. I should mention that the node I am trying to test GPU with, doesn't really have a gpu, but Rob was so kind to find out that you do not need a gpu as long as you just link to a file in /dev/ in the gres.conf. As mentioned: This is just for testing purposes - in the end we will run this on a node with a gpu, but it is not available at the moment. *The error isn't changing* If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error. *Debug Info* I added the gpu debug flag and logged the following: [2023-07-18T14:59:45.026] restoring original state of nodes [2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu ignored [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change GresPlugins [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu ignored [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change GresPlugins [2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set. [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed usec=5898 [2023-07-18T14:59:45.952] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 I am a bit unsure what to do next to further investigate this issue. Best regards, Xaver On 17.07.23 15:57, Groner, Rob wrote: That would certainly do it. If you look at the slurmctld log when it comes up, it will say that it's marking that node as invalid because it has less (0) gres resources then you say it should have. That's because slurmd on that node will come up and say "What gres resources??" For testing purposes, you can just create a dummy file on the node, then in gres.conf, point to that file as the "graphics file" interface. As long as you don't try to actually use it as a graphics file, that should be enough for that node to think it has gres/gpu resources. That's what I do in my vagrant slurm cluster. Rob ---- *From:* slurm-users on behalf of Xaver Stiensmeier *Sent:* Monday, July 17, 2023 9:43 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] GRES and GPUs Hi Hermann, Good idea, but we are already using `SelectType=select/cons_tres`. After setting everything up again (in case I made an unnoticed mistake), I saw that the node got marked STATE=inval. To be honest, I thought I can just claim that a node has a gpu even if it doesn't have one - just for testing purposes. Could this be the issue? Best regards, Xaver Stiensmeier On 17.07.23 14:11, Hermann Schwärzler wrote: > Hi Xaver, > > what kind of SelectType are you using in your slurm.conf? > > Per https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D=0 <https://slurm.schedmd.com/gres.html> you have to consider: > "As for the --gpu* option, these options are only supported by Slurm's > select/cons_tres plugin." > > So you
Re: [slurm-users] GRES and GPUs
Alright, I tried a few more things, but I still wasn't able to get past: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification. I should mention that the node I am trying to test GPU with, doesn't really have a gpu, but Rob was so kind to find out that you do not need a gpu as long as you just link to a file in /dev/ in the gres.conf. As mentioned: This is just for testing purposes - in the end we will run this on a node with a gpu, but it is not available at the moment. *The error isn't changing* If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error. *Debug Info* I added the gpu debug flag and logged the following: [2023-07-18T14:59:45.026] restoring original state of nodes [2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu ignored [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change GresPlugins [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu ignored [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change GresPlugins [2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set. [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed usec=5898 [2023-07-18T14:59:45.952] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 I am a bit unsure what to do next to further investigate this issue. Best regards, Xaver On 17.07.23 15:57, Groner, Rob wrote: That would certainly do it. If you look at the slurmctld log when it comes up, it will say that it's marking that node as invalid because it has less (0) gres resources then you say it should have. That's because slurmd on that node will come up and say "What gres resources??" For testing purposes, you can just create a dummy file on the node, then in gres.conf, point to that file as the "graphics file" interface. As long as you don't try to actually use it as a graphics file, that should be enough for that node to think it has gres/gpu resources. That's what I do in my vagrant slurm cluster. Rob ---- *From:* slurm-users on behalf of Xaver Stiensmeier *Sent:* Monday, July 17, 2023 9:43 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] GRES and GPUs Hi Hermann, Good idea, but we are already using `SelectType=select/cons_tres`. After setting everything up again (in case I made an unnoticed mistake), I saw that the node got marked STATE=inval. To be honest, I thought I can just claim that a node has a gpu even if it doesn't have one - just for testing purposes. Could this be the issue? Best regards, Xaver Stiensmeier On 17.07.23 14:11, Hermann Schwärzler wrote: > Hi Xaver, > > what kind of SelectType are you using in your slurm.conf? > > Per https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D=0 <https://slurm.schedmd.com/gres.html> you have to consider: > "As for the --gpu* option, these options are only supported by Slurm's > select/cons_tres plugin." > > So you can use "--gpus ..." only when you state > SelectType = select/cons_tres > in your slurm.conf. > > But "--gres=gpu:1" should work always. > > Regards > Hermann > > > On 7/17/23 13:43, Xaver Stiensmeier wrote: >> Hey, >> >> I am currently trying to understand how I can schedule a job that >> needs a GPU. >> >> I read about GRES https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D=0 <https://slurm.schedmd.com/gres.html> and tried to use: >> >> GresTypes=gpu >> NodeName=test Gres=gpu:1 >> >> But calling - after a 'sud
Re: [slurm-users] GRES and GPUs
Hi Hermann, Good idea, but we are already using `SelectType=select/cons_tres`. After setting everything up again (in case I made an unnoticed mistake), I saw that the node got marked STATE=inval. To be honest, I thought I can just claim that a node has a gpu even if it doesn't have one - just for testing purposes. Could this be the issue? Best regards, Xaver Stiensmeier On 17.07.23 14:11, Hermann Schwärzler wrote: Hi Xaver, what kind of SelectType are you using in your slurm.conf? Per https://slurm.schedmd.com/gres.html you have to consider: "As for the --gpu* option, these options are only supported by Slurm's select/cons_tres plugin." So you can use "--gpus ..." only when you state SelectType = select/cons_tres in your slurm.conf. But "--gres=gpu:1" should work always. Regards Hermann On 7/17/23 13:43, Xaver Stiensmeier wrote: Hey, I am currently trying to understand how I can schedule a job that needs a GPU. I read about GRES https://slurm.schedmd.com/gres.html and tried to use: GresTypes=gpu NodeName=test Gres=gpu:1 But calling - after a 'sudo scontrol reconfigure': srun --gpus 1 hostname didn't work: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification so I read more https://slurm.schedmd.com/gres.conf.html but that didn't really help me. I am rather confused. GRES claims to be generic resources but then it comes with three defined resources (GPU, MPS, MIG) and using one of those didn't work in my case. Obviously, I am misunderstanding something, but I am unsure where to look. Best regards, Xaver Stiensmeier
[slurm-users] GRES and GPUs
Hey, I am currently trying to understand how I can schedule a job that needs a GPU. I read about GRES https://slurm.schedmd.com/gres.html and tried to use: GresTypes=gpu NodeName=test Gres=gpu:1 But calling - after a 'sudo scontrol reconfigure': srun --gpus 1 hostname didn't work: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification so I read more https://slurm.schedmd.com/gres.conf.html but that didn't really help me. I am rather confused. GRES claims to be generic resources but then it comes with three defined resources (GPU, MPS, MIG) and using one of those didn't work in my case. Obviously, I am misunderstanding something, but I am unsure where to look. Best regards, Xaver Stiensmeier
[slurm-users] Prevent CLOUD node from being shutdown after startup
Dear slurm-users, I am currently looking into options how I can deactivate suspending for nodes. I am both interested in the general case: Allowing all nodes to be powered up, but for all nodes without automatic suspending except when triggering power down manually. And the special case: Allowing all nodes to be powered up, but without automatic suspending for some nodes except when triggering power down manually. --- I tried using negative times for SuspendTime, but that didn't seem to work as no nodes are powered up then. Best regards, Xaver Stiensmeier
[slurm-users] Submit sbatch to multiple partitions
Dear slurm-users list, let's say I want to submit a large batch job that should run on 8 nodes. I have two partitions, each holding 4 nodes. Slurm will now tell me that "Requested node configuration is not available". However, my desired output would be that slurm makes use of both partitions and allocates all 8 nodes. Best regards, Xaver Stiensmeier
Re: [slurm-users] Multiple default partitions
I found a solution that works for me, but it doesn't really answer the question: It's the option https://slurm.schedmd.com/slurm.conf.html#OPT_all_partitions for JobSubmitPlugins. It works for me, because all partitions are default in my case, but it doesn't /really/ answer my question as my question asks how to have multiple default partitions which could include having others that are not default. Best regards, Xaver Stiensmeier On 17.04.23 11:12, Xaver Stiensmeier wrote: Dear slurm-users list, is it possible to somehow have two default partitions? In the best case in a way that slurm schedules to partition1 on default and only to partition2 when partition1 can't handle the job right now. Best regards, Xaver Stiensmeier
[slurm-users] Multiple default partitions
Dear slurm-users list, is it possible to somehow have two default partitions? In the best case in a way that slurm schedules to partition1 on default and only to partition2 when partition1 can't handle the job right now. Best regards, Xaver Stiensmeier
[slurm-users] Evaluation: How collect data regarding slurms cloud scheduling performance?
Dear slurm-user list, I am currently investigating ways of evaluation regarding slurms cloud scheduling performance. As we are all aware there are many adjustment screws when it comes to cloud scheduling. We can change the regular scheduling (prioritizing, ...), powerup and powerdown times. There's probably a lot more. However, my question today is not about improving cloud scheduling performance, but how we collect data like: When were nodes powered up [down]. To what degree were the powered up machines used? Were the "right" instances started for the given jobs or were larger instances started than needed? ... I know that this question is currently very open, but I am still trying to narrow down where I have to look. The final goal is of course to use this evaluation to pick better timeout values and improve cloud scheduling. Best regards, Xaver Stiensmeier
[slurm-users] Request nodes with a custom resource?
Dear slurm-user list, how would you implement a custom resource requirement? For example you have a group of nodes that has direct access to a large database so it would be best to run jobs regarding that database on those nodes. How would you schedule a job (let's say using srun) to work on these nodes? Of course this would be interesting in a dynamic case, too (assuming that the database is downloaded to nodes during job execution), but for now I would be happy with a static solution (where it is okay to store a value before that says something like "hasDatabase=1" on nodes. So I am basically looking for custom requirements. Best regards, Xaver Stiensmeier
[slurm-users] How to set default partition in slurm configuration
Dear slurm-user list, I am aware that this question sounds very simple and it should be resolved by just taking one look at the documentation, but for some reason I am unable to find where I can set the default partition in the slurm configuration. For me it feels like simply the last added configuration becomes automatically the default configuration, but I might be wrong in this assessment of the situation. However, I would prefer just setting a key instead of fiddling around with the positioning of partition definitions. I only found reference to the "default partition" in `JobSubmitPlugins` and this might be the solution. However, I think this is something so basic that it probably shouldn't need a plugin so I am unsure. Can anyone point me towards how setting the default partition is done? Best regards, Xaver Stiensmeier
[slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system
Hello slurm-users, The question can be found in a similar fashion here: https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system Issue Current behavior and problem description When a node fails to |POWER_UP|, it is marked |DOWN|. While this is a great idea in general, this is not useful when working with |CLOUD| nodes, because said |CLOUD| node is likely to be started on a different machine and therefore to |POWER_UP| without issues. But since the node is marked as down, that cloud resource is no longer used and never started again until freed manually. Wanted behavior Ideally slurm would not mark the node as |DOWN|, but just attempt to start another. If that's not possible, automatically resuming |DOWN| nodes would also be an option. Question How can I prevent slurm from marking nodes that fail to |POWER_UP| as |DOWN| or make slurm restore |DOWN| nodes automatically to prevent slurm from forgetting cloud resources? Attempts and Thoughts ReturnToService I tried solving this using |ReturnToService| <https://slurm.schedmd.com/slurm.conf.html#OPT_ReturnToService> but that didn't seem to solve my issue, since, if I understand it correctly, that will only accept slurm nodes starting up by themselves or manually not taking them in consideration when scheduling jobs until they've been started. SlurmctldParameters=idle_on_node_suspend While this is great and definitely helpful, it doesn't solve the issue at hand since a node that failed during power up, is not suspended. ResumeFailedProgram I considered using |ResumeFailedProgram| <https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeFailProgram>, but it sounds odd that you have to write yourself a script for returning your nodes to service if they fail on startup. This case sounds too usual to not be implemented in slurm. However, this will be my next attempt: Implement a script that calls for every given node sudo scontrol update NodeName=$NODE_NAME state=RESUME reason=FailedShutdown Additional Information In the |POWER_UP| script I am terminating the server if the setup fails for any reason and return an exit code unequal to 0. In our Cloud Scheduling <https://slurm.schedmd.com/elastic_computing.html> instances are created once they are needed and deleted once they are no longer deleted. This means that slurm stores that a node is |DOWN| while no real instance behind it exists anymore. If that node wouldn't be marked |DOWN| and a job would be scheduled towards it at a later time, it would simply start an instance and run on that new instance. I am just stating this to be maximum explicit. Best regards, Xaver Stiensmeier PS: This is the first time I use the slurm-user list and I hope I am not violating any rules with this question. Please let me know, if I do.