[slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

2022-11-23 Thread Xaver Stiensmeier
e maximum explicit. Best regards, Xaver Stiensmeier PS: This is the first time I use the slurm-user list and I hope I am not violating any rules with this question. Please let me know, if I do.

[slurm-users] How to set default partition in slurm configuration

2023-01-25 Thread Xaver Stiensmeier
ition" in `JobSubmitPlugins` and this might be the solution. However, I think this is something so basic that it probably shouldn't need a plugin so I am unsure. Can anyone point me towards how setting the default partition is done? Best regards, Xaver Stiensmeier

[slurm-users] Evaluation: How collect data regarding slurms cloud scheduling performance?

2023-02-28 Thread Xaver Stiensmeier
ere larger instances started than needed? ... I know that this question is currently very open, but I am still trying to narrow down where I have to look. The final goal is of course to use this evaluation to pick better timeout values and improve cloud scheduling. Best regards, Xaver Stiensmeier

[slurm-users] Multiple default partitions

2023-04-17 Thread Xaver Stiensmeier
Dear slurm-users list, is it possible to somehow have two default partitions? In the best case in a way that slurm schedules to partition1 on default and only to partition2 when partition1 can't handle the job right now. Best regards, Xaver Stiensmeier

Re: [slurm-users] Multiple default partitions

2023-04-17 Thread Xaver Stiensmeier
question asks how to have multiple default partitions which could include having others that are not default. Best regards, Xaver Stiensmeier On 17.04.23 11:12, Xaver Stiensmeier wrote: Dear slurm-users list, is it possible to somehow have two default partitions? In the best case in a way

[slurm-users] Submit sbatch to multiple partitions

2023-04-17 Thread Xaver Stiensmeier
partitions and allocates all 8 nodes. Best regards, Xaver Stiensmeier

[slurm-users] Request nodes with a custom resource?

2023-02-05 Thread Xaver Stiensmeier
. So I am basically looking for custom requirements. Best regards, Xaver Stiensmeier

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
---- *From:* slurm-users on behalf of Xaver Stiensmeier *Sent:* Monday, July 17, 2023 9:43 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] GRES and GPUs Hi Hermann, Good idea, but we are already using `SelectType=select/cons_tres`. After setting everything up ag

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
her. I am thankful for any ideas in that regard. Best regards, Xaver On 19.07.23 10:23, Xaver Stiensmeier wrote: Alright, I tried a few more things, but I still wasn't able to get past: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification. I should ment

[slurm-users] GRES and GPUs

2023-07-17 Thread Xaver Stiensmeier
) and using one of those didn't work in my case. Obviously, I am misunderstanding something, but I am unsure where to look. Best regards, Xaver Stiensmeier

Re: [slurm-users] GRES and GPUs

2023-07-17 Thread Xaver Stiensmeier
for testing purposes. Could this be the issue? Best regards, Xaver Stiensmeier On 17.07.23 14:11, Hermann Schwärzler wrote: Hi Xaver, what kind of SelectType are you using in your slurm.conf? Per https://slurm.schedmd.com/gres.html you have to consider: "As for the --gpu* option, these op

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
the "Count=..." part in gres.conf It should read NodeName=NName Name=gpu File=/dev/tty0 Count=1 in your case. Regards, Hermann On 7/19/23 14:19, Xaver Stiensmeier wrote: Okay, thanks to S. Zhang I was able to figure out why nothing changed. While I did restart systemctld at the begi

Re: [slurm-users] GRES and GPUs

2023-07-20 Thread Xaver Stiensmeier
art slurmd* # master run without any issues afterwards. Thank you for all your help! Best regards, Xaver On 19.07.23 17:05, Xaver Stiensmeier wrote: Hi Hermann, count doesn't make a difference, but I noticed that when I reconfigure slurm and do reloads afterwards, the error "gpu c

[slurm-users] Prevent CLOUD node from being shutdown after startup

2023-05-12 Thread Xaver Stiensmeier
: Allowing all nodes to be powered up, but without automatic suspending for some nodes except when triggering power down manually. --- I tried using negative times for SuspendTime, but that didn't seem to work as no nodes are powered up then. Best regards, Xaver Stiensmeier

Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Xaver Stiensmeier
. You can run 'df -h' and see some info that would get you started. Brian Andrus On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote: Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is

[slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
Maybe someone has a great idea how to tackle this problem. Best regards Xaver Stiensmeier

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
: Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
matter for your power saving experience.  Do you run an updated version? /Ole On 12/6/23 10:54, Xaver Stiensmeier wrote: Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
or not, but it's worth a try. Best regards Xaver On 06.12.23 12:03, Ole Holm Nielsen wrote: On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything

[slurm-users] SlurmdSpoolDir full

2023-12-08 Thread Xaver Stiensmeier
hat Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here.

[slurm-users] Re: Errors upgrading to 23.11.0 -- jwt-secret.key

2024-02-08 Thread Xaver Stiensmeier via slurm-users
Thank you for your response. I have found found out why there was no error in the log: I've been looking at the wrong log. The error didn't occur on the master, but on our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as just another worker in the same network. The error I

[slurm-users] Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0

2024-02-19 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I had cases where our resumeProgram failed due to temporary cloud timeouts. In that case the resumeProgram returns a value =/= 0. Why does Slurm still wait until resumeTimeout instead of just accepting the startup as failed which then should lead to a rescheduling of the

[slurm-users] Errors upgrading to 23.11.0

2024-02-07 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I got this error: Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemctl status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details. but in slurmctld.service I see

[slurm-users] Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-23 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is

[slurm-users] Slurm.conf and workers

2024-04-15 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, as far as I understood it, the slurm.conf needs to be present on the master and on the workers at slurm.conf (if no other path is set via SLURM_CONF). However, I noticed that when adding a partition only in the master's slurm.conf, all workers were able to "correctly" show

[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-09 Thread Xaver Stiensmeier via slurm-users
hot-fix/updates from the base image or changes. By running it from the node, it would alleviate any cpu spikes on the slurm head node. Just a possible path to look at. Brian Andrus On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote: Dear slurm user list, we make use of elastic cl

[slurm-users] Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-08 Thread Xaver Stiensmeier via slurm-users
Dear slurm user list, we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up via Ansible. If more than one instance is requested at the exact same time, Slurm will pass

[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-29 Thread Xaver Stiensmeier via slurm-users
to handle "NOT_RESPONDING". I would really like to improve my question if necessary. Best regards, Xaver On 23.02.24 18:55, Xaver Stiensmeier wrote: Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout