Re: [slurm-users] [External] How to exclude nodes in sbatch/srun?

2020-06-22 Thread Florian Zillner
Durai, To overcome this, we use noXXX features like below. Users can then request “8268&noGPU&EDR” to select nodes with 8268s on EDR without GPUs for example. # scontrol show node node5000 |grep AvailableFeatures AvailableFeatures=192GB,2933MHz,SD530,Platinum,8268,rack25,EDR,sb7890_0416,enc2

Re: [slurm-users] [External] [slurm 20.02.3] don't suspend nodes in down state

2020-08-26 Thread Florian Zillner
Hi Herbert, just like Angelos described, we also have logic in our poweroff script that checks if the node is really IDLE and only sends the poweroff command if that's the case. Excerpt: hosts=$(scontrol show hostnames $1) for host in $hosts; do scontrol show node $host | tr ' ' '\n' |

Re: [slurm-users] [External] Re: Cluster nodes on multiple cluster networks

2021-01-23 Thread Florian Zillner
Chiming in on Michael's suggestion. You can specify the same hostname in the slurm.conf but for the on-premise nodes you either set the DNS or the /etc/hosts entry to the local (=private) IP address. For the cloud nodes you set DNS or the hosts entry to the publicly reachable IP. example /etc/h

Re: [slurm-users] [External] Autoset job TimeLimit to fit in a reservation

2021-03-29 Thread Florian Zillner
Hi, well, I think you're putting the cart before the horse, but anyway, you could write a script that extracts the next reservation and does some simple math to display the time in hours or else to the user. It's the users job to set the time their job needs to finish. Auto-squeezing a job that

Re: [slurm-users] [External] Slurm Configuration assistance: Unable to use srun after installation (slurm on fedora 33)

2021-04-19 Thread Florian Zillner
Hi Johnsy, 1. Do you have an active support contract with SchedMD? AFAIK they only offer paid support. 2. The error message is pretty straight forward, slurmctld is not running. Did you start it (systemctl start slurmctld)? 3. slurmd needs to run on the node(s) you want to run on as wel

Re: [slurm-users] [External] Re: Testing Lua job submit plugins

2021-05-06 Thread Florian Zillner
I've used that approach too. If the submitting user ID is mine, then do this or that, all other's take the else clause. That way, you can actually run on the production system without having to replicate the whole environment in a sandbox. Certainly not the cleanest approach, but it doesn't hurt

Re: [slurm-users] [External] jobs stuck in "CG" state

2021-08-20 Thread Florian Zillner
Hi, scancel the job, then set the nodes to a "down" state like so "scontrol update nodename= state=down reason=cg" and resume them afterwards. However, if there are tasks stuck, then in most cases a reboot is needed to bring the node back with in a clean state. Best, Florian ___

Re: [slurm-users] [External] Sinfo or squeue stuck for some seconds

2021-08-30 Thread Florian Zillner
Hi Navin, could it be that you're using LDAP/AD/NIS for user management? If so, check if the LDAP servers response is slow or gets slowed down when retrieving hundreds or thousands of users. Also CacheGroups=1 was last supported in V15.08. Best, Florian From: s

Re: [slurm-users] [External] Node utilization for 24 hours

2021-09-07 Thread Florian Zillner
Hi, you can run sreport like this: sreport cluster AccountUtilizationByUser Start=$(date -d "last month" +%D) End=$(date -d "this month" +%D) or sreport cluster Utilization Start=$(date -d "last month" +%D) End=$(date -d "this month" +%D) and script something around it, to show what you're look

Re: [slurm-users] [External] How can I do to prevent a specific job from being prempted?

2021-09-14 Thread Florian Zillner
See the no-requeue option for SBATCH: --no-requeue Specifies that the batch job should never be requeued under any circumstances. Setting this option will prevent system administrators from being able to restart the job (for example, after a scheduled downtime), recover from a node failure, or

Re: [slurm-users] [External] Re: Request nodes with a custom resource?

2023-02-06 Thread Florian Zillner
+1 for features. Features can also be added / changed during runtime like this "scontrol update Node=$(hostname -s) AvailableFeatures=$FEAT ActiveFeatures=$FEAT" Cheers, Florian From: slurm-users on behalf of Ward Poelmans Sent: Monday, February 6, 2023 09:03 To:

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-06 Thread Florian Zillner
Hi, follow this guide: https://slurm.schedmd.com/power_save.html Create poweroff / poweron scripts and configure slurm to do the poweroff after X minutes. Works well for us. Make sure to set an appropriate time (ResumeTimeout) to allow the node to come back to service. Note that we did not achi

[slurm-users] xcpuinfo_abs_to_mac: failed // cgroups v1 problem

2023-02-09 Thread Florian Zillner
Hi, I'm experiencing a strange issue related to a CPU swap (8352Y -> 6326) on two of our nodes. I adapted the slurm.conf to accommodate the new CPU: slurm.conf: NodeName=ice27[57-58] CPUs=64 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Realmemory=257550 MemSpecLimit=12000 which is also what slur

Re: [slurm-users] [External] Re: actual time of start (or finish) of a job

2023-02-20 Thread Florian Zillner
Hi, note that times reported by sacct may differ from the net times. For example, imagine a test job like this: date sleep 1m date sacct reports: $ sacct -j 225145 -X -o jobid,start,end JobID Start End --- --- 2251

[slurm-users] Upgrading SLURM from 17.02.7 to 18.08.8 - Job ID gets reset

2019-10-18 Thread Florian Zillner
Hi all, we're using OpenHPC packages to run SLURM. Current OpenHPC Version is 1.3.8 (SLURM 18.08.8), though we're still at 1.3.3 (SLURM 17.02.7), for now. I've successfully attempted an upgrade in a separate testing environment, which works fine once you adhere to the upgrading notes... So the

Re: [slurm-users] [External] Re: Upgrading SLURM from 17.02.7 to 18.08.8 - Job ID gets reset

2019-10-18 Thread Florian Zillner
an use the FirstJobId option from slurm.conf to continue the JobIds seamlessly. Kind Regards, Lech > Am 18.10.2019 um 11:47 schrieb Florian Zillner : > > Hi all, > > we’re using OpenHPC packages to run SLURM. Current OpenHPC Version is 1.3.8 > (SLURM 18.08.8), though we’re still a

Re: [slurm-users] [External] Re: Filter slurm e-mail notification

2019-11-26 Thread Florian Zillner
Hi, I guess you could use a lua script to filter out flags you don't want. I haven't tried it with mail flags, but I'm using a script like the one referenced to enforce accounts/time limits, etc. https://funinit.wordpress.com/2018/06/07/how-to-use-job_submit_lua-with-slurm/ Cheers, Florian -

[slurm-users] Node suspend / Power saving - for *idle* nodes only?

2020-05-14 Thread Florian Zillner
Hi, I'm experimenting with slurm's power saving feature and shutdown of "idle" nodes works in general, also the power up works when "idle~" nodes are requested. So far so good, but slurm is also shutting down nodes that are not explicitly "idle". Previously I drained a node to debug something o

Re: [slurm-users] [External] Re: Node suspend / Power saving - for *idle* nodes only?

2020-05-14 Thread Florian Zillner
behalf of Steffen Grunewald Sent: Thursday, 14 May 2020 15:34 To: Slurm User Community List Subject: [External] Re: [slurm-users] Node suspend / Power saving - for *idle* nodes only? On Thu, 2020-05-14 at 13:10:04 +0000, Florian Zillner wrote: > Hi, > > I'm experimenting w

Re: [slurm-users] [External] Re: Node suspend / Power saving - for *idle* nodes only?

2020-05-15 Thread Florian Zillner
md.com From: slurm-users on behalf of Florian Zillner Sent: Thursday, 14 May 2020 15:43 To: Slurm User Community List Subject: Re: [slurm-users] [External] Re: Node suspend / Power saving - for *idle* nodes only? Well, the documentation is rather clear o

Re: [slurm-users] [External] How to detect Job submission by srun / interactive jobs

2020-05-18 Thread Florian Zillner
Hi Stephan, From the slurm.conf docs: --- BatchFlag Jobs submitted using the sbatch command have BatchFlag set to 1. Jobs submitted using other commands have BatchFlag set to 0. --- You can look that up e.g. with scontrol show job . I haven't checked though how to access that via lua. If you kno