Durai,
To overcome this, we use noXXX features like below. Users can then request
“8268&noGPU&EDR” to select nodes with 8268s on EDR without GPUs for example.
# scontrol show node node5000 |grep AvailableFeatures
AvailableFeatures=192GB,2933MHz,SD530,Platinum,8268,rack25,EDR,sb7890_0416,enc2
Hi Herbert,
just like Angelos described, we also have logic in our poweroff script that
checks if the node is really IDLE and only sends the poweroff command if that's
the case.
Excerpt:
hosts=$(scontrol show hostnames $1)
for host in $hosts; do
scontrol show node $host | tr ' ' '\n' |
Chiming in on Michael's suggestion.
You can specify the same hostname in the slurm.conf but for the on-premise
nodes you either set the DNS or the /etc/hosts entry to the local (=private) IP
address.
For the cloud nodes you set DNS or the hosts entry to the publicly reachable IP.
example /etc/h
Hi,
well, I think you're putting the cart before the horse, but anyway, you could
write a script that extracts the next reservation and does some simple math to
display the time in hours or else to the user. It's the users job to set the
time their job needs to finish. Auto-squeezing a job that
Hi Johnsy,
1. Do you have an active support contract with SchedMD? AFAIK they only
offer paid support.
2. The error message is pretty straight forward, slurmctld is not running.
Did you start it (systemctl start slurmctld)?
3. slurmd needs to run on the node(s) you want to run on as wel
I've used that approach too.
If the submitting user ID is mine, then do this or that, all other's take the
else clause. That way, you can actually run on the production system without
having to replicate the whole environment in a sandbox. Certainly not the
cleanest approach, but it doesn't hurt
Hi,
scancel the job, then set the nodes to a "down" state like so "scontrol update
nodename= state=down reason=cg" and resume them afterwards.
However, if there are tasks stuck, then in most cases a reboot is needed to
bring the node back with in a clean state.
Best,
Florian
___
Hi Navin,
could it be that you're using LDAP/AD/NIS for user management? If so, check if
the LDAP servers response is slow or gets slowed down when retrieving hundreds
or thousands of users.
Also CacheGroups=1 was last supported in V15.08.
Best,
Florian
From: s
Hi,
you can run sreport like this:
sreport cluster AccountUtilizationByUser Start=$(date -d "last month" +%D)
End=$(date -d "this month" +%D)
or
sreport cluster Utilization Start=$(date -d "last month" +%D) End=$(date -d
"this month" +%D)
and script something around it, to show what you're look
See the no-requeue option for SBATCH:
--no-requeue
Specifies that the batch job should never be requeued under any circumstances.
Setting this option will prevent system administrators from being able to
restart the job (for example, after a scheduled downtime), recover from a node
failure, or
+1 for features.
Features can also be added / changed during runtime like this "scontrol update
Node=$(hostname -s) AvailableFeatures=$FEAT ActiveFeatures=$FEAT"
Cheers,
Florian
From: slurm-users on behalf of Ward Poelmans
Sent: Monday, February 6, 2023 09:03
To:
Hi,
follow this guide: https://slurm.schedmd.com/power_save.html
Create poweroff / poweron scripts and configure slurm to do the poweroff after
X minutes. Works well for us. Make sure to set an appropriate time
(ResumeTimeout) to allow the node to come back to service.
Note that we did not achi
Hi,
I'm experiencing a strange issue related to a CPU swap (8352Y -> 6326) on two
of our nodes. I adapted the slurm.conf to accommodate the new CPU:
slurm.conf: NodeName=ice27[57-58] CPUs=64 Sockets=2 CoresPerSocket=16
ThreadsPerCore=2 Realmemory=257550 MemSpecLimit=12000
which is also what slur
Hi,
note that times reported by sacct may differ from the net times. For example,
imagine a test job like this:
date
sleep 1m
date
sacct reports:
$ sacct -j 225145 -X -o jobid,start,end
JobID Start End
--- ---
2251
Hi all,
we're using OpenHPC packages to run SLURM. Current OpenHPC Version is 1.3.8
(SLURM 18.08.8), though we're still at 1.3.3 (SLURM 17.02.7), for now.
I've successfully attempted an upgrade in a separate testing environment, which
works fine once you adhere to the upgrading notes... So the
an use the FirstJobId option from slurm.conf to continue the JobIds
seamlessly.
Kind Regards,
Lech
> Am 18.10.2019 um 11:47 schrieb Florian Zillner :
>
> Hi all,
>
> we’re using OpenHPC packages to run SLURM. Current OpenHPC Version is 1.3.8
> (SLURM 18.08.8), though we’re still a
Hi,
I guess you could use a lua script to filter out flags you don't want. I
haven't tried it with mail flags, but I'm using a script like the one
referenced to enforce accounts/time limits, etc.
https://funinit.wordpress.com/2018/06/07/how-to-use-job_submit_lua-with-slurm/
Cheers,
Florian
-
Hi,
I'm experimenting with slurm's power saving feature and shutdown of "idle"
nodes works in general, also the power up works when "idle~" nodes are
requested.
So far so good, but slurm is also shutting down nodes that are not explicitly
"idle". Previously I drained a node to debug something o
behalf of Steffen
Grunewald
Sent: Thursday, 14 May 2020 15:34
To: Slurm User Community List
Subject: [External] Re: [slurm-users] Node suspend / Power saving - for *idle*
nodes only?
On Thu, 2020-05-14 at 13:10:04 +0000, Florian Zillner wrote:
> Hi,
>
> I'm experimenting w
md.com
From: slurm-users on behalf of Florian
Zillner
Sent: Thursday, 14 May 2020 15:43
To: Slurm User Community List
Subject: Re: [slurm-users] [External] Re: Node suspend / Power saving - for
*idle* nodes only?
Well, the documentation is rather clear o
Hi Stephan,
From the slurm.conf docs:
---
BatchFlag
Jobs submitted using the sbatch command have BatchFlag set to 1. Jobs submitted
using other commands have BatchFlag set to 0.
---
You can look that up e.g. with scontrol show job . I haven't checked
though how to access that via lua. If you kno
21 matches
Mail list logo