That SIGTERM message means something is telling slurmdbd to quit.
Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told
to shutdown. If you are running in the foreground, a ^C does that. If
you run a kill or killall on it, you will get that same message.
Brian Andrus
On 5
Oh, to address the passed train:
Restore the archive data with "sacctmgr archive load", then you can do
as you need.
From man sacctmgr:
*archive*{dump|load}
Write database information to a flat file or load information that
has previously been written to a file.
Brian Andr
Instead of using the archive files, couldn't you query the db directly
for the info you need?
I would recommend sacct/sreport if those can get the info you need.
Brian Andrus
On 5/28/2024 9:59 AM, O'Neal, Doug (NIH/NCI) [C] via slurm-users wrote:
My organization needs to access historic job
I would guess either you install GPU drivers on the non-GPU nodes or
build slurm without GPU support for that to work due to package
dependencies.
Both viable options. I have done installs where we just don't compile
GPU support in and that is left to the users to manage.
Brian Andrus
On 5
as versions are compatible, they can work together.
You will need to be aware of differences for jobs and configs, but it is
possible.
Brian Andrus
On 5/22/2024 12:45 AM, Arnuld via slurm-users wrote:
We have several nodes, most of which have different Linux
distributions (distro for short
Rike,
Assuming the data, scripts and other dependencies are already on the
cluster, you could just ssh and execute the sbatch command in a single
shot: ssh submitnode sbatch some_script.sh
It will ask for a password if appropriate and could use ssh keys to
bypass that need.
Brian Andrus
or added an override file, that
will affect things.
Brian Andrus
On 4/19/2024 10:15 AM, Jeffrey Layton wrote:
I like it, however, it was working before without a slurm.conf in
/etc/slurm.
Plus the environment variable SLURM_CONF is pointing to the correct
slurm.conf file (the one in /cm
/slurm/ on the node(s).
Brian Andrus
On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
Good afternoon,
I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10
(Base Command Manager which is based on Bright Cluster Manager). I ran
into an error and only just learned that Slurm
will want to sync the config across all nodes and then 'scontrol
reconfigure'
You may want to look into configless if you can set DNS entries and your
config is basically monolithic or all parts are in /etc/slurm/
Brian Andrus
On 4/15/2024 2:55 AM, Xaver Stiensmeier via slurm-users wrote:
Dear
Yes. You can build the 8 rpms on 9. Look at 'mock' to do so. I did
similar when I still had to support EL7
Fairly generic plan, the devil is in the details and verifying each
step, but those are the basic bases you need to touch.
Brian Andrus
On 4/10/2024 1:48 PM, Steve Berg via slurm
path to look at.
Brian Andrus
On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote:
Dear slurm user list,
we make use of elastic cloud computing i.e. node instances are created
on demand and are destroyed when they are not used for a certain amount
of time. Created instances are set up
by the name it has in the
slurm.conf file.
Also, a quick way to do the failover check is to run (from the backup
controller): scontrol takeover
Brian Andrus
On 3/25/2024 1:39 PM, Miriam Olmi wrote:
Hi Brian,
Thanks for replying.
In my first message I forgot to specify that the primary
Quick correction, it is SaveStateLocation not SlurmSaveState.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am having trouble finalizing the configuration of the backup
controller for my slurm cluster.
In principle, if no job is running everything seems
Miriam,
You need to ensure the SlurmSaveState directory is the same for both.
And by 'the same', I mean all contents are exactly the same.
This is usually achieved by using a shared drive or replication.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am
Wow, snazzy!
Looks very good. My compliments.
Brian Andrus
On 3/12/2024 11:24 AM, Victoria Hobson via slurm-users wrote:
Our website has gone through some much needed change and we'd love for
you to explore it!
The new SchedMD.com is equipped with the latest information about
Slurm, your
Chip,
I use 'sacct' rather than sreport and get individual job data. That is
ingested into a db and PowerBI, which can then aggregate as needed.
sreport is pretty general and likely not the best for accurate
chargeback data.
Brian Andrus
On 3/4/2024 6:09 AM, Chip Seraphine via slurm-users
Joseph,
You will likely get many perspectives on this. I disable swap completely
on our compute nodes. I can be draconian that way. For the workflow
supported, this works and is a good thing.
Other workflows may benefit from swap.
Brian Andrus
On 3/3/2024 11:04 PM, John Joseph via slurm
oxy>
Brian Andrus
On 2/28/2024 12:54 PM, Dan Healy wrote:
Are most of us using HAProxy or something else?
On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users
wrote:
Magnus,
That is a feature of the load balancer. Most of them have that
these days.
Brian
Magnus,
That is a feature of the load balancer. Most of them have that these days.
Brian Andrus
On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote:
On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:
for us, we put a load balancer in front
disconnection
for any reason even for X-based apps.
Personally, I don't care much for interactive sessions in HPC, but there
is a large body that only knows how to do things that way, so it is there.
Brian Andrus
On 2/26/2024 12:27 AM, Josef Dvoracek via slurm-users wrote:
What is the recommended way
I imagine you could create a reservation for the node and then when you
are completely done, remove the reservation.
Each helper could then target the reservation for the job.
Brian Andrus
On 2/9/2024 5:52 PM, Alan Stange via slurm-users wrote:
Chip,
Thank you for your prompt response. We
elsewhere.
Brian Andrus
On 1/26/2024 6:38 AM, Michael Lewis wrote:
Hi All,
I’m trying to get slurm-23.11.3 running on Ubuntu 20.04 and running on
a stand alone system. I’m running into an issue I can not find the
answer to. After compiling and installing when I fire up slurmctld
While I am not sure of your specifics, you could easily add lines to
your suspend/resume scripts to check/wait/etc if there are tasks waiting.
Brian Andrus
On 1/15/2024 12:22 AM, 김종록 wrote:
Hello.
I'm going to use Slurm's cloud feature in private cloud.
The problem is that the scale out
a submit/login node.
Brian Andrus
On 12/15/2023 2:00 AM, Felix wrote:
Hello
we are installing a new server with slurm on ALMA Linux 9.2
we did the followimg:
dnf install slurm
The result is
rpm -qa | grep slurm
slurm-libs-22.05.9-1.el9.x86_64
slurm-22.05.9-1.el9.x86_64
Now when trying
filled on the node. You can run 'df -h' and see
some info that would get you started.
Brian Andrus
On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote:
Dear slurm-user list,
during a larger cluster run (the same I mentioned earlier 242 nodes), I
got the error "SlurmdSpoolDir full". The Slur
y of being able to have "keep at
least X nodes up and idle" would be nice, that is not how I see this
documented or working.
Brian Andrus
On 11/23/2023 5:12 AM, Davide DelVento wrote:
Thanks for confirming, Brian. That was my understanding as well. Do
you have it working that way o
As I understand it, that setting means "Always have at least X nodes
up", which includes running jobs. So it stops any wait time for the
first X jobs being submitted, but any jobs after that will need to wait
for the power_up sequence.
Brian Andrus
On 11/22/2023 6:58 AM, David
Eg,
Could you be more specific as to what you want?
Is there a specific user you want to control, or no user should get more
than x cpus in the partition? Or no single job should get more than x cpus?
The details matter to determine the right approach and right settings.
Brian Andrus
On 11
slurm users
belong to and add them to slurmdbd. Once they are in there, you can set
defaults with exceptions for specific users.
If you are only looking to have settings apply to all users, you don't
have to import the users. Set the QoS for the partition.
Brian Andrus
On 11/20/2023 1:45 PM, ego
How do you 'manually create a directory'? That would be when the
ownership of root would be occurring. After creating it, you can
chown/chmod it as well.
Brian Andrus
On 11/18/2023 7:35 AM, Arsene Marian Alain wrote:
Dear slurm community,
I run slurm 21.08.1 under Rocky Linux 8.5 on my
Vlad,
Actually, it looks like it is working. You are using v0.39 for the
parser, which is trying to use OpenAPI calls. Unless you compiled with
OpenAPI, that won't work.
Try using the 0.37 version and you may see a simpler result that is
successful.
Brian Andrus
On 6/28/2023 11:05 AM
in before the additional node for Job B is expected to
be available, so it runs on the idle node.
Brian Andrus
On 6/26/2023 3:48 PM, Reed Dier wrote:
Hoping this will be an easy one for the community.
The priority schema was recently reworked for our cluster, with only
PriorityWeightQOS
do what (a node-locked
license, for example). Then you can send a job to a specific subset of
nodes.
Quite a few other ways to design the ability you describe, but separate
clusters is not one of them.
Brian Andrus
On 6/26/2023 6:11 AM, mohammed shambakey wrote:
Hi
Just out of interest, I
Second that.
Prometheus+slurm exporter+grafana works great.
Brian Andrus
On 6/12/2023 8:20 AM, Josef Dvoracek wrote:
> But I'd be interested to see what other places do.
we installed this: https://github.com/vpenso/prometheus-slurm-exporter
and scrape this exporter with "inputs.pr
Make sure you have configured the RebootProgram in slurm.conf, that it
exists on the nodes and is executable by the user.
This is usually /sbin/reboot
Brian Andrus
On 6/7/2023 7:50 AM, Heinz, Michael wrote:
Hey, all.
So I added slurmdbd to our slurm-23.02 install and made my account
That output of slurmd -C is your answer.
Slurmd only sees 6GB of memory and you are claiming it has 10GB.
I would run some memtests, look at meminfo on the node, etc.
Maybe even check that the type/size of memory in there is what you think
it is.
Brian Andrus
On 5/25/2023 7:30 AM, Roger
are
running on.
Brian Andrus
On 5/17/2023 10:49 AM, Groner, Rob wrote:
I'm not sure what you mean by "if they have the permissions". I'm
talking about someone who is specifically designated as "coordinator"
of an account in slurm. With that designation, and no other
running jobs, that would take a bit more effort
to set up, but is an alternative.
Brian Andrus
On 5/17/2023 6:40 AM, Groner, Rob wrote:
I was asked to see if coordinators could do anything in this scenario:
* Within the account that they coordinated, User A submitted 1000s
of jobs and left
unts in a comma separated list (e.g
"nid[10-20]:4,nid[80-90]:2"). By default no nodes are excluded. This
value may be updated with scontrol. See
ReconfigFlags=KeepPowerSaveSettings for setting persistence.
Brian Andrus
On 5/12/2023 2:35 AM, Xaver Stiensmeier wrote:
D
Something I have been impressed with is Netdata
It is in the standard repositories and will auto-detect quite a bit of
things on a node. It is great for real-time monitoring of a node/job.
I also use Prometheus and Grafana for historic data (anything over 5
minutes).
Brian Andrus
On 5/5
:
Hello,
Brian Andrus writes:
Ole is spot on with his federated suggestion. That is exactly what fits the bill
for you, given your requirements. You can have everything you want, but you
don't get to have it how you want (separate databases).
When/If you looked deeper into it, you will find
erent part of the world and trying to federate them
in a performant manner was prohibitively expensive.
Brian Andrus
On 4/29/2023 10:53 PM, Angel de Vicente wrote:
Hi Ole,
Ole Holm Nielsen writes:
Maybe you want to use Slurm federated clusters with a single database
thanks for
at the HA database. One would be primary and the other a
failover (AccountingStorageBackupHost). Although, technically, they
would both be able to be active at the same time.
Brian Andrus
On 4/13/2023 2:49 AM, Shaghuf Rahman wrote:
Hi,
I am setting up Slurmdb in my system and I need some inputs
the user exists on the node, however you are propagating
the uids.
Brian ANdrus
On 4/11/2023 9:48 AM, Jason Simms wrote:
Hello all,
Regularly I'm seeing array jobs fail, and the only log info from the
compute node is this:
[2023-04-11T11:41:12.336] error: /opt/slurm/prolog.sh: exited
:
[Unit]
After=autofs.service getty.target sssd.service
That makes it wait for all of those before trying to start.
Brian Andrus
On 3/10/2023 7:41 AM, Tristan LEFEBVRE wrote:
Hello to all,
I'm trying to do an installation of Slurm with cgroupv2 activated.
But I'm facing an odd thing : when
to do as well.
I would be insterested in any alternatives. Could you point me to some doc?
Best wishes
Gizo
Brian Andrus
On 2/28/2023 7:44 AM, Gizo Nanava wrote:
Hello,
it seems that if a slurm power saving is enabled then the parameter
"Weight" seem to be ignored
to get (resource-wise) and how do you want
to use them?
Brian Andrus
On 2/28/2023 9:49 AM, Jake Jellinek wrote:
Hi all
I come from a SGE/UGE background and am used to the convention that I can qrsh
to a node and, from there, start a new qrsh to a different node with different
parameters.
I've
You may be able to use the alternate approach that I was able to do as well.
Brian Andrus
On 2/28/2023 7:44 AM, Gizo Nanava wrote:
Hello,
it seems that if a slurm power saving is enabled then the parameter
"Weight" seem to be ignored for nodes that are in a power down state.
ow any of the processes
work, which is why some of us do so many experimental runs of jobs and
gather timings. We have yet to see a 100% efficient process, but folks
are improving things all the time.
Brian Andrus
On 2/13/2023 9:56 PM, Diego Zuccato wrote:
I think that's incorrect:
> Th
HPC jobs. The goal is that every process is utilizing the CPU
as close to 100% as possible, which would render hyper-threading moot.
Brian Andrus
On 2/13/2023 12:15 AM, Hermann Schwärzler wrote:
Hi Sebastian,
I am glad I could help (although not exactly as expected :-).
With your node
commands
are xterm, a shell script containing srun commands, and srun (see the
EXAMPLES section). *If no command is specified, then salloc runs the
user's default shell.*
Brian Andrus
On 2/8/2023 7:01 AM, Jeffrey T Frey wrote:
You may need srun to allocate a pty for the command
Then cluster_run.sh would call sbatch along with the appropriate commands.
Brian Andrus
On 2/7/2023 9:31 AM, Groner, Rob wrote:
I'm trying to setup the capability where a user can execute:
$: sbatch script_to_run.sh
and the end result is that a job is created on a node, and that job
will execute
with the new (known good) config.
Brian Andrus
On 1/17/2023 12:36 PM, Groner, Rob wrote:
So, you have two equal sized clusters, one for test and one for
production? Our test cluster is a small handful of machines compared
to our production.
We have a test slurm control node on a test cluster
.
Brian Andrus
On 1/4/2023 9:22 AM, Groner, Rob wrote:
We currently have a test cluster and a production cluster, all on the
same network. We try things on the test cluster, and then we gather
those changes and make a change to the production cluster. We're
doing that through two different repos
.conf"*/
You can change those as needed. This made it listen on port 8081 only
(no socket and not 6820)
I was then able to just use curl on port 8081 to test things.
Hope that helps.
Brian Andrus
On 12/29/2022 6:49 AM, Chris Stackpole wrote:
Greetings,
Thanks for responding!
On 12/2
I suspect if you delete /var/lib/slurmrestd.socket and then start
slurmrestd, it will create it as the user you need it to be.
Or just change the owner of it to the slurmrestd owner.
I have been running slurmrestd as a separate user for some time.
Brian Andrus
On 12/28/2022 3:20 PM, Chris
Seems like the time may have been off on the db server at the insert/update.
You may want to dump the database, find what table/records need updated
and try updating them. If anything went south, you could restore from
the dump.
Brian Andrus
On 12/20/2022 11:51 AM, Reed Dier wrote:
Just
Try:
sacctmgr list runawayjobs
Brian Andrus
On 12/20/2022 7:54 AM, Reed Dier wrote:
Hoping this is a fairly simple one.
This is a small internal cluster that we’ve been using for about 6
months now, and we’ve had some infrastructure instability in that
time, which I think may
as the many articles, wikis and videos
out there.
TLDR; If you are going to be running efficient HPC jobs, you are indeed
better off with HT turned off.
Brian Andrus
On 12/13/2022 8:03 AM, Gary Mansell wrote:
Hi, thanks for getting back to me.
I have been doing some more experimenting, and I
to it. Also check the state of the nodes with 'sinfo'
It would also be good to ensure the node settings are right. Run 'slurmd
-C' on a node and see if the output matches what is in the config.
Brian Andrus
On 12/13/2022 1:38 AM, Gary Mansell wrote:
Dear Slurm Users, perhaps you can help me
You may want to look here:
https://slurm.schedmd.com/heterogeneous_jobs.html
Brian Andrus
On 12/7/2022 12:42 AM, Le, Viet Duc wrote:
Dear slurm community,
I am encountering a unique situation where I need to allocate jobs to
nodes with different numbers of CPU cores. For instance
I successfully build it for Rocky straight from the tgz file as usual
with rpmbuild -ta
Brian Andrus
On 12/2/2022 9:21 AM, David Thompson wrote:
Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8
Slurm cluster. We would like to be able to use the sbatch –prefer
option
to submit at all? The reservation method can cause an sbatch
command to be rejected, if that is what you are looking for.
Brian Andrus
On 11/30/2022 6:29 AM, Richard Ems wrote:
Hi all,
I have to change our set up to be able to update the total number of
available licenses due to users checking
Steve,
I suspect you did not install the packages.
You need to install slurm-slurmctld to get the slurmctld systemd files:
/# rpm -qlp slurm-slurmctld-20.11.9-1.el7.x86_64.rpm//
///run/slurm/slurmctld.pid//
/*//usr/lib/systemd/system/slurmctld.service/*/
///usr/sbin/slurmctld//
and reset/recreate it.
That addresses even a miffed software change.
Brian Andrus
On 11/23/2022 5:11 AM, Xaver Stiensmeier wrote:
Hello slurm-users,
The question can be found in a similar fashion here:
https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up
data. There are many ways to do that, but those designs fall under
MariaDB and not Slurm.
Brian Andrus
On 11/1/2022 6:49 PM, Richard Chang wrote:
Does it mean it is best to use a single slurmdbd host in my case?
My primary slurmctld is the backup slurmdbd host, and my worry is if
the primary
Ole,
Fair enough, it is actually slurmctld that does the caching. Technical
typo on my part there.
Just trying to let the user know, there is a window that they have to
ensure no information is lost during a database outage.
Brian Andrus
On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote:
Hi
It caches up to a point. As I understand it, that is about an hour
(depending on size and how busy the cluster is, as well as available
memory, etc).
Brian Andrus
On 10/31/2022 9:20 PM, Richard Chang wrote:
Hi,
Just for my info, I would like to know what happens when SlurmDBD
loses
, but if you aren't having excessive traffic to the
share, you should be good. I have yet to discover what would be
excessive enough to impact things.
The only use I have had for the HA is being able to keep the cluster
running/happy during maintenance.
Brian Andrus
On 10/24/2022 1:14 AM, Ole
is the preferred method?
Rob
*From:* slurm-users on behalf
of Brian Andrus
*Sent:* Friday, September 23, 2022 10:24 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] slurmd and dynamic nodes
of the mix.
Brian Andrus
On 9/23/2022 7:09 AM, Groner, Rob wrote:
I'm working through how to use the new dynamic node features in order
to take down a particular node, reconfigure it (using nvidia MIG to
change the number of graphic cores available) and give it back to slurm.
I'm at the point
Paul,
You are likely spot on with the inactiveLimit change. It may also be an
environment variable of TMOUT (under bash) set.
Brian Andrus
On 9/19/2022 5:46 AM, Paul Raines wrote:
I have had two nights where right at 3:35am a bunch of jobs were
killed early with TIMEOUT way before
on a large cluster
sometime in your career, I would not recommend using it there.
Brian Andrus
On 9/16/2022 3:06 PM, Davide DelVento wrote:
Hi Brian,
From your response, I speculate that my wording sounded harsh or
unrespectful. That was not my intention and therefore I sincerely
apologize
free to do that. It is not something that
scales well, but it looks like you have a rather beginner cluster that
would never be impacted by such choices.
Brian Andrus
On 9/16/2022 10:00 AM, Davide DelVento wrote:
Thanks Brian.
I am still perplexed. What is a database to install, administer
) Update the database (sacctmgr command)
As you can see, that 1st step would be highly dependent on you and your
environment. The 2nd step would be dependent on what things you are
tracking within that.
Brian Andrus
On 9/16/2022 5:01 AM, Davide DelVento wrote:
So if I understand correctly
You can dynamically modify the weight of nodes with:
scontrol update nodename= weight=
So, in theory, you could do that periodically to adjust the weights you
may want.
Brian Andrus
On 9/15/2022 4:27 PM, Russell Smithies wrote:
Can I set dynamic or calculated “weights” for nodes
ense count that is in the
database updated to match the number free from flexlm to stop license
starvation due to users outside slurm using them up so they really
aren't available to slurm.
Brian Andrus
On 9/15/2022 3:34 PM, Davide DelVento wrote:
I am a bit confused by remote licenses.
https://lists.s
configured?
Brian Andrus
On 9/15/2022 2:49 PM, Davide DelVento wrote:
I have a super simple prolog script, as follows (very similar to the
example one)
#!/bin/bash
if [[ $VAR == 1 ]]; then
echo "True"
fi
exit 0
This fails (and obviously causes great disruption to my production
j
I had completely forgotten about HETJOB supporting multiple partitions.
Thanks for reminding me.
Brian Andrus
On 9/12/2022 6:06 AM, Marcus Wagner wrote:
yes, that is possible by submitting a hetjob.
Best
Marcus
Am 08.09.2022 um 20:38 schrieb Purvesh Parmar:
We require more nodes to run
No, however a node can reside in multiple partitions.
So if you add those nodes to the partition they are running in, they
will be available to them.
Brian Andrus
On 9/8/2022 11:38 AM, Purvesh Parmar wrote:
We require more nodes to run a single job which requires more nodes
than present
Possibly way off base, but did you happen to do any of the editing in
Windows? Maybe running into the cr/lf issue for how windows saves text
files?
Brian Andrus
On 9/7/2022 5:21 AM, Davide DelVento wrote:
Thanks Ole, your wiki page sheds some light on this mystery.
Very frustrating that even
Try setting logging to debug mode, then you can get some info from the logs.
Brian Andrus
On 9/1/2022 8:15 PM, Davide DelVento wrote:
Thanks.
I did try a lua script as soon as I got your first email, but that
never worked (yes, I enabled it in slurm.conf and ran "scontrol
reconfigure&q
)
Usually it would be found at /usr/lib64/slurm/job_submit_lua.so
If that is there, you should be good with trying out a job_submit lua
script.
Brian Andrus
On 9/1/2022 1:24 PM, Davide DelVento wrote:
Thanks again, Brian, indeed that grep returns many hits, but none of
them includes lua, i.e
I would be surprised if it were compiled without the support. However,
you could check and run something like:
strings /sbin/slurmctld | grep job_submit
(or where ever your slurmctld binary is). There should be quite a few
lines with that in it.
Brian Andrus
On 9/1/2022 10:54 AM, Davide
Not sure if you can do all the things you intend, but the job_submit
script is precisely where you want to check submission options.
https://slurm.schedmd.com/job_submit_plugins.html
Brian Andrus
On 8/30/2022 12:58 PM, Davide DelVento wrote:
Hi,
I would like to soft-enforce license
It looks to me like you have the same node in multiple partitions. If
the output you are getting is basically what you want just pipe it to
'sort -u' or 'uniq'
Brian Andrus
On 8/8/2022 10:14 AM, Borchert, Christopher B ERDC-RDE-ITL-MS CIV wrote:
Hello. How can I simply show the status
This is actually brilliant!
Brian Andrus
On 8/3/2022 10:20 PM, Gerhard Strangar wrote:
Phil Chiu wrote:
- Individual slurm jobs which reboot nodes - With a for loop, I could
submit a reboot job for each node. But I'm not sure how to limit this so at
most N jobs are running
So an example of using slurm to reboot all nodes 3 at a time:
sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {}
If you want to get fancy, make a script that does the reboot and waits
for the node to be back up before exiting and use that instead of the
'scontrol reboot' part.
Brian
the compute
nodes do.
Brian Andrus
On 8/2/2022 6:45 AM, Paul Edmon wrote:
No, the node running the slurmctld does not need access to any of the
customer facing filesystems or home directories. While all the login
and client nodes do, the slurmctld does not.
-Paul Edmon-
On 8/2/2022 9:30 AM
Lloyd,
You could check out the order of entries in your pam.d/ssh (and
related/included) files
See where the slurm_pam_adopt is, how it is being called and if there
are settings that are interferring.
Does this occur only on a single node, or all of them?
Brian Andrus
On 7/27/2022 9:29
Verify that their uid on the node is the same as the uid your master sees
Brian Andrus
On 7/27/2022 8:53 AM, byron wrote:
Hi
When a user tries to login into a compute node on which they have a
running job they get the error
Access denied: user blahblah (uid=) has no active jobs
Hmm. That would imply you could still use the tar file with something like:
rpmbuild -v -ta --define "_lto_cflags %{nil}" slurm-22.05.2.tar.bz2
Note, I have not tried this (no immediate access to RHEL9 derivative),
so YMMV.
Brian Andrus
On 7/21/2022 10:15 AM, Kilian Cavalotti
that ensures both are getting
accurate and current information.
Brian Andrus
On 6/27/2022 9:15 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
We noticed that slurmdbd provide the conf option *DbdBackupHost* for
user to set a secondary slurmdbd node. Since slurmdbd is closely
related to database
respectively.
*NOTE*: The suffix "*" identifies nodes that are presently not
responding.
Brian Andrus
On 6/26/2022 5:39 AM, z1...@arcor.de wrote:
Hello,
if I call "sinfo -o %all", the worker state includes only a single state
word like "DRNG".
It is clearer in
/groups.
Brian Andrus
On 6/9/2022 5:19 PM, Willy Markuske wrote:
Hello All,
I have a request from users for the ability to have persistent
interactive jobs. Currently some users are using srun to allocate and
interactive job and run their scripts but sshd will close connections
after 2 hours
Offhand, I would suggest double check munge and versions of
slurmd/slurmctld.
Brian Andrus
On 6/3/2022 3:17 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
Our cluster set up 2 slurm control node and scontrol show config as below:
> scontrol show config
…
SlurmctldHost[0] = slu
Add it to your slurm.conf
Then it is always there after a restart.
Brian Andrus
On 6/2/2022 12:05 PM, Hanby, Mike wrote:
Howdy,
I can’t seem to find a solution in ‘man slurm.conf’ for this. How can
I make the following persist a slurmctld restart:
scontrol update NodeName="
you give the
permission to that they will not abuse it.
Brian Andrus
On 5/18/2022 12:22 AM, GHui wrote:
Hi, Brian Andrus
I think the main poblem is that container can cheat Slurm.
On 5/17/22 06:58:20, Brian Andrus wrote:
You are starting to understand a major issue with most containers.
I
watch
the logs to see when it is happy). Don't start slurmctld until that is
done. Waiting makes things easier.
Brian Andrus
On 5/17/2022 9:29 AM, Paul Edmon wrote:
I think it should be, but you should be able to run a test and find out.
-Paul Edmon-
On 5/17/22 12:13 PM, byron wrote:
Sorry, I
You need to step upgrade through major versions (not minor).
So 19.05=>20.x
I would highly recommend going to 21.08 while you are at it.
I just did the same migration (although they started at 18.x) with no
issues. Running jobs were not impacted and users didn't even notice.
Brian And
1 - 100 of 342 matches
Mail list logo