, the devil is in the details on how to
define/get what you want.
Brian Andrus
On 8/17/2020 10:13 AM, Gerhard Strangar wrote:
Hello,
I'm wondering if it's possible to have slurm 19 run two partitions (low
and high prio) that share all the nodes and limit the high prio
partition in number of nodes
to
schedule them in that fashion outweighs the resources needed by far.
Brian Andrus
On 8/28/2020 3:30 AM, navin srivastava wrote:
Hi Team,
facing one issue. several users submitting 2 job in a single batch
job which is very short jobs( says 1-2 sec). so while submitting more
job slurmctld
That is where you have it call a bash script and within the script you
do as needed.
Like Ahmet's suggested script.
So use his as a template and add the headers you desire.
Brian Andrus
On 8/28/2020 11:36 AM, Chris Samuel wrote:
On 8/27/20 3:42 pm, Brian Andrus wrote:
Actually, you can add
Actually, you can add headers of all kinds:
Quick search of "sendmail add headers" discovers:
https://serverfault.com/questions/347602/sending-e-mail-from-sendmail-with-headers
Brian Andrus
On 8/26/2020 10:02 PM, Andrew Elwell wrote:
Hi folks,
I'm getting fed up receiving out
IIRC, that is because it is trying to do the 'configless' feature of
slurm 20 where it uses DNS entries to find the config.
This will happen if /etc/slurm.conf does not exist on the node.
Check that you have that and that it is the same as the one on the master.
Brian Andrus
On 8/24/2020 7
een places where that can take 24 hours.
Brian Andrus
On 9/29/2020 6:18 AM, Diego Zuccato wrote:
Hello all.
One of the users is unable to submit jobs to our cluster.
The first time he tries, he gets
$ sbatch test.job
sbatch: fatal: Invalid user id: 621049927
then:
$ sbatch test.job
sbatch: er
on the node waiting to be
resumed, but the node resources may get assigned to other jobs while
they wait to resume.
Brian Andrus
On 9/22/2020 2:33 PM, Ransom, Geoffrey M. wrote:
Hello
We had a user post a large number of array jobs with a short actual
run time (20-80 seconds, but mostly
Heh. That is the on-going "user education"
You could change the amount of ram requested using a job_sumit lua
script, but that could bite those that are accurate with their requests.
Or set a max ram for the partition.
Brian Andrus
On 5/27/2020 3:46 PM, Marcelo Z. Silva wrote:
packages. Source control for me
is just that spec file.
Brian Andrus
On 10/20/2020 8:46 AM, Michael Jennings wrote:
On Tuesday, 20 October 2020, at 15:49:25 (+0800),
Kevin Buckley wrote:
On 2020/10/20 11:50, Christopher Samuel wrote:
I forgot I do have access to a SLES15 SP1 system, that has
slurm daemons
going down.
Brian Andrus
On 7/21/2020 7:44 AM, Peter Mayes wrote:
Hi,
My first post to the list, so apologies if this is a FAQ,
My configuration has two nodes allocated for Slurm masters, with a
highly-available NFS server mounting a filesystem across the two nodes.
I need advice
This is very likely by design of the cluster and/or network. Otherwise
users could use the cluster to mine bitcoin and such.
Brian Andrus
On 8/2/2020 7:11 AM, Mahmood Naderan wrote:
I thought that maybe srun doesn't transfer all settings from the head
node to the compute node.
The wget
, the partition is
used to determine which node(s) and filter/order jobs. You should add
the node to the new partition, but also leave it in the 'test'
partition. If you are looking to remove the 'test' partition, set it to
down and once all the running jobs that are in it finish, then remove it.
Brian
you set that in the slurm.conf to continue the numbering from where
you left off so there are no entries in accounting that get replaced.
Brian Andrus
On 7/8/2020 3:15 AM, Simon Kainz wrote:
Hello,
we have a long-running slurm cluster, accounting into slurmdbd/mysql
backend on the cluster
thentication
<https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication>because
/*normal users have no business on those servers!*/
Brian Andrus
On 6/17/2020 1:26 AM, Ole Holm Nielsen wrote:
On 6/9/20 5:45 PM, Michael Jennings wrote:
On Tuesday, 09 June 2020, at 12:43:34
them outside the cluster.
Brian Andrus
On 6/19/2020 5:04 AM, David Baker wrote:
Hello,
We are currently helping a research group to set up their own Slurm
cluster. They have asked a very interesting question about Slurm and
file systems. That is, they are posing the question -- do you need
/configless_slurm.html
Brian Andrus
Sounds like a race condition where slurmd is starting before the node is
truly ready.
You can try adding dependencies for slurmd so it will not start until
some other needed service is running.
The benefits of systemd :)
Brian Andrus
On 6/9/2020 10:53 AM, Dumont, Joey wrote:
Hi,
I
are
running.
slurmd should be running as root. It needs to be able to do a few things
including run the job as the user that submitted it. Things that only
root should be doing.
Brian Andrus
On 6/2/2020 2:00 PM, Ferran Planas Padros wrote:
Hi Ole,
I run the same version of slurm in all
root could be quite useful. Especially for service accounts.
Yes, there can be a workaround using sudo, but it seems better if we
could track things in slurm to know a job was run 'on behalf of' another
user.
Thoughts, suggestions, current approaches?
Thanks,
Brian Andrus
Is there a reason to run them as a single job?
It may be easier to just have 2 separate jobs of 16 cores each.
If there are dependency requirements, that is addressed by adding any
dependencies to the job submission.
Brian Andrus
On 7/25/2020 2:50 AM, Даниил Вахрамеев wrote:
Hi everyone
calls as too many of them can tip a system over.
Brian Andrus
lua, if I may ask?
Brian Andrus
On 7/27/2020 9:52 AM, Baer, Troy wrote:
There's an outstanding feature request for that:
https://bugs.schedmd.com/show_bug.cgi?id=8383
While waiting on that, we've taken to injecting it into the job's environment
ourselves in the Lua submit filter.
--Troy
You are trying to use sbatch with the "--uid" option which is only
allowed by root.
Either run sbatch as the user doing the request (which should be the
same user that is running rstudio) or use 'sudo -u ' to run sbatch.
Brian Andrus
On 7/20/2020 7:50 AM, Sidhu, Khushwant wrote:
Ah,
They are assuming you are running the web interface as root.
If your environment is secure enough, you can do that. Or, grant your
web server user privileges in slurm to be allowed to use the "--uid" option.
Brian Andrus
On 7/20/2020 8:39 AM, Sidhu, Khushwant wrote:
H
That package looks to be built for a system with an nvidia gpu installed.
Look for (or build) different packages if you are not going to use a
gpu-based node.
Brian Andrus
On 12/4/2020 11:32 AM, Mullen, Drew wrote:
Howdy
Im getting this error installing slurm 20.02.4:
Error: Package
in a completed state for a period of time,
but they are not showing up at all on our cluster.
How does one have jobs show up that are completed?
Brian Andrus
over a direct-connect or VPN.
Brian Andrus
On 12/15/2020 12:02 PM, Sajesh Singh wrote:
We are currently investigating the use of the cloud scheduling
features within an on-site Slurm installation and was wondering if
anyone had any experiences that they wish to share of trying to use
Check your hosts file and ensure 'localhost' does not have an IPV6
address associated with it.
Brian Andrus
On 12/14/2020 4:19 PM, Alpha Experiment wrote:
Hi,
I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
running correctly; however the slurmctld daemon always errors
to more
fetches, wasting effort.
This is a VERY simplistic description, but the point is that
hyperthreading is not a silver bullet that will improve HPC performance
if you are maximizing your resource utilization.
Ok, I will get off my soapbox :)
Brian Andrus
On 11/4/2020 7:30 AM, Jean
We would need more information.
At a minimum, what client is it? As this is not a slurm issue, you would
need to dig into what is causing that behavior with your storage system.
Brian Andrus
On 1/20/2021 10:53 AM, John McCulloch wrote:
Our shared storage client daemon is utilizing 100
customers for Tim to keep things running as well
as he has. I'm pretty sure most folks that use slurm for any period of
time has received more value that a small support contract would be.
Brian Andrus
On 1/25/2021 7:35 AM, Jeffrey T Frey wrote:
...I would say having SLURM rpms in EPEL could be very
You would need to have a direct connect/vpn so the cloud nodes can
connect to your head node.
Brian Andrus
On 1/22/2021 10:37 AM, Sajesh Singh wrote:
We are looking at rolling out cloud bursting to our on-prem Slurm
cluster and I am wondering how to deal with the slurm.conf variable
mean their child can :)
Brian Andrus
On 1/15/2021 6:38 AM, Durai Arasan wrote:
Hi,
As you know for each partition you can specify
AllowAccounts=account1,account2...
I have a parent account say "parent1" with two child accounts "child1"
and "child2"
I expected that
have been able to deploy completely to cloud using only
slurm. It has the ability to integrate into any cloud cli, so nothing
else has been needed. Just for the heck of it, I am thinking of
integrating it into Terraform, although not necessary.
Brian Andrus
On 1/26/2021 11:48 AM, Robert Kudyba
Ahh.
One one of the new nodes do:
slurmd -C
The output of that will tell you what those settings should be. I
suspect they are off, which forces them into drain mode.
Brian Andrus
On 1/28/2021 12:25 PM, Chandler wrote:
Andy Riebs wrote on 1/28/21 07:53:
If the only changes to your system
it, slurm assumes all memory on the node for the job. So, even if you
are only using 1 cpu, all the memory is allocated, leaving none for any
other job to run on the unallocated cpus.
Brian Andrus
On 1/28/2021 2:15 PM, Chandler wrote:
Brian Andrus wrote on 1/28/21 13:59:
What
You are getting close :)
You can see why n010 is able to have multiple jobs. It shows more
resources available.
What are the specific requests for resources from a job?
Nodes, Cores, Memory, threads, etc?
Brian Andrus
On 1/28/2021 12:52 PM, Chandler wrote:
OK I'm getting this same output
The net effect is that the environment gets setup the same as if the
user had opened a shell console.
Brian Andrus
On 1/26/2021 2:13 AM, Gestió Servidors wrote:
Hi,
My environment is this:
* Users are using “bash” as the default shell
* A sample of one of my environment modules
Heh. Your nodes are drained.
do:
scontrol update state=resume nodename=n[011-013]
If they go back into a drained state, you need to look into why. That
will be in the slurmctld log. You can also see it with 'sinfo -R'
Brian Andrus
On 1/27/2021 10:18 PM, Chandler wrote:
Made a little bit
they can do a thing doesn't
mean they should do a thing.
There are many ways to achieve what is desired, most of which do not
require anyone other than the system admin.
If your issue can be solved without affecting others, leave them alone
and fix your issue.
Brian Andrus
Using v20.11.7
I have 8081 because that is the port I am running slurmrestd on.
How are you starting slurmrestd? If you are using systemd and have the
service file, look inside it.
Brian Andrus
On 6/14/2021 9:48 AM, Heitor wrote:
On Mon, 14 Jun 2021 08:30:51 -0700
Brian Andrus wrote
You don't use the prefix.
This works for me on the node running slurmrestd on port 8081:
user=someuser
curl --header "X-SLURM-USER-NAME: ${user}" --header "X-SLURM-USER-TOKEN:
$(sudo scontrol toker username=${user}|cut -d='=' -f2-)"
http://localhost:8081/slurm/v0.0.36
No problem.
You may want to set your variables in your /etc/sysconfig/slurmrestd file.
That is where you can set that variable along with others
(SLURMRESTD_LISTEN, SLURMRESTD_DEBUG, SLURMREST_OPTIONS) and your
service file will pick them up.
Brian Andrus
On 6/14/2021 12:05 PM, Heitor
Ah.
You should put files first. Otherwise, if it finds an entry in SSS, that
takes precedence and the local groups/users will not be seen.
Brian Andrus
On 5/10/2021 1:09 PM, Russell Jones wrote:
Thanks!
No, we are not. The compute nodes are also properly configured in
/etc/nsswitch.conf
As a solution, I recommend you leverage the ".forward" file
You can put anything you want in there and that is where it will go if
the user doesn't specify an email.
Brian Andrus
On 5/10/2021 1:38 PM, Luke Yeager wrote:
Contributions are usually handled through Bugzilla. He
er logged out."//
//exit//
/
Simplified, but works well. We can do additional tasks once they start
the vncserver (eg stage data) and once they log out (clean up files).
Brian Andrus
On 5/15/2021 5:02 AM, Jeremy Fix wrote:
Hello !
I'm facing a weird issue. With one user, call it gpupro_u
purged, but
the account records stay and build up over time.
Brian Andrus
limit what is allowed to be requested in the partition
definition and/or a QOS (if you are using accounting).
Brian Andrus
On 5/7/2021 8:11 PM, Cristóbal Navarro wrote:
Hi community,
I am unable to tell if SLURM is handling the following situation
efficiently in terms of CPU affinities at each
tml>for more information.
That could explain it.
Brian Andrus
On 5/10/2021 7:57 AM, Russell Jones wrote:
Hello,
We have a few users we are needing to add to the local "video" group
of a specific set of compute nodes. When submitting a job, slurm
appears to not be populating
for the operating system and so things
don't choke if it comes up a bit lower because some driver took more
memory when it loaded.
Brian Andrus
On 5/19/2021 9:15 PM, Herc Silverstein wrote:
Hi,
We have a cluster (in Google gcp) which has a few partitions set up to
auto-scale, but one partition is set
of slurm), but was
wondering if I had misunderstood the slurm docs and there was a
simpler way.
Best,
Mark
On Mon, 24 May 2021, Brian Andrus wrote:
Not sure I can understand how it can only be detected from inside the
job environment for a failed node.
That description is more of &quo
be executed. We have HPC_Setup.sh in there where we
create ssh keys, setup their .forward file and other setup tasks.
Brian Andrus
On 5/25/2021 5:09 AM, Loris Bennett wrote:
Hi everyone,
Thanks for all the replies.
I think my main problem is that I expect logging in to a node with a job
Umm.. Your keys are password protected. If they were not, you would be
getting what you expect:
Enter passphrase for key '/home/loris/.ssh/id_rsa':
Brian Andrus
On 5/21/2021 5:53 AM, Loris Bennett wrote:
Hi,
We have set up pam_slurm_adopt using the official Slurm documentation
and Ole's
Oh, you could also use the ssh-agent to mange the keys, then use
'ssh-add ~/.ssh/id_rsa' to type the passphrase once for your whole
session (from that system).
Brian Andrus
On 5/21/2021 5:53 AM, Loris Bennett wrote:
Hi,
We have set up pam_slurm_adopt using the official Slurm documentation
Yep, job 28 is already running.
If you want it to be on hold to start, use 'sbatch -h test.sh' and it
will start out in a hold state.
Brian Andrus
On 5/22/2021 11:36 PM, Chris Samuel wrote:
On Saturday, 22 May 2021 11:05:54 PM PDT Zainul Abiddin wrote:
i am trying to hold the job from
Array jobs are individual jobs that have been grouped. Underneath, they
each have their own jobid besides the grouped array jobid.
Not sure there is an easy way to pull what you are looking to do.
Brian Andrus
On 6/3/2021 8:12 AM, Shaohao Chen wrote:
Hi,
We use Slurm on our cluster and set
Sounds like a firewall issue.
When you log on to the 'down' node, can you run 'sinfo' or 'squeue' there?
Also, verify munge is configured/running properly on the node.
Brian Andrus
On 6/4/2021 9:31 AM, Herc Silverstein wrote:
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00
Oh, also ensure the dns is working properly on the node. It could be
that it isn't able to map the name to ip of the master.
Brian Andrus
On 6/4/2021 9:31 AM, Herc Silverstein wrote:
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00:12:27.481] sched: Allocate JobId=3402729
Not sure I can understand how it can only be detected from inside the
job environment for a failed node.
That description is more of "our application is behaving badly, but not
so bad, the node quits responding." For that situation, your app or job
should have something that it is doing to
and then request that feature when submitting your job.
Brian Andrus
On 6/1/2021 4:15 AM, Diego Zuccato wrote:
Hello all.
I just found that if an user tries to specify a nodelist (say
including 2 nodes) and --nodes=1, the job gets rejected with
sbatch: error: invalid number of nodes (-N 2-1
Ok.
You may want to check your slurmdbd host(s) and ensure the users are
known there. If it does not know who a user is, it will not allow access
to the data.
If you are running sssd, clear the cache and such too.
Brian Andrus
On 7/1/2021 1:12 AM, taleinterve...@sjtu.edu.cn wrote:
I can
a feature constraint, but that seems to only apply to those that
want the feature. Since we have so many other users, it isn't feasible
to have them modify their scripts, so having it avoid by default would work.
Any ideas how to do that? Submit LUA perhaps?
Brian Andrus
cally set in the lua script. I
do have it in place already to ensure time and account are set, but that
is about it.
Brian Andrus
On 7/1/2021 9:39 AM, Lyn Gerner wrote:
Hey, Brian,
Neither I nor you are going to like what I'm about to say (but I think
it's where you're headed). :)
We have an
will be
such that each part can be run independently of the others. This allows
the resources for that part to be released when that part is complete.
Bottom line:
Resources are not released when they are not being used, they are
released when the job is done.
Brian Andrus
On 6/26/2021 11:59 PM
try:
export SLURM_OVERLAP=1
export SLURM_WHOLE=1
before your salloc and see if that helps. I have seen some mpi issues
that were resolved with that.
You can also try it using just the regular mpirun on the nodes
allocated. That will help with a datapoint as well.
Brian Andrus
On 2/4/2021
Did you compile slurm with mpi support?
Your mpi libraries should be the same as that version and they should be
available in the same locations for all nodes.
Also, ensure they are accessible (PATH, LD_LIBRARY_PATH, etc are set)
Brian Andrus
On 2/4/2021 1:20 PM, Andrej Prsa wrote:
Gentle
, then cancel to be resumed on another node).
Brian Andrus
On 3/24/2021 7:31 AM, Gestió Servidors wrote:
Hi,
I have got this new question for you:
In my cluster there is a running job. Then, I change a partition state
from “up” to “down”. Then, that job continues “running” because it was
already
Run 'sinfo -R' to see if any of your nodes are out of the mix.
If so, resume them and see if things work.
Brian Andrus
On 4/1/2021 1:53 AM, Steve Brasier wrote:
Hi all, anyone have suggestions for debugging cloud nodes not
resuming? I've had this working before but I'm now using "confi
For this one, you want to look closely at the job. Is it targeting a
specific partition/nodelist?
See what resources it is looking for (scontrol show job )
Also look at the partition limits as well as any QOS items (if you are
using them).
Brian Andrus
On 4/1/2021 10:00 AM, Sajesh Singh
How are you taking them offline? I would expect a SuspendProgram script
that is running the command that shuts them down. Also, one of your
SlurmctldParameters should be "idle_on_node_suspend"
Brian Andrus
On 4/1/2021 12:25 PM, Sajesh Singh wrote:
Brian,
Targeting the correct
jobs that request resources you do not want used in a
particular queue.
Both would take some research to find the best approach, but I think
those are the two options available that may do what you are looking for.
Brian Andrus
On 3/31/2021 8:21 AM, Cristóbal Navarro wrote:
Hi Community,
I
Do 'sinfo -R' and see if you have any down or drained nodes.
Brian Andrus
On 3/24/2021 6:31 PM, Sajesh Singh wrote:
Slurm 20.02
CentOS 8
I just recently noticed a strange behavior when using the powersave
plugin for bursting to AWS. I have a queue configured with 60 nodes,
but if I submit
.
Brian Andrus
On 3/10/2021 5:05 AM, Marcus Boden wrote:
Yeah, I wondered something like that too, as it makes some of my
scripts quite fragile. I just tried your name on a test system and now
calling squeue paints my cli yellow :D
You could write a job_submit plugin to catch 'malicious' input
=/dev/shm
Then have them use $SCRATCH after something like SCRATCH=$FAST_SCRATCH
Just set SCRATCH to the one you want to use.
Brian Andrus
On 3/21/2021 11:32 PM, Loris Bennett wrote:
Brian Andrus writes:
The method I use for jobs is to make /scratch a symlink to where ever it may be
best
The method I use for jobs is to make /scratch a symlink to where ever it
may be best suited. Then all users just use /scratch
eg: /scratch -> /dev/shm for a ramdisk or /scrach->/mnt/ssd for local
ssd, etc
Brian Andrus
On 3/19/2021 6:25 AM, Paul Edmon wrote:
I was about to ask this a
That is looking like your /run folder does not have world execute
permissions, making it impossible for anything to access sub-directories.
Brian Andrus
On 3/17/2021 1:05 PM, Sven Duscha wrote:
Hi,
On 17.03.21 19:54, Brian Andrus wrote:
Be that as it may, you can see it is a permissions
ctld user to one that can write there or
change the permissions on the directory to allow the slurmctld user
write access.
Brian Andrus
On 3/17/2021 11:16 AM, Sven Duscha wrote:
Hi,
I experience with SLURM slurmctld an error on Ubuntu20.04, when starting
the service (through systemctl):
No issue.
In fact that is the default/normal.
The 'slurm' user gets created with a shell when you install the rpms.
Brian Andrus
On 3/9/2021 6:24 AM, Sajesh Singh wrote:
I am looking to enable the cloud scheduling feature of Slurm and was
wondering if there are any issues with changing
man sacct shows us:
-e, --helpformat
Print a list of fields that can be specified with the --format option.
Brian Andrus
On 3/3/2021 5:42 PM, xiaojingh...@163.com wrote:
Hello, guys,
I am doing a parsing job on the output of the sacct command and I know that the
—format option can
Looks like the job ran. You should look at the output logs.
My guess:
The node the job ran on does not have access to that path.
Log on to that node and check it out.
Brian Andrus
On 3/3/2021 1:21 AM, Adrian Sevcenco wrote:
Hi! I just encountered the situation that i cannot submit jobs from
--export=ALL,MYVAR=othervalue
do 'man srun' and look at the --export option
Brian Andrus
On 3/3/2021 9:28 PM, Chin,David wrote:
ahmet.mer...@uhem.itu.edu.tr wrote:
> Prolog and TaskProlog are different parameters and scripts. You should
> use the TaskProlog script to set env. variables
runaway:
sacctmgr show RunawayJobs
*From:* slurm-users on behalf
of Brian Andrus
*Sent:* Monday, March 1, 2021 11:14 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] fix missing accounting entries
All
All,
IIRC, there was a command that would repair the accounting tables when a
job had no endtime.
I can't seem to find the info for that. Does anyone recall such a thing?
Brian Andrus
caused by that.
Brian Andrus
On 4/19/2021 4:41 AM, Bruno Gomes Pessanha wrote:
That is showing that I'm in different groups depending on how I run
the command id.
PS: I'm running the controller and workers in docker containers using
privileged mode.
Bruno
On Mon, 19 Apr 2021 at 13:24
Your prolog script is run by/as the same user as slurmd, so any
environment variables you set there will not be available to the job
being run.
See: https://slurm.schedmd.com/prolog_epilog.html for info.
Brian Andrus
On 2/12/2021 1:27 PM, mercan wrote:
Hi;
Prolog and TaskProlog
IIRC, Preemption is determined by partition first, not node.
Since your pending job is in the 'day' partition, it will not preempt
something in the 'night' partition (even if the node is in both).
Brian Andrus
On 8/19/2021 2:49 PM, Russell Jones wrote:
Hi all,
I could use some help
I suspect you may have set some "frontendname" or "frontendaddr" in your
slurm.conf that triggered that.
A FrontEnd is a node that is used to execute batch scripts rather than
compute nodes (Cray ALPS systems). If that is not you, you should not
set it.
Brian Andrus
a cluster. This may be a good option for you.
Brian Andrus
On 9/13/2021 7:14 AM, Ozeryan, Vladimir wrote:
*max_script_size=#*
Specify the maximum size of a batch script, in bytes. The default
value is 4 megabytes. Larger values may adversely impact system
performance.
I have us
Modify it and raise the priority to something very, very high.
scontrol update job=JOBID priority=999
Brian Andrus
On 9/16/2021 8:39 AM, 顏文 wrote:
Dear users
Thank for the immediate replies.I currently have one important job
running. How to prevent the running job from being preempted
Yep. I do it all the time when I forget to add a parent. Also when a
project/account changes who owns it.
sacctmgr will also tell you what it is going to change and gives you 30
seconds to say yes, else it doesn't make the change.
Brian Andrus
On 9/8/2021 3:41 AM, byron wrote:
Hi
I've
That 'not responding' is the issue and usually means 1 of 2 things:
1) slurmd is not running on the node
2) something on the network is stopping the communication between the
node and the master (firewall, selinux, congestion, bad nic, routes, etc)
Brian Andrus
On 7/30/2021 3:51 PM, Soichi
they are also updated, but that is
expected and not to be worried about. It will go away once you also
update your compute nodes.
Brian Andrus
On 8/2/2021 12:34 PM, Adrian Sevcenco wrote:
Hi! can a 19.05 cluster be directly upgraded to 20.11?
Thank you!
Adrian
You may also want to look at node weights. By setting them at different
levels for each node, you can give a preference to one over the other.
That may be a way to do a "try this node first" method of job placement.
Brian Andrus
On 8/10/2021 9:19 AM, Jack Chen wrote:
Thanks for
Certainly, set:
* *SuspendExcNodes*: List of nodes to never place in power saving
mode. Use Slurm's hostlist expression format. By default, no nodes
are excluded.
Then do 'scontrol reconfigure'
Repeat when you want them to be included
Brian Andrus
On 8/10/2021 5:46 AM, Josef Dvoracek
You may want to look at your resources. If the memory allocation adds up
such that there isn't enough left for any job to run, it won't matter
that there are still GPUs available.
Similar for any other resource (CPUs, cores, etc)
Brian Andrus
On 8/10/2021 8:07 AM, Jack Chen wrote:
Does
Something is very odd when you have the node reporting:RealMemory=1 AllocMem=0 FreeMem=47563 Sockets=2 Boards=1 What do you get when you run ‘slurmd -C’ on the node? Brian Andrus From: Adam XuSent: Tuesday, October 12, 2021 6:07 PMTo: slurm-users@lists.schedmd.comSubject: Re: [slurm-users] job
.
Also helps with OOM killer situations.
Brian Andrus
On 10/1/2021 1:22 AM, Diego Zuccato wrote:
Hello all.
I just upgraded to Debian 11 that brings Slurm 21.08 and the newer
nodes upgraded w/o too many issues (just minor config changes, one
being RealMemory value in slurm.conf, since
Those would be considered separate for each job.
You may want to have your prolog check to see if there is an epilogue
running and wait for the epilogue to be done before starting its prolog
work.
Brian Andrus
On 9/27/2021 9:15 AM, Joe Teumer wrote:
Should the Prologslurmctld script only
Which version of Mariadb are you using?
Brian Andrus
On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote:
After installation of libmariadb-dev, I have reinstalled the entire
slurm with ./configure + options, make, and make install. Still,
accounting_storage_mysql.so is missing.
On Sat, Dec
:41.022] fatal: You are running with a database but
for some reason we have no TRES from it. This should only happen if
the database is down and you don't have any state files.
On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus wrote:
Your slurm needs built with the support. If you have mysql
101 - 200 of 350 matches
Mail list logo