[slurm-users] Re: slurmdbd not connecting to mysql (mariadb)

2024-05-30 Thread Brian Andrus via slurm-users
That SIGTERM message means something is telling slurmdbd to quit. Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told to shutdown. If you are running in the foreground, a ^C does that. If you run a kill or killall on it, you will get that same message. Brian Andrus On 5

[slurm-users] Re: slurmdbd archive format

2024-05-28 Thread Brian Andrus via slurm-users
Oh, to address the passed train: Restore the archive data with "sacctmgr archive load", then you can do as you need. From man sacctmgr: *archive*{dump|load}     Write database information to a flat file or load information that has previously been written to a file. Brian Andr

[slurm-users] Re: slurmdbd archive format

2024-05-28 Thread Brian Andrus via slurm-users
Instead of using the archive files, couldn't you query the db directly for the info you need? I would recommend sacct/sreport if those can get the info you need. Brian Andrus On 5/28/2024 9:59 AM, O'Neal, Doug (NIH/NCI) [C] via slurm-users wrote: My organization needs to access historic job

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-23 Thread Brian Andrus via slurm-users
I would guess either you install GPU drivers on the non-GPU nodes or build slurm without GPU support for that to work due to package dependencies. Both viable options. I have done installs where we just don't compile GPU support in and that is left to the users to manage. Brian Andrus On 5

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-22 Thread Brian Andrus via slurm-users
as versions are compatible, they can work together. You will need to be aware of differences for jobs and configs, but it is possible. Brian Andrus On 5/22/2024 12:45 AM, Arnuld via slurm-users wrote: We have several nodes, most of which have different Linux distributions (distro for short

[slurm-users] Re: Submitting from an untrusted node

2024-05-14 Thread Brian Andrus via slurm-users
Rike, Assuming the data, scripts and other dependencies are already on the cluster, you could just ssh and execute the sbatch command in a single shot: ssh submitnode sbatch some_script.sh It will ask for a password if appropriate and could use ssh keys to bypass that need. Brian Andrus

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
or added an override file, that will affect things. Brian Andrus On 4/19/2024 10:15 AM, Jeffrey Layton wrote: I like it, however, it was working before without a slurm.conf in /etc/slurm. Plus the environment variable SLURM_CONF is pointing to the correct slurm.conf file (the one in /cm

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
/slurm/ on the node(s). Brian Andrus On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote: Good afternoon, I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base Command Manager which is based on Bright Cluster Manager). I ran into an error and only just learned that Slurm

[slurm-users] Re: Slurm.conf and workers

2024-04-15 Thread Brian Andrus via slurm-users
will want to sync the config across all nodes and then 'scontrol reconfigure' You may want to look into configless if you can set DNS entries and your config is basically monolithic or all parts are in /etc/slurm/ Brian Andrus On 4/15/2024 2:55 AM, Xaver Stiensmeier via slurm-users wrote: Dear

[slurm-users] Re: Upgrading nodes

2024-04-10 Thread Brian Andrus via slurm-users
Yes. You can build the 8 rpms on 9. Look at 'mock' to do so. I did similar when I still had to support EL7 Fairly generic plan, the devil is in the details and verifying each step, but those are the basic bases you need to touch. Brian Andrus On 4/10/2024 1:48 PM, Steve Berg via slurm

[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-08 Thread Brian Andrus via slurm-users
path to look at. Brian Andrus On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote: Dear slurm user list, we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
by the name it has in the slurm.conf file. Also, a quick way to do the failover check is to run (from the backup controller): scontrol takeover Brian Andrus On 3/25/2024 1:39 PM, Miriam Olmi wrote: Hi Brian, Thanks for replying. In my first message I forgot to specify that the primary

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Quick correction, it is SaveStateLocation not SlurmSaveState. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am having trouble finalizing the configuration of the backup controller for my slurm cluster. In principle, if no job is running everything seems

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Miriam, You need to ensure the SlurmSaveState directory is the same for both. And by 'the same', I mean all contents are exactly the same. This is usually achieved by using a shared drive or replication. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am

[slurm-users] Re: We're Live! Check out the new SchedMD.com now!

2024-03-13 Thread Brian Andrus via slurm-users
Wow, snazzy! Looks very good. My compliments. Brian Andrus On 3/12/2024 11:24 AM, Victoria Hobson via slurm-users wrote: Our website has gone through some much needed change and we'd love for you to explore it! The new SchedMD.com is equipped with the latest information about Slurm, your

[slurm-users] Re: Slurm billback and sreport

2024-03-04 Thread Brian Andrus via slurm-users
Chip, I use 'sacct' rather than sreport and get individual job data. That is ingested into a db and PowerBI, which can then aggregate as needed. sreport is pretty general and likely not the best for accurate chargeback data. Brian Andrus On 3/4/2024 6:09 AM, Chip Seraphine via slurm-users

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Brian Andrus via slurm-users
Joseph, You will likely get many perspectives on this. I disable swap completely on our compute nodes. I can be draconian that way. For the workflow supported, this works and is a good thing. Other workflows may benefit from swap. Brian Andrus On 3/3/2024 11:04 PM, John Joseph via slurm

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
oxy> Brian Andrus On 2/28/2024 12:54 PM, Dan Healy wrote: Are most of us using HAProxy or something else? On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users wrote: Magnus, That is a feature of the load balancer. Most of them have that these days. Brian

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
Magnus, That is a feature of the load balancer. Most of them have that these days. Brian Andrus On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote: On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote: for us, we put a load balancer in front

[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-27 Thread Brian Andrus via slurm-users
disconnection for any reason even for X-based apps. Personally, I don't care much for interactive sessions in HPC, but there is a large body that only knows how to do things that way, so it is there. Brian Andrus On 2/26/2024 12:27 AM, Josef Dvoracek via slurm-users wrote: What is the recommended way

[slurm-users] Re: [INTERNET] Re: question on sbatch --prefer

2024-02-10 Thread Brian Andrus via slurm-users
I imagine you could create a reservation for the node and then when you are completely done, remove the reservation. Each helper could then target the reservation for the job. Brian Andrus On 2/9/2024 5:52 PM, Alan Stange via slurm-users wrote: Chip, Thank you for your prompt response.  We

Re: [slurm-users] sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host

2024-01-26 Thread Brian Andrus
elsewhere. Brian Andrus On 1/26/2024 6:38 AM, Michael Lewis wrote: Hi All, I’m trying to get slurm-23.11.3 running on Ubuntu 20.04 and running on a stand alone system.  I’m running into an issue I can not find the answer to.  After compiling and installing when I fire up slurmctld

Re: [slurm-users] Suspend/Resume request limit

2024-01-17 Thread Brian Andrus
While I am not sure of your specifics, you could easily add lines to your suspend/resume scripts to check/wait/etc if there are tasks waiting. Brian Andrus On 1/15/2024 12:22 AM, 김종록 wrote: Hello. I'm going to use Slurm's cloud feature in private cloud. The problem is that the scale out

Re: [slurm-users] install new slurm, no slurmctld found

2023-12-16 Thread Brian Andrus
a submit/login node. Brian Andrus On 12/15/2023 2:00 AM, Felix wrote: Hello we are installing a new server with slurm on ALMA Linux 9.2 we did the followimg: dnf install slurm The result is rpm -qa | grep slurm slurm-libs-22.05.9-1.el9.x86_64 slurm-22.05.9-1.el9.x86_64 Now when trying

Re: [slurm-users] SlurmdSpoolDir full

2023-12-09 Thread Brian Andrus
filled on the node. You can run 'df -h' and see some info that would get you started. Brian Andrus On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote: Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The Slur

Re: [slurm-users] slurm power save question

2023-11-29 Thread Brian Andrus
y of being able to have "keep at least X nodes up and idle" would be nice, that is not how I see this documented or working. Brian Andrus On 11/23/2023 5:12 AM, Davide DelVento wrote: Thanks for confirming, Brian. That was my understanding as well. Do you have it working that way o

Re: [slurm-users] slurm power save question

2023-11-22 Thread Brian Andrus
As I understand it, that setting means "Always have at least X nodes up", which includes running jobs. So it stops any wait time for the first X jobs being submitted, but any jobs after that will need to wait for the power_up sequence. Brian Andrus On 11/22/2023 6:58 AM, David

Re: [slurm-users] partition qos without managing users

2023-11-22 Thread Brian Andrus
Eg, Could you be more specific as to what you want? Is there a specific user you want to control, or no user should get more than x cpus in the partition? Or no single job should get more than x cpus? The details matter to determine the right approach and right settings. Brian Andrus On 11

Re: [slurm-users] partition qos without managing users

2023-11-20 Thread Brian Andrus
slurm users belong to and add them to slurmdbd. Once they are in there, you can set defaults with exceptions for specific users. If you are only looking to have settings apply to all users, you don't have to import the users. Set the QoS for the partition. Brian Andrus On 11/20/2023 1:45 PM, ego

Re: [slurm-users] slurm job_container/tmpfs

2023-11-20 Thread Brian Andrus
How do you 'manually create a directory'? That would be when the ownership of root would be occurring. After creating it, you can chown/chmod it as well. Brian Andrus On 11/18/2023 7:35 AM, Arsene Marian Alain wrote: Dear slurm community, I run slurm 21.08.1 under Rocky Linux 8.5 on my

Re: [slurm-users] Slurm Rest API error

2023-06-28 Thread Brian Andrus
Vlad, Actually, it looks like it is working. You are using v0.39 for the parser, which is trying to use OpenAPI calls. Unless you compiled with OpenAPI, that won't work. Try using the 0.37 version and you may see a simpler result that is successful. Brian Andrus On 6/28/2023 11:05 AM

Re: [slurm-users] Backfill Scheduling

2023-06-26 Thread Brian Andrus
in before the additional node for Job B is expected to be available, so it runs on the idle node. Brian Andrus On 6/26/2023 3:48 PM, Reed Dier wrote: Hoping this will be an easy one for the community. The priority schema was recently reworked for our cluster, with only PriorityWeightQOS

Re: [slurm-users] federation vs multi-cluster

2023-06-26 Thread Brian Andrus
do what (a node-locked license, for example). Then you can send a job to a specific subset of nodes. Quite a few other ways to design the ability you describe, but separate clusters is not one of them. Brian Andrus On 6/26/2023 6:11 AM, mohammed shambakey wrote: Hi Just out of interest, I

Re: [slurm-users] monitoring and accounting

2023-06-12 Thread Brian Andrus
Second that. Prometheus+slurm exporter+grafana works great. Brian Andrus On 6/12/2023 8:20 AM, Josef Dvoracek wrote: > But I'd be interested to see what other places do. we installed this: https://github.com/vpenso/prometheus-slurm-exporter and scrape this exporter with "inputs.pr

Re: [slurm-users] Can't get --reboot to work at all with slurm-23.02?

2023-06-07 Thread Brian Andrus
Make sure you have configured the RebootProgram in slurm.conf, that it exists on the nodes and is executable by the user. This is usually /sbin/reboot Brian Andrus On 6/7/2023 7:50 AM, Heinz, Michael wrote: Hey, all. So I added slurmdbd to our slurm-23.02 install and made my account

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Brian Andrus
That output of slurmd -C is your answer. Slurmd only sees 6GB of memory and you are claiming it has 10GB. I would run some memtests, look at meminfo on the node, etc. Maybe even check that the type/size of memory in there is what you think it is. Brian Andrus On 5/25/2023 7:30 AM, Roger

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Brian Andrus
are running on. Brian Andrus On 5/17/2023 10:49 AM, Groner, Rob wrote: I'm not sure what you mean by "if they have the permissions". I'm talking about someone who is specifically designated as "coordinator" of an account in slurm.  With that designation, and no other

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Brian Andrus
running jobs, that would take a bit more effort to set up, but is an alternative. Brian Andrus On 5/17/2023 6:40 AM, Groner, Rob wrote: I was asked to see if coordinators could do anything in this scenario: * Within the account that they coordinated, User A submitted 1000s of jobs and left

Re: [slurm-users] Prevent CLOUD node from being shutdown after startup

2023-05-12 Thread Brian Andrus
unts in a comma separated list (e.g "nid[10-20]:4,nid[80-90]:2"). By default no nodes are excluded. This value may be updated with scontrol. See ReconfigFlags=KeepPowerSaveSettings for setting persistence. Brian Andrus On 5/12/2023 2:35 AM, Xaver Stiensmeier wrote: D

Re: [slurm-users] monitoring and accounting

2023-05-05 Thread Brian Andrus
Something I have been impressed with is Netdata It is in the standard repositories and will auto-detect quite a bit of things on a node. It is great for real-time monitoring of a node/job. I also use Prometheus and Grafana for historic data (anything over 5 minutes). Brian Andrus On 5/5

Re: [slurm-users] Several slurmdbds against one mysql server?

2023-04-30 Thread Brian Andrus
: Hello, Brian Andrus writes: Ole is spot on with his federated suggestion. That is exactly what fits the bill for you, given your requirements. You can have everything you want, but you don't get to have it how you want (separate databases). When/If you looked deeper into it, you will find

Re: [slurm-users] Several slurmdbds against one mysql server?

2023-04-30 Thread Brian Andrus
erent part of the world and trying to federate them in a performant manner was prohibitively expensive. Brian Andrus On 4/29/2023 10:53 PM, Angel de Vicente wrote: Hi Ole, Ole Holm Nielsen writes: Maybe you want to use Slurm federated clusters with a single database thanks for

Re: [slurm-users] Slurmdbd High Availability

2023-04-13 Thread Brian Andrus
at the HA database. One would be primary and the other a failover (AccountingStorageBackupHost). Although, technically, they would both be able to be active at the same time. Brian Andrus On 4/13/2023 2:49 AM, Shaghuf Rahman wrote: Hi, I am setting up Slurmdb in my system and I need some inputs

Re: [slurm-users] Odd prolog Error?

2023-04-11 Thread Brian Andrus
the user exists on the node, however you are propagating the uids. Brian ANdrus On 4/11/2023 9:48 AM, Jason Simms wrote: Hello all, Regularly I'm seeing array jobs fail, and the only log info from the compute node is this: [2023-04-11T11:41:12.336] error: /opt/slurm/prolog.sh: exited

Re: [slurm-users] Slurmd enabled crash with CgroupV2

2023-03-10 Thread Brian Andrus
: [Unit] After=autofs.service getty.target sssd.service That makes it wait for all of those before trying to start. Brian Andrus On 3/10/2023 7:41 AM, Tristan LEFEBVRE wrote: Hello to all, I'm trying to do an installation of Slurm with cgroupv2 activated. But I'm facing an odd thing : when

Re: [slurm-users] Power saving and node weight

2023-03-01 Thread Brian Andrus
to do as well. I would be insterested in any alternatives. Could you point me to some doc? Best wishes Gizo Brian Andrus On 2/28/2023 7:44 AM, Gizo Nanava wrote: Hello, it seems that if a slurm power saving is enabled then the parameter "Weight" seem to be ignored

Re: [slurm-users] Chaining srun commands

2023-02-28 Thread Brian Andrus
to get (resource-wise) and how do you want to use them? Brian Andrus On 2/28/2023 9:49 AM, Jake Jellinek wrote: Hi all I come from a SGE/UGE background and am used to the convention that I can qrsh to a node and, from there, start a new qrsh to a different node with different parameters. I've

Re: [slurm-users] Power saving and node weight

2023-02-28 Thread Brian Andrus
You may be able to use the alternate approach that I was able to do as well. Brian Andrus On 2/28/2023 7:44 AM, Gizo Nanava wrote: Hello, it seems that if a slurm power saving is enabled then the parameter "Weight" seem to be ignored for nodes that are in a power down state.

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-14 Thread Brian Andrus
ow any of the processes work, which is why some of us do so many experimental runs of jobs and gather timings. We have yet to see a 100% efficient process, but folks are improving things all the time. Brian Andrus On 2/13/2023 9:56 PM, Diego Zuccato wrote: I think that's incorrect: > Th

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-13 Thread Brian Andrus
HPC jobs. The goal is that every process is utilizing the CPU as close to 100% as possible, which would render hyper-threading moot. Brian Andrus On 2/13/2023 12:15 AM, Hermann Schwärzler wrote: Hi Sebastian, I am glad I could help (although not exactly as expected :-). With your node

Re: [slurm-users] slurm and singularity

2023-02-08 Thread Brian Andrus
commands are xterm, a shell script containing srun commands, and srun (see the EXAMPLES section). *If no command is specified, then salloc runs the user's default shell.* Brian Andrus On 2/8/2023 7:01 AM, Jeffrey T Frey wrote: You may need srun to allocate a pty for the command

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Brian Andrus
Then cluster_run.sh would call sbatch along with the appropriate commands. Brian Andrus On 2/7/2023 9:31 AM, Groner, Rob wrote: I'm trying to setup the capability where a user can execute: $: sbatch script_to_run.sh and the end result is that a job is created on a node, and that job will execute

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Brian Andrus
with the new (known good) config. Brian Andrus On 1/17/2023 12:36 PM, Groner, Rob wrote: So, you have two equal sized clusters, one for test and one for production?  Our test cluster is a small handful of machines compared to our production. We have a test slurm control node on a test cluster

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Brian Andrus
. Brian Andrus On 1/4/2023 9:22 AM, Groner, Rob wrote: We currently have a test cluster and a production cluster, all on the same network.  We try things on the test cluster, and then we gather those changes and make a change to the production cluster.  We're doing that through two different repos

Re: [slurm-users] slurmrestd service broken by 22.05.07 update

2022-12-29 Thread Brian Andrus
.conf"*/ You can change those as needed. This made it listen on port 8081 only (no socket and not 6820) I was then able to just use curl on port 8081 to test things. Hope that helps. Brian Andrus On 12/29/2022 6:49 AM, Chris Stackpole wrote: Greetings, Thanks for responding! On 12/2

Re: [slurm-users] slurmrestd service broken by 22.05.07 update

2022-12-28 Thread Brian Andrus
I suspect if you delete /var/lib/slurmrestd.socket and then start slurmrestd, it will create it as the user you need it to be. Or just change the owner of it to the slurmrestd owner. I have been running slurmrestd as a separate user for some time. Brian Andrus On 12/28/2022 3:20 PM, Chris

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Brian Andrus
Seems like the time may have been off on the db server at the insert/update. You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump. Brian Andrus On 12/20/2022 11:51 AM, Reed Dier wrote: Just

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Brian Andrus
Try:     sacctmgr list runawayjobs Brian Andrus On 12/20/2022 7:54 AM, Reed Dier wrote: Hoping this is a fairly simple one. This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may

Re: [slurm-users] I can't seem to use all the CPUs in my Cluster?

2022-12-13 Thread Brian Andrus
as the many articles, wikis and videos out there. TLDR; If you are going to be running efficient HPC jobs, you are indeed better off with HT turned off. Brian Andrus On 12/13/2022 8:03 AM, Gary Mansell wrote: Hi, thanks for getting back to me. I have been doing some more experimenting, and I

Re: [slurm-users] I can't seem to use all the CPUs in my Cluster?

2022-12-13 Thread Brian Andrus
to it. Also check the state of the nodes with 'sinfo' It would also be good to ensure the node settings are right. Run 'slurmd -C' on a node and see if the output matches what is in the config. Brian Andrus On 12/13/2022 1:38 AM, Gary Mansell wrote: Dear Slurm Users, perhaps you can help me

Re: [slurm-users] Job allocation from a heterogenous pool of nodes

2022-12-07 Thread Brian Andrus
You may want to look here: https://slurm.schedmd.com/heterogeneous_jobs.html Brian Andrus On 12/7/2022 12:42 AM, Le, Viet Duc wrote: Dear slurm community, I am encountering a unique situation where I need to allocate jobs to nodes with different numbers of CPU cores. For instance

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Brian Andrus
I successfully build it for Rocky straight from the tgz file as usual with rpmbuild -ta Brian Andrus On 12/2/2022 9:21 AM, David Thompson wrote: Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8 Slurm cluster. We would like to be able to use the sbatch –prefer option

Re: [slurm-users] Licenses: Remote vs Reservation

2022-11-30 Thread Brian Andrus
to submit at all? The reservation method can cause an sbatch command to be rejected, if that is what you are looking for. Brian Andrus On 11/30/2022 6:29 AM, Richard Ems wrote: Hi all, I have to change our set up to be able to update the total number of available licenses due to users checking

Re: [slurm-users] How to launch slurm services after installation

2022-11-27 Thread Brian Andrus
Steve, I suspect you did not install the packages. You need to install slurm-slurmctld to get the slurmctld systemd files: /# rpm -qlp slurm-slurmctld-20.11.9-1.el7.x86_64.rpm// ///run/slurm/slurmctld.pid// /*//usr/lib/systemd/system/slurmctld.service/*/ ///usr/sbin/slurmctld//

Re: [slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

2022-11-23 Thread Brian Andrus
and reset/recreate it. That addresses even a miffed software change. Brian Andrus On 11/23/2022 5:11 AM, Xaver Stiensmeier wrote: Hello slurm-users, The question can be found in a similar fashion here: https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Brian Andrus
data. There are many ways to do that, but those designs fall under MariaDB and not Slurm. Brian Andrus On 11/1/2022 6:49 PM, Richard Chang wrote: Does it mean it is best to use a single slurmdbd host in my case? My primary slurmctld is the backup slurmdbd host, and my worry is if the primary

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Brian Andrus
Ole, Fair enough, it is actually slurmctld that does the caching. Technical typo on my part there. Just trying to let the user know, there is a window that they have to ensure no information is lost during a database outage. Brian Andrus On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote: Hi

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Brian Andrus
It caches up to a point. As I understand it, that is about an hour (depending on size and how busy the cluster is, as well as available memory, etc). Brian Andrus On 10/31/2022 9:20 PM, Richard Chang wrote: Hi, Just for my info, I would like to know what happens when SlurmDBD loses

Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Brian Andrus
, but if you aren't having excessive traffic to the share, you should be good. I have yet to discover what would be excessive enough to impact things. The only use I have had for the HA is being able to keep the cluster running/happy during maintenance. Brian Andrus On 10/24/2022 1:14 AM, Ole

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Brian Andrus
is the preferred method? Rob *From:* slurm-users on behalf of Brian Andrus *Sent:* Friday, September 23, 2022 10:24 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] slurmd and dynamic nodes

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Brian Andrus
of the mix. Brian Andrus On 9/23/2022 7:09 AM, Groner, Rob wrote: I'm working through how to use the new dynamic node features in order to take down a particular node, reconfigure it (using nvidia MIG to change the number of graphic cores available) and give it back to slurm. I'm at the point

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Brian Andrus
Paul, You are likely spot on with the inactiveLimit change. It may also be an environment variable of TMOUT (under bash) set. Brian Andrus On 9/19/2022 5:46 AM, Paul Raines wrote: I have had two nights where right at 3:35am a bunch of jobs were killed early with TIMEOUT way before

Re: [slurm-users] remote license

2022-09-16 Thread Brian Andrus
on a large cluster sometime in your career, I would not recommend using it there. Brian Andrus On 9/16/2022 3:06 PM, Davide DelVento wrote: Hi Brian, From your response, I speculate that my wording sounded harsh or unrespectful. That was not my intention and therefore I sincerely apologize

Re: [slurm-users] remote license

2022-09-16 Thread Brian Andrus
free to do that. It is not something that scales well, but it looks like you have a rather beginner cluster that would never be impacted by such choices. Brian Andrus On 9/16/2022 10:00 AM, Davide DelVento wrote: Thanks Brian. I am still perplexed. What is a database to install, administer

Re: [slurm-users] remote license

2022-09-16 Thread Brian Andrus
) Update the database (sacctmgr command) As you can see, that 1st step would be highly dependent on you and your environment. The 2nd step would be dependent on what things you are tracking within that. Brian Andrus On 9/16/2022 5:01 AM, Davide DelVento wrote: So if I understand correctly

Re: [slurm-users] Can I set dynamic weighting for nodes?

2022-09-15 Thread Brian Andrus
You can dynamically modify the weight of nodes with:     scontrol update nodename= weight= So, in theory, you could do that periodically to adjust the weights you may want. Brian Andrus On 9/15/2022 4:27 PM, Russell Smithies wrote: Can I set dynamic or calculated  “weights” for nodes

Re: [slurm-users] remote license

2022-09-15 Thread Brian Andrus
ense count that is in the database updated to match the number free from flexlm to stop license starvation due to users outside slurm using them up so they really aren't available to slurm. Brian Andrus On 9/15/2022 3:34 PM, Davide DelVento wrote: I am a bit confused by remote licenses. https://lists.s

Re: [slurm-users] How to debug a prolog script?

2022-09-15 Thread Brian Andrus
configured? Brian Andrus On 9/15/2022 2:49 PM, Davide DelVento wrote: I have a super simple prolog script, as follows (very similar to the example one) #!/bin/bash if [[ $VAR == 1 ]]; then echo "True" fi exit 0 This fails (and obviously causes great disruption to my production j

Re: [slurm-users] can a job run across partition in slurm

2022-09-12 Thread Brian Andrus
I had completely forgotten about HETJOB supporting multiple partitions. Thanks for reminding me. Brian Andrus On 9/12/2022 6:06 AM, Marcus Wagner wrote: yes, that is possible by submitting a hetjob. Best Marcus Am 08.09.2022 um 20:38 schrieb Purvesh Parmar: We require more nodes to run

Re: [slurm-users] can a job run across partition in slurm

2022-09-08 Thread Brian Andrus
No, however a node can reside in multiple partitions. So if you add those nodes to the partition they are running in, they will be available to them. Brian Andrus On 9/8/2022 11:38 AM, Purvesh Parmar wrote: We require more nodes to run a single job which requires more nodes than present

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-09-07 Thread Brian Andrus
Possibly way off base, but did you happen to do any of the editing in Windows? Maybe running into the cr/lf issue for how windows saves text files? Brian Andrus On 9/7/2022 5:21 AM, Davide DelVento wrote: Thanks Ole, your wiki page sheds some light on this mystery. Very frustrating that even

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-09-01 Thread Brian Andrus
Try setting logging to debug mode, then you can get some info from the logs. Brian Andrus On 9/1/2022 8:15 PM, Davide DelVento wrote: Thanks. I did try a lua script as soon as I got your first email, but that never worked (yes, I enabled it in slurm.conf and ran "scontrol reconfigure&q

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-09-01 Thread Brian Andrus
) Usually it would be found at /usr/lib64/slurm/job_submit_lua.so If that is there, you should be good with trying out a job_submit lua script. Brian Andrus On 9/1/2022 1:24 PM, Davide DelVento wrote: Thanks again, Brian, indeed that grep returns many hits, but none of them includes lua, i.e

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-09-01 Thread Brian Andrus
I would be surprised if it were compiled without the support. However, you could check and run something like: strings /sbin/slurmctld | grep job_submit (or where ever your slurmctld binary is). There should be quite a few lines with that in it. Brian Andrus On 9/1/2022 10:54 AM, Davide

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-08-30 Thread Brian Andrus
Not sure if you can do all the things you intend, but the job_submit script is precisely where you want to check submission options. https://slurm.schedmd.com/job_submit_plugins.html Brian Andrus On 8/30/2022 12:58 PM, Davide DelVento wrote: Hi, I would like to soft-enforce license

Re: [slurm-users] Node status (without repeats)

2022-08-08 Thread Brian Andrus
It looks to me like you have the same node in multiple partitions. If the output you are getting is basically what you want just pipe it to 'sort -u' or 'uniq' Brian Andrus On 8/8/2022 10:14 AM, Borchert, Christopher B ERDC-RDE-ITL-MS CIV wrote: Hello. How can I simply show the status

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-04 Thread Brian Andrus
This is actually brilliant! Brian Andrus On 8/3/2022 10:20 PM, Gerhard Strangar wrote: Phil Chiu wrote: - Individual slurm jobs which reboot nodes - With a for loop, I could submit a reboot job for each node. But I'm not sure how to limit this so at most N jobs are running

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Brian Andrus
So an example of using slurm to reboot all nodes 3 at a time:     sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {} If you want to get fancy, make a script that does the reboot and waits for the node to be back up before exiting and use that instead of the 'scontrol reboot' part. Brian

Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.

2022-08-02 Thread Brian Andrus
the compute nodes do. Brian Andrus On 8/2/2022 6:45 AM, Paul Edmon wrote: No, the node running the slurmctld does not need access to any of the customer facing filesystems or home directories.  While all the login and client nodes do, the slurmctld does not. -Paul Edmon- On 8/2/2022 9:30 AM

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

2022-07-27 Thread Brian Andrus
Lloyd, You could  check out the order of entries in your pam.d/ssh (and related/included) files See where the slurm_pam_adopt is, how it is being called and if there are settings that are interferring. Does this occur only on a single node, or all of them? Brian Andrus On 7/27/2022 9:29

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

2022-07-27 Thread Brian Andrus
Verify that their uid on the node is the same as the uid your master sees Brian Andrus On 7/27/2022 8:53 AM, byron wrote: Hi When a user tries to login into a compute node on which they have a running job they get the error Access denied: user blahblah (uid=) has no active jobs

Re: [slurm-users] Problems building RPMs

2022-07-21 Thread Brian Andrus
Hmm. That would imply you could still use the tar file with something like: rpmbuild -v -ta --define "_lto_cflags %{nil}" slurm-22.05.2.tar.bz2 Note, I have not tried this (no immediate access to RHEL9 derivative), so YMMV. Brian Andrus On 7/21/2022 10:15 AM, Kilian Cavalotti

Re: [slurm-users] Is there split-brain danger when using backup slurmdbd?

2022-06-27 Thread Brian Andrus
that ensures both are getting accurate and current information. Brian Andrus On 6/27/2022 9:15 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: We noticed that slurmdbd provide the conf option *DbdBackupHost* for user to set a secondary slurmdbd node. Since slurmdbd is closely related to database

Re: [slurm-users] detailed worker state with sinfo

2022-06-26 Thread Brian Andrus
respectively. *NOTE*: The suffix "*" identifies nodes that are presently not responding. Brian Andrus On 6/26/2022 5:39 AM, z1...@arcor.de wrote: Hello, if I call "sinfo -o %all", the worker state includes only a single state word like "DRNG". It is clearer in

Re: [slurm-users] Persistent Interactive Jobs

2022-06-09 Thread Brian Andrus
/groups. Brian Andrus On 6/9/2022 5:19 PM, Willy Markuske wrote: Hello All, I have a request from users for the ability to have persistent interactive jobs. Currently some users are using srun to allocate and interactive job and run their scripts but sshd will close connections after 2 hours

Re: [slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

2022-06-03 Thread Brian Andrus
Offhand, I would suggest double check munge and versions of slurmd/slurmctld. Brian Andrus On 6/3/2022 3:17 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: Our cluster set up 2 slurm control node and scontrol show config as below: > scontrol show config … SlurmctldHost[0] = slu

Re: [slurm-users] How to Make AvailableFeatures Persist after Slurmctld Restart

2022-06-02 Thread Brian Andrus
Add it to your slurm.conf Then it is always there after a restart. Brian Andrus On 6/2/2022 12:05 PM, Hanby, Mike wrote: Howdy, I can’t seem to find a solution in ‘man slurm.conf’ for this. How can I make the following persist a slurmctld restart: scontrol update NodeName="

Re: [slurm-users] container on slurm cluster

2022-05-18 Thread Brian Andrus
you give the permission to that they will not abuse it. Brian Andrus On 5/18/2022 12:22 AM, GHui wrote: Hi, Brian Andrus I think the main poblem is that container can cheat Slurm. On 5/17/22 06:58:20, Brian Andrus wrote: You are starting to understand a major issue with most containers. I

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Brian Andrus
watch the logs to see when it is happy). Don't start slurmctld until that is done. Waiting makes things easier. Brian Andrus On 5/17/2022 9:29 AM, Paul Edmon wrote: I think it should be, but you should be able to run a test and find out. -Paul Edmon- On 5/17/22 12:13 PM, byron wrote: Sorry, I

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Brian Andrus
You need to step upgrade through major versions (not minor). So 19.05=>20.x I would highly recommend going to 21.08 while you are at it. I just did the same migration (although they started at 18.x) with no issues. Running jobs were not impacted and users didn't even notice. Brian And

  1   2   3   4   >