[slurm-users] Software and Config for Job submission host only
Hi, I am new to SLURM and I am still trying to understand stuff. There is ample documentation available that teaches you how to set it up quickly. Pardon me if this was asked before, I was not able to find anything pointing to this. I am trying to figure out if there is something like PBS-execution only for SLURM. Such that I can install it is the Login nodes and those nodes will only be responsible for the job-submission and not job execution. Is there any particular package to install and is there a different config that needs to be put in the job submission only nodes ? Basically, I want the job submission nodes to have all the commands and all the things that will enable them to get reports,logs whatever an admin and a user will need. Just not execution of the jobs. Thanks in advance for your help. RC.
[slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.
Hi, I am new to SLURM, so please bear with me. I need to understand whether the Server/Node running the slurmctld daemon will need access to the Parallel file system, and if it will need all the SW run time libraries installed, as in the compute nodes. The users will login to the Login/submission nodes with their home mounted from say PFS1 and change directory to the PFS2 mount point and then submit/run their jobs. Does it mean the Server/node running the slurmctld daemon will also need access to both the PFS1 and PFS2 mount points ? I am not sure. The server running the slurmctld daemon will be exclusively for that and is not a login node. Thanks & regards, Richard.
Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.
Hi Paul, Thank you for confirming this. Best regards, Richard. On 8/2/2022 7:15 PM, Paul Edmon wrote: No, the node running the slurmctld does not need access to any of the customer facing filesystems or home directories. While all the login and client nodes do, the slurmctld does not. -Paul Edmon- On 8/2/2022 9:30 AM, Richard Chang wrote: Hi, I am new to SLURM, so please bear with me. I need to understand whether the Server/Node running the slurmctld daemon will need access to the Parallel file system, and if it will need all the SW run time libraries installed, as in the compute nodes. The users will login to the Login/submission nodes with their home mounted from say PFS1 and change directory to the PFS2 mount point and then submit/run their jobs. Does it mean the Server/node running the slurmctld daemon will also need access to both the PFS1 and PFS2 mount points ? I am not sure. The server running the slurmctld daemon will be exclusively for that and is not a login node. Thanks & regards, Richard.
[slurm-users] Ideal NFS exported StateSaveLocation size.
Hi, Is there a thumb rule for the size of the directory that is NFS exported, and to be used as StateSaveLocation. I have a two node Slurmctld setup and both will mount an NFS exported directory as the state save location. Let me know your thoughts. Thanks & regards, RC
[slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node
Hi, I have observed that when I specify a switch type in the slurm.conf file and that particular switch type is not present in the slurmctld node, slurmctld panics and shuts down. Is this expected ? My slurmctld doesn't have the switch type, but the computes have that switch type. how can I set it up so that it can utilise the feature but not break slurm. Thanks & regards, RC.
Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node
Yes, the system is a HPE Cray EX, and I am trying to use switch/hpe_slingshot. RC On 10/28/2022 11:21 AM, Ole Holm Nielsen wrote: On 10/28/22 07:35, Richard Chang wrote: I have observed that when I specify a switch type in the slurm.conf file and that particular switch type is not present in the slurmctld node, slurmctld panics and shuts down. Is this expected ? My slurmctld doesn't have the switch type, but the computes have that switch type. how can I set it up so that it can utilise the feature but not break slurm. What is you line in slurm.conf? The manual page seems to describe what you have observed: SwitchType Identifies the type of switch or interconnect used for applica‐ tion communications. Acceptable values include "switch/cray_aries" for Cray systems, "switch/none" for switches not requiring special processing for job launch or termination (Ethernet, and InfiniBand) and The default value is "switch/none". All Slurm daemons, commands and running jobs must be restarted for a change in SwitchType to take effect. If running jobs exist at the time slurmctld is restarted with a new value of SwitchType, records of all jobs in any state may be lost. Why do you want to use this configuration? Is your system a Cray? /Ole
[slurm-users] What happens if slurmdbd loses connection to mysql
Hi, I have two dedicated nodes for slurm, node1 and node2. I have created the following. *Role* *SlurmCTLD* *SlurmDBD* *Mariadb Server for accounting storage* *Primary* Node1 Node2 Node2 *Backup* Node2 Node1 - Shared NFS Storage from an NFS Server, for StateSaveLocation. I want to know what if Node2 goes down. I have read in the documentation that if slurmdbd does down, slurmctld can still hold back the accounting info and when the slurmdbd is back up, it will get it passed on and written to the backend database ( not the exact words, but in that vein). Just want to know what if node2 goes down and the backup slurmdbd in node1 takes over. Will it fail instantaneously or keep logging the data in it's memory and write back to the DB when it is back up ? Hope I could explain what I mean. Thanks & regards, Richard.
Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node
This is 21.08 Than you, RC On 10/31/2022 11:05 AM, Chris Samuel wrote: On 27/10/22 11:30 pm, Richard Chang wrote: Yes, the system is a HPE Cray EX, and I am trying to use switch/hpe_slingshot. Which version of Slurm are you using Richard? All the best, Chris
[slurm-users] SlurmDBD losing connection to the backend MariaDB
Hi, Just for my info, I would like to know what happens when SlurmDBD loses connection to the backend Database, for ex, MariaDB. Does it cache the accounting info and keep them till the DB comes back up ?, or does it panic and shut down ? Thank you, RC.
Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB
Hello Greg, I have a two node set up. node1 is primary slurmctld + backup slurmdbd and node2 is primary slurmdbd + backup slurmctld and mysql database host. My concern is if node 2 goes down, then the backup slurmdbd will take over, then what will happen ? I have read that slurmctld can cache data, but what about slurmdbd? Not sure. I have intentionally used the slurmdbd + mariadb in the second node because I didn't want to overload the primary slurmctld. I hope you all are getting the picture of how my set up is. Thanks, RC On 11/1/2022 10:40 AM, Greg Wickham wrote: Hi Richard, Slurmctld caches the updates until slurmdbd comes back online. You can see how many records are pending for the database by using the “sdiag” command and looking for “DBD Agent queue size”. If this number grows significantly it means that slurmdbd isn’t available. -Greg On 01/11/2022, 07:23, "slurm-users" wrote: Hi, Just for my info, I would like to know what happens when SlurmDBD loses connection to the backend Database, for ex, MariaDB. Does it cache the accounting info and keep them till the DB comes back up ?, or does it panic and shut down ? Thank you, RC.
Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB
Does it mean it is best to use a single slurmdbd host in my case? My primary slurmctld is the backup slurmdbd host, and my worry is if the primary slurmdbd host ( which is also the mariadb server) goes down, will the backup slurmdbd be able to cache data and wait till the mariadb catches up ? Thanks, RC On 11/2/2022 2:00 AM, Brian Andrus wrote: Ole, Fair enough, it is actually slurmctld that does the caching. Technical typo on my part there. Just trying to let the user know, there is a window that they have to ensure no information is lost during a database outage. Brian Andrus On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote: Hi Brian, On 11/1/22 05:28, Brian Andrus wrote: It caches up to a point. As I understand it, that is about an hour (depending on size and how busy the cluster is, as well as available memory, etc). Have you found any documentation of slurmdbd caching? It's well-known that slurmctld caches information while slurmdbd is down, see for example page 30 in the talk "Field Notes Mark 2: Random Musings From Under A New Hat"[1] by Tim Wickberg, SchedMD: For slurmdbd, the critical element in the failure domain is MySQL, not slurmdbd. slurmdbd itself is stateless. ● slurmctld will cache accounting records (up to a limit) if slurmdbd is unavailable. This can be hours+ to days+ depending on your system without data loss. The statelessness of slurmdbd makes me think that it can't cache any data. Thanks, Ole [1] https://slurm.schedmd.com/publications.html On 10/31/2022 9:20 PM, Richard Chang wrote: Hi, Just for my info, I would like to know what happens when SlurmDBD loses connection to the backend Database, for ex, MariaDB. Does it cache the accounting info and keep them till the DB comes back up ?, or does it panic and shut down ?
Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB
Hello Brian, Thank you for the reply and sharing your design. Can you please share your MariaDB server HA details.? ( Can be offline and DM to me ) I would like to understand it so that I can replicate it here. Thanks & regards, Richard. On 11/2/2022 8:09 AM, Brian Andrus wrote: RC, In that scenario, the backup slurmdbd would take over, but then its database would not necessarily be in sync with the 'main' database (hence the warnings/info about it in the documentation). For my setup, I have 2 slurmdbd hosts, but they both connect to the same, separate, MariaDB server, which is HA. Now, I can take down the primary slurmdbd system and the other will takeover, so I can bring them up/down as needed for updates, etc. If your two slurmdbd servers use different databases, you would need a way to keep them in sync, regardless of which slurmdbd was processing data. There are many ways to do that, but those designs fall under MariaDB and not Slurm. Brian Andrus On 11/1/2022 6:49 PM, Richard Chang wrote: Does it mean it is best to use a single slurmdbd host in my case? My primary slurmctld is the backup slurmdbd host, and my worry is if the primary slurmdbd host ( which is also the mariadb server) goes down, will the backup slurmdbd be able to cache data and wait till the mariadb catches up ? Thanks, RC On 11/2/2022 2:00 AM, Brian Andrus wrote: Ole, Fair enough, it is actually slurmctld that does the caching. Technical typo on my part there. Just trying to let the user know, there is a window that they have to ensure no information is lost during a database outage. Brian Andrus On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote: Hi Brian, On 11/1/22 05:28, Brian Andrus wrote: It caches up to a point. As I understand it, that is about an hour (depending on size and how busy the cluster is, as well as available memory, etc). Have you found any documentation of slurmdbd caching? It's well-known that slurmctld caches information while slurmdbd is down, see for example page 30 in the talk "Field Notes Mark 2: Random Musings From Under A New Hat"[1] by Tim Wickberg, SchedMD: For slurmdbd, the critical element in the failure domain is MySQL, not slurmdbd. slurmdbd itself is stateless. ● slurmctld will cache accounting records (up to a limit) if slurmdbd is unavailable. This can be hours+ to days+ depending on your system without data loss. The statelessness of slurmdbd makes me think that it can't cache any data. Thanks, Ole [1] https://slurm.schedmd.com/publications.html On 10/31/2022 9:20 PM, Richard Chang wrote: Hi, Just for my info, I would like to know what happens when SlurmDBD loses connection to the backend Database, for ex, MariaDB. Does it cache the accounting info and keep them till the DB comes back up ?, or does it panic and shut down ?
[slurm-users] SLURM configuration for LDAP users
Hi, I am a little new to this, so please pardon my ignorance. I have configured slurm in my cluster and it works fine with local users. But I am not able to get it working with LDAP/SSSD authentication. User logins using ssh are working fine. An LDAP user can login to the login, slurmctld and compute nodes, but when they try to submit jobs, slurmctld logs an error about invalid account or partition for user. Someone said we need to add the user manually into the database using the sacctmgr command. But I am not sure we need to do this for each and every LDAP user. Yes, it does work if we add the LDAP user manually using sacctmgr. But I am not convinced this manual way is the way to do. The documentation is not very clear about using LDAP accounts. Saw somewhere in the list about using UsePAM=1 and copying or creating a softlink for slurm PAM module under /etc/pam.d . But it didn't work for me. Saw somewhere else that we need to specifying LaunchParameters=enable_nss_slurm in the slurm.conf file and put slurm keyword in passwd/group entry in the /etc/nsswitch.conf file. Did these, but didn't help either. I am bereft of ideas at present. If anyone has real world experience and can advise, I will be grateful. Thank you, Richard -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: SLURM configuration for LDAP users
Job submission works for local users. I was not aware we need to manually add the LDAP users to the SlurmDB. Does it mean we need to add each and every user in LDAP to the Slurm database ? On 2/4/2024 9:04 PM, Renfro, Michael wrote: “An LDAP user can login to the login, slurmctld and compute nodes, but when they try to submit jobs, slurmctld logs an error about invalid account or partition for user.” Since I don’t think it was mentioned below, does a non-LDAP user get the same error, or does it work by default? We don’t use LDAP explicitly, but we’ve used sssd with Slurm and Active Directory for 6.5 years without issue. We’ve always added users to sacctmgr so that we could track usage by research group or class, so we never used a default account for all users. *From: *Richard Chang via slurm-users *Date: *Saturday, February 3, 2024 at 11:41 PM *To: *slurm-us...@schedmd.com *Subject: *[slurm-users] SLURM configuration for LDAP users *External Email Warning* *This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.* Hi, I am a little new to this, so please pardon my ignorance. I have configured slurm in my cluster and it works fine with local users. But I am not able to get it working with LDAP/SSSD authentication. User logins using ssh are working fine. An LDAP user can login to the login, slurmctld and compute nodes, but when they try to submit jobs, slurmctld logs an error about invalid account or partition for user. Someone said we need to add the user manually into the database using the sacctmgr command. But I am not sure we need to do this for each and every LDAP user. Yes, it does work if we add the LDAP user manually using sacctmgr. But I am not convinced this manual way is the way to do. The documentation is not very clear about using LDAP accounts. Saw somewhere in the list about using UsePAM=1 and copying or creating a softlink for slurm PAM module under /etc/pam.d . But it didn't work for me. Saw somewhere else that we need to specifying LaunchParameters=enable_nss_slurm in the slurm.conf file and put slurm keyword in passwd/group entry in the /etc/nsswitch.conf file. Did these, but didn't help either. I am bereft of ideas at present. If anyone has real world experience and can advise, I will be grateful. Thank you, Richard -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com