[slurm-users] Software and Config for Job submission host only

2022-05-12 Thread Richard Chang

Hi,

I am new to SLURM and I am still trying to understand stuff. There is 
ample documentation available that teaches you how to set it up quickly.


Pardon me if this was asked before,  I was not able to find anything 
pointing to this.


I am trying to figure out if there is something like PBS-execution only 
for SLURM. Such that I can install it is the Login nodes and those nodes 
will only be responsible for the job-submission and not job execution.


Is there any particular package to install and is there a different 
config that needs to be put in the job submission only nodes ?


Basically, I want the job submission nodes to have all the commands and 
all the things that will enable them to get reports,logs whatever an 
admin and a user will need. Just not execution of the jobs.


Thanks in advance for your help.

RC.





[slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.

2022-08-02 Thread Richard Chang

Hi,

I am new to SLURM, so please bear with me.

I need to understand whether the Server/Node running the slurmctld 
daemon will need access to the Parallel file system, and if it will need 
all the SW run time libraries installed, as in the compute nodes.


The users will login to the Login/submission nodes with their home 
mounted from say PFS1 and change directory to the PFS2 mount point and 
then submit/run their jobs.


Does it mean the Server/node running the slurmctld daemon will also need 
access to both the PFS1 and PFS2 mount points ? I am not sure.


The server running the slurmctld daemon will be exclusively for that and 
is not a login node.


Thanks & regards,

Richard.




Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.

2022-08-02 Thread Richard Chang

Hi Paul,

Thank you for confirming this.

Best regards,

Richard.

On 8/2/2022 7:15 PM, Paul Edmon wrote:
No, the node running the slurmctld does not need access to any of the 
customer facing filesystems or home directories.  While all the login 
and client nodes do, the slurmctld does not.


-Paul Edmon-

On 8/2/2022 9:30 AM, Richard Chang wrote:

Hi,

I am new to SLURM, so please bear with me.

I need to understand whether the Server/Node running the slurmctld 
daemon will need access to the Parallel file system, and if it will 
need all the SW run time libraries installed, as in the compute nodes.


The users will login to the Login/submission nodes with their home 
mounted from say PFS1 and change directory to the PFS2 mount point 
and then submit/run their jobs.


Does it mean the Server/node running the slurmctld daemon will also 
need access to both the PFS1 and PFS2 mount points ? I am not sure.


The server running the slurmctld daemon will be exclusively for that 
and is not a login node.


Thanks & regards,

Richard.








[slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-23 Thread Richard Chang

Hi,

Is there a thumb rule for the size of the directory that is NFS 
exported, and to be used as StateSaveLocation.


I have a two node Slurmctld setup and both will mount an NFS exported 
directory as the state save location.


Let me know your thoughts.

Thanks & regards,

RC






[slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-27 Thread Richard Chang

Hi,

I have observed that when I specify a switch type in the slurm.conf file 
and that particular switch type is not present in the slurmctld node, 
slurmctld panics and shuts down. Is this expected ? My slurmctld doesn't 
have the switch type, but the computes have that switch type. how can I 
set it up so that it can utilise the feature but not break slurm.


Thanks & regards,

RC.




Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-27 Thread Richard Chang
Yes, the system is a HPE Cray EX, and I am trying to use 
switch/hpe_slingshot.


RC


On 10/28/2022 11:21 AM, Ole Holm Nielsen wrote:

On 10/28/22 07:35, Richard Chang wrote:
I have observed that when I specify a switch type in the slurm.conf 
file and that particular switch type is not present in the slurmctld 
node, slurmctld panics and shuts down. Is this expected ? My 
slurmctld doesn't have the switch type, but the computes have that 
switch type. how can I set it up so that it can utilise the feature 
but not break slurm.


What is you line in slurm.conf?  The manual page seems to describe 
what you have observed:


SwitchType
  Identifies the type of switch or interconnect used for 
applica‐
  tion  communications.  Acceptable values 
include
  "switch/cray_aries" for Cray systems, "switch/none" for 
switches
  not  requiring  special processing for job launch or 
termination
  (Ethernet,  and   InfiniBand)   and   The default   
value   is
  "switch/none".   All  Slurm  daemons,  commands and 
running jobs
  must be restarted for a change in SwitchType to take 
effect.  If
  running jobs exist at the time slurmctld is restarted 
with a new
  value of SwitchType, records of all jobs in  any state 
may  be

  lost.

Why do you want to use this configuration?  Is your system a Cray?

/Ole





[slurm-users] What happens if slurmdbd loses connection to mysql

2022-10-30 Thread Richard Chang

Hi,

I have two dedicated nodes for slurm, node1 and node2.

I have created the following.

*Role*



*SlurmCTLD*



*SlurmDBD*



*Mariadb Server for accounting storage*

*Primary*



Node1



Node2



Node2

*Backup*



Node2



Node1



-

Shared NFS Storage from an NFS Server, for StateSaveLocation.

I want to know what if Node2 goes down. I have read in the documentation 
that if slurmdbd does down, slurmctld can still hold back the accounting 
info and when the slurmdbd is back up, it will get it passed on and 
written to the backend database ( not the exact words, but in that vein).


Just want to know what if node2 goes down and the backup slurmdbd in 
node1 takes over. Will it fail instantaneously or keep logging the data 
in it's memory and write back to the DB when it is back up ?


Hope I could explain what I mean.

Thanks & regards,

Richard.


Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-31 Thread Richard Chang

This is 21.08

Than you,

RC

On 10/31/2022 11:05 AM, Chris Samuel wrote:

On 27/10/22 11:30 pm, Richard Chang wrote:

Yes, the system is a HPE Cray EX, and I am trying to use 
switch/hpe_slingshot.


Which version of Slurm are you using Richard?

All the best,
Chris




[slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Richard Chang

Hi,

Just for my info, I would like to know what happens when SlurmDBD loses 
connection to the backend Database, for ex, MariaDB.


Does it cache the accounting info and keep them till the DB comes back 
up ?, or does it panic and shut down ?


Thank you,

RC.




Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Richard Chang

Hello Greg,

I have a two node set up. node1 is primary slurmctld + backup slurmdbd 
and node2 is primary slurmdbd + backup slurmctld and mysql database host.


 My concern is if node 2 goes down, then the backup slurmdbd will take 
over, then what will happen ?


I have read that slurmctld can cache data, but what about slurmdbd? Not 
sure.


I have intentionally used the slurmdbd + mariadb in the second node 
because I didn't want to overload the primary slurmctld.


I hope you all are getting the picture of how my set up is.

Thanks,

RC


On 11/1/2022 10:40 AM, Greg Wickham wrote:


Hi Richard,

Slurmctld caches the updates until slurmdbd comes back online.

You can see how many records are pending for the database by using the 
“sdiag” command and looking for “DBD Agent queue size”.


If this number grows significantly it means that slurmdbd isn’t available.

-Greg

On 01/11/2022, 07:23, "slurm-users" 
 wrote:


Hi,

Just for my info, I would like to know what happens when SlurmDBD loses
connection to the backend Database, for ex, MariaDB.

Does it cache the accounting info and keep them till the DB comes back
up ?, or does it panic and shut down ?

Thank you,

RC.


Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Richard Chang

Does it mean it is best to use a single slurmdbd host in my case?

My primary slurmctld is the backup slurmdbd host, and my worry is if the 
primary slurmdbd host ( which is also the mariadb server) goes down, 
will the backup slurmdbd be able to cache data and wait till the mariadb 
catches up ?


Thanks,

RC

On 11/2/2022 2:00 AM, Brian Andrus wrote:

Ole,

Fair enough, it is actually slurmctld that does the caching. Technical 
typo on my part there.


Just trying to let the user know, there is a window that they have to 
ensure no information is lost during a database outage.


Brian Andrus

On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote:

Hi Brian,

On 11/1/22 05:28, Brian Andrus wrote:
It caches up to a point. As I understand it, that is about an hour 
(depending on size and how busy the cluster is, as well as available 
memory, etc).


Have you found any documentation of slurmdbd caching?  It's 
well-known that slurmctld caches information while slurmdbd is down, 
see for example page 30 in the talk "Field Notes Mark 2: Random 
Musings From Under A New Hat"[1] by Tim Wickberg, SchedMD:



For slurmdbd, the critical element in the failure domain is
MySQL, not slurmdbd. slurmdbd itself is stateless.
● slurmctld will cache accounting records (up to a limit) if
slurmdbd is unavailable. This can be hours+ to days+
depending on your system without data loss.


The statelessness of slurmdbd makes me think that it can't cache any 
data.


Thanks,
Ole

[1] https://slurm.schedmd.com/publications.html


On 10/31/2022 9:20 PM, Richard Chang wrote:

Hi,

Just for my info, I would like to know what happens when SlurmDBD 
loses connection to the backend Database, for ex, MariaDB.


Does it cache the accounting info and keep them till the DB comes 
back up ?, or does it panic and shut down ?








Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-11-02 Thread Richard Chang

Hello Brian,

Thank you for the reply and sharing your design. Can you please share 
your MariaDB server HA details.? ( Can be offline and DM to me )


I would like to understand it so that I can replicate it  here.

Thanks & regards,

Richard.

On 11/2/2022 8:09 AM, Brian Andrus wrote:

RC,

In that scenario, the backup slurmdbd would take over, but then its 
database would not necessarily be in sync with the 'main' database 
(hence the warnings/info about it in the documentation).


For my setup, I have 2 slurmdbd hosts, but they both connect to the 
same, separate, MariaDB server, which is HA. Now, I can take down the 
primary slurmdbd system and the other will takeover, so I can bring 
them up/down as needed for updates, etc.


If your two slurmdbd servers use different databases, you would need a 
way to keep them in sync, regardless of which slurmdbd was processing 
data. There are many ways to do that, but those designs fall under 
MariaDB and not Slurm.


Brian Andrus

On 11/1/2022 6:49 PM, Richard Chang wrote:

Does it mean it is best to use a single slurmdbd host in my case?

My primary slurmctld is the backup slurmdbd host, and my worry is if 
the primary slurmdbd host ( which is also the mariadb server) goes 
down, will the backup slurmdbd be able to cache data and wait till 
the mariadb catches up ?


Thanks,

RC

On 11/2/2022 2:00 AM, Brian Andrus wrote:

Ole,

Fair enough, it is actually slurmctld that does the caching. 
Technical typo on my part there.


Just trying to let the user know, there is a window that they have 
to ensure no information is lost during a database outage.


Brian Andrus

On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote:

Hi Brian,

On 11/1/22 05:28, Brian Andrus wrote:
It caches up to a point. As I understand it, that is about an hour 
(depending on size and how busy the cluster is, as well as 
available memory, etc).


Have you found any documentation of slurmdbd caching?  It's 
well-known that slurmctld caches information while slurmdbd is 
down, see for example page 30 in the talk "Field Notes Mark 2: 
Random Musings From Under A New Hat"[1] by Tim Wickberg, SchedMD:



For slurmdbd, the critical element in the failure domain is
MySQL, not slurmdbd. slurmdbd itself is stateless.
● slurmctld will cache accounting records (up to a limit) if
slurmdbd is unavailable. This can be hours+ to days+
depending on your system without data loss.


The statelessness of slurmdbd makes me think that it can't cache 
any data.


Thanks,
Ole

[1] https://slurm.schedmd.com/publications.html


On 10/31/2022 9:20 PM, Richard Chang wrote:

Hi,

Just for my info, I would like to know what happens when SlurmDBD 
loses connection to the backend Database, for ex, MariaDB.


Does it cache the accounting info and keep them till the DB comes 
back up ?, or does it panic and shut down ?












[slurm-users] SLURM configuration for LDAP users

2024-02-03 Thread Richard Chang via slurm-users

Hi,

I am a little new to this, so please pardon my ignorance.

I have configured slurm in my cluster and it works fine with local 
users. But I am not able to get it working with LDAP/SSSD authentication.


User logins using ssh are working fine. An LDAP user can login to the 
login, slurmctld and compute nodes, but when they try to submit jobs, 
slurmctld logs an error about invalid account or partition for user.


Someone said we need to add the user manually into the database using 
the sacctmgr command. But I am not sure we need to do this for each and 
every LDAP user. Yes, it does work if we add the LDAP user manually 
using sacctmgr. But I am not convinced this manual way is the way to do.


The documentation is not very clear about using LDAP accounts.

Saw somewhere in the list about using UsePAM=1 and copying or creating a 
softlink for slurm PAM module under /etc/pam.d . But it didn't work for me.


Saw somewhere else that we need to specifying 
LaunchParameters=enable_nss_slurm in the slurm.conf file and put slurm 
keyword in passwd/group entry in the /etc/nsswitch.conf file. Did these, 
but didn't help either.


I am bereft of ideas at present. If anyone has real world experience and 
can advise, I will be grateful.


Thank you,

Richard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: SLURM configuration for LDAP users

2024-02-05 Thread Richard Chang via slurm-users
 Job submission works for local users. I was not aware we need to 
manually add the LDAP users to the SlurmDB. Does it mean we need to add 
each and every user in LDAP to the Slurm database ?


On 2/4/2024 9:04 PM, Renfro, Michael wrote:


“An LDAP user can login to the login, slurmctld and compute nodes, but 
when they try to submit jobs, slurmctld logs an error about invalid 
account or partition for user.”


Since I don’t think it was mentioned below, does a non-LDAP user get 
the same error, or does it work by default?


We don’t use LDAP explicitly, but we’ve used sssd with Slurm and 
Active Directory for 6.5 years without issue. We’ve always added users 
to sacctmgr so that we could track usage by research group or class, 
so we never used a default account for all users.


*From: *Richard Chang via slurm-users 
*Date: *Saturday, February 3, 2024 at 11:41 PM
*To: *slurm-us...@schedmd.com 
*Subject: *[slurm-users] SLURM configuration for LDAP users

*External Email Warning*

*This email originated from outside the university. Please use caution 
when opening attachments, clicking links, or responding to requests.*




Hi,

I am a little new to this, so please pardon my ignorance.

I have configured slurm in my cluster and it works fine with local 
users. But I am not able to get it working with LDAP/SSSD authentication.


User logins using ssh are working fine. An LDAP user can login to the 
login, slurmctld and compute nodes, but when they try to submit jobs, 
slurmctld logs an error about invalid account or partition for user.


Someone said we need to add the user manually into the database using 
the sacctmgr command. But I am not sure we need to do this for each 
and every LDAP user. Yes, it does work if we add the LDAP user 
manually using sacctmgr. But I am not convinced this manual way is the 
way to do.


The documentation is not very clear about using LDAP accounts.

Saw somewhere in the list about using UsePAM=1 and copying or creating 
a softlink for slurm PAM module under /etc/pam.d . But it didn't work 
for me.


Saw somewhere else that we need to specifying 
LaunchParameters=enable_nss_slurm in the slurm.conf file and put slurm 
keyword in passwd/group entry in the /etc/nsswitch.conf file. Did 
these, but didn't help either.


I am bereft of ideas at present. If anyone has real world experience 
and can advise, I will be grateful.


Thank you,

Richard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com