[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Fabio Ranalli via slurm-users

You can try DRBD
https://linbit.com/drbd/

or a shared-disk (clustered) FS like GFS2, OCFS2, etc

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/configuring_gfs2_file_systems/index
https://docs.oracle.com/en/operating-systems/oracle-linux/9/shareadmin/shareadmin-ManagingtheOracleClusterFileSystemVersion2inOracleLinux.html

--

*Fabio Ranalli* | Principal Systems Administrator

Schrödinger, Inc. 

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] "token expired" errors with auth/slurm

2024-05-07 Thread Fabio Ranalli via slurm-users

Hi there,

We've updated to 23.11.6 and replaced MUNGE with SACK.

Performance and stability have both been pretty good, but we're 
occasionally seeing this in the slurmctld.log


/[2024-05-07T03:50:16.638] error: decode_jwt: token expired at 1715053769
[2024-05-07T03:50:16.638] error: cred_p_unpack: decode_jwt() failed
[2024-05-07T03:50:16.638] error: Malformed RPC of type 
REQUEST_BATCH_JOB_LAUNCH(4005) received
[2024-05-07T03:50:16.641] error: slurm_receive_msg_and_forward: 
[[headnode.internal]:58286] failed: Header lengths are longer than data 
received
[2024-05-07T03:50:16.648] error: service_connection: slurm_receive_msg: 
Header lengths are longer than data received/


it seems to impact a subset of nodes: jobs get killed and no new ones 
are allocated.
Full functionality can be restored by simply restarting slurmctld first, 
and then slurmd.


Is the token expected to actually expire? I didn't see this possibility 
mentioned in the docs.


The problem occurs on an R cloud cluster based on EL9, with a pretty 
"flat" setup.

_headnode_: configless slurmctld, slurmdbd, mariadb, nfsd
_elastic compute nodes_: autofs, slurmd

*//etc/slurm/slurm.conf/*
AuthType=auth/slurm
AuthInfo=use_client_ids
CredType=cred/slurm

*//etc/slurm/slurmdbd.conf/*
AuthType=auth/slurm
AuthInfo=use_client_ids


Has anyone else encountered the same error?

Thanks,
Fabio

--

*Fabio Ranalli* | Principal Systems Administrator

Schrödinger, Inc. 

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com