[slurm-users] Re: Configless Slurm Error: failed to fetch remote configs

Xaver Stiensmeier via slurm-users Mon, 19 Jan 2026 05:48:53 -0800

Hey Ole,

thank you so much for your in detail documentation which leaves me bothwith answers and questions. Apparently, the aforementioned error hadnothing to do with munge but with some issues regarding the reload ofslurmd which I can't really reproduce. I think I somehow had two runningand only killed one, but this is difficult to tell, because once I redidthe entire setup, half the issue disappeared.

The remaining issue is that Slurmd can't start via systemctl as Slurmdnever notifies systemctl that it is ready. I was able to fix this bysetting:


   [Service]
   Type=simple

which allows the start and then Slurm is able to reach the node, configfiles are pulled as expected and I can schedule commands on the node.


While this leaves me with a running system, I still get:

   ubuntu@worker:~$ systemctl status slurmd.service
   ○ slurmd.service - Slurm node daemon
         Loaded: loaded (/usr/lib/systemd/system/slurmd.service;
   enabled; preset: enabled)
        Drop-In: /etc/systemd/system/slurmd.service.d
                 └─override.conf
         Active: inactive (dead) since Mon 2026-01-19 13:31:28 UTC;
   8min ago
       Duration: 7ms
        Process: 19712 ExecStart=/usr/sbin/slurmd --conf-server=master
   (code=exited, status=0/SUCCESS)
       Main PID: 19712 (code=exited, status=0/SUCCESS)
          Tasks: 11 (limit: 19147)
         Memory: 4.2M (peak: 6.4M)
            CPU: 110ms
         CGroup: /system.slice/slurmd.service
                 └─19714 /usr/sbin/slurmd --conf-server=master

   Jan 19 13:31:28 worker systemd[1]: Started slurmd.service - Slurm
   node daemon.
   Jan 19 13:31:28 worker systemd[1]: slurmd.service: Deactivated
   successfully.
   Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process
   19713 (slurmd) remains running after unit stopped.
   Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process
   19714 (slurmd) remains running after unit stopped.
   Jan 19 13:31:28 worker slurmd[19716]: error: _fetch_child: failed to
   fetch remote configs: Protocol authentication error
   Jan 19 13:31:28 worker slurmd[19714]: error:
   _establish_configuration: failed to load configs. Retrying in 10
   seconds.

This leaves me with the guess that the initial fail that then succeedsmight cause systemctl to abort early. Note that we setup our Slurmcluster via Ansible scripts so there might also be a race condition I amoverlooking that causes parts of the authentication not being ready;however, this was not an issue before we tried configless.


Best,
Xaver

On 1/16/26 12:11, Ole Holm Nielsen via slurm-users wrote:

Hi Xaver,
We have been running Configless Slurm for a number of years, and we'revery happy with this setup. I have documented all the detailedconfigurations we made in this Wiki page, so maybe you want to consultthis page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#configless-slurm-setup
IHTH,
Ole

On 1/16/26 11:11, Xaver Stiensmeier via slurm-users wrote:
Hey everyone,
in the past we set up clusters with configs on each node. Now we wantto explore configless. Without changing anything else, we thereforefollowed: https://slurm.schedmd.com/configless_slurm.html and added'enable_configless' in the config on the master:
SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart

and start each worker's slurmd with the conf-server parameter:

    # Override systemd service to set conditional path
    [Service]
    ExecStart=
    ExecStart=/usr/sbin/slurmd --conf-server=master

However, this leads to:
slurmd: error: _fetch_child: failed to fetch remote configs:Protocol
    authentication error

    slurmd: error: _establish_configuration: failed to load configs.
    Retrying in 10 seconds.

on the workers and on the master (/var/log/slurm/slurmctld) to:
[2026-01-16T10:00:06.681] error: Munge decode failed: Invalidcredential [2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED: ThuJan 01
    00:00:00 1970
[2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED: ThuJan 01
    00:00:00 1970
    [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:
    [[worker]:24295] auth_g_verify: REQUEST_CONFIG has authentication
    error: Unspecified error
    [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:
    [[worker]:24295] Protocol authentication error
The munge key setup is the same as before so I don't think there isanything wrong with it unless something changes with configless(slurm.conf):
    AuthType=auth/munge
    CryptoType=crypto/munge
    AuthAltTypes=auth/jwt
    AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key
I found https://groups.google.com/g/slurm-users/c/Q7FVkhx-bOs butthis seems unrelated as both can talk fine with each other:
worker:~$ nc -zv master 6817
    Connection to master (192.168.20.169) 6817 port [tcp/*] succeeded!
I tried adding more "-v" to the slurmd start, but that did not givemore information. I am unsure how to debug this further. Somehow Ithink it must be a munge issue, but I am confused as this part hasn'tchanged.
Best regards,
Xaver

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Configless Slurm Error: failed to fetch remote configs

Reply via email to