Also check the settings on your nodeaddr in slurm.conf On 3/22/21, 2:48 PM, "slurm-users on behalf of Michael Robbert" <slurm-users-boun...@lists.schedmd.com on behalf of mrobb...@mines.edu> wrote:
I haven't tried configless setup yet, but the problem you're hitting looks like it could be a DNS issue. Can you do a dns lookup of n26 from the login node? The way that non-interactive batch jobs are started may not require that, but I believe that it is required for interactive jobs. Mike Robbert Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing Information and Technology Solutions (ITS) 303-273-3786 | mrobb...@mines.edu Our values: Trust | Integrity | Respect | Responsibility On 3/22/21, 11:24, "slurm-users on behalf of Josef Dvoracek" <slurm-users-boun...@lists.schedmd.com on behalf of j...@fzu.cz> wrote: Hi @list; I was able to configure "configless" slurm cluster with quite minimalistic slurm.conf everywhere, of-course excepting slurmctld server. All nodes are running slurmd, including front-end/login nodes to pull the config. Submitting jobs using sbatch scripts works fine, but interactive jobs using srun are failing with $ srun --verbose -w n26 --pty /bin/bash ... srun: error: fwd_tree_thread: can't find address for host n26, check slurm.conf srun: error: Task launch for 200137.0 failed on node n26: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf ... Does it mean that on submit hosts one has to manually specify all relevant NodeNames? I thought that running slurmd there will pull configuration from slurmserver. (I can see the file is actually sucessfully pulled into /run/slurm/conf/slurm.conf ). So far I found two workarounds: workaround1: specify nodenames at login/front-end nodes in slurm.conf: NodeName=n[(...)n26(...)] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN then, srun works as expected. workaround2: directing environment variable SLURM_CONF to the slurm.conf pulled by slurmd: export SLURM_CONF=/run/slurm/conf/slurm.conf then again, srun works as expected. Is this expected behavior? I actually expected that srun at configless login/front-end node with running slurmd recognizes the pulled configuration, but apparently, that's not the case. cheers josef setup at front-end and compute nodes: [root@FRONTEND ~]# slurmd --version slurm 20.02.5 [root@FRONTEND ~]# [root@FRONTEND ~]# cat /etc/sysconfig/slurmd SLURMD_OPTIONS="--conf-server slurmserver2.DOMAIN" [root@FRONTEND ~]# [root@FRONTEND ~]# cat /etc/slurm/slurm.conf ClusterName=CLUSTERNAME ControlMachine=slurmserver2.DOMAIN AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=slurmserver2.DOMAIN AccountingStoragePort=7031 SlurmctldParameters=enable_configless [root@FRONTEND ~]# ClusterName=XXXXX ControlMachine=slurmserver2.DOMAIN AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=slurmserver2.DOMAIN AccountingStoragePort=7031 SlurmctldParameters=enable_configless -- Josef Dvoracek Institute of Physics | Czech Academy of Sciences cell: +420 608 563 558 | https://telegram.me/jose_d | FZU phone nr. : 2669