Hello, I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes with this written to slurm-*.out:
less 1x1x1_220524_121358/slurm-1368_1.out srun: error: Unable to resolve "node012": Unknown server error srun: error: fwd_tree_thread: can't find address for host node012, check slurm.conf srun: error: Task launch for 1368.0 failed on node node012: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete The same job runs correctly on either of two other nodes. sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST macpro* up infinite 1 idle node012 macpro* up infinite 3 down node[001-002,004] I can ssh into node012 and the above sinfo suggests no communication problems. I have not modified slurm.conf recently. I would appreciate any suggestions on what might be causing this problem or what I can do to diagnose it. Thanks, Roger