I have an 8-node cluster running CentOS 6.6 and SLURM 15.08. I start slurmd on the 8 compute nodes, then start slurmctld, but for 7 of the 8 nodes, I get an error like this:
slurmctld: agent/is_node_resp: node:node02 RPC:REQUEST_NODE_REGISTRATION_STATUS: Communication connection failure One of the nodes (node01) does not have the above problem (I don't know why). When I run a simple job (srun hostname), I get this error from node01 and I can't run jobs again: slurmctld: agent/is_node_resp: node:node01 RPC:REQUEST_TERMINATE_JOB: Communication connection failure So it seems to be the same basic problem. Any ideas on how to diagnose this?