I have an 8-node cluster running CentOS 6.6 and SLURM 15.08.  I start slurmd on 
the 8 compute nodes, then start slurmctld, but for 7 of the 8 nodes, I get an 
error like this:

    slurmctld: agent/is_node_resp: node:node02 
RPC:REQUEST_NODE_REGISTRATION_STATUS: Communication connection failure

One of the nodes (node01) does not have the above problem (I don't know why).  
When I run a simple job (srun hostname), I get this error from node01 and I 
can't run jobs again:

    slurmctld: agent/is_node_resp: node:node01 RPC:REQUEST_TERMINATE_JOB: 
Communication connection failure

So it seems to be the same basic problem.  Any ideas on how to diagnose this?

Reply via email to