Hi,

I think you shouldn't run slurmd on your ControlMachine node (but run slurmctld and slurmdbd), as in your configuration I don't see that slurm_master has its NodeName line. So you should either add slurm_master to your slurm.conf in NodeName line or not start slurmd on the slurm_master.

Cheers,
Jacek

W dniu 10.08.2017 o 14:36, LAHAYE Olivier pisze:
Hi,

I've upgraded slurm 15.08.3 (built from rpmbuild -tb <tarball>) to 17.02.6 on 
centos-7-x86_64.

Since I've done that, slurmd refuse to start on ControlMachine and on 
Backupcontroller. (it starts fine on compute nodes)

The error is: slurmd: fatal: Unable to determine this slurmd's NodeName

If I try to specify the nodename it fails with a different error message:

[root@slurm_master] # slurmd -D -N $(hostname -s)
slurmd: Node configuration differs from hardware: CPUs=0:32(hw) Boards=0:1(hw) 
SocketsPerBoard=0:2(hw) CoresPerSocket=0:8(hw) ThreadsPerCore=0:2(hw)
slurmd: Message aggregation disabled
slurmd: error: find_node_record: lookup failure for slurm_master
slurmd: fatal: ROUTE -- slurm_master not found in node_record_table
[root@slurm_master]# hostname -s
slurm_master

Trying to debug seems to show that the hostname is not in the node hash table.

slurmdbd and slurmctld start fine.
I've googled around, but I only find problems related to compute nodes, not 
Controller or Backup.

Any ideas?

--
Jacek Budzowski
System administrator
ACC Cyfronet AGH

Reply via email to