Hi Olivier,
You might also want to consult my HowTo wiki for Slurm on CentOS 7:
https://wiki.fysik.dtu.dk/niflheim/SLURM
Lots of little details are discussed in this wiki.
/Ole
On 08/10/2017 03:04 PM, LAHAYE Olivier wrote:
how stupid I am,
your perfectly right!
How by hell was I unable to see that before I upgraded? I really need
hollydays. Sorry for inconvenience.
Maybe the error message could be enhanced like this:
This is slurm controler host, slurmd doesn't need to run on controller host
except if you list it as a compute node as well (not recommanded).
--
Olivier LAHAYE
CEA DRT/LIST/DIR
________________________________________
De : Jacek Budzowski [j.budzow...@cyfronet.pl]
Envoyé : jeudi 10 août 2017 14:56
À : slurm-dev
Objet : [slurm-dev] Re: Slurmd v15 to v17 stopped working (slurmd: fatal:
Unable to determine this slurmd's NodeName) on ControlMachine
Hi,
I think you shouldn't run slurmd on your ControlMachine node (but run
slurmctld and slurmdbd), as in your configuration I don't see that
slurm_master has its NodeName line.
So you should either add slurm_master to your slurm.conf in NodeName
line or not start slurmd on the slurm_master.
Cheers,
Jacek
W dniu 10.08.2017 o 14:36, LAHAYE Olivier pisze:
Hi,
I've upgraded slurm 15.08.3 (built from rpmbuild -tb <tarball>) to 17.02.6 on
centos-7-x86_64.
Since I've done that, slurmd refuse to start on ControlMachine and on
Backupcontroller. (it starts fine on compute nodes)
The error is: slurmd: fatal: Unable to determine this slurmd's NodeName
If I try to specify the nodename it fails with a different error message:
[root@slurm_master] # slurmd -D -N $(hostname -s)
slurmd: Node configuration differs from hardware: CPUs=0:32(hw) Boards=0:1(hw)
SocketsPerBoard=0:2(hw) CoresPerSocket=0:8(hw) ThreadsPerCore=0:2(hw)
slurmd: Message aggregation disabled
slurmd: error: find_node_record: lookup failure for slurm_master
slurmd: fatal: ROUTE -- slurm_master not found in node_record_table
[root@slurm_master]# hostname -s
slurm_master
Trying to debug seems to show that the hostname is not in the node hash table.
slurmdbd and slurmctld start fine.
I've googled around, but I only find problems related to compute nodes, not
Controller or Backup.
Any ideas?