My server was having issues yesterday so I rebooted it last night but slurm
has not been working properly ever since the reboot.  I've rebooted other
machines too in the same time and they work completely fine but this one in
particular cannot submit any srun/sbatch commands due to a "invalid node
name" error.  I don't see anything wrong with what I'm doing and DNS is
working completely fine.

# on slurmd node
[bwong1@mk-gpu-2 ~]$ srun /bin/hostname
srun: error: Unable to allocate resources: Invalid node name specified

# from slurmctld
[root@mk-slurm slurm]# ping mk-gpu-2
PING ( 56(84) bytes of data.
64 bytes from (

# on slurmctld.log, (19015 is my UID)
slurmctld: error: slurm_auth_get_host: Lookup failed: Unknown host
slurmctld: error: REQUEST_RESOURCE_ALLOCATE lacks alloc_node from uid=19015
slurmctld: _slurm_rpc_allocate_resources: Invalid node name specified
slurmctld: error: slurm_auth_get_host: Lookup failed: Unknown host
slurmctld: error: REQUEST_RESOURCE_ALLOCATE lacks alloc_node from uid=19015
slurmctld: _slurm_rpc_allocate_resources: Invalid node name specified

# relevant portions of slurm.conf
NodeName=mk-gpu-2 NodeAddr= RealMemory=750000 Gres=gpu:8
Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
PartitionName=all.q Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Any ideas for what's causing this "unknown host" error?  I have the proper
hostname and IP address in the slurm.conf so I'm not sure what else is
going on.

Benjamin Wong

Reply via email to