Thank you, Ryan. I read through the "NodeName" session in https://slurm.schedmd.com/slurm.conf.html <https://slurm.schedmd.com/slurm.conf.html> , finding there is no clue on setting host list like "node[001-512].yourdomain.com". Cited as below, SLURM seems to support "node[001-512]" only, which gives rise to lookup failures if FQDN (node300.yourdomain.com) is used on the compute node.
> NodeName > Name that Slurm uses to refer to a node (or base partition for BlueGene > systems). Typically this would be the string that "/bin/hostname -s" returns. > It may also be the fully qualified domain name as returned by "/bin/hostname > -f" (e.g. "foo1.bar.com"), or any valid domain name associated with the host > through the host database (/etc/hosts) or DNS, depending on the resolver > settings. Note that if the short form of the hostname is not used, it may > prevent use of hostlist expressions (the numeric portion in brackets must be > at the end of the string). Only short hostname forms are compatible with the > switch/nrt plugin at this time. It may also be an arbitrary string if > NodeHostname is specified. If the NodeName is "DEFAULT", the values specified > with that record will apply to subsequent node specifications unless > explicitly set to other values in that node record or replaced with a > different set of default values. Each line where NodeName is "DEFAULT" will > replace or add to previous default values and not a reinitialize the default > values. For architectures in which the node order is significant, nodes will > be considered consecutive in the order defined. For example, if the > configuration for "NodeName=charlie" immediately follows the configuration > for "NodeName=baker" they will be considered adjacent in the computer. > NodeHostname > Typically this would be the string that "/bin/hostname -s" returns. It may > also be the fully qualified domain name as returned by "/bin/hostname -f" > (e.g. "foo1.bar.com"), or any valid domain name associated with the host > through the host database (/etc/hosts) or DNS, depending on the resolver > settings. Note that if the short form of the hostname is not used, it may > prevent use of hostlist expressions (the numeric portion in brackets must be > at the end of the string). Only short hostname forms are compatible with the > switch/nrt plugin at this time. A node range expression can be used to > specify a set of nodes. If an expression is used, the number of nodes > identified by NodeHostname on a line in the configuration file must be > identical to the number of nodes identified by NodeName. By default, the > NodeHostname will be identical in value to NodeName. > NodeAddr > Name that a node should be referred to in establishing a communications path. > This name will be used as an argument to the gethostbyname() function for > identification. If a node range expression is used to designate multiple > nodes, they must exactly match the entries in the NodeName (e.g. > "NodeName=lx[0-7] NodeAddr=elx[0-7]"). NodeAddr may also contain IP > addresses. By default, the NodeAddr will be identical in value to > NodeHostname. Best, Jianwen > On 16 Apr 2017, at 00:51, Ryan Novosielski <novos...@rutgers.edu> wrote: > > Read this slurm.conf manual, under the parameters that start with Node. They > discuss this situation. > > -- > ____ > || \\UTGERS, |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > <mailto:novos...@rutgers.edu> > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark > `' > > On Apr 15, 2017, at 11:47, Jianwen Wei <wei.jian...@gmail.com > <mailto:wei.jian...@gmail.com>> wrote: > >> Hi, >> >> I used *short* hostnames (say node306) in all my compute node and SLURM >> settings before. It works well. However, error messages arise in >> /var/log/slurmctld.log when I set FQDN for the compute nodes. >> >> [2017-04-15T22:50:06.149] error: find_node_record: lookup failure for >> node306. <http://node306.pi.sjtu.edu.cn/>yourdomain.com >> <http://yourdomain.com/> >> >> On nnode306: >> >> $ hostname node306.yourdomain.com <http://node306.yourdomain.com/> >> $ hostname -s >> node306 >> $ hostname -f >> node306.yourdomain.com <http://node306.yourdomain.com/> >> >> In /etc/slurm/slurm.conf , shortnames are used since FQDN prevents use of >> hostlist. That is, "node[001-332].yourdomain.com <http://yourdomain.com/>" >> is invalid. >> >> NodeName=node[001-332] CPUs=16 SocketsPerBoard=2 CoresPerSocket=8 >> ThreadsPerCore=1 RealMemory=64100 >> By far, SLURM works fine despite the error message appearing in log every 10 >> minutes. I appreciate any suggestion on this issue. >> >> Best, >> >> Jianwen >>