Thank you, Ryan.  I read through the "NodeName" session in 
https://slurm.schedmd.com/slurm.conf.html 
<https://slurm.schedmd.com/slurm.conf.html> , finding there is no clue on 
setting host list like "node[001-512].yourdomain.com".  Cited as below, SLURM 
seems to support "node[001-512]" only, which gives rise to lookup failures if 
FQDN (node300.yourdomain.com) is used on the compute node.

> NodeName
> Name that Slurm uses to refer to a node (or base partition for BlueGene 
> systems). Typically this would be the string that "/bin/hostname -s" returns. 
> It may also be the fully qualified domain name as returned by "/bin/hostname 
> -f" (e.g. "foo1.bar.com"), or any valid domain name associated with the host 
> through the host database (/etc/hosts) or DNS, depending on the resolver 
> settings. Note that if the short form of the hostname is not used, it may 
> prevent use of hostlist expressions (the numeric portion in brackets must be 
> at the end of the string). Only short hostname forms are compatible with the 
> switch/nrt plugin at this time. It may also be an arbitrary string if 
> NodeHostname is specified. If the NodeName is "DEFAULT", the values specified 
> with that record will apply to subsequent node specifications unless 
> explicitly set to other values in that node record or replaced with a 
> different set of default values. Each line where NodeName is "DEFAULT" will 
> replace or add to previous default values and not a reinitialize the default 
> values. For architectures in which the node order is significant, nodes will 
> be considered consecutive in the order defined. For example, if the 
> configuration for "NodeName=charlie" immediately follows the configuration 
> for "NodeName=baker" they will be considered adjacent in the computer.
> NodeHostname
> Typically this would be the string that "/bin/hostname -s" returns. It may 
> also be the fully qualified domain name as returned by "/bin/hostname -f" 
> (e.g. "foo1.bar.com"), or any valid domain name associated with the host 
> through the host database (/etc/hosts) or DNS, depending on the resolver 
> settings. Note that if the short form of the hostname is not used, it may 
> prevent use of hostlist expressions (the numeric portion in brackets must be 
> at the end of the string). Only short hostname forms are compatible with the 
> switch/nrt plugin at this time. A node range expression can be used to 
> specify a set of nodes. If an expression is used, the number of nodes 
> identified by NodeHostname on a line in the configuration file must be 
> identical to the number of nodes identified by NodeName. By default, the 
> NodeHostname will be identical in value to NodeName.
> NodeAddr
> Name that a node should be referred to in establishing a communications path. 
> This name will be used as an argument to the gethostbyname() function for 
> identification. If a node range expression is used to designate multiple 
> nodes, they must exactly match the entries in the NodeName (e.g. 
> "NodeName=lx[0-7] NodeAddr=elx[0-7]"). NodeAddr may also contain IP 
> addresses. By default, the NodeAddr will be identical in value to 
> NodeHostname.



Best,

Jianwen

> On 16 Apr 2017, at 00:51, Ryan Novosielski <novos...@rutgers.edu> wrote:
> 
> Read this slurm.conf manual, under the parameters that start with Node. They 
> discuss this situation. 
> 
> --
> ____
> || \\UTGERS,       |---------------------------*O*---------------------------
> ||_// the State     |         Ryan Novosielski - novos...@rutgers.edu 
> <mailto:novos...@rutgers.edu>
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
>     `'
> 
> On Apr 15, 2017, at 11:47, Jianwen Wei <wei.jian...@gmail.com 
> <mailto:wei.jian...@gmail.com>> wrote:
> 
>> Hi,
>> 
>> I used *short* hostnames (say node306) in all my compute node and SLURM 
>> settings before. It works well. However, error messages arise in 
>> /var/log/slurmctld.log when I set FQDN for the compute nodes.
>> 
>> [2017-04-15T22:50:06.149] error: find_node_record: lookup failure for 
>> node306. <http://node306.pi.sjtu.edu.cn/>yourdomain.com 
>> <http://yourdomain.com/>
>> 
>> On nnode306:
>> 
>> $ hostname node306.yourdomain.com <http://node306.yourdomain.com/>
>> $ hostname -s
>> node306
>> $ hostname -f
>> node306.yourdomain.com <http://node306.yourdomain.com/>
>> 
>> In /etc/slurm/slurm.conf , shortnames are used since FQDN prevents use of 
>> hostlist. That is, "node[001-332].yourdomain.com <http://yourdomain.com/>" 
>> is invalid.
>> 
>> NodeName=node[001-332]  CPUs=16 SocketsPerBoard=2 CoresPerSocket=8 
>> ThreadsPerCore=1 RealMemory=64100
>> By far, SLURM works fine despite the error message appearing in log every 10 
>> minutes. I appreciate any suggestion on this issue.
>> 
>> Best,
>> 
>> Jianwen
>> 

Reply via email to