On 17/02/17 05:36, Allan Streib wrote: > t-019 is one of my nodes that's frequently "down" according to slurm but > really isn't. What is that "Can't find an address" about? DNS lookups > seem to be working fine in a shell on the same machine.
This looks to be an issue when Slurm is wanting to forward messages and trying to find hosts in slurm.conf: src/common/forward.c - _forward_thread(): /* repeat until we are sure the message was sent */ while ((name = hostlist_shift(hl))) { if (slurm_conf_get_addr(name, &addr) == SLURM_ERROR) { error("forward_thread: can't find address for host " "%s, check slurm.conf", name); slurm_mutex_lock(&fwd_struct->forward_mutex); mark_as_failed_forward(&fwd_struct->ret_list, name, SLURM_UNKNOWN_FORWARD_ADDR); free(name); if (hostlist_count(hl) > 0) { slurm_mutex_unlock(&fwd_struct->forward_mutex); continue; } goto cleanup; } It would be interesting to know if increasing your TreeWidth to 256 would help (basically turn off forwarding if I'm reading it right). TreeWidth Slurmd daemons use a virtual tree network for communications. TreeWidth specifies the width of the tree (i.e. the fanout). On architectures with a front end node running the slurmd daemon, the value must always be equal to or greater than the number of front end nodes which eliminates the need for message forwarding between the slurmd daemons. On other architectures the default value is 50, meaning each slurmd daemon can communicate with up to 50 other slurmd daemons and over 2500 nodes can be contacted with two message hops. The default value will work well for most clusters. Optimal system performance can typically be achieved if TreeWidth is set to the square root of the number of nodes in the cluster for systems having no more than 2500 nodes or the cube root for larger systems. The value may not exceed 65533. If so then I suspect that this is a possible transient DNS failure? All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci