On 17/02/17 05:36, Allan Streib wrote:

> t-019 is one of my nodes that's frequently "down" according to slurm but
> really isn't. What is that "Can't find an address" about? DNS lookups
> seem to be working fine in a shell on the same machine.

This looks to be an issue when Slurm is wanting to forward messages and
trying to find hosts in slurm.conf:

src/common/forward.c - _forward_thread():

        /* repeat until we are sure the message was sent */
        while ((name = hostlist_shift(hl))) {
                if (slurm_conf_get_addr(name, &addr) == SLURM_ERROR) {
                        error("forward_thread: can't find address for host "
                              "%s, check slurm.conf", name);
                        slurm_mutex_lock(&fwd_struct->forward_mutex);
                        mark_as_failed_forward(&fwd_struct->ret_list, name,
                                               SLURM_UNKNOWN_FORWARD_ADDR);
                        free(name);
                        if (hostlist_count(hl) > 0) {
                                slurm_mutex_unlock(&fwd_struct->forward_mutex);
                                continue;
                        }
                        goto cleanup;
                }


It would be interesting to know if increasing your TreeWidth to 256
would help (basically turn off forwarding if I'm reading it right).

       TreeWidth
              Slurmd  daemons  use  a virtual tree network for communications.
              TreeWidth specifies the width of the tree (i.e. the fanout).  On
              architectures  with  a front end node running the slurmd daemon,
              the value must always be equal to or greater than the number  of
              front end nodes which eliminates the need for message forwarding
              between the slurmd daemons.  On other architectures the  default
              value  is 50, meaning each slurmd daemon can communicate with up
              to 50 other slurmd daemons and over 2500 nodes can be  contacted
              with  two  message  hops.   The default value will work well for
              most clusters.  Optimal  system  performance  can  typically  be
              achieved if TreeWidth is set to the square root of the number of
              nodes in the cluster for systems having no more than 2500  nodes
              or  the  cube  root for larger systems. The value may not exceed
              65533.

If so then I suspect that this is a possible transient DNS failure?

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to