They are all running ntpd and clocks are in sync.

In this slurmctld there are a total of 226 nodes, in several different
partitions. The cluster of 64 is the only one where I see this
happening. Unless that number of nodes is pushing the limit for a single
slurmctld (which I doubt) I'd be inclined to think it's more likely a
network issue but in that case I'd expect wireshark to show an attempt
by slurmctld to contact the node and then no response. What I'm actually
seeing is no traffic either way for these nodes, on the same interval as
the others.

Allan

Lachlan Musicman <data...@gmail.com> writes:

> Check they are all in the same time or ntpd against the same server. I
> found that the nodes that kept going down had the time out of sync.
>
> Cheers
> L.
>
> ------
> The most dangerous phrase in the language is, "We've always done it this way."
>
> - Grace Hopper
>
> On 25 January 2017 at 05:49, Allan Streib <astr...@indiana.edu> wrote:
>
>     I have a cluster of 64 nodes and nodes 1 and 19 keep getting
>     marked as down with a reason of not responding. They are up,
>     pingable, slurmd is running, etc. everything looks normal.
>    
>     Using wireshark on the slurmctld I looked at traffic for node 1
>     and node 2. I can see traffic betwen the slurmctld node and node 2
>     at intervals about every 300 seconds but for node 1 sometimes the
>     interval is as much as 1800 seconds.
>    
>     Any reason why these nodes might be getting "pinged" less often
>     than the others? The slurm.conf is identical, and contains these
>     timer settings (which I think are all defaults):
>    
>     # TIMERS
>     InactiveLimit=0
>     KillWait=30
>     MinJobAge=300
>     SlurmctldTimeout=120
>     SlurmdTimeout=300
>     Waittime=0
>    
>     Slurm version 14.11.7.
>    
>     Allan

-- 
Allan Streib
Indiana University School of Informatics and Computing
Digital Science Center :: Community Grids Lab :: FutureSystems

Reply via email to