On Monday, 3 June 2019 7:53:39 AM PDT Alexander Åhman wrote:
> That was my first thought too, but... no. Both /etc/hosts (not used) and
> slurm.conf are identical on all nodes, both working and non-working nodes.
I think Slurm caches things like that, so it might be worth restarting
slurmctld
That was my first thought too, but... no. Both /etc/hosts (not used) and
slurm.conf are identical on all nodes, both working and non-working nodes.
_From login machine:_
[alex@li1 ~]$ srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118071 queued and waiting for resources
srun: job 1118071 has been
I think this error usually means that on your node cn7 it has either the
wrong /etc/hosts or the wrong /etc/slurm/slurm.conf
E.g. try 'srun --nodelist=cn7 ping -c 1 cn7'
On Wed, May 29, 2019 at 6:00 AM Alexander Åhman
wrote:
> Hi,
> Have a very strange problem. The cluster has been working
I have tried to find a network error but can't see anything. Every node
I've tested has the same (and correct) view of things.
_On node cn7:_ (the problematic one)
em1: link/ether 50:9a:4c:79:31:4d inet 10.28.3.137/24
_On login machine:_
[alex@li1 ~]$ host cn7
cn7.ydesign.se has address
Hi Alexander,
The error "can't find address for host cn7" would indicate a DNS
problem. What is the output of "host cn7" from the srun host li1?
How many network devices are in your subnet? It may be that the Linux
kernel is doing "ARP cache trashing" if the number of devices approaches
Hi,
Have a very strange problem. The cluster has been working just fine
until one node died and now I can't submit jobs to 2 of the nodes using
srun from the login machine. Using sbatch works just fine and also if I
use srun from the same host as slurmctld.
All the other nodes works just fine