Re: [slurm-users] Submit job using srun fails but sbatch works

2019-06-06 Thread Chris Samuel
On Monday, 3 June 2019 7:53:39 AM PDT Alexander Åhman wrote: > That was my first thought too, but... no. Both /etc/hosts (not used) and > slurm.conf are identical on all nodes, both working and non-working nodes. I think Slurm caches things like that, so it might be worth restarting slurmctld

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-06-03 Thread Alexander Åhman
That was my first thought too, but... no. Both /etc/hosts (not used) and slurm.conf are identical on all nodes, both working and non-working nodes. _From login machine:_ [alex@li1 ~]$ srun --nodelist=cn7 ping -c 1 cn7 srun: job 1118071 queued and waiting for resources srun: job 1118071 has been

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Alex Chekholko
I think this error usually means that on your node cn7 it has either the wrong /etc/hosts or the wrong /etc/slurm/slurm.conf E.g. try 'srun --nodelist=cn7 ping -c 1 cn7' On Wed, May 29, 2019 at 6:00 AM Alexander Åhman wrote: > Hi, > Have a very strange problem. The cluster has been working

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Alexander Åhman
I have tried to find a network error but can't see anything. Every node I've tested has the same (and correct) view of things. _On node cn7:_ (the problematic one) em1: link/ether 50:9a:4c:79:31:4d inet 10.28.3.137/24 _On login machine:_ [alex@li1 ~]$ host cn7 cn7.ydesign.se has address

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Ole Holm Nielsen
Hi Alexander, The error "can't find address for host cn7" would indicate a DNS problem. What is the output of "host cn7" from the srun host li1? How many network devices are in your subnet? It may be that the Linux kernel is doing "ARP cache trashing" if the number of devices approaches

[slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Alexander Åhman
Hi, Have a very strange problem. The cluster has been working just fine until one node died and now I can't submit jobs to 2 of the nodes using srun from the login machine. Using sbatch works just fine and also if I use srun from the same host as slurmctld. All the other nodes works just fine