Ben, a stupid question, hoever - have you installed and configured Munge authentication on the slave node?
On 17 May 2017 at 02:59, Ben Mann <b...@openai.com> wrote: > Hello Slurm dev, > > I just set up a small test cluster on two Ubuntu 14.04 machines, installed > SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd on a > master and just slurmd on a slave. When I run a job on two nodes, it > completes instantly on master, but never on slave. > > Here are my .conf files, which are on a NAS and symlinked from > /usr/local/etc/ as well as log files for the srun below > https://gist.github.com/8enmann/0637ee2cbb6e6f5aaedef6b3c3f24a1d > > $ sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 2 idle [91-92] > > $ srun -l hostname > 0: 91.cirrascale.sci.openai.org > > $ srun -l -N2 hostname > 0: 91.cirrascale.sci.openai.org > $ srun -N2 -l hostname > 0: 91.cirrascale.sci.openai.org > srun: error: timeout waiting for task launch, started 1 of 2 tasks > srun: Job step 36.0 aborted before step completely launched. > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: Timed out waiting for job step to complete > > $ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 36 debug hostname ben R 8:42 2 [91-92] > $ sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 2 alloc [91-92] > > I'm guessing I misconfigured something, but I don't see anything in the > logs suggesting what it might be. I've also tried cranking up verbosity and > didn't see anything. I know it's not recommended to use root to run > everything, but doesn't at least slurmd need root to manage cgroups? > > Thanks in advance!! > Ben >