[slurm-dev] Re: Multinode setup trouble

John Hearns Wed, 17 May 2017 00:24:44 -0700

Ben, a stupid question, hoever - have you installed and configured Munge
authentication on the slave node?


On 17 May 2017 at 02:59, Ben Mann <b...@openai.com> wrote:

> Hello Slurm dev,
>
> I just set up a small test cluster on two Ubuntu 14.04 machines, installed
> SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd on a
> master and just slurmd on a slave. When I run a job on two nodes, it
> completes instantly on master, but never on slave.
>
> Here are my .conf files, which are on a NAS and symlinked from
> /usr/local/etc/ as well as log files for the srun below
> https://gist.github.com/8enmann/0637ee2cbb6e6f5aaedef6b3c3f24a1d
>
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*       up   infinite      2   idle [91-92]
>
> $ srun -l hostname
> 0: 91.cirrascale.sci.openai.org
>
> $ srun -l -N2 hostname
> 0: 91.cirrascale.sci.openai.org
> $ srun -N2 -l hostname
> 0: 91.cirrascale.sci.openai.org
> srun: error: timeout waiting for task launch, started 1 of 2 tasks
> srun: Job step 36.0 aborted before step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
> $ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                 36     debug hostname      ben  R       8:42      2 [91-92]
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*       up   infinite      2  alloc [91-92]
>
> I'm guessing I misconfigured something, but I don't see anything in the
> logs suggesting what it might be. I've also tried cranking up verbosity and
> didn't see anything. I know it's not recommended to use root to run
> everything, but doesn't at least slurmd need root to manage cgroups?
>
> Thanks in advance!!
> Ben
>

[slurm-dev] Re: Multinode setup trouble

Reply via email to