Yep, and I also did the cross-node munge test suggested in the munge setup documentation. In the remote slurlmd -D -vvvv output it appears to check out (Checking credential with 468 bytes of sig data), but the job still doesn't execute.
slurmd: debug2: got this type of message 6001 slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: launch task 57.0 request from 4078.4078@172.18.2.44 (port 59078) slurmd: debug3: state for jobid 21: ctime:1494917363 revoked:0 expires:0 slurmd: debug3: state for jobid 22: ctime:1494918387 revoked:0 expires:0 slurmd: debug3: state for jobid 23: ctime:1494919023 revoked:0 expires:0 slurmd: debug3: state for jobid 25: ctime:1494979611 revoked:0 expires:0 slurmd: debug3: state for jobid 29: ctime:1494980262 revoked:0 expires:0 slurmd: debug3: state for jobid 36: ctime:1494981612 revoked:0 expires:0 slurmd: debug3: state for jobid 37: ctime:1494982415 revoked:0 expires:0 slurmd: debug3: state for jobid 51: ctime:1494984067 revoked:0 expires:0 slurmd: debug: Checking credential with 468 bytes of sig data slurmd: debug: Reading cgroup.conf file /usr/local/etc/cgroup.conf slurmd: _run_prolog: run job script took usec=6 slurmd: _run_prolog: prolog with lock for job 57 ran for 0 seconds slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 1 (92), parent rank 0 (91), children 0, depth 1, max_depth 1 slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r slurmd: debug2: Cached group access list for ben/4078 slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd On Wed, May 17, 2017 at 12:24 AM John Hearns <hear...@googlemail.com> wrote: > Ben, a stupid question, hoever - have you installed and configured Munge > authentication on the slave node? > > On 17 May 2017 at 02:59, Ben Mann <b...@openai.com> wrote: > >> Hello Slurm dev, >> >> I just set up a small test cluster on two Ubuntu 14.04 machines, >> installed SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd >> on a master and just slurmd on a slave. When I run a job on two nodes, it >> completes instantly on master, but never on slave. >> >> Here are my .conf files, which are on a NAS and symlinked from >> /usr/local/etc/ as well as log files for the srun below >> https://gist.github.com/8enmann/0637ee2cbb6e6f5aaedef6b3c3f24a1d >> >> $ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> debug* up infinite 2 idle [91-92] >> >> $ srun -l hostname >> 0: 91.cirrascale.sci.openai.org >> >> $ srun -l -N2 hostname >> 0: 91.cirrascale.sci.openai.org >> $ srun -N2 -l hostname >> 0: 91.cirrascale.sci.openai.org >> srun: error: timeout waiting for task launch, started 1 of 2 tasks >> srun: Job step 36.0 aborted before step completely launched. >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> srun: error: Timed out waiting for job step to complete >> >> $ squeue >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> 36 debug hostname ben R 8:42 2 >> [91-92] >> $ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> debug* up infinite 2 alloc [91-92] >> >> I'm guessing I misconfigured something, but I don't see anything in the >> logs suggesting what it might be. I've also tried cranking up verbosity and >> didn't see anything. I know it's not recommended to use root to run >> everything, but doesn't at least slurmd need root to manage cgroups? >> >> Thanks in advance!! >> Ben >> > >