Turns out it was my firewall settings! When I disabled it it started working. Will investigate more.
On Wed, May 17, 2017 at 10:54 AM Ben Mann <b...@openai.com> wrote: > Yep, and I also did the cross-node munge test suggested in the munge setup > documentation. In the remote slurlmd -D -vvvv output it appears to check > out (Checking credential with 468 bytes of sig data), but the job still > doesn't execute. > > slurmd: debug2: got this type of message 6001 > slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS > slurmd: launch task 57.0 request from 4078.4078@172.18.2.44 (port 59078) > slurmd: debug3: state for jobid 21: ctime:1494917363 revoked:0 expires:0 > slurmd: debug3: state for jobid 22: ctime:1494918387 revoked:0 expires:0 > slurmd: debug3: state for jobid 23: ctime:1494919023 revoked:0 expires:0 > slurmd: debug3: state for jobid 25: ctime:1494979611 revoked:0 expires:0 > slurmd: debug3: state for jobid 29: ctime:1494980262 revoked:0 expires:0 > slurmd: debug3: state for jobid 36: ctime:1494981612 revoked:0 expires:0 > slurmd: debug3: state for jobid 37: ctime:1494982415 revoked:0 expires:0 > slurmd: debug3: state for jobid 51: ctime:1494984067 revoked:0 expires:0 > slurmd: debug: Checking credential with 468 bytes of sig data > slurmd: debug: Reading cgroup.conf file /usr/local/etc/cgroup.conf > slurmd: _run_prolog: run job script took usec=6 > slurmd: _run_prolog: prolog with lock for job 57 ran for 0 seconds > slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd > slurmd: debug3: slurmstepd rank 1 (92), parent rank 0 (91), children 0, > depth 1, max_depth 1 > slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r > slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r > slurmd: debug2: Cached group access list for ben/4078 > slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd > > On Wed, May 17, 2017 at 12:24 AM John Hearns <hear...@googlemail.com> > wrote: > >> Ben, a stupid question, hoever - have you installed and configured Munge >> authentication on the slave node? >> >> On 17 May 2017 at 02:59, Ben Mann <b...@openai.com> wrote: >> >>> Hello Slurm dev, >>> >>> I just set up a small test cluster on two Ubuntu 14.04 machines, >>> installed SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd >>> on a master and just slurmd on a slave. When I run a job on two nodes, it >>> completes instantly on master, but never on slave. >>> >>> Here are my .conf files, which are on a NAS and symlinked from >>> /usr/local/etc/ as well as log files for the srun below >>> https://gist.github.com/8enmann/0637ee2cbb6e6f5aaedef6b3c3f24a1d >>> >>> $ sinfo >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>> debug* up infinite 2 idle [91-92] >>> >>> $ srun -l hostname >>> 0: 91.cirrascale.sci.openai.org >>> >>> $ srun -l -N2 hostname >>> 0: 91.cirrascale.sci.openai.org >>> $ srun -N2 -l hostname >>> 0: 91.cirrascale.sci.openai.org >>> srun: error: timeout waiting for task launch, started 1 of 2 tasks >>> srun: Job step 36.0 aborted before step completely launched. >>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>> srun: error: Timed out waiting for job step to complete >>> >>> $ squeue >>> JOBID PARTITION NAME USER ST TIME NODES >>> NODELIST(REASON) >>> 36 debug hostname ben R 8:42 2 >>> [91-92] >>> $ sinfo >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>> debug* up infinite 2 alloc [91-92] >>> >>> I'm guessing I misconfigured something, but I don't see anything in the >>> logs suggesting what it might be. I've also tried cranking up verbosity and >>> didn't see anything. I know it's not recommended to use root to run >>> everything, but doesn't at least slurmd need root to manage cgroups? >>> >>> Thanks in advance!! >>> Ben >>> >> >>