Yep, and I also did the cross-node munge test suggested in the munge setup
documentation. In the remote slurlmd -D -vvvv output it appears to check
out (Checking credential with 468 bytes of sig data), but the job still
doesn't execute.

slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task 57.0 request from 4078.4078@172.18.2.44 (port 59078)
slurmd: debug3: state for jobid 21: ctime:1494917363 revoked:0 expires:0
slurmd: debug3: state for jobid 22: ctime:1494918387 revoked:0 expires:0
slurmd: debug3: state for jobid 23: ctime:1494919023 revoked:0 expires:0
slurmd: debug3: state for jobid 25: ctime:1494979611 revoked:0 expires:0
slurmd: debug3: state for jobid 29: ctime:1494980262 revoked:0 expires:0
slurmd: debug3: state for jobid 36: ctime:1494981612 revoked:0 expires:0
slurmd: debug3: state for jobid 37: ctime:1494982415 revoked:0 expires:0
slurmd: debug3: state for jobid 51: ctime:1494984067 revoked:0 expires:0
slurmd: debug:  Checking credential with 468 bytes of sig data
slurmd: debug:  Reading cgroup.conf file /usr/local/etc/cgroup.conf
slurmd: _run_prolog: run job script took usec=6
slurmd: _run_prolog: prolog with lock for job 57 ran for 0 seconds
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 1 (92), parent rank 0 (91), children 0,
depth 1, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug2: Cached group access list for ben/4078
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd

On Wed, May 17, 2017 at 12:24 AM John Hearns <hear...@googlemail.com> wrote:

> Ben, a stupid question, hoever - have you installed and configured Munge
> authentication on the slave node?
>
> On 17 May 2017 at 02:59, Ben Mann <b...@openai.com> wrote:
>
>> Hello Slurm dev,
>>
>> I just set up a small test cluster on two Ubuntu 14.04 machines,
>> installed SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd
>> on a master and just slurmd on a slave. When I run a job on two nodes, it
>> completes instantly on master, but never on slave.
>>
>> Here are my .conf files, which are on a NAS and symlinked from
>> /usr/local/etc/ as well as log files for the srun below
>> https://gist.github.com/8enmann/0637ee2cbb6e6f5aaedef6b3c3f24a1d
>>
>> $ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*       up   infinite      2   idle [91-92]
>>
>> $ srun -l hostname
>> 0: 91.cirrascale.sci.openai.org
>>
>> $ srun -l -N2 hostname
>> 0: 91.cirrascale.sci.openai.org
>> $ srun -N2 -l hostname
>> 0: 91.cirrascale.sci.openai.org
>> srun: error: timeout waiting for task launch, started 1 of 2 tasks
>> srun: Job step 36.0 aborted before step completely launched.
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: Timed out waiting for job step to complete
>>
>> $ squeue
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>                 36     debug hostname      ben  R       8:42      2
>> [91-92]
>> $ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*       up   infinite      2  alloc [91-92]
>>
>> I'm guessing I misconfigured something, but I don't see anything in the
>> logs suggesting what it might be. I've also tried cranking up verbosity and
>> didn't see anything. I know it's not recommended to use root to run
>> everything, but doesn't at least slurmd need root to manage cgroups?
>>
>> Thanks in advance!!
>> Ben
>>
>
>

Reply via email to