[slurm-dev] Re: Multinode setup trouble

Ben Mann Wed, 17 May 2017 18:34:12 -0700

Turns out it was my firewall settings! When I disabled it it started
working. Will investigate more.


On Wed, May 17, 2017 at 10:54 AM Ben Mann <b...@openai.com> wrote:

> Yep, and I also did the cross-node munge test suggested in the munge setup
> documentation. In the remote slurlmd -D -vvvv output it appears to check
> out (Checking credential with 468 bytes of sig data), but the job still
> doesn't execute.
>
> slurmd: debug2: got this type of message 6001
> slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
> slurmd: launch task 57.0 request from 4078.4078@172.18.2.44 (port 59078)
> slurmd: debug3: state for jobid 21: ctime:1494917363 revoked:0 expires:0
> slurmd: debug3: state for jobid 22: ctime:1494918387 revoked:0 expires:0
> slurmd: debug3: state for jobid 23: ctime:1494919023 revoked:0 expires:0
> slurmd: debug3: state for jobid 25: ctime:1494979611 revoked:0 expires:0
> slurmd: debug3: state for jobid 29: ctime:1494980262 revoked:0 expires:0
> slurmd: debug3: state for jobid 36: ctime:1494981612 revoked:0 expires:0
> slurmd: debug3: state for jobid 37: ctime:1494982415 revoked:0 expires:0
> slurmd: debug3: state for jobid 51: ctime:1494984067 revoked:0 expires:0
> slurmd: debug:  Checking credential with 468 bytes of sig data
> slurmd: debug:  Reading cgroup.conf file /usr/local/etc/cgroup.conf
> slurmd: _run_prolog: run job script took usec=6
> slurmd: _run_prolog: prolog with lock for job 57 ran for 0 seconds
> slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
> slurmd: debug3: slurmstepd rank 1 (92), parent rank 0 (91), children 0,
> depth 1, max_depth 1
> slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
> slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
> slurmd: debug2: Cached group access list for ben/4078
> slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
>
> On Wed, May 17, 2017 at 12:24 AM John Hearns <hear...@googlemail.com>
> wrote:
>
>> Ben, a stupid question, hoever - have you installed and configured Munge
>> authentication on the slave node?
>>
>> On 17 May 2017 at 02:59, Ben Mann <b...@openai.com> wrote:
>>
>>> Hello Slurm dev,
>>>
>>> I just set up a small test cluster on two Ubuntu 14.04 machines,
>>> installed SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd
>>> on a master and just slurmd on a slave. When I run a job on two nodes, it
>>> completes instantly on master, but never on slave.
>>>
>>> Here are my .conf files, which are on a NAS and symlinked from
>>> /usr/local/etc/ as well as log files for the srun below
>>> https://gist.github.com/8enmann/0637ee2cbb6e6f5aaedef6b3c3f24a1d
>>>
>>> $ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> debug*       up   infinite      2   idle [91-92]
>>>
>>> $ srun -l hostname
>>> 0: 91.cirrascale.sci.openai.org
>>>
>>> $ srun -l -N2 hostname
>>> 0: 91.cirrascale.sci.openai.org
>>> $ srun -N2 -l hostname
>>> 0: 91.cirrascale.sci.openai.org
>>> srun: error: timeout waiting for task launch, started 1 of 2 tasks
>>> srun: Job step 36.0 aborted before step completely launched.
>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>> srun: error: Timed out waiting for job step to complete
>>>
>>> $ squeue
>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>> NODELIST(REASON)
>>>                 36     debug hostname      ben  R       8:42      2
>>> [91-92]
>>> $ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> debug*       up   infinite      2  alloc [91-92]
>>>
>>> I'm guessing I misconfigured something, but I don't see anything in the
>>> logs suggesting what it might be. I've also tried cranking up verbosity and
>>> didn't see anything. I know it's not recommended to use root to run
>>> everything, but doesn't at least slurmd need root to manage cgroups?
>>>
>>> Thanks in advance!!
>>> Ben
>>>
>>
>>

[slurm-dev] Re: Multinode setup trouble

Reply via email to