Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-21 Thread Dean Schulze
Thank you, thank you, thank you.  It was the firewall on CentOS 7.  Once I
disabled that it worked.

For anyone else who runs into this issue here is how to disable the
firewall on CentOS 7:

https://linuxize.com/post/how-to-stop-and-disable-firewalld-on-centos-7/



On Tue, Jan 21, 2020 at 7:24 AM Brian Johanson  wrote:

>
> On 1/21/2020 12:32 AM, Chris Samuel wrote:
> > On 20/1/20 3:00 pm, Dean Schulze wrote:
> >
> >> There's either a problem with the source code I cloned from github,
> >> or there is a problem when the controller runs on Ubuntu 19 and the
> >> node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to
> >> see if that solves the problem.
> >
> > I've run the master branch on a Cray XC without issues, and I concur
> > with what the others have said and suggest it's worth checking the
> > slurmd and slurmctld logs to find out why communications is not right
> > between them.
> >
> and if the logs do not have enough information, run the daemon in the
> foreground with increased verbosity
>
> slurmd -D -v -v -v
>
> As another said, check if the connections are available with telnet
> server->client 'telnet node1 6818' (6818 is the default slurmd port) and
> same from compute->server.
>
> Are these new host builds?  Is there a firewall enabled?  Kinda sounds
> like a firewall on the client that allows outbound (initial connection
> to the slurmctl) but not new inbound (slurmctl ping) connections.
>
> -b
>
>
>


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-21 Thread Brian Johanson



On 1/21/2020 12:32 AM, Chris Samuel wrote:

On 20/1/20 3:00 pm, Dean Schulze wrote:

There's either a problem with the source code I cloned from github, 
or there is a problem when the controller runs on Ubuntu 19 and the 
node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to 
see if that solves the problem.


I've run the master branch on a Cray XC without issues, and I concur 
with what the others have said and suggest it's worth checking the 
slurmd and slurmctld logs to find out why communications is not right 
between them.


and if the logs do not have enough information, run the daemon in the 
foreground with increased verbosity


slurmd -D -v -v -v

As another said, check if the connections are available with telnet  
server->client 'telnet node1 6818' (6818 is the default slurmd port) and 
same from compute->server.


Are these new host builds?  Is there a firewall enabled?  Kinda sounds 
like a firewall on the client that allows outbound (initial connection 
to the slurmctl) but not new inbound (slurmctl ping) connections.


-b




Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Chris Samuel

On 20/1/20 3:00 pm, Dean Schulze wrote:

There's either a problem with the source code I cloned from github, or 
there is a problem when the controller runs on Ubuntu 19 and the node 
runs on CentOS 7.7.  I'm downgrading to a stable 19.05 build to see if 
that solves the problem.


I've run the master branch on a Cray XC without issues, and I concur 
with what the others have said and suggest it's worth checking the 
slurmd and slurmctld logs to find out why communications is not right 
between them.


Good luck,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Ryan Novosielski
The node is not getting the status from itself, it’s querying the slurmctld to 
ask for its status.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jan 20, 2020, at 3:56 PM, Dean Schulze  wrote:
> 
> If I run sinfo on the node itself it shows an asterisk.  How can the node be 
> unreachable from itself?
> 
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:
> Hi,
> 
> The * next to the idle status in sinfo means that the node is unreachable/not 
> responding. Check the status of the slurmd on the node and check the 
> connectivity from the slurmctld host to the compute node (telnet may be 
> enough). You can also check the slurmctld logs for more information. 
> 
> Regards,
> Carlos
> 
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze  wrote:
> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code 
> base.  It's behavior is strange to say the least.
> 
> The controller was built from the same code base, but on Ubuntu 19.10.  The 
> controller reports the nodes state with sinfo, but can't run a simple job 
> with srun because it thinks the node isn't available, even when it is idle.  
> (And squeue shows an empty queue.)
> 
> On the controller:
> $ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 30 queued and waiting for resources
> ^Csrun: Job allocation 30 has been revoked
> srun: Force Terminated job 30
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> $ squeue
>  JOBID  PARTITION  USER  STTIME   NODES 
> NODELIST(REASON) 
> 
> 
> When I try to run the simple job on the node I get:
> 
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 27 queued and waiting for resources
> ^Csrun: Job allocation 27 has been revoked
> [liqid@liqidos-dean-node1 ~]$ squeue
>  JOBID  PARTITION  USER  STTIME   NODES 
> NODELIST(REASON) 
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 28 queued and waiting for resources
> ^Csrun: Job allocation 28 has been revoked
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> 
> Apparently slurm thinks there are a bunch of jobs queued, but shows an empty 
> queue.  How do I get rid of these?
> 
> If these zombie jobs aren't the problem what else could be keeping this from 
> running?
> 
> Thanks.
> -- 
> --
> Carles Fenoy



Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
There's either a problem with the source code I cloned from github, or
there is a problem when the controller runs on Ubuntu 19 and the node runs
on CentOS 7.7.  I'm downgrading to a stable 19.05 build to see if that
solves the problem.

On Mon, Jan 20, 2020 at 3:41 PM Carlos Fenoy  wrote:

> It seems to me that the problem is between the slurmctld and slurmd. When
> slurmd starts it sends a message to the slurmctld, that's why it appears
> idle. Every now and then the slurmctld will try to ping the slurmd to check
> if it's still alive. This ping doesn't seem to be working, so as I
> mentioned previously, check the slurmctld log and the connectivity between
> the slurmctld node and the slurmd node.
>
> On Mon, 20 Jan 2020, 22:43 Brian Andrus,  wrote:
>
>> Check the slurmd log file on the node.
>>
>> Ensure slurmd is still running. Sounds possible that OOM Killer or such
>> may be killing slurmd
>>
>> Brian Andrus
>> On 1/20/2020 1:12 PM, Dean Schulze wrote:
>>
>> If I restart slurmd the asterisk goes away.  Then I can run the job once
>> and the asterisk is back, and the node remains in comp*:
>>
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1   idle liqidos-dean-node1
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> liqidos-dean-node1
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  comp* liqidos-dean-node1
>>
>> I can get it back to idle* with scontrol:
>>
>> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
>> NodeName=liqidos-dean-node1 State=down Reason=none
>> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
>> NodeName=liqidos-dean-node1 State=resume
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>
>> I'm beginning to wonder if I got some bad code from github.
>>
>>
>> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:
>>
>>> Hi,
>>>
>>> The * next to the idle status in sinfo means that the node is
>>> unreachable/not responding. Check the status of the slurmd on the node and
>>> check the connectivity from the slurmctld host to the compute node (telnet
>>> may be enough). You can also check the slurmctld logs for more information.
>>>
>>> Regards,
>>> Carlos
>>>
>>> On Mon, 20 Jan 2020 at 21:04, Dean Schulze 
>>> wrote:
>>>
 I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
 code base.  It's behavior is strange to say the least.

 The controller was built from the same code base, but on Ubuntu 19.10.
 The controller reports the nodes state with sinfo, but can't run a simple
 job with srun because it thinks the node isn't available, even when it is
 idle.  (And squeue shows an empty queue.)

 On the controller:
 $ srun -N 1 hostname
 srun: Required node not available (down, drained or reserved)
 srun: job 30 queued and waiting for resources
 ^Csrun: Job allocation 30 has been revoked
 srun: Force Terminated job 30
 $ sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 debug*   up   infinite  1  idle* liqidos-dean-node1
 $ squeue
  JOBID  PARTITION  USER  STTIME   NODES
 NODELIST(REASON)


 When I try to run the simple job on the node I get:

 [liqid@liqidos-dean-node1 ~]$ sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 debug*   up   infinite  1  idle* liqidos-dean-node1
 [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
 srun: Required node not available (down, drained or reserved)
 srun: job 27 queued and waiting for resources
 ^Csrun: Job allocation 27 has been revoked
 [liqid@liqidos-dean-node1 ~]$ squeue
  JOBID  PARTITION  USER  STTIME   NODES
 NODELIST(REASON)
 [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
 srun: Required node not available (down, drained or reserved)
 srun: job 28 queued and waiting for resources
 ^Csrun: Job allocation 28 has been revoked
 [liqid@liqidos-dean-node1 ~]$ sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 debug*   up   infinite  1  idle* liqidos-dean-node1

 Apparently slurm thinks there are a bunch of jobs queued, but shows an
 empty queue.  How do I get rid of these?

 If these zombie jobs aren't the problem what else could be keeping this
 from running?

 Thanks.

>>> --
>>> --
>>> Carles Fenoy
>>>
>>


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Carlos Fenoy
It seems to me that the problem is between the slurmctld and slurmd. When
slurmd starts it sends a message to the slurmctld, that's why it appears
idle. Every now and then the slurmctld will try to ping the slurmd to check
if it's still alive. This ping doesn't seem to be working, so as I
mentioned previously, check the slurmctld log and the connectivity between
the slurmctld node and the slurmd node.

On Mon, 20 Jan 2020, 22:43 Brian Andrus,  wrote:

> Check the slurmd log file on the node.
>
> Ensure slurmd is still running. Sounds possible that OOM Killer or such
> may be killing slurmd
>
> Brian Andrus
> On 1/20/2020 1:12 PM, Dean Schulze wrote:
>
> If I restart slurmd the asterisk goes away.  Then I can run the job once
> and the asterisk is back, and the node remains in comp*:
>
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1   idle liqidos-dean-node1
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> liqidos-dean-node1
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  comp* liqidos-dean-node1
>
> I can get it back to idle* with scontrol:
>
> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
> NodeName=liqidos-dean-node1 State=down Reason=none
> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
> NodeName=liqidos-dean-node1 State=resume
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  idle* liqidos-dean-node1
>
> I'm beginning to wonder if I got some bad code from github.
>
>
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:
>
>> Hi,
>>
>> The * next to the idle status in sinfo means that the node is
>> unreachable/not responding. Check the status of the slurmd on the node and
>> check the connectivity from the slurmctld host to the compute node (telnet
>> may be enough). You can also check the slurmctld logs for more information.
>>
>> Regards,
>> Carlos
>>
>> On Mon, 20 Jan 2020 at 21:04, Dean Schulze 
>> wrote:
>>
>>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
>>> code base.  It's behavior is strange to say the least.
>>>
>>> The controller was built from the same code base, but on Ubuntu 19.10.
>>> The controller reports the nodes state with sinfo, but can't run a simple
>>> job with srun because it thinks the node isn't available, even when it is
>>> idle.  (And squeue shows an empty queue.)
>>>
>>> On the controller:
>>> $ srun -N 1 hostname
>>> srun: Required node not available (down, drained or reserved)
>>> srun: job 30 queued and waiting for resources
>>> ^Csrun: Job allocation 30 has been revoked
>>> srun: Force Terminated job 30
>>> $ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>> $ squeue
>>>  JOBID  PARTITION  USER  STTIME   NODES
>>> NODELIST(REASON)
>>>
>>>
>>> When I try to run the simple job on the node I get:
>>>
>>> [liqid@liqidos-dean-node1 ~]$ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>>> srun: Required node not available (down, drained or reserved)
>>> srun: job 27 queued and waiting for resources
>>> ^Csrun: Job allocation 27 has been revoked
>>> [liqid@liqidos-dean-node1 ~]$ squeue
>>>  JOBID  PARTITION  USER  STTIME   NODES
>>> NODELIST(REASON)
>>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>>> srun: Required node not available (down, drained or reserved)
>>> srun: job 28 queued and waiting for resources
>>> ^Csrun: Job allocation 28 has been revoked
>>> [liqid@liqidos-dean-node1 ~]$ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>>
>>> Apparently slurm thinks there are a bunch of jobs queued, but shows an
>>> empty queue.  How do I get rid of these?
>>>
>>> If these zombie jobs aren't the problem what else could be keeping this
>>> from running?
>>>
>>> Thanks.
>>>
>> --
>> --
>> Carles Fenoy
>>
>


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Brian Andrus

Check the slurmd log file on the node.

Ensure slurmd is still running. Sounds possible that OOM Killer or such 
may be killing slurmd


Brian Andrus

On 1/20/2020 1:12 PM, Dean Schulze wrote:
If I restart slurmd the asterisk goes away.  Then I can run the job 
once and the asterisk is back, and the node remains in comp*:


[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  comp* liqidos-dean-node1

I can get it back to idle* with scontrol:

[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update 
NodeName=liqidos-dean-node1 State=down Reason=none
[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update 
NodeName=liqidos-dean-node1 State=resume

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1

I'm beginning to wonder if I got some bad code from github.


On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy > wrote:


Hi,

The * next to the idle status in sinfo means that the node is
unreachable/not responding. Check the status of the slurmd on the
node and check the connectivity from the slurmctld host to the
compute node (telnet may be enough). You can also check the
slurmctld logs for more information.

Regards,
Carlos

On Mon, 20 Jan 2020 at 21:04, Dean Schulze
mailto:dean.w.schu...@gmail.com>> wrote:

I've got a node running on CentOS 7.7 build from the recent
20.02.0pre1 code base.  It's behavior is strange to say the
least.

The controller was built from the same code base, but on
Ubuntu 19.10.  The controller reports the nodes state with
sinfo, but can't run a simple job with srun because it thinks
the node isn't available, even when it is idle.  (And squeue
shows an empty queue.)

On the controller:
$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 30 queued and waiting for resources
^Csrun: Job allocation 30 has been revoked
srun: Force Terminated job 30
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1
$ squeue
             JOBID  PARTITION      USER  ST  TIME   NODES
NODELIST(REASON)


When I try to run the simple job on the node I get:

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 27 queued and waiting for resources
^Csrun: Job allocation 27 has been revoked
[liqid@liqidos-dean-node1 ~]$ squeue
             JOBID  PARTITION      USER  ST  TIME   NODES
NODELIST(REASON)
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 28 queued and waiting for resources
^Csrun: Job allocation 28 has been revoked
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1

Apparently slurm thinks there are a bunch of jobs queued, but
shows an empty queue.  How do I get rid of these?

If these zombie jobs aren't the problem what else could be
keeping this from running?

Thanks.

-- 
--

Carles Fenoy



Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
If I restart slurmd the asterisk goes away.  Then I can run the job once
and the asterisk is back, and the node remains in comp*:

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1   idle liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  comp* liqidos-dean-node1

I can get it back to idle* with scontrol:

[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
NodeName=liqidos-dean-node1 State=down Reason=none
[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
NodeName=liqidos-dean-node1 State=resume
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  idle* liqidos-dean-node1

I'm beginning to wonder if I got some bad code from github.


On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:

> Hi,
>
> The * next to the idle status in sinfo means that the node is
> unreachable/not responding. Check the status of the slurmd on the node and
> check the connectivity from the slurmctld host to the compute node (telnet
> may be enough). You can also check the slurmctld logs for more information.
>
> Regards,
> Carlos
>
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze 
> wrote:
>
>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
>> code base.  It's behavior is strange to say the least.
>>
>> The controller was built from the same code base, but on Ubuntu 19.10.
>> The controller reports the nodes state with sinfo, but can't run a simple
>> job with srun because it thinks the node isn't available, even when it is
>> idle.  (And squeue shows an empty queue.)
>>
>> On the controller:
>> $ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 30 queued and waiting for resources
>> ^Csrun: Job allocation 30 has been revoked
>> srun: Force Terminated job 30
>> $ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>> $ squeue
>>  JOBID  PARTITION  USER  STTIME   NODES
>> NODELIST(REASON)
>>
>>
>> When I try to run the simple job on the node I get:
>>
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 27 queued and waiting for resources
>> ^Csrun: Job allocation 27 has been revoked
>> [liqid@liqidos-dean-node1 ~]$ squeue
>>  JOBID  PARTITION  USER  STTIME   NODES
>> NODELIST(REASON)
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 28 queued and waiting for resources
>> ^Csrun: Job allocation 28 has been revoked
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>
>> Apparently slurm thinks there are a bunch of jobs queued, but shows an
>> empty queue.  How do I get rid of these?
>>
>> If these zombie jobs aren't the problem what else could be keeping this
>> from running?
>>
>> Thanks.
>>
> --
> --
> Carles Fenoy
>


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
If I run sinfo on the node itself it shows an asterisk.  How can the node
be unreachable from itself?

On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:

> Hi,
>
> The * next to the idle status in sinfo means that the node is
> unreachable/not responding. Check the status of the slurmd on the node and
> check the connectivity from the slurmctld host to the compute node (telnet
> may be enough). You can also check the slurmctld logs for more information.
>
> Regards,
> Carlos
>
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze 
> wrote:
>
>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
>> code base.  It's behavior is strange to say the least.
>>
>> The controller was built from the same code base, but on Ubuntu 19.10.
>> The controller reports the nodes state with sinfo, but can't run a simple
>> job with srun because it thinks the node isn't available, even when it is
>> idle.  (And squeue shows an empty queue.)
>>
>> On the controller:
>> $ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 30 queued and waiting for resources
>> ^Csrun: Job allocation 30 has been revoked
>> srun: Force Terminated job 30
>> $ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>> $ squeue
>>  JOBID  PARTITION  USER  STTIME   NODES
>> NODELIST(REASON)
>>
>>
>> When I try to run the simple job on the node I get:
>>
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 27 queued and waiting for resources
>> ^Csrun: Job allocation 27 has been revoked
>> [liqid@liqidos-dean-node1 ~]$ squeue
>>  JOBID  PARTITION  USER  STTIME   NODES
>> NODELIST(REASON)
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 28 queued and waiting for resources
>> ^Csrun: Job allocation 28 has been revoked
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>
>> Apparently slurm thinks there are a bunch of jobs queued, but shows an
>> empty queue.  How do I get rid of these?
>>
>> If these zombie jobs aren't the problem what else could be keeping this
>> from running?
>>
>> Thanks.
>>
> --
> --
> Carles Fenoy
>


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Carlos Fenoy
Hi,

The * next to the idle status in sinfo means that the node is
unreachable/not responding. Check the status of the slurmd on the node and
check the connectivity from the slurmctld host to the compute node (telnet
may be enough). You can also check the slurmctld logs for more information.

Regards,
Carlos

On Mon, 20 Jan 2020 at 21:04, Dean Schulze  wrote:

> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
> code base.  It's behavior is strange to say the least.
>
> The controller was built from the same code base, but on Ubuntu 19.10.
> The controller reports the nodes state with sinfo, but can't run a simple
> job with srun because it thinks the node isn't available, even when it is
> idle.  (And squeue shows an empty queue.)
>
> On the controller:
> $ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 30 queued and waiting for resources
> ^Csrun: Job allocation 30 has been revoked
> srun: Force Terminated job 30
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  idle* liqidos-dean-node1
> $ squeue
>  JOBID  PARTITION  USER  STTIME   NODES
> NODELIST(REASON)
>
>
> When I try to run the simple job on the node I get:
>
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  idle* liqidos-dean-node1
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 27 queued and waiting for resources
> ^Csrun: Job allocation 27 has been revoked
> [liqid@liqidos-dean-node1 ~]$ squeue
>  JOBID  PARTITION  USER  STTIME   NODES
> NODELIST(REASON)
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 28 queued and waiting for resources
> ^Csrun: Job allocation 28 has been revoked
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  idle* liqidos-dean-node1
>
> Apparently slurm thinks there are a bunch of jobs queued, but shows an
> empty queue.  How do I get rid of these?
>
> If these zombie jobs aren't the problem what else could be keeping this
> from running?
>
> Thanks.
>
-- 
--
Carles Fenoy


[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
code base.  It's behavior is strange to say the least.

The controller was built from the same code base, but on Ubuntu 19.10.  The
controller reports the nodes state with sinfo, but can't run a simple job
with srun because it thinks the node isn't available, even when it is
idle.  (And squeue shows an empty queue.)

On the controller:
$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 30 queued and waiting for resources
^Csrun: Job allocation 30 has been revoked
srun: Force Terminated job 30
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  idle* liqidos-dean-node1
$ squeue
 JOBID  PARTITION  USER  STTIME   NODES
NODELIST(REASON)


When I try to run the simple job on the node I get:

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  idle* liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 27 queued and waiting for resources
^Csrun: Job allocation 27 has been revoked
[liqid@liqidos-dean-node1 ~]$ squeue
 JOBID  PARTITION  USER  STTIME   NODES
NODELIST(REASON)
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 28 queued and waiting for resources
^Csrun: Job allocation 28 has been revoked
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  idle* liqidos-dean-node1

Apparently slurm thinks there are a bunch of jobs queued, but shows an
empty queue.  How do I get rid of these?

If these zombie jobs aren't the problem what else could be keeping this
from running?

Thanks.