Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Marcus Wagner

Hi Juergen,


THAT's a cool feature, never knew about that. But this leads to other 
questions.

Short excerpt:
2020-01-19T00:00:18  Modify Clusters   slurmadm name='rcc' 
control_host='134.6+
2020-01-19T03:24:16  Modify Clusters   slurmadm name='rcc' 
control_host='134.6+
2020-01-19T03:34:05  Modify Clusters   slurmadm name='rcc' 
control_host='134.6+
2020-01-19T19:43:52    Add Users   slurmadm dj770060    
admin_level=1
2020-01-19T19:43:52 Add Associations   slurmadm id_assoc=5181 
mod_time=1579459432+
2020-01-19T19:43:53 Add Associations   slurmadm id_assoc=5182 
mod_time=1579459433+
2020-01-19T19:44:25    Add Users   slurmadm ff912680    
admin_level=1
2020-01-19T19:44:25 Add Associations   slurmadm id_assoc=5183 
mod_time=1579459465+
2020-01-19T21:01:16 Add Accounts   slurmadm thes0725 
description='thes07+
2020-01-19T21:01:16 Add Associations   slurmadm id_assoc=5184 
mod_time=1579464076+
2020-01-19T21:01:17 Add Associations   slurmadm id_assoc=5185 
mod_time=1579464077+
2020-01-19T22:52:04    Add Users   slurmadm tb258730    
admin_level=1
2020-01-19T22:52:04 Add Associations   slurmadm id_assoc=5186 
mod_time=1579470724+
2020-01-20T00:00:13  Modify Clusters   slurmadm name='rcc' 
control_host='134.6+


I was astonished about the "Modify Clusters" transactions, so I looked a 
bit further:

$> sacctmgr list transactions Action="Modify Clusters" -p
2020-01-15T00:00:12|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-16T00:01:29|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.20', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-16T00:01:30|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-16T00:01:41|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-17T00:00:14|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-18T00:00:18|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-19T00:00:18|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-19T03:24:16|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-19T03:34:05|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|
2020-01-20T00:00:13|Modify 
Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', 
control_port=6750, last_port=6750, rpc_version=8448, dimensions=1, 
plugin_id_select=101, flags=0|


It seems, there is nothing modified. Do you have any idea, why these are 
generated?


Best
Marcus




On 1/19/20 2:38 PM, Juergen Salk wrote:

* Ole Holm Nielsen  [200118 12:06]:


When we have created a new Slurm user with "sacctmgr create user name=xxx",
I would like inquire at a later date about the timestamp for the user
creation.  As far as I can tell, the sacctmgr command cannot show such
timestamps.

Hi Ole,

for me (currently running Slurm version 19.05.2) the command

  sacctmgr list transactions Action="Add Users"

also shows timestamps. Isn't this what you are looking for?

Best regards
Jürgen



--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de




Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Juergen Salk
* Marcus Wagner  [200120 09:17]:

> I was astonished about the "Modify Clusters" transactions, so I looked a bit
> further:
> $> sacctmgr list transactions Action="Modify Clusters" -p
> 2020-01-15T00:00:12|Modify
> Clusters|slurmadm|name='rcc'|control_host='134.61.193.19',
> control_port=6750, last_port=6750, rpc_version=8448, dimensions=1,
> plugin_id_select=101, flags=0|
> 2020-01-16T00:01:29|Modify
> Clusters|slurmadm|name='rcc'|control_host='134.61.193.20',
> control_port=6750, last_port=6750, rpc_version=8448, dimensions=1,
> [...] 
> 
> It seems, there is nothing modified. Do you have any idea, why these are
> generated?

Hi Marcus,

I always get these "Modify Clusters" entries whenever slurmctld is 
restarted. However, I am not sure if this is the only event that 
triggers this entry.

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471





Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen

Hi Jürgen,

On 1/19/20 2:38 PM, Juergen Salk wrote:

* Ole Holm Nielsen  [200118 12:06]:


When we have created a new Slurm user with "sacctmgr create user name=xxx",
I would like inquire at a later date about the timestamp for the user
creation.  As far as I can tell, the sacctmgr command cannot show such
timestamps.


Hi Ole,

for me (currently running Slurm version 19.05.2) the command

  sacctmgr list transactions Action="Add Users"

also shows timestamps. Isn't this what you are looking for?


This is indeed a hidden feature of Slurm!  The sacctmgr man-page has a 
single example:


sacctmgr list transactions StartTime=11/03\-10:30:00 
format=Timestamp,Action,Actor


This example is actually incorrect: StartTime should be Start and the 
backslash \ should be removed.  Here is a correct example:


sacctmgr list transactions Action="Add Users" Start=01/01 format=where,time

This "sacctmgr list transactions" feature should be usable for my task at 
hand.  If you want to know which users were added within the last 30 days, 
you "just" have to calculate the appropriate Start time = (Now - 30 days) 
for the above command.


One caveat is that users may have been added and then removed, but one can 
list removed users by:


sacctmgr list transactions Action="Remove Users" Start=01/01 format=where,time

There is a different way reading the entire list of users directly from 
the Slurm database with a mysql command, and I will post this method in a 
separate mail.


Thanks,
Ole



Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen
Thanks to input from Niels Carl W. Hansen  I have been 
able to write a new slurmusertable tool available from

https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmaccounts

This slurmusertable tool reads your current Slurm database user_table and 
prints out a list of usernames, creation time, and modification time.


Read access to the Slurm MySQL database is required, so the appropriate 
MySQL user and hostname must be configured in the MySQL server, see 
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#set-up-mariadb-database


Usage:

slurmusertable [username(s)]

Comments and suggestions for improvement are welcome!

/Ole

On 1/18/20 12:06 PM, Ole Holm Nielsen wrote:
When we have created a new Slurm user with "sacctmgr create user 
name=xxx", I would like inquire at a later date about the timestamp for 
the user creation.  As far as I can tell, the sacctmgr command cannot show 
such timestamps.


I assume that the Slurm database contains the desired timestamp(?), so 
does anyone know how to print it for a named user?


The reason why I am interested in the user timestamp is that we repeatedly 
see novice users (students etc.) submitting lots of jobs which are broken 
or do not use resources correctly.  Since new users' jobs will have a high 
Slurm priority due to our Fairshare configuration, they quickly waste a 
lot of resources which could be used much more productively by experienced 
users.


My idea is to give new users some rather low resource limits (fairshare 
GrpTRES GrpTRESMins MaxTRES MaxTRESPerNode MaxTRESMins GrpTRESRunMins QOS 
DefaultQOS MaxJobsAccrue GrpJobsAccrue) initially.  After a test period of 
a few weeks or months, I could increase their limits to our normal default 
values.  If I had a timestamp for the user creation, I could use my 
scripts to automatically update the user limits after the test period had 
expired.


I have thought of recording user creation timestamps outside of Slurm by 
creating a stop-file in the filesystem whenever running "sacctmgr create 
user name=xxx".  This would probably do the work, but I prefer to use only 
the Slurm database information whenever possible.


Does anyone have other ideas about how to accomplish such a setup?




Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen

Hi,

I have been exploring how to list Slurm "Add Users" transactions for a 
specified number of days/weeks/months into the past.  The "date" command 
is very flexible in printing days in the past.


Here are some examples:

# sacctmgr list transactions Action="Add Users" Start=`date -d "-1 month" 
+%m/%d/%y`
   Time   Action  ActorWhere 
   Info
---  --  

2020-01-07T08:50:46Add Users   rootaaa 
   admin_level=1
2020-01-07T15:14:37Add Users   rootbbb 
   admin_level=1
2020-01-09T11:15:17Add Users   rootccc 
   admin_level=1
2020-01-13T12:59:01Add Users   rootddd 
   admin_level=1
2020-01-15T10:00:01Add Users   rooteee 
   admin_level=1


and so on with varying periods:

# sacctmgr list transactions Action="Add Users" Start=`date -d "-2 month" 
+%m/%d/%y`


# sacctmgr list transactions Action="Add Users" Start=`date -d "-45 days" 
+%m/%d/%y`


I think this pretty nicely gives us the flexibility for listing 
transactions during some period into the past.


/Ole

On 1/20/20 11:29 AM, Ole Holm Nielsen wrote:

Hi Jürgen,

On 1/19/20 2:38 PM, Juergen Salk wrote:

* Ole Holm Nielsen  [200118 12:06]:

When we have created a new Slurm user with "sacctmgr create user 
name=xxx",

I would like inquire at a later date about the timestamp for the user
creation.  As far as I can tell, the sacctmgr command cannot show such
timestamps.


Hi Ole,

for me (currently running Slurm version 19.05.2) the command

  sacctmgr list transactions Action="Add Users"

also shows timestamps. Isn't this what you are looking for?


This is indeed a hidden feature of Slurm!  The sacctmgr man-page has a 
single example:


sacctmgr list transactions StartTime=11/03\-10:30:00 
format=Timestamp,Action,Actor


This example is actually incorrect: StartTime should be Start and the 
backslash \ should be removed.  Here is a correct example:


sacctmgr list transactions Action="Add Users" Start=01/01 format=where,time

This "sacctmgr list transactions" feature should be usable for my task at 
hand.  If you want to know which users were added within the last 30 days, 
you "just" have to calculate the appropriate Start time = (Now - 30 days) 
for the above command.


One caveat is that users may have been added and then removed, but one can 
list removed users by:


sacctmgr list transactions Action="Remove Users" Start=01/01 
format=where,time


There is a different way reading the entire list of users directly from 
the Slurm database with a mysql command, and I will post this method in a 
separate mail.




Re: [slurm-users] Job completed but child process still running

2020-01-20 Thread Youssef Eldakar
Thanks for the pointer to proctrack/cgroup!

On Mon, Jan 13, 2020 at 6:46 PM Juergen Salk 
wrote:

>
> Are you saying that there is absolutely no need to take care
> of potential leftover/stray processes in the epilog script any
> more with proctrack/cgroup enabled?
>

I tried it, and proctrack/cgroup indeed makes sure no process remains after
the job is terminated without the need for additional handling in the
epilog.

I am now working on the job script to have it wait for the Java process to
exit.

Thanks again for the help.

Youssef Eldakar
Bibliotheca Alexandrina


[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Robert Kudyba
I've posted about this previously here
,
and here

so
I'm trying to get to the bottom of this once and for all and even got this
comment

previously:

our problem here is that the configuration for the nodes in question have
> an incorrect amount of memory set for them. Looks like you have it set in
> bytes instead of megabytes
> In your slurm.conf you should look at the RealMemory setting:
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The default
> value is 1.
> I would suggest RealMemory=191879 , where I suspect you have
> RealMemory=196489092


Now the slurmctld logs show this:

[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node002:
Invalid argument
[2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
(191846 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node001:
Invalid argument
[2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node003:
Invalid argument

Here's the setting in slurm.conf:
/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptM$

sinfo -N
NODELIST   NODES PARTITION STATE
node0011 defq* drain
node0021 defq* drain
node0031 defq* drain

sinfo -N
NODELIST   NODES PARTITION STATE
node0011 defq* drain
node0021 defq* drain
node0031 defq* drain

[2020-01-20T12:50:51.034] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration node=node003:
Invalid argument

/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptM$

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node001: Thread(s) per core:1
node001: Core(s) per socket:12
node001: Socket(s): 2
node002: Thread(s) per core:1
node002: Core(s) per socket:12
node002: Socket(s): 2
node003: Thread(s) per core:2
node003: Core(s) per socket:12
node003: Socket(s): 2

module load cmsh
[root@ciscluster kudyba]# cmsh
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
 

Slurmdefq node001..node003
Slurmgpuq

use defq
[ciscluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1
   NodeAddr=node001 NodeHostName=node001 Version=17.11
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
   RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=defq
   BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15
   CfgTRES=cpu=24,mem=196489092M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2020-01-20T13:22:48]

sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Low RealMemory   sl

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Brian Andrus

Try using "nodename=node003" in the slurm.conf on your nodes.

Also, make sure the slurm.conf on the nodes is the same as on the head.

Somewhere in there, you have "node=node003" (as well as the other nodes 
names).


That may even do it, as they may be trying to register generically, so 
their configs are not getting matched to the specific info in your main 
config


Brian Andrus


On 1/20/2020 10:37 AM, Robert Kudyba wrote:
I've posted about this previously here 
, 
and here 
 so 
I'm trying to get to the bottom of this once and for all and even got 
this comment 
 
previously:


our problem here is that the configuration for the nodes in
question have an incorrect amount of memory set for them. Looks
like you have it set in bytes instead of megabytes
In your slurm.conf you should look at the RealMemory setting:
RealMemory
Size of real memory on the node in megabytes (e.g. "2048"). The
default value is 1.
I would suggest RealMemory=191879 , where I suspect you have
RealMemory=196489092


Now the slurmctld logs show this:

[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size 
(191840 < 196489092)

[2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration 
node=node002: Invalid argument
[2020-01-20T13:22:48.256] error: Node node001 has low real_memory size 
(191846 < 196489092)

[2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration 
node=node001: Invalid argument
[2020-01-20T13:22:48.256] error: Node node003 has low real_memory size 
(191840 < 196489092)

[2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration 
node=node003: Invalid argument


Here's the setting in slurm.conf:
/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 
Sockets=2 Gres=gpu:1

# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO 
Hidden=NO Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO 
Hidden=NO Shared=NO GraceTime=0 PreemptM$


sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

[2020-01-20T12:50:51.034] error: Node node003 has low real_memory size 
(191840 < 196489092)
[2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration 
node=node003: Invalid argument


/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 
Sockets=2 Gres=gpu:1

# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO 
Hidden=NO Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO 
Hidden=NO Shared=NO GraceTime=0 PreemptM$


pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node001: Thread(s) per core:    1
node001: Core(s) per socket:    12
node001: Socket(s):             2
node002: Thread(s) per core:    1
node002: Core(s) per socket:    12
node002: Socket(s):             2
node003: Thread(s) per core:    2
node003: Core(s) per socket:    12
node003: Socket(s):             2

module load cmsh
[root@ciscluster kudyba]# cmsh
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type         Name                     Nodes
  


Slurm        defq                     node001..node003
Slurm        gpuq

use defq
[ciscluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1
   NodeAddr=node001 NodeHostName=node001 Version=17.11
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
   RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
   State

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Robert Kudyba
We are on a Bright Cluster and their support says the head node controls
this. Here you can see the sym links:

[root@node001 ~]# file /etc/slurm/slurm.conf
/etc/slurm/slurm.conf: symbolic link to
`/cm/shared/apps/slurm/var/etc/slurm.conf'

[root@ourcluster myuser]# file /etc/slurm/slurm.conf
/etc/slurm/slurm.conf: symbolic link to
`/cm/shared/apps/slurm/var/etc/slurm.conf'

 ls -l  /etc/slurm/slurm.conf
lrwxrwxrwx 1 root root 40 Nov 30  2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf
[root@ourcluster myuser]# ssh node001
Last login: Mon Jan 20 14:02:00 2020
[root@node001 ~]# ls -l  /etc/slurm/slurm.conf
lrwxrwxrwx 1 root root 40 Nov 30  2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf

On Mon, Jan 20, 2020 at 1:52 PM Brian Andrus  wrote:

> Try using "nodename=node003" in the slurm.conf on your nodes.
>
> Also, make sure the slurm.conf on the nodes is the same as on the head.
>
> Somewhere in there, you have "node=node003" (as well as the other nodes
> names).
>
> That may even do it, as they may be trying to register generically, so
> their configs are not getting matched to the specific info in your main
> config
>
> Brian Andrus
>
>
> On 1/20/2020 10:37 AM, Robert Kudyba wrote:
>
> I've posted about this previously here
> ,
> and here
> 
>  so
> I'm trying to get to the bottom of this once and for all and even got this
> comment
> 
> previously:
>
> our problem here is that the configuration for the nodes in question have
>> an incorrect amount of memory set for them. Looks like you have it set in
>> bytes instead of megabytes
>> In your slurm.conf you should look at the RealMemory setting:
>> RealMemory
>> Size of real memory on the node in megabytes (e.g. "2048"). The default
>> value is 1.
>> I would suggest RealMemory=191879 , where I suspect you have
>> RealMemory=196489092
>
>
> Now the slurmctld logs show this:
>
> [2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node002: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
> (191846 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node001: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> Here's the setting in slurm.conf:
> /etc/slurm/slurm.conf
> # Nodes
> NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
> Gres=gpu:1
> # Partitions
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 Preempt$
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 PreemptM$
>
> sinfo -N
> NODELIST   NODES PARTITION STATE
> node0011 defq* drain
> node0021 defq* drain
> node0031 defq* drain
>
> sinfo -N
> NODELIST   NODES PARTITION STATE
> node0011 defq* drain
> node0021 defq* drain
> node0031 defq* drain
>
> [2020-01-20T12:50:51.034] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> /etc/slurm/slurm.c

[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
code base.  It's behavior is strange to say the least.

The controller was built from the same code base, but on Ubuntu 19.10.  The
controller reports the nodes state with sinfo, but can't run a simple job
with srun because it thinks the node isn't available, even when it is
idle.  (And squeue shows an empty queue.)

On the controller:
$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 30 queued and waiting for resources
^Csrun: Job allocation 30 has been revoked
srun: Force Terminated job 30
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  idle* liqidos-dean-node1
$ squeue
 JOBID  PARTITION  USER  STTIME   NODES
NODELIST(REASON)


When I try to run the simple job on the node I get:

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  idle* liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 27 queued and waiting for resources
^Csrun: Job allocation 27 has been revoked
[liqid@liqidos-dean-node1 ~]$ squeue
 JOBID  PARTITION  USER  STTIME   NODES
NODELIST(REASON)
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 28 queued and waiting for resources
^Csrun: Job allocation 28 has been revoked
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  idle* liqidos-dean-node1

Apparently slurm thinks there are a bunch of jobs queued, but shows an
empty queue.  How do I get rid of these?

If these zombie jobs aren't the problem what else could be keeping this
from running?

Thanks.


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Carlos Fenoy
Hi,

The * next to the idle status in sinfo means that the node is
unreachable/not responding. Check the status of the slurmd on the node and
check the connectivity from the slurmctld host to the compute node (telnet
may be enough). You can also check the slurmctld logs for more information.

Regards,
Carlos

On Mon, 20 Jan 2020 at 21:04, Dean Schulze  wrote:

> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
> code base.  It's behavior is strange to say the least.
>
> The controller was built from the same code base, but on Ubuntu 19.10.
> The controller reports the nodes state with sinfo, but can't run a simple
> job with srun because it thinks the node isn't available, even when it is
> idle.  (And squeue shows an empty queue.)
>
> On the controller:
> $ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 30 queued and waiting for resources
> ^Csrun: Job allocation 30 has been revoked
> srun: Force Terminated job 30
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  idle* liqidos-dean-node1
> $ squeue
>  JOBID  PARTITION  USER  STTIME   NODES
> NODELIST(REASON)
>
>
> When I try to run the simple job on the node I get:
>
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  idle* liqidos-dean-node1
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 27 queued and waiting for resources
> ^Csrun: Job allocation 27 has been revoked
> [liqid@liqidos-dean-node1 ~]$ squeue
>  JOBID  PARTITION  USER  STTIME   NODES
> NODELIST(REASON)
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 28 queued and waiting for resources
> ^Csrun: Job allocation 28 has been revoked
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  idle* liqidos-dean-node1
>
> Apparently slurm thinks there are a bunch of jobs queued, but shows an
> empty queue.  How do I get rid of these?
>
> If these zombie jobs aren't the problem what else could be keeping this
> from running?
>
> Thanks.
>
-- 
--
Carles Fenoy


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
If I run sinfo on the node itself it shows an asterisk.  How can the node
be unreachable from itself?

On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:

> Hi,
>
> The * next to the idle status in sinfo means that the node is
> unreachable/not responding. Check the status of the slurmd on the node and
> check the connectivity from the slurmctld host to the compute node (telnet
> may be enough). You can also check the slurmctld logs for more information.
>
> Regards,
> Carlos
>
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze 
> wrote:
>
>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
>> code base.  It's behavior is strange to say the least.
>>
>> The controller was built from the same code base, but on Ubuntu 19.10.
>> The controller reports the nodes state with sinfo, but can't run a simple
>> job with srun because it thinks the node isn't available, even when it is
>> idle.  (And squeue shows an empty queue.)
>>
>> On the controller:
>> $ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 30 queued and waiting for resources
>> ^Csrun: Job allocation 30 has been revoked
>> srun: Force Terminated job 30
>> $ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>> $ squeue
>>  JOBID  PARTITION  USER  STTIME   NODES
>> NODELIST(REASON)
>>
>>
>> When I try to run the simple job on the node I get:
>>
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 27 queued and waiting for resources
>> ^Csrun: Job allocation 27 has been revoked
>> [liqid@liqidos-dean-node1 ~]$ squeue
>>  JOBID  PARTITION  USER  STTIME   NODES
>> NODELIST(REASON)
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 28 queued and waiting for resources
>> ^Csrun: Job allocation 28 has been revoked
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>
>> Apparently slurm thinks there are a bunch of jobs queued, but shows an
>> empty queue.  How do I get rid of these?
>>
>> If these zombie jobs aren't the problem what else could be keeping this
>> from running?
>>
>> Thanks.
>>
> --
> --
> Carles Fenoy
>


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
If I restart slurmd the asterisk goes away.  Then I can run the job once
and the asterisk is back, and the node remains in comp*:

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1   idle liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  comp* liqidos-dean-node1

I can get it back to idle* with scontrol:

[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
NodeName=liqidos-dean-node1 State=down Reason=none
[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
NodeName=liqidos-dean-node1 State=resume
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  idle* liqidos-dean-node1

I'm beginning to wonder if I got some bad code from github.


On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:

> Hi,
>
> The * next to the idle status in sinfo means that the node is
> unreachable/not responding. Check the status of the slurmd on the node and
> check the connectivity from the slurmctld host to the compute node (telnet
> may be enough). You can also check the slurmctld logs for more information.
>
> Regards,
> Carlos
>
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze 
> wrote:
>
>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
>> code base.  It's behavior is strange to say the least.
>>
>> The controller was built from the same code base, but on Ubuntu 19.10.
>> The controller reports the nodes state with sinfo, but can't run a simple
>> job with srun because it thinks the node isn't available, even when it is
>> idle.  (And squeue shows an empty queue.)
>>
>> On the controller:
>> $ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 30 queued and waiting for resources
>> ^Csrun: Job allocation 30 has been revoked
>> srun: Force Terminated job 30
>> $ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>> $ squeue
>>  JOBID  PARTITION  USER  STTIME   NODES
>> NODELIST(REASON)
>>
>>
>> When I try to run the simple job on the node I get:
>>
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 27 queued and waiting for resources
>> ^Csrun: Job allocation 27 has been revoked
>> [liqid@liqidos-dean-node1 ~]$ squeue
>>  JOBID  PARTITION  USER  STTIME   NODES
>> NODELIST(REASON)
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 28 queued and waiting for resources
>> ^Csrun: Job allocation 28 has been revoked
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>
>> Apparently slurm thinks there are a bunch of jobs queued, but shows an
>> empty queue.  How do I get rid of these?
>>
>> If these zombie jobs aren't the problem what else could be keeping this
>> from running?
>>
>> Thanks.
>>
> --
> --
> Carles Fenoy
>


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Brian Andrus

Check the slurmd log file on the node.

Ensure slurmd is still running. Sounds possible that OOM Killer or such 
may be killing slurmd


Brian Andrus

On 1/20/2020 1:12 PM, Dean Schulze wrote:
If I restart slurmd the asterisk goes away.  Then I can run the job 
once and the asterisk is back, and the node remains in comp*:


[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  comp* liqidos-dean-node1

I can get it back to idle* with scontrol:

[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update 
NodeName=liqidos-dean-node1 State=down Reason=none
[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update 
NodeName=liqidos-dean-node1 State=resume

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1

I'm beginning to wonder if I got some bad code from github.


On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy > wrote:


Hi,

The * next to the idle status in sinfo means that the node is
unreachable/not responding. Check the status of the slurmd on the
node and check the connectivity from the slurmctld host to the
compute node (telnet may be enough). You can also check the
slurmctld logs for more information.

Regards,
Carlos

On Mon, 20 Jan 2020 at 21:04, Dean Schulze
mailto:dean.w.schu...@gmail.com>> wrote:

I've got a node running on CentOS 7.7 build from the recent
20.02.0pre1 code base.  It's behavior is strange to say the
least.

The controller was built from the same code base, but on
Ubuntu 19.10.  The controller reports the nodes state with
sinfo, but can't run a simple job with srun because it thinks
the node isn't available, even when it is idle.  (And squeue
shows an empty queue.)

On the controller:
$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 30 queued and waiting for resources
^Csrun: Job allocation 30 has been revoked
srun: Force Terminated job 30
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1
$ squeue
             JOBID  PARTITION      USER  ST  TIME   NODES
NODELIST(REASON)


When I try to run the simple job on the node I get:

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 27 queued and waiting for resources
^Csrun: Job allocation 27 has been revoked
[liqid@liqidos-dean-node1 ~]$ squeue
             JOBID  PARTITION      USER  ST  TIME   NODES
NODELIST(REASON)
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 28 queued and waiting for resources
^Csrun: Job allocation 28 has been revoked
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1

Apparently slurm thinks there are a bunch of jobs queued, but
shows an empty queue.  How do I get rid of these?

If these zombie jobs aren't the problem what else could be
keeping this from running?

Thanks.

-- 
--

Carles Fenoy



Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Carlos Fenoy
It seems to me that the problem is between the slurmctld and slurmd. When
slurmd starts it sends a message to the slurmctld, that's why it appears
idle. Every now and then the slurmctld will try to ping the slurmd to check
if it's still alive. This ping doesn't seem to be working, so as I
mentioned previously, check the slurmctld log and the connectivity between
the slurmctld node and the slurmd node.

On Mon, 20 Jan 2020, 22:43 Brian Andrus,  wrote:

> Check the slurmd log file on the node.
>
> Ensure slurmd is still running. Sounds possible that OOM Killer or such
> may be killing slurmd
>
> Brian Andrus
> On 1/20/2020 1:12 PM, Dean Schulze wrote:
>
> If I restart slurmd the asterisk goes away.  Then I can run the job once
> and the asterisk is back, and the node remains in comp*:
>
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1   idle liqidos-dean-node1
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> liqidos-dean-node1
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  comp* liqidos-dean-node1
>
> I can get it back to idle* with scontrol:
>
> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
> NodeName=liqidos-dean-node1 State=down Reason=none
> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
> NodeName=liqidos-dean-node1 State=resume
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  idle* liqidos-dean-node1
>
> I'm beginning to wonder if I got some bad code from github.
>
>
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:
>
>> Hi,
>>
>> The * next to the idle status in sinfo means that the node is
>> unreachable/not responding. Check the status of the slurmd on the node and
>> check the connectivity from the slurmctld host to the compute node (telnet
>> may be enough). You can also check the slurmctld logs for more information.
>>
>> Regards,
>> Carlos
>>
>> On Mon, 20 Jan 2020 at 21:04, Dean Schulze 
>> wrote:
>>
>>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
>>> code base.  It's behavior is strange to say the least.
>>>
>>> The controller was built from the same code base, but on Ubuntu 19.10.
>>> The controller reports the nodes state with sinfo, but can't run a simple
>>> job with srun because it thinks the node isn't available, even when it is
>>> idle.  (And squeue shows an empty queue.)
>>>
>>> On the controller:
>>> $ srun -N 1 hostname
>>> srun: Required node not available (down, drained or reserved)
>>> srun: job 30 queued and waiting for resources
>>> ^Csrun: Job allocation 30 has been revoked
>>> srun: Force Terminated job 30
>>> $ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>> $ squeue
>>>  JOBID  PARTITION  USER  STTIME   NODES
>>> NODELIST(REASON)
>>>
>>>
>>> When I try to run the simple job on the node I get:
>>>
>>> [liqid@liqidos-dean-node1 ~]$ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>>> srun: Required node not available (down, drained or reserved)
>>> srun: job 27 queued and waiting for resources
>>> ^Csrun: Job allocation 27 has been revoked
>>> [liqid@liqidos-dean-node1 ~]$ squeue
>>>  JOBID  PARTITION  USER  STTIME   NODES
>>> NODELIST(REASON)
>>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>>> srun: Required node not available (down, drained or reserved)
>>> srun: job 28 queued and waiting for resources
>>> ^Csrun: Job allocation 28 has been revoked
>>> [liqid@liqidos-dean-node1 ~]$ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>>
>>> Apparently slurm thinks there are a bunch of jobs queued, but shows an
>>> empty queue.  How do I get rid of these?
>>>
>>> If these zombie jobs aren't the problem what else could be keeping this
>>> from running?
>>>
>>> Thanks.
>>>
>> --
>> --
>> Carles Fenoy
>>
>


[slurm-users] Downgraded to slurm 19.05.4 and now slrumctld won't start because of incompatible state

2020-01-20 Thread Dean Schulze
This is what I get from systemctl status slurmctld:

fatal: Can not recover last_tres state, incompatible version, got 8960 need
>= 8192 <= 8704, start with '-i' to ignore this

Starting it with the -i option doesn't do anything.

Where does slurm store this state so I can get rid of it?

Thanks.


Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
There's either a problem with the source code I cloned from github, or
there is a problem when the controller runs on Ubuntu 19 and the node runs
on CentOS 7.7.  I'm downgrading to a stable 19.05 build to see if that
solves the problem.

On Mon, Jan 20, 2020 at 3:41 PM Carlos Fenoy  wrote:

> It seems to me that the problem is between the slurmctld and slurmd. When
> slurmd starts it sends a message to the slurmctld, that's why it appears
> idle. Every now and then the slurmctld will try to ping the slurmd to check
> if it's still alive. This ping doesn't seem to be working, so as I
> mentioned previously, check the slurmctld log and the connectivity between
> the slurmctld node and the slurmd node.
>
> On Mon, 20 Jan 2020, 22:43 Brian Andrus,  wrote:
>
>> Check the slurmd log file on the node.
>>
>> Ensure slurmd is still running. Sounds possible that OOM Killer or such
>> may be killing slurmd
>>
>> Brian Andrus
>> On 1/20/2020 1:12 PM, Dean Schulze wrote:
>>
>> If I restart slurmd the asterisk goes away.  Then I can run the job once
>> and the asterisk is back, and the node remains in comp*:
>>
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1   idle liqidos-dean-node1
>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
>> liqidos-dean-node1
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  comp* liqidos-dean-node1
>>
>> I can get it back to idle* with scontrol:
>>
>> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
>> NodeName=liqidos-dean-node1 State=down Reason=none
>> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
>> NodeName=liqidos-dean-node1 State=resume
>> [liqid@liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*   up   infinite  1  idle* liqidos-dean-node1
>>
>> I'm beginning to wonder if I got some bad code from github.
>>
>>
>> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:
>>
>>> Hi,
>>>
>>> The * next to the idle status in sinfo means that the node is
>>> unreachable/not responding. Check the status of the slurmd on the node and
>>> check the connectivity from the slurmctld host to the compute node (telnet
>>> may be enough). You can also check the slurmctld logs for more information.
>>>
>>> Regards,
>>> Carlos
>>>
>>> On Mon, 20 Jan 2020 at 21:04, Dean Schulze 
>>> wrote:
>>>
 I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
 code base.  It's behavior is strange to say the least.

 The controller was built from the same code base, but on Ubuntu 19.10.
 The controller reports the nodes state with sinfo, but can't run a simple
 job with srun because it thinks the node isn't available, even when it is
 idle.  (And squeue shows an empty queue.)

 On the controller:
 $ srun -N 1 hostname
 srun: Required node not available (down, drained or reserved)
 srun: job 30 queued and waiting for resources
 ^Csrun: Job allocation 30 has been revoked
 srun: Force Terminated job 30
 $ sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 debug*   up   infinite  1  idle* liqidos-dean-node1
 $ squeue
  JOBID  PARTITION  USER  STTIME   NODES
 NODELIST(REASON)


 When I try to run the simple job on the node I get:

 [liqid@liqidos-dean-node1 ~]$ sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 debug*   up   infinite  1  idle* liqidos-dean-node1
 [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
 srun: Required node not available (down, drained or reserved)
 srun: job 27 queued and waiting for resources
 ^Csrun: Job allocation 27 has been revoked
 [liqid@liqidos-dean-node1 ~]$ squeue
  JOBID  PARTITION  USER  STTIME   NODES
 NODELIST(REASON)
 [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
 srun: Required node not available (down, drained or reserved)
 srun: job 28 queued and waiting for resources
 ^Csrun: Job allocation 28 has been revoked
 [liqid@liqidos-dean-node1 ~]$ sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 debug*   up   infinite  1  idle* liqidos-dean-node1

 Apparently slurm thinks there are a bunch of jobs queued, but shows an
 empty queue.  How do I get rid of these?

 If these zombie jobs aren't the problem what else could be keeping this
 from running?

 Thanks.

>>> --
>>> --
>>> Carles Fenoy
>>>
>>


Re: [slurm-users] Downgraded to slurm 19.05.4 and now slrumctld won't start because of incompatible state

2020-01-20 Thread Ryan Novosielski
Check slurm.conf for StateSaveLocation.

https://slurm.schedmd.com/slurm.conf.html

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jan 20, 2020, at 5:58 PM, Dean Schulze  wrote:
> 
> 
> This is what I get from systemctl status slurmctld:
> 
> fatal: Can not recover last_tres state, incompatible version, got 8960 need 
> >= 8192 <= 8704, start with '-i' to ignore this
> 
> Starting it with the -i option doesn't do anything.
> 
> Where does slurm store this state so I can get rid of it?
> 
> Thanks.
> 




Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Ryan Novosielski
The node is not getting the status from itself, it’s querying the slurmctld to 
ask for its status.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jan 20, 2020, at 3:56 PM, Dean Schulze  wrote:
> 
> If I run sinfo on the node itself it shows an asterisk.  How can the node be 
> unreachable from itself?
> 
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:
> Hi,
> 
> The * next to the idle status in sinfo means that the node is unreachable/not 
> responding. Check the status of the slurmd on the node and check the 
> connectivity from the slurmctld host to the compute node (telnet may be 
> enough). You can also check the slurmctld logs for more information. 
> 
> Regards,
> Carlos
> 
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze  wrote:
> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code 
> base.  It's behavior is strange to say the least.
> 
> The controller was built from the same code base, but on Ubuntu 19.10.  The 
> controller reports the nodes state with sinfo, but can't run a simple job 
> with srun because it thinks the node isn't available, even when it is idle.  
> (And squeue shows an empty queue.)
> 
> On the controller:
> $ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 30 queued and waiting for resources
> ^Csrun: Job allocation 30 has been revoked
> srun: Force Terminated job 30
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> $ squeue
>  JOBID  PARTITION  USER  STTIME   NODES 
> NODELIST(REASON) 
> 
> 
> When I try to run the simple job on the node I get:
> 
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 27 queued and waiting for resources
> ^Csrun: Job allocation 27 has been revoked
> [liqid@liqidos-dean-node1 ~]$ squeue
>  JOBID  PARTITION  USER  STTIME   NODES 
> NODELIST(REASON) 
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 28 queued and waiting for resources
> ^Csrun: Job allocation 28 has been revoked
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> 
> Apparently slurm thinks there are a bunch of jobs queued, but shows an empty 
> queue.  How do I get rid of these?
> 
> If these zombie jobs aren't the problem what else could be keeping this from 
> running?
> 
> Thanks.
> -- 
> --
> Carles Fenoy



Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Chris Samuel

On 20/1/20 3:00 pm, Dean Schulze wrote:

There's either a problem with the source code I cloned from github, or 
there is a problem when the controller runs on Ubuntu 19 and the node 
runs on CentOS 7.7.  I'm downgrading to a stable 19.05 build to see if 
that solves the problem.


I've run the master branch on a Cray XC without issues, and I concur 
with what the others have said and suggest it's worth checking the 
slurmd and slurmctld logs to find out why communications is not right 
between them.


Good luck,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Marcus Wagner

Dear Robert,

On 1/20/20 7:37 PM, Robert Kudyba wrote:
I've posted about this previously here 
, 
and here 
 so 
I'm trying to get to the bottom of this once and for all and even got 
this comment 
 
previously:


our problem here is that the configuration for the nodes in
question have an incorrect amount of memory set for them. Looks
like you have it set in bytes instead of megabytes
In your slurm.conf you should look at the RealMemory setting:
RealMemory
Size of real memory on the node in megabytes (e.g. "2048"). The
default value is 1.
I would suggest RealMemory=191879 , where I suspect you have
RealMemory=196489092



are you sure, your 24 core nodes have 187 TERABYTES memory?

As you yourself cited:

Size of real memory on the node in megabytes

The settings in your slurm.conf:
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 
Sockets=2 Gres=gpu:1
so, your machines should have 196489092 megabytes memory, that are 
~191884 gigabytes or ~187 terabytes


Slurm believes, these machines do NOT have that much memory:
[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size 
(191840 < 196489092)
It sees only 191840 megabytes, which is still less than the 191884. 
Since the available memory changes slightly from OS version to OS 
version, I would suggest to set RealMemory to less than 191840, e.g. 191800.

But Brian already told you to reduce the RealMemory:
I would suggest RealMemory=191879 , where I suspect you have 
RealMemory=196489092


If SLURM sees less than RealMemory on a node, it drains the node, 
because a defective DIMM is assumed.


Best
Marcus



Now the slurmctld logs show this:

[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size 
(191840 < 196489092)

[2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration 
node=node002: Invalid argument
[2020-01-20T13:22:48.256] error: Node node001 has low real_memory size 
(191846 < 196489092)

[2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration 
node=node001: Invalid argument
[2020-01-20T13:22:48.256] error: Node node003 has low real_memory size 
(191840 < 196489092)

[2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration 
node=node003: Invalid argument


Here's the setting in slurm.conf:
/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 
Sockets=2 Gres=gpu:1

# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO 
Hidden=NO Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO 
Hidden=NO Shared=NO GraceTime=0 PreemptM$


sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

[2020-01-20T12:50:51.034] error: Node node003 has low real_memory size 
(191840 < 196489092)
[2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration 
node=node003: Invalid argument


/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 
Sockets=2 Gres=gpu:1

# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO 
Hidden=NO Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO 
Hidden=NO Shared=NO GraceTime=0 PreemptM$


pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node001: Thread(s) per core:    1
node001: Core(s) per socket:    12
node001: Socket(s):             2
node002: Thread(s) per core:    1
node002: Core(s) per socket:    12
node002: Socket(s):             2
node003: Thread(s) per core:    2
node003: Core(s) per socket:    12
node003: Socket(s):             2

module load cmsh
[root@ciscluster kudyba]# cmsh
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type         Name                     Nodes
  
--