Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-12 Thread Taras Shapovalov
Hey Robert,

Ask Bright support, they will help you to figure out what is going on there.

Best regards,
Taras

On Tue, Feb 11, 2020 at 8:26 PM Robert Kudyba  wrote:

> This is still happening. Nodes are being drained after a kill task failed.
> Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307?
>
> [2020-02-11T12:21:26.005] update_node: node node001 reason set to: Kill
> task failed
> [2020-02-11T12:21:26.006] update_node: node node001 state set to DRAINING
> [2020-02-11T12:21:26.006] got (nil)
> [2020-02-11T12:21:26.015] error: slurmd error running JobId=1514 on
> node(s)=node001: Kill task failed
> [2020-02-11T12:21:26.015] _job_complete: JobID=1514 State=0x1 NodeCnt=1
> WEXITSTATUS 1
> [2020-02-11T12:21:26.015] email msg to sli...@fordham.edu: SLURM
> Job_id=1514 Name=run.sh Failed, Run time 00:02:21, NODE_FAIL, ExitCode 0
> [2020-02-11T12:21:26.016] _job_complete: requeue JobID=1514 State=0x8000
> NodeCnt=1 per user/system request
> [2020-02-11T12:21:26.016] _job_complete: JobID=1514 State=0x8000 NodeCnt=1
> done
> [2020-02-11T12:21:26.057] Requeuing JobID=1514 State=0x0 NodeCnt=0
> [2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x1 NodeCnt=1
> WEXITSTATUS 0
> [2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x8003 NodeCnt=1
> done
> [2020-02-11T12:21:52.111] _job_complete: JobID=1512 State=0x1 NodeCnt=1
> WEXITSTATUS 0
> [2020-02-11T12:21:52.112] _job_complete: JobID=1512 State=0x8003 NodeCnt=1
> done
> [2020-02-11T12:21:52.214] sched: Allocate JobID=1516 NodeList=node002
> #CPUs=1 Partition=defq
> [2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x1 NodeCnt=1
> WEXITSTATUS 0
> [2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x8003 NodeCnt=1
> done
>
> On Tue, Feb 11, 2020 at 11:54 AM Robert Kudyba 
> wrote:
>
>> Usually means you updated the slurm.conf but have not done "scontrol
>>> reconfigure" yet.
>>>
>> Well it turns out it was something else related to a Bright Computing
>> setting. In case anyone finds this thread in the future:
>>
>> ourcluster->category[gpucategory]->roles]% use slurmclient
>> [ourcluster->category[gpucategory]->roles[slurmclient]]% show
>> ...
>> RealMemory 196489092
>> ...
>> [ ciscluster->category[gpucategory]->roles[slurmclient]]%
>>
>> Values are specified in MB and this line is saying that our node has
>> 196TB of RAM.
>>
>> I set this using cmsh:
>>
>> # cmsh
>> % category
>> % use gpucategory
>> % roles
>> % use slurmclient
>> % set realmemory 191846
>> % commit
>>
>> The value in /etc/slurm/slurm.conf was conflicting with this especially
>> when restarting slurmctld.
>>
>> On 2/10/2020 8:55 AM, Robert Kudyba wrote:
>>>
>>> We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
>>>
>>> We're getting the below errors when I restart the slurmctld service. The
>>> file appears to be the same on the head node and compute nodes:
>>> [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
>>>
>>> -rw-r--r-- 1 root root 3477 Feb 10 11:05
>>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>>
>>> [root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf
>>> /etc/slurm/slurm.conf
>>>
>>> -rw-r--r-- 1 root root 3477 Feb 10 11:05
>>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>>
>>> lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf ->
>>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>>
>>> So what else could be causing this?
>>> [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
>>> [2020-02-10T10:31:12.009] error: Node node001 appears to have a
>>> different slurm.conf than the slurmctld.  This could cause issues with
>>> communication and functionality.  Please review both files and make  sure
>>> they are the same.  If this is expected ignore, and set
>>> DebugFlags=NO_CONF_HASH in your slurm.conf.
>>> [2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
>>> (191846 < 196489092)
>>> [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration
>>> node=node001: Invalid argument
>>> [2020-02-10T10:31:12.011] error: Node node002 appears to have a
>>> different slurm.conf than the slurmctld.  This could cause issues with
>>> communication and functionality.  Please review both files and make sure
>>> they are the same.  If this is expected ignore, and set
>>> DebugFlags=NO_CONF_HASH in your slurm.conf.
>>> [2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
>>> (191840 < 196489092)
>>> [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration
>>> node=node002: Invalid argument
>>> [2020-02-10T10:31:12.047] error: Node node003 appears to have a
>>> different slurm.conf than the slurmctld.  This could cause issues with
>>> communication and functionality.  Please review both files and make sure
>>> they are the same.  If this is expected ignore, and set
>>> DebugFlags=NO_CONF_HASH in your slurm.conf.
>>> [2020-02-10T10:31:12.047] error: Node node003 has low real_memory size
>>> (191840 < 196489092)
>>> 

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-11 Thread Robert Kudyba
This is still happening. Nodes are being drained after a kill task failed.
Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307?

[2020-02-11T12:21:26.005] update_node: node node001 reason set to: Kill
task failed
[2020-02-11T12:21:26.006] update_node: node node001 state set to DRAINING
[2020-02-11T12:21:26.006] got (nil)
[2020-02-11T12:21:26.015] error: slurmd error running JobId=1514 on
node(s)=node001: Kill task failed
[2020-02-11T12:21:26.015] _job_complete: JobID=1514 State=0x1 NodeCnt=1
WEXITSTATUS 1
[2020-02-11T12:21:26.015] email msg to sli...@fordham.edu: SLURM
Job_id=1514 Name=run.sh Failed, Run time 00:02:21, NODE_FAIL, ExitCode 0
[2020-02-11T12:21:26.016] _job_complete: requeue JobID=1514 State=0x8000
NodeCnt=1 per user/system request
[2020-02-11T12:21:26.016] _job_complete: JobID=1514 State=0x8000 NodeCnt=1
done
[2020-02-11T12:21:26.057] Requeuing JobID=1514 State=0x0 NodeCnt=0
[2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x1 NodeCnt=1
WEXITSTATUS 0
[2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x8003 NodeCnt=1
done
[2020-02-11T12:21:52.111] _job_complete: JobID=1512 State=0x1 NodeCnt=1
WEXITSTATUS 0
[2020-02-11T12:21:52.112] _job_complete: JobID=1512 State=0x8003 NodeCnt=1
done
[2020-02-11T12:21:52.214] sched: Allocate JobID=1516 NodeList=node002
#CPUs=1 Partition=defq
[2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x1 NodeCnt=1
WEXITSTATUS 0
[2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x8003 NodeCnt=1
done

On Tue, Feb 11, 2020 at 11:54 AM Robert Kudyba  wrote:

> Usually means you updated the slurm.conf but have not done "scontrol
>> reconfigure" yet.
>>
> Well it turns out it was something else related to a Bright Computing
> setting. In case anyone finds this thread in the future:
>
> ourcluster->category[gpucategory]->roles]% use slurmclient
> [ourcluster->category[gpucategory]->roles[slurmclient]]% show
> ...
> RealMemory 196489092
> ...
> [ ciscluster->category[gpucategory]->roles[slurmclient]]%
>
> Values are specified in MB and this line is saying that our node has 196TB
> of RAM.
>
> I set this using cmsh:
>
> # cmsh
> % category
> % use gpucategory
> % roles
> % use slurmclient
> % set realmemory 191846
> % commit
>
> The value in /etc/slurm/slurm.conf was conflicting with this especially
> when restarting slurmctld.
>
> On 2/10/2020 8:55 AM, Robert Kudyba wrote:
>>
>> We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
>>
>> We're getting the below errors when I restart the slurmctld service. The
>> file appears to be the same on the head node and compute nodes:
>> [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
>>
>> -rw-r--r-- 1 root root 3477 Feb 10 11:05
>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>
>> [root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf
>> /etc/slurm/slurm.conf
>>
>> -rw-r--r-- 1 root root 3477 Feb 10 11:05
>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>
>> lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf ->
>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>
>> So what else could be causing this?
>> [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
>> [2020-02-10T10:31:12.009] error: Node node001 appears to have a different
>> slurm.conf than the slurmctld.  This could cause issues with communication
>> and functionality.  Please review both files and make  sure they are the
>> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
>> slurm.conf.
>> [2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
>> (191846 < 196489092)
>> [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration
>> node=node001: Invalid argument
>> [2020-02-10T10:31:12.011] error: Node node002 appears to have a different
>> slurm.conf than the slurmctld.  This could cause issues with communication
>> and functionality.  Please review both files and make sure they are the
>> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
>> slurm.conf.
>> [2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
>> (191840 < 196489092)
>> [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration
>> node=node002: Invalid argument
>> [2020-02-10T10:31:12.047] error: Node node003 appears to have a different
>> slurm.conf than the slurmctld.  This could cause issues with communication
>> and functionality.  Please review both files and make sure they are the
>> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
>> slurm.conf.
>> [2020-02-10T10:31:12.047] error: Node node003 has low real_memory size
>> (191840 < 196489092)
>> [2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
>> [2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
>> [2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration
>> node=node003: Invalid argument
>> [2020-02-10T10:32:08.026]
>> 

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-11 Thread Robert Kudyba
>
> Usually means you updated the slurm.conf but have not done "scontrol
> reconfigure" yet.
>
Well it turns out it was something else related to a Bright Computing
setting. In case anyone finds this thread in the future:

ourcluster->category[gpucategory]->roles]% use slurmclient
[ourcluster->category[gpucategory]->roles[slurmclient]]% show
...
RealMemory 196489092
...
[ ciscluster->category[gpucategory]->roles[slurmclient]]%

Values are specified in MB and this line is saying that our node has 196TB
of RAM.

I set this using cmsh:

# cmsh
% category
% use gpucategory
% roles
% use slurmclient
% set realmemory 191846
% commit

The value in /etc/slurm/slurm.conf was conflicting with this especially
when restarting slurmctld.

On 2/10/2020 8:55 AM, Robert Kudyba wrote:
>
> We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
>
> We're getting the below errors when I restart the slurmctld service. The
> file appears to be the same on the head node and compute nodes:
> [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
>
> -rw-r--r-- 1 root root 3477 Feb 10 11:05
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> [root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf
> /etc/slurm/slurm.conf
>
> -rw-r--r-- 1 root root 3477 Feb 10 11:05
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf ->
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> So what else could be causing this?
> [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
> [2020-02-10T10:31:12.009] error: Node node001 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make  sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> [2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
> (191846 < 196489092)
> [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration
> node=node001: Invalid argument
> [2020-02-10T10:31:12.011] error: Node node002 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> [2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
> (191840 < 196489092)
> [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration
> node=node002: Invalid argument
> [2020-02-10T10:31:12.047] error: Node node003 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> [2020-02-10T10:31:12.047] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
> [2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
> [2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
> [2020-02-10T10:32:08.026]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
> [2020-02-10T10:56:08.992] layouts: no layout to initialize
> [2020-02-10T10:56:08.992] restoring original state of nodes
> [2020-02-10T10:56:08.992] cons_res: select_p_node_init
> [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
> [2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not
> specified
> [2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
> [2020-02-10T10:56:08.992] cons_res: select_p_node_init
> [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
> [2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
> [2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
> [2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed
> usec=4369
> [2020-02-10T10:56:11.253]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
> [2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
> [2020-02-10T10:56:18.645] got (nil)
> [2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
> [2020-02-10T10:56:18.679] got (nil)
> [2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung
> [2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN
> [2020-02-10T10:56:18.693] got (nil)
> [2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE
> [2020-02-10T10:56:18.711] got (nil)
>
> And not 

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-10 Thread Brian Andrus
Usually means you updated the slurm.conf but have not done "scontrol 
reconfigure" yet.



Brian Andrus

On 2/10/2020 8:55 AM, Robert Kudyba wrote:

We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.

We're getting the below errors when I restart the slurmctld service. 
The file appears to be the same on the head node and compute nodes:

[root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf

-rw-r--r-- 1 root root 3477 Feb 10 11:05 
/cm/shared/apps/slurm/var/etc/slurm.conf


[root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf 
/etc/slurm/slurm.conf


-rw-r--r-- 1 root root 3477 Feb 10 11:05 
/cm/shared/apps/slurm/var/etc/slurm.conf


lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf -> 
/cm/shared/apps/slurm/var/etc/slurm.conf


So what else could be causing this?
[2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:31:12.009] error: Node node001 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and make 
 sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.009] error: Node node001 has low real_memory size 
(191846 < 196489092)
[2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration 
node=node001: Invalid argument
[2020-02-10T10:31:12.011] error: Node node002 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and 
make sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.011] error: Node node002 has low real_memory size 
(191840 < 196489092)
[2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration 
node=node002: Invalid argument
[2020-02-10T10:31:12.047] error: Node node003 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and 
make sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.047] error: Node node003 has low real_memory size 
(191840 < 196489092)

[2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
[2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
[2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration 
node=node003: Invalid argument
[2020-02-10T10:32:08.026] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

[2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2020-02-10T10:56:08.992] layouts: no layout to initialize
[2020-02-10T10:56:08.992] restoring original state of nodes
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not 
specified

[2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
[2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed 
usec=4369
[2020-02-10T10:56:11.253] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

[2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
[2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
[2020-02-10T10:56:18.645] got (nil)
[2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
[2020-02-10T10:56:18.679] got (nil)
[2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung
[2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN
[2020-02-10T10:56:18.693] got (nil)
[2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE
[2020-02-10T10:56:18.711] got (nil)

And not sure if this is related but we're getting this  "Kill task 
failed" and a node gets drained.


[2020-02-09T14:42:06.006] error: slurmd error running JobId=1465 on 
node(s)=node001: Kill task failed
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1 
NodeCnt=1 WEXITSTATUS 1
[2020-02-09T14:42:06.006] email msg to ouru...@ourdomain.edu 
: SLURM Job_id=1465 Name=run.sh Failed, 
Run time 00:02:23, NODE_FAIL, ExitCode 0
[2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465 
State=0x8000 NodeCnt=1 per user/system request
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000 
NodeCnt=1 done

[2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0 NodeCnt=0
[2020-02-09T14:43:16.308]