Re: [slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-10 Thread Chris Samuel
On Monday, 10 February 2020 12:11:30 PM PST Dean Schulze wrote:

> With this configuration I get this message every second in my slurmctld.log
> file:
> 
> error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument

What other errors are in the logs?

Could you check that you've got identical slurm.conf and gres.conf files 
everywhere?

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-10 Thread Brian Andrus
Usually means you updated the slurm.conf but have not done "scontrol 
reconfigure" yet.



Brian Andrus

On 2/10/2020 8:55 AM, Robert Kudyba wrote:

We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.

We're getting the below errors when I restart the slurmctld service. 
The file appears to be the same on the head node and compute nodes:

[root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf

-rw-r--r-- 1 root root 3477 Feb 10 11:05 
/cm/shared/apps/slurm/var/etc/slurm.conf


[root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf 
/etc/slurm/slurm.conf


-rw-r--r-- 1 root root 3477 Feb 10 11:05 
/cm/shared/apps/slurm/var/etc/slurm.conf


lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf -> 
/cm/shared/apps/slurm/var/etc/slurm.conf


So what else could be causing this?
[2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:31:12.009] error: Node node001 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and make 
 sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.009] error: Node node001 has low real_memory size 
(191846 < 196489092)
[2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration 
node=node001: Invalid argument
[2020-02-10T10:31:12.011] error: Node node002 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and 
make sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.011] error: Node node002 has low real_memory size 
(191840 < 196489092)
[2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration 
node=node002: Invalid argument
[2020-02-10T10:31:12.047] error: Node node003 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and 
make sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.047] error: Node node003 has low real_memory size 
(191840 < 196489092)

[2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
[2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
[2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration 
node=node003: Invalid argument
[2020-02-10T10:32:08.026] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

[2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2020-02-10T10:56:08.992] layouts: no layout to initialize
[2020-02-10T10:56:08.992] restoring original state of nodes
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not 
specified

[2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
[2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed 
usec=4369
[2020-02-10T10:56:11.253] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

[2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
[2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
[2020-02-10T10:56:18.645] got (nil)
[2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
[2020-02-10T10:56:18.679] got (nil)
[2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung
[2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN
[2020-02-10T10:56:18.693] got (nil)
[2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE
[2020-02-10T10:56:18.711] got (nil)

And not sure if this is related but we're getting this  "Kill task 
failed" and a node gets drained.


[2020-02-09T14:42:06.006] error: slurmd error running JobId=1465 on 
node(s)=node001: Kill task failed
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1 
NodeCnt=1 WEXITSTATUS 1
[2020-02-09T14:42:06.006] email msg to ouru...@ourdomain.edu 
: SLURM Job_id=1465 Name=run.sh Failed, 
Run time 00:02:23, NODE_FAIL, ExitCode 0
[2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465 
State=0x8000 NodeCnt=1 per user/system request
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000 
NodeCnt=1 done

[2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0 NodeCnt=0
[2020-02-09T14:43:16.308] 

[slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-10 Thread Robert Kudyba
We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.

We're getting the below errors when I restart the slurmctld service. The
file appears to be the same on the head node and compute nodes:
[root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf

-rw-r--r-- 1 root root 3477 Feb 10 11:05
/cm/shared/apps/slurm/var/etc/slurm.conf

[root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf
/etc/slurm/slurm.conf

-rw-r--r-- 1 root root 3477 Feb 10 11:05
/cm/shared/apps/slurm/var/etc/slurm.conf

lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf

So what else could be causing this?
[2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:31:12.009] error: Node node001 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make  sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
(191846 < 196489092)
[2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration node=node001:
Invalid argument
[2020-02-10T10:31:12.011] error: Node node002 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
(191840 < 196489092)
[2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration node=node002:
Invalid argument
[2020-02-10T10:31:12.047] error: Node node003 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-02-10T10:31:12.047] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
[2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
[2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration node=node003:
Invalid argument
[2020-02-10T10:32:08.026]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2020-02-10T10:56:08.992] layouts: no layout to initialize
[2020-02-10T10:56:08.992] restoring original state of nodes
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not specified
[2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
[2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed
usec=4369
[2020-02-10T10:56:11.253]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
[2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
[2020-02-10T10:56:18.645] got (nil)
[2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
[2020-02-10T10:56:18.679] got (nil)
[2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung
[2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN
[2020-02-10T10:56:18.693] got (nil)
[2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE
[2020-02-10T10:56:18.711] got (nil)

And not sure if this is related but we're getting this  "Kill task failed"
and a node gets drained.

[2020-02-09T14:42:06.006] error: slurmd error running JobId=1465 on
node(s)=node001: Kill task failed
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1 NodeCnt=1
WEXITSTATUS 1
[2020-02-09T14:42:06.006] email msg to ouru...@ourdomain.edu: SLURM
Job_id=1465 Name=run.sh Failed, Run time 00:02:23, NODE_FAIL, ExitCode 0
[2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465 State=0x8000
NodeCnt=1 per user/system request
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000 NodeCnt=1
done
[2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0 NodeCnt=0
[2020-02-09T14:43:16.308] backfill: Started JobID=1466 in defq on node003
[2020-02-09T14:43:17.054] prolog_running_decr: Configuration for JobID=1466
is complete
[2020-02-09T14:44:16.309] email msg to ouru...@ourdomain.edu:: SLURM
Job_id=1461 Name=run.sh 

Re: [slurm-users] Which ports does slurm use?

2020-02-10 Thread Ole Holm Nielsen

Hi Dean,

Blocking ports with the Linux firewall and/or your network firewall 
(wired/Wi-Fi) would have the same effect:  Slurm won't work unless you 
open ports as specified in 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons


/Ole

On 2/8/20 1:26 AM, dean.w.schu...@gmail.com wrote:

The firewalls are disabled on all nodes on my cluster so I don't think it is a 
firewall issue.  It's probably our network security between the wired part of 
our network and the wireless side.  When I put the nodes back on a wired 
controller they work again.


-Original Message-
From: slurm-users  On Behalf Of Ole Holm 
Nielsen
Sent: Friday, February 7, 2020 2:34 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Which ports does slurm use?

On 06-02-2020 22:40, Dean Schulze wrote:

I've moved two nodes to a different controller.  The nodes are wired
and the controller is networked via wifi.  I had to open up ports 6817
and
6818 between the wired and wireless sides of our network to get any
connectivity.

Now when I do

srun -N2 hostname

the jobs show connection timeouts on the nodes:

[2020-02-06T14:24:37.183] launch task 60.0 request from UID:1000
GID:1000 HOST:10.204.18.232 PORT:19602 [2020-02-06T14:24:37.183]
lllp_distribution jobid [60] implicit auto
binding: cores, dist 8192
[2020-02-06T14:24:37.183] _task_layout_lllp_cyclic
[2020-02-06T14:24:37.183] _lllp_generate_cpu_bind jobid [60]:
mask_cpu,
0x0101
[2020-02-06T14:24:37.184] _run_prolog: run job script took usec=6
[2020-02-06T14:24:37.184] _run_prolog: prolog with lock for job 60 ran
for 0 seconds [2020-02-06T14:24:45.224] [60.0] error: connect io:
Connection timed out [2020-02-06T14:24:45.224] [60.0] error: IO setup
failed: Connection timed out [2020-02-06T14:24:45.225] [60.0] error:
job_manager exiting abnormally, rc = 4021 [2020-02-06T14:24:59.538]
[60.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS:
Connection timed out [2020-02-06T14:24:59.551] [60.0] done with job

That node used port 19602 and the other node was using port 12496.
When I did the srun again the jobs showed two different ports on the
nodes
(58040 and 32392).

How can I configure a network if srun is going to grab different ports
each time?


Perhaps the information about firewall setup in my Wiki page can be of use:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

/Ole