Re: [slurm-users] unable to start slurmd process.

Riebs, Andy Thu, 11 Jun 2020 08:53:02 -0700

Short of getting on the system and kicking the tires myself, I’m fresh out of 
ideas. Does “sinfo -R” offer any hints?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 11:31 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] unable to start slurmd process.

i am able to get the output scontrol show node oled3
also the oled3 is pinging fine

and scontrol ping output showing like

Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN

so all looks ok to me.

REgards
Navin.

On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy 
<andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote:
So there seems to be a failure to communicate between slurmctld and the oled3 
slurmd.

From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon.

From the head node, try “scontrol show node oled3”, and then ping the address 
that is shown for “NodeAddr=”

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 10:40 AM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

i collected the log from slurmctld and it says below

[2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 
RPC:REQUEST_TERMINATE_JOB : Communication connection failure
[2020-06-11T07:14:50.210] error: Nodes oled3 not responding
[2020-06-11T07:15:54.313] error: Nodes oled3 not responding
[2020-06-11T07:17:34.407] error: Nodes oled3 not responding
[2020-06-11T07:19:14.637] error: Nodes oled3 not responding
[2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required
[2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING*
[2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3
[2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3
[2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN

sinfo says

OLED*           up   infinite      1 drain* oled3

while checking the node i feel node is healthy.

Regards
Navin

On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy 
<andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote:
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to 
interpret it not reporting anything but the “log file” and “munge” messages. 
When you have it running attached to your window, is there any chance that 
sinfo or scontrol suggest that the node is actually all right? Perhaps 
something in /etc/sysconfig/slurm or the like is messed up?

If that’s not the case, I think my next step would be to follow up on someone 
else’s suggestion, and scan the slurmctld.log file for the problem node name.

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 9:26 AM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

Sorry Andy I missed to add.
1st i tried the  slurmd -Dvvv and it is not written anything
slurmd: debug:  Log file re-opened
slurmd: debug:  Munge authentication plugin loaded

After that I waited for 10-20 minutes but no output and finally i pressed 
Ctrl^c.

My doubt is in slurm.conf file:

ControlMachine=deda1x1466
ControlAddr=192.168.150.253

The deda1x1466 is having a different interface with different IP which compute 
node is unable to ping but IP is pingable.
could be one of the reason?

but other nodes having the same config and there i am able to start the slurmd. 
so bit of confusion.

Regards
Navin.

Regards
Navin.

On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy 
<andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote:
If you omitted the “-D” that I suggested, then the daemon would have detached 
and logged nothing on the screen. In this case, you can still go to the slurmd 
log (use “scontrol show config | grep -I log” if you’re not sure where the logs 
are stored).

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 9:01 AM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

I tried by executing the debug mode but there also it is not writing anything.

i waited for about 5-10 minutes

deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v

No output on terminal.

The OS is SLES12-SP4 . All firewall services are disabled.

The recent change is the local hostname earlier it was with local hostname 
node1,node2,etc but we have moved to dns based hostname which is deda

NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] 
Sockets=2 CoresPerSocket=10 State=UNKNOWN
other than this it is fine but after that i have done several time slurmd 
process started on the node and it works fine but now i am seeing this issue 
today.

Regards
Navin.

On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy 
<andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote:
Navin,

As you can see, systemd provides very little service-specific information. For 
slurm, you really need to go to the slurm logs to find out what happened.

Hint: A quick way to identify problems like this with slurmd and slurmctld is 
to run them with the “-Dvvv” option, causing them to log to your window, and 
usually causing the problem to become immediately obvious.

For example,

# /usr/local/slurm/sbin/slurmd -Dvvvv

Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
you run it this way, it’s time to look elsewhere.

Andy

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 8:25 AM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] unable to start slurmd process.

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered 
failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with 
result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
session opened for user root by (uid=0)

Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any changes 
but today when i am starting it is giving this error message.
Any idea what could be wrong here.

Regards
Navin.

Re: [slurm-users] unable to start slurmd process.

Reply via email to