Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

Paul Raines Tue, 30 Jan 2024 09:20:34 -0800


This is definitely a NVML thing crashing slurmstepd.  Here is what I find

doing an strace of the slurmstepd: [3681401.0] process at the point thecrash happens:


[pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0
[pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339
[pid 1132920] openat(AT_FDCWD, "/proc/1132950/status", O_RDONLY) = 12

[pid 1132920] read(12, "Name:\tbash\nUmask:\t0002\nState:\tS "..., 4095) =1431

[pid 1132920] close(12)                 = 0
[pid 1132920] close(10)                 = 0
[pid 1132920] getpid()                  = 1132919
[pid 1132920] getpid()                  = 1132919
[pid 1132920] getpid()                  = 1132919
[pid 1132920] getpid()                  = 1132919

[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20),0x14dc6db683f0) = 0

[pid 1132920] getpid()                  = 1132919

[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20),0x14dc6db683f0) = 0[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20),0x14dc6db683f0) = 0[pid 1132920] writev(2, [{iov_base="free(): invalid next size (fast)",iov_len=32}, {iov_base="\n", iov_len=1}], 2) = 33

[pid 1132924] <... poll resumed>)       = 1 ([{fd=13, revents=POLLIN}])

[pid 1132920] mmap(NULL, 4096, PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>

[pid 1132924] read(13, "free(): invalid next size (fast)"..., 3850) = 34
[pid 1132920] <... mmap resumed>)       = 0x14dc6da6d000
[pid 1132920] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>

From "ls -l /proc/1132919/fd" before the crash happens I can tell

that FileDescriptor 16 is /dev/nvidiactl.  SO it is doing an ioctl
to /dev/nvidiactl write before this crash.  In some cases the error
is a "free(): invalid next size (fast)" and sometimes it is malloc()

Job submitted as:

srun -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G--time=1-10:00:00 --cpus-per-task=2 --gpus=8 --nodelist=rtx-02 --pty/bin/bash


Command run in interactive shell is:  gpuburn 30

In this case I am not getting the defunct slurmd process left behind
but there is a strange 'sleep 100000000' left behind I have to kill

[root@rtx-02 ~]# find $(find /sys/fs/cgroup/ -name job_3681401 ) -namecgroup.procs -exec cat {} \; | sort | uniq

1132916
[root@rtx-02 ~]# ps -f -p 1132916
UID          PID    PPID  C STIME TTY          TIME CMD
root     1132916       1  0 11:13 ?        00:00:00 sleep 100000000


The NVML library I have installed is

$ rpm -qa | grep NVML
nvidia-driver-NVML-535.54.03-1.el8.x86_64

on both the box where the SLURM binaries were built and on this box
where slurmstepd is crashing.  /usr/local/cuda has a 11.6 CUDA

Okay, I just upgraded the NVIDIA driver on rtx-02 with

  dnf --enablerepo=cuda module reset nvidia-driver
  dnf --enablerepo=cuda module install nvidia-driver:535-dkms

Restarted everything and it appears with my initial couple of
tests the problem has gone away.  Going to need to have real users
test with real jobs.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Tue, 30 Jan 2024 9:01am, Paul Raines wrote:

External Email - Use Caution

I built 23.02.7 and tried that and had the same problems.

BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes withNVIDIA 535.54.03 proprietary drives installed).

The behavior I was seeing was one would start a GPU job. It was fine at firstbut at some point the slurmstepd processes for it would crash/die. Sometimesthe user process for it would die too, sometimes not. In an interactive jobyou would sometimes see some final line about "malloc" and "invalid value"and the terminal would be hung until the job was 'scancel'ed. 'ps' wouldshow a 'slurmd <defunct>' process that was

unkillable (killing the main slurmd process would get rid of it)

How the slurm controller saw the job seemed random. Sometimes it saw it as acrashed job and it would be reported like that in the system. Sometimes itwas stuck as a permanently CG (completing) job. Sometimes it did not noticeanything wrong and just stayed as a seemingly perfect running job accordingto slurmctld (I never waited for the TimeLimit to hit to see what happenedthen but did scancel).


I scancelled a "failed" job in the CG or R state, it would not actually
kill the user processes on the node but it would clear the job from
squeue.

Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3install built sepearately on these) have all been working fine so far.The install on the Ubuntu GPU boxes was built separately on them and alsoanother difference is they are still using NVIDIA 470 drivers.


I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding
slurm 23.11.3 there and installing it to see if that worked
to fix things.  It did not.

I then tried installing old 22.05.6 RPMs I had built on my Rocky 8
box back in Nov 2022 on all Rocky 8 GPU boxes.  This seems to
have fixed the problem and jobs are no longer showing the issue.
Not and ideal solution but good enough for now.

Both those 22.05.6 RPMs and the 23.11.3 RPMs are built on the
same Rocky 8 GPU box.  So the differences are:

   1) slurm version obviously
   2) built using different gcc/lib versions mostly likely
      due to OS updates between Nov 2022 and now
   3) built with a different NVIDIA driver/cuda installed
      between then and now but I am not sure what I had
      in Nov 2022

I highly suspect #2 or #3 as the underlying issue here and I wonder if theNVML library at the time of build is the key (though like I said I tried

rebuiling with NVIDIA 470 and that still had the issue)

-- Paul Raines(http://secure-web.cisco.com/1GE8WXzrTSCgdg4bXwEtusq8klIWCEm8v6O6fPZ-8pGQdt3EiPIzh2fEycZgQe9gpMjQVx_qJ9e8BA4dxSsWfvOcr-dlirci-hv1Y6dX1fgns1NxvC8tfoPr8lR2XHgTkpgIRa_pN63RyokWy_qIjTM2mVC4FMxUWpXqyagaOJHZ8KH00QD3aqmzBI_Bm-woUp0fpU7XY77kZevKNxJRXQu6auNmruuqoVPjKqz6q76L0ewhBJ0r6RFjS5Z2uOmyNXUbfvMiViP7ageT7hs_D26CdwUIq5r1PLQV8Iey_ciZvP1kcuTf0jlYIMlyrEPFwGY4ZaAT-9ratApciW-QvNw/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)




On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:

External Email - Use Caution

 We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
 a completing state and slurmd daemons can't be killed because they are
 left
 in a CLOSE-WAIT state. See my previous mail to the mailing list for the
 details. And also
 
https://secure-web.cisco.com/16zDED-XPBGKgr-e6-N-CrY1J1qxgnDyBeTc5vVc0ghVjnVNEtDKVr66UkuO1IT_ALDy93Yc0AmTeUXzIZviB36h_FFVEoHsmy36XMo6j8usUs-kk5AjTgDUP4KUbfOzFSwWiOoSZGqRWubpW1s47fI9tY-S6zRRk3tl2C_RZ7VpbkyCXnPlNaSsYQvxM_MkVKpbsUrwVDWQ0aqr4jIDrKr-ddHpVhwaN7YmukIpwG4dXoXlh5yqq1VznG-xDJ2Xabhr08FF6AdRHwHQXYPWTR1hLchTEGLxENISVpgGs_PwosuU-VzgMEGp9KSRIjCM9y3MV5gXoJhG44TKX9jkYQg/https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D18561
 for
 another site having issues.
 We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which
 gets rid of most issues. If possible, I would try to also downgrade
 slurmctld to an earlier release, but this requires getting rid of all
 running and queued jobs.

 Kind regards,

 Fokke

 Op ma 29 jan 2024 om 01:00 schreef Paul Raines
 <rai...@nmr.mgh.harvard.edu>:

 Some more info on what I am seeing after the 23.11.3 upgrade.

 Here is a case where a job is cancelled but seems permanently
 stuck in 'CG' state in squeue

 [2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
 [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
 #CPUs=4 Partition=rtx8000
 [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file
 `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t
 [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB
 JobId=3679903 uid 5875902
 [2024-01-28T17:42:27.725] debug:  email msg to sg1526: Slurm
 Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,
 ExitCode 0
 [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
 JobId=3679903 action:normal
 [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
 removed JobId=3679903 from part rtx8000 row 0
 [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903
 successful 0x8004
 [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903
 Nodelist=rtx-06
 [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903
 Nodelist=rtx-06
 [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903
 Nodelist=rtx-06
 [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903
 Nodelist=rtx-06
 [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903
 Nodelist=rtx-06


 So at 17:42 the user must of done an scancel.  In the slurmd log on the
 node I see:

 [2024-01-28T17:42:27.727] debug:  _rpc_terminate_job: uid = 1150
 JobId=3679903
 [2024-01-28T17:42:27.728] debug:  credential for job 3679903 revoked
 [2024-01-28T17:42:27.728] debug:  _rpc_terminate_job: sent SUCCESS for
 3679903, waiting for prolog to finish
 [2024-01-28T17:42:27.728] debug:  Waiting for job 3679903's prolog to
 complete
 [2024-01-28T17:43:19.002] debug:  _rpc_terminate_job: uid = 1150
 _JobId=3679903
 [2024-01-28T17:44:20.001] debug:  _rpc_terminate_job: uid = 1150
 JobId=3679903
 [2024-01-28T17:45:20.002] debug:  _rpc_terminate_job: uid = 1150
 JobId=3679903
 [2024-01-28T17:46:20.001] debug:  _rpc_terminate_job: uid = 1150
 JobId=3679903
 [2024-01-28T17:47:20.002] debug:  _rpc_terminate_job: uid = 1150
 JobId=3679903

 Strange that a prolog is being called on job cancel

 slurmd seems to be getting the repeated calls to terminate the job
 from slurmctld but it is not happening.  Also the process table has

 [root@rtx-06 ~]# ps auxw | grep slurmd
 root      161784  0.0  0.0 436748 21720 ?        Ssl  13:44   0:00
 /usr/sbin/slurmd --systemd
 root      190494  0.0  0.0      0     0 ?        Zs   17:34   0:00
 [slurmd] <defunct>

 where there is now a zombie slurmd process I cannot kill even with kill
 -9

 If I do a 'systemctl stop slurmd' it takes a long time but eventually
 stop
 slurmd and gets rid of the zombie process but kills the "good" running
 jobs too with NODE_FAIL.

 Another case is where a job will be cancelled and SLURM acts like it
 is cancelled with it not showing up in squeue but the process keep
 running
 on the box.

 # pstree -u sg1526 -p | grep ^slurm
 slurm_script(185763)---python(185796)-+-{python}(185797)
 # strings /proc/185763/environ | grep JOB_ID
 SLURM_JOB_ID=3679888
 # squeue -j 3679888
 slurm_load_jobs error: Invalid job id specified

 sacct shows that job being cancelled.  In the slurmd log we see

 [2024-01-28T17:33:58.757] debug:  _rpc_terminate_job: uid = 1150
 JobId=3679888
 [2024-01-28T17:33:58.757] debug:  credential for job 3679888 revoked
 [2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
 /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused
 [2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
 /var/slurm/spool/d/rtx-06_3679888.4294967292
 [2024-01-28T17:33:58.757] debug:  signal for nonexistent
 StepId=3679888.extern stepd_connect failed: Connection refused
 [2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
 /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused
 [2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
 /var/slurm/spool/d/rtx-06_3679888.4294967291
 [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job
 script /var/slurm/spool/d/job3679888/slurm_script
 [2024-01-28T17:33:58.757] debug:  signal for nonexistent
 StepId=3679888.batch stepd_connect failed: Connection refused
 [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to
 be signaled with 18
 [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send
 signal
 15
 [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888
 to 1706481358 UTS
 [2024-01-28T17:33:58.757] debug:  Waiting for job 3679888's prolog to
 complete
 [2024-01-28T17:33:58.757] debug:  Finished wait for job 3679888's prolog
 to complete
 [2024-01-28T17:33:58.771] debug:  completed epilog for jobid 3679888
 [2024-01-28T17:33:58.774] debug:  JobId=3679888: sent epilog complete
 msg:
 rc = 0


 -- Paul Raines
 
(http://secure-web.cisco.com/19HZZ6loaWDjqm3fm3-mLdIlOO486WzCozFRPfjXcoODk0os6A1hBP3WXDCQ4u_tmUyP3Bf-6x2dhzz0q8D1mA84xHUvIyvr2xldvTcSkypQnYUyPES7j9mmdxcCaxjp7TP6LMC_KgCVSvIMODRvyZ5H-YNzmogGLC0uiZeyvm7U_3jb1TSfOVFL4vQRJ45wu2nK3upi8FNEvNVCu43hEKoEPJq37vkmZKq1lPK8drgpcN1dEqGGOXp-iuaEdLoEqJwAzdocv_Ozep4OchdGzU7KjRMe2J-pKuXJV3_GICfbHOXHqtAV6KtJy2d_LTozW4EqyODnrwPH8GHd01bFLpw/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)



 Please note that this e-mail is not secure (encrypted).  If you do not
 wish to continue communication over unencrypted e-mail, please notify the
 sender of this message immediately.  Continuing to send or respond to
 e-mail after receiving this message means you understand and accept this
 risk and wish to continue to communicate over unencrypted e-mail.


 --
 Fokke Dijkstra <f.dijks...@rug.nl> <f.dijks...@rug.nl>
 Team High Performance Computing
 Center for Information Technology, University of Groningen
 Postbus 11044, 9700 CA  Groningen, The Netherlands

The information in this e-mail is intended only for the person to whom it isaddressed. If you believe this e-mail was sent to you in error and thee-mail contains patient information, please contact the Mass General BrighamCompliance HelpLine athttps://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0AxmDyZhhZ5_qUr3zAp1EfvIHpG6XP4pLWy1YvG1EFbSjAc9rKQnAqpexk2uGNHCHEpH8PKrdG-nXdxPDvsO55pYZBpDup55sj3j_IqdFPzWiBQ7e6XYDrV3l3Wn-PM2LcZm6hMhx1duKOp4rgLMiq6VGc4GpNxdx0qg7VR5GSfvV-gNzrqVBTUEI7IUca3hoccjpQdpsyU29lgwDGOwGjQd670rwqnjMxnSuPlViUlVhYH0DqG9BReEwsT19d0dWQtFrvAdVM9kL2x3lUbgKcajhcFfUsXj9eVNj--FGLQ/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline<https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0AxmDyZhhZ5_qUr3zAp1EfvIHpG6XP4pLWy1YvG1EFbSjAc9rKQnAqpexk2uGNHCHEpH8PKrdG-nXdxPDvsO55pYZBpDup55sj3j_IqdFPzWiBQ7e6XYDrV3l3Wn-PM2LcZm6hMhx1duKOp4rgLMiq6VGc4GpNxdx0qg7VR5GSfvV-gNzrqVBTUEI7IUca3hoccjpQdpsyU29lgwDGOwGjQd670rwqnjMxnSuPlViUlVhYH0DqG9BReEwsT19d0dWQtFrvAdVM9kL2x3lUbgKcajhcFfUsXj9eVNj--FGLQ/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>.Please note that this e-mail is not secure (encrypted). If you do not wishto continue communication over unencrypted e-mail, please notify the senderof this message immediately. Continuing to send or respond to e-mail afterreceiving this message means you understand and accept this risk and wish tocontinue to communicate over unencrypted e-mail.

The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
<https://www.massgeneralbrigham.org/complianceline> .

Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

Reply via email to