[slurm-users] Re: Mailing list upgrade - slurm-users list paused

2024-01-30 Thread Tim Wickberg via slurm-users
Welcome to the updated list. Posting is re-enabled now. - Tim On 1/30/24 11:56, Tim Wickberg wrote: Hey folks - The mailing list will be offline for about an hour as we upgrade the host, upgrade the mailing list software, and change the mail configuration around. As part of these changes,

[slurm-users] Mailing list upgrade - slurm-users list paused

2024-01-30 Thread Tim Wickberg
Hey folks - The mailing list will be offline for about an hour as we upgrade the host, upgrade the mailing list software, and change the mail configuration around. As part of these changes, the "From: " field will no longer be the original sender, but instead use the mailing list ID itself.

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines
This is definitely a NVML thing crashing slurmstepd. Here is what I find doing an strace of the slurmstepd: [3681401.0] process at the point the crash happens: [pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0 [pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339 [pid

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Heckes, Frank
These are scary news. I just updated to 23.11.1, but couldn't confirm the problems described so far. I'll do some more extensive and intensive tests. In case of desaster: Does anyone knows how to rollback the DB, as some new DB 'objects' attributes are introduced in 23.11.1. I never had the

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines
I built 23.02.7 and tried that and had the same problems. BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes with NVIDIA 535.54.03 proprietary drives installed). The behavior I was seeing was one would start a GPU job. It was fine at first but at some point the

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Ole Holm Nielsen
On 1/30/24 09:36, Fokke Dijkstra wrote: We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Fokke Dijkstra
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 for another