[slurm-users] SLURM 22.05 and NHC in prolog/epilog

Bas van der Vlies Fri, 05 Aug 2022 03:15:30 -0700

We are testing slurm 22.05 and we noticed a behaviour change forprolog/epilog scripts. We use NHC in the prolog/epilog to check if anode is healthy. In the prevous problems we had no problems 21.08.X andearlier.


Now when we do a srun:
 *  srun -t 1 hostname
```
srun: job 3975 queued and waiting for resources
srun: job 3975 has been allocated resources
srun: error: Nodes r16n19 are still not ready
srun: error: Something is wrong with the boot of the nodes.


```


11:57 r16n19:/tmp
root# ps -eaf | grep nhc
root       22228   22185  0 Aug03 pts/3    00:00:00 tail -f nhc.log
root       50250   20274  0 11:57 ?        00:00:00 [nhc] <defunct>

root 50259 1 0 11:57 ? 00:00:00 /bin/bash/usr/sbin/nhc -f FORCE_SETSID=0root 50268 1 0 11:57 ? 00:00:00 /bin/bash/usr/sbin/nhc -f FORCE_SETSID=0

root       50331   48699  0 11:57 pts/5    00:00:00 grep --color=auto nhc

11:57 r16n19:/tmp
root# ps -eaf | grep 20274

root 20274 1 0 Aug03 ? 00:00:01/opt/slurm/sw/current/sbin/slurmd -D

root       50250   20274  0 11:57 ?        00:00:00 [nhc] <defunct>
root       50339   48699  0 11:57 pts/5    00:00:00 grep --color=auto 20274


Have other sites also have this problem? Did I miss an option?

Regards


--
--
Bas van der Vlies

| High Performance Computing & Visualization | SURF| Science Park 140 |1098 XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surf.nl | www.surf.nl |

[slurm-users] SLURM 22.05 and NHC in prolog/epilog

Reply via email to