We are testing slurm 22.05 and we noticed a behaviour change for prolog/epilog scripts. We use NHC in the prolog/epilog to check if a node is healthy. In the prevous problems we had no problems 21.08.X and earlier.

Now when we do a srun:
 *  srun -t 1 hostname
```
srun: job 3975 queued and waiting for resources
srun: job 3975 has been allocated resources
srun: error: Nodes r16n19 are still not ready
srun: error: Something is wrong with the boot of the nodes.

```


11:57 r16n19:/tmp
root# ps -eaf | grep nhc
root       22228   22185  0 Aug03 pts/3    00:00:00 tail -f nhc.log
root       50250   20274  0 11:57 ?        00:00:00 [nhc] <defunct>
root 50259 1 0 11:57 ? 00:00:00 /bin/bash /usr/sbin/nhc -f FORCE_SETSID=0 root 50268 1 0 11:57 ? 00:00:00 /bin/bash /usr/sbin/nhc -f FORCE_SETSID=0
root       50331   48699  0 11:57 pts/5    00:00:00 grep --color=auto nhc

11:57 r16n19:/tmp
root# ps -eaf | grep 20274
root 20274 1 0 Aug03 ? 00:00:01 /opt/slurm/sw/current/sbin/slurmd -D
root       50250   20274  0 11:57 ?        00:00:00 [nhc] <defunct>
root       50339   48699  0 11:57 pts/5    00:00:00 grep --color=auto 20274


Have other sites also have this problem? Did I miss an option?

Regards


--
--
Bas van der Vlies
| High Performance Computing & Visualization | SURF| Science Park 140 | 1098 XG Amsterdam
| T +31 (0) 20 800 1300  | bas.vandervl...@surf.nl | www.surf.nl |

Reply via email to