We are testing slurm 22.05 and we noticed a behaviour change for
prolog/epilog scripts. We use NHC in the prolog/epilog to check if a
node is healthy. In the prevous problems we had no problems 21.08.X and
earlier.
Now when we do a srun:
* srun -t 1 hostname
```
srun: job 3975 queued and waiting for resources
srun: job 3975 has been allocated resources
srun: error: Nodes r16n19 are still not ready
srun: error: Something is wrong with the boot of the nodes.
```
11:57 r16n19:/tmp
root# ps -eaf | grep nhc
root 22228 22185 0 Aug03 pts/3 00:00:00 tail -f nhc.log
root 50250 20274 0 11:57 ? 00:00:00 [nhc] <defunct>
root 50259 1 0 11:57 ? 00:00:00 /bin/bash
/usr/sbin/nhc -f FORCE_SETSID=0
root 50268 1 0 11:57 ? 00:00:00 /bin/bash
/usr/sbin/nhc -f FORCE_SETSID=0
root 50331 48699 0 11:57 pts/5 00:00:00 grep --color=auto nhc
11:57 r16n19:/tmp
root# ps -eaf | grep 20274
root 20274 1 0 Aug03 ? 00:00:01
/opt/slurm/sw/current/sbin/slurmd -D
root 50250 20274 0 11:57 ? 00:00:00 [nhc] <defunct>
root 50339 48699 0 11:57 pts/5 00:00:00 grep --color=auto 20274
Have other sites also have this problem? Did I miss an option?
Regards
--
--
Bas van der Vlies
| High Performance Computing & Visualization | SURF| Science Park 140 |
1098 XG Amsterdam
| T +31 (0) 20 800 1300 | bas.vandervl...@surf.nl | www.surf.nl |