Reuti <re...@staff.uni-marburg.de> writes: >> These are diskless and stateless compute nodes. They have a private >> ram disk based local copy of sge_root and /var (along with everything >> else except /home and /scratch). The spool directory is definitely >> cleared on reboot.
[I'm surprised it's all in RAM, but good if there's enough!] For what it's worth, the output I referred to is from stateless nodes that share nearly everything, including the GE spool, and the symptoms seem similar. > Then SGE doesn't discover that the node crashed and lost its jobs. If > the spool directory would be there, the job would be removed from the > list automatically. > > But sure: this behavior could be improved by comparing what the > qmaster things should be there and what the node just discovered. For > now the rebooted node is doing it on its own and discovering that the > jobs mentioned in its spool directory aren't there (if the spool > directory survives) and informs the qmaster about it. What seems to be happening here (on the basis of the logs) is that execd zaps the spool when it starts up (before or after reporting to qmaster?), qmaster realizes it needs to kill the job and tells the rebooted node, which keeps trying unsuccessfully because the job directory has been clobbered. _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users