Hi, We're using SGE 8.1.9 on CentOS 6.9
"All of the sudden" we've noticed that when we reboot an execution host, any jobs sent to it within the first 10-15 min following boot-up will get stuck in the 't' state until deleted (sometimes that has to be done forcibly). However, after 10-ish minutes, the execution host will start accepting jobs. In the qmaster's messages file, I see the following entries: 06/25/2018 10:28:15|listen|sim1|E|commlib error: endpoint is not unique error (endpoint "sim4.work.com/execd/1" is already connected) 06/25/2018 10:38:36| timer|sim1|W|failed to deliver job 54312.1 to queue "shor...@sim4.work.com" 06/25/2018 10:38:36| timer|sim1|E|got max. unheard timeout for target "execd" on host "sim4.work.com", can't deliver job "54312" Our IT person says he can connect to the SGE ports on both the qmaster and exec hosts without issue. I need some help trying to figure out exactly why the SGE qmaster is not happy so that we can deploy a fix. I am _assuming_ some kind of DNS/Network issue on our end. This phenomenon is repeatable on all of our execution hosts (although, our server count is small at this point). I am told by IT that nothing has changed regarding DNS from when SGE execution hosts worked "correctly" following a reboot to now. Regards, -- Mun
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users