On Tue, 26 Jun 2018 at 9:12am, Mun Johl wrote

We're using SGE 8.1.9 on CentOS 6.9

"All of the sudden" we've noticed that when we reboot an execution host,
any jobs sent to it within the first 10-15 min following boot-up will
get stuck in the 't' state until deleted (sometimes that has to be done
forcibly).  However, after 10-ish minutes, the execution host will start
accepting jobs.

In the qmaster's messages file, I see the following entries:

06/25/2018 10:28:15|listen|sim1|E|commlib error: endpoint is not unique error (endpoint 
"sim4.work.com/execd/1" is already connected)
06/25/2018 10:38:36| timer|sim1|W|failed to deliver job 54312.1 to queue 
"shor...@sim4.work.com"
06/25/2018 10:38:36| timer|sim1|E|got max. unheard timeout for target "execd" on host 
"sim4.work.com", can't deliver job "54312"

One possibility occurs to me. SoGE 8.1.9 has a bug where "qconf -s" commands fail on non-admin hosts (see <https://arc.liv.ac.uk/trac/SGE/ticket/1576>). One side-effect of this is that the init script fails to properly shutdown the execd. I'm wondering if that's leading to your problem. I don't see this, but I'm running on CentOS-7, which may lead to some different behavior.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to