Re: [gridengine users] Debugging a commlib error following reboot of exec host

Joshua Baker-LePain Tue, 03 Jul 2018 11:32:06 -0700

On Tue, 26 Jun 2018 at 9:12am, Mun Johl wrote

We're using SGE 8.1.9 on CentOS 6.9


"All of the sudden" we've noticed that when we reboot an execution host,
any jobs sent to it within the first 10-15 min following boot-up will
get stuck in the 't' state until deleted (sometimes that has to be done
forcibly).  However, after 10-ish minutes, the execution host will start
accepting jobs.

In the qmaster's messages file, I see the following entries:

06/25/2018 10:28:15|listen|sim1|E|commlib error: endpoint is not unique error (endpoint 
"sim4.work.com/execd/1" is already connected)
06/25/2018 10:38:36| timer|sim1|W|failed to deliver job 54312.1 to queue 
"[email protected]"
06/25/2018 10:38:36| timer|sim1|E|got max. unheard timeout for target "execd" on host 
"sim4.work.com", can't deliver job "54312"

One possibility occurs to me. SoGE 8.1.9 has a bug where "qconf -s"commands fail on non-admin hosts (see<https://arc.liv.ac.uk/trac/SGE/ticket/1576>). One side-effect of this isthat the init script fails to properly shutdown the execd. I'm wonderingif that's leading to your problem. I don't see this, but I'm running onCentOS-7, which may lead to some different behavior.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Debugging a commlib error following reboot of exec host

Reply via email to