On Wed, 10 Jun 2015 12:42:11 +0000 Dan Hyatt <[email protected]> wrote:
> Ahh the reverse > > before I hosed and rebuilt the system, if a node was not responding > correctly (a server), then it would be marked as an error and removed > from the queue > Sometimes this did not happen. > > Now, a good job is sent to a bad node, can't even talk to LDAP but > responds to ping and will let me ssh in using keys... the job fails > because of a sick server, > The grid keeps sending jobs to the sick server (execute node). > It sounds like your nodes are dying in a subtly different way to previously leaving execd just alive enough to cause problems. You could try adding a load sensor that runs on each node and checks if it can talk to LDAP and reports on the fact. Set a matching load threshold on the queues and that should prevent jobs being run on the node. You could also repeat the check in the prolog and exit 99 if you encounter a problem. Be careful who you run the prolog as though. -- William Hay <[email protected]>
pgpgYEkCMrUIx.pgp
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
