On Wed, 10 Jun 2015 12:42:11 +0000
Dan Hyatt <[email protected]> wrote:

> Ahh the reverse
> 
> before I hosed and rebuilt the system, if a node was not responding 
> correctly (a server), then it would be marked as an error and removed 
> from the queue
> Sometimes this did not happen.
> 
> Now, a good job is sent to a bad node, can't even talk to LDAP but 
> responds to ping and will let me ssh in using keys... the job fails 
> because of a sick server,
> The grid keeps sending jobs to the sick server (execute node).
>
It sounds like your nodes are dying in a subtly different way to 
previously leaving execd just alive enough to cause problems. 
You could try adding a load sensor that runs on each node and checks 
if it can talk to LDAP and reports on the fact.  Set a matching load 
threshold on the queues and that should prevent jobs being run on 
the node.  You could also repeat the check in the prolog and exit 99
if you encounter a problem.  Be careful who you run the prolog as though.


 
-- 
William Hay <[email protected]>

Attachment: pgpgYEkCMrUIx.pgp
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to