Hi,

Am 13.12.2014 um 22:16 schrieb Sean Smith:

> our old qmaster died and in after replacing it we are now faced with these 
> issues.

by "replacing" you mean the hardware? You installed the same version of SGE 
which was in use before?


> These don't happen 100% of the time but most of the time.
> 
> from the user perspective job ran all the way to completion but SGE is 
> unhappy and starts marking the Q in the E State...
> 
> spool/host/messages contains:
> 
> 12/13/2014 13:03:29|  main|dvgrid14|E|shepherd of job 267412.1 exited with 
> exit status = 7
> 12/13/2014 13:03:29|  main|dvgrid14|E|abnormal termination of shepherd for 
> job 267412.1: no "exit_status" file
> 12/13/2014 13:03:29|  main|dvgrid14|E|can't open file 
> active_jobs/267412.1/error: No such file or directory
> 12/13/2014 13:03:29|  main|dvgrid14|E|can't open pid file 
> "active_jobs/267412.1/pid" for job 267412.1
> 
> I then did a qconf -mconf and added:
> 
> execd_params                 KEEP_ACTIVE=true
> 
> per several google searches:
> 
> when the job fails I just see these three files: It seems like something is 
> missing??

This sounds like an NFS issue. Do you have a shared spool directory for the 
exechosts or is it local on each of them?

-- Reuti


> ✔ /sge/ge-2011.11p1/colo/spool/dvgrid14/active_jobs/267412.1 
> 13:09 $ ll
> total 28
> -rw-r--r-- 1 sgeadmin it-group  2204 Dec 13 13:03 config
> -rw-r--r-- 1 sgeadmin it-group 16806 Dec 13 13:03 environment
> -rw-r--r-- 1 sgeadmin it-group    59 Dec 13 13:03 pe_hostfile
> 
> Why is there no error or PID File?
> 
> Any suggestions I'm stuck....
> 
> Sean
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to