Hi All,

our old qmaster died and in after replacing it we are now faced with these 
issues.

These don't happen 100% of the time but most of the time.

from the user perspective job ran all the way to completion but SGE is unhappy 
and starts marking the Q in the E State...

spool/host/messages contains:

12/13/2014 13:03:29|  main|dvgrid14|E|shepherd of job 267412.1 exited with exit 
status = 7
12/13/2014 13:03:29|  main|dvgrid14|E|abnormal termination of shepherd for job 
267412.1: no "exit_status" file
12/13/2014 13:03:29|  main|dvgrid14|E|can't open file 
active_jobs/267412.1/error: No such file or directory
12/13/2014 13:03:29|  main|dvgrid14|E|can't open pid file 
"active_jobs/267412.1/pid" for job 267412.1

I then did a qconf -mconf and added:

execd_params                 KEEP_ACTIVE=true

per several google searches:

when the job fails I just see these three files: It seems like something is 
missing??

✔ /sge/ge-2011.11p1/colo/spool/dvgrid14/active_jobs/267412.1
13:09 $ ll
total 28
-rw-r--r-- 1 sgeadmin it-group  2204 Dec 13 13:03 config
-rw-r--r-- 1 sgeadmin it-group 16806 Dec 13 13:03 environment
-rw-r--r-- 1 sgeadmin it-group    59 Dec 13 13:03 pe_hostfile

Why is there no error or PID File?

Any suggestions I'm stuck....

Sean

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to