Hi All, our old qmaster died and in after replacing it we are now faced with these issues.
These don't happen 100% of the time but most of the time. from the user perspective job ran all the way to completion but SGE is unhappy and starts marking the Q in the E State... spool/host/messages contains: 12/13/2014 13:03:29| main|dvgrid14|E|shepherd of job 267412.1 exited with exit status = 7 12/13/2014 13:03:29| main|dvgrid14|E|abnormal termination of shepherd for job 267412.1: no "exit_status" file 12/13/2014 13:03:29| main|dvgrid14|E|can't open file active_jobs/267412.1/error: No such file or directory 12/13/2014 13:03:29| main|dvgrid14|E|can't open pid file "active_jobs/267412.1/pid" for job 267412.1 I then did a qconf -mconf and added: execd_params KEEP_ACTIVE=true per several google searches: when the job fails I just see these three files: It seems like something is missing?? ✔ /sge/ge-2011.11p1/colo/spool/dvgrid14/active_jobs/267412.1 13:09 $ ll total 28 -rw-r--r-- 1 sgeadmin it-group 2204 Dec 13 13:03 config -rw-r--r-- 1 sgeadmin it-group 16806 Dec 13 13:03 environment -rw-r--r-- 1 sgeadmin it-group 59 Dec 13 13:03 pe_hostfile Why is there no error or PID File? Any suggestions I'm stuck.... Sean
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
