________________________________________ From: Reuti [[email protected]] Sent: Sunday, December 14, 2014 4:36 AM To: Sean Smith Cc: [email protected] Subject: Re: [gridengine users] shepherd of job 267412.1 exited with exit status = 7
Hi, Am 13.12.2014 um 22:16 schrieb Sean Smith: > our old qmaster died and in after replacing it we are now faced with these > issues. by "replacing" you mean the hardware? You installed the same version of SGE which was in use before? Yes, The HW died so we rehosted it into another host. Same version of SW and OS. > These don't happen 100% of the time but most of the time. > > from the user perspective job ran all the way to completion but SGE is > unhappy and starts marking the Q in the E State... > > spool/host/messages contains: > > 12/13/2014 13:03:29| main|dvgrid14|E|shepherd of job 267412.1 exited with > exit status = 7 > 12/13/2014 13:03:29| main|dvgrid14|E|abnormal termination of shepherd for job > 267412.1: no "exit_status" file > 12/13/2014 13:03:29| main|dvgrid14|E|can't open file > active_jobs/267412.1/error: No such file or directory > 12/13/2014 13:03:29| main|dvgrid14|E|can't open pid file > "active_jobs/267412.1/pid" for job 267412.1 > > I then did a qconf -mconf and added: > > execd_params KEEP_ACTIVE=true > > per several google searches: > > when the job fails I just see these three files: It seems like something is > missing?? This sounds like an NFS issue. Do you have a shared spool directory for the exechosts or is it local on each of them? -- Reuti I have shared spool directory that is exported from the qmaster. I set the permssions to 777 and sgeadmin and root can both create files in this hierachy it appears. Any suggestions?
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
