________________________________________
From: Reuti [[email protected]]
Sent: Sunday, December 14, 2014 4:36 AM
To: Sean Smith
Cc: [email protected]
Subject: Re: [gridengine users] shepherd of job 267412.1 exited with exit 
status = 7

Hi,

Am 13.12.2014 um 22:16 schrieb Sean Smith:

> our old qmaster died and in after replacing it we are now faced with these 
> issues.

by "replacing" you mean the hardware? You installed the same version of SGE 
which was in use before?

Yes, The HW died so we rehosted it into another host.  Same version of SW and 
OS.


> These don't happen 100% of the time but most of the time.
>
> from the user perspective job ran all the way to completion but SGE is 
> unhappy and starts marking the Q in the E State...
>
> spool/host/messages contains:
>
> 12/13/2014 13:03:29| main|dvgrid14|E|shepherd of job 267412.1 exited with 
> exit status = 7
> 12/13/2014 13:03:29| main|dvgrid14|E|abnormal termination of shepherd for job 
> 267412.1: no "exit_status" file
> 12/13/2014 13:03:29| main|dvgrid14|E|can't open file 
> active_jobs/267412.1/error: No such file or directory
> 12/13/2014 13:03:29| main|dvgrid14|E|can't open pid file 
> "active_jobs/267412.1/pid" for job 267412.1
>
> I then did a qconf -mconf and added:
>
> execd_params KEEP_ACTIVE=true
>
> per several google searches:
>
> when the job fails I just see these three files: It seems like something is 
> missing??

This sounds like an NFS issue. Do you have a shared spool directory for the 
exechosts or is it local on each of them?

-- Reuti


I have shared spool directory that is exported from the qmaster.  I set the 
permssions to 777 and sgeadmin and root can both create files in this hierachy 
it appears.

Any suggestions?



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to