On Fri, Jun 15, 2012 at 1:46 PM, Michael Coffman
<[email protected]> wrote:
> Also might be of interest:

Thanks... Also, any messages in the execd "messages" file??

Rayson




>
> ==============================================================
> qname        all.q
> hostname     cs431.ftc.avagotech.net
> group        fidlib
> owner        bgp
> project      NONE
> department   priority
> jobname      qsubcmd.21231
> jobnumber    17593
> taskid       undefined
> account      sge
> priority     0
> qsub_time    Wed Dec 31 17:00:00 1969
> start_time   -/-
> end_time     -/-
> granted_pe   NONE
> slots        0
> failed       11  : before job
> exit_status  0
> ru_wallclock 0
> ru_utime     0.000
> ru_stime     0.000
> ru_maxrss    0
> ru_ixrss     0
> ru_ismrss    0
> ru_idrss     0
> ru_isrss     0
> ru_minflt    0
> ru_majflt    0
> ru_nswap     0
> ru_inblock   0
> ru_oublock   0
> ru_msgsnd    0
> ru_msgrcv    0
> ru_nsignals  0
> ru_nvcsw     0
> ru_nivcsw    0
> cpu          0.000
> mem          0.000
> io           0.000
> iow          0.000
> maxvmem      0.000
> arid         undefined
>
>
>
> On Fri, Jun 15, 2012 at 11:27 AM, Michael Coffman
> <[email protected]> wrote:
>>
>> On Fri, Jun 15, 2012 at 11:11 AM, Rayson Ho <[email protected]> wrote:
>>>
>>> Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the
>>> manpage at this URL:
>>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html )
>>>
>>> Request the job to run in this queue/host again, and see why the
>>> shepherd can't open the job_pid.
>>>
>>> (And remember to unset the execd_params or else you will fill up your
>>> local spool dir eventually with job information.)
>>>
>>
>> I can't do this on my production grid.   And I don't know how to replicate
>> the problem currently.   I will set things up on a test setup and try and
>> reproduce the issue with KEEP_ACTIVE turned on.
>>
>> Is it possible to set the KEEP_ACTIVE per host?   I only see this in the
>> qconf -sconf
>>
>>>
>>> Rayson
>>>
>>>
>>>
>>> On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman
>>> <[email protected]> wrote:
>>> > On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <[email protected]>
>>> > wrote:
>>> >>
>>> >> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman
>>> >> <[email protected]> wrote:
>>> >> > From the qmaster messages file:
>>> >> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host
>>> >> > cs428.ftc.avagotech.net general before job because: 06/14/2012
>>> >> > 21:29:37
>>> >> > [20339:8436]: can't open file job_pid: Permission denied
>>> >> >
>>> >> > I checked a job_pid file on a currently running job on the system
>>> >> > that
>>> >> > had
>>> >> > the above errors, permission down the entire tree seems fine and
>>> >> > here is
>>> >> > the
>>> >> > job_id file:
>>> >> >
>>> >> > -rw-r--r-- 1 grid  grid       6 Jun 14 17:40 job_pid
>>> >>
>>> >> Is your execd spool dir on NFS or local??
>>> >>
>>> > Local.
>>> >
>>> >>
>>> >> Also, does it happen to all nodes or just a node or queue?
>>> >>
>>> >
>>> > Happened on 2 different nodes.   Not all jobs caused this.
>>> >
>>> >>
>>> >> Rayson
>>> >>
>>> >>
>>> >>
>>> >> >
>>> >> > Any clues?    Is the path perhaps hard coded into sge_shepherd for
>>> >> > this
>>> >> > file?
>>> >> >
>>> >> > Thanks.
>>> >> > --
>>> >> > -MichaelC
>>> >> >
>>> >> > _______________________________________________
>>> >> > users mailing list
>>> >> > [email protected]
>>> >> > https://gridengine.org/mailman/listinfo/users
>>> >> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > -MichaelC
>>
>>
>>
>>
>> --
>> -MichaelC
>
>
>
>
> --
> -MichaelC

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to