Re: [gridengine users] Jobs being killed with exit status 137

Reuti Fri, 01 Apr 2011 04:04:09 -0700

Am 01.04.2011 um 12:54 schrieb lars van der bijl:

> in this case yes.
> 
> however on the jobs running on our farm we put no memory limits as of yet. 
> just request amount of procs
> 
> is the it usual behaviour that if it fails with this code that the subsequent 
> dependencies start regardless?


Yes, for SGE the -hold_jid will only check whether the predecessor left the 
system. It's state isn't checked or honored. This needs to be done in the 
followup job script and maybe sending itself into error state, so that it's not 
lost.

If you have a list with jobnames in advance you want to handle, you could 
submit all but the first job with -h, and each finished job will have to enable 
the followup job then in the job script or a queue epilog.

A place to specify the name of the followup jobs could be the job context, as 
its meta information is just comment for SGE, but you can access the 
information and act upon.

-- Reuti


> Lars
> 
> 
> On 1 April 2011 11:41, Reuti <re...@staff.uni-marburg.de> wrote:
> Hi,
> 
> Am 01.04.2011 um 12:33 schrieb lars van der bijl:
> 
> > Hey everyone.
> >
> > Where having some issues with job's being killed with exit status 137.
> 
> 137 = 128 + 9
> 
> $ kill -l
>  1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
>  5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
>  9) SIGKILL     ...
> 
> So, the job was killed. Did you request a too small value for h_vmem or h_rt?
> 
> -- Reuti
> 
> 
> > This causes the task to finish and start it dependent task which is causing 
> > all kind of havoc.
> >
> > submitting a job with a very small max memory limit gives me this this as a 
> > example.
> >
> > $ qacct -j 21141
> > ==============================================================
> > qname        test.q
> > hostname     atom12.**
> > group        **
> > owner        lars
> > project      NONE
> > department   defaultdepartment
> > jobname      stest__out__geometry2
> > jobnumber    21141
> > taskid       101
> > account      sge
> > priority     0
> > qsub_time    Fri Apr  1 11:22:30 2011
> > start_time   Fri Apr  1 11:22:31 2011
> > end_time     Fri Apr  1 11:22:39 2011
> > granted_pe   smp
> > slots        4
> > failed       100 : assumedly after job
> > exit_status  137
> > ru_wallclock 8
> > ru_utime     0.281
> > ru_stime     0.167
> > ru_maxrss    3744
> > ru_ixrss     0
> > ru_ismrss    0
> > ru_idrss     0
> > ru_isrss     0
> > ru_minflt    70739
> > ru_majflt    0
> > ru_nswap     0
> > ru_inblock   8
> > ru_oublock   224
> > ru_msgsnd    0
> > ru_msgrcv    0
> > ru_nsignals  0
> > ru_nvcsw     1072
> > ru_nivcsw    439
> > cpu          2.240
> > mem          0.573
> > io           0.145
> > iow          0.000
> > maxvmem      405.820M
> > arid         undefined
> >
> > anyone know of a reason why the task would be killed with this error state? 
> > or how to catch it?
> >
> > Lars
> >
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs being killed with exit status 137

Reply via email to