also is there anyway of catching this and raising 100? ones the job is
finished and it's dependencies start it's causing major havok on our system
looking for file that aren’t there.

are there other things the grid uses the SIGKILL for? not just memory
limits?

Lars

On 1 April 2011 11:54, lars van der bijl <l...@realisestudio.com> wrote:

> in this case yes.
>
> however on the jobs running on our farm we put no memory limits as of yet.
> just request amount of procs
>
> is the it usual behaviour that if it fails with this code that the
> subsequent dependencies start regardless?
>
> Lars
>
>
>
> On 1 April 2011 11:41, Reuti <re...@staff.uni-marburg.de> wrote:
>
>> Hi,
>>
>> Am 01.04.2011 um 12:33 schrieb lars van der bijl:
>>
>> > Hey everyone.
>> >
>> > Where having some issues with job's being killed with exit status 137.
>>
>> 137 = 128 + 9
>>
>> $ kill -l
>>  1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
>>  5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
>>  9) SIGKILL     ...
>>
>> So, the job was killed. Did you request a too small value for h_vmem or
>> h_rt?
>>
>> -- Reuti
>>
>>
>> > This causes the task to finish and start it dependent task which is
>> causing all kind of havoc.
>> >
>> > submitting a job with a very small max memory limit gives me this this
>> as a example.
>> >
>> > $ qacct -j 21141
>> > ==============================================================
>> > qname        test.q
>> > hostname     atom12.**
>> > group        **
>> > owner        lars
>> > project      NONE
>> > department   defaultdepartment
>> > jobname      stest__out__geometry2
>> > jobnumber    21141
>> > taskid       101
>> > account      sge
>> > priority     0
>> > qsub_time    Fri Apr  1 11:22:30 2011
>> > start_time   Fri Apr  1 11:22:31 2011
>> > end_time     Fri Apr  1 11:22:39 2011
>> > granted_pe   smp
>> > slots        4
>> > failed       100 : assumedly after job
>> > exit_status  137
>> > ru_wallclock 8
>> > ru_utime     0.281
>> > ru_stime     0.167
>> > ru_maxrss    3744
>> > ru_ixrss     0
>> > ru_ismrss    0
>> > ru_idrss     0
>> > ru_isrss     0
>> > ru_minflt    70739
>> > ru_majflt    0
>> > ru_nswap     0
>> > ru_inblock   8
>> > ru_oublock   224
>> > ru_msgsnd    0
>> > ru_msgrcv    0
>> > ru_nsignals  0
>> > ru_nvcsw     1072
>> > ru_nivcsw    439
>> > cpu          2.240
>> > mem          0.573
>> > io           0.145
>> > iow          0.000
>> > maxvmem      405.820M
>> > arid         undefined
>> >
>> > anyone know of a reason why the task would be killed with this error
>> state? or how to catch it?
>> >
>> > Lars
>> >
>> > _______________________________________________
>> > users mailing list
>> > users@gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>>
>>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to