Re: [gridengine users] Jobs being killed with exit status 137

Reuti Fri, 01 Apr 2011 08:20:23 -0700

Am 01.04.2011 um 16:57 schrieb lars van der bijl:

> core file size          (blocks, -c) 0
> <snip>
> file locks                      (-x) unlimited


Fine.

> I think it might be the machine killing them. because where not putting any 
> other limits anywhere. unless it's the application where running. the task 
> usually take up a lot of ram and if more then one hits a machine it can be 
> swapping like crazy.

This should show up in /var/log/messages by the kernel.


> would be good to still be able to catch this before it gets the signal.

Hehe - not with the oom-killer.

- Limit the slot count per machine.
- Limit and request memory in SGE, this way only the ones fitting in will be 
scheduled to a machine until memory is exhausted.
- Install more memory.

-- Reuti


> 
> On 1 April 2011 15:50, Reuti <re...@staff.uni-marburg.de> wrote:
> Add on:
> 
> you can check the messages file of the execd on the nodes, whether anything 
> about the reason was recorded there.
> 
> -- Reuti
> 
> 
> Am 01.04.2011 um 16:39 schrieb lars van der bijl:
> 
> > the problem is that i don't have any such limit's enforced currently on 
> > submission. the submission to qsub are hidden from the user so i know there 
> > not adding them.. the only thing we have is a load/suspended theshold in 
> > the grid it self. (np_load_av, 1.75) at if I give it the memory limit I get 
> > the same result but the other jobs have been getting this same signal and 
> > where submitted without any limits.
> >
> >
> > atom10b = 215 times
> > huey = 856 times
> > atom24 = 356 times
> > atom23 = 345 times
> > atom05 = 669 times
> > atom15 = 796 times
> > atom12 = 432 times
> > atom22 = 250 times
> > centi = 152 times
> > sage = 186 times
> > atom08 = 588 times
> > fluffy = 101 times
> > atom20 = 561 times
> > atom10 = 570 times
> > neon = 129 times
> > atom17 = 358 times
> > atom14 = 188 times
> > atom13 = 414 times
> > atom21 = 406 times
> > atom11 = 182 times
> > dewey = 658 times
> > atom16 = 423 times
> > atom06 = 500 times
> > atom01 = 802 times
> > atom18 = 567 times
> > atom09 = 539 times
> > milly = 113 times
> > louie = 249 times
> > atom03 = 793 times
> > topsy = 69 times
> > atom02 = 834 times
> > atom04 = 359 times
> > atom07 = 791 times
> > atom19 = 488 times
> >
> >
> > seems a little more the then users killing it on there local machines. 
> > could that load avarage be doing this? or any other settings in qmon i 
> > might be over looking?
> >
> > Lars
> >
> >
> > On 1 April 2011 15:14, Reuti <re...@staff.uni-marburg.de> wrote:
> > Am 01.04.2011 um 15:55 schrieb lars van der bijl:
> >
> > > <snip>
> > >
> > > trying to raise a 100 error in the epilog didn't work. if the task failed 
> > > with 137 it will not accept anything else it seems. I works fine for 
> > > other errors but not for a Kill command.
> >
> > Indeed. Nevertheless it's noted in `qacct` as "failed       100 : assumedly 
> > after job". Looks like it's to late to prevent kicking it out of the system.
> >
> > What about using s_vmem then? The job will get a signal SIGXCPU and can act 
> > upon like using a trap in the jobscript. The binary on its own will be 
> > terminated unless you change the default behavior there too.
> >
> > (soft limits in SGE are not like soft ulimits [which introduce only a 
> > second lower limit on user request])
> >
> > -- Reuti
> >
> >
> > >  > ether inside of the grid code it self or as ether a other location 
> > > like the epilog that gets run regards less of the task being killed?
> > > >
> > > > also. is it normal that the task being killed will not run it's epilog?
> > >
> > > No. The epilog will be executed, but not the remainder of the job script. 
> > > Otherwise necessary cleanups couldn't be executed.
> > >
> > > I see this. I'm printing the exit status of 100 but still no bananas.
> > >
> > >
> > > -- Reuti
> > >
> > >
> > > >
> > > > Lars
> > > >
> > > >
> > > >
> > > > On 1 April 2011 12:08, lars van der bijl <l...@realisestudio.com> wrote:
> > > > your a beacon of knowledge Reuti! Thank you!
> > > >
> > > > I'll Think i'll have enough to have a other stab at my problem
> > > >
> > > > Lars
> > > >
> > > >
> > > > On 1 April 2011 12:05, Reuti <re...@staff.uni-marburg.de> wrote:
> > > > Am 01.04.2011 um 13:01 schrieb lars van der bijl:
> > > >
> > > > > also is there anyway of catching this and raising 100? ones the job 
> > > > > is finished and it's dependencies start it's causing major havok on 
> > > > > our system looking for file that aren’t there.
> > > > >
> > > > > are there other things the grid uses the SIGKILL for? not just memory 
> > > > > limits?
> > > >
> > > > h_rt and h_cpu too: man queue_conf
> > > >
> > > > Or any ulimits in the cluster, which you set by other means.
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > Lars
> > > > >
> > > > > On 1 April 2011 11:54, lars van der bijl <l...@realisestudio.com> 
> > > > > wrote:
> > > > > in this case yes.
> > > > >
> > > > > however on the jobs running on our farm we put no memory limits as of 
> > > > > yet. just request amount of procs
> > > > >
> > > > > is the it usual behaviour that if it fails with this code that the 
> > > > > subsequent dependencies start regardless?
> > > > >
> > > > > Lars
> > > > >
> > > > >
> > > > >
> > > > > On 1 April 2011 11:41, Reuti <re...@staff.uni-marburg.de> wrote:
> > > > > Hi,
> > > > >
> > > > > Am 01.04.2011 um 12:33 schrieb lars van der bijl:
> > > > >
> > > > > > Hey everyone.
> > > > > >
> > > > > > Where having some issues with job's being killed with exit status 
> > > > > > 137.
> > > > >
> > > > > 137 = 128 + 9
> > > > >
> > > > > $ kill -l
> > > > >  1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
> > > > >  5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
> > > > >  9) SIGKILL     ...
> > > > >
> > > > > So, the job was killed. Did you request a too small value for h_vmem 
> > > > > or h_rt?
> > > > >
> > > > > -- Reuti
> > > > >
> > > > >
> > > > > > This causes the task to finish and start it dependent task which is 
> > > > > > causing all kind of havoc.
> > > > > >
> > > > > > submitting a job with a very small max memory limit gives me this 
> > > > > > this as a example.
> > > > > >
> > > > > > $ qacct -j 21141
> > > > > > ==============================================================
> > > > > > qname        test.q
> > > > > > hostname     atom12.**
> > > > > > group        **
> > > > > > owner        lars
> > > > > > project      NONE
> > > > > > department   defaultdepartment
> > > > > > jobname      stest__out__geometry2
> > > > > > jobnumber    21141
> > > > > > taskid       101
> > > > > > account      sge
> > > > > > priority     0
> > > > > > qsub_time    Fri Apr  1 11:22:30 2011
> > > > > > start_time   Fri Apr  1 11:22:31 2011
> > > > > > end_time     Fri Apr  1 11:22:39 2011
> > > > > > granted_pe   smp
> > > > > > slots        4
> > > > > > failed       100 : assumedly after job
> > > > > > exit_status  137
> > > > > > ru_wallclock 8
> > > > > > ru_utime     0.281
> > > > > > ru_stime     0.167
> > > > > > ru_maxrss    3744
> > > > > > ru_ixrss     0
> > > > > > ru_ismrss    0
> > > > > > ru_idrss     0
> > > > > > ru_isrss     0
> > > > > > ru_minflt    70739
> > > > > > ru_majflt    0
> > > > > > ru_nswap     0
> > > > > > ru_inblock   8
> > > > > > ru_oublock   224
> > > > > > ru_msgsnd    0
> > > > > > ru_msgrcv    0
> > > > > > ru_nsignals  0
> > > > > > ru_nvcsw     1072
> > > > > > ru_nivcsw    439
> > > > > > cpu          2.240
> > > > > > mem          0.573
> > > > > > io           0.145
> > > > > > iow          0.000
> > > > > > maxvmem      405.820M
> > > > > > arid         undefined
> > > > > >
> > > > > > anyone know of a reason why the task would be killed with this 
> > > > > > error state? or how to catch it?
> > > > > >
> > > > > > Lars
> > > > > >
> > > > > > _______________________________________________
> > > > > > users mailing list
> > > > > > users@gridengine.org
> > > > > > https://gridengine.org/mailman/listinfo/users
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs being killed with exit status 137

Reply via email to