Re: [gridengine users] Jobs being killed with exit status 137

lars van der bijl Fri, 01 Apr 2011 07:57:56 -0700

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 193056
max locked memory       (kbytes, -l) 256
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 193056
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 193056
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 193056
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I think it might be the machine killing them. because where not putting any
other limits anywhere. unless it's the application where running. the task
usually take up a lot of ram and if more then one hits a machine it can be
swapping like crazy.

would be good to still be able to catch this before it gets the signal.


On 1 April 2011 15:50, Reuti <re...@staff.uni-marburg.de> wrote:

> Add on:
>
> you can check the messages file of the execd on the nodes, whether anything
> about the reason was recorded there.
>
> -- Reuti
>
>
> Am 01.04.2011 um 16:39 schrieb lars van der bijl:
>
> > the problem is that i don't have any such limit's enforced currently on
> submission. the submission to qsub are hidden from the user so i know there
> not adding them.. the only thing we have is a load/suspended theshold in the
> grid it self. (np_load_av, 1.75) at if I give it the memory limit I get the
> same result but the other jobs have been getting this same signal and where
> submitted without any limits.
> >
> >
> > atom10b = 215 times
> > huey = 856 times
> > atom24 = 356 times
> > atom23 = 345 times
> > atom05 = 669 times
> > atom15 = 796 times
> > atom12 = 432 times
> > atom22 = 250 times
> > centi = 152 times
> > sage = 186 times
> > atom08 = 588 times
> > fluffy = 101 times
> > atom20 = 561 times
> > atom10 = 570 times
> > neon = 129 times
> > atom17 = 358 times
> > atom14 = 188 times
> > atom13 = 414 times
> > atom21 = 406 times
> > atom11 = 182 times
> > dewey = 658 times
> > atom16 = 423 times
> > atom06 = 500 times
> > atom01 = 802 times
> > atom18 = 567 times
> > atom09 = 539 times
> > milly = 113 times
> > louie = 249 times
> > atom03 = 793 times
> > topsy = 69 times
> > atom02 = 834 times
> > atom04 = 359 times
> > atom07 = 791 times
> > atom19 = 488 times
> >
> >
> > seems a little more the then users killing it on there local machines.
> could that load avarage be doing this? or any other settings in qmon i might
> be over looking?
> >
> > Lars
> >
> >
> > On 1 April 2011 15:14, Reuti <re...@staff.uni-marburg.de> wrote:
> > Am 01.04.2011 um 15:55 schrieb lars van der bijl:
> >
> > > <snip>
> > >
> > > trying to raise a 100 error in the epilog didn't work. if the task
> failed with 137 it will not accept anything else it seems. I works fine for
> other errors but not for a Kill command.
> >
> > Indeed. Nevertheless it's noted in `qacct` as "failed       100 :
> assumedly after job". Looks like it's to late to prevent kicking it out of
> the system.
> >
> > What about using s_vmem then? The job will get a signal SIGXCPU and can
> act upon like using a trap in the jobscript. The binary on its own will be
> terminated unless you change the default behavior there too.
> >
> > (soft limits in SGE are not like soft ulimits [which introduce only a
> second lower limit on user request])
> >
> > -- Reuti
> >
> >
> > >  > ether inside of the grid code it self or as ether a other location
> like the epilog that gets run regards less of the task being killed?
> > > >
> > > > also. is it normal that the task being killed will not run it's
> epilog?
> > >
> > > No. The epilog will be executed, but not the remainder of the job
> script. Otherwise necessary cleanups couldn't be executed.
> > >
> > > I see this. I'm printing the exit status of 100 but still no bananas.
> > >
> > >
> > > -- Reuti
> > >
> > >
> > > >
> > > > Lars
> > > >
> > > >
> > > >
> > > > On 1 April 2011 12:08, lars van der bijl <l...@realisestudio.com>
> wrote:
> > > > your a beacon of knowledge Reuti! Thank you!
> > > >
> > > > I'll Think i'll have enough to have a other stab at my problem
> > > >
> > > > Lars
> > > >
> > > >
> > > > On 1 April 2011 12:05, Reuti <re...@staff.uni-marburg.de> wrote:
> > > > Am 01.04.2011 um 13:01 schrieb lars van der bijl:
> > > >
> > > > > also is there anyway of catching this and raising 100? ones the job
> is finished and it's dependencies start it's causing major havok on our
> system looking for file that aren’t there.
> > > > >
> > > > > are there other things the grid uses the SIGKILL for? not just
> memory limits?
> > > >
> > > > h_rt and h_cpu too: man queue_conf
> > > >
> > > > Or any ulimits in the cluster, which you set by other means.
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > Lars
> > > > >
> > > > > On 1 April 2011 11:54, lars van der bijl <l...@realisestudio.com>
> wrote:
> > > > > in this case yes.
> > > > >
> > > > > however on the jobs running on our farm we put no memory limits as
> of yet. just request amount of procs
> > > > >
> > > > > is the it usual behaviour that if it fails with this code that the
> subsequent dependencies start regardless?
> > > > >
> > > > > Lars
> > > > >
> > > > >
> > > > >
> > > > > On 1 April 2011 11:41, Reuti <re...@staff.uni-marburg.de> wrote:
> > > > > Hi,
> > > > >
> > > > > Am 01.04.2011 um 12:33 schrieb lars van der bijl:
> > > > >
> > > > > > Hey everyone.
> > > > > >
> > > > > > Where having some issues with job's being killed with exit status
> 137.
> > > > >
> > > > > 137 = 128 + 9
> > > > >
> > > > > $ kill -l
> > > > >  1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
> > > > >  5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
> > > > >  9) SIGKILL     ...
> > > > >
> > > > > So, the job was killed. Did you request a too small value for
> h_vmem or h_rt?
> > > > >
> > > > > -- Reuti
> > > > >
> > > > >
> > > > > > This causes the task to finish and start it dependent task which
> is causing all kind of havoc.
> > > > > >
> > > > > > submitting a job with a very small max memory limit gives me this
> this as a example.
> > > > > >
> > > > > > $ qacct -j 21141
> > > > > > ==============================================================
> > > > > > qname        test.q
> > > > > > hostname     atom12.**
> > > > > > group        **
> > > > > > owner        lars
> > > > > > project      NONE
> > > > > > department   defaultdepartment
> > > > > > jobname      stest__out__geometry2
> > > > > > jobnumber    21141
> > > > > > taskid       101
> > > > > > account      sge
> > > > > > priority     0
> > > > > > qsub_time    Fri Apr  1 11:22:30 2011
> > > > > > start_time   Fri Apr  1 11:22:31 2011
> > > > > > end_time     Fri Apr  1 11:22:39 2011
> > > > > > granted_pe   smp
> > > > > > slots        4
> > > > > > failed       100 : assumedly after job
> > > > > > exit_status  137
> > > > > > ru_wallclock 8
> > > > > > ru_utime     0.281
> > > > > > ru_stime     0.167
> > > > > > ru_maxrss    3744
> > > > > > ru_ixrss     0
> > > > > > ru_ismrss    0
> > > > > > ru_idrss     0
> > > > > > ru_isrss     0
> > > > > > ru_minflt    70739
> > > > > > ru_majflt    0
> > > > > > ru_nswap     0
> > > > > > ru_inblock   8
> > > > > > ru_oublock   224
> > > > > > ru_msgsnd    0
> > > > > > ru_msgrcv    0
> > > > > > ru_nsignals  0
> > > > > > ru_nvcsw     1072
> > > > > > ru_nivcsw    439
> > > > > > cpu          2.240
> > > > > > mem          0.573
> > > > > > io           0.145
> > > > > > iow          0.000
> > > > > > maxvmem      405.820M
> > > > > > arid         undefined
> > > > > >
> > > > > > anyone know of a reason why the task would be killed with this
> error state? or how to catch it?
> > > > > >
> > > > > > Lars
> > > > > >
> > > > > > _______________________________________________
> > > > > > users mailing list
> > > > > > users@gridengine.org
> > > > > > https://gridengine.org/mailman/listinfo/users
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs being killed with exit status 137

Reply via email to