Add on:

you can check the messages file of the execd on the nodes, whether anything 
about the reason was recorded there.

-- Reuti


Am 01.04.2011 um 16:39 schrieb lars van der bijl:

> the problem is that i don't have any such limit's enforced currently on 
> submission. the submission to qsub are hidden from the user so i know there 
> not adding them.. the only thing we have is a load/suspended theshold in the 
> grid it self. (np_load_av, 1.75) at if I give it the memory limit I get the 
> same result but the other jobs have been getting this same signal and where 
> submitted without any limits.
> 
> 
> atom10b = 215 times
> huey = 856 times
> atom24 = 356 times
> atom23 = 345 times
> atom05 = 669 times
> atom15 = 796 times
> atom12 = 432 times
> atom22 = 250 times
> centi = 152 times
> sage = 186 times
> atom08 = 588 times
> fluffy = 101 times
> atom20 = 561 times
> atom10 = 570 times
> neon = 129 times
> atom17 = 358 times
> atom14 = 188 times
> atom13 = 414 times
> atom21 = 406 times
> atom11 = 182 times
> dewey = 658 times
> atom16 = 423 times
> atom06 = 500 times
> atom01 = 802 times
> atom18 = 567 times
> atom09 = 539 times
> milly = 113 times
> louie = 249 times
> atom03 = 793 times
> topsy = 69 times
> atom02 = 834 times
> atom04 = 359 times
> atom07 = 791 times
> atom19 = 488 times
> 
> 
> seems a little more the then users killing it on there local machines. could 
> that load avarage be doing this? or any other settings in qmon i might be 
> over looking?
> 
> Lars
> 
> 
> On 1 April 2011 15:14, Reuti <re...@staff.uni-marburg.de> wrote:
> Am 01.04.2011 um 15:55 schrieb lars van der bijl:
> 
> > <snip>
> >
> > trying to raise a 100 error in the epilog didn't work. if the task failed 
> > with 137 it will not accept anything else it seems. I works fine for other 
> > errors but not for a Kill command.
> 
> Indeed. Nevertheless it's noted in `qacct` as "failed       100 : assumedly 
> after job". Looks like it's to late to prevent kicking it out of the system.
> 
> What about using s_vmem then? The job will get a signal SIGXCPU and can act 
> upon like using a trap in the jobscript. The binary on its own will be 
> terminated unless you change the default behavior there too.
> 
> (soft limits in SGE are not like soft ulimits [which introduce only a second 
> lower limit on user request])
> 
> -- Reuti
> 
> 
> >  > ether inside of the grid code it self or as ether a other location like 
> > the epilog that gets run regards less of the task being killed?
> > >
> > > also. is it normal that the task being killed will not run it's epilog?
> >
> > No. The epilog will be executed, but not the remainder of the job script. 
> > Otherwise necessary cleanups couldn't be executed.
> >
> > I see this. I'm printing the exit status of 100 but still no bananas.
> >
> >
> > -- Reuti
> >
> >
> > >
> > > Lars
> > >
> > >
> > >
> > > On 1 April 2011 12:08, lars van der bijl <l...@realisestudio.com> wrote:
> > > your a beacon of knowledge Reuti! Thank you!
> > >
> > > I'll Think i'll have enough to have a other stab at my problem
> > >
> > > Lars
> > >
> > >
> > > On 1 April 2011 12:05, Reuti <re...@staff.uni-marburg.de> wrote:
> > > Am 01.04.2011 um 13:01 schrieb lars van der bijl:
> > >
> > > > also is there anyway of catching this and raising 100? ones the job is 
> > > > finished and it's dependencies start it's causing major havok on our 
> > > > system looking for file that aren’t there.
> > > >
> > > > are there other things the grid uses the SIGKILL for? not just memory 
> > > > limits?
> > >
> > > h_rt and h_cpu too: man queue_conf
> > >
> > > Or any ulimits in the cluster, which you set by other means.
> > >
> > > -- Reuti
> > >
> > >
> > > > Lars
> > > >
> > > > On 1 April 2011 11:54, lars van der bijl <l...@realisestudio.com> wrote:
> > > > in this case yes.
> > > >
> > > > however on the jobs running on our farm we put no memory limits as of 
> > > > yet. just request amount of procs
> > > >
> > > > is the it usual behaviour that if it fails with this code that the 
> > > > subsequent dependencies start regardless?
> > > >
> > > > Lars
> > > >
> > > >
> > > >
> > > > On 1 April 2011 11:41, Reuti <re...@staff.uni-marburg.de> wrote:
> > > > Hi,
> > > >
> > > > Am 01.04.2011 um 12:33 schrieb lars van der bijl:
> > > >
> > > > > Hey everyone.
> > > > >
> > > > > Where having some issues with job's being killed with exit status 137.
> > > >
> > > > 137 = 128 + 9
> > > >
> > > > $ kill -l
> > > >  1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
> > > >  5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
> > > >  9) SIGKILL     ...
> > > >
> > > > So, the job was killed. Did you request a too small value for h_vmem or 
> > > > h_rt?
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > This causes the task to finish and start it dependent task which is 
> > > > > causing all kind of havoc.
> > > > >
> > > > > submitting a job with a very small max memory limit gives me this 
> > > > > this as a example.
> > > > >
> > > > > $ qacct -j 21141
> > > > > ==============================================================
> > > > > qname        test.q
> > > > > hostname     atom12.**
> > > > > group        **
> > > > > owner        lars
> > > > > project      NONE
> > > > > department   defaultdepartment
> > > > > jobname      stest__out__geometry2
> > > > > jobnumber    21141
> > > > > taskid       101
> > > > > account      sge
> > > > > priority     0
> > > > > qsub_time    Fri Apr  1 11:22:30 2011
> > > > > start_time   Fri Apr  1 11:22:31 2011
> > > > > end_time     Fri Apr  1 11:22:39 2011
> > > > > granted_pe   smp
> > > > > slots        4
> > > > > failed       100 : assumedly after job
> > > > > exit_status  137
> > > > > ru_wallclock 8
> > > > > ru_utime     0.281
> > > > > ru_stime     0.167
> > > > > ru_maxrss    3744
> > > > > ru_ixrss     0
> > > > > ru_ismrss    0
> > > > > ru_idrss     0
> > > > > ru_isrss     0
> > > > > ru_minflt    70739
> > > > > ru_majflt    0
> > > > > ru_nswap     0
> > > > > ru_inblock   8
> > > > > ru_oublock   224
> > > > > ru_msgsnd    0
> > > > > ru_msgrcv    0
> > > > > ru_nsignals  0
> > > > > ru_nvcsw     1072
> > > > > ru_nivcsw    439
> > > > > cpu          2.240
> > > > > mem          0.573
> > > > > io           0.145
> > > > > iow          0.000
> > > > > maxvmem      405.820M
> > > > > arid         undefined
> > > > >
> > > > > anyone know of a reason why the task would be killed with this error 
> > > > > state? or how to catch it?
> > > > >
> > > > > Lars
> > > > >
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users@gridengine.org
> > > > > https://gridengine.org/mailman/listinfo/users
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to