Add on: you can check the messages file of the execd on the nodes, whether anything about the reason was recorded there.
-- Reuti Am 01.04.2011 um 16:39 schrieb lars van der bijl: > the problem is that i don't have any such limit's enforced currently on > submission. the submission to qsub are hidden from the user so i know there > not adding them.. the only thing we have is a load/suspended theshold in the > grid it self. (np_load_av, 1.75) at if I give it the memory limit I get the > same result but the other jobs have been getting this same signal and where > submitted without any limits. > > > atom10b = 215 times > huey = 856 times > atom24 = 356 times > atom23 = 345 times > atom05 = 669 times > atom15 = 796 times > atom12 = 432 times > atom22 = 250 times > centi = 152 times > sage = 186 times > atom08 = 588 times > fluffy = 101 times > atom20 = 561 times > atom10 = 570 times > neon = 129 times > atom17 = 358 times > atom14 = 188 times > atom13 = 414 times > atom21 = 406 times > atom11 = 182 times > dewey = 658 times > atom16 = 423 times > atom06 = 500 times > atom01 = 802 times > atom18 = 567 times > atom09 = 539 times > milly = 113 times > louie = 249 times > atom03 = 793 times > topsy = 69 times > atom02 = 834 times > atom04 = 359 times > atom07 = 791 times > atom19 = 488 times > > > seems a little more the then users killing it on there local machines. could > that load avarage be doing this? or any other settings in qmon i might be > over looking? > > Lars > > > On 1 April 2011 15:14, Reuti <re...@staff.uni-marburg.de> wrote: > Am 01.04.2011 um 15:55 schrieb lars van der bijl: > > > <snip> > > > > trying to raise a 100 error in the epilog didn't work. if the task failed > > with 137 it will not accept anything else it seems. I works fine for other > > errors but not for a Kill command. > > Indeed. Nevertheless it's noted in `qacct` as "failed 100 : assumedly > after job". Looks like it's to late to prevent kicking it out of the system. > > What about using s_vmem then? The job will get a signal SIGXCPU and can act > upon like using a trap in the jobscript. The binary on its own will be > terminated unless you change the default behavior there too. > > (soft limits in SGE are not like soft ulimits [which introduce only a second > lower limit on user request]) > > -- Reuti > > > > > ether inside of the grid code it self or as ether a other location like > > the epilog that gets run regards less of the task being killed? > > > > > > also. is it normal that the task being killed will not run it's epilog? > > > > No. The epilog will be executed, but not the remainder of the job script. > > Otherwise necessary cleanups couldn't be executed. > > > > I see this. I'm printing the exit status of 100 but still no bananas. > > > > > > -- Reuti > > > > > > > > > > Lars > > > > > > > > > > > > On 1 April 2011 12:08, lars van der bijl <l...@realisestudio.com> wrote: > > > your a beacon of knowledge Reuti! Thank you! > > > > > > I'll Think i'll have enough to have a other stab at my problem > > > > > > Lars > > > > > > > > > On 1 April 2011 12:05, Reuti <re...@staff.uni-marburg.de> wrote: > > > Am 01.04.2011 um 13:01 schrieb lars van der bijl: > > > > > > > also is there anyway of catching this and raising 100? ones the job is > > > > finished and it's dependencies start it's causing major havok on our > > > > system looking for file that aren’t there. > > > > > > > > are there other things the grid uses the SIGKILL for? not just memory > > > > limits? > > > > > > h_rt and h_cpu too: man queue_conf > > > > > > Or any ulimits in the cluster, which you set by other means. > > > > > > -- Reuti > > > > > > > > > > Lars > > > > > > > > On 1 April 2011 11:54, lars van der bijl <l...@realisestudio.com> wrote: > > > > in this case yes. > > > > > > > > however on the jobs running on our farm we put no memory limits as of > > > > yet. just request amount of procs > > > > > > > > is the it usual behaviour that if it fails with this code that the > > > > subsequent dependencies start regardless? > > > > > > > > Lars > > > > > > > > > > > > > > > > On 1 April 2011 11:41, Reuti <re...@staff.uni-marburg.de> wrote: > > > > Hi, > > > > > > > > Am 01.04.2011 um 12:33 schrieb lars van der bijl: > > > > > > > > > Hey everyone. > > > > > > > > > > Where having some issues with job's being killed with exit status 137. > > > > > > > > 137 = 128 + 9 > > > > > > > > $ kill -l > > > > 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL > > > > 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE > > > > 9) SIGKILL ... > > > > > > > > So, the job was killed. Did you request a too small value for h_vmem or > > > > h_rt? > > > > > > > > -- Reuti > > > > > > > > > > > > > This causes the task to finish and start it dependent task which is > > > > > causing all kind of havoc. > > > > > > > > > > submitting a job with a very small max memory limit gives me this > > > > > this as a example. > > > > > > > > > > $ qacct -j 21141 > > > > > ============================================================== > > > > > qname test.q > > > > > hostname atom12.** > > > > > group ** > > > > > owner lars > > > > > project NONE > > > > > department defaultdepartment > > > > > jobname stest__out__geometry2 > > > > > jobnumber 21141 > > > > > taskid 101 > > > > > account sge > > > > > priority 0 > > > > > qsub_time Fri Apr 1 11:22:30 2011 > > > > > start_time Fri Apr 1 11:22:31 2011 > > > > > end_time Fri Apr 1 11:22:39 2011 > > > > > granted_pe smp > > > > > slots 4 > > > > > failed 100 : assumedly after job > > > > > exit_status 137 > > > > > ru_wallclock 8 > > > > > ru_utime 0.281 > > > > > ru_stime 0.167 > > > > > ru_maxrss 3744 > > > > > ru_ixrss 0 > > > > > ru_ismrss 0 > > > > > ru_idrss 0 > > > > > ru_isrss 0 > > > > > ru_minflt 70739 > > > > > ru_majflt 0 > > > > > ru_nswap 0 > > > > > ru_inblock 8 > > > > > ru_oublock 224 > > > > > ru_msgsnd 0 > > > > > ru_msgrcv 0 > > > > > ru_nsignals 0 > > > > > ru_nvcsw 1072 > > > > > ru_nivcsw 439 > > > > > cpu 2.240 > > > > > mem 0.573 > > > > > io 0.145 > > > > > iow 0.000 > > > > > maxvmem 405.820M > > > > > arid undefined > > > > > > > > > > anyone know of a reason why the task would be killed with this error > > > > > state? or how to catch it? > > > > > > > > > > Lars > > > > > > > > > > _______________________________________________ > > > > > users mailing list > > > > > users@gridengine.org > > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users