core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 193056 max locked memory (kbytes, -l) 256 max memory size (kbytes, -m) unlimited open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 193056 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 193056 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 193056 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited I think it might be the machine killing them. because where not putting any other limits anywhere. unless it's the application where running. the task usually take up a lot of ram and if more then one hits a machine it can be swapping like crazy. would be good to still be able to catch this before it gets the signal. On 1 April 2011 15:50, Reuti <re...@staff.uni-marburg.de> wrote: > Add on: > > you can check the messages file of the execd on the nodes, whether anything > about the reason was recorded there. > > -- Reuti > > > Am 01.04.2011 um 16:39 schrieb lars van der bijl: > > > the problem is that i don't have any such limit's enforced currently on > submission. the submission to qsub are hidden from the user so i know there > not adding them.. the only thing we have is a load/suspended theshold in the > grid it self. (np_load_av, 1.75) at if I give it the memory limit I get the > same result but the other jobs have been getting this same signal and where > submitted without any limits. > > > > > > atom10b = 215 times > > huey = 856 times > > atom24 = 356 times > > atom23 = 345 times > > atom05 = 669 times > > atom15 = 796 times > > atom12 = 432 times > > atom22 = 250 times > > centi = 152 times > > sage = 186 times > > atom08 = 588 times > > fluffy = 101 times > > atom20 = 561 times > > atom10 = 570 times > > neon = 129 times > > atom17 = 358 times > > atom14 = 188 times > > atom13 = 414 times > > atom21 = 406 times > > atom11 = 182 times > > dewey = 658 times > > atom16 = 423 times > > atom06 = 500 times > > atom01 = 802 times > > atom18 = 567 times > > atom09 = 539 times > > milly = 113 times > > louie = 249 times > > atom03 = 793 times > > topsy = 69 times > > atom02 = 834 times > > atom04 = 359 times > > atom07 = 791 times > > atom19 = 488 times > > > > > > seems a little more the then users killing it on there local machines. > could that load avarage be doing this? or any other settings in qmon i might > be over looking? > > > > Lars > > > > > > On 1 April 2011 15:14, Reuti <re...@staff.uni-marburg.de> wrote: > > Am 01.04.2011 um 15:55 schrieb lars van der bijl: > > > > > <snip> > > > > > > trying to raise a 100 error in the epilog didn't work. if the task > failed with 137 it will not accept anything else it seems. I works fine for > other errors but not for a Kill command. > > > > Indeed. Nevertheless it's noted in `qacct` as "failed 100 : > assumedly after job". Looks like it's to late to prevent kicking it out of > the system. > > > > What about using s_vmem then? The job will get a signal SIGXCPU and can > act upon like using a trap in the jobscript. The binary on its own will be > terminated unless you change the default behavior there too. > > > > (soft limits in SGE are not like soft ulimits [which introduce only a > second lower limit on user request]) > > > > -- Reuti > > > > > > > > ether inside of the grid code it self or as ether a other location > like the epilog that gets run regards less of the task being killed? > > > > > > > > also. is it normal that the task being killed will not run it's > epilog? > > > > > > No. The epilog will be executed, but not the remainder of the job > script. Otherwise necessary cleanups couldn't be executed. > > > > > > I see this. I'm printing the exit status of 100 but still no bananas. > > > > > > > > > -- Reuti > > > > > > > > > > > > > > Lars > > > > > > > > > > > > > > > > On 1 April 2011 12:08, lars van der bijl <l...@realisestudio.com> > wrote: > > > > your a beacon of knowledge Reuti! Thank you! > > > > > > > > I'll Think i'll have enough to have a other stab at my problem > > > > > > > > Lars > > > > > > > > > > > > On 1 April 2011 12:05, Reuti <re...@staff.uni-marburg.de> wrote: > > > > Am 01.04.2011 um 13:01 schrieb lars van der bijl: > > > > > > > > > also is there anyway of catching this and raising 100? ones the job > is finished and it's dependencies start it's causing major havok on our > system looking for file that aren’t there. > > > > > > > > > > are there other things the grid uses the SIGKILL for? not just > memory limits? > > > > > > > > h_rt and h_cpu too: man queue_conf > > > > > > > > Or any ulimits in the cluster, which you set by other means. > > > > > > > > -- Reuti > > > > > > > > > > > > > Lars > > > > > > > > > > On 1 April 2011 11:54, lars van der bijl <l...@realisestudio.com> > wrote: > > > > > in this case yes. > > > > > > > > > > however on the jobs running on our farm we put no memory limits as > of yet. just request amount of procs > > > > > > > > > > is the it usual behaviour that if it fails with this code that the > subsequent dependencies start regardless? > > > > > > > > > > Lars > > > > > > > > > > > > > > > > > > > > On 1 April 2011 11:41, Reuti <re...@staff.uni-marburg.de> wrote: > > > > > Hi, > > > > > > > > > > Am 01.04.2011 um 12:33 schrieb lars van der bijl: > > > > > > > > > > > Hey everyone. > > > > > > > > > > > > Where having some issues with job's being killed with exit status > 137. > > > > > > > > > > 137 = 128 + 9 > > > > > > > > > > $ kill -l > > > > > 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL > > > > > 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE > > > > > 9) SIGKILL ... > > > > > > > > > > So, the job was killed. Did you request a too small value for > h_vmem or h_rt? > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > This causes the task to finish and start it dependent task which > is causing all kind of havoc. > > > > > > > > > > > > submitting a job with a very small max memory limit gives me this > this as a example. > > > > > > > > > > > > $ qacct -j 21141 > > > > > > ============================================================== > > > > > > qname test.q > > > > > > hostname atom12.** > > > > > > group ** > > > > > > owner lars > > > > > > project NONE > > > > > > department defaultdepartment > > > > > > jobname stest__out__geometry2 > > > > > > jobnumber 21141 > > > > > > taskid 101 > > > > > > account sge > > > > > > priority 0 > > > > > > qsub_time Fri Apr 1 11:22:30 2011 > > > > > > start_time Fri Apr 1 11:22:31 2011 > > > > > > end_time Fri Apr 1 11:22:39 2011 > > > > > > granted_pe smp > > > > > > slots 4 > > > > > > failed 100 : assumedly after job > > > > > > exit_status 137 > > > > > > ru_wallclock 8 > > > > > > ru_utime 0.281 > > > > > > ru_stime 0.167 > > > > > > ru_maxrss 3744 > > > > > > ru_ixrss 0 > > > > > > ru_ismrss 0 > > > > > > ru_idrss 0 > > > > > > ru_isrss 0 > > > > > > ru_minflt 70739 > > > > > > ru_majflt 0 > > > > > > ru_nswap 0 > > > > > > ru_inblock 8 > > > > > > ru_oublock 224 > > > > > > ru_msgsnd 0 > > > > > > ru_msgrcv 0 > > > > > > ru_nsignals 0 > > > > > > ru_nvcsw 1072 > > > > > > ru_nivcsw 439 > > > > > > cpu 2.240 > > > > > > mem 0.573 > > > > > > io 0.145 > > > > > > iow 0.000 > > > > > > maxvmem 405.820M > > > > > > arid undefined > > > > > > > > > > > > anyone know of a reason why the task would be killed with this > error state? or how to catch it? > > > > > > > > > > > > Lars > > > > > > > > > > > > _______________________________________________ > > > > > > users mailing list > > > > > > users@gridengine.org > > > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users