Am 01.04.2011 um 12:54 schrieb lars van der bijl: > in this case yes. > > however on the jobs running on our farm we put no memory limits as of yet. > just request amount of procs > > is the it usual behaviour that if it fails with this code that the subsequent > dependencies start regardless?
Yes, for SGE the -hold_jid will only check whether the predecessor left the system. It's state isn't checked or honored. This needs to be done in the followup job script and maybe sending itself into error state, so that it's not lost. If you have a list with jobnames in advance you want to handle, you could submit all but the first job with -h, and each finished job will have to enable the followup job then in the job script or a queue epilog. A place to specify the name of the followup jobs could be the job context, as its meta information is just comment for SGE, but you can access the information and act upon. -- Reuti > Lars > > > On 1 April 2011 11:41, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 01.04.2011 um 12:33 schrieb lars van der bijl: > > > Hey everyone. > > > > Where having some issues with job's being killed with exit status 137. > > 137 = 128 + 9 > > $ kill -l > 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL > 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE > 9) SIGKILL ... > > So, the job was killed. Did you request a too small value for h_vmem or h_rt? > > -- Reuti > > > > This causes the task to finish and start it dependent task which is causing > > all kind of havoc. > > > > submitting a job with a very small max memory limit gives me this this as a > > example. > > > > $ qacct -j 21141 > > ============================================================== > > qname test.q > > hostname atom12.** > > group ** > > owner lars > > project NONE > > department defaultdepartment > > jobname stest__out__geometry2 > > jobnumber 21141 > > taskid 101 > > account sge > > priority 0 > > qsub_time Fri Apr 1 11:22:30 2011 > > start_time Fri Apr 1 11:22:31 2011 > > end_time Fri Apr 1 11:22:39 2011 > > granted_pe smp > > slots 4 > > failed 100 : assumedly after job > > exit_status 137 > > ru_wallclock 8 > > ru_utime 0.281 > > ru_stime 0.167 > > ru_maxrss 3744 > > ru_ixrss 0 > > ru_ismrss 0 > > ru_idrss 0 > > ru_isrss 0 > > ru_minflt 70739 > > ru_majflt 0 > > ru_nswap 0 > > ru_inblock 8 > > ru_oublock 224 > > ru_msgsnd 0 > > ru_msgrcv 0 > > ru_nsignals 0 > > ru_nvcsw 1072 > > ru_nivcsw 439 > > cpu 2.240 > > mem 0.573 > > io 0.145 > > iow 0.000 > > maxvmem 405.820M > > arid undefined > > > > anyone know of a reason why the task would be killed with this error state? > > or how to catch it? > > > > Lars > > > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users