[gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
Hey everyone. Where having some issues with job's being killed with exit status 137. This causes the task to finish and start it dependent task which is causing all kind of havoc. submitting a job with a very small max memory limit gives me this this as a example. $ qacct -j 21141

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
also is there anyway of catching this and raising 100? ones the job is finished and it's dependencies start it's causing major havok on our system looking for file that aren’t there. are there other things the grid uses the SIGKILL for? not just memory limits? Lars On 1 April 2011 11:54, lars

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread Reuti
Am 01.04.2011 um 12:54 schrieb lars van der bijl: in this case yes. however on the jobs running on our farm we put no memory limits as of yet. just request amount of procs is the it usual behaviour that if it fails with this code that the subsequent dependencies start regardless? Yes,

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread Reuti
Add on: you can check the messages file of the execd on the nodes, whether anything about the reason was recorded there. -- Reuti Am 01.04.2011 um 16:39 schrieb lars van der bijl: the problem is that i don't have any such limit's enforced currently on submission. the submission to qsub

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 193056 max locked memory (kbytes, -l) 256 max memory size (kbytes, -m)

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread Reuti
Am 01.04.2011 um 16:57 schrieb lars van der bijl: core file size (blocks, -c) 0 snip file locks (-x) unlimited Fine. I think it might be the machine killing them. because where not putting any other limits anywhere. unless it's the application where running.