Hey everyone.
Where having some issues with job's being killed with exit status 137. This
causes the task to finish and start it dependent task which is causing all
kind of havoc.
submitting a job with a very small max memory limit gives me this this as a
example.
$ qacct -j 21141
also is there anyway of catching this and raising 100? ones the job is
finished and it's dependencies start it's causing major havok on our system
looking for file that aren’t there.
are there other things the grid uses the SIGKILL for? not just memory
limits?
Lars
On 1 April 2011 11:54, lars
Am 01.04.2011 um 12:54 schrieb lars van der bijl:
in this case yes.
however on the jobs running on our farm we put no memory limits as of yet.
just request amount of procs
is the it usual behaviour that if it fails with this code that the subsequent
dependencies start regardless?
Yes,
Add on:
you can check the messages file of the execd on the nodes, whether anything
about the reason was recorded there.
-- Reuti
Am 01.04.2011 um 16:39 schrieb lars van der bijl:
the problem is that i don't have any such limit's enforced currently on
submission. the submission to qsub
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 193056
max locked memory (kbytes, -l) 256
max memory size (kbytes, -m)
Am 01.04.2011 um 16:57 schrieb lars van der bijl:
core file size (blocks, -c) 0
snip
file locks (-x) unlimited
Fine.
I think it might be the machine killing them. because where not putting any
other limits anywhere. unless it's the application where running.