[slurm-dev] Re: Reason for job state CANCELLED

Florian Pommerening Thu, 03 Aug 2017 04:48:56 -0700


On 29.07.2017 17:31, Florian Pommerening wrote:

On 29.07.2017 10:04, Benjamin Redling wrote:
Am 29. Juli 2017 08:07:44 MESZ, schrieb Florian Pommerening<florian.pommeren...@unibas.ch>:
Hi everyone,

is there a way to find out why a job was canceled by slurm? I would
like
to distinguish the cases where a resource limit was hit from all other
reasons (like a manual cancellation). In case a resource limit was hit,

I also would like to know which one.

Thank you
Florian
Hello,

@ https://slurm.schedmd.com/squeue.html
search for: Job State Codes
compare canceled, failed, timeout

Regards,
Benjamin
Hello Benjamin,
thank you for your response. I am aware of that page but I may havemisinterpreted the description of the state CA so far. For tasks thatpresumably ran out of memory I got a state of CANCELLED. I didn'texplicitly cancel it and our system admin also didn't, so I thought thatrunning out of memory would lead to a signal being sent to the taskwhich then leads to the task being canceled. Some of the jobs also endedup in a FAILED state. I'm not sure which of these is the expectedbehavior for running out of memory.
If the FAILED state is the expected outcome when running out ofresources other than time, I think this is not a good indicator becauseit groups together actual issues in the executed program like segfaultsand running out of resources.
If the CANCELLED state is the expected outcome my original questionstill stands.
Either way, I wonder why I got a mix of both outcomes. This might berelated to the following issue:https://bugs.schedmd.com/show_bug.cgi?id=3999
I also don't know how I can distinguish different resource limits. Iassume the TIMEOUT state is about hitting the wall clock time limit, isthat correct? In our case, we do not have a wall clock time limit but aCPU time limit and a memory limit (enforced with the cgroups plugin). Isthere any way to recognize these?
My overall goal is to run a program on a bunch of different inputs withsome tight resource bounds. For some of the inputs the resources willnot suffice. So for a given input a program can either completesuccessfully, run out of resources, or crash. A crash means there is abug in the program which requires further investigation, but running outof resources can happen. So I want to know which tasks ran out ofresources and which ones had an actual problem.
Cheers
Florian


Hi everyone,

I still hope to find a way to distinguish between different reasons whya job stopped. I tried looking in the slurm source code but didn't knowwhere to start. I found some code that pinged the task and couldeventually detect that the task is above its limit. In this case thiswould lead to a return value of SIG_OOM (0:125).

I don't quite understand how this relates to the task plugin (cgroup).If the task runs in a cgroup that only has limited memory available thenthe cgroup will not allow the process to exceed the memory limit, so I'mnot sure how slurm can ever detect an out-of-memory situation.


Can anyone give me some pointers where to look next?

Cheers
Florian

[slurm-dev] Re: Reason for job state CANCELLED

Reply via email to