On 29.07.2017 17:31, Florian Pommerening wrote:
On 29.07.2017 10:04, Benjamin Redling wrote:
Am 29. Juli 2017 08:07:44 MESZ, schrieb Florian Pommerening <florian.pommeren...@unibas.ch>:

Hi everyone,

is there a way to find out why a job was canceled by slurm? I would
like
to distinguish the cases where a resource limit was hit from all other
reasons (like a manual cancellation). In case a resource limit was hit,

I also would like to know which one.

Thank you
Florian

Hello,

@ https://slurm.schedmd.com/squeue.html
search for: Job State Codes
compare canceled, failed, timeout

Regards,
Benjamin

Hello Benjamin,

thank you for your response. I am aware of that page but I may have misinterpreted the description of the state CA so far. For tasks that presumably ran out of memory I got a state of CANCELLED. I didn't explicitly cancel it and our system admin also didn't, so I thought that running out of memory would lead to a signal being sent to the task which then leads to the task being canceled. Some of the jobs also ended up in a FAILED state. I'm not sure which of these is the expected behavior for running out of memory.

If the FAILED state is the expected outcome when running out of resources other than time, I think this is not a good indicator because it groups together actual issues in the executed program like segfaults and running out of resources.

If the CANCELLED state is the expected outcome my original question still stands.

Either way, I wonder why I got a mix of both outcomes. This might be related to the following issue: https://bugs.schedmd.com/show_bug.cgi?id=3999

I also don't know how I can distinguish different resource limits. I assume the TIMEOUT state is about hitting the wall clock time limit, is that correct? In our case, we do not have a wall clock time limit but a CPU time limit and a memory limit (enforced with the cgroups plugin). Is there any way to recognize these?

My overall goal is to run a program on a bunch of different inputs with some tight resource bounds. For some of the inputs the resources will not suffice. So for a given input a program can either complete successfully, run out of resources, or crash. A crash means there is a bug in the program which requires further investigation, but running out of resources can happen. So I want to know which tasks ran out of resources and which ones had an actual problem.

Cheers
Florian

Hi everyone,

I still hope to find a way to distinguish between different reasons why a job stopped. I tried looking in the slurm source code but didn't know where to start. I found some code that pinged the task and could eventually detect that the task is above its limit. In this case this would lead to a return value of SIG_OOM (0:125).

I don't quite understand how this relates to the task plugin (cgroup). If the task runs in a cgroup that only has limited memory available then the cgroup will not allow the process to exceed the memory limit, so I'm not sure how slurm can ever detect an out-of-memory situation.

Can anyone give me some pointers where to look next?

Cheers
Florian

Reply via email to