Hi,

We have an user claiming his job was not requeued  when the node failed.

Slurmctld detects the missing job when node is rebooted and slurmd sends
the registration message.

In these cases, slurmctld just call to job_complete with requeue=0 and
node_fail=1. I wonder why it is not possible to requeue a job when this
happen. Maybe a complex interaction that I can not see.

Also, slurmctld shows this message " Job 10529777 cancelled from
interactive user", which is not the case. but code triggers here:
 

        (line 3463 at job_mgr.c)

if ((job_return_code == NO_VAL) &&
(IS_JOB_RUNNING(job_ptr) || IS_JOB_PENDING(job_ptr))) {
info("Job %u cancelled from interactive user", job_ptr->job_id);
}


Probably an extra check for node_fail should be done.

WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer

Reply via email to