Re: [slurm-users] Change ExcNodeList on a running job

2020-06-10 Thread Ransom, Geoffrey M.
I'm just curious as to what causes a user to decide that a given node has an issue? If a node is healthy in all respects, why would a user decide not to use the node? Not enough free TMPDIR space, a GPU starts having memory errors, or a machine with a temporary issue that slurm

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-05 Thread Ole Holm Nielsen
:* Thursday, June 4, 2020 4:16 PM *To:* Slurm User Community List *Subject:* [EXT] Re: [slurm-users] Change ExcNodeList on a running job *APL external email warning: *Verify sender slurm-users-boun...@lists.schedmd.com <mailto:slurm-users-boun...@lists.schedmd.com> before clicking

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Steven Senator (slurm-dev-list)
xcNodeList=$NewExcNodeList >> >> scontrol requeue ${ SLURM_JOB_ID} >> >> sleep 10 >> >> fi >> >> >> >> >> >> >> >> *From:* slurm-users *On Behalf >> Of *Rodrigo Santibáñez >> *Sent:* Thursday, J

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Rodrigo Santibáñez
ol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList > > scontrol requeue ${ SLURM_JOB_ID} > > sleep 10 > > fi > > > > > > > > *From:* slurm-users *On Behalf Of > *Rodrigo Santibáñez > *Sent:* Thursday, June 4, 2020 4:16 PM > *To:* Slurm User C

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Ransom, Geoffrey M.
=$NewExcNodeList scontrol requeue ${ SLURM_JOB_ID} sleep 10 fi From: slurm-users On Behalf Of Rodrigo Santibáñez Sent: Thursday, June 4, 2020 4:16 PM To: Slurm User Community List Subject: [EXT] Re: [slurm-users] Change ExcNodeList on a running job APL external email warning: Verify sender slurm

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Riebs, Andy
-- but the situations where this would be a good solution are rare!) Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Rodrigo Santibáñez Sent: Thursday, June 4, 2020 4:16 PM To: Slurm User Community List Subject: Re: [slurm-users] Change ExcNodeList on a running job Hello

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Rodrigo Santibáñez
Hello, Jobs can be requeue if something wrong happens, and the node with failure excluded by the controller. *--requeue* Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher

[slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Ransom, Geoffrey M.
Hello We are moving from Univa(sge) to slurm and one of our users has jobs that if they detect a failure on the current machine they add that machine to their exclude list and requeue themselves. The user wants to emulate that behavior in slurm. It seems like "scontrol update job