Re: [slurm-users] Spurious OOM-kills with cgroups on 20.11.8?

2021-08-10 Thread Roger Moye
Do you know if the job is actually being killed? We had an issue on an older version of slurm whereby we got OOM errors but the tasks actually completed. The OOM came when the job exited and was a false error. Also, there are several bug reports open right now about an issue similar to what

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-12 Thread Roger Moye
We had issues getting TMPDIR to work as well. We finally did this in our prolog: export SLURM_TMPDIR="/tmp/slurm/${SLURM_JOB_ID}" This works. -Roger From: slurm-users On Behalf Of Ellestad, Erik Sent: Tuesday, May 12, 2020 10:40 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] R

Re: [slurm-users] SLURM_TMPDIR

2019-12-05 Thread Roger Moye
Our prolog script just does this: export SLURM_TMPDIR="/tmp/slurm/${SLURM_JOB_ID}" This has worked for us. -Roger -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Angelines Sent: Thursday, December 5, 2019 9:58 AM To: slurm-users@lists.sc

[slurm-users] Slurm cancelling jobs even when dependencies are successful

2019-10-29 Thread Roger Moye
8.08.4. Thanks in advance! -Roger [cid:image001.png@01D22319.C7D5D540] Roger Moye HPC Engineer 713.425.6236 Office 713.898.0021 Mobile QUANTLAB Financial, LLC 3 Greenway Plaza Suite 200 Houston, Texas 77046 www.quantlab.com<https:

[slurm-users] How to automatically release jobs that failed with "launch failed requeued held"

2019-01-22 Thread Roger Moye
hat it can run? There were plenty of healthy nodes for this job so I'd prefer that the job not remained held indefinitely. Thanks! -Roger [cid:image001.png@01D22319.C7D5D540] Roger Moye HPC Engineer 713.425.6236 Office 713.898.0021 Mobile QUANTLAB Financial, LLC 3 Greenway Plaza Suite 200

[slurm-users] How to implement job arrays with distribution=cyclic

2018-12-12 Thread Roger Moye
accomplish this? Without this, node 1 fills up first before any cores on node 2 are assigned. Thanks in advance! -Roger [cid:image001.png@01D22319.C7D5D540] Roger Moye HPC Engineer 713.425.6236 Office 713.898.0021 Mobile QUANTLAB Financial, LLC 3 Greenway Plaza Suite 200 Houston, Texas 77046

Re: [slurm-users] $TMPDIR does not honor "TmpFS"

2018-11-21 Thread Roger Moye
We are having the exact same problem with $TMPDIR. I wonder if a bug has crept in?I spoke to the SchedMD guys at SC18 last week and they were not aware of a bug but since more than one person is having this difficulty something must be wrong somewhere. -Roger From: slurm-users [mailto:sl

[slurm-users] How to get information about job steps

2018-05-09 Thread Roger Moye
he step is running or pending. Once it is finished, the information disappears. Is there a way to see all job steps associated with a job regardless of the state of the step? Thanks so much! -Roger Moye [cid:image001.png@01D22319.C7D5D540] Roger Moye HPC Engineer 713.425.6236 Office 713.376.2