Hi Jason,
On 4/20/23 20:11, Jason Simms wrote:
Hello Ole and Hoot,
First, Hoot, thank you for your question. I've managed Slurm for a few
years now and still feel like I don't have a great understanding about
managing or limiting resources.
Ole, thanks for your continued support of the user community with your
documentation. I do wish not only that more of your information were
contained within the official docs, but also that there were even clearer
discussions around certain topics.
As an example, you write that "It is important to configure slurm.conf so
that the locked memory limit isn’t propagated to the batch jobs" by
setting PropagateResourceLimitsExcept=MEMLOCK. It's unclear to me whether
you are suggesting that literally everyone should have that set, or
whether it only applies to certain configurations. We don't have it set,
for instance, but we've not run into trouble with jobs failing due to
locked memory errors.
The link mentioned in the page hopefully explains it:
https://slurm.schedmd.com/faq.html#memlock
Then, in the official docs, to which you link, it says that "it may also
be desirable to lock the slurmd daemon's memory to help ensure that it
keeps responding if memory swapping begins" by creating
/etc/sysconfig/slurm containing the line SLURMD_OPTIONS="-M". Would there
ever be a reason *not* to include that? That is, I can't think it would
ever be desirable for slurmd to stop responding. So is that another
"universal" recommendation, I wonder?
I'm not an expert on locking slurmd pages! The -M option is documented in
the slurmd manual page, and I probably read a thread long ago abut this on
the slurm-users mailing list discussing this. You could try it out in
your environment and see if all is well.
It may be me talking as a new-ish user, but I would find a concise
document laying out common or useful configuration options to be presented
when setting up or reconfiguring Slurm. I'm certain I have inefficient or
missing options that I should have.
IMHO, most sites have their own requirements and preferences, so I don't
think there is a one-size-fits-all Slurm installation solution.
Since requirements can be so different, and because Slurm is a fantastic
software that can be configured for many different scenarios, IMHO a
support contract with SchedMD is the best way to get consulting services,
get general help, and report bugs. We have excellent experiences with
SchedMD support (https://www.schedmd.com/support.php).
Best regards,
Ole
On Thu, Apr 20, 2023 at 2:11 AM Ole Holm Nielsen
<ole.h.niel...@fysik.dtu.dk <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:
Hi Hoot,
On 4/20/23 00:15, Hoot Thompson wrote:
> Is there a ‘how to’ or recipe document for setting up and enforcing
resource limits? I can establish accounts, users, and set limits but
'current value' is not incrementing after running jobs.
I have written about resource limits in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits
<https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits>