We are pleased to announce the availability of Slurm version 24.11.1.
This fixes a few possible crashes of the slurmctld and slurmrestd; a
regression in 24.11 which caused file transfers to a job with sbcast to
not join the job container namespace; mpi apps using Intel OPA, PSM2 and
OMPI 5.x when ran through srun; and various minor to moderate bugs.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
* Changes in Slurm 24.11.1
==========================
-- With client commands MIN_MEMORY will show mem_per_tres if specified.
-- Fix errno message about bad constraint
-- slurmctld - Fix crash and possible split brain issue if the
backup controller handles an scontrol reconfigure while in control
before the primary resumes operation.
-- Fix stepmgr not getting dynamic node addrs from the controller
-- stepmgr - avoid "Unexpected missing socket" errors.
-- Fix `scontrol show steps` with dynamic stepmgr
-- Deny jobs using the "R:" option of --signal if PreemptMode=OFF
globally.
-- Force jobs using the "R:" option of --signal to be preemptable
by requeue or cancel only. If PreemptMode on the partition or QOS is off
or suspend, the job will default to using PreemptMode=cancel.
-- If --mem-per-cpu exceeds MaxMemPerCPU, the number of cpus per
task will always be increased even if --cpus-per-task was specified. This
is needed to ensure each task gets the expected amount of memory.
-- Fix compilation issue on OpenSUSE Leap 15
-- Fix jobs using more nodes than needed when not using -N
-- Fix issue with allocation being allocated less resources
than needed when using --gres-flags=enforce-binding.
-- select/cons_tres - Fix errors with MaxCpusPerSocket partition
limit. Used cpus/cores weren't counted properly, nor limiting free ones
to avail, when the socket was partially allocated, or the job request
went beyond this limit.
-- Fix issue when jobs were preempted for licenses even if there
were enough licenses available.
-- Fix srun ntasks calculation inside an allocation when nodes are
requested using a min-max range.
-- Print correct number of digits for TmpDisk in sdiag.
-- Fix a regression in 24.11 which caused file transfers to a job
with sbcast to not join the job container namespace.
-- data_parser/v0.0.40 - Prevent a segfault in the slurmrestd when
dumping data with v0.0.40+complex data parser.
-- Remove logic to force lowercase GRES names.
-- data_parser/v0.0.42 - Prevent the association id from always
being dumped as NULL when parsing in complex mode. Instead it will now
dump the id. This affects the following endpoints:
GET slurmdb/v0.0.42/association
GET slurmdb/v0.0.42/associations
GET slurmdb/v0.0.42/config
-- Fixed a job requeuing issue that merged job entries into the
same SLUID when all nodes in a job failed simultaneously.
-- When a job completes, try to give idle nodes to reservations with
the REPLACE flag before allowing them to be allocated to jobs.
-- Avoid expensive lookup of all associations when dumping or
parsing for v0.0.42 endpoints.
-- Avoid expensive lookup of all associations when dumping or
parsing for v0.0.41 endpoints.
-- Avoid expensive lookup of all associations when dumping or
parsing for v0.0.40 endpoints.
-- Fix segfault when testing jobs against nodes with invalid gres.
-- Fix performance regression while packing larger RPCs.
-- Document the new mcs/label plugin.
-- job_container/tmpfs - Fix Xauthoirty file being created
outside the container when EntireStepInNS is enabled.
-- job_container/tmpfs - Fix spank_task_post_fork not always
running in the container when EntireStepInNS is enabled.
-- Fix a job potentially getting stuck in CG on permissions
errors while setting up X11 forwarding.
-- Fix error on X11 shutdown if Xauthority file was not created.
-- slurmctld - Fix memory or fd leak if an RPC is recieved that
is not registered for processing.
-- Inject OMPI_MCA_orte_precondition_transports when using PMIx. This fixes
mpi apps using Intel OPA, PSM2 and OMPI 5.x when ran through srun.
-- Don't skip the first partition_job_depth jobs per partition.
-- Fix gres allocation issue after controller restart.
-- Fix issue where jobs requesting cpus-per-gpu hang in queue.
-- switch/hpe_slingshot - Treat HTTP status forbidden the same as
unauthorized, allowing for a graceful retry attempt.
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]