We are pleased to announce the availability of Slurm version 23.11.4.
The 23.11.4 release includes a number of fixes to stability and various
bug fixes. Some notable changes include that VSZ is no longer being
reported when using cgroup/v2 (this is not provided by the kernel), a
warning has been added if using select/linear and tolology/tree together
as this will not be supported in the next major release, and a backwards
compatibility issue that caused jobs using --gpus to be rejected when
submitted from 23.02 or 22.05.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
-Tim
* Changes in Slurm 23.11.4
==========================
-- Fix a memory leak when updating partition nodes.
-- Don't leave a partition around if it fails to create with scontrol.
-- Fix segfault when creating partition with bad node list from scontrol.
-- Fix preserving partition nodes on bad node list update from scontrol.
-- Fix assertion in developer mode on a failed message unpack.
-- Fix repeat POWER_DOWN requests making the nodes available for ping.
-- Fix rebuilding job alias_list on restart when nodes are still powering up.
-- Fix INVALID nodes running health check.
-- Fix cloud/future nodes not setting addresses on invalid registration.
-- scrun - Remove the requirement to set the SCRUN_WORKING_DIR environment
variable. This was a regression in 23.11.
-- Add warning for using select/linear with topology/tree.
This combination will not be supported in the next major version.
-- Fix health check program not being run after first pass of all nodes when
using MaxNodeCount.
-- sacct - Set process exit code to one for all errors.
-- Add SlurmctldParameters=disable_triggers option.
-- Fix issue running steps when the allocation requested an exclusive
allocation shards along with shards.
-- Fix cleaning up the sleep process and the cgroup of the extern step if
slurm_spank_task_post_fork returns an error.
-- slurm_completion - Add missing --gres-flags= options
multiple-tasks-per-sharing and one-task-per-sharing.
-- scrun - Avoid race condition that could cause outbound network
communications to incorrectly rejected with an incomplete packet error.
-- scrun - Gracefully handle kernel giving invalid expected number of incoming
bytes for a connection causing incoming packet corruption resulting in
connection getting closed.
-- srun - return 1 when a step lauch fails
-- scrun - Avoid race condition that could cause deadlock during shutdown.
-- Fix scontrol listpids to work under dynamic node scenarios.
-- Add --tres-bind to --help and --usage output.
-- Add --gres-flags=allow-task-sharing to allow GPUs to still be accessible
among all tasks when binding GPUs to specific tasks.
-- Fix issue with CUDA_VISIBLE_DEVICES showing the same MIG device for all
tasks when using MIGs with --tres-per-task or --gpus-per-task.
-- slurmctld - Prevent a potential hang during shutdown/reconfigure if the
association cache thread was previously shut down.
-- scrun - Avoid race condition that could cause scrun to hang during
shutdown when connections have pending events.
-- scrun - Avoid excessive polling of connections during shutdown that could
needlessly cause 100% CPU usage on a thread.
-- sbcast - Use user identity from broadcast credential instead of looking it
up locally on the node.
-- scontrol - Remove "abort" option handling.
-- Fix an error message referring to the wrong RPC.
-- Fix memory leak on error when creating dynamic nodes.
-- Fix a slurmctld segfault when a cloud/dynamic node changes hostname on
registration.
-- Prevent a slurmctld deadlock if the gpu plugin fails to load when
creating a node.
-- Change a slurmctld fatal() to an error() when attempting to create a
dynamic node with a global autodetect set in gres.conf.
-- Fix leaving node records on error when creating nodes with scontrol.
-- scrun/sackd - Avoid race condition where shutdown could deadlock.
-- Fix a regression in 23.02.5 that caused pam_slurm_adopt to fail when
the user has multiple jobs on a node.
-- Add GLOB_SILENCE flag that silences the error message which will display if
an include directive attempts to use the "*" wildcard.
-- Fix jobs getting rejected when submitting with --gpus option from older
versions of job submission commands (23.02 and older).
-- cgroup/v2 - Return 0 for VSZ. Kernel cgroups do not provide this metric.
-- scrun - Avoid race condition where outbound RPCs could be corrupted.
-- scrun - Avoid race condition that could cause a crash while compiled in
debug mode.
-- gpu/rsmi - Disable gpu usage statistics when not using ROCM 6.0.0+
-- Fix stuck processes and incorrect environment when using --get-user-env.
-- Avoid segfault in the slurmdbd when TrackWCKey=no but you are still using
use WCKeys.
-- Fix ctld segfault with TopologyParam=RoutePart and no partition defined.
-- slurmctld - Fix missing --deadline handling for jobs not evaluated by the
schedulers (i.e. non-runnable, skipped for other reasons, etc.).
-- Demote some eio related logs from error to verbose in user commands. These
are not generally actionable by the user and are easilly generated by port
scanning a machine running srun.
-- Make sprio correctly print array tasks that have not yet been split out.
-- topology/block - Restrict the number of last-level blocks in any allocation.
--
Tim McMullan
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]