Slurm versions 24.05.2, 23.11.9, and 23.02.8 are now available and include a fix for a recently discovered security issue with the switch plugins.

SchedMD customers were informed on July 17th and provided a patch on request; this process is documented in our security policy. [1]

For the switch/hpe_slingshot and switch/nvidia_imex plugins, a user could override the isolation between Slingshot VNIs or IMEX channels.

If you do not have one of these switch plugins configured, then you are not impacted by this issue.

It is unclear what, if any, information could be accessed with access to an unauthorized channel. This disclosure is being made out of an abundance of caution.

If you do have one of these plugins enabled, the slurmctld must be restarted before the slurmd daemons to avoid disruption.

Downloads are available at https://www.schedmd.com/downloads.php .

Release notes follow below.

- Tim

[1] https://www.schedmd.com/security-policy/

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

* Changes in Slurm 24.05.2
==========================
 -- Fix energy gathering rpc counter underflow in _rpc_acct_gather_energy when
    more than 10 threads try to get energy at the same time. This prevented
    the possibility to get energy from slurmd by any step until slurmd was
    restarted, so losing energy accounting metrics in the node.
 -- accounting_storage/mysql - Fix issue where new user with wckey did not
    have a default wckey sent to the slurmctld.
 -- slurmrestd - Prevent slurmrestd segfault when handling the following
    endpoints when none of the optional parameters are specified:
      'DELETE /slurm/v0.0.40/jobs'
      'DELETE /slurm/v0.0.41/jobs'
      'GET /slurm/v0.0.40/shares'
      'GET /slurm/v0.0.41/shares'
      'GET /slurmdb/v0.0.40/instance'
      'GET /slurmdb/v0.0.41/instance'
      'GET /slurmdb/v0.0.40/instances'
      'GET /slurmdb/v0.0.41/instances'
      'POST /slurm/v0.0.40/job/{job_id}'
      'POST /slurm/v0.0.41/job/{job_id}'
 -- Fix IPMI energy gathering when no IPMIPowerSensors are specified in
    acct_gather.conf. This situation resulted in an accounted energy of 0
    for job steps.
 -- Fix a minor memory leak in slurmctld when updating a job dependency.
 -- scontrol,squeue - Fix regression that caused incorrect values for
    multisocket nodes at '.jobs[].job_resources.nodes.allocation' for
    'scontrol show jobs --(json|yaml)' and 'squeue --(json|yaml)'.
 -- slurmrestd - Fix regression that caused incorrect values for
    multisocket nodes at '.jobs[].job_resources.nodes.allocation' to be dumped
    with endpoints:
      'GET /slurm/v0.0.41/job/{job_id}'
      'GET /slurm/v0.0.41/jobs'
 -- jobcomp/filetxt - Fix truncation of job record lines > 1024 characters.
 -- Fixed regression that prevented compilation on FreeBSD hosts.
 -- switch/hpe_slingshot - Drain node on failure to delete CXI services.
 -- Fix a performance regression from 23.11.0 in cpu frequency handling when no
    CpuFreqDef is defined.
 -- Fix one-task-per-sharing not working across multiple nodes.
 -- Fix inconsistent number of cpus when creating a reservation using the
    TRESPerNode option.
 -- data_parser/v0.0.40+ - Fix job state parsing which could break filtering.
 -- Prevent cpus-per-task to be modified in jobs where a -c value has been
    explicitly specified and the requested memory constraints implicitly
    increase the number of CPUs to allocate.
 -- slurmrestd - Fix regression where args '-s v0.0.39,dbv0.0.39' and
    '-d v0.0.39' would result in 'GET /openapi/v3' not registering as a valid
    possible query resulting in 404 errors.
 -- slurmrestd - Fix memory leak for dbv0.0.39 jobs query which occurred if the
    query parameters specified account, association, cluster, constraints,
    format, groups, job_name, partition, qos, reason, reservation, state, users,
    or wckey. This affects the following endpoints:
      'GET /slurmdb/v0.0.39/jobs'
 -- slurmrestd - In the case the slurmdbd does not respond to a persistent
    connection init message, prevent the closed fd from being used, and instead
    emit an error or warning depending on if the connection was required.
 -- Fix 24.05.0 regression that caused the slurmdbd not to send back an error
    message if there is an error initializing a persistent connection.
 -- Reduce latency of forwarded x11 packets.
 -- Add "curr_dependency" (representing the current dependency of the job)
    and "orig_dependency" (representing the original requested dependency of
    the job) fields to the job record in job_submit.lua (for job update) and
    jobcomp.lua.
 -- Fix potential segfault of slurmctld configured with
    SlurmctldParameters=enable_rpc_queue from happening on reconfigure.
 -- Fix potential segfault of slurmctld on its shutdown when rate limitting
    is enabled.
 -- slurmrestd - Fix missing job environment for SLURM_JOB_NAME,
    SLURM_OPEN_MODE, SLURM_JOB_DEPENDENCY, SLURM_PROFILE, SLURM_ACCTG_FREQ,
    SLURM_NETWORK and SLURM_CPU_FREQ_REQ to match sbatch.
 -- Add missing bash-completions dependency to slurm-smd-client debian package.
 -- Fix bash-completions installation in debian pacakges.
 -- Fix GRES environment variable indices being incorrect when only using a
    subset of all GPUs on a node and the --gres-flags=allow-task-sharing option
 -- Add missing mariadb/mysql client package dependency to debian package.
 -- Fail the debian package build early if mysql cannot be found.
 -- Prevent scontrol from segfaulting when requesting scontrol show reservation
    --json or --yaml if there is an error retrieving reservations from the
    slurmctld.
 -- switch/hpe_slingshot - Fix security issue around managing VNI access.
 -- switch/nvidia_imex - Fix security issue managing IMEX channel access.
 -- switch/nvidia_imex - Allow for compatibility with job_container/tmpfs.

* Changes in Slurm 23.11.9
==========================
 -- Fix many commands possibly reporting an "Unexpected Message Received" when
    in reality the connection timed out.
 -- Fix heterogeneous job components not being signaled with scancel --ctld and
    'DELETE slurm/v0.0.40/jobs' if the job ids are not explicitly given,
    the heterogeneous job components match the given filters, and the
    heterogeneous job leader does not match the given filters.
 -- Fix regression from 23.02 impeding job licenses from being cleared.
 -- Move error to log_flag which made _get_joules_task error to be logged to the
    user when too many rpcs were queued in slurmd for gathering energy.
 -- slurmrestd - Prevent a slurmrestd segfault when modifying an association
    without specifying max TRES limits in the request if those TRES
    limits are currently defined in the association. This affects the following
    fields of endpoint 'POST /slurmdb/v0.0.38/associations/':
      'associations/max/tres/per/job'
      'associations/max/tres/per/node'
      'associations/max/tres/total'
      'associations/max/tres/minutes/per/job'
      'associations/max/tres/minutes/total'
 -- Fix power_save operation after recovering from a failed reconfigure.
 -- scrun - Delay shutdown until after start requested. This caused scrun
    to never start or shutdown and hung forever when using --tty.
 -- Fix backup slurmctld potentially not running the agent when taking over as
    the primary controller.
 -- Fix primary controller not running the agent when a reconfigure of the
    slurmctld fails.
 -- jobcomp/{elasticsearch,kafka} - Avoid sending fields with invalid date/time.
 -- Fix energy gathering rpc counter underflow in _rpc_acct_gather_energy when
    more than 10 threads try to get energy at the same time. This prevented
    the possibility to get energy from slurmd by any step until slurmd was
    restarted, so losing energy accounting metrics in the node.
 -- slurmrestd - Fix memory leak for dbv0.0.39 jobs query which occurred if the
    query parameters specified account, association, cluster, constraints,
    format, groups, job_name, partition, qos, reason, reservation, state, users,
    or wckey. This affects the following endpoints:
      'GET /slurmdb/v0.0.39/jobs'
 -- switch/hpe_slingshot - Fix security issue around managing VNI access.

* Changes in Slurm 23.02.8
==========================
 -- Fix rare deadlock when a dynamic node registers at the same time that a
    once per minute background task occurs.
 -- Fix assertion in developer mode on a failed message unpack.
 -- switch/hpe_slingshot - Fix security issue around managing VNI access.

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to