We are pleased to announce the availability of Slurm versions 16.05.9
and 17.02.0-0rc1 (release candidate 1).
16.05.9 contains around 25 rather minor bug fixes. Please upgrade at
your leisure.
The rc release contains all of the features intended for release 17.02.
Development has ended for this release and we are continuing with our
testing phase which will most likely result in another rc before we tag
17.02.0 near the middle of February. A description of what this release
contains is in the RELEASE_NOTES file available in the source. Your help
in hardening this version is greatly appreciated. You are invited to
download this version and assist in testing. As with all rc releases you
should be able to install and not worry about protocol/state changes
going forward with the version.
Slurm downloads are available from https://schedmd.com/downloads.php.
Reading from NEWS for 16.05.9...
* Changes in Slurm 16.05.9
==========================
-- Fix parsing of SBCAST_COMPRESS environment variable in sbcast.
-- Change some debug messages to errors in task/cgroup plugin.
-- backfill scheduler: Stop trying to determine expected start time
for a job
after 2 seconds of wall time. This can happen if there are many
running jobs
and a pending job can not be started soon.
-- Improve performance of cr_sort_part_rows() in cons_res plugin.
-- CRAY - Fix dealock issue when updating accounting in the slurmctld and
scheduling a Datawarp job.
-- Correct the job state accounting information for jobs requeued due
to burst
buffer errors.
-- burst_buffer/cray - Avoid "pre_run" operation if not using buffer (i.e.
just creating or deleting a persistent burst buffer).
-- Fix slurm.spec file support for BlueGene builds.
-- Fix missing TRES read lock in acct_policy_job_runnable_pre_select()
code.
-- Fix debug2 message printing value using wrong array index in
_qos_job_runnable_post_select().
-- Prevent job timeout on node power up.
-- MYSQL - Fix minor memory leak when querying steps and the sql fails.
-- Make it so sacctmgr accepts column headers like MaxTRESPU and not
MaxTRESP.
-- Only look at SLURM_STEP_KILLED_MSG_NODE_ID on startup, to avoid race
condition later when looking at a steps env.
-- Make backfill scheduler behave like regular scheduler in respect to
'assoc_limit_stop'.
-- Allow a lower version client command to talk to a higher version
contoller
using the multi-cluster options (e.g. squeue -M<clsuter>).
-- slurmctld/agent race condition fix: Prevent job launch while
PrologSlurmctld
daemon is running or node boot in progress.
-- MYSQL - Fix a few other minor memory leaks when uncommon failures
occur.
-- burst_buffer/cray - Fix race condition that could cause multiple
batch job
launch requests resulting in drained nodes.
-- Correct logic to purge old reservations.
-- Fix DBD cache restore from previous versions.
-- Fix to logic for getting expected start time of existing job ID with
explicit begin time that is in the past.
-- Clear job's reason of "BeginTime" in a more timely fashion and/or
prevents
them from being stuck in a PENDING state.
-- Make sure acct policy limits imposed on a job are correct after
requeue.
Reading from NEWS for 17.02.0-0rc1...
* Changes in Slurm 17.02.0rc1
==============================
-- Add port info to 'sinfo' and 'scontrol show node'.
-- Fix errant definition of USE_64BIT_BITSTR which can lead to core dumps.
-- Move BatchScript to end of each job's information when using
"scontrol -dd show job" to make it more readable.
-- Add SchedulerParameters configuration parameter of
"default_gbytes", which
treats numeric only (no suffix) value for memory and tmp disk space
as being
in units of Gigabytes. Mostly for compatability with LSF.
-- Fix race condtion in srun/sattach logic which would prevent srun from
terminating.
-- Bitstring operations are now 64bit instead of 32bit.
-- Replace hweight() function in bitstring with faster version.
-- scancel would treat a non-numeric argument as the name of jobs to be
cancelled (a non-documented feature). Cancelling jobs by name now
require
the "--jobname=" command line argument.
-- scancel modified to note that no jobs satisfy the filter options
when the
--verbose option is used along with one or more job filters (e.g.
"--qos=").
-- Change _pack_cred to use pack_bit_str_hex instead of pack_bit_fmt for
better scalability and performance.
-- Add BootTime configuration parameter to knl.conf file to optimize
resource
allocations with respect to required node reboots.
-- Add node_features_p_boot_time() to node_features plugin to optimize
scheduling with respect to node reboots.
-- Avoid allocating resources to a job in the event that its run time
plus boot
time (if needed) extent into an advanced reservation.
-- Burst_buffer/cray - Avoid stage-out operation if job never started.
-- node_features/knl_cray - Add capability to detected Uncorrectable
Memory
Errors (UME) and if detected then log the event in all job and step
stderr
with a message of the form:
error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT
2016-12-14T09:09:37 ***
Similar logic added to node_features/knl_generic in version
17.02.0pre4.
-- If job is allocated nodes which are powered down, then reset job
start time
when the nodes are ready and do not charge the job for power up time.
-- Add the ability to purge transactions from the database.
-- Add support for requeue'ing of federated jobs (BETA).
-- Add support for interactive federated jobs (BETA).
-- Add the ability to purge rolled up usage from the database.
-- CRAY systems only: TaskPlugins must list task/cgroup before
task/cray in
order for the cgroup files to be created before task/cray runs.
-- Properly set SLURM_JOB_GPUS environment variable for Prolog.