[slurm-dev] SLURM versions 2.3.4 and 2.4.0-pre4 are now available

Moe Jette Mon, 19 Mar 2012 17:51:06 -0700

SLURM versions 2.3.4 and 2.4.0-pre4 are now available from  
http://www.schedmd.com/#repos
A description of the changes is appended.


* Changes in SLURM 2.3.4
========================
  -- Set DEFAULT flag in partition structure when slurmctld reads the
     configuration file. Patch from Rémi Palancher.
  -- Fix for possible deadlock in accounting logic: Avoid calling
     jobacct_gather_g_getinfo() until there is data to read from the socket.
  -- Fix typo in accounting when using reservations. Patch from Alejandro
     Lucero Palau.
  -- Fix to the multifactor priority plugin to calculate effective  
usage earlier
     to give a correct priority on the first decay cycle after a restart of the
     slurmctld. Patch from Martin Perry, Bull.
  -- Permit user root to run a job step for any job as any user. Patch from
     Didier Gazen, Laboratoire d'Aerologie.
  -- BLUEGENE - fix for not allowing jobs if all midplanes are drained and all
     blocks are in an error state.
  -- Avoid slurmctld abort due to bad pointer when setting an advanced
     reservation MAINT flag if it contains no nodes (only licenses).
  -- Fix bug when requeued batch job is scheduled to run on a different node
     zero, but attemts job launch on old node zero.
  -- Fix bug in step task distribution when nodes are not configured in numeric
     order. Patch from Hongjia Cao, NUDT.
  -- Fix for srun allocating running within existing allocation with --exclude
     option and --nnodes count small enough to remove more nodes. Patch from
     Phil Eckert, LLNL.
  -- Work around to handle certain combinations of glibc/kernel
     (i.e. glibc-2.14/Linux-3.1) to correctly open the pty of the slurmstepd
     as the job user. Patch from Mark Grondona, LLNL.
  -- Modify linking to include "-ldl" only when needed. Patch from Aleksej
     Saushev.
  -- Fix smap regression to display nodes that are drained or down correctly.
  -- Several bug fixes and performance improvements with related to batch
     scripts containing very large numbers of arguments. Patches from Par
     Andersson, NSC.
  -- Fixed extremely hard to reproduce threading issue in assoc_mgr.
  -- Correct "scontrol show daemons" output if there is more than one
     ControlMachine configured.
  -- Add node read lock where needed in slurmctld/agent code.
  -- Added test for LUA library named "liblua5.1.so.0" in addition to
     "liblua5.1.so" as needed by Debian. Patch by Remi Palancher.
  -- Added partition default_time field to job_submit LUA plugin. Patch by
     Remi Palancher.
  -- Fix bug in cray/srun wrapper stdin/out/err file handling.
  -- In cray/srun wrapper, only include aprun "-q" option when srun "--quiet"
     option is used.
  -- BLUEGENE - fix issue where if a small block was in error it could hold up
     the queue when trying to place a larger than midplane job.
  -- CRAY - ignore all interactive nodes and jobs on interactive nodes.
  -- Add new job state reason of "FrontEndDown" which applies only to Cray and
     IBM BlueGene systems.
  -- Cray - Enable configure option of "--enable-salloc-background" to permit
     the srun and salloc commands to be executed in the background. This does
     NOT remove the ALPS limitation that only one job reservation can  
be created
     for each Linux session ID.
  -- Cray - For srun wrapper when creating a job allocation, set the  
default job
     name to the executable file's name.
  -- Add support for Cray ALPS 5.0.0
  -- FRONTEND - if a front end unexpectedly reboots kill all jobs but don't
     mark front end node down.
  -- FRONTEND - don't down a front end node if you have an epilog error.
  -- Cray - fix for if a frontend slurmd was started after the slurmctld had
     already pinged it on startup the unresponding flag would be removed from
     the frontend node.
  -- Cray - Fix issue on smap not displaying grid correctly.
  -- Fixed minor memory leak in sview.

* Changes in SLURM 2.4.0-pre4
=============================
  -- Add logic to cache GPU file information (bitmap index mapping to device
     file number) in the slurmd daemon and transfer that information to the
     slurmstepd whenever a job step is initiated. This is needed to set the
     appropriate CUDA_VISIBLE_DEVICES environment variable value when the
     devices are not in strict numeric order (e.g. some GPUs are skipped).
     Based upon work by Nicolas Bigaouette.
  -- BGQ - Remove ability to make a sub-block with a geometry with one or more
     of it's dimensions of length 3.  There is a limitation in the IBM I/O
     subsystem that is problematic with multiple sub-blocks with a dimension
     of length 3, so we will disallow them to be able to be created.  This
     mean you if you ask the system for an allocation of 12 c-nodes you will
     be given 16.  If this is ever fix in BGQ you can remove this patch.
  -- BLUEGENE - Better handling blocks that go into error state or deallocate
     while jobs are running on them.
  -- BGQ - fix for handling mix of steps running at same time some of which
     are full allocation jobs, and others that are smaller.
  -- BGQ - fix for core dump after running multiple sub-block jobs on static
     blocks.
  -- BGQ - fixed sync issue where if a job finishes in SLURM but not in mmcs
     for a long time after the SLURM job has been flushed from the system
     we don't have to worry about rebooting the block to sync the system.
  -- BGQ - In scontrol/sview node counts are now displayed with
     CnodeCount/CnodeErrCount so to point out there are cnodes in an  
error state
     on the block.  Draining the block and having it reboot when all jobs are
     gone will clear up the cnodes in Software Failure.
  -- Change default SchedulerParameters max_switch_wait field value from 60 to
     300 seconds.
  -- BGQ - catch errors from the kill option of the runjob client.
  -- BLUEGENE - make it so the epilog runs until slurmctld tells it the job is
     gone.  Previously it had a timelimit which has proven to not be the right
     thing.
  -- FRONTEND - fix issue where if a compute node was in a down state and
     an admin updates the node to idle/resume the compute nodes will go
     instantly to idle instead of idle* which means no response.
  -- Fix regression in 2.4.0.pre3 where number of submitted jobs limit wasn't
     being honored for QOS.
  -- Cray - Enable logging of BASIL communications with environment variables.
     Set XML_LOG to enable logging. Set XML_LOG_LOC to specify path to log file
     or "SLURM" to write to SlurmctldLogFile or unset for  
"slurm_basil_xml.log".
     Patch from Steve Tronfinoff, CSCS.
  -- FRONTEND - if a front end unexpectedly reboots kill all jobs but don't
     mark front end node down.
  -- FRONTEND - don't down a front end node if you have an epilog error
  -- BLUEGENE - if a job has an epilog error don't down the midplane it was
     running on.
  -- BGQ - added new DebugFlag (NoRealTime) for only printing debug from
     state change while the realtime server is running.
  -- Fix multi-cluster mode with sview starting on a non-bluegene cluster going
     to a bluegene cluster.
  -- BLUEGENE - ability to show Rack Midplane name of midplanes in sview and
     scontrol.

[slurm-dev] SLURM versions 2.3.4 and 2.4.0-pre4 are now available

Reply via email to