While testing GFS2 as a storage repository for virtual machines we
discovered a number of scenarios where the performance was being
pathologically poor.

The scenarios are simplfied to the following -

  * On a single host in the cluster grow a number of files to a
    significant proportion of the filesystems LUN size, exceeding the
    hosts preferred resource group allocation. This can be replicated
    by using fio and writing to 20 different files with a script like

[test-files]
directory=gfs2/a:gfs2/b:gfs2/c:gfs2/d:gfs2/e:gfs2/f:gfs2/g:gfs2/h:gfs2/i:gfs2/j:gfs2/k:gfs2/l:gfs2/m:gfs2/n:gfs2/o:gfs2/p:gfs2/q:gfs2/r:gfs2/s:gfs2/t
nrfiles=1
size=20G
bs=512k
rw=write
buffered=0
ioengine=libaio
fallocate=none
numjobs=20

    After starting off at network wire speed this will rapidly degrade
    with the fio process reporting large sys time.

    This was diagnosed to all the processes contending on the glock in
    gfs2_inplace_reserve having all selected the same resource
    group. Patch 1 addresses this with an optional module parameter
    which enables behaviour to "randomly" skip a selected resource
    group in the first two passes in gfs_inplace_reserve in order to
    spread the processes out.

    Worth noting that this would probably also be addressed if the
    comment in Documentation/gfs2-glocks.txt about eventually making
    glock EX locally shared was made to happen. However, this looks
    like it would require quite a bit of coordination and design so
    this stop-gap helps in the meantime.

  * With two or more hosts growing files at high data rates the
    throughput drops to a small proportion of the maximum storage
    I/O. This is the several VMs all writing to the filesystem
    scenario. Sometimes this test would run through clean at 80-90% of
    storage wire speed but at other times the performance would drop
    on one or more hosts to a small number of KiB/s.

    This was diagnosed to the different hosts repeatedly bouncing
    resource group glocks between them as different hosts selected
    the same resource group (having exhausted the preferred groups).

    Patch 2 addresses this by -
      * adding a hold delay to the resource group glock if there are
        local waiters, following the pattern already in place for
        inodes, this should also provide more data for
        gfs2_rgrp_congested to work on.
      * remembering when we were last asked to demote the lock on a
        resource group
      * in the first two passes in gfs2_inplace_reserve avoiding
        resource groups where we have been asked to demote the glock
        within the last second

Mark Syms (1):
  GFS2: Avoid recently demoted rgrps.

Tim Smith (1):
  Add some randomisation to the GFS2 resource group allocator

 fs/gfs2/glock.c      |  7 +++++--
 fs/gfs2/incore.h     |  2 ++
 fs/gfs2/main.c       |  1 +
 fs/gfs2/rgrp.c       | 49 +++++++++++++++++++++++++++++++++++++++++++++----
 fs/gfs2/trace_gfs2.h | 12 +++++++++---
 5 files changed, 62 insertions(+), 9 deletions(-)

-- 
1.8.3.1

Reply via email to