While testing GFS2 as a storage repository for virtual machines we discovered a number of scenarios where the performance was being pathologically poor.
The scenarios are simplfied to the following - * On a single host in the cluster grow a number of files to a significant proportion of the filesystems LUN size, exceeding the hosts preferred resource group allocation. This can be replicated by using fio and writing to 20 different files with a script like [test-files] directory=gfs2/a:gfs2/b:gfs2/c:gfs2/d:gfs2/e:gfs2/f:gfs2/g:gfs2/h:gfs2/i:gfs2/j:gfs2/k:gfs2/l:gfs2/m:gfs2/n:gfs2/o:gfs2/p:gfs2/q:gfs2/r:gfs2/s:gfs2/t nrfiles=1 size=20G bs=512k rw=write buffered=0 ioengine=libaio fallocate=none numjobs=20 After starting off at network wire speed this will rapidly degrade with the fio process reporting large sys time. This was diagnosed to all the processes contending on the glock in gfs2_inplace_reserve having all selected the same resource group. Patch 1 addresses this with an optional module parameter which enables behaviour to "randomly" skip a selected resource group in the first two passes in gfs_inplace_reserve in order to spread the processes out. Worth noting that this would probably also be addressed if the comment in Documentation/gfs2-glocks.txt about eventually making glock EX locally shared was made to happen. However, this looks like it would require quite a bit of coordination and design so this stop-gap helps in the meantime. * With two or more hosts growing files at high data rates the throughput drops to a small proportion of the maximum storage I/O. This is the several VMs all writing to the filesystem scenario. Sometimes this test would run through clean at 80-90% of storage wire speed but at other times the performance would drop on one or more hosts to a small number of KiB/s. This was diagnosed to the different hosts repeatedly bouncing resource group glocks between them as different hosts selected the same resource group (having exhausted the preferred groups). Patch 2 addresses this by - * adding a hold delay to the resource group glock if there are local waiters, following the pattern already in place for inodes, this should also provide more data for gfs2_rgrp_congested to work on. * remembering when we were last asked to demote the lock on a resource group * in the first two passes in gfs2_inplace_reserve avoiding resource groups where we have been asked to demote the glock within the last second Mark Syms (1): GFS2: Avoid recently demoted rgrps. Tim Smith (1): Add some randomisation to the GFS2 resource group allocator fs/gfs2/glock.c | 7 +++++-- fs/gfs2/incore.h | 2 ++ fs/gfs2/main.c | 1 + fs/gfs2/rgrp.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---- fs/gfs2/trace_gfs2.h | 12 +++++++++--- 5 files changed, 62 insertions(+), 9 deletions(-) -- 1.8.3.1