From: Tim Smith <tim.sm...@citrix.com> When growing a number of files on the same cluster node from different threads (e.g. fio with 20 or so jobs), all those threads pile into gfs2_inplace_reserve() independently looking to claim a new resource group and after a while they all synchronise, getting through the gfs2_rgrp_used_recently()/gfs2_rgrp_congested() check together.
When this happens, write performance drops to about 1/5 on a single node cluster, and on multi-node clusters it drops to near zero on some nodes. The output from "glocktop -r -H -d 1" when this happens begins to show many processes stuck in gfs2_inplace_reserve(), waiting on a resource group lock. This commit introduces a module parameter which, when set to a value of 1, will introduce some random jitter into the first two passes of gfs2_inplace_reserve() when trying to lock a new resource group, skipping to the next one 1/2 the time with progressively lower probability on each attempt. Signed-off-by: Tim Smith <tim.sm...@citrix.com> --- fs/gfs2/rgrp.c | 39 +++++++++++++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 4 deletions(-) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 1ad3256..994eb7f 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -19,6 +19,7 @@ #include <linux/blkdev.h> #include <linux/rbtree.h> #include <linux/random.h> +#include <linux/module.h> #include "gfs2.h" #include "incore.h" @@ -49,6 +50,11 @@ #define LBITSKIP00 (0x0000000000000000UL) #endif +static int gfs2_skippy_rgrp_alloc; + +module_param_named(skippy_rgrp_alloc, gfs2_skippy_rgrp_alloc, int, 0644); +MODULE_PARM_DESC(skippy_rgrp_alloc, "Set skippiness of resource group allocator, 0|1. Where 1 will cause resource groups to be randomly skipped with the likelihood of skipping progressively decreasing after a skip has occured."); + /* * These routines are used by the resource group routines (rgrp.c) * to keep track of block allocation. Each block is represented by two @@ -2016,6 +2022,11 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) u64 last_unlinked = NO_BLOCK; int loops = 0; u32 free_blocks, skip = 0; + /* + * gfs2_skippy_rgrp_alloc provides our initial skippiness. + * randskip will thus be 2-255 if we want it do do anything. + */ + u8 randskip = gfs2_skippy_rgrp_alloc + 1; if (sdp->sd_args.ar_rgrplvb) flags |= GL_SKIP; @@ -2046,10 +2057,30 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) if (loops == 0 && !fast_to_acquire(rs->rs_rbm.rgd)) goto next_rgrp; - if ((loops < 2) && - gfs2_rgrp_used_recently(rs, 1000) && - gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) - goto next_rgrp; + if (loops < 2) { + /* + * If resource group allocation is requested to be skippy, + * roll a hypothetical dice of <randskip> sides and skip + * straight to the next resource group anyway if it comes + * up 1. + */ + if (gfs2_skippy_rgrp_alloc) { + u8 jitter; + + prandom_bytes(&jitter, sizeof(jitter)); + if ((jitter % randskip) == 0) { + /* + * If we are choosing to skip, bump randskip to make it + * successively less likely that we will skip again + */ + randskip ++; + goto next_rgrp; + } + } + if (gfs2_rgrp_used_recently(rs, 1000) && + gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) + goto next_rgrp; + } } error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl, LM_ST_EXCLUSIVE, flags, -- 1.8.3.1