[Cluster-devel] Recommendations for securing DLM traffic?
What are the general recommendations for securing traffic for DLM on port 21064? It appears that this traffic is not signed or encrypted in any way so whilst there might not be any privacy issues with information disclosure it's not clear that the messages could not be replayed or otherwise spoofed. Thanks, Mark.
[Cluster-devel] Sending patches for GFS2
We have a couple of patches for GFS2 which address some performance issues we've observed in our testing. What branch would you like the patches to be based on top off (we have a custom patched build so they'll need to be realigned first). Thanks, Mark.
[Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator
From: Tim Smith When growing a number of files on the same cluster node from different threads (e.g. fio with 20 or so jobs), all those threads pile into gfs2_inplace_reserve() independently looking to claim a new resource group and after a while they all synchronise, getting through the gfs2_rgrp_used_recently()/gfs2_rgrp_congested() check together. When this happens, write performance drops to about 1/5 on a single node cluster, and on multi-node clusters it drops to near zero on some nodes. The output from "glocktop -r -H -d 1" when this happens begins to show many processes stuck in gfs2_inplace_reserve(), waiting on a resource group lock. This commit introduces a module parameter which, when set to a value of 1, will introduce some random jitter into the first two passes of gfs2_inplace_reserve() when trying to lock a new resource group, skipping to the next one 1/2 the time with progressively lower probability on each attempt. Signed-off-by: Tim Smith --- fs/gfs2/rgrp.c | 39 +++ 1 file changed, 35 insertions(+), 4 deletions(-) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 1ad3256..994eb7f 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "gfs2.h" #include "incore.h" @@ -49,6 +50,11 @@ #define LBITSKIP00 (0xUL) #endif +static int gfs2_skippy_rgrp_alloc; + +module_param_named(skippy_rgrp_alloc, gfs2_skippy_rgrp_alloc, int, 0644); +MODULE_PARM_DESC(skippy_rgrp_alloc, "Set skippiness of resource group allocator, 0|1. Where 1 will cause resource groups to be randomly skipped with the likelihood of skipping progressively decreasing after a skip has occured."); + /* * These routines are used by the resource group routines (rgrp.c) * to keep track of block allocation. Each block is represented by two @@ -2016,6 +2022,11 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) u64 last_unlinked = NO_BLOCK; int loops = 0; u32 free_blocks, skip = 0; + /* +* gfs2_skippy_rgrp_alloc provides our initial skippiness. +* randskip will thus be 2-255 if we want it do do anything. +*/ + u8 randskip = gfs2_skippy_rgrp_alloc + 1; if (sdp->sd_args.ar_rgrplvb) flags |= GL_SKIP; @@ -2046,10 +2057,30 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) if (loops == 0 && !fast_to_acquire(rs->rs_rbm.rgd)) goto next_rgrp; - if ((loops < 2) && - gfs2_rgrp_used_recently(rs, 1000) && - gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) - goto next_rgrp; + if (loops < 2) { + /* +* If resource group allocation is requested to be skippy, +* roll a hypothetical dice of sides and skip +* straight to the next resource group anyway if it comes +* up 1. +*/ + if (gfs2_skippy_rgrp_alloc) { + u8 jitter; + + prandom_bytes(&jitter, sizeof(jitter)); + if ((jitter % randskip) == 0) { + /* +* If we are choosing to skip, bump randskip to make it +* successively less likely that we will skip again +*/ + randskip ++; + goto next_rgrp; + } + } + if (gfs2_rgrp_used_recently(rs, 1000) && + gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) + goto next_rgrp; + } } error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl, LM_ST_EXCLUSIVE, flags, -- 1.8.3.1
[Cluster-devel] [PATCH 2/2] GFS2: Avoid recently demoted rgrps.
When under heavy I/O load from two or more hosts the resource group allocation can result in glocks being bounced around between hosts. Follow the example of inodes and if we have local waiters when asked to demote the glock on a resource group add a delay. Additionally, track when last asked to demote a lock and when assessing resource groups in the allocator prefer, in the first two loop iterations, not to use resource groups where we've been asked to demote the glock within the last second. Signed-off-by: Mark Syms --- fs/gfs2/glock.c | 7 +-- fs/gfs2/incore.h | 2 ++ fs/gfs2/main.c | 1 + fs/gfs2/rgrp.c | 10 ++ fs/gfs2/trace_gfs2.h | 12 +--- 5 files changed, 27 insertions(+), 5 deletions(-) diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 4614ee2..94ef947 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -973,7 +973,9 @@ static void handle_callback(struct gfs2_glock *gl, unsigned int state, } if (gl->gl_ops->go_callback) gl->gl_ops->go_callback(gl, remote); - trace_gfs2_demote_rq(gl, remote); + trace_gfs2_demote_rq(gl, remote, delay); + if (remote && !delay) + gl->gl_last_demote = jiffies; } void gfs2_print_dbg(struct seq_file *seq, const char *fmt, ...) @@ -1339,7 +1341,8 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state) gfs2_glock_hold(gl); holdtime = gl->gl_tchange + gl->gl_hold_time; if (test_bit(GLF_QUEUED, &gl->gl_flags) && - gl->gl_name.ln_type == LM_TYPE_INODE) { + (gl->gl_name.ln_type == LM_TYPE_INODE || +gl->gl_name.ln_type == LM_TYPE_RGRP)) { if (time_before(now, holdtime)) delay = holdtime - now; if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags)) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index b96d39c..e3d5b10 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -366,6 +366,8 @@ struct gfs2_glock { gl_reply:8;/* Last reply from the dlm */ unsigned long gl_demote_time; /* time of first demote request */ + unsigned long gl_last_demote; /* jiffies at last demote transition */ + long gl_hold_time; struct list_head gl_holders; diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c index 2d55e2c..2183c73 100644 --- a/fs/gfs2/main.c +++ b/fs/gfs2/main.c @@ -58,6 +58,7 @@ static void gfs2_init_glock_once(void *foo) INIT_LIST_HEAD(&gl->gl_ail_list); atomic_set(&gl->gl_ail_count, 0); atomic_set(&gl->gl_revokes, 0); + gl->gl_last_demote = jiffies - (2 * HZ); } static void gfs2_init_gl_aspace_once(void *foo) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 994eb7f..7b77bb2 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1955,6 +1955,12 @@ static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs, return tdiff > (msecs * 1000 * 1000); } +static bool gfs2_rgrp_demoted_recently(const struct gfs2_blkreserv *rs, + u32 max_age_jiffies, u32 loop) +{ + return time_before(jiffies, rs->rs_rbm.rgd->rd_gl->gl_last_demote + max_age_jiffies); +} + static u32 gfs2_orlov_skip(const struct gfs2_inode *ip) { const struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode); @@ -2077,6 +2083,10 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) goto next_rgrp; } } + + if (gfs2_rgrp_demoted_recently(rs, HZ, loops)) + goto next_rgrp; + if (gfs2_rgrp_used_recently(rs, 1000) && gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) goto next_rgrp; diff --git a/fs/gfs2/trace_gfs2.h b/fs/gfs2/trace_gfs2.h index e002525..79935dc 100644 --- a/fs/gfs2/trace_gfs2.h +++ b/fs/gfs2/trace_gfs2.h @@ -161,9 +161,9 @@ static inline u8 glock_trace_state(unsigned int state) /* Callback (local or remote) requesting lock demotion */ TRACE_EVENT(gfs2_demote_rq, - TP_PROTO(const struct gfs2_glock *gl, bool remote), + TP_PROTO(const struct gfs2_glock *gl, bool remote, unsigned long delay), - TP_ARGS(gl, remote), + TP_ARGS(gl, remote, delay), TP_STRUCT__entry( __field(dev_t, dev ) @@ -173,6 +173,8 @@ static inline u8 glock_trace_state(unsigned int state) __field(u8, dmt_state ) __field(unsigned long, flags ) __field(bool, remote )
[Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
While testing GFS2 as a storage repository for virtual machines we discovered a number of scenarios where the performance was being pathologically poor. The scenarios are simplfied to the following - * On a single host in the cluster grow a number of files to a significant proportion of the filesystems LUN size, exceeding the hosts preferred resource group allocation. This can be replicated by using fio and writing to 20 different files with a script like [test-files] directory=gfs2/a:gfs2/b:gfs2/c:gfs2/d:gfs2/e:gfs2/f:gfs2/g:gfs2/h:gfs2/i:gfs2/j:gfs2/k:gfs2/l:gfs2/m:gfs2/n:gfs2/o:gfs2/p:gfs2/q:gfs2/r:gfs2/s:gfs2/t nrfiles=1 size=20G bs=512k rw=write buffered=0 ioengine=libaio fallocate=none numjobs=20 After starting off at network wire speed this will rapidly degrade with the fio process reporting large sys time. This was diagnosed to all the processes contending on the glock in gfs2_inplace_reserve having all selected the same resource group. Patch 1 addresses this with an optional module parameter which enables behaviour to "randomly" skip a selected resource group in the first two passes in gfs_inplace_reserve in order to spread the processes out. Worth noting that this would probably also be addressed if the comment in Documentation/gfs2-glocks.txt about eventually making glock EX locally shared was made to happen. However, this looks like it would require quite a bit of coordination and design so this stop-gap helps in the meantime. * With two or more hosts growing files at high data rates the throughput drops to a small proportion of the maximum storage I/O. This is the several VMs all writing to the filesystem scenario. Sometimes this test would run through clean at 80-90% of storage wire speed but at other times the performance would drop on one or more hosts to a small number of KiB/s. This was diagnosed to the different hosts repeatedly bouncing resource group glocks between them as different hosts selected the same resource group (having exhausted the preferred groups). Patch 2 addresses this by - * adding a hold delay to the resource group glock if there are local waiters, following the pattern already in place for inodes, this should also provide more data for gfs2_rgrp_congested to work on. * remembering when we were last asked to demote the lock on a resource group * in the first two passes in gfs2_inplace_reserve avoiding resource groups where we have been asked to demote the glock within the last second Mark Syms (1): GFS2: Avoid recently demoted rgrps. Tim Smith (1): Add some randomisation to the GFS2 resource group allocator fs/gfs2/glock.c | 7 +-- fs/gfs2/incore.h | 2 ++ fs/gfs2/main.c | 1 + fs/gfs2/rgrp.c | 49 + fs/gfs2/trace_gfs2.h | 12 +--- 5 files changed, 62 insertions(+), 9 deletions(-) -- 1.8.3.1
Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
Thanks for that Bob, we've been watching with interest the changes going in upstream but at the moment we're not really in a position to take advantage of them. Due to hardware vendor support certification requirements XenServer can only very occasionally make big kernel bumps that would affect the ABI that the driver would see as that would require our hardware partners to recertify. So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat newer as it is essentially self-contained and therefore we can backport change more easily. We currently have most of the GFS2 and DLM changes that are in 4.15 backported into the XenServer 7.6 kernel, but we can't take the ones related to iomap as they are more invasive and it looks like a number of the more recent performance targeting changes are also predicated on the iomap framework. As I mentioned in the covering letter, the intra host problem would largely be a non-issue if EX glocks were actually a host wide thing with local mutexes used to share them within the host. I don't know if this is what your patch set is trying to achieve or not. It's not so much that that selection of resource group is "random", just that there is a random chance that we won't select the first RG that we test, it probably does work out much the same though. The inter host problem addressed by the second patch seems to be less amenable to avoidance as the hosts don't seem to have a synchronous view of the state of the resource group locks (for understandable reasons as I'd expect thisto be very expensive to keep sync'd). So it seemed reasonable to try to make it "expensive" to request a resource that someone else is using and also to avoid immediately grabbing it back if we've been asked to relinquish it. It does seem to give a fairer balance to the usage without being massively invasive. We thought we should share these with the community anyway even if they only serve as inspiration for more detailed changes and also to describe the scenarios where we're seeing issues now that we have completed implementing the XenServer support for GFS2 that we discussed back in Nuremburg last year. In our testing they certainly make things better. They probably aren’t fully optimal as we can't maintain 10g wire speed consistently across the full LUN but we're getting about 75% which is certainly better than we were seeing before we started looking at this. Thanks, Mark. -Original Message- From: Bob Peterson Sent: 20 September 2018 18:18 To: Mark Syms Cc: cluster-devel@redhat.com; Ross Lagerwall ; Tim Smith Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements - Original Message - > While testing GFS2 as a storage repository for virtual machines we > discovered a number of scenarios where the performance was being > pathologically poor. > > The scenarios are simplfied to the following - > > * On a single host in the cluster grow a number of files to a > significant proportion of the filesystems LUN size, exceeding the > hosts preferred resource group allocation. This can be replicated > by using fio and writing to 20 different files with a script like Hi Mark, Tim and all, The performance problems with rgrp contention are well known, and have been for a very long time. In rhel6 it's not as big of a problem because rhel6 gfs2 uses "try locks" which distributes different processes to unique rgrps, thus keeping them from contending. However, it results in file system fragmentation that tends to catch up with you later. I posted a different patch set to solve the problem a different way by trying to keep track of both Inter-node and Intra-node contention, and redistributed rgrps accordingly. It was similar to your first patch, but used a more predictable distribution, whereas yours is random. It worked very well, but it ultimately got rejected by Steve Whitehouse in favor of a better approach: Our current plan is to allow rgrps to be shared among many processes on a single node. This alleviates the contention, improves throughput and performance, and fixes the "favoritism" problems gfs2 has today. In other words, it's better than just redistributing the rgrps. I did a proof-of-concept set of patches and saw pretty good performance numbers and "fairness" among simultaneous writers. I posted that a few months ago. Your patch would certainly work, and random distribution of rgrps would definitely gain performance, just as the Orlov algorithm does, however, I still want to pursue what Steve suggested. My patch set for this still needs some work because I found some bugs with how things are done, so it'll take time to get working properly. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
Hi Bob, No, we haven't but it wouldn't be hard for us to replace our patches in our internal patchqueue with these and try them. Will let you know what we find. We have also seen, what we think is an unrelated issue where we get the following backtrace in kern.log and our system stalls Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 seconds. Sep 21 21:19:09 cl15-05 kernel: [21389.462749] Tainted: G O 4.4.0+10 #1 Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python D 88019628bc90 0 15480 1 0x Sep 21 21:19:09 cl15-05 kernel: [21389.462790] 88019628bc90 880198f11c00 88005a509c00 88019628c000 Sep 21 21:19:09 cl15-05 kernel: [21389.462795] c90040226000 88019628bd80 fe58 8801818da418 Sep 21 21:19:09 cl15-05 kernel: [21389.462799] 88019628bca8 815a1cd4 8801818da5c0 88019628bd68 Sep 21 21:19:09 cl15-05 kernel: [21389.462803] Call Trace: Sep 21 21:19:09 cl15-05 kernel: [21389.462815] [] schedule+0x64/0x80 Sep 21 21:19:09 cl15-05 kernel: [21389.462877] [] find_insert_glock+0x4a4/0x530 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462891] [] ? gfs2_holder_wake+0x20/0x20 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462903] [] gfs2_glock_get+0x3d/0x330 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462928] [] do_flock+0xf2/0x210 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462933] [] ? gfs2_getattr+0xe0/0xf0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462938] [] ? cp_new_stat+0x10b/0x120 Sep 21 21:19:09 cl15-05 kernel: [21389.462943] [] gfs2_flock+0x78/0xa0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462946] [] SyS_flock+0x129/0x170 Sep 21 21:19:09 cl15-05 kernel: [21389.462948] [] entry_SYSCALL_64_fastpath+0x12/0x71 We think there is a possibility, given that this code path only gets entered if a glock is being destroyed, that there is a time of check, time of use issue here where by the time that schedule gets called the thing which we expect to be waking us up has completed dying and therefore won't trigger a wakeup for us. We only seen this a couple of times in fairly intensive VM stress tests where a lot of flocks get used on a small number of lock files (we use them to ensure consistent behaviour of disk activation/deactivation and also access to the database with the system state) but it's concerning nonetheless. We're looking at replacing the call to schedule with schedule_timeout with a timeout of maybe HZ to ensure that we will always get out of the schedule operation and retry. Is this something you think you may have seen or have any ideas on? Thanks, Mark. -Original Message- From: Bob Peterson Sent: 28 September 2018 13:24 To: Mark Syms Cc: cluster-devel@redhat.com; Ross Lagerwall ; Tim Smith Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements - Original Message - > Thanks for that Bob, we've been watching with interest the changes > going in upstream but at the moment we're not really in a position to > take advantage of them. > > Due to hardware vendor support certification requirements XenServer > can only very occasionally make big kernel bumps that would affect the > ABI that the driver would see as that would require our hardware partners to > recertify. > So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat > newer as it is essentially self-contained and therefore we can > backport change more easily. We currently have most of the GFS2 and > DLM changes that are in > 4.15 backported into the XenServer 7.6 kernel, but we can't take the > ones related to iomap as they are more invasive and it looks like a > number of the more recent performance targeting changes are also > predicated on the iomap framework. > > As I mentioned in the covering letter, the intra host problem would > largely be a non-issue if EX glocks were actually a host wide thing > with local mutexes used to share them within the host. I don't know if > this is what your patch set is trying to achieve or not. It's not so > much that that selection of resource group is "random", just that > there is a random chance that we won't select the first RG that we > test, it probably does work out much the same though. > > The inter host problem addressed by the second patch seems to be less > amenable to avoidance as the hosts don't seem to have a synchronous > view of the state of the resource group locks (for understandable > reasons as I'd expect thisto be very expensive to keep sync'd). So it > seemed reasonable to t
Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
Hi Bon, The patches look quite good and would seem to help in the intra-node congestion case, which our first patch was trying to do. We haven't tried them yet but I'll pull a build together and try to run it over the weekend. We don't however, see that they would help in the situation we saw for the second patch where rgrp glocks would get bounced around between hosts at high speed and cause lots of state flushing to occur in the process as the stats don't take any account of anything other than network latency whereas there is more involved with a rgrp glock when state needs to be flushed. Any thoughts on this? Thanks, Mark. -Original Message- From: Mark Syms Sent: 28 September 2018 13:37 To: 'Bob Peterson' Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Hi Bob, No, we haven't but it wouldn't be hard for us to replace our patches in our internal patchqueue with these and try them. Will let you know what we find. We have also seen, what we think is an unrelated issue where we get the following backtrace in kern.log and our system stalls Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 seconds. Sep 21 21:19:09 cl15-05 kernel: [21389.462749] Tainted: G O 4.4.0+10 #1 Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python D 88019628bc90 0 15480 1 0x Sep 21 21:19:09 cl15-05 kernel: [21389.462790] 88019628bc90 880198f11c00 88005a509c00 88019628c000 Sep 21 21:19:09 cl15-05 kernel: [21389.462795] c90040226000 88019628bd80 fe58 8801818da418 Sep 21 21:19:09 cl15-05 kernel: [21389.462799] 88019628bca8 815a1cd4 8801818da5c0 88019628bd68 Sep 21 21:19:09 cl15-05 kernel: [21389.462803] Call Trace: Sep 21 21:19:09 cl15-05 kernel: [21389.462815] [] schedule+0x64/0x80 Sep 21 21:19:09 cl15-05 kernel: [21389.462877] [] find_insert_glock+0x4a4/0x530 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462891] [] ? gfs2_holder_wake+0x20/0x20 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462903] [] gfs2_glock_get+0x3d/0x330 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462928] [] do_flock+0xf2/0x210 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462933] [] ? gfs2_getattr+0xe0/0xf0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462938] [] ? cp_new_stat+0x10b/0x120 Sep 21 21:19:09 cl15-05 kernel: [21389.462943] [] gfs2_flock+0x78/0xa0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462946] [] SyS_flock+0x129/0x170 Sep 21 21:19:09 cl15-05 kernel: [21389.462948] [] entry_SYSCALL_64_fastpath+0x12/0x71 We think there is a possibility, given that this code path only gets entered if a glock is being destroyed, that there is a time of check, time of use issue here where by the time that schedule gets called the thing which we expect to be waking us up has completed dying and therefore won't trigger a wakeup for us. We only seen this a couple of times in fairly intensive VM stress tests where a lot of flocks get used on a small number of lock files (we use them to ensure consistent behaviour of disk activation/deactivation and also access to the database with the system state) but it's concerning nonetheless. We're looking at replacing the call to schedule with schedule_timeout with a timeout of maybe HZ to ensure that we will always get out of the schedule operation and retry. Is this something you think you may have seen or have any ideas on? Thanks, Mark. -Original Message- From: Bob Peterson Sent: 28 September 2018 13:24 To: Mark Syms Cc: cluster-devel@redhat.com; Ross Lagerwall ; Tim Smith Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements - Original Message - > Thanks for that Bob, we've been watching with interest the changes > going in upstream but at the moment we're not really in a position to > take advantage of them. > > Due to hardware vendor support certification requirements XenServer > can only very occasionally make big kernel bumps that would affect the > ABI that the driver would see as that would require our hardware partners to > recertify. > So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat > newer as it is essentially self-contained and therefore we can > backport change more easily. We currently have most of the GFS2 and > DLM changes that are in > 4.15 backported into the XenServer 7.6 kernel, but we can't take the > ones related to iomap as they are more invasive and it looks like a > number of the more recent performance targeting changes ar
Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
The hosts stayed up and in the first occurrence of this (that we caught inflight as opposed to only seeing the aftermath in automation logs) the locks all actually unwedged after about 2 hours. Mark. -Original Message- From: Bob Peterson Sent: 28 September 2018 13:56 To: Mark Syms Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall ; Andreas Gruenbacher Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements - Original Message - > Hi Bob, > > No, we haven't but it wouldn't be hard for us to replace our patches > in our internal patchqueue with these and try them. Will let you know what we > find. > > We have also seen, what we think is an unrelated issue where we get > the following backtrace in kern.log and our system stalls > > Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 > blocked for more than 120 seconds. > Sep 21 21:19:09 cl15-05 kernel: [21389.462749] Tainted: G O > 4.4.0+10 #1 > Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python D > 88019628bc90 0 15480 1 0x > Sep 21 21:19:09 cl15-05 kernel: [21389.462790] 88019628bc90 > 880198f11c00 88005a509c00 88019628c000 Sep 21 21:19:09 > cl15-05 kernel: [21389.462795] c90040226000 > 88019628bd80 fe58 8801818da418 Sep 21 21:19:09 > cl15-05 kernel: [21389.462799] 88019628bca8 > 815a1cd4 8801818da5c0 88019628bd68 Sep 21 21:19:09 > cl15-05 kernel: [21389.462803] Call Trace: > Sep 21 21:19:09 cl15-05 kernel: [21389.462815] [] > schedule+0x64/0x80 > Sep 21 21:19:09 cl15-05 kernel: [21389.462877] [] > find_insert_glock+0x4a4/0x530 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462891] [] ? > gfs2_holder_wake+0x20/0x20 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462903] [] > gfs2_glock_get+0x3d/0x330 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462928] [] > do_flock+0xf2/0x210 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462933] [] ? > gfs2_getattr+0xe0/0xf0 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462938] [] ? > cp_new_stat+0x10b/0x120 > Sep 21 21:19:09 cl15-05 kernel: [21389.462943] [] > gfs2_flock+0x78/0xa0 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462946] [] > SyS_flock+0x129/0x170 > Sep 21 21:19:09 cl15-05 kernel: [21389.462948] [] > entry_SYSCALL_64_fastpath+0x12/0x71 > > We think there is a possibility, given that this code path only gets > entered if a glock is being destroyed, that there is a time of check, > time of use issue here where by the time that schedule gets called the > thing which we expect to be waking us up has completed dying and > therefore won't trigger a wakeup for us. We only seen this a couple of > times in fairly intensive VM stress tests where a lot of flocks get > used on a small number of lock files (we use them to ensure consistent > behaviour of disk activation/deactivation and also access to the > database with the system state) but it's concerning nonetheless. We're > looking at replacing the call to schedule with schedule_timeout with a > timeout of maybe HZ to ensure that we will always get out of the > schedule operation and retry. Is this something you think you may have seen > or have any ideas on? > > Thanks, > > Mark. Hi Mark, It's very common to get call traces like that when one of the nodes in a cluster goes down and the other nodes all wait for the failed node to be fenced, etc. The node failure causes dlm to temporarily stop granting locks until the issue is resolved. This is expected behavior, and dlm recovery should eventually grant the lock once the node is properly removed from the cluster. I haven't seen it on an flock glock, because I personally don't often run with flocks, but it often happens to me with other glocks when I do recovery testing (which I've been doing a lot of lately). So is there a node failure in your case? If there's a node failure, dlm should recover the locks and allow the waiter to continue normally. If it's not a node failure, it's hard to say... I know Andreas fixed some problems with the rcu locking we do to protect the glock rhashtable. Perhaps the kernel you're using is missing one of his patches? Or maybe it's a new bug. Adding Andreas to the cc. Regards, Bob Peterson
Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
To give some context here, the environment we were testing this in looks like this * 2 x XenServer hosts, Dell R430s with Xeon E5-2630 v3 CPUs and Intel X520 10g NICS dedicated to the iSCSI traffic for GFS2 (only using one per host) * Dedicated Linux filer packed with SSDs and 128GB of RAM. The native storage can sustainably support > 5GB/s write throughput and the host (currently) has a bonded pair of X710 10g NICS to serve the hosts. So basically the storage is significantly faster than the network and will not be the bottleneck in these tests. Whether what we observe here will change when we update the filer to have 6 10g NICs (planned in the next few weeks) will remain to be seen, obviously we'll need to add some more hosts to the cluster but we have another 10 in the rack so that isn't an issue. Mark. -Original Message- From: Bob Peterson Sent: 28 September 2018 15:00 To: Tim Smith Cc: Steven Whitehouse ; Mark Syms ; cluster-devel@redhat.com; Ross Lagerwall Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements - Original Message - > I think what's happening for us is that the work that needs to be done > to release an rgrp lock is happening pretty fast and is about the same > in all cases, so the stats are not providing a meaningful distinction. > We see the same lock (or small number of locks) bouncing back and > forth between nodes with neither node seeming to consider them > congested enough to avoid, even though the FS is <50% full and there must be > plenty of other non-full rgrps. > > -- > Tim Smith Hi Tim, Interesting. I've done experiments in the past where I allowed resource group glocks to take advantage of the "minimum hold time" which is today only used for inode glocks. In my experiments it's made no appreciable difference that I can recall, but it might be an interesting experiment for you to try. Steve's right that we need to be careful not to improve one aspect of performance while causing another aspect's downfall, like improving intra-node congestion problems at the expense of inter-node congestion. Regards, Bob Peterson
Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
Just to follow up on this further. We made a change to swap the call to schedule for schedule_timeout and added a kprintf in the case where schedule_timeout actually timed out rather than being woken. Running in our multi VM , multi host stress test we saw one instance of the kprintf message and no stuck tasks at any point within a 24 stress test so it looks like this is beneficial, at least in our kernel. I'll clean the change up and send it on but it's likely that it's only required due a race elsewhere and possibly one that has already been fixed in your latest codebase. Our thoughts revolve around the code performing clean up on the hashtable possibly checking to see if there are any waiters that will need to be notified on completion of the cleanup immediately before we start to wait and then not notifying as it's not aware of the waiter, but we have no concrete proof of that just the outcome that we get into schedule and stick there. Mark. -Original Message- From: Mark Syms Sent: 28 September 2018 14:57 To: 'Bob Peterson' Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall ; Andreas Gruenbacher Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements The hosts stayed up and in the first occurrence of this (that we caught inflight as opposed to only seeing the aftermath in automation logs) the locks all actually unwedged after about 2 hours. Mark. -Original Message- From: Bob Peterson Sent: 28 September 2018 13:56 To: Mark Syms Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall ; Andreas Gruenbacher Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements - Original Message - > Hi Bob, > > No, we haven't but it wouldn't be hard for us to replace our patches > in our internal patchqueue with these and try them. Will let you know what we > find. > > We have also seen, what we think is an unrelated issue where we get > the following backtrace in kern.log and our system stalls > > Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 > blocked for more than 120 seconds. > Sep 21 21:19:09 cl15-05 kernel: [21389.462749] Tainted: G O > 4.4.0+10 #1 > Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python D > 88019628bc90 0 15480 1 0x > Sep 21 21:19:09 cl15-05 kernel: [21389.462790] 88019628bc90 > 880198f11c00 88005a509c00 88019628c000 Sep 21 21:19:09 > cl15-05 kernel: [21389.462795] c90040226000 > 88019628bd80 fe58 8801818da418 Sep 21 21:19:09 > cl15-05 kernel: [21389.462799] 88019628bca8 > 815a1cd4 8801818da5c0 88019628bd68 Sep 21 21:19:09 > cl15-05 kernel: [21389.462803] Call Trace: > Sep 21 21:19:09 cl15-05 kernel: [21389.462815] [] > schedule+0x64/0x80 > Sep 21 21:19:09 cl15-05 kernel: [21389.462877] [] > find_insert_glock+0x4a4/0x530 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462891] [] ? > gfs2_holder_wake+0x20/0x20 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462903] [] > gfs2_glock_get+0x3d/0x330 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462928] [] > do_flock+0xf2/0x210 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462933] [] ? > gfs2_getattr+0xe0/0xf0 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462938] [] ? > cp_new_stat+0x10b/0x120 > Sep 21 21:19:09 cl15-05 kernel: [21389.462943] [] > gfs2_flock+0x78/0xa0 [gfs2] > Sep 21 21:19:09 cl15-05 kernel: [21389.462946] [] > SyS_flock+0x129/0x170 > Sep 21 21:19:09 cl15-05 kernel: [21389.462948] [] > entry_SYSCALL_64_fastpath+0x12/0x71 > > We think there is a possibility, given that this code path only gets > entered if a glock is being destroyed, that there is a time of check, > time of use issue here where by the time that schedule gets called the > thing which we expect to be waking us up has completed dying and > therefore won't trigger a wakeup for us. We only seen this a couple of > times in fairly intensive VM stress tests where a lot of flocks get > used on a small number of lock files (we use them to ensure consistent > behaviour of disk activation/deactivation and also access to the > database with the system state) but it's concerning nonetheless. We're > looking at replacing the call to schedule with schedule_timeout with a > timeout of maybe HZ to ensure that we will always get out of the > schedule operation and retry. Is this something you think you may have seen > or have any ideas on? > > Thanks, > > Mark. Hi Mark, It's very common to get call
[Cluster-devel] Any ideas on this?
Hi, We've seen a couple of time in our automated tests than unmounting the GFS2 gets stuck and doesn't complete. It's just happened again and the stack for the umount process looks like [] flush_workqueue+0x1c8/0x520 [] gfs2_make_fs_ro+0x69/0x160 [gfs2] [] gfs2_put_super+0xa9/0x1c0 [gfs2] [] generic_shutdown_super+0x6f/0x100 [] kill_block_super+0x27/0x70 [] gfs2_kill_sb+0x71/0x80 [gfs2] [] deactivate_locked_super+0x3b/0x70 [] deactivate_super+0x59/0x60 [] cleanup_mnt+0x58/0x80 [] __cleanup_mnt+0x12/0x20 [] task_work_run+0x7d/0xa0 [] exit_to_usermode_loop+0x73/0x98 [] syscall_return_slowpath+0x41/0x50 [] int_ret_from_sys_call+0x25/0x8f [] 0x Querying mount or /proc/mounts no longer shows the gfs2 filesystem so it's got part way through the process. The test has just finished deleting a series of VMs which will have had data written to them by multiple hosts and the time gap between the last delete completing and the umount is small so this might play a part in things. "glocktop -d 1 -r -H" on the stuck host shows it trying to get locks from a different host in the cluster where the referenced pids are no longer present. Unmounting the fs on the host where the locks are supposedly held was successful but then the stuck host started reporting that it was stuck waiting for locks held locally by kworker threads. Anything we should be looking at especially, possibly a fix that we don't currently have in our kernel? Thanks, Mark.
[Cluster-devel] [PATCH 0/2] GFS2 locking patches
A pair of locking issues in GFS2 observed when running VM storgae stress tests. 0001-GFS2-use-schedule-timeout-in-find-insert-glock.patch covers a case where an application level flock would wedge. The VM control plane makes extensive use of flocks to control access to VM virtual disks and databases and we envountered several failed tests where the flocks did not get acquired even when noone was holding them. Investigation indicates that there is a race in find_insert_glock where the call to schedule can be called when the expected waiter has already completed its work. Replace schedule with schedule_timeout and log. 0002-GFS2-Flush-the-GFS2-delete-workqueue-before-stopping.patch covers a case where umount would wedge unrecoverably. The completion of the stress test involves the deletion of the test machines and virtual disks followed by the filesystem being unmounted on all hosts before the hosts are returned to the lab pool. umount was found to wedge and this has been traced to gfs2_log_reserve being called in the flush_workqueue but after the associated kthread processes had been stopped. Thus there was nobody to handle the log reserver request and the code wedged. Mark Syms (1): GFS2: use schedule timeout in find insert glock Tim Smith (1): GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads fs/gfs2/glock.c | 3 ++- fs/gfs2/super.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) -- 1.8.3.1
[Cluster-devel] [PATCH 2/2] GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads
From: Tim Smith Flushing the workqueue can cause operations to happen which might call gfs2_log_reserve(), or get stuck waiting for locks taken by such operations. gfs2_log_reserve() can io_schedule(). If this happens, it will never wake because the only thing which can wake it is gfs2_logd() which was already stopped. This causes umount of a gfs2 filesystem to wedge permanently if, for example, the umount immediately follows a large delete operation. When this occured, the following stack trace was obtained from the umount command [] flush_workqueue+0x1c8/0x520 [] gfs2_make_fs_ro+0x69/0x160 [gfs2] [] gfs2_put_super+0xa9/0x1c0 [gfs2] [] generic_shutdown_super+0x6f/0x100 [] kill_block_super+0x27/0x70 [] gfs2_kill_sb+0x71/0x80 [gfs2] [] deactivate_locked_super+0x3b/0x70 [] deactivate_super+0x59/0x60 [] cleanup_mnt+0x58/0x80 [] __cleanup_mnt+0x12/0x20 [] task_work_run+0x7d/0xa0 [] exit_to_usermode_loop+0x73/0x98 [] syscall_return_slowpath+0x41/0x50 [] int_ret_from_sys_call+0x25/0x8f [] 0x Signed-off-by: Tim Smith Signed-off-by: Mark Syms --- fs/gfs2/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index c212893..a971862 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -854,10 +854,10 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp) if (error && !test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) return error; + flush_workqueue(gfs2_delete_workqueue); kthread_stop(sdp->sd_quotad_process); kthread_stop(sdp->sd_logd_process); - flush_workqueue(gfs2_delete_workqueue); gfs2_quota_sync(sdp->sd_vfs, 0); gfs2_statfs_sync(sdp->sd_vfs, 0); -- 1.8.3.1
[Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock
During a VM stress test we encountered a system lockup and kern.log contained kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 seconds. kernel: [21389.462749] Tainted: G O 4.4.0+10 #1 kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: [21389.462783] python D 88019628bc90 0 15480 1 0x kernel: [21389.462790] 88019628bc90 880198f11c00 88005a509c00 88019628c000 kernel: [21389.462795] c90040226000 88019628bd80 fe58 8801818da418 kernel: [21389.462799] 88019628bca8 815a1cd4 8801818da5c0 88019628bd68 kernel: [21389.462803] Call Trace: kernel: [21389.462815] [] schedule+0x64/0x80 kernel: [21389.462877] [] find_insert_glock+0x4a4/0x530 [gfs2] kernel: [21389.462891] [] ? gfs2_holder_wake+0x20/0x20 [gfs2] kernel: [21389.462903] [] gfs2_glock_get+0x3d/0x330 [gfs2] kernel: [21389.462928] [] do_flock+0xf2/0x210 [gfs2] kernel: [21389.462933] [] ? gfs2_getattr+0xe0/0xf0 [gfs2] kernel: [21389.462938] [] ? cp_new_stat+0x10b/0x120 kernel: [21389.462943] [] gfs2_flock+0x78/0xa0 [gfs2] kernel: [21389.462946] [] SyS_flock+0x129/0x170 kernel: [21389.462948] [] entry_SYSCALL_64_fastpath+0x12/0x71 On examination of the code it was determined that this code path is only taken if the selected glock is marked as dead, the supposition therefore is that by the time schedule as called the glock had been cleaned up and therefore nothing woke the schedule. Instead of calling schedule, call schedule_timeout(HZ) so at least we get a chance to re-evaluate. On repeating the stress test, the printk message was seen once in the logs across four servers but no further occurences nor were there any stuck task log entries. This indicates that when the timeout occured the code repeated the lookup and did not find the same glock entry but as we hadn't been woken this means that we would never have been woken. Signed-off-by: Mark Syms Signed-off-by: Tim Smith --- fs/gfs2/glock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 4614ee2..0a59a01 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -758,7 +758,8 @@ static struct gfs2_glock *find_insert_glock(struct lm_lockname *name, } if (gl && !lockref_get_not_dead(&gl->gl_lockref)) { rcu_read_unlock(); - schedule(); + if (schedule_timeout(HZ) == 0) + printk(KERN_INFO "find_insert_glock schedule timed out\n"); goto again; } out: -- 1.8.3.1
Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock
That sounds entirely reasonable so long as you are absolutely sure that nothing is ever going to mess with that glock, we erred on the side of more caution not knowing whether it would be guaranteed safe or not. Thanks, Mark -Original Message- From: Steven Whitehouse Sent: 08 October 2018 13:56 To: Mark Syms ; cluster-devel@redhat.com Cc: Ross Lagerwall ; Tim Smith Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock Hi, On 08/10/18 13:36, Mark Syms wrote: > During a VM stress test we encountered a system lockup and kern.log > contained > > kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 > seconds. > kernel: [21389.462749] Tainted: G O 4.4.0+10 #1 > kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > kernel: [21389.462783] python D 88019628bc90 0 15480 1 0x > kernel: [21389.462790] 88019628bc90 880198f11c00 > 88005a509c00 88019628c000 > kernel: [21389.462795] c90040226000 88019628bd80 > fe58 8801818da418 > kernel: [21389.462799] 88019628bca8 815a1cd4 > 8801818da5c0 88019628bd68 > kernel: [21389.462803] Call Trace: > kernel: [21389.462815] [] schedule+0x64/0x80 > kernel: [21389.462877] [] > find_insert_glock+0x4a4/0x530 [gfs2] > kernel: [21389.462891] [] ? > gfs2_holder_wake+0x20/0x20 [gfs2] > kernel: [21389.462903] [] gfs2_glock_get+0x3d/0x330 > [gfs2] > kernel: [21389.462928] [] do_flock+0xf2/0x210 [gfs2] > kernel: [21389.462933] [] ? gfs2_getattr+0xe0/0xf0 > [gfs2] > kernel: [21389.462938] [] ? cp_new_stat+0x10b/0x120 > kernel: [21389.462943] [] gfs2_flock+0x78/0xa0 > [gfs2] > kernel: [21389.462946] [] SyS_flock+0x129/0x170 > kernel: [21389.462948] [] > entry_SYSCALL_64_fastpath+0x12/0x71 > > On examination of the code it was determined that this code path is > only taken if the selected glock is marked as dead, the supposition > therefore is that by the time schedule as called the glock had been > cleaned up and therefore nothing woke the schedule. Instead of calling > schedule, call schedule_timeout(HZ) so at least we get a chance to > re-evaluate. > > On repeating the stress test, the printk message was seen once in the > logs across four servers but no further occurences nor were there any > stuck task log entries. This indicates that when the timeout occured > the code repeated the lookup and did not find the same glock entry but > as we hadn't been woken this means that we would never have been woken. > > Signed-off-by: Mark Syms > Signed-off-by: Tim Smith > --- > fs/gfs2/glock.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 4614ee2..0a59a01 > 100644 > --- a/fs/gfs2/glock.c > +++ b/fs/gfs2/glock.c > @@ -758,7 +758,8 @@ static struct gfs2_glock *find_insert_glock(struct > lm_lockname *name, > } > if (gl && !lockref_get_not_dead(&gl->gl_lockref)) { > rcu_read_unlock(); > - schedule(); > + if (schedule_timeout(HZ) == 0) > + printk(KERN_INFO "find_insert_glock schedule timed > out\n"); > goto again; > } > out: That is a bit odd. In fact that whole function looks odd. I wonder why it needs to wait in the first place. It should be a simple comparison here. If the glock is dead then nothing else should touch it, so we are safe to add a new one into the hash table. The wait is almost certainly a bug, Steve.
Re: [Cluster-devel] [PATCH 2/2] GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads
Thanks Bob, We're having a look at the bit that Tim sent earlier to see if switching the call order makes the schedule_timeout never occur. As Steven said, it would be nice if we didn't have to worry about waiting for "dead" glocks to be cleaned up but that would require some considerable care to ensure that we don't just get the opposite race and get a newly created glock freed and deleted. Mark From: Bob Peterson Sent: Monday, 8 October 2018 21:10 To: Mark Syms CC: cluster-devel@redhat.com,Ross Lagerwall ,Tim Smith Subject: Re: [Cluster-devel] [PATCH 2/2] GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads - Original Message - > From: Tim Smith > > Flushing the workqueue can cause operations to happen which might > call gfs2_log_reserve(), or get stuck waiting for locks taken by such > operations. gfs2_log_reserve() can io_schedule(). If this happens, it > will never wake because the only thing which can wake it is gfs2_logd() > which was already stopped. > > This causes umount of a gfs2 filesystem to wedge permanently if, for > example, the umount immediately follows a large delete operation. > > When this occured, the following stack trace was obtained from the > umount command > > [] flush_workqueue+0x1c8/0x520 > [] gfs2_make_fs_ro+0x69/0x160 [gfs2] > [] gfs2_put_super+0xa9/0x1c0 [gfs2] > [] generic_shutdown_super+0x6f/0x100 > [] kill_block_super+0x27/0x70 > [] gfs2_kill_sb+0x71/0x80 [gfs2] > [] deactivate_locked_super+0x3b/0x70 > [] deactivate_super+0x59/0x60 > [] cleanup_mnt+0x58/0x80 > [] __cleanup_mnt+0x12/0x20 > [] task_work_run+0x7d/0xa0 > [] exit_to_usermode_loop+0x73/0x98 > [] syscall_return_slowpath+0x41/0x50 > [] int_ret_from_sys_call+0x25/0x8f > [] 0x > > Signed-off-by: Tim Smith > Signed-off-by: Mark Syms > --- Hi Mark, Tim, and all, I pushed patch 2/2 upstream. For now I'll hold off on 1/2 but keep it on my queue, pending our investigation. https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next&id=b7f5a2cd27b76e96fdc6d77b060dfdd877c9d0a9 Regards, Bob Peterson
Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock
Having swapped the line below around we still see the timeout on schedule fire, but only once in a fairly mega stress test. This is why we weren't worried about the timeout being HZ, the situation is hardly ever hit as having to wait is rare and normally we are woken from schedule and without a timeout on schedule we never wake up so a rare occurrence of waiting a second really doesn't seem too bad. Mark. -Original Message- From: Tim Smith Sent: 08 October 2018 14:27 To: Steven Whitehouse Cc: Mark Syms ; cluster-devel@redhat.com; Ross Lagerwall Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote: > Hi, > > On 08/10/18 14:10, Tim Smith wrote: > > On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote: > >> On 08/10/18 13:59, Mark Syms wrote: > >>> That sounds entirely reasonable so long as you are absolutely sure > >>> that nothing is ever going to mess with that glock, we erred on > >>> the side of more caution not knowing whether it would be guaranteed safe > >>> or not. > >>> > >>> Thanks, > >>> > >>> Mark > >> > >> We should have a look at the history to see how that wait got added. > >> However the "dead" flag here means "don't touch this glock" and is > >> there so that we can separate the marking dead from the actual > >> removal from the list (which simplifies the locking during the > >> scanning procedures) > > > > You beat me to it :-) > > > > I think there might be a bit of a problem inserting a new entry with > > the same name before the old entry has been fully destroyed (or at > > least removed), which would be why the schedule() is there. > > If the old entry is marked dead, all future lookups should ignore it. > We should only have a single non-dead entry at a time, but that > doesn't seem like it should need us to wait for it. On the second call we do have the new glock to insert as arg2, so we could try to swap them cleanly, yeah. > If we do discover that the wait is really required, then it sounds > like as you mentioned above there is a lost wakeup, and that must > presumably be on a code path that sets the dead flag and then fails to > send a wake up later on. If we can drop the wait in the first place, > that seems like a better plan, Ooooh, I wonder if these two lines: wake_up_glock(gl); call_rcu(&gl->gl_rcu, gfs2_glock_dealloc); in gfs2_glock_free() are the wrong way round? -- Tim Smith
Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock
We think we have, just making a build to test. Will follow up later. Mark. -Original Message- From: Steven Whitehouse Sent: 09 October 2018 09:41 To: Mark Syms ; Tim Smith Cc: cluster-devel@redhat.com; Ross Lagerwall Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock On 09/10/18 09:13, Mark Syms wrote: > Having swapped the line below around we still see the timeout on > schedule fire, but only once in a fairly mega stress test. This is why > we weren't worried about the timeout being HZ, the situation is hardly > ever hit as having to wait is rare and normally we are woken from > schedule and without a timeout on schedule we never wake up so a rare > occurrence of waiting a second really doesn't seem too bad. > > Mark. We should still get to the bottom of why the wake up is missing though, since without that fix we won't know if there is something else wrong somewhere, Steve. > > -Original Message- > From: Tim Smith > Sent: 08 October 2018 14:27 > To: Steven Whitehouse > Cc: Mark Syms ; cluster-devel@redhat.com; Ross > Lagerwall > Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in > find insert glock > > On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote: >> Hi, >> >> On 08/10/18 14:10, Tim Smith wrote: >>> On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote: >>>> On 08/10/18 13:59, Mark Syms wrote: >>>>> That sounds entirely reasonable so long as you are absolutely sure >>>>> that nothing is ever going to mess with that glock, we erred on >>>>> the side of more caution not knowing whether it would be guaranteed safe >>>>> or not. >>>>> >>>>> Thanks, >>>>> >>>>> Mark >>>> We should have a look at the history to see how that wait got added. >>>> However the "dead" flag here means "don't touch this glock" and is >>>> there so that we can separate the marking dead from the actual >>>> removal from the list (which simplifies the locking during the >>>> scanning procedures) >>> You beat me to it :-) >>> >>> I think there might be a bit of a problem inserting a new entry with >>> the same name before the old entry has been fully destroyed (or at >>> least removed), which would be why the schedule() is there. >> If the old entry is marked dead, all future lookups should ignore it. >> We should only have a single non-dead entry at a time, but that >> doesn't seem like it should need us to wait for it. > On the second call we do have the new glock to insert as arg2, so we could > try to swap them cleanly, yeah. > >> If we do discover that the wait is really required, then it sounds >> like as you mentioned above there is a lost wakeup, and that must >> presumably be on a code path that sets the dead flag and then fails >> to send a wake up later on. If we can drop the wait in the first >> place, that seems like a better plan, > Ooooh, I wonder if these two lines: > > wake_up_glock(gl); > call_rcu(&gl->gl_rcu, gfs2_glock_dealloc); > > in gfs2_glock_free() are the wrong way round? > > -- > Tim Smith > >
Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock
And, sadly, the change to wait all also doesn't prevent the schedule_timeout from occurring. Something more subtle going on obviously. Mark. -Original Message- From: Tim Smith Sent: 10 October 2018 09:23 To: cluster-devel@redhat.com Cc: Andreas Gruenbacher ; Ross Lagerwall ; Mark Syms Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock On Tuesday, 9 October 2018 16:08:10 BST Tim Smith wrote: > On Tuesday, 9 October 2018 14:00:34 BST Andreas Gruenbacher wrote: > > On Tue, 9 Oct 2018 at 14:46, Tim Smith wrote: > > > On Tuesday, 9 October 2018 13:34:47 BST you wrote: > > > > There must be another reason for the missed wake-up. I'll have > > > > to study the code some more. > > > > > > I think it needs to go into gfs2_glock_dealloc(), in such a way as > > > to avoid that problem. Currently working out a patch to do that. > > > > That doesn't sound right: find_insert_glock is waiting for the glock > > to be removed from the rhashtable. In gfs2_glock_free, we remove the > > glock from the rhashtable and then we do the wake-up. Delaying the > > wakeup further isn't going to help (but it might hide the problem). > > The only way we can get the problem we're seeing is if we get an > effective order of > > T0: wake_up_glock() > T1: prepare_to_wait() > T1: schedule() > > so clearly there's a way for that to happen. Any other order and > either > schedule() doesn't sleep or it gets woken. > > The only way I can see at the moment to ensure that wake_up_glock() > *cannot* get called until after prepare_to_wait() is to delay it until > the read_side critical sections are done, and the first place that's > got that property is the start of gfs2_glock_dealloc(), unless we want > to add synchronise_rcu() to gfs2_glock_free() and I'm guessing there's > a reason it's using > call_rcu() instead. > > I'll keep thinking about it. OK, we have a result that this definitely *isn't* the right answer (still getting the wakeup happening). This is lends more weight to the idea that there are multiple waiters, so we'll try your patch. -- Tim Smith
Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock
Also, the prepare to wait call does so in non exclusive mode so switching the parameter on the wakeup would have no effect as non exclusive waiters are always woken anyway. So, there is some other reason why we decide to yield but never get woken. Mark. From: Mark Syms Sent: Thursday, 11 October 2018 09:14 To: Tim Smith ,cluster-devel@redhat.com CC: Andreas Gruenbacher ,Ross Lagerwall Subject: RE: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock And, sadly, the change to wait all also doesn't prevent the schedule_timeout from occurring. Something more subtle going on obviously. Mark. -Original Message- From: Tim Smith Sent: 10 October 2018 09:23 To: cluster-devel@redhat.com Cc: Andreas Gruenbacher ; Ross Lagerwall ; Mark Syms Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock On Tuesday, 9 October 2018 16:08:10 BST Tim Smith wrote: > On Tuesday, 9 October 2018 14:00:34 BST Andreas Gruenbacher wrote: > > On Tue, 9 Oct 2018 at 14:46, Tim Smith wrote: > > > On Tuesday, 9 October 2018 13:34:47 BST you wrote: > > > > There must be another reason for the missed wake-up. I'll have > > > > to study the code some more. > > > > > > I think it needs to go into gfs2_glock_dealloc(), in such a way as > > > to avoid that problem. Currently working out a patch to do that. > > > > That doesn't sound right: find_insert_glock is waiting for the glock > > to be removed from the rhashtable. In gfs2_glock_free, we remove the > > glock from the rhashtable and then we do the wake-up. Delaying the > > wakeup further isn't going to help (but it might hide the problem). > > The only way we can get the problem we're seeing is if we get an > effective order of > > T0: wake_up_glock() > T1: prepare_to_wait() > T1: schedule() > > so clearly there's a way for that to happen. Any other order and > either > schedule() doesn't sleep or it gets woken. > > The only way I can see at the moment to ensure that wake_up_glock() > *cannot* get called until after prepare_to_wait() is to delay it until > the read_side critical sections are done, and the first place that's > got that property is the start of gfs2_glock_dealloc(), unless we want > to add synchronise_rcu() to gfs2_glock_free() and I'm guessing there's > a reason it's using > call_rcu() instead. > > I'll keep thinking about it. OK, we have a result that this definitely *isn't* the right answer (still getting the wakeup happening). This is lends more weight to the idea that there are multiple waiters, so we'll try your patch. -- Tim Smith
Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
On Mon, Dec 17, 2018 at 09:58:47AM -0500, Bob Peterson wrote: > Dave Teigland recommended. Unless I'm mistaken, Dave has said that > GFS2 should never withdraw; it should always just kernel panic (Dave, > correct me if I'm wrong). At least this patch confines that behavior > to a small subset of withdraws. The basic idea is that you want to get a malfunctioning node out of the way as quickly as possible so others can recover and carry on. Escalating a partial failure into a total node failure is the best way to do that in this case. Specialized recovery paths run from a partially failed node won't be as reliable, and are prone to blocking all the nodes. I think a reasonable alternative to this is to just sit in an infinite retry loop until the i/o succeeds. Dave [Mark Syms] I would hope that this code would only trigger after some effort has been put into retrying as panicing the host on the first I/O failure seems like a sure fire way to get unhappy users (and in our case paying customers). As Edvin points out there may be other filesystems that may be able to cleanly unmount and thus avoid having to check everything on restart.
Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
Hi Bob, I agree, it's a hard problem. I'm just trying to understand that we've done the absolute best we can and that if this condition is hit then the best solution really is to just kill the node. I guess it's also a question of how common this actually ends up being. We have now got customers starting to use GFS2 for VM storage on XenServer so I guess we'll just have to see how many support calls we get in on it. Thanks, Mark. -Original Message- From: Bob Peterson Sent: 17 December 2018 20:20 To: Mark Syms Cc: cluster-devel@redhat.com Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing - Original Message - > I think a reasonable alternative to this is to just sit in an infinite > retry loop until the i/o succeeds. > > Dave > [Mark Syms] I would hope that this code would only trigger after some > effort has been put into retrying as panicing the host on the first > I/O failure seems like a sure fire way to get unhappy users (and in > our case paying customers). As Edvin points out there may be other > filesystems that may be able to cleanly unmount and thus avoid having > to check everything on restart. Hi Mark, Perhaps. I'm not block layer or iscsi expert, but afaik, it's not the file system's job to retry IO, and never has been, right? There are already iscsi tuning parameters, vfs tuning parameters, etc. So if an IO error is sent to GFS2 for a write operation, it means the retry algorithms and operation timeout algorithms built into the layers below us (the iscsi layer, scsi layer, block layer, tcp/ip layer etc.) have all failed and given up on the IO operation. We can't really justify adding yet another layer of retries on top of all that, can we? I see your point, and perhaps the system should stay up to continue other mission-critical operations that may not require the faulted hardware. But what's a viable alternative? As Dave T. suggested, we can keep resubmitting the IO until it completes, but then the journal never gets replayed and nobody can have those locks ever again, and that would cause a potential hang of the entire cluster, especially in cases where there's only one device that's failed and the whole cluster is using it. In GFS2, we've got a concept of marking a resource group "in error" so the other nodes won't try to use it, but the same corruption that affects resource groups could be extrapolated to "hot" dinodes as well. For example, suppose the root (mount-point) dinode was in the journal. Now the whole cluster is hung rather than just the one node. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing
Thanks Bob, We believe we have seen these issues from time to time in our automated testing but I suspect that they're indicating a configuration problem with the backing storage. For flexibility a proportion of our purely functional testing will use storage provided by a VM running a software iSCSI target and these tests seem to be somewhat susceptible to getting I/O errors, some of which will inevitably end up being in the journal. If we start to see a lot we'll need to look at the config of the VMs first I think. Mark. -Original Message- From: Bob Peterson Sent: 18 December 2018 15:52 To: Mark Syms Cc: cluster-devel@redhat.com Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing - Original Message - > Hi Bob, > > I agree, it's a hard problem. I'm just trying to understand that we've > done the absolute best we can and that if this condition is hit then > the best solution really is to just kill the node. I guess it's also a > question of how common this actually ends up being. We have now got > customers starting to use GFS2 for VM storage on XenServer so I guess > we'll just have to see how many support calls we get in on it. > > Thanks, > > Mark. Hi Mark, I don't expect the problem to be very common in the real world. The user has to get IO errors while writing to the GFS2 journal, which is not very common. The patch is basically reacting to a phenomenon we recently started noticing in which the HBA (qla2xxx) driver shuts down and stops accepting requests when you do abnormal reboots (which we sometimes do to test node recovery). In these cases, the node doesn't go down right away. It stays up just long enough to cause IO errors with subsequent withdraws, which, we discovered, results in file system corruption. Normal reboots, "/sbin/reboot -fin", and "echo b > /proc/sysrq-trigger" should not have this problem, nor should node fencing, etc. And like I said, I'm open to suggestions on how to fix it. I wish there was a better solution. As it is, I'd kind of like to get something into this merge window for the upstream kernel, but I'll need to submit the pull request for that probably tomorrow or Thursday. If we find a better solution, we can always revert these changes and implement a new one. Regards, Bob Peterson
[Cluster-devel] kernel BUG at fs/gfs2/inode.h:64
Hi, We've seen this in testing with 4.19. Full trace is at the bottom. Looking at the code though it looks like it will assert if the value of change is equal to the number of blocks currently allocated to the inode. Is this expected or should the assert be using >= instead of > ? Thanks, Mark -- 2018-12-24 00:07:03 UTC] [ 3366.209273] kernel BUG at fs/gfs2/inode.h:64! [2018-12-24 00:07:03 UTC] [ 3366.209305] invalid opcode: [#1] SMP NOPTI [2018-12-24 00:07:03 UTC] [ 3366.209327] CPU: 10 PID: 30664 Comm: kworker/10:1 Tainted: G O 4.19.0+0 #1 [2018-12-24 00:07:03 UTC] [ 3366.209352] Hardware name: HP ProLiant BL420c Gen8, BIOS I30 11/03/2014 [2018-12-24 00:07:03 UTC] [ 3366.209397] Workqueue: delete_workqueue delete_work_func [gfs2] [2018-12-24 00:07:03 UTC] [ 3366.209437] RIP: e030:gfs2_add_inode_blocks.part.33+0x10/0x12 [gfs2] [2018-12-24 00:07:03 UTC] [ 3366.209460] Code: 00 89 c2 48 c7 c7 80 bc 4c c0 31 c0 e8 ec 44 c1 c0 e9 68 fc ff ff 0f 0b 0f 0b 48 8b 47 28 48 8b b8 08 04 00 00 e8 bf dd ff ff <0f> 0b 0f 0b 0f 0b 66 66 66 66 90 0f 0b 48 8b 47 28 48 8b b8 08 04 [2018-12-24 00:07:03 UTC] [ 3366.209512] RSP: e02b:c9004a53fbf0 EFLAGS: 00010246 [2018-12-24 00:07:03 UTC] [ 3366.209532] RAX: 0043 RBX: 8881709a4000 RCX: [2018-12-24 00:07:03 UTC] [ 3366.209556] RDX: RSI: 88818d496428 RDI: 88818d496428 [2018-12-24 00:07:03 UTC] [ 3366.209581] RBP: c9004a53fdb0 R08: R09: 0553 [2018-12-24 00:07:03 UTC] [ 3366.209605] R10: R11: c9004a53f968 R12: 888156a303d8 [2018-12-24 00:07:03 UTC] [ 3366.209629] R13: 00f47524 R14: 8881709a3ac0 R15: 888188d31110 [2018-12-24 00:07:03 UTC] [ 3366.209683] FS: 7fa73976e700() GS:88818d48() knlGS: [2018-12-24 00:07:03 UTC] [ 3366.209709] CS: e033 DS: ES: CR0: 80050033 [2018-12-24 00:07:03 UTC] [ 3366.209730] CR2: 7fca0def3000 CR3: 000187fee000 CR4: 00042660 [2018-12-24 00:07:03 UTC] [ 3366.209802] Call Trace: [2018-12-24 00:07:03 UTC] [ 3366.209835] punch_hole+0xf4f/0x10d0 [gfs2] [2018-12-24 00:07:03 UTC] [ 3366.209866] ? __switch_to_asm+0x40/0x70 [2018-12-24 00:07:03 UTC] [ 3366.209907] ? __switch_to_asm+0x40/0x70 [2018-12-24 00:07:03 UTC] [ 3366.209924] ? __switch_to_asm+0x40/0x70 [2018-12-24 00:07:03 UTC] [ 3366.209949] ? punch_hole+0xb3e/0x10d0 [gfs2] [2018-12-24 00:07:03 UTC] [ 3366.209983] gfs2_evict_inode+0x57b/0x680 [gfs2] [2018-12-24 00:07:03 UTC] [ 3366.210036] ? __inode_wait_for_writeback+0x75/0xe0 [2018-12-24 00:07:03 UTC] [ 3366.210066] ? gfs2_evict_inode+0x1c7/0x680 [gfs2] [2018-12-24 00:07:03 UTC] [ 3366.210090] evict+0xc6/0x1a0 [2018-12-24 00:07:03 UTC] [ 3366.210151] delete_work_func+0x60/0x70 [gfs2] [2018-12-24 00:07:03 UTC] [ 3366.210177] process_one_work+0x165/0x370 [2018-12-24 00:07:03 UTC] [ 3366.210197] worker_thread+0x49/0x3e0 [2018-12-24 00:07:03 UTC] [ 3366.210216] kthread+0xf8/0x130 [2018-12-24 00:07:03 UTC] [ 3366.210235] ? rescuer_thread+0x310/0x310 [2018-12-24 00:07:03 UTC] [ 3366.210284] ? kthread_bind+0x10/0x10 [2018-12-24 00:07:04 UTC] [ 3366.210301] ret_from_fork+0x35/0x40 [2018-12-24 00:07:04 UTC] [ 3366.210319] Modules linked in: tun nfsv3 nfs_acl nfs lockd grace fscache gfs2 dlm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp stp llc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter dm_multipath sunrpc sb_edac intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper dm_mod sg hpilo lpc_ich intel_rapl_perf psmouse ipmi_si ipmi_devintf ipmi_msghandler nls_utf8 isofs acpi_power_meter loop xen_wdt ip_tables x_tables sd_mod uhci_hcd serio_raw ehci_pci ehci_hcd hpsa igb(O) scsi_transport_sas scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt [2018-12-24 00:07:04 UTC] [ 3366.210663] ---[ end trace d36fcbebd28b36e9 ]---
Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64
We don't yet know how the assert got triggered as we've only seen it once and in the original form it looks like it would be very hard to trigger in any normal case (given that in default usage i_blocks should be at least 8 times what any putative value for change could be). So, for the assert to have triggered we've been asked to remove at least 8 times the number of blocks currently allocated to the inode. Possible causes could be a double release or some other higher level bug that will require further investigation to uncover. Mark. -Original Message- From: Andreas Gruenbacher Sent: 09 January 2019 13:31 To: Tim Smith Cc: cluster-devel ; Igor Druzhinin ; Mark Syms Subject: Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64 Mark and Tim, On Wed, 9 Jan 2019 at 13:05, Tim Smith wrote: > On Tuesday, 8 January 2019 13:32:20 GMT Mark Syms wrote: > > Hi, > > > > We've seen this in testing with 4.19. > > > > Full trace is at the bottom. > > > > Looking at the code though it looks like it will assert if the value > > of change is equal to the number of blocks currently allocated to the inode. > > Is this expected or should the assert be using >= instead of > ? > > It's weirder. Looking at > > static inline void gfs2_add_inode_blocks(struct inode *inode, s64 > change) { > gfs2_assert(GFS2_SB(inode), (change >= 0 || inode->i_blocks > > -change)); > change *= (GFS2_SB(inode)->sd_sb.sb_bsize/GFS2_BASIC_BLOCK); > inode->i_blocks += change; > } > > where the BUG is happening, "inode->i_blocks" is counting of 512b > blocks and "change" is in units of whatever-the-superblock-said which > will probably be counting 4096b blocks, so the comparison makes no sense. indeed. I wonder why this hasn't been discovered long ago. > It should probably read > > static inline void gfs2_add_inode_blocks(struct inode *inode, s64 > change) { > change *= (GFS2_SB(inode)->sd_sb.sb_bsize/GFS2_BASIC_BLOCK); > gfs2_assert(GFS2_SB(inode), (change >= 0 || inode->i_blocks >= > -change)); > inode->i_blocks += change; > } > > so I'll make a patch for that unless someone wants to correct me. You can just shift by (inode->i_blkbits - 9). Can you still trigger the assert with this fix? Thanks, Andreas
Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64
So, actually the original comparison in the assert (dodgy scaling factor not withstanding) was probably correct in that we don't ever want to remove all blocks from the inode as one of them is used for the inode itself? Or do we still think it should allow for change to be the negative of current blocks? Mark From: Andreas Gruenbacher Sent: Wednesday, 9 January 2019 17:24 To: Tim Smith CC: Mark Syms ,cluster-devel ,Igor Druzhinin Subject: Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64 On Wed, 9 Jan 2019 at 18:14, Tim Smith wrote: > > On Wednesday, 9 January 2019 15:35:05 GMT Andreas Gruenbacher wrote: > > On Wed, 9 Jan 2019 at 14:43, Mark Syms wrote: > > > We don't yet know how the assert got triggered as we've only seen it once > > > and in the original form it looks like it would be very hard to trigger in > > > any normal case (given that in default usage i_blocks should be at least 8 > > > times what any putative value for change could be). So, for the assert to > > > have triggered we've been asked to remove at least 8 times the number of > > > blocks currently allocated to the inode. Possible causes could be a double > > > release or some other higher level bug that will require further > > > investigation to uncover. > > > > The following change has at least survived xfstests: > > > > --- a/fs/gfs2/inode.h > > +++ b/fs/gfs2/inode.h > > @@ -61,8 +61,8 @@ static inline u64 gfs2_get_inode_blocks(const struct > > inode *inode) > > > > static inline void gfs2_add_inode_blocks(struct inode *inode, s64 change) > > { > > -gfs2_assert(GFS2_SB(inode), (change >= 0 || inode->i_blocks > > > -change)); > > -change *= (GFS2_SB(inode)->sd_sb.sb_bsize/GFS2_BASIC_BLOCK); > > +change <<= inode->i_blkbits - 9; > > +gfs2_assert(GFS2_SB(inode), change >= 0 || inode->i_blocks >= -change); > > inode->i_blocks += change; > > } > > > > Andreas > > I'll use > > change <<= (GFS2_SB(inode)->sd_sb.sb_bsize_shift - GFS2_BASIC_BLOCK_SHIFT); > > for consistency with the gfs2_get/set_inode_blocks(). I'll send the patch in a > bit. > > Given what it was like before, either i_blocks was already 0 or -change > somehow became stupidly huge. Anything else looks like it would be hard to > reproduce without mkfs.gfs2 -b 512 (so that GFS2_SB(inode)->sd_sb.sb_bsize == > GFS2_BASIC_BLOCK) which we don't do. > > I'll try to work out what could have caused it and see if we can provoke it > again. > > Out of curiosity I did a few tests where I created a file on GFS2, copied / > dev/null on top of it, and then ran stat on the file. It seems like GFS2 never > frees the last allocation on truncate; stat always reports 1, 2, 4 or 8 blocks > in use for a zero-length file depending on the underlying filesystem block > size, unlike (say) ext3/4 where it reports 0. I presume this is intentional so > maybe some corner case where it *is* trying to do that is the root of the > problem. Yes, that's intentional. An empty file without extended attributes consumes one block. Extended attributes consume at least one additional block. Andreas
[Cluster-devel] [PATCH] Retry wait_event_interruptible in event of ERESTARTSYS
We saw an issue in a production server on a customer deployment where DLM 4.0.7 gets "stuck" and unable to join new lockspaces. See - https://lists.clusterlabs.org/pipermail/users/2019-January/016054.html This was forwarded off list to David Teigland who responded thusly. " Hi, thanks for the debugging info. You've spent more time looking at this than I have, but from a first glance it seems to me that the initial problem (there may be multiple) is that in the kernel, lockspace.c do_event() does not sensibly handle the ERESTARTSYS error from wait_event_interruptible(). I think do_event() should continue waiting for a uevent result from userspace until it gets one, because the kernel can't do anything sensible until it gets that. Dave " This change does that. We have it running in automation with no problems so far but comments welcome. Mark Syms (1): Retry wait_event_interruptible in event of ERESTARTSYS fs/dlm/lockspace.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) -- 1.8.3.1
[Cluster-devel] [PATCH] Retry wait_event_interruptible in event of ERESTARTSYS
Signed-off-by: Mark Syms --- fs/dlm/lockspace.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c index db43b98..7aae23a 100644 --- a/fs/dlm/lockspace.c +++ b/fs/dlm/lockspace.c @@ -210,8 +210,10 @@ static int do_uevent(struct dlm_ls *ls, int in) /* dlm_controld will see the uevent, do the necessary group management and then write to sysfs to wake us */ - error = wait_event_interruptible(ls->ls_uevent_wait, - test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags)); + do { + error = wait_event_interruptible(ls->ls_uevent_wait, + test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags)); + } while (error == ERESTARTSYS); log_rinfo(ls, "group event done %d %d", error, ls->ls_uevent_result); -- 1.8.3.1
Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
Sadly, this doesn't help and seems to make the situation worse. Our automated tests were previously seeing about 5% failure rate and with this patch its 20%. We still need to verify that they're all down to the same failure but I thought it better to give some early feedback. Surprised that the script didn't run for you but the limit on dirty bytes was just to get it to occur more frequently. You may have success with larger values. Mark Message: 1 Date: Fri, 15 Mar 2019 21:58:12 +0100 From: Andreas Gruenbacher mailto:agrue...@redhat.com>> To: Ross Lagerwall mailto:ross.lagerw...@citrix.com>> Cc: igor.druzhi...@citrix.com<mailto:igor.druzhi...@citrix.com>, Sergey Dyasli mailto:sergey.dya...@citrix.com>>, Mark Syms mailto:mark.s...@citrix.com>>, cluster-devel@redhat.com<mailto:cluster-devel@redhat.com> Subject: Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter Message-ID: <20190315205812.22727-1-agrue...@redhat.com<mailto:20190315205812.22727-1-agrue...@redhat.com>> Content-Type: text/plain; charset=UTF-8 Hi Ross, On Thu, 14 Mar 2019 at 12:18, Ross Lagerwall mailto:ross.lagerw...@citrix.com>> wrote: > On 3/13/19 5:13 PM, Andreas Gruenbacher wrote: > > Hi Edwin, > > > > On Wed, 6 Mar 2019 at 12:08, Edwin T?r?k > > mailto:edvin.to...@citrix.com>> > > wrote: > >> Hello, > >> > >> I've been trying to debug a GFS2 deadlock that we see in our lab > >> quite frequently with a 4.19 kernel. With 4.4 and older kernels we > >> were not able to reproduce this. > >> See below for lockdep dumps and stacktraces. > > > > thanks for the thorough bug report. Does the below fix work for > > you? > > > Hi Andreas, > > I've tested the patch and it doesn't fix the issue. As far as I can see, > current->backing_dev_info is not used by any of the code called from > balance_dirty_pages_ratelimited() so I don't see how it could work. yes, I see now. > I found a way of consistently reproducing the issue almost immediately > (tested with the latest master commit): > > # cat a.py > import os > > fd = os.open("f", os.O_CREAT|os.O_TRUNC|os.O_WRONLY) > > for i in range(1000): > os.mkdir("xxx" + str(i), 0777) > > buf = 'x' * 4096 > > while True: > count = os.write(fd, buf) > if count <= 0: > break > > # cat b.py > import os > while True: >os.mkdir("x", 0777) >os.rmdir("x") > > # echo 8192 > /proc/sys/vm/dirty_bytes > # cd /gfs2mnt > # (mkdir tmp1; cd tmp1; python2 ~/a.py) & > # (mkdir tmp2; cd tmp2; python2 ~/a.py) & > # (mkdir tmp3; cd tmp3; python2 ~/b.py) & > > This should deadlock almost immediately. One of the processes will be > waiting in balance_dirty_pages() and holding sd_log_flush_lock and > several others will be waiting for sd_log_flush_lock. This doesn't work for me: the python processes don't even start properly when dirty_bytes is set so low. > I came up with the following patch which seems to resolve the issue by > failing to write the inode if it can't take the lock, but it seems > like a dirty workaround rather than a proper fix: > > [...] Looking at ext4_dirty_inode, it seems that we should just be able to bail out of gfs2_write_inode an return 0 when PF_MEMALLOC is set in current->flags. Also, we should probably add the current->flags checks from xfs_do_writepage to gfs2_writepage_common. So what do you get with the below patch? Thanks, Andreas --- fs/gfs2/aops.c | 7 +++ fs/gfs2/super.c | 4 2 files changed, 11 insertions(+) diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 05dd78f..694ff91 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -102,6 +102,13 @@ static int gfs2_writepage_common(struct page *page, pgoff_t end_index = i_size >> PAGE_SHIFT; unsigned offset; + /* (see xfs_do_writepage) */ + if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == +PF_MEMALLOC)) + goto redirty; + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) + goto redirty; + if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl))) goto out; if (current->journal_info) diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index ca71163..540535c 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -756,6 +756,10 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc) int ret = 0; bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip)); + /* (see ext4_dirty_inode) */ +
Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
-Original Message- From: Andreas Gruenbacher Sent: 17 March 2019 20:06 To: Mark Syms Cc: cluster-devel@redhat.com; Sergey Dyasli ; Igor Druzhinin ; Edvin Torok Subject: Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter On Sun, 17 Mar 2019 at 00:59, Mark Syms wrote: > > Sadly, this doesn't help and seems to make the situation worse. Our automated > tests were previously seeing about 5% failure rate and with this patch its > 20%. So the PF_MEMALLOC flag doesn't get set on that code path. In that case, we should probably check wbc->sync_mode instead. How about the following? diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 540535c..36de2f7 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -757,7 +757,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc) bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip)); /* (see ext4_dirty_inode) */ -if (current->flags & PF_MEMALLOC) +if (wbc->sync_mode != WB_SYNC_ALL) return 0; if (flush_all) It may help to wrap that condition in WARN_ON_ONCE for debugging. [Mark Syms] That works better. We were wondering whether it might be a bit too aggressive though in that it skips writing the inode entirely unless we have WB_SYNC_ALL whereas the patch that Ross Lagerwall posted originally would use a trylock in the case where we don't have WB_SYNC_ALL and then fail out. Whether or not this should just silently return 0 or -EAGAIN as Ross' patch does is a matter of style I guess. What are your thoughts on this? Thanks, Mark.
[Cluster-devel] GFS2 rm can be very slow
One of our users has just been in touch and reported slow file deletion (in this case the virtual disk for a VM) which was particularly impactful in the case of the Citrix Hypervisor control code as we hold a number of locks while deleting VM virtual disks and in this case the file delete took ~40 seconds to complete. Now, we can, and will, work around this in the hypervisor control code by dropping the database entries under the locks and then leave the actual file deletion process occur under the control of our background garbage collection process but it lead me to wonder whether the userspace rm operation couldn't do something relatively simple to the file's inode data and then leave the actual resource group purging to happen in the background? This is obviously more complex to handle and in the case where an rm occurs and then there is immediately a demand for blocks where the only blocks possibly available were assigned to the rm'd file (i.e. the fs was full and the file was rm'd to make space) the block allocator would need to wait for the cleanup to occur. Would this be something worth considering as a future improvement or is it just too complicated to envisage? Mark.
Re: [Cluster-devel] GFS2 rm can be very slow
Thanks Bob, I only have the aftermath logs that show the delete being long. My suspicion would be that the file in question was a reasonable size and fragmented (possibly a consequence of some of the patches we have to avoid rgrp contention on allocation) and this needed to acquire significant numbers of rgrp glocks in order to clear them. Mark From: Bob Peterson Sent: Friday, 26 April 2019 16:45 To: Mark Syms CC: cluster-devel@redhat.com Subject: Re: [Cluster-devel] GFS2 rm can be very slow - Original Message - > One of our users has just been in touch and reported slow file deletion (in > this case the virtual disk for a VM) which was particularly impactful in the > case of the Citrix Hypervisor control code as we hold a number of locks > while deleting VM virtual disks and in this case the file delete took ~40 > seconds to complete. > > Now, we can, and will, work around this in the hypervisor control code by > dropping the database entries under the locks and then leave the actual file > deletion process occur under the control of our background garbage > collection process but it lead me to wonder whether the userspace rm > operation couldn't do something relatively simple to the file's inode data > and then leave the actual resource group purging to happen in the > background? This is obviously more complex to handle and in the case where > an rm occurs and then there is immediately a demand for blocks where the > only blocks possibly available were assigned to the rm'd file (i.e. the fs > was full and the file was rm'd to make space) the block allocator would need > to wait for the cleanup to occur. Would this be something worth considering > as a future improvement or is it just too complicated to envisage? > > Mark. > Hi Mark, I'd let Andreas comment on this, since he was last to work on that part of gfs2, but he's taking the day off today (back Monday). Maybe he'll comment on Monday. We should try to find out what part of the delete process is taking all the time here. After all, the unlink part of it should be relatively fast because the freeing of the blocks is done later. If the files are really big or really fragmented, we can sometimes spend a lot of time waiting for many rgrp glocks, at least in the older versions of the code. The newer versions, where we got rid of the recursive delete, should only need to lock one rgrp at a time, so that should not be an issue. The actual truncating of the file might take time, flushing transactions and such, especially since a delete forces us to read all the indirect metadata blocks in before freeing them. I think some versions had a broken read-ahead for that part of the code, but Andreas would remember. Or maybe we're waiting to grab the glock of the directory we're deleting from? For example, maybe there's a "hot" directory that's used in read mode by lots of processes across several nodes and we need it in rw mode to remove the dirent. I suppose a finely crafted systemtap script would help figure this all out. Also, what version of gfs2 is running slow? Regards, Bob Peterson