[Cluster-devel] Recommendations for securing DLM traffic?

2018-01-11 Thread Mark Syms
What are the general recommendations for securing traffic for DLM on port 21064?

It appears that this traffic is not signed or encrypted in any way so whilst 
there might not be any privacy issues with information disclosure it's not 
clear that the messages could not be replayed or otherwise spoofed.

Thanks,

Mark.



[Cluster-devel] Sending patches for GFS2

2018-09-20 Thread Mark Syms
We have a couple of patches for GFS2 which address some performance issues 
we've observed in our testing. What branch would you like the patches to be 
based on top off (we have a custom patched build so they'll need to be 
realigned first).

Thanks,

Mark.


[Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator

2018-09-20 Thread Mark Syms
From: Tim Smith 

When growing a number of files on the same cluster node from different
threads (e.g. fio with 20 or so jobs), all those threads pile into
gfs2_inplace_reserve() independently looking to claim a new resource
group and after a while they all synchronise, getting through the
gfs2_rgrp_used_recently()/gfs2_rgrp_congested() check together.

When this happens, write performance drops to about 1/5 on a single
node cluster, and on multi-node clusters it drops to near zero on
some nodes. The output from "glocktop -r -H -d 1" when this happens
begins to show many processes stuck in gfs2_inplace_reserve(), waiting
on a resource group lock.

This commit introduces a module parameter which, when set to a value
of 1, will introduce some random jitter into the first two passes of
gfs2_inplace_reserve() when trying to lock a new resource group,
skipping to the next one 1/2 the time with progressively lower
probability on each attempt.

Signed-off-by: Tim Smith 
---
 fs/gfs2/rgrp.c | 39 +++
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 1ad3256..994eb7f 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "gfs2.h"
 #include "incore.h"
@@ -49,6 +50,11 @@
 #define LBITSKIP00 (0xUL)
 #endif
 
+static int gfs2_skippy_rgrp_alloc;
+
+module_param_named(skippy_rgrp_alloc, gfs2_skippy_rgrp_alloc, int, 0644);
+MODULE_PARM_DESC(skippy_rgrp_alloc, "Set skippiness of resource group 
allocator, 0|1. Where 1 will cause resource groups to be randomly skipped with 
the likelihood of skipping progressively decreasing after a skip has occured.");
+
 /*
  * These routines are used by the resource group routines (rgrp.c)
  * to keep track of block allocation.  Each block is represented by two
@@ -2016,6 +2022,11 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct 
gfs2_alloc_parms *ap)
u64 last_unlinked = NO_BLOCK;
int loops = 0;
u32 free_blocks, skip = 0;
+   /*
+* gfs2_skippy_rgrp_alloc provides our initial skippiness.
+* randskip will thus be 2-255 if we want it do do anything.
+*/
+   u8 randskip = gfs2_skippy_rgrp_alloc + 1;
 
if (sdp->sd_args.ar_rgrplvb)
flags |= GL_SKIP;
@@ -2046,10 +2057,30 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct 
gfs2_alloc_parms *ap)
if (loops == 0 &&
!fast_to_acquire(rs->rs_rbm.rgd))
goto next_rgrp;
-   if ((loops < 2) &&
-   gfs2_rgrp_used_recently(rs, 1000) &&
-   gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
-   goto next_rgrp;
+   if (loops < 2) {
+   /*
+* If resource group allocation is 
requested to be skippy,
+* roll a hypothetical dice of 
 sides and skip
+* straight to the next resource group 
anyway if it comes
+* up 1.
+*/
+   if (gfs2_skippy_rgrp_alloc) {
+   u8 jitter;
+
+   prandom_bytes(&jitter, 
sizeof(jitter));
+   if ((jitter % randskip) == 0) {
+   /*
+* If we are choosing 
to skip, bump randskip to make it
+* successively less 
likely that we will skip again
+*/
+   randskip ++;
+   goto next_rgrp;
+   }
+   }
+   if (gfs2_rgrp_used_recently(rs, 1000) &&
+   
gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+   goto next_rgrp;
+   }
}
error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
   LM_ST_EXCLUSIVE, flags,
-- 
1.8.3.1



[Cluster-devel] [PATCH 2/2] GFS2: Avoid recently demoted rgrps.

2018-09-20 Thread Mark Syms
When under heavy I/O load from two or more hosts the resource group
allocation can result in glocks being bounced around between hosts.
Follow the example of inodes and if we have local waiters when asked
to demote the glock on a resource group add a delay. Additionally,
track when last asked to demote a lock and when assessing resource
groups in the allocator prefer, in the first two loop iterations, not
to use resource groups where we've been asked to demote the glock
within the last second.

Signed-off-by: Mark Syms 
---
 fs/gfs2/glock.c  |  7 +--
 fs/gfs2/incore.h |  2 ++
 fs/gfs2/main.c   |  1 +
 fs/gfs2/rgrp.c   | 10 ++
 fs/gfs2/trace_gfs2.h | 12 +---
 5 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 4614ee2..94ef947 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -973,7 +973,9 @@ static void handle_callback(struct gfs2_glock *gl, unsigned 
int state,
}
if (gl->gl_ops->go_callback)
gl->gl_ops->go_callback(gl, remote);
-   trace_gfs2_demote_rq(gl, remote);
+   trace_gfs2_demote_rq(gl, remote, delay);
+   if (remote && !delay)
+   gl->gl_last_demote = jiffies;
 }
 
 void gfs2_print_dbg(struct seq_file *seq, const char *fmt, ...)
@@ -1339,7 +1341,8 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int 
state)
gfs2_glock_hold(gl);
holdtime = gl->gl_tchange + gl->gl_hold_time;
if (test_bit(GLF_QUEUED, &gl->gl_flags) &&
-   gl->gl_name.ln_type == LM_TYPE_INODE) {
+   (gl->gl_name.ln_type == LM_TYPE_INODE ||
+gl->gl_name.ln_type == LM_TYPE_RGRP)) {
if (time_before(now, holdtime))
delay = holdtime - now;
if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags))
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b96d39c..e3d5b10 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -366,6 +366,8 @@ struct gfs2_glock {
 gl_reply:8;/* Last reply from the dlm */
 
unsigned long gl_demote_time; /* time of first demote request */
+   unsigned long gl_last_demote; /* jiffies at last demote transition */
+
long gl_hold_time;
struct list_head gl_holders;
 
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 2d55e2c..2183c73 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -58,6 +58,7 @@ static void gfs2_init_glock_once(void *foo)
INIT_LIST_HEAD(&gl->gl_ail_list);
atomic_set(&gl->gl_ail_count, 0);
atomic_set(&gl->gl_revokes, 0);
+   gl->gl_last_demote = jiffies - (2 * HZ);
 }
 
 static void gfs2_init_gl_aspace_once(void *foo)
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 994eb7f..7b77bb2 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1955,6 +1955,12 @@ static bool gfs2_rgrp_used_recently(const struct 
gfs2_blkreserv *rs,
return tdiff > (msecs * 1000 * 1000);
 }
 
+static bool gfs2_rgrp_demoted_recently(const struct gfs2_blkreserv *rs,
+  u32 max_age_jiffies, u32 loop)
+{
+   return time_before(jiffies, rs->rs_rbm.rgd->rd_gl->gl_last_demote + 
max_age_jiffies);
+}
+
 static u32 gfs2_orlov_skip(const struct gfs2_inode *ip)
 {
const struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
@@ -2077,6 +2083,10 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct 
gfs2_alloc_parms *ap)
goto next_rgrp;
}
}
+
+   if (gfs2_rgrp_demoted_recently(rs, HZ, 
loops))
+   goto next_rgrp;
+
if (gfs2_rgrp_used_recently(rs, 1000) &&

gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
goto next_rgrp;
diff --git a/fs/gfs2/trace_gfs2.h b/fs/gfs2/trace_gfs2.h
index e002525..79935dc 100644
--- a/fs/gfs2/trace_gfs2.h
+++ b/fs/gfs2/trace_gfs2.h
@@ -161,9 +161,9 @@ static inline u8 glock_trace_state(unsigned int state)
 /* Callback (local or remote) requesting lock demotion */
 TRACE_EVENT(gfs2_demote_rq,
 
-   TP_PROTO(const struct gfs2_glock *gl, bool remote),
+   TP_PROTO(const struct gfs2_glock *gl, bool remote, unsigned long delay),
 
-   TP_ARGS(gl, remote),
+   TP_ARGS(gl, remote, delay),
 
TP_STRUCT__entry(
__field(dev_t,  dev )
@@ -173,6 +173,8 @@ static inline u8 glock_trace_state(unsigned int state)
__field(u8, dmt_state   )
__field(unsigned long,  flags   )
__field(bool,   remote  )

[Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-20 Thread Mark Syms
While testing GFS2 as a storage repository for virtual machines we
discovered a number of scenarios where the performance was being
pathologically poor.

The scenarios are simplfied to the following -

  * On a single host in the cluster grow a number of files to a
significant proportion of the filesystems LUN size, exceeding the
hosts preferred resource group allocation. This can be replicated
by using fio and writing to 20 different files with a script like

[test-files]
directory=gfs2/a:gfs2/b:gfs2/c:gfs2/d:gfs2/e:gfs2/f:gfs2/g:gfs2/h:gfs2/i:gfs2/j:gfs2/k:gfs2/l:gfs2/m:gfs2/n:gfs2/o:gfs2/p:gfs2/q:gfs2/r:gfs2/s:gfs2/t
nrfiles=1
size=20G
bs=512k
rw=write
buffered=0
ioengine=libaio
fallocate=none
numjobs=20

After starting off at network wire speed this will rapidly degrade
with the fio process reporting large sys time.

This was diagnosed to all the processes contending on the glock in
gfs2_inplace_reserve having all selected the same resource
group. Patch 1 addresses this with an optional module parameter
which enables behaviour to "randomly" skip a selected resource
group in the first two passes in gfs_inplace_reserve in order to
spread the processes out.

Worth noting that this would probably also be addressed if the
comment in Documentation/gfs2-glocks.txt about eventually making
glock EX locally shared was made to happen. However, this looks
like it would require quite a bit of coordination and design so
this stop-gap helps in the meantime.

  * With two or more hosts growing files at high data rates the
throughput drops to a small proportion of the maximum storage
I/O. This is the several VMs all writing to the filesystem
scenario. Sometimes this test would run through clean at 80-90% of
storage wire speed but at other times the performance would drop
on one or more hosts to a small number of KiB/s.

This was diagnosed to the different hosts repeatedly bouncing
resource group glocks between them as different hosts selected
the same resource group (having exhausted the preferred groups).

Patch 2 addresses this by -
  * adding a hold delay to the resource group glock if there are
local waiters, following the pattern already in place for
inodes, this should also provide more data for
gfs2_rgrp_congested to work on.
  * remembering when we were last asked to demote the lock on a
resource group
  * in the first two passes in gfs2_inplace_reserve avoiding
resource groups where we have been asked to demote the glock
within the last second

Mark Syms (1):
  GFS2: Avoid recently demoted rgrps.

Tim Smith (1):
  Add some randomisation to the GFS2 resource group allocator

 fs/gfs2/glock.c  |  7 +--
 fs/gfs2/incore.h |  2 ++
 fs/gfs2/main.c   |  1 +
 fs/gfs2/rgrp.c   | 49 +
 fs/gfs2/trace_gfs2.h | 12 +---
 5 files changed, 62 insertions(+), 9 deletions(-)

-- 
1.8.3.1



Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-20 Thread Mark Syms
Thanks for that Bob, we've been watching with interest the changes going in 
upstream but at the moment we're not really in a position to take advantage of 
them.

Due to hardware vendor support certification requirements XenServer can only 
very occasionally make big kernel bumps that would affect the ABI that the 
driver would see as that would require our hardware partners to recertify. So, 
we're currently on a 4.4.52 base but the gfs2 driver is somewhat newer as it is 
essentially self-contained and therefore we can backport change more easily. We 
currently have most of the GFS2 and DLM changes that are in 4.15 backported 
into the XenServer 7.6 kernel, but we can't take the ones related to iomap as 
they are more invasive and it looks like a number of the more recent 
performance targeting changes are also predicated on the iomap framework.

As I mentioned in the covering letter, the intra host problem would largely be 
a non-issue if EX glocks were actually a host wide thing with local mutexes 
used to share them within the host. I don't know if this is what your patch set 
is trying to achieve or not. It's not so much that that selection of resource 
group is "random", just that there is a random chance that we won't select the 
first RG that we test, it probably does work out much the same though.

The inter host problem addressed by the second patch seems to be less amenable 
to avoidance as the hosts don't seem to have a synchronous view of the state of 
the resource group locks (for understandable reasons as I'd expect thisto be 
very expensive to keep sync'd). So it seemed reasonable to try to make it 
"expensive" to request a resource that someone else is using and also to avoid 
immediately grabbing it back if we've been asked to relinquish it. It does seem 
to give a fairer balance to the usage without being massively invasive.

We thought we should share these with the community anyway even if they only 
serve as inspiration for more detailed changes and also to describe the 
scenarios where we're seeing issues now that we have completed implementing the 
XenServer support for GFS2 that we discussed back in Nuremburg last year. In 
our testing they certainly make things better. They probably aren’t fully 
optimal as we can't maintain 10g wire speed consistently across the full LUN 
but we're getting about 75% which is certainly better than we were seeing 
before we started looking at this.

Thanks,

    Mark.

-Original Message-
From: Bob Peterson  
Sent: 20 September 2018 18:18
To: Mark Syms 
Cc: cluster-devel@redhat.com; Ross Lagerwall ; Tim 
Smith 
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

- Original Message -
> While testing GFS2 as a storage repository for virtual machines we 
> discovered a number of scenarios where the performance was being 
> pathologically poor.
> 
> The scenarios are simplfied to the following -
> 
>   * On a single host in the cluster grow a number of files to a
> significant proportion of the filesystems LUN size, exceeding the
> hosts preferred resource group allocation. This can be replicated
> by using fio and writing to 20 different files with a script like

Hi Mark, Tim and all,

The performance problems with rgrp contention are well known, and have been for 
a very long time.

In rhel6 it's not as big of a problem because rhel6 gfs2 uses "try locks"
which distributes different processes to unique rgrps, thus keeping them from 
contending. However, it results in file system fragmentation that tends to 
catch up with you later.

I posted a different patch set to solve the problem a different way by trying 
to keep track of both Inter-node and Intra-node contention, and redistributed 
rgrps accordingly. It was similar to your first patch, but used a more 
predictable distribution, whereas yours is random.
It worked very well, but it ultimately got rejected by Steve Whitehouse in 
favor of a better approach:

Our current plan is to allow rgrps to be shared among many processes on a 
single node. This alleviates the contention, improves throughput and 
performance, and fixes the "favoritism" problems gfs2 has today.
In other words, it's better than just redistributing the rgrps.

I did a proof-of-concept set of patches and saw pretty good performance numbers 
and "fairness" among simultaneous writers. I posted that a few months ago.

Your patch would certainly work, and random distribution of rgrps would 
definitely gain performance, just as the Orlov algorithm does, however, I still 
want to pursue what Steve suggested.

My patch set for this still needs some work because I found some bugs with how 
things are done, so it'll take time to get working properly.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-28 Thread Mark Syms
Hi Bob,

No, we haven't but it wouldn't be hard for us to replace our patches in our 
internal patchqueue with these and try them. Will let you know what we find.

We have also seen, what we think is an unrelated issue where we get the 
following backtrace in kern.log and our system stalls

Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked 
for more than 120 seconds.
Sep 21 21:19:09 cl15-05 kernel: [21389.462749]   Tainted: G   O
4.4.0+10 #1
Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python  D 
88019628bc90 0 15480  1 0x
Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  88019628bc90 
880198f11c00 88005a509c00 88019628c000
Sep 21 21:19:09 cl15-05 kernel: [21389.462795]  c90040226000 
88019628bd80 fe58 8801818da418
Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  88019628bca8 
815a1cd4 8801818da5c0 88019628bd68
Sep 21 21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [] 
schedule+0x64/0x80
Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  [] 
find_insert_glock+0x4a4/0x530 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [] ? 
gfs2_holder_wake+0x20/0x20 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [] 
gfs2_glock_get+0x3d/0x330 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [] 
do_flock+0xf2/0x210 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [] ? 
gfs2_getattr+0xe0/0xf0 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [] ? 
cp_new_stat+0x10b/0x120
Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  [] 
gfs2_flock+0x78/0xa0 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [] 
SyS_flock+0x129/0x170
Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [] 
entry_SYSCALL_64_fastpath+0x12/0x71

We think there is a possibility, given that this code path only gets entered if 
a glock is being destroyed, that there is a time of check, time of use issue 
here where by the time that schedule gets called the thing which we expect to 
be waking us up has completed dying and therefore won't trigger a wakeup for 
us. We only seen this a couple of times in fairly intensive VM stress tests 
where a lot of flocks get used on a small number of lock files (we use them to 
ensure consistent behaviour of disk activation/deactivation and also access to 
the database with the system state) but it's concerning nonetheless. We're 
looking at replacing the call to schedule with schedule_timeout with a timeout 
of maybe HZ to ensure that we will always get out of the schedule operation and 
retry. Is this something you think you may have seen or have any ideas on?

Thanks,

Mark.

-Original Message-
From: Bob Peterson  
Sent: 28 September 2018 13:24
To: Mark Syms 
Cc: cluster-devel@redhat.com; Ross Lagerwall ; Tim 
Smith 
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

- Original Message -
> Thanks for that Bob, we've been watching with interest the changes 
> going in upstream but at the moment we're not really in a position to 
> take advantage of them.
> 
> Due to hardware vendor support certification requirements XenServer 
> can only very occasionally make big kernel bumps that would affect the 
> ABI that the driver would see as that would require our hardware partners to 
> recertify.
> So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat 
> newer as it is essentially self-contained and therefore we can 
> backport change more easily. We currently have most of the GFS2 and 
> DLM changes that are in
> 4.15 backported into the XenServer 7.6 kernel, but we can't take the 
> ones related to iomap as they are more invasive and it looks like a 
> number of the more recent performance targeting changes are also 
> predicated on the iomap framework.
> 
> As I mentioned in the covering letter, the intra host problem would 
> largely be a non-issue if EX glocks were actually a host wide thing 
> with local mutexes used to share them within the host. I don't know if 
> this is what your patch set is trying to achieve or not. It's not so 
> much that that selection of resource group is "random", just that 
> there is a random chance that we won't select the first RG that we 
> test, it probably does work out much the same though.
> 
> The inter host problem addressed by the second patch seems to be less 
> amenable to avoidance as the hosts don't seem to have a synchronous 
> view of the state of the resource group locks (for understandable 
> reasons as I'd expect thisto be very expensive to keep sync'd). So it 
> seemed reasonable to t

Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-28 Thread Mark Syms
Hi Bon,

The patches look quite good and would seem to help in the intra-node congestion 
case, which our first patch was trying to do. We haven't tried them yet but 
I'll pull a build together and try to run it over the weekend.

We don't however, see that they would help in the situation we saw for the 
second patch where rgrp glocks would get bounced around between hosts at high 
speed and cause lots of state flushing to occur in the process as the stats 
don't take any account of anything other than network latency whereas there is 
more involved with a rgrp glock when state needs to be flushed.

Any thoughts on this?

Thanks,

Mark.

-Original Message-
From: Mark Syms 
Sent: 28 September 2018 13:37
To: 'Bob Peterson' 
Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall 

Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

Hi Bob,

No, we haven't but it wouldn't be hard for us to replace our patches in our 
internal patchqueue with these and try them. Will let you know what we find.

We have also seen, what we think is an unrelated issue where we get the 
following backtrace in kern.log and our system stalls

Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked 
for more than 120 seconds.
Sep 21 21:19:09 cl15-05 kernel: [21389.462749]   Tainted: G   O
4.4.0+10 #1
Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python  D 
88019628bc90 0 15480  1 0x
Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  88019628bc90 
880198f11c00 88005a509c00 88019628c000 Sep 21 21:19:09 cl15-05 
kernel: [21389.462795]  c90040226000 88019628bd80 fe58 
8801818da418 Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  
88019628bca8 815a1cd4 8801818da5c0 88019628bd68 Sep 21 
21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [] 
schedule+0x64/0x80 Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  
[] find_insert_glock+0x4a4/0x530 [gfs2] Sep 21 21:19:09 
cl15-05 kernel: [21389.462891]  [] ? 
gfs2_holder_wake+0x20/0x20 [gfs2] Sep 21 21:19:09 cl15-05 kernel: 
[21389.462903]  [] gfs2_glock_get+0x3d/0x330 [gfs2] Sep 21 
21:19:09 cl15-05 kernel: [21389.462928]  [] 
do_flock+0xf2/0x210 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  
[] ? gfs2_getattr+0xe0/0xf0 [gfs2] Sep 21 21:19:09 cl15-05 
kernel: [21389.462938]  [] ? cp_new_stat+0x10b/0x120 Sep 21 
21:19:09 cl15-05 kernel: [21389.462943]  [] 
gfs2_flock+0x78/0xa0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  
[] SyS_flock+0x129/0x170 Sep 21 21:19:09 cl15-05 kernel: 
[21389.462948]  [] entry_SYSCALL_64_fastpath+0x12/0x71

We think there is a possibility, given that this code path only gets entered if 
a glock is being destroyed, that there is a time of check, time of use issue 
here where by the time that schedule gets called the thing which we expect to 
be waking us up has completed dying and therefore won't trigger a wakeup for 
us. We only seen this a couple of times in fairly intensive VM stress tests 
where a lot of flocks get used on a small number of lock files (we use them to 
ensure consistent behaviour of disk activation/deactivation and also access to 
the database with the system state) but it's concerning nonetheless. We're 
looking at replacing the call to schedule with schedule_timeout with a timeout 
of maybe HZ to ensure that we will always get out of the schedule operation and 
retry. Is this something you think you may have seen or have any ideas on?

Thanks,

Mark.

-Original Message-
From: Bob Peterson 
Sent: 28 September 2018 13:24
To: Mark Syms 
Cc: cluster-devel@redhat.com; Ross Lagerwall ; Tim 
Smith 
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

- Original Message -
> Thanks for that Bob, we've been watching with interest the changes 
> going in upstream but at the moment we're not really in a position to 
> take advantage of them.
> 
> Due to hardware vendor support certification requirements XenServer 
> can only very occasionally make big kernel bumps that would affect the 
> ABI that the driver would see as that would require our hardware partners to 
> recertify.
> So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat 
> newer as it is essentially self-contained and therefore we can 
> backport change more easily. We currently have most of the GFS2 and 
> DLM changes that are in
> 4.15 backported into the XenServer 7.6 kernel, but we can't take the 
> ones related to iomap as they are more invasive and it looks like a 
> number of the more recent performance targeting changes ar

Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-28 Thread Mark Syms
The hosts stayed up and in the first occurrence of this (that we caught 
inflight as opposed to only seeing the aftermath in automation logs) the locks 
all actually unwedged after about 2 hours.

Mark.

-Original Message-
From: Bob Peterson  
Sent: 28 September 2018 13:56
To: Mark Syms 
Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall 
; Andreas Gruenbacher 
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

- Original Message -
> Hi Bob,
> 
> No, we haven't but it wouldn't be hard for us to replace our patches 
> in our internal patchqueue with these and try them. Will let you know what we 
> find.
> 
> We have also seen, what we think is an unrelated issue where we get 
> the following backtrace in kern.log and our system stalls
> 
> Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 
> blocked for more than 120 seconds.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462749]   Tainted: G   O
> 4.4.0+10 #1
> Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python  D
> 88019628bc90 0 15480  1 0x
> Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  88019628bc90
> 880198f11c00 88005a509c00 88019628c000 Sep 21 21:19:09 
> cl15-05 kernel: [21389.462795]  c90040226000
> 88019628bd80 fe58 8801818da418 Sep 21 21:19:09 
> cl15-05 kernel: [21389.462799]  88019628bca8
> 815a1cd4 8801818da5c0 88019628bd68 Sep 21 21:19:09 
> cl15-05 kernel: [21389.462803] Call Trace:
> Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  []
> schedule+0x64/0x80
> Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  []
> find_insert_glock+0x4a4/0x530 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [] ?
> gfs2_holder_wake+0x20/0x20 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  []
> gfs2_glock_get+0x3d/0x330 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  []
> do_flock+0xf2/0x210 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [] ?
> gfs2_getattr+0xe0/0xf0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [] ?
> cp_new_stat+0x10b/0x120
> Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  []
> gfs2_flock+0x78/0xa0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  []
> SyS_flock+0x129/0x170
> Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  []
> entry_SYSCALL_64_fastpath+0x12/0x71
> 
> We think there is a possibility, given that this code path only gets 
> entered if a glock is being destroyed, that there is a time of check, 
> time of use issue here where by the time that schedule gets called the 
> thing which we expect to be waking us up has completed dying and 
> therefore won't trigger a wakeup for us. We only seen this a couple of 
> times in fairly intensive VM stress tests where a lot of flocks get 
> used on a small number of lock files (we use them to ensure consistent 
> behaviour of disk activation/deactivation and also access to the 
> database with the system state) but it's concerning nonetheless. We're 
> looking at replacing the call to schedule with schedule_timeout with a 
> timeout of maybe HZ to ensure that we will always get out of the 
> schedule operation and retry. Is this something you think you may have seen 
> or have any ideas on?
> 
> Thanks,
> 
>   Mark.

Hi Mark,

It's very common to get call traces like that when one of the nodes in a 
cluster goes down and the other nodes all wait for the failed node to be 
fenced, etc. The node failure causes dlm to temporarily stop granting locks 
until the issue is resolved. This is expected behavior, and dlm recovery should 
eventually grant the lock once the node is properly removed from the cluster. I 
haven't seen it on an flock glock, because I personally don't often run with 
flocks, but it often happens to me with other glocks when I do recovery testing 
(which I've been doing a lot of lately).

So is there a node failure in your case? If there's a node failure, dlm should 
recover the locks and allow the waiter to continue normally. If it's not a node 
failure, it's hard to say... I know Andreas fixed some problems with the rcu 
locking we do to protect the glock rhashtable.
Perhaps the kernel you're using is missing one of his patches? Or maybe it's a 
new bug. Adding Andreas to the cc.

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-28 Thread Mark Syms
To give some context here, the environment we were testing this in looks like 
this

* 2 x XenServer hosts, Dell R430s with Xeon E5-2630 v3 CPUs and Intel X520 10g 
NICS dedicated to the iSCSI traffic for GFS2 (only using one per host)
* Dedicated Linux filer packed with SSDs and 128GB of RAM. The native storage 
can sustainably support > 5GB/s write throughput and the host (currently) has a 
bonded pair of X710 10g NICS to serve the hosts.

So basically the storage is significantly faster than the network and will not 
be the bottleneck in these tests.

Whether what we observe here will change when we update the filer to have 6 10g 
NICs (planned in the next few weeks) will remain to be seen, obviously we'll 
need to add some more hosts to the cluster but we have another 10 in the rack 
so that isn't an issue.

Mark.

-Original Message-
From: Bob Peterson  
Sent: 28 September 2018 15:00
To: Tim Smith 
Cc: Steven Whitehouse ; Mark Syms ; 
cluster-devel@redhat.com; Ross Lagerwall 
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

- Original Message -
> I think what's happening for us is that the work that needs to be done 
> to release an rgrp lock is happening pretty fast and is about the same 
> in all cases, so the stats are not providing a meaningful distinction. 
> We see the same lock (or small number of locks) bouncing back and 
> forth between nodes with neither node seeming to consider them 
> congested enough to avoid, even though the FS is <50% full and there must be 
> plenty of other non-full rgrps.
> 
> --
> Tim Smith 

Hi Tim,

Interesting.
I've done experiments in the past where I allowed resource group glocks to take 
advantage of the "minimum hold time" which is today only used for inode glocks. 
In my experiments it's made no appreciable difference that I can recall, but it 
might be an interesting experiment for you to try.

Steve's right that we need to be careful not to improve one aspect of 
performance while causing another aspect's downfall, like improving intra-node 
congestion problems at the expense of inter-node congestion.

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-10-02 Thread Mark Syms
Just to follow up on this further. We made a change to swap the call to 
schedule for schedule_timeout and added a kprintf in the case where 
schedule_timeout actually timed out rather than being woken. Running in our 
multi VM , multi host stress test we saw one instance of the kprintf message 
and no stuck tasks at any point within a 24 stress test so it looks like this 
is beneficial, at least in our kernel.

I'll clean the change up and send it on but it's likely that it's only required 
due a race elsewhere and possibly one that has already been fixed in your 
latest codebase. Our thoughts revolve around the code performing clean up on 
the hashtable possibly checking to see if there are any waiters that will need 
to be notified on completion of the cleanup immediately before we start to wait 
and then not notifying  as it's not aware of the waiter, but we have no 
concrete proof of that just the outcome that we get into schedule and stick 
there.

Mark.   

-Original Message-
From: Mark Syms 
Sent: 28 September 2018 14:57
To: 'Bob Peterson' 
Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall 
; Andreas Gruenbacher 
Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

The hosts stayed up and in the first occurrence of this (that we caught 
inflight as opposed to only seeing the aftermath in automation logs) the locks 
all actually unwedged after about 2 hours.

Mark.

-Original Message-
From: Bob Peterson 
Sent: 28 September 2018 13:56
To: Mark Syms 
Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall 
; Andreas Gruenbacher 
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

- Original Message -
> Hi Bob,
> 
> No, we haven't but it wouldn't be hard for us to replace our patches 
> in our internal patchqueue with these and try them. Will let you know what we 
> find.
> 
> We have also seen, what we think is an unrelated issue where we get 
> the following backtrace in kern.log and our system stalls
> 
> Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 
> blocked for more than 120 seconds.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462749]   Tainted: G   O
> 4.4.0+10 #1
> Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python  D
> 88019628bc90 0 15480  1 0x
> Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  88019628bc90
> 880198f11c00 88005a509c00 88019628c000 Sep 21 21:19:09
> cl15-05 kernel: [21389.462795]  c90040226000
> 88019628bd80 fe58 8801818da418 Sep 21 21:19:09
> cl15-05 kernel: [21389.462799]  88019628bca8
> 815a1cd4 8801818da5c0 88019628bd68 Sep 21 21:19:09
> cl15-05 kernel: [21389.462803] Call Trace:
> Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  []
> schedule+0x64/0x80
> Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  []
> find_insert_glock+0x4a4/0x530 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [] ?
> gfs2_holder_wake+0x20/0x20 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  []
> gfs2_glock_get+0x3d/0x330 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  []
> do_flock+0xf2/0x210 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [] ?
> gfs2_getattr+0xe0/0xf0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [] ?
> cp_new_stat+0x10b/0x120
> Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  []
> gfs2_flock+0x78/0xa0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  []
> SyS_flock+0x129/0x170
> Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  []
> entry_SYSCALL_64_fastpath+0x12/0x71
> 
> We think there is a possibility, given that this code path only gets 
> entered if a glock is being destroyed, that there is a time of check, 
> time of use issue here where by the time that schedule gets called the 
> thing which we expect to be waking us up has completed dying and 
> therefore won't trigger a wakeup for us. We only seen this a couple of 
> times in fairly intensive VM stress tests where a lot of flocks get 
> used on a small number of lock files (we use them to ensure consistent 
> behaviour of disk activation/deactivation and also access to the 
> database with the system state) but it's concerning nonetheless. We're 
> looking at replacing the call to schedule with schedule_timeout with a 
> timeout of maybe HZ to ensure that we will always get out of the 
> schedule operation and retry. Is this something you think you may have seen 
> or have any ideas on?
> 
> Thanks,
> 
>   Mark.

Hi Mark,

It's very common to get call 

[Cluster-devel] Any ideas on this?

2018-10-02 Thread Mark Syms
Hi,

We've seen a couple of time in  our automated tests than unmounting the GFS2 
gets stuck and doesn't complete. It's just happened again and the stack for the 
umount process looks like 

[] flush_workqueue+0x1c8/0x520
[] gfs2_make_fs_ro+0x69/0x160 [gfs2]
[] gfs2_put_super+0xa9/0x1c0 [gfs2]
[] generic_shutdown_super+0x6f/0x100
[] kill_block_super+0x27/0x70
[] gfs2_kill_sb+0x71/0x80 [gfs2]
[] deactivate_locked_super+0x3b/0x70
[] deactivate_super+0x59/0x60
[] cleanup_mnt+0x58/0x80
[] __cleanup_mnt+0x12/0x20
[] task_work_run+0x7d/0xa0
[] exit_to_usermode_loop+0x73/0x98
[] syscall_return_slowpath+0x41/0x50
[] int_ret_from_sys_call+0x25/0x8f
[] 0x

Querying mount or /proc/mounts no longer shows the gfs2 filesystem so it's got 
part way through the process.

The test has just finished deleting a series of VMs which will have had data 
written to them by multiple hosts and the time gap between the last delete 
completing and the umount is small so this might play a part in things.

"glocktop -d 1 -r -H" on the stuck host shows it trying to get locks from a 
different host in the cluster where the referenced pids are no longer present.

Unmounting the fs on the host where the locks are supposedly held was 
successful but then the stuck host started reporting that it was stuck waiting 
for locks held locally by kworker threads.

Anything we should be looking at especially, possibly a fix that we don't 
currently have in our kernel?

Thanks,

Mark.



[Cluster-devel] [PATCH 0/2] GFS2 locking patches

2018-10-08 Thread Mark Syms
A pair of locking issues in GFS2 observed when running VM storgae stress
tests.

0001-GFS2-use-schedule-timeout-in-find-insert-glock.patch covers a case
where an application level flock would wedge. The VM control plane makes
extensive use of flocks to control access to VM virtual disks and databases
and we envountered several failed tests where the flocks did not get acquired
even when noone was holding them. Investigation indicates that there is a
race in find_insert_glock where the call to schedule can be called when
the expected waiter has already completed its work. Replace schedule with
schedule_timeout and log.

0002-GFS2-Flush-the-GFS2-delete-workqueue-before-stopping.patch covers a
case where umount would wedge unrecoverably. The completion of the stress
test involves the deletion of the test machines and virtual disks followed
by the filesystem being unmounted on all hosts before the hosts are returned
to the lab pool. umount was found to wedge and this has been traced to
gfs2_log_reserve being called in the flush_workqueue but after the associated
kthread processes had been stopped. Thus there was nobody to handle the
log reserver request and the code wedged.

Mark Syms (1):
  GFS2: use schedule timeout in find insert glock

Tim Smith (1):
  GFS2: Flush the GFS2 delete workqueue before stopping the kernel
threads

 fs/gfs2/glock.c | 3 ++-
 fs/gfs2/super.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

-- 
1.8.3.1



[Cluster-devel] [PATCH 2/2] GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads

2018-10-08 Thread Mark Syms
From: Tim Smith 

Flushing the workqueue can cause operations to happen which might
call gfs2_log_reserve(), or get stuck waiting for locks taken by such
operations.  gfs2_log_reserve() can io_schedule(). If this happens, it
will never wake because the only thing which can wake it is gfs2_logd()
which was already stopped.

This causes umount of a gfs2 filesystem to wedge permanently if, for
example, the umount immediately follows a large delete operation.

When this occured, the following stack trace was obtained from the
umount command

[] flush_workqueue+0x1c8/0x520
[] gfs2_make_fs_ro+0x69/0x160 [gfs2]
[] gfs2_put_super+0xa9/0x1c0 [gfs2]
[] generic_shutdown_super+0x6f/0x100
[] kill_block_super+0x27/0x70
[] gfs2_kill_sb+0x71/0x80 [gfs2]
[] deactivate_locked_super+0x3b/0x70
[] deactivate_super+0x59/0x60
[] cleanup_mnt+0x58/0x80
[] __cleanup_mnt+0x12/0x20
[] task_work_run+0x7d/0xa0
[] exit_to_usermode_loop+0x73/0x98
[] syscall_return_slowpath+0x41/0x50
[] int_ret_from_sys_call+0x25/0x8f
[] 0x

Signed-off-by: Tim Smith 
Signed-off-by: Mark Syms 
---
 fs/gfs2/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index c212893..a971862 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -854,10 +854,10 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
if (error && !test_bit(SDF_SHUTDOWN, &sdp->sd_flags))
return error;
 
+   flush_workqueue(gfs2_delete_workqueue);
kthread_stop(sdp->sd_quotad_process);
kthread_stop(sdp->sd_logd_process);
 
-   flush_workqueue(gfs2_delete_workqueue);
gfs2_quota_sync(sdp->sd_vfs, 0);
gfs2_statfs_sync(sdp->sd_vfs, 0);
 
-- 
1.8.3.1



[Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock

2018-10-08 Thread Mark Syms
During a VM stress test we encountered a system lockup and kern.log contained

kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 
seconds.
kernel: [21389.462749] Tainted: G O 4.4.0+10 #1
kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
kernel: [21389.462783] python D 88019628bc90 0 15480 1 0x
kernel: [21389.462790] 88019628bc90 880198f11c00 88005a509c00 
88019628c000
kernel: [21389.462795] c90040226000 88019628bd80 fe58 
8801818da418
kernel: [21389.462799] 88019628bca8 815a1cd4 8801818da5c0 
88019628bd68
kernel: [21389.462803] Call Trace:
kernel: [21389.462815] [] schedule+0x64/0x80
kernel: [21389.462877] [] find_insert_glock+0x4a4/0x530 [gfs2]
kernel: [21389.462891] [] ? gfs2_holder_wake+0x20/0x20 [gfs2]
kernel: [21389.462903] [] gfs2_glock_get+0x3d/0x330 [gfs2]
kernel: [21389.462928] [] do_flock+0xf2/0x210 [gfs2]
kernel: [21389.462933] [] ? gfs2_getattr+0xe0/0xf0 [gfs2]
kernel: [21389.462938] [] ? cp_new_stat+0x10b/0x120
kernel: [21389.462943] [] gfs2_flock+0x78/0xa0 [gfs2]
kernel: [21389.462946] [] SyS_flock+0x129/0x170
kernel: [21389.462948] [] entry_SYSCALL_64_fastpath+0x12/0x71

On examination of the code it was determined that this code path
is only taken if the selected glock is marked as dead, the supposition
therefore is that by the time schedule as called the glock had been
cleaned up and therefore nothing woke the schedule. Instead of calling
schedule, call schedule_timeout(HZ) so at least we get a chance to
re-evaluate.

On repeating the stress test, the printk message was seen once in the logs
across four servers but no further occurences nor were there any stuck task
log entries. This indicates that when the timeout occured the code repeated
the lookup and did not find the same glock entry but as we hadn't been
woken this means that we would never have been woken.

Signed-off-by: Mark Syms 
Signed-off-by: Tim Smith 
---
 fs/gfs2/glock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 4614ee2..0a59a01 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -758,7 +758,8 @@ static struct gfs2_glock *find_insert_glock(struct 
lm_lockname *name,
}
if (gl && !lockref_get_not_dead(&gl->gl_lockref)) {
rcu_read_unlock();
-   schedule();
+   if (schedule_timeout(HZ) == 0)
+   printk(KERN_INFO "find_insert_glock schedule timed 
out\n");
goto again;
}
 out:
-- 
1.8.3.1



Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock

2018-10-08 Thread Mark Syms
That sounds entirely reasonable so long as you are absolutely sure that nothing 
is ever going to mess with that glock, we erred on the side of more caution not 
knowing whether it would be guaranteed safe or not.

Thanks,

Mark

-Original Message-
From: Steven Whitehouse  
Sent: 08 October 2018 13:56
To: Mark Syms ; cluster-devel@redhat.com
Cc: Ross Lagerwall ; Tim Smith 
Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find 
insert glock

Hi,


On 08/10/18 13:36, Mark Syms wrote:
> During a VM stress test we encountered a system lockup and kern.log 
> contained
>
> kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 
> seconds.
> kernel: [21389.462749] Tainted: G O 4.4.0+10 #1
> kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> kernel: [21389.462783] python D 88019628bc90 0 15480 1 0x
> kernel: [21389.462790] 88019628bc90 880198f11c00 
> 88005a509c00 88019628c000
> kernel: [21389.462795] c90040226000 88019628bd80 
> fe58 8801818da418
> kernel: [21389.462799] 88019628bca8 815a1cd4 
> 8801818da5c0 88019628bd68
> kernel: [21389.462803] Call Trace:
> kernel: [21389.462815] [] schedule+0x64/0x80
> kernel: [21389.462877] [] 
> find_insert_glock+0x4a4/0x530 [gfs2]
> kernel: [21389.462891] [] ? 
> gfs2_holder_wake+0x20/0x20 [gfs2]
> kernel: [21389.462903] [] gfs2_glock_get+0x3d/0x330 
> [gfs2]
> kernel: [21389.462928] [] do_flock+0xf2/0x210 [gfs2]
> kernel: [21389.462933] [] ? gfs2_getattr+0xe0/0xf0 
> [gfs2]
> kernel: [21389.462938] [] ? cp_new_stat+0x10b/0x120
> kernel: [21389.462943] [] gfs2_flock+0x78/0xa0 
> [gfs2]
> kernel: [21389.462946] [] SyS_flock+0x129/0x170
> kernel: [21389.462948] [] 
> entry_SYSCALL_64_fastpath+0x12/0x71
>
> On examination of the code it was determined that this code path is 
> only taken if the selected glock is marked as dead, the supposition 
> therefore is that by the time schedule as called the glock had been 
> cleaned up and therefore nothing woke the schedule. Instead of calling 
> schedule, call schedule_timeout(HZ) so at least we get a chance to 
> re-evaluate.
>
> On repeating the stress test, the printk message was seen once in the 
> logs across four servers but no further occurences nor were there any 
> stuck task log entries. This indicates that when the timeout occured 
> the code repeated the lookup and did not find the same glock entry but 
> as we hadn't been woken this means that we would never have been woken.
>
> Signed-off-by: Mark Syms 
> Signed-off-by: Tim Smith 
> ---
>   fs/gfs2/glock.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 4614ee2..0a59a01 
> 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -758,7 +758,8 @@ static struct gfs2_glock *find_insert_glock(struct 
> lm_lockname *name,
>   }
>   if (gl && !lockref_get_not_dead(&gl->gl_lockref)) {
>   rcu_read_unlock();
> - schedule();
> + if (schedule_timeout(HZ) == 0)
> + printk(KERN_INFO "find_insert_glock schedule timed 
> out\n");
>   goto again;
>   }
>   out:

That is a bit odd. In fact that whole function looks odd. I wonder why it needs 
to wait in the first place. It should be a simple comparison here. If the glock 
is dead then nothing else should touch it, so we are safe to add a new one into 
the hash table. The wait is almost certainly a bug,

Steve.




Re: [Cluster-devel] [PATCH 2/2] GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads

2018-10-08 Thread Mark Syms
Thanks Bob,

We're having a look at the bit that Tim sent earlier to see if switching the 
call order makes the schedule_timeout never occur.

As Steven said, it would be nice if we didn't have to worry about waiting for 
"dead" glocks to be cleaned up but that would require some considerable care to 
ensure that we don't just get the opposite race and get a newly created glock 
freed and deleted.

Mark

From: Bob Peterson 
Sent: Monday, 8 October 2018 21:10
To: Mark Syms 
CC: cluster-devel@redhat.com,Ross Lagerwall ,Tim 
Smith 
Subject: Re: [Cluster-devel] [PATCH 2/2] GFS2: Flush the GFS2 delete workqueue 
before stopping the kernel threads


- Original Message -
> From: Tim Smith 
>
> Flushing the workqueue can cause operations to happen which might
> call gfs2_log_reserve(), or get stuck waiting for locks taken by such
> operations.  gfs2_log_reserve() can io_schedule(). If this happens, it
> will never wake because the only thing which can wake it is gfs2_logd()
> which was already stopped.
>
> This causes umount of a gfs2 filesystem to wedge permanently if, for
> example, the umount immediately follows a large delete operation.
>
> When this occured, the following stack trace was obtained from the
> umount command
>
> [] flush_workqueue+0x1c8/0x520
> [] gfs2_make_fs_ro+0x69/0x160 [gfs2]
> [] gfs2_put_super+0xa9/0x1c0 [gfs2]
> [] generic_shutdown_super+0x6f/0x100
> [] kill_block_super+0x27/0x70
> [] gfs2_kill_sb+0x71/0x80 [gfs2]
> [] deactivate_locked_super+0x3b/0x70
> [] deactivate_super+0x59/0x60
> [] cleanup_mnt+0x58/0x80
> [] __cleanup_mnt+0x12/0x20
> [] task_work_run+0x7d/0xa0
> [] exit_to_usermode_loop+0x73/0x98
> [] syscall_return_slowpath+0x41/0x50
> [] int_ret_from_sys_call+0x25/0x8f
> [] 0x
>
> Signed-off-by: Tim Smith 
> Signed-off-by: Mark Syms 
> ---
Hi Mark, Tim, and all,

I pushed patch 2/2 upstream. For now I'll hold off on 1/2 but keep it
on my queue, pending our investigation.
https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next&id=b7f5a2cd27b76e96fdc6d77b060dfdd877c9d0a9

Regards,

Bob Peterson


Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock

2018-10-09 Thread Mark Syms
Having swapped the line below around we still see the timeout on schedule fire, 
but only once in
a fairly mega stress test. This is why we weren't worried about the timeout 
being HZ, the situation
is hardly ever hit as having to wait is rare and normally we are woken from 
schedule and without
a timeout on schedule we never wake up so a rare occurrence of waiting a second 
really doesn't
seem too bad.

Mark.

-Original Message-
From: Tim Smith  
Sent: 08 October 2018 14:27
To: Steven Whitehouse 
Cc: Mark Syms ; cluster-devel@redhat.com; Ross Lagerwall 

Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find 
insert glock

On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote:
> Hi,
> 
> On 08/10/18 14:10, Tim Smith wrote:
> > On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote:
> >> On 08/10/18 13:59, Mark Syms wrote:
> >>> That sounds entirely reasonable so long as you are absolutely sure 
> >>> that nothing is ever going to mess with that glock, we erred on 
> >>> the side of more caution not knowing whether it would be guaranteed safe 
> >>> or not.
> >>> 
> >>> Thanks,
> >>> 
> >>>   Mark
> >> 
> >> We should have a look at the history to see how that wait got added.
> >> However the "dead" flag here means "don't touch this glock" and is 
> >> there so that we can separate the marking dead from the actual 
> >> removal from the list (which simplifies the locking during the 
> >> scanning procedures)
> > 
> > You beat me to it :-)
> > 
> > I think there might be a bit of a problem inserting a new entry with 
> > the same name before the old entry has been fully destroyed (or at 
> > least removed), which would be why the schedule() is there.
> 
> If the old entry is marked dead, all future lookups should ignore it. 
> We should only have a single non-dead entry at a time, but that 
> doesn't seem like it should need us to wait for it.

On the second call we do have the new glock to insert as arg2, so we could try 
to swap them cleanly, yeah.

> If we do discover that the wait is really required, then it sounds 
> like as you mentioned above there is a lost wakeup, and that must 
> presumably be on a code path that sets the dead flag and then fails to 
> send a wake up later on. If we can drop the wait in the first place, 
> that seems like a better plan,

Ooooh, I wonder if these two lines:

wake_up_glock(gl);
call_rcu(&gl->gl_rcu, gfs2_glock_dealloc);

in gfs2_glock_free() are the wrong way round?

--
Tim Smith 





Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock

2018-10-09 Thread Mark Syms
We think we have, just making a build to test.

Will follow up later.

Mark.

-Original Message-
From: Steven Whitehouse  
Sent: 09 October 2018 09:41
To: Mark Syms ; Tim Smith 
Cc: cluster-devel@redhat.com; Ross Lagerwall 
Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find 
insert glock



On 09/10/18 09:13, Mark Syms wrote:
> Having swapped the line below around we still see the timeout on 
> schedule fire, but only once in a fairly mega stress test. This is why 
> we weren't worried about the timeout being HZ, the situation is hardly 
> ever hit as having to wait is rare and normally we are woken from 
> schedule and without a timeout on schedule we never wake up so a rare 
> occurrence of waiting a second really doesn't seem too bad.
>
> Mark.
We should still get to the bottom of why the wake up is missing though, since 
without that fix we won't know if there is something else wrong somewhere,

Steve.

>
> -Original Message-
> From: Tim Smith 
> Sent: 08 October 2018 14:27
> To: Steven Whitehouse 
> Cc: Mark Syms ; cluster-devel@redhat.com; Ross 
> Lagerwall 
> Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in 
> find insert glock
>
> On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote:
>> Hi,
>>
>> On 08/10/18 14:10, Tim Smith wrote:
>>> On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote:
>>>> On 08/10/18 13:59, Mark Syms wrote:
>>>>> That sounds entirely reasonable so long as you are absolutely sure 
>>>>> that nothing is ever going to mess with that glock, we erred on 
>>>>> the side of more caution not knowing whether it would be guaranteed safe 
>>>>> or not.
>>>>>
>>>>> Thanks,
>>>>>
>>>>>   Mark
>>>> We should have a look at the history to see how that wait got added.
>>>> However the "dead" flag here means "don't touch this glock" and is 
>>>> there so that we can separate the marking dead from the actual 
>>>> removal from the list (which simplifies the locking during the 
>>>> scanning procedures)
>>> You beat me to it :-)
>>>
>>> I think there might be a bit of a problem inserting a new entry with 
>>> the same name before the old entry has been fully destroyed (or at 
>>> least removed), which would be why the schedule() is there.
>> If the old entry is marked dead, all future lookups should ignore it.
>> We should only have a single non-dead entry at a time, but that 
>> doesn't seem like it should need us to wait for it.
> On the second call we do have the new glock to insert as arg2, so we could 
> try to swap them cleanly, yeah.
>
>> If we do discover that the wait is really required, then it sounds 
>> like as you mentioned above there is a lost wakeup, and that must 
>> presumably be on a code path that sets the dead flag and then fails 
>> to send a wake up later on. If we can drop the wait in the first 
>> place, that seems like a better plan,
> Ooooh, I wonder if these two lines:
>
>   wake_up_glock(gl);
>   call_rcu(&gl->gl_rcu, gfs2_glock_dealloc);
>
> in gfs2_glock_free() are the wrong way round?
>
> --
> Tim Smith 
>
>




Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock

2018-10-11 Thread Mark Syms
And, sadly, the change to wait all also doesn't prevent the schedule_timeout 
from occurring. Something more subtle going on obviously.

Mark.

-Original Message-
From: Tim Smith  
Sent: 10 October 2018 09:23
To: cluster-devel@redhat.com
Cc: Andreas Gruenbacher ; Ross Lagerwall 
; Mark Syms 
Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find 
insert glock

On Tuesday, 9 October 2018 16:08:10 BST Tim Smith wrote:
> On Tuesday, 9 October 2018 14:00:34 BST Andreas Gruenbacher wrote:
> > On Tue, 9 Oct 2018 at 14:46, Tim Smith  wrote:
> > > On Tuesday, 9 October 2018 13:34:47 BST you wrote:
> > > > There must be another reason for the missed wake-up. I'll have 
> > > > to study the code some more.
> > > 
> > > I think it needs to go into gfs2_glock_dealloc(), in such a way as 
> > > to avoid that problem. Currently working out a patch to do that.
> > 
> > That doesn't sound right: find_insert_glock is waiting for the glock 
> > to be removed from the rhashtable. In gfs2_glock_free, we remove the 
> > glock from the rhashtable and then we do the wake-up. Delaying the 
> > wakeup further isn't going to help (but it might hide the problem).
> 
> The only way we can get the problem we're seeing is if we get an 
> effective order of
> 
> T0: wake_up_glock()
> T1: prepare_to_wait()
> T1: schedule()
> 
> so clearly there's a way for that to happen. Any other order and 
> either
> schedule() doesn't sleep or it gets woken.
> 
> The only way I can see at the moment to ensure that wake_up_glock() 
> *cannot* get called until after prepare_to_wait() is to delay it until 
> the read_side critical sections are done, and the first place that's 
> got that property is the start of gfs2_glock_dealloc(), unless we want 
> to add synchronise_rcu() to gfs2_glock_free() and I'm guessing there's 
> a reason it's using
> call_rcu() instead.
> 
> I'll keep thinking about it.

OK, we have a result that this definitely *isn't* the right answer (still 
getting the wakeup happening). This is lends more weight to the idea that there 
are multiple waiters, so we'll try your patch.

--
Tim Smith 





Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock

2018-10-11 Thread Mark Syms
Also, the prepare to wait call does so in non exclusive mode so switching the 
parameter on the wakeup would have no effect as non exclusive waiters are 
always woken anyway.

So, there is some other reason why we decide to yield but never get woken.

Mark.

From: Mark Syms 
Sent: Thursday, 11 October 2018 09:14
To: Tim Smith ,cluster-devel@redhat.com
CC: Andreas Gruenbacher ,Ross Lagerwall 

Subject: RE: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find 
insert glock


And, sadly, the change to wait all also doesn't prevent the schedule_timeout 
from occurring. Something more subtle going on obviously.

Mark.

-Original Message-
From: Tim Smith 
Sent: 10 October 2018 09:23
To: cluster-devel@redhat.com
Cc: Andreas Gruenbacher ; Ross Lagerwall 
; Mark Syms 
Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find 
insert glock

On Tuesday, 9 October 2018 16:08:10 BST Tim Smith wrote:
> On Tuesday, 9 October 2018 14:00:34 BST Andreas Gruenbacher wrote:
> > On Tue, 9 Oct 2018 at 14:46, Tim Smith  wrote:
> > > On Tuesday, 9 October 2018 13:34:47 BST you wrote:
> > > > There must be another reason for the missed wake-up. I'll have
> > > > to study the code some more.
> > >
> > > I think it needs to go into gfs2_glock_dealloc(), in such a way as
> > > to avoid that problem. Currently working out a patch to do that.
> >
> > That doesn't sound right: find_insert_glock is waiting for the glock
> > to be removed from the rhashtable. In gfs2_glock_free, we remove the
> > glock from the rhashtable and then we do the wake-up. Delaying the
> > wakeup further isn't going to help (but it might hide the problem).
>
> The only way we can get the problem we're seeing is if we get an
> effective order of
>
> T0: wake_up_glock()
> T1: prepare_to_wait()
> T1: schedule()
>
> so clearly there's a way for that to happen. Any other order and
> either
> schedule() doesn't sleep or it gets woken.
>
> The only way I can see at the moment to ensure that wake_up_glock()
> *cannot* get called until after prepare_to_wait() is to delay it until
> the read_side critical sections are done, and the first place that's
> got that property is the start of gfs2_glock_dealloc(), unless we want
> to add synchronise_rcu() to gfs2_glock_free() and I'm guessing there's
> a reason it's using
> call_rcu() instead.
>
> I'll keep thinking about it.

OK, we have a result that this definitely *isn't* the right answer (still 
getting the wakeup happening). This is lends more weight to the idea that there 
are multiple waiters, so we'll try your patch.

--
Tim Smith 




Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

2018-12-17 Thread Mark Syms
On Mon, Dec 17, 2018 at 09:58:47AM -0500, Bob Peterson wrote:
> Dave Teigland recommended. Unless I'm mistaken, Dave has said that 
> GFS2 should never withdraw; it should always just kernel panic (Dave, 
> correct me if I'm wrong). At least this patch confines that behavior 
> to a small subset of withdraws.

The basic idea is that you want to get a malfunctioning node out of the way as 
quickly as possible so others can recover and carry on.  Escalating a partial 
failure into a total node failure is the best way to do that in this case.  
Specialized recovery paths run from a partially failed node won't be as 
reliable, and are prone to blocking all the nodes.

I think a reasonable alternative to this is to just sit in an infinite retry 
loop until the i/o succeeds.

Dave
[Mark Syms] I would hope that this code would only trigger after some effort 
has been put into  retrying as panicing the host on the first I/O failure seems 
like a sure fire way to get unhappy users (and in our case paying customers). 
As Edvin points out there may be other filesystems that may be able to cleanly 
unmount and thus avoid having to check everything on restart.




Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

2018-12-18 Thread Mark Syms
Hi Bob,

I agree, it's a hard problem. I'm just trying to understand that we've done the 
absolute best we can and that if this condition is hit then the best solution 
really is to just kill the node. I guess it's also a question of how common 
this actually ends up being. We have now got customers starting to use GFS2 for 
VM storage on XenServer so I guess we'll just have to see how many support 
calls we get in on it.

Thanks,

Mark.

-Original Message-
From: Bob Peterson  
Sent: 17 December 2018 20:20
To: Mark Syms 
Cc: cluster-devel@redhat.com
Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs 
writing

- Original Message -
> I think a reasonable alternative to this is to just sit in an infinite 
> retry loop until the i/o succeeds.
> 
> Dave
> [Mark Syms] I would hope that this code would only trigger after some 
> effort has been put into  retrying as panicing the host on the first 
> I/O failure seems like a sure fire way to get unhappy users (and in 
> our case paying customers). As Edvin points out there may be other 
> filesystems that may be able to cleanly unmount and thus avoid having 
> to check everything on restart.

Hi Mark,

Perhaps. I'm not block layer or iscsi expert, but afaik, it's not the file 
system's job to retry IO, and never has been, right?

There are already iscsi tuning parameters, vfs tuning parameters, etc. So if an 
IO error is sent to GFS2 for a write operation, it means the retry algorithms 
and operation timeout algorithms built into the layers below us (the iscsi 
layer, scsi layer, block layer, tcp/ip layer etc.) have all failed and given up 
on the IO operation. We can't really justify adding yet another layer of 
retries on top of all that, can we?

I see your point, and perhaps the system should stay up to continue other 
mission-critical operations that may not require the faulted hardware.
But what's a viable alternative? As Dave T. suggested, we can keep resubmitting 
the IO until it completes, but then the journal never gets replayed and nobody 
can have those locks ever again, and that would cause a potential hang of the 
entire cluster, especially in cases where there's only one device that's failed 
and the whole cluster is using it.

In GFS2, we've got a concept of marking a resource group "in error" so the 
other nodes won't try to use it, but the same corruption that affects resource 
groups could be extrapolated to "hot" dinodes as well. For example, suppose the 
root (mount-point) dinode was in the journal. Now the whole cluster is hung 
rather than just the one node.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

2018-12-18 Thread Mark Syms
Thanks Bob,

We believe we have seen these issues from time to time in our automated testing 
but I suspect that they're indicating a configuration problem with the backing 
storage. For flexibility a proportion of our purely functional testing will use 
storage provided by a VM running a software iSCSI target and these tests seem 
to be somewhat susceptible to getting I/O errors, some of which will inevitably 
end up being in the journal. If we start to see a lot we'll need to look at the 
config of the VMs first I think.

Mark.

-Original Message-
From: Bob Peterson  
Sent: 18 December 2018 15:52
To: Mark Syms 
Cc: cluster-devel@redhat.com
Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs 
writing

- Original Message -
> Hi Bob,
> 
> I agree, it's a hard problem. I'm just trying to understand that we've 
> done the absolute best we can and that if this condition is hit then 
> the best solution really is to just kill the node. I guess it's also a 
> question of how common this actually ends up being. We have now got 
> customers starting to use GFS2 for VM storage on XenServer so I guess 
> we'll just have to see how many support calls we get in on it.
> 
> Thanks,
> 
> Mark.

Hi Mark,

I don't expect the problem to be very common in the real world. 
The user has to get IO errors while writing to the GFS2 journal, which is not 
very common. The patch is basically reacting to a phenomenon we recently 
started noticing in which the HBA (qla2xxx) driver shuts down and stops 
accepting requests when you do abnormal reboots (which we sometimes do to test 
node recovery). In these cases, the node doesn't go down right away.
It stays up just long enough to cause IO errors with subsequent withdraws, 
which, we discovered, results in file system corruption.
Normal reboots, "/sbin/reboot -fin", and "echo b > /proc/sysrq-trigger" should 
not have this problem, nor should node fencing, etc.

And like I said, I'm open to suggestions on how to fix it. I wish there was a 
better solution.

As it is, I'd kind of like to get something into this merge window for the 
upstream kernel, but I'll need to submit the pull request for that probably 
tomorrow or Thursday. If we find a better solution, we can always revert these 
changes and implement a new one.

Regards,

Bob Peterson



[Cluster-devel] kernel BUG at fs/gfs2/inode.h:64

2019-01-08 Thread Mark Syms
Hi,

We've seen this in testing with 4.19.

Full trace is at the bottom.

Looking at the code though it looks like it will assert if the value of change 
is equal to the number of blocks currently allocated to the inode. Is this 
expected or should the assert be using >= instead of > ?

Thanks,

Mark

--

2018-12-24 00:07:03 UTC] [ 3366.209273] kernel BUG at fs/gfs2/inode.h:64!

[2018-12-24 00:07:03 UTC] [ 3366.209305] invalid opcode:  [#1] SMP NOPTI

[2018-12-24 00:07:03 UTC] [ 3366.209327] CPU: 10 PID: 30664 Comm: kworker/10:1 
Tainted: G   O  4.19.0+0 #1

[2018-12-24 00:07:03 UTC] [ 3366.209352] Hardware name: HP ProLiant BL420c 
Gen8, BIOS I30 11/03/2014

[2018-12-24 00:07:03 UTC] [ 3366.209397] Workqueue: delete_workqueue 
delete_work_func [gfs2]

[2018-12-24 00:07:03 UTC] [ 3366.209437] RIP: 
e030:gfs2_add_inode_blocks.part.33+0x10/0x12 [gfs2]

[2018-12-24 00:07:03 UTC] [ 3366.209460] Code: 00 89 c2 48 c7 c7 80 bc 4c c0 31 
c0 e8 ec 44 c1 c0 e9 68 fc ff ff 0f 0b 0f 0b 48 8b 47 28 48 8b b8 08 04 00 00 
e8 bf dd ff ff <0f> 0b 0f 0b 0f 0b 66 66 66 66 90 0f 0b 48 8b 47 28 48 8b b8 08 
04

[2018-12-24 00:07:03 UTC] [ 3366.209512] RSP: e02b:c9004a53fbf0 EFLAGS: 
00010246

[2018-12-24 00:07:03 UTC] [ 3366.209532] RAX: 0043 RBX: 
8881709a4000 RCX: 

[2018-12-24 00:07:03 UTC] [ 3366.209556] RDX:  RSI: 
88818d496428 RDI: 88818d496428

[2018-12-24 00:07:03 UTC] [ 3366.209581] RBP: c9004a53fdb0 R08: 
 R09: 0553

[2018-12-24 00:07:03 UTC] [ 3366.209605] R10:  R11: 
c9004a53f968 R12: 888156a303d8

[2018-12-24 00:07:03 UTC] [ 3366.209629] R13: 00f47524 R14: 
8881709a3ac0 R15: 888188d31110

[2018-12-24 00:07:03 UTC] [ 3366.209683] FS:  7fa73976e700() 
GS:88818d48() knlGS:

[2018-12-24 00:07:03 UTC] [ 3366.209709] CS:  e033 DS:  ES:  CR0: 
80050033

[2018-12-24 00:07:03 UTC] [ 3366.209730] CR2: 7fca0def3000 CR3: 
000187fee000 CR4: 00042660

[2018-12-24 00:07:03 UTC] [ 3366.209802] Call Trace:

[2018-12-24 00:07:03 UTC] [ 3366.209835]  punch_hole+0xf4f/0x10d0 [gfs2]

[2018-12-24 00:07:03 UTC] [ 3366.209866]  ? __switch_to_asm+0x40/0x70

[2018-12-24 00:07:03 UTC] [ 3366.209907]  ? __switch_to_asm+0x40/0x70

[2018-12-24 00:07:03 UTC] [ 3366.209924]  ? __switch_to_asm+0x40/0x70

[2018-12-24 00:07:03 UTC] [ 3366.209949]  ? punch_hole+0xb3e/0x10d0 [gfs2]

[2018-12-24 00:07:03 UTC] [ 3366.209983]  gfs2_evict_inode+0x57b/0x680 [gfs2]

[2018-12-24 00:07:03 UTC] [ 3366.210036]  ? __inode_wait_for_writeback+0x75/0xe0

[2018-12-24 00:07:03 UTC] [ 3366.210066]  ? gfs2_evict_inode+0x1c7/0x680 [gfs2]

[2018-12-24 00:07:03 UTC] [ 3366.210090]  evict+0xc6/0x1a0

[2018-12-24 00:07:03 UTC] [ 3366.210151]  delete_work_func+0x60/0x70 [gfs2]

[2018-12-24 00:07:03 UTC] [ 3366.210177]  process_one_work+0x165/0x370

[2018-12-24 00:07:03 UTC] [ 3366.210197]  worker_thread+0x49/0x3e0

[2018-12-24 00:07:03 UTC] [ 3366.210216]  kthread+0xf8/0x130

[2018-12-24 00:07:03 UTC] [ 3366.210235]  ? rescuer_thread+0x310/0x310

[2018-12-24 00:07:03 UTC] [ 3366.210284]  ? kthread_bind+0x10/0x10

[2018-12-24 00:07:04 UTC] [ 3366.210301]  ret_from_fork+0x35/0x40

[2018-12-24 00:07:04 UTC] [ 3366.210319] Modules linked in: tun nfsv3 nfs_acl 
nfs lockd grace fscache gfs2 dlm iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi 8021q garp mrp stp llc openvswitch nsh nf_nat_ipv6 
nf_nat_ipv4 nf_conncount nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp 
xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 
iptable_filter dm_multipath sunrpc sb_edac intel_powerclamp crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd 
glue_helper dm_mod sg hpilo lpc_ich intel_rapl_perf psmouse ipmi_si 
ipmi_devintf ipmi_msghandler nls_utf8 isofs acpi_power_meter loop xen_wdt 
ip_tables x_tables sd_mod uhci_hcd serio_raw ehci_pci ehci_hcd hpsa igb(O) 
scsi_transport_sas scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod 
ipv6 crc_ccitt

[2018-12-24 00:07:04 UTC] [ 3366.210663] ---[ end trace d36fcbebd28b36e9 ]---



Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64

2019-01-09 Thread Mark Syms
We don't yet know how the assert got triggered as we've only seen it once and 
in the original form it looks like it would be very hard to trigger in any 
normal case (given that in default usage i_blocks should be at least 8 times 
what any putative value for change could be). So, for the assert to have 
triggered we've been asked to remove at least 8 times the number of blocks 
currently allocated to the inode. Possible causes could be a double release or 
some other higher level bug that will require further investigation to uncover.

Mark.

-Original Message-
From: Andreas Gruenbacher  
Sent: 09 January 2019 13:31
To: Tim Smith 
Cc: cluster-devel ; Igor Druzhinin 
; Mark Syms 
Subject: Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64

Mark and Tim,

On Wed, 9 Jan 2019 at 13:05, Tim Smith  wrote:
> On Tuesday, 8 January 2019 13:32:20 GMT Mark Syms wrote:
> > Hi,
> >
> > We've seen this in testing with 4.19.
> >
> > Full trace is at the bottom.
> >
> > Looking at the code though it looks like it will assert if the value 
> > of change is equal to the number of blocks currently allocated to the inode.
> > Is this expected or should the assert be using >= instead of > ?
>
> It's weirder. Looking at
>
> static inline void gfs2_add_inode_blocks(struct inode *inode, s64 
> change) {
> gfs2_assert(GFS2_SB(inode), (change >= 0 || inode->i_blocks > 
> -change));
> change *= (GFS2_SB(inode)->sd_sb.sb_bsize/GFS2_BASIC_BLOCK);
> inode->i_blocks += change;
> }
>
> where the BUG is happening, "inode->i_blocks" is counting of 512b 
> blocks and "change" is in units of whatever-the-superblock-said which 
> will probably be counting 4096b blocks, so the comparison makes no sense.

indeed. I wonder why this hasn't been discovered long ago.

> It should probably read
>
> static inline void gfs2_add_inode_blocks(struct inode *inode, s64 
> change) {
> change *= (GFS2_SB(inode)->sd_sb.sb_bsize/GFS2_BASIC_BLOCK);
> gfs2_assert(GFS2_SB(inode), (change >= 0 || inode->i_blocks >= 
> -change));
> inode->i_blocks += change;
> }
>
> so I'll make a patch for that unless someone wants to correct me.

You can just shift by (inode->i_blkbits - 9).

Can you still trigger the assert with this fix?

Thanks,
Andreas



Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64

2019-01-09 Thread Mark Syms
So, actually the original comparison in the assert (dodgy scaling factor not 
withstanding) was probably correct in that we don't ever want to remove all 
blocks from the inode as one of them is used for the inode itself? Or do we 
still think it should allow for change to be the negative of current blocks?

Mark

From: Andreas Gruenbacher 
Sent: Wednesday, 9 January 2019 17:24
To: Tim Smith 
CC: Mark Syms ,cluster-devel 
,Igor Druzhinin 
Subject: Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64


On Wed, 9 Jan 2019 at 18:14, Tim Smith  wrote:
>
> On Wednesday, 9 January 2019 15:35:05 GMT Andreas Gruenbacher wrote:
> > On Wed, 9 Jan 2019 at 14:43, Mark Syms  wrote:
> > > We don't yet know how the assert got triggered as we've only seen it once
> > > and in the original form it looks like it would be very hard to trigger in
> > > any normal case (given that in default usage i_blocks should be at least 8
> > > times what any putative value for change could be). So, for the assert to
> > > have triggered we've been asked to remove at least 8 times the number of
> > > blocks currently allocated to the inode. Possible causes could be a double
> > > release or some other higher level bug that will require further
> > > investigation to uncover.
> >
> > The following change has at least survived xfstests:
> >
> > --- a/fs/gfs2/inode.h
> > +++ b/fs/gfs2/inode.h
> > @@ -61,8 +61,8 @@ static inline u64 gfs2_get_inode_blocks(const struct
> > inode *inode)
> >
> >  static inline void gfs2_add_inode_blocks(struct inode *inode, s64 change)
> >  {
> > -gfs2_assert(GFS2_SB(inode), (change >= 0 || inode->i_blocks >
> > -change));
> > -change *= (GFS2_SB(inode)->sd_sb.sb_bsize/GFS2_BASIC_BLOCK);
> > +change <<= inode->i_blkbits - 9;
> > +gfs2_assert(GFS2_SB(inode), change >= 0 || inode->i_blocks >= -change);
> > inode->i_blocks += change;
> >  }
> >
> > Andreas
>
> I'll use
>
> change <<= (GFS2_SB(inode)->sd_sb.sb_bsize_shift - GFS2_BASIC_BLOCK_SHIFT);
>
> for consistency with the gfs2_get/set_inode_blocks(). I'll send the patch in a
> bit.
>
> Given what it was like before, either i_blocks was already 0 or -change
> somehow became stupidly huge. Anything else looks like it would be hard to
> reproduce without mkfs.gfs2 -b 512 (so that GFS2_SB(inode)->sd_sb.sb_bsize ==
> GFS2_BASIC_BLOCK) which we don't do.
>
> I'll try to work out what could have caused it and see if we can provoke it
> again.
>
> Out of curiosity I did a few tests where I created a file on GFS2, copied /
> dev/null on top of it, and then ran stat on the file. It seems like GFS2 never
> frees the last allocation on truncate; stat always reports 1, 2, 4 or 8 blocks
> in use for a zero-length file depending on the underlying filesystem block
> size, unlike (say) ext3/4 where it reports 0. I presume this is intentional so
> maybe some corner case where it *is* trying to do that is the root of the
> problem.

Yes, that's intentional. An empty file without extended attributes
consumes one block. Extended attributes consume at least one
additional block.

Andreas


[Cluster-devel] [PATCH] Retry wait_event_interruptible in event of ERESTARTSYS

2019-02-01 Thread Mark Syms
We saw an issue in a production server on a customer deployment where
DLM 4.0.7 gets "stuck" and unable to join new lockspaces.

See - https://lists.clusterlabs.org/pipermail/users/2019-January/016054.html

This was forwarded off list to David Teigland who responded thusly.

"
Hi, thanks for the debugging info.  You've spent more time looking at
this than I have, but from a first glance it seems to me that the
initial problem (there may be multiple) is that in the kernel,
lockspace.c do_event() does not sensibly handle the ERESTARTSYS error
from wait_event_interruptible().  I think do_event() should continue
waiting for a uevent result from userspace until it gets one, because
the kernel can't do anything sensible until it gets that.

Dave
"

This change does that. We have it running in automation with no problems
so far but comments welcome.

Mark Syms (1):
  Retry wait_event_interruptible in event of ERESTARTSYS

 fs/dlm/lockspace.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

-- 
1.8.3.1



[Cluster-devel] [PATCH] Retry wait_event_interruptible in event of ERESTARTSYS

2019-02-01 Thread Mark Syms
Signed-off-by: Mark Syms 
---
 fs/dlm/lockspace.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index db43b98..7aae23a 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -210,8 +210,10 @@ static int do_uevent(struct dlm_ls *ls, int in)
/* dlm_controld will see the uevent, do the necessary group management
   and then write to sysfs to wake us */
 
-   error = wait_event_interruptible(ls->ls_uevent_wait,
-   test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags));
+   do {
+   error = wait_event_interruptible(ls->ls_uevent_wait,
+
test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags));
+   } while (error == ERESTARTSYS);
 
log_rinfo(ls, "group event done %d %d", error, ls->ls_uevent_result);
 
-- 
1.8.3.1



Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in  gfs2_file_write_iter

2019-03-16 Thread Mark Syms
Sadly, this doesn't help and seems to make the situation worse. Our automated 
tests were previously seeing about 5% failure rate and with this patch its 20%. 
We still need to verify that they're all down to the same failure but I thought 
it better to give some early feedback.

Surprised that the script didn't run for you but the limit on dirty bytes was 
just to get it to occur more frequently. You may have success with larger 
values.

Mark

Message: 1
Date: Fri, 15 Mar 2019 21:58:12 +0100
From: Andreas Gruenbacher mailto:agrue...@redhat.com>>
To: Ross Lagerwall mailto:ross.lagerw...@citrix.com>>
Cc: igor.druzhi...@citrix.com<mailto:igor.druzhi...@citrix.com>, Sergey Dyasli
mailto:sergey.dya...@citrix.com>>, Mark 
Syms mailto:mark.s...@citrix.com>>,
cluster-devel@redhat.com<mailto:cluster-devel@redhat.com>
Subject: Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in
gfs2_file_write_iter
Message-ID: 
<20190315205812.22727-1-agrue...@redhat.com<mailto:20190315205812.22727-1-agrue...@redhat.com>>
Content-Type: text/plain; charset=UTF-8

Hi Ross,

On Thu, 14 Mar 2019 at 12:18, Ross Lagerwall 
mailto:ross.lagerw...@citrix.com>> wrote:
> On 3/13/19 5:13 PM, Andreas Gruenbacher wrote:
> > Hi Edwin,
> >
> > On Wed, 6 Mar 2019 at 12:08, Edwin T?r?k 
> > mailto:edvin.to...@citrix.com>>
> > wrote:
> >> Hello,
> >>
> >> I've been trying to debug a GFS2 deadlock that we see in our lab
> >> quite frequently with a 4.19 kernel. With 4.4 and older kernels we
> >> were not able to reproduce this.
> >> See below for lockdep dumps and stacktraces.
> >
> > thanks for the thorough bug report.  Does the below fix work for
> > you?
> >
> Hi Andreas,
>
> I've tested the patch and it doesn't fix the issue. As far as I can see,
> current->backing_dev_info is not used by any of the code called from
> balance_dirty_pages_ratelimited() so I don't see how it could work.

yes, I see now.

> I found a way of consistently reproducing the issue almost immediately
> (tested with the latest master commit):
>
> # cat a.py
> import os
>
> fd = os.open("f", os.O_CREAT|os.O_TRUNC|os.O_WRONLY)
>
> for i in range(1000):
>  os.mkdir("xxx" + str(i), 0777)
>
> buf = 'x' * 4096
>
> while True:
>  count = os.write(fd, buf)
>  if count <= 0:
>  break
>
> # cat b.py
> import os
> while True:
>os.mkdir("x", 0777)
>os.rmdir("x")
>
> # echo 8192 > /proc/sys/vm/dirty_bytes
> # cd /gfs2mnt
> # (mkdir tmp1; cd tmp1; python2 ~/a.py) &
> # (mkdir tmp2; cd tmp2; python2 ~/a.py) &
> # (mkdir tmp3; cd tmp3; python2 ~/b.py) &
>
> This should deadlock almost immediately. One of the processes will be
> waiting in balance_dirty_pages() and holding sd_log_flush_lock and
> several others will be waiting for sd_log_flush_lock.

This doesn't work for me: the python processes don't even start properly
when dirty_bytes is set so low.

> I came up with the following patch which seems to resolve the issue by
> failing to write the inode if it can't take the lock, but it seems
> like a dirty workaround rather than a proper fix:
>
> [...]

Looking at ext4_dirty_inode, it seems that we should just be able to
bail out of gfs2_write_inode an return 0 when PF_MEMALLOC is set in
current->flags.

Also, we should probably add the current->flags checks from
xfs_do_writepage to gfs2_writepage_common.

So what do you get with the below patch?

Thanks,
Andreas

---
 fs/gfs2/aops.c  | 7 +++
 fs/gfs2/super.c | 4 
 2 files changed, 11 insertions(+)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f..694ff91 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -102,6 +102,13 @@ static int gfs2_writepage_common(struct page *page,
 pgoff_t end_index = i_size >> PAGE_SHIFT;
 unsigned offset;

+   /* (see xfs_do_writepage) */
+   if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+PF_MEMALLOC))
+   goto redirty;
+   if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
+   goto redirty;
+
 if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 goto out;
 if (current->journal_info)
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index ca71163..540535c 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -756,6 +756,10 @@ static int gfs2_write_inode(struct inode *inode, struct 
writeback_control *wbc)
 int ret = 0;
 bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));

+   /* (see ext4_dirty_inode) */
+

Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter

2019-03-18 Thread Mark Syms
-Original Message-
From: Andreas Gruenbacher  
Sent: 17 March 2019 20:06
To: Mark Syms 
Cc: cluster-devel@redhat.com; Sergey Dyasli ; Igor 
Druzhinin ; Edvin Torok 
Subject: Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in 
gfs2_file_write_iter

On Sun, 17 Mar 2019 at 00:59, Mark Syms  wrote:
>
> Sadly, this doesn't help and seems to make the situation worse. Our automated 
> tests were previously seeing about 5% failure rate and with this patch its 
> 20%.

So the PF_MEMALLOC flag doesn't get set on that code path. In that case, we 
should probably check wbc->sync_mode instead. How about the following?

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 540535c..36de2f7 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -757,7 +757,7 @@ static int gfs2_write_inode(struct inode *inode, struct 
writeback_control *wbc)
 bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));

 /* (see ext4_dirty_inode) */
-if (current->flags & PF_MEMALLOC)
+if (wbc->sync_mode != WB_SYNC_ALL)
 return 0;

 if (flush_all)

It may help to wrap that condition in WARN_ON_ONCE for debugging.

[Mark Syms] That works better. We were wondering whether it might be a bit too 
aggressive though in that it skips writing the inode entirely unless we have 
WB_SYNC_ALL whereas the patch that Ross Lagerwall posted originally would use a 
trylock in the case where we don't have WB_SYNC_ALL and then fail out. Whether 
or not this should just silently return 0 or -EAGAIN as Ross' patch does is a 
matter of style I guess. What are your thoughts on this?

Thanks,

Mark.




[Cluster-devel] GFS2 rm can be very slow

2019-04-26 Thread Mark Syms
One of our users has just been in touch and reported slow file deletion (in 
this case the virtual disk for a VM) which was particularly impactful in the 
case of the Citrix Hypervisor control code as we hold a number of locks while 
deleting VM virtual disks and in this case the file delete took ~40 seconds to 
complete.

Now, we can, and will, work around this in the hypervisor control code by 
dropping the database entries under the locks and then leave the actual file 
deletion process occur under the control of our background garbage collection 
process but it lead me to wonder whether the userspace rm operation couldn't do 
something relatively simple to the file's inode data and then leave the actual 
resource group purging to happen in the background? This is obviously more 
complex to handle and in the case where an rm occurs and then there is 
immediately a demand for blocks where the only blocks possibly available were 
assigned to the rm'd file (i.e. the fs was full and the file was rm'd to make 
space) the block allocator would need to wait for the cleanup to occur. Would 
this be something worth considering as a future improvement or is it just too 
complicated to envisage?

Mark.


Re: [Cluster-devel] GFS2 rm can be very slow

2019-04-26 Thread Mark Syms
Thanks Bob,

I only have the aftermath logs that show the delete being long. My suspicion 
would be that the file in question was a reasonable size and fragmented 
(possibly a consequence of some of the patches we have to avoid rgrp contention 
on allocation) and this needed to acquire significant numbers of rgrp glocks in 
order to clear them.

Mark

From: Bob Peterson 
Sent: Friday, 26 April 2019 16:45
To: Mark Syms 
CC: cluster-devel@redhat.com
Subject: Re: [Cluster-devel] GFS2 rm can be very slow


- Original Message -
> One of our users has just been in touch and reported slow file deletion (in
> this case the virtual disk for a VM) which was particularly impactful in the
> case of the Citrix Hypervisor control code as we hold a number of locks
> while deleting VM virtual disks and in this case the file delete took ~40
> seconds to complete.
>
> Now, we can, and will, work around this in the hypervisor control code by
> dropping the database entries under the locks and then leave the actual file
> deletion process occur under the control of our background garbage
> collection process but it lead me to wonder whether the userspace rm
> operation couldn't do something relatively simple to the file's inode data
> and then leave the actual resource group purging to happen in the
> background? This is obviously more complex to handle and in the case where
> an rm occurs and then there is immediately a demand for blocks where the
> only blocks possibly available were assigned to the rm'd file (i.e. the fs
> was full and the file was rm'd to make space) the block allocator would need
> to wait for the cleanup to occur. Would this be something worth considering
> as a future improvement or is it just too complicated to envisage?
>
> Mark.
>
Hi Mark,

I'd let Andreas comment on this, since he was last to work on that part of gfs2,
but he's taking the day off today (back Monday). Maybe he'll comment on Monday.

We should try to find out what part of the delete process is taking all the 
time here.

After all, the unlink part of it should be relatively fast because the freeing
of the blocks is done later. If the files are really big or really fragmented,
we can sometimes spend a lot of time waiting for many rgrp glocks, at least in
the older versions of the code. The newer versions, where we got rid of the
recursive delete, should only need to lock one rgrp at a time, so that should
not be an issue.

The actual truncating of the file might take time, flushing transactions and 
such,
especially since a delete forces us to read all the indirect metadata blocks in
before freeing them. I think some versions had a broken read-ahead for that
part of the code, but Andreas would remember.

Or maybe we're waiting to grab the glock of the directory we're deleting from?
For example, maybe there's a "hot" directory that's used in read mode by lots
of processes across several nodes and we need it in rw mode to remove the 
dirent.

I suppose a finely crafted systemtap script would help figure this all out.

Also, what version of gfs2 is running slow?

Regards,

Bob Peterson