Re: [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Submit all ordered writes in bulk, wait after that

2018-05-15 Thread Andreas Gruenbacher
Hi Bob,

On 14 May 2018 at 16:18, Bob Peterson  wrote:
> Before this patch, the ordered_write function would submit all
> the ordered writes with filemap_fdatawrite, which waited for each
> one to complete before moving on to the next. This patch allows it
> to submit them all, then wait for them after they're submitted.

this looks reasonable.  I think it's not very helpful to reuse
gfs2_ordered_wait here though: the folowing incremental patch should
work just as well or better.

Andreas

---
 fs/gfs2/log.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index b05d0fbc3d05..2a9b1c008325 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -577,12 +577,15 @@ static void gfs2_ordered_write(struct gfs2_sbd *sdp)
continue;
}
list_move_tail(>i_ordered, );
-   spin_unlock(>sd_ordered_lock);
-   mapping->a_ops->writepages(mapping, );
-   spin_lock(>sd_ordered_lock);
}
spin_unlock(>sd_ordered_lock);
-   gfs2_ordered_wait(sdp, , >sd_log_le_ordered);
+   list_for_each_entry(ip, , i_ordered)
+   mapping->a_ops->writepages(ip->i_inode.i_mapping, );
+   list_for_each_entry(ip, , i_ordered)
+   filemap_fdatawait(ip->i_inode.i_mapping);
+   spin_lock(>sd_ordered_lock);
+   list_splice_tail(, >sd_log_le_ordered);
+   spin_unlock(>sd_ordered_lock);
 }
 
 void gfs2_ordered_del_inode(struct gfs2_inode *ip)
-- 
2.17.0



Re: [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Introduce rgrp sharing

2018-05-15 Thread Andreas Gruenbacher
Hi Bob,

On 15 May 2018 at 20:29, Bob Peterson  wrote:
> This patch takes advantage of the new glock holder sharing
> feature for rgrps. New functions rgrp_lock and rgrp_unlock
> use a new semaphore to gain temporary exclusive access to the
> rgrps. This is needed whenever a multi-block reservation is
> reserved, and every time a function needs to add an rgrp to a
> transaction.
>
> Since multiple rgrp glocks held by gfs2_rlist_alloc all require
> the same access, the state parameter has been removed, and the
> new holder flag is specified.
>
> Also, function gfs2_check_blk_type, which had previously been
> using LM_ST_SHARED for access to the rgrp has been switched
> to LM_ST_EXCLUSIVE with the new holder flag. This means
> multiple nodes no longer hold the rgrp in SH at the same time.
> However, with the new flag, local holders will still share the
> rgrp, but we avoid the glock state transition from EX to SH
> and back, (which may involve both DLM and the workqueue) which
> is another performance boost for the block allocator.

this is looking much better than the previous version.

> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/bmap.c   |  2 +-
>  fs/gfs2/dir.c|  2 +-
>  fs/gfs2/incore.h |  2 ++
>  fs/gfs2/inode.c  |  7 ++--
>  fs/gfs2/rgrp.c   | 86 
>  fs/gfs2/rgrp.h   |  2 +-
>  fs/gfs2/super.c  |  8 +++--
>  fs/gfs2/xattr.c  |  8 +++--
>  8 files changed, 91 insertions(+), 26 deletions(-)
>
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 0590e93494f7..ac484e6623c9 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -1108,7 +1108,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, 
> struct gfs2_holder *rd_gh,
> goto out;
> }
> ret = gfs2_glock_nq_init(rgd->rd_gl, LM_ST_EXCLUSIVE,
> -0, rd_gh);
> +LM_FLAG_NODE_EX, rd_gh);
> if (ret)
> goto out;
>
> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
> index d9fb0ad6cc30..7327a9d43692 100644
> --- a/fs/gfs2/dir.c
> +++ b/fs/gfs2/dir.c
> @@ -2020,7 +2020,7 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
> index, u32 len,
> l_blocks++;
> }
>
> -   gfs2_rlist_alloc(, LM_ST_EXCLUSIVE);
> +   gfs2_rlist_alloc();
>
> for (x = 0; x < rlist.rl_rgrps; x++) {
> struct gfs2_rgrpd *rgd = 
> gfs2_glock2rgrp(rlist.rl_ghs[x].gh_gl);
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 0bbbaa9b05cb..03ddd01d58d4 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #define DIO_WAIT   0x0010
>  #define DIO_METADATA   0x0020
> @@ -100,6 +101,7 @@ struct gfs2_rgrpd {
>  #define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */
>  #define GFS2_RDF_MASK  0xf000 /* mask for internal flags */
> spinlock_t rd_rsspin;   /* protects reservation related vars 
> */
> +   struct semaphore rd_sem;/* ensures local rgrp exclusivity */
> struct rb_root rd_rstree;   /* multi-block reservation tree */
>  };
>
> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 8700eb815638..3176cadfc309 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -1121,8 +1121,8 @@ static int gfs2_unlink(struct inode *dir, struct dentry 
> *dentry)
> if (!rgd)
> goto out_inodes;
>
> -   gfs2_holder_init(rgd->rd_gl, LM_ST_EXCLUSIVE, 0, ghs + 2);
> -
> +   gfs2_holder_init(rgd->rd_gl, LM_ST_EXCLUSIVE, LM_FLAG_NODE_EX,
> +ghs + 2);
>
> error = gfs2_glock_nq(ghs); /* parent */
> if (error)
> @@ -1409,7 +1409,8 @@ static int gfs2_rename(struct inode *odir, struct 
> dentry *odentry,
>  */
> nrgd = gfs2_blk2rgrpd(sdp, nip->i_no_addr, 1);
> if (nrgd)
> -   gfs2_holder_init(nrgd->rd_gl, LM_ST_EXCLUSIVE, 0, ghs 
> + num_gh++);
> +   gfs2_holder_init(nrgd->rd_gl, LM_ST_EXCLUSIVE,
> +LM_FLAG_NODE_EX, ghs + num_gh++);
> }
>
> for (x = 0; x < num_gh; x++) {
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 8b683917a27e..7891229b836d 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -904,6 +904,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
> rgd->rd_data = be32_to_cpu(buf.ri_data);
> rgd->rd_bitbytes = be32_to_cpu(buf.ri_bitbytes);
> spin_lock_init(>rd_rsspin);
> +   sema_init(>rd_sem, 1);
>
> error = compute_bitstructs(rgd);
> if (error)
> @@ -1344,6 +1345,25 @@ int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 
> offset,
> return -EIO;
>  }
>
> +/**
> + * rgrp_lock 

Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Introduce LM_FLAG_NODE_EX holder bit: Node Exclusive sharing

2018-05-15 Thread Andreas Gruenbacher
On 15 May 2018 at 20:29, Bob Peterson  wrote:
> This patch is a first step in rgrp sharing. It allows for glocks
> locked in EX mode to be shared amongst processes on that node.
>
> Like a SH (shared) glock, multiple processes may hold the lock in
> EX mode at the same time, provided they're all on the same
> node. This is a holder flag, meaning "Even though I've locked the
> glock in EX, this process will share the lock and allow multiple
> holders for other local processes that use the same flag."
>
> For now, there are no users of the new flag. A future patch
> will use it to improve performance with rgrp sharing.
>
> Signed-off-by: Bob Peterson 

Reviewed-by: Andreas Gruenbacher 

> ---
>  fs/gfs2/glock.c | 23 +++
>  fs/gfs2/glock.h |  6 ++
>  2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index 097bd3c0f270..bd45df7c0ef4 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -279,10 +279,20 @@ void gfs2_glock_put(struct gfs2_glock *gl)
>
>  static inline int may_grant(const struct gfs2_glock *gl, const struct 
> gfs2_holder *gh)
>  {
> -   const struct gfs2_holder *gh_head = list_entry(gl->gl_holders.next, 
> const struct gfs2_holder, gh_list);
> -   if ((gh->gh_state == LM_ST_EXCLUSIVE ||
> -gh_head->gh_state == LM_ST_EXCLUSIVE) && gh != gh_head)
> -   return 0;
> +   const struct gfs2_holder *gh_head = list_entry(gl->gl_holders.next,
> +  const struct 
> gfs2_holder,
> +  gh_list);
> +
> +   if (gh != gh_head) {
> +   if (gh_head->gh_state == LM_ST_EXCLUSIVE &&
> +   (gh_head->gh_flags & LM_FLAG_NODE_EX) &&
> +   gh->gh_state == LM_ST_EXCLUSIVE &&
> +   (gh->gh_flags & LM_FLAG_NODE_EX))
> +   return 1;
> +   if ((gh->gh_state == LM_ST_EXCLUSIVE ||
> +gh_head->gh_state == LM_ST_EXCLUSIVE))
> +   return 0;
> +   }
> if (gl->gl_state == gh->gh_state)
> return 1;
> if (gh->gh_flags & GL_EXACT)
> @@ -292,6 +302,9 @@ static inline int may_grant(const struct gfs2_glock *gl, 
> const struct gfs2_holde
> return 1;
> if (gh->gh_state == LM_ST_DEFERRED && gh_head->gh_state == 
> LM_ST_DEFERRED)
> return 1;
> +   if (gh_head->gh_flags & LM_FLAG_NODE_EX &&
> +   gh->gh_flags & LM_FLAG_NODE_EX)
> +   return 1;
> }
> if (gl->gl_state != LM_ST_UNLOCKED && (gh->gh_flags & LM_FLAG_ANY))
> return 1;
> @@ -1680,6 +1693,8 @@ static const char *hflags2str(char *buf, u16 flags, 
> unsigned long iflags)
> *p++ = 'A';
> if (flags & LM_FLAG_PRIORITY)
> *p++ = 'p';
> +   if (flags & LM_FLAG_NODE_EX)
> +   *p++ = 'n';
> if (flags & GL_ASYNC)
> *p++ = 'a';
> if (flags & GL_EXACT)
> diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
> index 5e12220cc0c2..d3e1bdeefcc1 100644
> --- a/fs/gfs2/glock.h
> +++ b/fs/gfs2/glock.h
> @@ -78,6 +78,11 @@ enum {
>   * request and directly join the other shared lock.  A shared lock request
>   * without the priority flag might be forced to wait until the deferred
>   * requested had acquired and released the lock.
> + *
> + * LM_FLAG_NODE_EX
> + * This holder agrees to share the lock within this node only. In other 
> words,
> + * the glock is held in EX mode according to DLM, but local holders on the
> + * same node can share it as if it was held in SH.
>   */
>
>  #define LM_FLAG_TRY0x0001
> @@ -85,6 +90,7 @@ enum {
>  #define LM_FLAG_NOEXP  0x0004
>  #define LM_FLAG_ANY0x0008
>  #define LM_FLAG_PRIORITY   0x0010
> +#define LM_FLAG_NODE_EX0x0020
>  #define GL_ASYNC   0x0040
>  #define GL_EXACT   0x0080
>  #define GL_SKIP0x0100
> --
> 2.17.0
>

Thanks,
Andreas



[Cluster-devel] [GFS2 PATCH 1/2] GFS2: Introduce LM_FLAG_NODE_EX holder bit: Node Exclusive sharing

2018-05-15 Thread Bob Peterson
This patch is a first step in rgrp sharing. It allows for glocks
locked in EX mode to be shared amongst processes on that node.

Like a SH (shared) glock, multiple processes may hold the lock in
EX mode at the same time, provided they're all on the same
node. This is a holder flag, meaning "Even though I've locked the
glock in EX, this process will share the lock and allow multiple
holders for other local processes that use the same flag."

For now, there are no users of the new flag. A future patch
will use it to improve performance with rgrp sharing.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 23 +++
 fs/gfs2/glock.h |  6 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 097bd3c0f270..bd45df7c0ef4 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -279,10 +279,20 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 
 static inline int may_grant(const struct gfs2_glock *gl, const struct 
gfs2_holder *gh)
 {
-   const struct gfs2_holder *gh_head = list_entry(gl->gl_holders.next, 
const struct gfs2_holder, gh_list);
-   if ((gh->gh_state == LM_ST_EXCLUSIVE ||
-gh_head->gh_state == LM_ST_EXCLUSIVE) && gh != gh_head)
-   return 0;
+   const struct gfs2_holder *gh_head = list_entry(gl->gl_holders.next,
+  const struct gfs2_holder,
+  gh_list);
+
+   if (gh != gh_head) {
+   if (gh_head->gh_state == LM_ST_EXCLUSIVE &&
+   (gh_head->gh_flags & LM_FLAG_NODE_EX) &&
+   gh->gh_state == LM_ST_EXCLUSIVE &&
+   (gh->gh_flags & LM_FLAG_NODE_EX))
+   return 1;
+   if ((gh->gh_state == LM_ST_EXCLUSIVE ||
+gh_head->gh_state == LM_ST_EXCLUSIVE))
+   return 0;
+   }
if (gl->gl_state == gh->gh_state)
return 1;
if (gh->gh_flags & GL_EXACT)
@@ -292,6 +302,9 @@ static inline int may_grant(const struct gfs2_glock *gl, 
const struct gfs2_holde
return 1;
if (gh->gh_state == LM_ST_DEFERRED && gh_head->gh_state == 
LM_ST_DEFERRED)
return 1;
+   if (gh_head->gh_flags & LM_FLAG_NODE_EX &&
+   gh->gh_flags & LM_FLAG_NODE_EX)
+   return 1;
}
if (gl->gl_state != LM_ST_UNLOCKED && (gh->gh_flags & LM_FLAG_ANY))
return 1;
@@ -1680,6 +1693,8 @@ static const char *hflags2str(char *buf, u16 flags, 
unsigned long iflags)
*p++ = 'A';
if (flags & LM_FLAG_PRIORITY)
*p++ = 'p';
+   if (flags & LM_FLAG_NODE_EX)
+   *p++ = 'n';
if (flags & GL_ASYNC)
*p++ = 'a';
if (flags & GL_EXACT)
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 5e12220cc0c2..d3e1bdeefcc1 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -78,6 +78,11 @@ enum {
  * request and directly join the other shared lock.  A shared lock request
  * without the priority flag might be forced to wait until the deferred
  * requested had acquired and released the lock.
+ *
+ * LM_FLAG_NODE_EX
+ * This holder agrees to share the lock within this node only. In other words,
+ * the glock is held in EX mode according to DLM, but local holders on the
+ * same node can share it as if it was held in SH.
  */
 
 #define LM_FLAG_TRY0x0001
@@ -85,6 +90,7 @@ enum {
 #define LM_FLAG_NOEXP  0x0004
 #define LM_FLAG_ANY0x0008
 #define LM_FLAG_PRIORITY   0x0010
+#define LM_FLAG_NODE_EX0x0020
 #define GL_ASYNC   0x0040
 #define GL_EXACT   0x0080
 #define GL_SKIP0x0100
-- 
2.17.0



[Cluster-devel] [GFS2 PATCH 2/2] GFS2: Introduce rgrp sharing

2018-05-15 Thread Bob Peterson
This patch takes advantage of the new glock holder sharing
feature for rgrps. New functions rgrp_lock and rgrp_unlock
use a new semaphore to gain temporary exclusive access to the
rgrps. This is needed whenever a multi-block reservation is
reserved, and every time a function needs to add an rgrp to a
transaction.

Since multiple rgrp glocks held by gfs2_rlist_alloc all require
the same access, the state parameter has been removed, and the
new holder flag is specified.

Also, function gfs2_check_blk_type, which had previously been
using LM_ST_SHARED for access to the rgrp has been switched
to LM_ST_EXCLUSIVE with the new holder flag. This means
multiple nodes no longer hold the rgrp in SH at the same time.
However, with the new flag, local holders will still share the
rgrp, but we avoid the glock state transition from EX to SH
and back, (which may involve both DLM and the workqueue) which
is another performance boost for the block allocator.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/bmap.c   |  2 +-
 fs/gfs2/dir.c|  2 +-
 fs/gfs2/incore.h |  2 ++
 fs/gfs2/inode.c  |  7 ++--
 fs/gfs2/rgrp.c   | 86 
 fs/gfs2/rgrp.h   |  2 +-
 fs/gfs2/super.c  |  8 +++--
 fs/gfs2/xattr.c  |  8 +++--
 8 files changed, 91 insertions(+), 26 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 0590e93494f7..ac484e6623c9 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1108,7 +1108,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, 
struct gfs2_holder *rd_gh,
goto out;
}
ret = gfs2_glock_nq_init(rgd->rd_gl, LM_ST_EXCLUSIVE,
-0, rd_gh);
+LM_FLAG_NODE_EX, rd_gh);
if (ret)
goto out;
 
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index d9fb0ad6cc30..7327a9d43692 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -2020,7 +2020,7 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
index, u32 len,
l_blocks++;
}
 
-   gfs2_rlist_alloc(, LM_ST_EXCLUSIVE);
+   gfs2_rlist_alloc();
 
for (x = 0; x < rlist.rl_rgrps; x++) {
struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(rlist.rl_ghs[x].gh_gl);
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 0bbbaa9b05cb..03ddd01d58d4 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DIO_WAIT   0x0010
 #define DIO_METADATA   0x0020
@@ -100,6 +101,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK  0xf000 /* mask for internal flags */
spinlock_t rd_rsspin;   /* protects reservation related vars */
+   struct semaphore rd_sem;/* ensures local rgrp exclusivity */
struct rb_root rd_rstree;   /* multi-block reservation tree */
 };
 
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8700eb815638..3176cadfc309 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -1121,8 +1121,8 @@ static int gfs2_unlink(struct inode *dir, struct dentry 
*dentry)
if (!rgd)
goto out_inodes;
 
-   gfs2_holder_init(rgd->rd_gl, LM_ST_EXCLUSIVE, 0, ghs + 2);
-
+   gfs2_holder_init(rgd->rd_gl, LM_ST_EXCLUSIVE, LM_FLAG_NODE_EX,
+ghs + 2);
 
error = gfs2_glock_nq(ghs); /* parent */
if (error)
@@ -1409,7 +1409,8 @@ static int gfs2_rename(struct inode *odir, struct dentry 
*odentry,
 */
nrgd = gfs2_blk2rgrpd(sdp, nip->i_no_addr, 1);
if (nrgd)
-   gfs2_holder_init(nrgd->rd_gl, LM_ST_EXCLUSIVE, 0, ghs + 
num_gh++);
+   gfs2_holder_init(nrgd->rd_gl, LM_ST_EXCLUSIVE,
+LM_FLAG_NODE_EX, ghs + num_gh++);
}
 
for (x = 0; x < num_gh; x++) {
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 8b683917a27e..7891229b836d 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -904,6 +904,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
rgd->rd_data = be32_to_cpu(buf.ri_data);
rgd->rd_bitbytes = be32_to_cpu(buf.ri_bitbytes);
spin_lock_init(>rd_rsspin);
+   sema_init(>rd_sem, 1);
 
error = compute_bitstructs(rgd);
if (error)
@@ -1344,6 +1345,25 @@ int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 
offset,
return -EIO;
 }
 
+/**
+ * rgrp_lock - gain exclusive access to an rgrp
+ * @gl: the glock
+ *
+ * This function waits for exclusive access to an rgrp whose glock is held in
+ * EX, but shared via the LM_FLAG_NODE_EX holder flag.
+ */
+static void rgrp_lock(struct gfs2_rgrpd *rgd)
+{
+   BUG_ON(!gfs2_glock_is_held_excl(rgd->rd_gl));
+   down(>rd_sem);
+}
+
+static void rgrp_unlock(struct 

[Cluster-devel] [GFS2 PATCH 0/2] Improve throughput through rgrp sharing (v3)

2018-05-15 Thread Bob Peterson
Hi,

This is version 3 of my patch set to allow rgrp glock sharing,
based on feedback I received from Steve Whitehouse and Andreas
Gruenbacher. It improves upon version 2 in the following ways:

1. Comments and descriptions are fixed to match the actual intent
   of the code. They were previously outdated and incorrect.
2. The rwsem has been replaced by a simple semaphore, since there
   is no non-exclusive access to the locks yet.
3. The newer versions of the patches don't take the rgrp exclusive
   lock multiple times for a transaction, so there is no longer a
   need to track who is holding the semaphore. Therefore all that
   has been eliminated.
4. Some variable and function names have been renamed to make more sense.

The first patch introduces the new glock holder flag that allows
multiple tasks on a single node to share a glock held in EX mode.

The second patch puts the new holder flag into use for rgrp sharing.
Exclusive access to the rgrp is implemented through a new rgrp
semaphore and supporting functions.

Performance tests done on an earlier prototype indicate the overall
throughput is 6X the original code when multiple processes are writing,
with vastly improved sharing of the block allocator.

Bob Peterson (2):
  GFS2: Introduce LM_FLAG_NODE_EX holder bit: Node Exclusive sharing
  GFS2: Introduce rgrp sharing

 fs/gfs2/bmap.c   |  2 +-
 fs/gfs2/dir.c|  2 +-
 fs/gfs2/glock.c  | 23 ++---
 fs/gfs2/glock.h  |  6 
 fs/gfs2/incore.h |  2 ++
 fs/gfs2/inode.c  |  7 ++--
 fs/gfs2/rgrp.c   | 86 
 fs/gfs2/rgrp.h   |  2 +-
 fs/gfs2/super.c  |  8 +++--
 fs/gfs2/xattr.c  |  8 +++--
 10 files changed, 116 insertions(+), 30 deletions(-)

-- 
2.17.0



Re: [Cluster-devel] [GFS2 PATCH] GFS2: Add to tail, not head, of transaction

2018-05-15 Thread Andreas Gruenbacher
On 9 May 2018 at 15:21, Bob Peterson  wrote:
> Hi,
>
> Before this patch, frunction gfs2_trans_add_meta called list_add
> to add a buffer to a transaction, tr_buf. Later, in the before_commit
> functions, it traversed the list in sequential order, which meant
> that they were processed in a sub-optimal order. For example, blocks
> could go out in 54321 order rather than 12345, causing media heads
> to bounce unnecessarily.

Hmm, the patch itself looks reasonable, but given that
gfs2_before_commit sorts the list by block number before iterating it,
the above explanation seems to be at least partially wrong. Could you
please clarify what's really going on?

> This makes no difference for small IO operations, but large writes
> benefit greatly when they add lots of indirect blocks to a
> transaction, and those blocks are allocated in ascending order,
> as the block allocator tries to do.
>
> This patch changes it to list_add_tail so they are traversed (and
> therefore written back) in the same order as they are added to
> the transaction.
>
> In one of the more extreme examples, I did a test where I had
> 10 simultaneous instances of iozone, each writing 25GB of data,
> using one of our performance testing machines. The results are:
>
>  Without the patch   With the patch
>  -   --
> Children see throughput  1395068.00 kB/sec   1527073.61 kB/sec
> Parent sees throughput   1348544.00 kB/sec   1485594.66 kB/sec
>
> These numbers are artificially inflated because I was also running
> with Andreas Gruenbacher's iomap-write patch set plus my rgrp
> sharing patch set in both these cases. Still, it shows a 9 percent
> performance boost for both children and parent throughput.
>
> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/trans.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
> index c75cacaa349b..c4c00b90935c 100644
> --- a/fs/gfs2/trans.c
> +++ b/fs/gfs2/trans.c
> @@ -247,7 +247,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
> buffer_head *bh)
> gfs2_pin(sdp, bd->bd_bh);
> mh->__pad0 = cpu_to_be64(0);
> mh->mh_jid = cpu_to_be32(sdp->sd_jdesc->jd_jid);
> -   list_add(>bd_list, >tr_buf);
> +   list_add_tail(>bd_list, >tr_buf);
> tr->tr_num_buf_new++;
>  out_unlock:
> gfs2_log_unlock(sdp);
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v4 06/11] iomap: Add write_{begin, end} iomap operations

2018-05-15 Thread Andreas Gruenbacher
On 15 May 2018 at 09:22, Christoph Hellwig  wrote:
> On Tue, May 15, 2018 at 11:11:47AM +1000, Dave Chinner wrote:
>> I'm not sure adding a per-page filesystem callout abstraction is
>> what we want in the iomap code.  The commit message does not explain
>> why gfs2 needs per-page callouts, nor why the per-page work cannot
>> be done/avoided by running code in the ->iomap_begin callout
>> (e.g. by unstuffing the inode so nothing needs to be done per-page).
>
> Agreed.
>
>> My main concern is that the callouts to the filesystem in iomap are
>> - by design - per IO, not per page. The whole point of using iomap
>> was to get away from doing per-page filesystem operations on
>> multi-page IOs - it's highly inefficient when we only need to call
>> into the filesystem on a per-extent basis, and we simplified the
>> code a *lot* by avoiding such behaviours.
>
> Yes.  And from a quick look we can easily handle "stuffed" aka inline
> data with the existing architecture.  We just need a new IOMAP_INLINE
> flag for the extent type. Details will need to be worked out how to
> pass that data, either a struct page or virtual address should probably
> do it.

This is about data journaling, it has nothing to do with "stuffed"
files per se. With journaled data writes, we need to do some special
processing for each page so that the page ends up in the journal
before it gets written back to its proper on-disk location. It just
happens that since we need these per-page operations anyway, the
"unstuffing" and "re-stuffing" nicely fits in those operations as
well.

> The other thing gfs2 seems to be doing is handling of journaled
> data.  I'm not quite sure how to fit that into the grand overall scheme
> of things, but at least that would only be after the write.

Not sure what you mean by "but at least that would only be after the write".

>> That is, Christoph's patchset pushes us further away from
>> filesystems doing their own per-page processing of page state. It
>> converts the iomap IO path to store it's own per-page state tracking
>> information in page->private, greatly reducing memory overhead (16
>> bytes on 4k page instead of 104 bytes per bufferhead) and
>
> Or in fact zero bytes for block size == PAGE_SIZE
>
>> streamlining the code.  This, however, essentially makes the
>> buffered IO path through the iomap infrastructure incompatible with
>> existing bufferhead based filesystems.
>>
>> Depsite this, we can still provide limited support for buffered
>> writes on bufferhead based filesystems through the iomap
>> infrastructure, but I only see that as a mechanism for filesystems
>> moving away from dependencies on bufferheads. It would require a
>> different refactoring of the iomap code - I'd much prefer to keep
>> bufferhead dependent IO paths completely separate (e.g. via a new
>> actor function) to minimise impact and dependencies on the internal
>> bufferhead free iomap IO path
>>
>> Let's see what Christoph thinks first, though
>
> gfs2 seems to indeed depend pretty heavily on buffer_heads.  This is
> a bit of a bummer, but I'd need to take a deeper look how to offer
> iomap functionality to them without them moving away from buffer_heads
> which isn't really going to be a quick project.  In a way the
> write_begin/write_end ops from Andreas facilitate that, as the
> rest of iomap_file_buffered_write and iomap_write_actor are untouched.
>
> So at least temporarily his callouts are what we need, even if the
> actual inline data functionality should be moved out of them for sure,

Been there. An earlier version of the patches did have a separate
stuffed write path and the unstuffing and re-stuffing didn't happen in
those per-page operations. The result was much messier overall.

> but only with a long term plan to move gfs2 away from buffer heads.

I hope we'll get rid of buffer heads in additional parts of gfs2, but
that's not going to happen soon. The main point of those per-page
operations is still to support data journaling though.

> Btw, does gfs2 support blocksizes smallers than PAGE_SIZE?

Yes.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v4 09/11] gfs2: iomap direct I/O support

2018-05-15 Thread Christoph Hellwig
On Mon, May 14, 2018 at 05:36:22PM +0200, Andreas Gruenbacher wrote:
> With that, the direct_IO address space operation can be all but
> eliminated: only a dummy remains which indicates that the filesystem
> supports direct I/O.

And that dummy can be replaced with the generic noop_direct_IO helper.

> + ret = filemap_write_and_wait_range(mapping, lstart, end);
> + if (ret)
> + goto out;

We already do this call in the common code.

> + if (iov_iter_rw(from) == WRITE)
> + truncate_inode_pages_range(mapping, lstart, end);

Why do you need the truncate_inode_pages_range call here instead of
invalidate_inode_pages2_range that everyone else uses and that
the iomap code already does for you?



Re: [Cluster-devel] [PATCH v4 11/11] iomap: Complete partial direct I/O writes synchronously

2018-05-15 Thread Christoph Hellwig
On Mon, May 14, 2018 at 05:36:24PM +0200, Andreas Gruenbacher wrote:
> According to xfstest generic/240, applications see, to expect direct I/O
> writes to either complete as a whole or to fail; short direct I/O writes
> are apparently not appreciated.  This means that when only part of an
> asynchronous direct I/O write succeeds, we can either fail the entire
> write, or we can wait wait for the partial write to complete and retry
> the remaining write using buffered I/O.  The old __blockdev_direct_IO
> helper has code for waiting for partial writes to complete; the new
> iomap_dio_rw iomap helper does not.
> 
> The above mentioned fallback mode is used by gfs2, which doesn't allow
> block allocations under direct I/O to avoid taking cluster-wide
> exclusive locks.  As a consequence, an asynchronous direct I/O write to
> a file range that ends in a hole will result in a short write.  When
> that happens, we want to retry the remaining write using buffered I/O.
> 
> To allow that, change iomap_dio_rw to wait for short direct I/O writes
> like __blockdev_direct_IO does instead of returning -EIOCBQUEUED.
> 
> This fixes xfstest generic/240 on gfs2.

The code looks pretty racy to me.  Why would gfs2 cause a short direct
I/O write to start with?  I suspect that is where the problem that needs
fixing is burried.



Re: [Cluster-devel] [PATCH v4 06/11] iomap: Add write_{begin, end} iomap operations

2018-05-15 Thread Christoph Hellwig
On Tue, May 15, 2018 at 11:11:47AM +1000, Dave Chinner wrote:
> I'm not sure adding a per-page filesystem callout abstraction is
> what we want in the iomap code.  The commit message does not explain
> why gfs2 needs per-page callouts, nor why the per-page work cannot
> be done/avoided by running code in the ->iomap_begin callout
> (e.g. by unstuffing the inode so nothing needs to be done per-page).

Agreed.
> 
> My main concern is that the callouts to the filesystem in iomap are
> - by design - per IO, not per page. The whole point of using iomap
> was to get away from doing per-page filesystem operations on
> multi-page IOs - it's highly inefficient when we only need to call
> into the filesystem on a per-extent basis, and we simplified the
> code a *lot* by avoiding such behaviours.

Yes.  And from a quick look we can easily handle "stuffed" aka inline
data with the existing architecture.  We just need a new IOMAP_INLINE
flag for the extent type.  Details will need to be worked out how to
pass that data, either a struct page or virtual address should probably
do it.  The other thing gfs2 seems to be doing is handling of journaled
data.  I'm not quite sure how to fit that into the grand overall scheme
of things, but at least that would only be after the write.

> That is, Christoph's patchset pushes us further away from
> filesystems doing their own per-page processing of page state. It
> converts the iomap IO path to store it's own per-page state tracking
> information in page->private, greatly reducing memory overhead (16
> bytes on 4k page instead of 104 bytes per bufferhead) and

Or in fact zero bytes for block size == PAGE_SIZE

> streamlining the code.  This, however, essentially makes the
> buffered IO path through the iomap infrastructure incompatible with
> existing bufferhead based filesystems.
> 
> Depsite this, we can still provide limited support for buffered
> writes on bufferhead based filesystems through the iomap
> infrastructure, but I only see that as a mechanism for filesystems
> moving away from dependencies on bufferheads. It would require a
> different refactoring of the iomap code - I'd much prefer to keep
> bufferhead dependent IO paths completely separate (e.g. via a new
> actor function) to minimise impact and dependencies on the internal
> bufferhead free iomap IO path
> 
> Let's see what Christoph thinks first, though

gfs2 seems to indeed depend pretty heavily on buffer_heads.  This is 
a bit of a bummer, but I'd need to take a deeper look how to offer
iomap functionality to them without them moving away from buffer_heads
which isn't really going to be a quick project.  In a way the
write_begin/write_end ops from Andreas facilitate that, as the
rest of iomap_file_buffered_write and iomap_write_actor are untouched.

So at least temporarily his callouts are what we need, even if the
actual inline data functionality should be moved out of them for sure,
but only with a long term plan to move gfs2 away from buffer heads.

Btw, does gfs2 support blocksizes smallers than PAGE_SIZE?

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
---end quoted text---