[Cluster-devel] [GFS2 PATCH] GFS2: gfs2_free_extlen can return an extent that is too long

2018-06-01 Thread Bob Peterson
Hi,

Function gfs2_free_extlen calculates the length of an extent of
free blocks that may be reserved. The end pointer was calculated as
end = start + bh->b_size but b_size is incorrect because the
bitmap usually stops prior to the end of the buffer data on
the last bitmap.

What this means is that when you do a write, you can reserve a
chunk of blocks that runs off the end of the last bitmap. For
example, I've got a file system where there is only one bitmap
for each rgrp, so ri_length==1. I saw cases in which iozone
tried to do a big write, grabbed a large block reservation,
chose rgrp 5464152, which has ri_data0 5464153 and ri_data 8188.
So 5464153 + 8188 = 5472341 which is the end of the rgrp.

When it grabbed a reservation it got back: 5470936, length 7229.
But 5470936 + 7229 = 5478165. So the reservation starts inside
the rgrp but runs 5824 blocks past the end of the bitmap.

This patch fixes the calculation so it won't exceed the last
bitmap. It also adds a BUG_ON to guard against overflows in the
future.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/incore.h | 1 +
 fs/gfs2/rgrp.c   | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 0bbbaa9b05cb..d2ad817e089f 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -116,6 +116,7 @@ static inline struct gfs2_bitmap *rbm_bi(const struct 
gfs2_rbm *rbm)
 
 static inline u64 gfs2_rbm_to_block(const struct gfs2_rbm *rbm)
 {
+   BUG_ON(rbm->offset >= rbm->rgd->rd_data);
return rbm->rgd->rd_data0 + (rbm_bi(rbm)->bi_start * GFS2_NBBY) +
rbm->offset;
 }
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index d7a8ed8bb30c..2141373300e0 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -372,8 +372,8 @@ static u32 gfs2_free_extlen(const struct gfs2_rbm *rrbm, 
u32 len)
start = bi->bi_bh->b_data;
if (bi->bi_clone)
start = bi->bi_clone;
-   end = start + bi->bi_bh->b_size;
start += bi->bi_offset;
+   end = start + bi->bi_len;
BUG_ON(rbm.offset & 3);
start += (rbm.offset / GFS2_NBBY);
bytes = min_t(u32, len / GFS2_NBBY, (end - start));



Re: [Cluster-devel] [GFS2 PATCH] gfs2: revert af21ca8ed50f01c5278c5ded6dad6f05e8a5d2e4

2018-06-01 Thread Steven Whitehouse

Hi,


On 30/05/18 20:18, Bob Peterson wrote:

Hi,

Patch af21ca8ed5 is defective and needs to be reverted. The idea
was to only reserve single blocks for directories, with the thought
that they should be packed together. That's all well and good for
creating new directories, but directories have other reasons for
getting reservations: for example, when files are added to them
and they need their directory hash tables expanded. For
those purposes, one block may not be enough and the reservation
can run out of blocks, forcing us to do block allocations without
a reservation. We can also get into situations where we cannot
modify a directory when we should be able to.

This patch reverts the patch.

Signed-off-by: Bob Peterson 
Doing allocation without a reservation is ok though, since the 
reservations are only hints and what really protects us against running 
out of blocks is the block counts on the rgrp. I forget now what the 
reason was that we couldn't add reservations for directories of more 
than one block, but I had a feeling there was a reason that we did that 
at the time.


That said, having reservations result in directories which are less 
fragments is definitely a good thing. One issue is whether it makes 
sense in case of deep directory trees which each directory itself 
contains only other directories, and indeed, not a huge number (so that 
the directories are still stuffed files). In that case we'll land up 
with a lot of fragmentation if we are not careful and simply apply a 
reservation without some knowledge of the directory size/layout.


It would be good if we could add a feature to the allocator to allocate 
contiguous extents even if it means searching a bit harder, and then to 
use that for directory hash tables so that we can at least keep those to 
a single extent where possible. The downside is that we then land up 
leaving holes where the old hash table was, which are then not easy to 
reuse...


Steve.


---
  fs/gfs2/rgrp.c | 9 ++---
  1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 8be47b81011a..d7a8ed8bb30c 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1529,14 +1529,9 @@ static void rg_mblk_search(struct gfs2_rgrpd *rgd, 
struct gfs2_inode *ip,
u32 extlen;
u32 free_blocks = rgd_free(rgd, rs);
int ret;
-   struct inode *inode = >i_inode;
  
-	if (S_ISDIR(inode->i_mode))

-   extlen = 1;
-   else {
-   extlen = max_t(u32, atomic_read(>rs_sizehint), ap->target);
-   extlen = clamp(extlen, RGRP_RSRV_MINBLKS, free_blocks);
-   }
+   extlen = max_t(u32, atomic_read(>rs_sizehint), ap->target);
+   extlen = clamp(extlen, RGRP_RSRV_MINBLKS, free_blocks);
if ((rgd->rd_free_clone < rgd->rd_reserved) || (free_blocks < extlen))
return;
  







Re: [Cluster-devel] [GFS2 PATCH] GFS2: Add to tail, not head, of transaction

2018-06-01 Thread Steven Whitehouse

Hi,


On 30/05/18 20:24, Bob Peterson wrote:

Hi,

Before this patch, frunction gfs2_trans_add_meta called list_add
to add a buffer to a transaction, tr_buf. Later, in the before_commit
functions, it traversed the list in sequential order, which meant
that they were processed in a sub-optimal order. For example, blocks
could go out in 54321 order rather than 12345, causing media heads
to bounce unnecessarily.

This makes no difference for small IO operations, but large writes
benefit greatly when they add lots of indirect blocks to a
transaction, and those blocks are allocated in ascending order,
as the block allocator tries to do.

This patch changes it to list_add_tail so they are traversed (and
therefore written back) in the same order as they are added to
the transaction.

In one of the more extreme examples, I did a test where I had
10 simultaneous instances of iozone, each writing 25GB of data,
using one of our performance testing machines. The results are:

  Without the patch   With the patch
  -   --
Children see throughput  1395068.00 kB/sec   1527073.61 kB/sec
Parent sees throughput   1348544.00 kB/sec   1485594.66 kB/sec

These numbers are artificially inflated because I was also running
with Andreas Gruenbacher's iomap-write patch set plus my rgrp
sharing patch set in both these cases. Still, it shows a 9 percent
performance boost for both children and parent throughput.

Signed-off-by: Bob Peterson 
That makes sense assuming that a transaction will always add the blocks 
in ascending order. Writing into the journal will not be affected by 
this, since that will always be sequential anyway, so I assume that you 
are referring to AIL writeback here?


It might make more sense just to sort the blocks into order at journal 
flush time, since then we'll always be in order, no matter what order 
the blocks are added during the transactions,


Steve.


---
  fs/gfs2/trans.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index c75cacaa349b..c4c00b90935c 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -247,7 +247,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
buffer_head *bh)
gfs2_pin(sdp, bd->bd_bh);
mh->__pad0 = cpu_to_be64(0);
mh->mh_jid = cpu_to_be32(sdp->sd_jdesc->jd_jid);
-   list_add(>bd_list, >tr_buf);
+   list_add_tail(>bd_list, >tr_buf);
tr->tr_num_buf_new++;
  out_unlock:
gfs2_log_unlock(sdp);