[Cluster-devel] [GFS2 PATCH] GFS2: gfs2_free_extlen can return an extent that is too long
Hi, Function gfs2_free_extlen calculates the length of an extent of free blocks that may be reserved. The end pointer was calculated as end = start + bh->b_size but b_size is incorrect because the bitmap usually stops prior to the end of the buffer data on the last bitmap. What this means is that when you do a write, you can reserve a chunk of blocks that runs off the end of the last bitmap. For example, I've got a file system where there is only one bitmap for each rgrp, so ri_length==1. I saw cases in which iozone tried to do a big write, grabbed a large block reservation, chose rgrp 5464152, which has ri_data0 5464153 and ri_data 8188. So 5464153 + 8188 = 5472341 which is the end of the rgrp. When it grabbed a reservation it got back: 5470936, length 7229. But 5470936 + 7229 = 5478165. So the reservation starts inside the rgrp but runs 5824 blocks past the end of the bitmap. This patch fixes the calculation so it won't exceed the last bitmap. It also adds a BUG_ON to guard against overflows in the future. Signed-off-by: Bob Peterson --- fs/gfs2/incore.h | 1 + fs/gfs2/rgrp.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index 0bbbaa9b05cb..d2ad817e089f 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -116,6 +116,7 @@ static inline struct gfs2_bitmap *rbm_bi(const struct gfs2_rbm *rbm) static inline u64 gfs2_rbm_to_block(const struct gfs2_rbm *rbm) { + BUG_ON(rbm->offset >= rbm->rgd->rd_data); return rbm->rgd->rd_data0 + (rbm_bi(rbm)->bi_start * GFS2_NBBY) + rbm->offset; } diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index d7a8ed8bb30c..2141373300e0 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -372,8 +372,8 @@ static u32 gfs2_free_extlen(const struct gfs2_rbm *rrbm, u32 len) start = bi->bi_bh->b_data; if (bi->bi_clone) start = bi->bi_clone; - end = start + bi->bi_bh->b_size; start += bi->bi_offset; + end = start + bi->bi_len; BUG_ON(rbm.offset & 3); start += (rbm.offset / GFS2_NBBY); bytes = min_t(u32, len / GFS2_NBBY, (end - start));
Re: [Cluster-devel] [GFS2 PATCH] gfs2: revert af21ca8ed50f01c5278c5ded6dad6f05e8a5d2e4
Hi, On 30/05/18 20:18, Bob Peterson wrote: Hi, Patch af21ca8ed5 is defective and needs to be reverted. The idea was to only reserve single blocks for directories, with the thought that they should be packed together. That's all well and good for creating new directories, but directories have other reasons for getting reservations: for example, when files are added to them and they need their directory hash tables expanded. For those purposes, one block may not be enough and the reservation can run out of blocks, forcing us to do block allocations without a reservation. We can also get into situations where we cannot modify a directory when we should be able to. This patch reverts the patch. Signed-off-by: Bob Peterson Doing allocation without a reservation is ok though, since the reservations are only hints and what really protects us against running out of blocks is the block counts on the rgrp. I forget now what the reason was that we couldn't add reservations for directories of more than one block, but I had a feeling there was a reason that we did that at the time. That said, having reservations result in directories which are less fragments is definitely a good thing. One issue is whether it makes sense in case of deep directory trees which each directory itself contains only other directories, and indeed, not a huge number (so that the directories are still stuffed files). In that case we'll land up with a lot of fragmentation if we are not careful and simply apply a reservation without some knowledge of the directory size/layout. It would be good if we could add a feature to the allocator to allocate contiguous extents even if it means searching a bit harder, and then to use that for directory hash tables so that we can at least keep those to a single extent where possible. The downside is that we then land up leaving holes where the old hash table was, which are then not easy to reuse... Steve. --- fs/gfs2/rgrp.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 8be47b81011a..d7a8ed8bb30c 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1529,14 +1529,9 @@ static void rg_mblk_search(struct gfs2_rgrpd *rgd, struct gfs2_inode *ip, u32 extlen; u32 free_blocks = rgd_free(rgd, rs); int ret; - struct inode *inode = >i_inode; - if (S_ISDIR(inode->i_mode)) - extlen = 1; - else { - extlen = max_t(u32, atomic_read(>rs_sizehint), ap->target); - extlen = clamp(extlen, RGRP_RSRV_MINBLKS, free_blocks); - } + extlen = max_t(u32, atomic_read(>rs_sizehint), ap->target); + extlen = clamp(extlen, RGRP_RSRV_MINBLKS, free_blocks); if ((rgd->rd_free_clone < rgd->rd_reserved) || (free_blocks < extlen)) return;
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Add to tail, not head, of transaction
Hi, On 30/05/18 20:24, Bob Peterson wrote: Hi, Before this patch, frunction gfs2_trans_add_meta called list_add to add a buffer to a transaction, tr_buf. Later, in the before_commit functions, it traversed the list in sequential order, which meant that they were processed in a sub-optimal order. For example, blocks could go out in 54321 order rather than 12345, causing media heads to bounce unnecessarily. This makes no difference for small IO operations, but large writes benefit greatly when they add lots of indirect blocks to a transaction, and those blocks are allocated in ascending order, as the block allocator tries to do. This patch changes it to list_add_tail so they are traversed (and therefore written back) in the same order as they are added to the transaction. In one of the more extreme examples, I did a test where I had 10 simultaneous instances of iozone, each writing 25GB of data, using one of our performance testing machines. The results are: Without the patch With the patch - -- Children see throughput 1395068.00 kB/sec 1527073.61 kB/sec Parent sees throughput 1348544.00 kB/sec 1485594.66 kB/sec These numbers are artificially inflated because I was also running with Andreas Gruenbacher's iomap-write patch set plus my rgrp sharing patch set in both these cases. Still, it shows a 9 percent performance boost for both children and parent throughput. Signed-off-by: Bob Peterson That makes sense assuming that a transaction will always add the blocks in ascending order. Writing into the journal will not be affected by this, since that will always be sequential anyway, so I assume that you are referring to AIL writeback here? It might make more sense just to sort the blocks into order at journal flush time, since then we'll always be in order, no matter what order the blocks are added during the transactions, Steve. --- fs/gfs2/trans.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c index c75cacaa349b..c4c00b90935c 100644 --- a/fs/gfs2/trans.c +++ b/fs/gfs2/trans.c @@ -247,7 +247,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh) gfs2_pin(sdp, bd->bd_bh); mh->__pad0 = cpu_to_be64(0); mh->mh_jid = cpu_to_be32(sdp->sd_jdesc->jd_jid); - list_add(>bd_list, >tr_buf); + list_add_tail(>bd_list, >tr_buf); tr->tr_num_buf_new++; out_unlock: gfs2_log_unlock(sdp);