On Tue, Feb 14, 2017 at 11:32 AM, Steven Whitehouse <swhit...@redhat.com> wrote: > Hi, > > > On 13/02/17 17:59, Andrew Price wrote: >> >> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The >> rg_skip field has the following meaning: >> >> - If rg_skip is zero, it is considered unset and not useful. >> - If rg_skip is non-zero, its value will be the number of blocks between >> this rgrp's address and the next rgrp's address. This can be used as a >> hint by fsck.gfs2 when rebuilding a bad rindex, for example. >> >> When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip >> to be 0 it will attempt to set it to the difference between its rd_addr >> and the rd_addr of the next resource group. >> >> The only special case is the final rgrp, which always has a rg_skip of >> 0. It is not set to a special value (like -1) because, when the >> filesystem is grown, the rgrp will no longer be the final one and it >> will then need to have its rg_skip field set. The overhead of this >> special case is a gfs2_rgrpd_get_next() call each time >> gfs2_rgrp_bh_get() is called for the final resource group. >> >> For the other resource groups, if the rg_skip field is 0, it is set >> appropriately and then the only overhead becomes the rgd->rg_skip == 0 >> comparison in gfs2_rgrp_bh_get(). >> >> Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so >> the rg_skip field can get set back to 0 in cases where nodes with and >> without this patch are mixed in a cluster. In some cases, the field may >> bounce between being set by one node and then zeroed by another which >> may harm performance slightly, e.g. when two nodes create many small >> files. In testing this situation is rare but it becomes more likely as >> the filesystem fills up and there are fewer resource groups to choose >> from. The problem goes away when all nodes are running with this patch. >> Dipping into the space currently occupied by the rg_reserved field would >> have resulted in the same problem as it is also explicitly zeroed, so >> unfortunately there is no other way around it. >> >> Signed-off-by: Andrew Price <anpr...@redhat.com> > > We will also need patches for gfs2_convert, mkfs.gfs2, fsck.gfs2, gfs2_edit > and gfs2_grow too most likely. Ideally we want to be in a situation where we > can default to using only the info in the rgrp headers, and fall back to > looking at the rindex only in case that information is not there/broken in > the rgrp headers. > > > >> --- >> fs/gfs2/incore.h | 1 + >> fs/gfs2/rgrp.c | 27 ++++++++++++++++++++++++++- >> include/uapi/linux/gfs2_ondisk.h | 5 ++++- >> 3 files changed, 31 insertions(+), 2 deletions(-) >> >> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h >> index a6a3389..2c03287 100644 >> --- a/fs/gfs2/incore.h >> +++ b/fs/gfs2/incore.h >> @@ -88,6 +88,7 @@ struct gfs2_rgrpd { >> u32 rd_reserved; /* number of blocks reserved */ >> u32 rd_free_clone; >> u32 rd_dinodes; >> + u32 rd_skip; /* Distance to the next rgrp in fs >> blocks */ >> u64 rd_igeneration; >> struct gfs2_bitmap *rd_bits; >> struct gfs2_sbd *rd_sbd; >> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c >> index 86ccc015..aaf435d 100644 >> --- a/fs/gfs2/rgrp.c >> +++ b/fs/gfs2/rgrp.c >> @@ -1049,6 +1049,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd, >> const void *buf) >> rgd->rd_flags |= rg_flags; >> rgd->rd_free = be32_to_cpu(str->rg_free); >> rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes); >> + rgd->rd_skip = be32_to_cpu(str->rg_skip); >> rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration); >> } >> @@ -1059,7 +1060,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd, >> void *buf) >> str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK); >> str->rg_free = cpu_to_be32(rgd->rd_free); >> str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes); >> - str->__pad = cpu_to_be32(0); >> + str->rg_skip = cpu_to_be32(rgd->rd_skip); >> str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration); >> memset(&str->rg_reserved, 0, sizeof(str->rg_reserved)); >> } >> @@ -1119,6 +1120,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd) >> return count; >> } >> +/** >> + * Set the rg_next field if this isn't the final rgrp. >> + */ >> +static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd) >> +{ >> + struct gfs2_sbd *sdp = rgd->rd_sbd; >> + struct buffer_head *bh = rgd->rd_bits[0].bi_bh; >> + struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd); >> + >> + if (next == NULL || next->rd_addr <= rgd->rd_addr) >> + return; >> + >> + if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0) >> + return; >> + >> + rgd->rd_skip = next->rd_addr - rgd->rd_addr; >> + gfs2_trans_add_meta(rgd->rd_gl, bh); >> + gfs2_rgrp_out(rgd, bh->b_data); >> + gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data); >> + gfs2_trans_end(sdp); >> + return; >> +} >> /** >> * gfs2_rgrp_bh_get - Read in a RG's header and bitmaps >> @@ -1184,6 +1207,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd) >> if (rgd->rd_rgl->rl_unlinked == 0) >> rgd->rd_flags &= ~GFS2_RDF_CHECK; >> } >> + if (rgd->rd_skip == 0 && !(sdp->sd_vfs->s_flags & MS_RDONLY)) >> + gfs2_rgrp_set_skip(rgd); >> return 0; > > This looks like it can be called under a read lock. It might be better to > move the setting of the field so that we only set it when we are going to > write the rgrp anyway. That will remove the additional overhead, and also > avoid the complication of trying to figure out when it is ok to set it.
I second that. The skip fields will be initialized eventually either way; I don't see a need to speed that up. > You > could also avoid storing it directly by calculating it on write, and using > the existing rindex fields in struct gfs2_rgrpd - no need to duplicate the > information there. > > >> fail: >> diff --git a/include/uapi/linux/gfs2_ondisk.h >> b/include/uapi/linux/gfs2_ondisk.h >> index 7c4be77..0064381f 100644 >> --- a/include/uapi/linux/gfs2_ondisk.h >> +++ b/include/uapi/linux/gfs2_ondisk.h >> @@ -186,7 +186,10 @@ struct gfs2_rgrp { >> __be32 rg_flags; >> __be32 rg_free; >> __be32 rg_dinodes; >> - __be32 __pad; >> + union { >> + __be32 __pad; >> + __be32 rg_skip; /* Distance to the next rgrp in fs blocks >> */ >> + }; >> __be64 rg_igeneration; >> __u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */ > > That looks like a good place to fit rg_skip in, but gfs2_rindex contains > ri_data0, ri_data and ri_bitbytes and to allow the eventual removal of the > rindex, those need also to be duplicated into the rgrp. There is plenty of > space there to fit those fields in, so we should do that at the same time. > We could also take the opportunity to add a checksum too, Are there plans for adding metadata checksums throughout all structures? rgrp checksums don't seem very valuable if we don't at least also checksum the super-block. > since rg_skip > being non-zero would tell us whether we should be checking the checksum - I > don't think that should be too tricky to do, but we could always add it > later too, if required. > > Other thoughts... if we also added a field to give us the size of the > following rgrp in disk blocks, then we could do proper readahead on them. So > as well as rg_skip we would need rg_next_len too, which would be the same as > ri_length of the following rgrp header, > > Steve. Thanks, Andreas