Re: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED
On Fri, 22 Mar 2019 at 00:01, Andreas Gruenbacher wrote: > On Thu, 21 Mar 2019 at 22:43, Dave Chinner wrote: > > The problem is calling balance_dirty_pages() inside the > > ->iomap_begin/->iomap_end calls and not that it is called by the > > iomap infrastructure itself, right? > > > > Is so, I'd prefer to see this in iomap_apply() after the call to > > ops->iomap_end because iomap_file_buffered_write() can iterate and > > call iomap_apply() multiple times. This would keep the balancing to > > a per-iomap granularity, rather than a per-syscall granularity. > > > > i.e. if we do write(2GB), we want more than one balancing call > > during that syscall, so it would be up to the filesystem to a) limit > > the size of write mappings to something smaller (e.g. 1024 pages) > > so that there are still frequent balancing calls for large writes. > > Hmm. The looping across multiple mappings isn't done in iomap_apply > but in iomap_file_buffered_write, so the balancing could go into > iomap_apply or iomap_file_buffered_write, but can't go further up the > stack. Given that, iomap_file_buffered_write seems the better place, > but this is still quite horrible. Here's a more reasonable version of my first patch, with a cleaned up and hopefully fixed gfs2 part. In addition, this checks for IOMAP_F_UNBALANCED in iomap_dirty_actor, the actor for iomap_file_dirty. We don't use iomap_file_dirty in gfs2, but we should probably allowing to skip the dirty page balancing there as well. Thanks, Andreas --- fs/gfs2/bmap.c| 64 +-- fs/iomap.c| 6 ++-- include/linux/iomap.h | 1 + 3 files changed, 55 insertions(+), 16 deletions(-) diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index 02b2646d84b3a..628d66d07fc6c 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -974,6 +974,19 @@ static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos, gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied); } +static void gfs2_write_end(struct inode *inode, struct buffer_head *dibh) +{ + struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_trans *tr = current->journal_info; + + gfs2_ordered_add_inode(ip); + + if (tr->tr_num_buf_new) + __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + else + gfs2_trans_add_meta(ip->i_gl, dibh); +} + static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap *iomap, @@ -996,6 +1009,25 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, if (ret) goto out_unlock; + if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip) || inode == sdp->sd_rindex) { + int max_pages; + u64 max_length; + + iomap->flags |= IOMAP_F_UNBALANCED; + + /* +* Limit the write size: this ensures that write throttling +* will kick in fast enough even when we don't call +* balance_dirty_pages_ratelimited for each page written. +*/ + max_pages = current->nr_dirtied_pause - current->nr_dirtied; + if (max_pages < 8) + max_pages = 8; + max_length = (u64)max_pages << PAGE_SHIFT; + if (iomap->length > max_length) + iomap->length = max_length; + } + alloc_required = unstuff || iomap->type == IOMAP_HOLE; if (alloc_required || gfs2_is_jdata(ip)) @@ -1052,6 +1084,11 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, } if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip)) iomap->page_done = gfs2_iomap_journaled_page_done; + + if (!(iomap->flags & IOMAP_F_UNBALANCED)) { + gfs2_write_end(inode, mp->mp_bh[0]); + gfs2_trans_end(sdp); + } return 0; out_trans_end: @@ -1103,30 +1140,29 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { struct gfs2_inode *ip = GFS2_I(inode); - struct gfs2_sbd *sdp = GFS2_SB(inode); - struct gfs2_trans *tr = current->journal_info; struct buffer_head *dibh = iomap->private; if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE) goto out; - if (iomap->type != IOMAP_INLINE) { - gfs2_ordered_add_inode(ip); + if (current->journal_info) { + struct gfs2_sbd *sdp = GFS2_SB(inode); - if (tr->tr_num_buf_new) - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); - else - gfs2_trans_add_meta(ip->i_gl, dibh); - } + if (iomap->type != IOMAP_INLINE) + gfs2_write_end(inode, dibh); -
Re: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED
On Thu, 21 Mar 2019 at 22:43, Dave Chinner wrote: > The problem is calling balance_dirty_pages() inside the > ->iomap_begin/->iomap_end calls and not that it is called by the > iomap infrastructure itself, right? > > Is so, I'd prefer to see this in iomap_apply() after the call to > ops->iomap_end because iomap_file_buffered_write() can iterate and > call iomap_apply() multiple times. This would keep the balancing to > a per-iomap granularity, rather than a per-syscall granularity. > > i.e. if we do write(2GB), we want more than one balancing call > during that syscall, so it would be up to the filesystem to a) limit > the size of write mappings to something smaller (e.g. 1024 pages) > so that there are still frequent balancing calls for large writes. Hmm. The looping across multiple mappings isn't done in iomap_apply but in iomap_file_buffered_write, so the balancing could go into iomap_apply or iomap_file_buffered_write, but can't go further up the stack. Given that, iomap_file_buffered_write seems the better place, but this is still quite horrible. Thanks, Andreas
Re: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED
On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote: > Hi Christoph, > > we need your help fixing a gfs2 deadlock involving iomap. What's going > on is the following: > > * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush > lock and keeps it until gfs2_iomap_end. It currently always does that > even though there is no point other than for journaled data writes. > > * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited. > If that ends up calling gfs2_write_inode, gfs2 will try to grab the > log flush lock again and deadlock. > > We can stop gfs2_iomap_begin from keeping the log flush lock held for > non-journaled data writes, but that still leaves us with the deadlock > for journaled data writes: for them, we need to add the data pages to > the running transaction, so dropping the log flush lock wouldn't work. > > I've tried to work around the unwanted balance_dirty_pages_ratelimited > in gfs2_write_inode as originally suggested by Ross. That works to a > certain degree, but only if we always skip inodes in the WB_SYNC_NONE > case, and that means that gfs2_write_inode becomes quite ineffective. > > So it seems that it would be more reasonable to avoid the dirty page > balancing under the log flush lock in the first place. > > The attached patch changes gfs2_iomap_begin to only hold on to the log > flush lock for journaled data writes. In that case, we also make sure > to limit the write size to not overrun dirty limits using a similar > logic as in balance_dirty_pages_ratelimited; there is precedent for that > approach in btrfs_buffered_write. Then, we prevent iomap from balancing > dirty pages via the new IOMAP_F_UNBALANCED flag. > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c > index 58a768e59712e..542bd149c4aa3 100644 > --- a/fs/gfs2/file.c > +++ b/fs/gfs2/file.c > @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, > struct iov_iter *from) > iocb->ki_pos += ret; > } > > + balance_dirty_pages_ratelimited(file->f_mapping); > + > out2: > current->backing_dev_info = NULL; > out: The problem is calling balance_dirty_pages() inside the ->iomap_begin/->iomap_end calls and not that it is called by the iomap infrastructure itself, right? Is so, I'd prefer to see this in iomap_apply() after the call to ops->iomap_end because iomap_file_buffered_write() can iterate and call iomap_apply() multiple times. This would keep the balancing to a per-iomap granularity, rather than a per-syscall granularity. i.e. if we do write(2GB), we want more than one balancing call during that syscall, so it woul dbe up to the filesystem to a) limit the size of write mappings to something smaller (e.g. 1024 pages) so that there are still frequent balancing calls for large writes. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] vfs: Allow selection of fs root independent of sb
On Thu, Mar 21, 2019 at 05:26:44PM +, Andrew Price wrote: > Take a function pointer 'select_root' in vfs_get_block_super() to allow > callers to select the root dentry based on the context. Can't this be a fs_context_operations pointer?
Re: [Cluster-devel] [PATCH] vfs: Allow selection of fs root independent of sb
That's not what I meant. There's a kerneldoc comment on vfs_get_block_super() that you need to modify. I would in any case roll your patch into mine. > This is intended for gfs2 to select between the normal root and the > 'meta' root, which must be done whether a root already exists for a > superblock or not, i.e. it cannot be decided in fill_super(). Can this be done in gfs2_get_tree, where it sets fc->root? David
[Cluster-devel] [PATCH] vfs: Allow selection of fs root independent of sb
Take a function pointer 'select_root' in vfs_get_block_super() to allow callers to select the root dentry based on the context. This is intended for gfs2 to select between the normal root and the 'meta' root, which must be done whether a root already exists for a superblock or not, i.e. it cannot be decided in fill_super(). This will allow gfs2 to avoid duplicating a function from the vfs just to special-case the root dentry selection. Signed-off-by: Andrew Price --- Depends on "vfs: Create fs_context-aware mount_bdev() replacement" fs/super.c | 11 --- include/linux/fs_context.h | 4 ++-- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/super.c b/fs/super.c index 6d8dbf309241..0cdeaf28126f 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1232,11 +1232,12 @@ static int test_bdev_super_fc(struct super_block *s, struct fs_context *fc) * @fill_super: Helper to initialise a new superblock */ int vfs_get_block_super(struct fs_context *fc, - int (*fill_super)(struct super_block *, - struct fs_context *)) + int (*fill_super)(struct super_block *, struct fs_context *), + struct dentry* (*select_root)(struct fs_context *, struct super_block *)) { struct block_device *bdev; struct super_block *s; + struct dentry *root; int error = 0; fc->bdev_mode = FMODE_READ | FMODE_EXCL; @@ -1302,7 +1303,11 @@ int vfs_get_block_super(struct fs_context *fc, } BUG_ON(fc->root); - fc->root = dget(s->s_root); + if (select_root) + root = select_root(fc, s); + else + root = s->s_root; + fc->root = dget(root); return 0; error_sb: diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h index 8233c873af73..c2e9dc5826ce 100644 --- a/include/linux/fs_context.h +++ b/include/linux/fs_context.h @@ -145,8 +145,8 @@ extern int vfs_get_super(struct fs_context *fc, struct fs_context *fc)); extern int vfs_get_block_super(struct fs_context *fc, - int (*fill_super)(struct super_block *sb, -struct fs_context *fc)); + int (*fill_super)(struct super_block *, struct fs_context *), + struct dentry* (*select_root)(struct fs_context *, struct super_block *)); extern const struct file_operations fscontext_fops; -- 2.20.1
Re: [Cluster-devel] [PATCH v2 1/2] gfs2: Convert gfs2 to fs_context
Andrew Price wrote: > Would this fly? Can you update the comment to document what it's for? David
[Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED
Hi Christoph, we need your help fixing a gfs2 deadlock involving iomap. What's going on is the following: * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush lock and keeps it until gfs2_iomap_end. It currently always does that even though there is no point other than for journaled data writes. * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited. If that ends up calling gfs2_write_inode, gfs2 will try to grab the log flush lock again and deadlock. We can stop gfs2_iomap_begin from keeping the log flush lock held for non-journaled data writes, but that still leaves us with the deadlock for journaled data writes: for them, we need to add the data pages to the running transaction, so dropping the log flush lock wouldn't work. I've tried to work around the unwanted balance_dirty_pages_ratelimited in gfs2_write_inode as originally suggested by Ross. That works to a certain degree, but only if we always skip inodes in the WB_SYNC_NONE case, and that means that gfs2_write_inode becomes quite ineffective. So it seems that it would be more reasonable to avoid the dirty page balancing under the log flush lock in the first place. The attached patch changes gfs2_iomap_begin to only hold on to the log flush lock for journaled data writes. In that case, we also make sure to limit the write size to not overrun dirty limits using a similar logic as in balance_dirty_pages_ratelimited; there is precedent for that approach in btrfs_buffered_write. Then, we prevent iomap from balancing dirty pages via the new IOMAP_F_UNBALANCED flag. Does that seem like an acceptable approach? Thanks, Andreas --- fs/gfs2/bmap.c| 62 +-- fs/gfs2/file.c| 2 ++ fs/iomap.c| 3 ++- include/linux/iomap.h | 1 + 4 files changed, 53 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index 02b2646d84b3a..dad7d3810533e 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -862,6 +862,24 @@ static int gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length, iomap->offset = lblock << inode->i_blkbits; lblock_stop = (pos + length - 1) >> inode->i_blkbits; len = lblock_stop - lblock + 1; + + if ((flags & IOMAP_WRITE) && gfs2_is_jdata(ip)) { + int max_pages; + u64 max_blocks; + + /* +* Limit the write size: this ensures that write throttling +* will kick in fast enough even when we don't call +* balance_dirty_pages_ratelimited for each page written. +*/ + max_pages = min((int)DIV_ROUND_UP(length, PAGE_SIZE), + current->nr_dirtied_pause - current->nr_dirtied); + max_pages = max(max_pages, 8); + max_blocks = (u64)max_pages << (PAGE_SHIFT - inode->i_blkbits); + if (len > max_blocks) + len = max_blocks; + } + iomap->length = len << inode->i_blkbits; height = ip->i_height; @@ -974,6 +992,19 @@ static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos, gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied); } +static void gfs2_write_end(struct inode *inode, struct buffer_head *dibh) +{ + struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_trans *tr = current->journal_info; + + gfs2_ordered_add_inode(ip); + + if (tr->tr_num_buf_new) + __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + else + gfs2_trans_add_meta(ip->i_gl, dibh); +} + static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap *iomap, @@ -1052,6 +1083,13 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, } if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip)) iomap->page_done = gfs2_iomap_journaled_page_done; + + if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip) || inode == sdp->sd_rindex) { + iomap->flags |= IOMAP_F_UNBALANCED; + } else { + gfs2_write_end(inode, mp->mp_bh[0]); + gfs2_trans_end(sdp); + } return 0; out_trans_end: @@ -1103,28 +1141,24 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { struct gfs2_inode *ip = GFS2_I(inode); - struct gfs2_sbd *sdp = GFS2_SB(inode); - struct gfs2_trans *tr = current->journal_info; struct buffer_head *dibh = iomap->private; if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE) goto out; - if (iomap->type != IOMAP_INLINE) { - gfs2_ordered_add_inode(ip); + if (current->journal_info) { + struct gfs2_sbd *sdp
Re: [Cluster-devel] [PATCH v2 1/2] gfs2: Convert gfs2 to fs_context
On 19/03/2019 17:53, Andrew Price wrote: On 19/03/2019 17:05, David Howells wrote: Can you use vfs_get_block_super()? It might be possible if we can rearrange things so that this can be done outside of the function: if (args->ar_meta) fc->root = dget(sdp->sd_master_dir); else fc->root = dget(sdp->sd_root_dir); but we can't do that in our fill_super() because it needs to be selected whether we have an existing mount or not. Would this fly? diff --git a/fs/super.c b/fs/super.c index 6d8dbf309241..0cdeaf28126f 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1232,11 +1232,12 @@ static int test_bdev_super_fc(struct super_block *s, struct fs_context *fc) * @fill_super: Helper to initialise a new superblock */ int vfs_get_block_super(struct fs_context *fc, - int (*fill_super)(struct super_block *, - struct fs_context *)) + int (*fill_super)(struct super_block *, struct fs_context *), + struct dentry* (*select_root)(struct fs_context *, struct super_block *)) { struct block_device *bdev; struct super_block *s; + struct dentry *root; int error = 0; fc->bdev_mode = FMODE_READ | FMODE_EXCL; @@ -1302,7 +1303,11 @@ int vfs_get_block_super(struct fs_context *fc, } BUG_ON(fc->root); - fc->root = dget(s->s_root); + if (select_root) + root = select_root(fc, s); + else + root = s->s_root; + fc->root = dget(root); return 0; error_sb: diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h index 8233c873af73..c2e9dc5826ce 100644 --- a/include/linux/fs_context.h +++ b/include/linux/fs_context.h @@ -145,8 +145,8 @@ extern int vfs_get_super(struct fs_context *fc, struct fs_context *fc)); extern int vfs_get_block_super(struct fs_context *fc, - int (*fill_super)(struct super_block *sb, -struct fs_context *fc)); + int (*fill_super)(struct super_block *, struct fs_context *), + struct dentry* (*select_root)(struct fs_context *, struct super_block *)); extern const struct file_operations fscontext_fops;