Re: [f2fs-dev] [syzbot] [f2fs?] WARNING in rcu_sync_dtor
On Mon, Jul 29, 2024 at 03:27:21PM +0200, Jan Kara wrote: > Also as the "filesystem shutdown" is spreading across multiple filesystems, > I'm playing with the idea that maybe we could lift a flag like this to VFS > so that we can check it in VFS paths and abort some operations early. I've been thinking the same thing since I saw what CIFS was doing a couple of days ago with shutdowns. It's basically just stopping all new incoming modification operations if the flag is set. i.e. it's just a check in each filesystem method, and I suspect that many other filesystems that support shutdown do the same thing. It looks like exactly the same implementation as CIFS is about to be added to exfat - stop all the incoming methods and check in the writeback method - so having a generic superblock flag and generic checks before calling into filesystem methods would make it real easy for all filesystems to have basic ->shutdown support for when block devices go away suddenly. I also think that we should be lifting *IOC_SHUTDOWN to the VFS - the same ioctl is now implemented in 4-5 filesystems and they largely do the same thing - just set a bit in the internal filesystem superblock flags. Yes, filesystems like XFS and ext4 do special stuff with journals, but the generic VFS implemenation could call the filesystem ->shutdown method to do that > But > so far I'm not convinced the gain is worth the need to iron out various > subtle semantical differences of "shutdown" among filesystems. I don't think we need to change how any filesystem behaves when it is shut down. As long as filesystems follow at least the "no new modifications when shutdown" behaviour, filesystems can implement internal shutdown however they want... -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 1/5] zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call
On Tue, Jan 23, 2024 at 09:39:02PM +0100, Mikulas Patocka wrote: > > > On Tue, 23 Jan 2024, Johannes Thumshirn wrote: > > > Pass GFP_KERNEL instead of GFP_NOFS to the blkdev_zone_mgmt() call in > > zonefs_zone_mgmt(). > > > > As as zonefs_zone_mgmt() and zonefs_inode_zone_mgmt() are never called > > from a place that can recurse back into the filesystem on memory reclaim, > > it is save to call blkdev_zone_mgmt() with GFP_KERNEL. > > > > Signed-off-by: Johannes Thumshirn > > --- > > fs/zonefs/super.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c > > index 93971742613a..63fbac018c04 100644 > > --- a/fs/zonefs/super.c > > +++ b/fs/zonefs/super.c > > @@ -113,7 +113,7 @@ static int zonefs_zone_mgmt(struct super_block *sb, > > > > trace_zonefs_zone_mgmt(sb, z, op); > > ret = blkdev_zone_mgmt(sb->s_bdev, op, z->z_sector, > > - z->z_size >> SECTOR_SHIFT, GFP_NOFS); > > + z->z_size >> SECTOR_SHIFT, GFP_KERNEL); > > if (ret) { > > zonefs_err(sb, > >"Zone management operation %s at %llu failed %d\n", > > > > -- > > 2.43.0 > > zonefs_inode_zone_mgmt calls > lockdep_assert_held(&ZONEFS_I(inode)->i_truncate_mutex); - so, this > function is called with the mutex held - could it happen that the > GFP_KERNEL allocation recurses into the filesystem and attempts to take > i_truncate_mutex as well? > > i.e. GFP_KERNEL -> iomap_do_writepage -> zonefs_write_map_blocks -> > zonefs_write_iomap_begin -> mutex_lock(&zi->i_truncate_mutex) zonefs doesn't have a ->writepage method, so writeback can't be called from memory reclaim like this. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 07/11] vfs: add nowait parameter for file_accessed()
On Fri, Sep 08, 2023 at 01:29:55AM +0100, Pavel Begunkov wrote: > On 9/3/23 23:30, Dave Chinner wrote: > > On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote: > > > On 8/29/23 19:53, Matthew Wilcox wrote: > > > > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote: > > > > > On 8/28/23 05:32, Matthew Wilcox wrote: > > > > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote: > > > > > > > From: Hao Xu > > > > > > > > > > > > > > Add a boolean parameter for file_accessed() to support nowait > > > > > > > semantics. > > > > > > > Currently it is true only with io_uring as its initial caller. > > > > > > > > > > > > So why do we need to do this as part of this series? Apparently it > > > > > > hasn't caused any problems for filemap_read(). > > > > > > > > > > > > > > > > We need this parameter to indicate if nowait semantics should be > > > > > enforced in > > > > > touch_atime(), There are locks and maybe IOs in it. > > > > > > > > That's not my point. We currently call file_accessed() and > > > > touch_atime() for nowait reads and nowait writes. You haven't done > > > > anything to fix those. > > > > > > > > I suspect you can trim this patchset down significantly by avoiding > > > > fixing the file_accessed() problem. And then come back with a later > > > > patchset that fixes it for all nowait i/o. Or do a separate prep series > > > > > > I'm ok to do that. > > > > > > > first that fixes it for the existing nowait users, and then a second > > > > series to do all the directory stuff. > > > > > > > > I'd do the first thing. Just ignore the problem. Directory atime > > > > updates cause I/O so rarely that you can afford to ignore it. Almost > > > > everyone uses relatime or nodiratime. > > > > > > Hi Matthew, > > > The previous discussion shows this does cause issues in real > > > producations: > > > https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write > > > > > > > Then separate it out into it's own patch set so we can have a > > discussion on the merits of requiring using noatime, relatime or > > lazytime for really latency sensitive IO applications. Changing code > > is not always the right solution... > > Separation sounds reasonable, but it can hardly be said that only > latency sensitive apps would care about >1s nowait/async submission > delays. Presumably, btrfs can improve on that, but it still looks > like it's perfectly legit for filesystems do heavy stuff in > timestamping like waiting for IO. Right? Yes, it is, no-one is denying that. And some filesystems are worse than others, but none of that means it has to be fixed so getdents can be converted to NOWAIT semantics. ie. this patchset is about the getdents NOWAIT machinery, and fiddling around with timestamps has much, much wider scope than just NOWAIT getdents machinery. We'll have this discussion about NOWAIT timestamp updates when a RFC is proposed to address the wider problem of how timestamp updates should behave in NOWAIT context. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 02/11] xfs: add NOWAIT semantics for readdir
On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote: > From: Hao Xu > > Implement NOWAIT semantics for readdir. Return EAGAIN error to the > caller if it would block, like failing to get locks, or going to > do IO. > > Co-developed-by: Dave Chinner Not really. "Co-developed" implies equal development input between all the parties, which is not the case here - this patch is based on prototype I wrote, whilst you're doing the refining, testing and correctness work. In these cases with XFS code, we add a line in the commit message to say: "This is based on a patch originally written by Dave Chinner." > Signed-off-by: Dave Chinner > Signed-off-by: Hao Xu > [fixes deadlock issue, tweak code style] With a signoff chain like you already have. In the end you'll also get a RVB from me, which seems rather wrong to me if I've apparently been "co-developing" the code > @@ -156,7 +157,9 @@ xfs_dir2_block_getdents( > if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk) > return 0; > > - error = xfs_dir3_block_read(args->trans, dp, &bp); > + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) > + flags |= XFS_DABUF_NOWAIT; > + error = xfs_dir3_block_read(args->trans, dp, flags, &bp); > if (error) > return error; > Given we do this same check in both block and leaf formats to set XFS_DABUF_NOWAIT, and we do the DIR_CONTEXT_F_NOWAIT check in xfs_readdir() as well, we should probably do this check once at the higher level and pass flags down from there with XFS_DABUF_NOWAIT already set. > @@ -240,6 +243,7 @@ xfs_dir2_block_getdents( > STATIC int > xfs_dir2_leaf_readbuf( > struct xfs_da_args *args, > + struct dir_context *ctx, > size_t bufsize, > xfs_dir2_off_t *cur_off, > xfs_dablk_t *ra_blk, > @@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf( > struct xfs_iext_cursor icur; > int ra_want; > int error = 0; > - > - error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK); > - if (error) > - goto out; > + unsigned intflags = 0; > + > + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) { > + flags |= XFS_DABUF_NOWAIT; > + } else { > + error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK); > + if (error) > + goto out; > + } Especially as, in hindsight, this doesn't make a whole lot of sense. If XFS_DABUF_NOWAIT is set, we keep going until xfs_ilock_data_map_shared_nowait() where we call xfs_need_iread_extents() to see if we need to read the extents in and abort at that point. So, really, we shouldn't get this far with nowait semantics if we haven't read the extents in yet - we're supposed to already have the inode locked here and so we should have already checked this condition before we bother locking the inode... i.e. all we should be doing here is this: if (!(flags & XFS_DABUF_NOWAIT)) { error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK); if (error) goto out; } And then we don't need to pass the VFS dir_context down into low level XFS functions unnecessarily. > > /* >* Look for mapped directory blocks at or above the current offset. > @@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf( > new_off = xfs_dir2_da_to_byte(geo, map.br_startoff); > if (new_off > *cur_off) > *cur_off = new_off; > - error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp); > + error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, > &bp); > if (error) > goto out; > > @@ -360,6 +369,7 @@ xfs_dir2_leaf_getdents( > int byteoff;/* offset in current block */ > unsigned intoffset = 0; > int error = 0; /* error return value */ > + int written = 0; > > /* >* If the offset is at or past the largest allowed value, > @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents( > bp = NULL; > } > > - if (*lock_mode == 0) > - *lock_mode = xfs_ilock_data_map_shared(dp); > - error = xfs_dir2_leaf_readbuf(args, bufsize, &curoff, > - &rablk, &bp); > +
Re: [f2fs-dev] [PATCH 07/11] vfs: add nowait parameter for file_accessed()
On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote: > On 8/29/23 19:53, Matthew Wilcox wrote: > > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote: > > > On 8/28/23 05:32, Matthew Wilcox wrote: > > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote: > > > > > From: Hao Xu > > > > > > > > > > Add a boolean parameter for file_accessed() to support nowait > > > > > semantics. > > > > > Currently it is true only with io_uring as its initial caller. > > > > > > > > So why do we need to do this as part of this series? Apparently it > > > > hasn't caused any problems for filemap_read(). > > > > > > > > > > We need this parameter to indicate if nowait semantics should be enforced > > > in > > > touch_atime(), There are locks and maybe IOs in it. > > > > That's not my point. We currently call file_accessed() and > > touch_atime() for nowait reads and nowait writes. You haven't done > > anything to fix those. > > > > I suspect you can trim this patchset down significantly by avoiding > > fixing the file_accessed() problem. And then come back with a later > > patchset that fixes it for all nowait i/o. Or do a separate prep series > > I'm ok to do that. > > > first that fixes it for the existing nowait users, and then a second > > series to do all the directory stuff. > > > > I'd do the first thing. Just ignore the problem. Directory atime > > updates cause I/O so rarely that you can afford to ignore it. Almost > > everyone uses relatime or nodiratime. > > Hi Matthew, > The previous discussion shows this does cause issues in real > producations: > https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write > Then separate it out into it's own patch set so we can have a discussion on the merits of requiring using noatime, relatime or lazytime for really latency sensitive IO applications. Changing code is not always the right solution... -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH RFC v5 00/29] io_uring getdents
On Fri, Aug 25, 2023 at 09:54:02PM +0800, Hao Xu wrote: > From: Hao Xu > > This series introduce getdents64 to io_uring, the code logic is similar > with the snychronized version's. It first try nowait issue, and offload > it to io-wq threads if the first try fails. > > Patch1 and Patch2 are some preparation > Patch3 supports nowait for xfs getdents code > Patch4-11 are vfs change, include adding helpers and trylock for locks > Patch12-29 supports nowait for involved xfs journal stuff > note, Patch24 and 27 are actually two questions, might be removed later. > an xfs test may come later. You need to drop all the XFS journal stuff. It's fundamentally broken as it stands, and we cannot support non-blocking transactional changes without first putting a massive investment in transaction and intent chain rollback to allow correctly undoing partially complete modifications. Regardless, non-blocking transactions are completely unnecessary for a non-blocking readdir implementation. readdir should only be touching atime, and with relatime it should only occur once every 24 hours per inode. If that's a problem, then we have noatime mount options. Hence I just don't see any point in worrying about having a timestamp update block occasionally... I also don't really don't see why you need to fiddle with xfs buffer cache semantics - it already has the functionality "nowait" buffer reads require (i.e. XBF_INCORE|XBF_TRYLOCK). However, the readahead IO that the xfs readdir code issues cannot use your defined NOWAIT semantics - it must be able to allocate memory and issue IO. Readahead already avoids blocking on memory allocation and blocking on IO via the XBF_READ_AHEAD flag. This sets __GFP_NORETRY for buffer allocation and REQ_RAHEAD for IO. Hence readahead only needs the existing XBF_TRYLOCK flag to be set to be compatible with the required NOWAIT semantics As for the NOIO memory allocation restrictions io_uring requires, that should be enforced at the io_uring layer before calling into the VFS using memalloc_noio_save/restore. At that point no memory allocation will trigger IO and none of the code running under NOWAIT conditions even needs to be aware that io_uring has a GFP_NOIO restriction on memory allocation Please go back to the simple "do non-blocking buffer IO" implementation we started with and don't try to solve every little blocking problem that might exist in the VFS and filesystems... -Dave -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 25/29] xfs: support nowait for xfs_buf_item_init()
On Fri, Aug 25, 2023 at 09:54:27PM +0800, Hao Xu wrote: > From: Hao Xu > > support nowait for xfs_buf_item_init() and error out -EAGAIN to > _xfs_trans_bjoin() when it would block. > > Signed-off-by: Hao Xu > --- > fs/xfs/xfs_buf_item.c | 9 +++-- > fs/xfs/xfs_buf_item.h | 2 +- > fs/xfs/xfs_buf_item_recover.c | 2 +- > fs/xfs/xfs_trans_buf.c| 16 +--- > 4 files changed, 22 insertions(+), 7 deletions(-) > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > index 023d4e0385dd..b1e63137d65b 100644 > --- a/fs/xfs/xfs_buf_item.c > +++ b/fs/xfs/xfs_buf_item.c > @@ -827,7 +827,8 @@ xfs_buf_item_free_format( > int > xfs_buf_item_init( > struct xfs_buf *bp, > - struct xfs_mount *mp) > + struct xfs_mount *mp, > + bool nowait) > { > struct xfs_buf_log_item *bip = bp->b_log_item; > int chunks; > @@ -847,7 +848,11 @@ xfs_buf_item_init( > return 0; > } > > - bip = kmem_cache_zalloc(xfs_buf_item_cache, GFP_KERNEL | __GFP_NOFAIL); > + bip = kmem_cache_zalloc(xfs_buf_item_cache, > + GFP_KERNEL | (nowait ? 0 : __GFP_NOFAIL)); > + if (!bip) > + return -EAGAIN; > + > xfs_log_item_init(mp, &bip->bli_item, XFS_LI_BUF, &xfs_buf_item_ops); > bip->bli_buf = bp; I see filesystem shutdowns > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c > index 016371f58f26..a1e4f2e8629a 100644 > --- a/fs/xfs/xfs_trans_buf.c > +++ b/fs/xfs/xfs_trans_buf.c > @@ -57,13 +57,14 @@ xfs_trans_buf_item_match( > * If the buffer does not yet have a buf log item associated with it, > * then allocate one for it. Then add the buf item to the transaction. > */ > -STATIC void > +STATIC int > _xfs_trans_bjoin( > struct xfs_trans*tp, > struct xfs_buf *bp, > int reset_recur) > { > struct xfs_buf_log_item *bip; > + int ret; > > ASSERT(bp->b_transp == NULL); > > @@ -72,7 +73,11 @@ _xfs_trans_bjoin( >* it doesn't have one yet, then allocate one and initialize it. >* The checks to see if one is there are in xfs_buf_item_init(). >*/ > - xfs_buf_item_init(bp, tp->t_mountp); > + ret = xfs_buf_item_init(bp, tp->t_mountp, > + tp->t_flags & XFS_TRANS_NOWAIT); > + if (ret < 0) > + return ret; > + > bip = bp->b_log_item; > ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); > ASSERT(!(bip->__bli_format.blf_flags & XFS_BLF_CANCEL)); > @@ -92,6 +97,7 @@ _xfs_trans_bjoin( > xfs_trans_add_item(tp, &bip->bli_item); > bp->b_transp = tp; > > + return 0; > } > > void > @@ -309,7 +315,11 @@ xfs_trans_read_buf_map( > } > > if (tp) { > - _xfs_trans_bjoin(tp, bp, 1); > + error = _xfs_trans_bjoin(tp, bp, 1); > + if (error) { > + xfs_buf_relse(bp); > + return error; > + } > trace_xfs_trans_read_buf(bp->b_log_item); So what happens at the callers when we have a dirty transaction and joining a buffer fails with -EAGAIN? Apart from the fact this may well propagate -EAGAIN up to userspace, cancelling a dirty transaction at this point will result in a filesystem shutdown Indeed, this can happen in the "simple" timestamp update case that this "nowait" semantic is being aimed at. We log the inode in the timestamp update, which dirties the log item and registers a precommit operation to be run. We commit the transaction, which then runs xfs_inode_item_precommit() and that may need to attach the inode to the inode cluster buffer. This results in: xfs_inode_item_precommit xfs_imap_to_bp xfs_trans_read_buf_map _xfs_trans_bjoin xfs_buf_item_init(XFS_TRANS_NOWAIT) kmem_cache_zalloc(GFP_NOFS) gets -EAGAIN error propagates -EAGAIN fails due to -EAGAIN And now xfs_trans_commit() fails with a dirty transaction and the filesystem shuts down. IOWs, XFS_TRANS_NOWAIT as it stands is fundamentally broken. Once we dirty an item in a transaction, we *cannot* back out of the transaction. We *must block* in every place that could fail - locking, memory allocation and/or IO - until the transaction completes because we cannot undo the changes we've already made to the dirty items in the transaction It's even worse than that - once we have committed intents, the whole chain of intent processing must be run to completionr. Hence we can't tolerate backing out of
Re: [f2fs-dev] [PATCH 24/29] xfs: support nowait for xfs_buf_read_map()
On Fri, Aug 25, 2023 at 09:54:26PM +0800, Hao Xu wrote: > From: Hao Xu > > This causes xfstests generic/232 hung in umount process, waiting for ail > push, so I comment it for now, need some hints from xfs folks. > Not a real patch. > > Signed-off-by: Hao Xu > --- > fs/xfs/xfs_buf.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index cdad80e1ae25..284962a9f31a 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -828,6 +828,13 @@ xfs_buf_read_map( > trace_xfs_buf_read(bp, flags, _RET_IP_); > > if (!(bp->b_flags & XBF_DONE)) { > +// /* > +//* Let's bypass the _xfs_buf_read() for now > +//*/ > +// if (flags & XBF_NOWAIT) { > +// xfs_buf_relse(bp); > +// return -EAGAIN; > +// } This is *fundamentally broken*, and apart from anything else breaks readahead. IF we asked for a read, we cannot instantiate the buffer and then *not issue any IO on it* and release it. That leaves an uninitialised buffer in memory, and there's every chance that something then trips over it and bad things happen. A buffer like this *must* be errored out and marked stale so that the next access to it will then re-initialise the buffer state and trigger any preparatory work that needs to be done for the new operation. This comes back to my first comments that XBF_TRYLOCK cannot simpy be replaced with XBF_NOWAIT semantics... -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 28/29] xfs: support nowait semantics for xc_ctx_lock in xlog_cil_commit()
On Fri, Aug 25, 2023 at 09:54:30PM +0800, Hao Xu wrote: > From: Hao Xu > > Apply trylock logic for xc_ctx_lock in xlog_cil_commit() in nowait > case and error out -EAGAIN for xlog_cil_commit(). Again, fundamentally broken. Any error from xlog_cil_commit() will result in a filesystem shutdown as we cannot back out from failure with dirty log items gracefully at this point. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 26/29] xfs: return -EAGAIN when nowait meets sync in transaction commit
On Fri, Aug 25, 2023 at 09:54:28PM +0800, Hao Xu wrote: > From: Hao Xu > > if the log transaction is a sync one, let's fail the nowait try and > return -EAGAIN directly since sync transaction means blocked by IO. > > Signed-off-by: Hao Xu > --- > fs/xfs/xfs_trans.c | 14 +- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 7988b4c7f36e..f1f84a3dd456 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -968,12 +968,24 @@ __xfs_trans_commit( > xfs_csn_t commit_seq = 0; > int error = 0; > int sync = tp->t_flags & XFS_TRANS_SYNC; > + boolnowait = tp->t_flags & XFS_TRANS_NOWAIT; > + boolperm_log = tp->t_flags & XFS_TRANS_PERM_LOG_RES; > > trace_xfs_trans_commit(tp, _RET_IP_); > > + if (nowait && sync) { > + /* > + * Currently nowait is only from xfs_vn_update_time() > + * so perm_log is always false here, but let's make > + * code general. > + */ > + if (perm_log) > + xfs_defer_cancel(tp); > + goto out_unreserve; > + } This is fundamentally broken. We cannot about a transaction commit with dirty items at this point with shutting down the filesystem. This points to XFS_TRANS_NOWAIT being completely broken, too, because we don't call xfs_trans_set_sync() until just before we commit the transaction. At this point, it is -too late- for nowait+sync to be handled gracefully, and it will *always* go bad. IOWs, the whole transaction "nowait" semantics as designed and implemented is not a workable solution -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 02/29] xfs: rename XBF_TRYLOCK to XBF_NOWAIT
42208..8cd307626939 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -45,7 +45,7 @@ struct xfs_buf; > > /* flags used only as arguments to access routines */ > #define XBF_INCORE(1u << 29)/* lookup only, return if found in cache */ > -#define XBF_TRYLOCK (1u << 30)/* lock requested, but do not wait */ > +#define XBF_NOWAIT(1u << 30)/* mem/lock requested, but do not wait */ That's now a really poor comment. It doesn't describe the semantics or constraints that NOWAIT might imply. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless
clear_bit(offset, unit->map); > + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) > { > + clear_bit(offset, unit->map); > + rcu_read_unlock(); > continue; > } > + rcu_read_unlock(); > > /* Call non-slab shrinkers even though kmem is disabled > */ > if (!memcg_kmem_online() && > @@ -523,15 +546,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > set_shrinker_bit(memcg, nid, > shrinker_id); > } > freed += ret; > - > - if (rwsem_is_contended(&shrinker_rwsem)) { > - freed = freed ? : 1; > - goto unlock; > - } > + shrinker_put(shrinker); Ok, so why is this safe to call without holding the rcu read lock? The global shrinker has to hold the rcu_read_lock() whilst calling shrinker_put() to guarantee the validity of the list next pointer, but we don't hold off RCU here so what guarantees a racing global shrinker walk doesn't trip over this shrinker_put() call dropping the refcount to zero and freeing occuring in a different context... > + /* > + * We have already exited the read-side of rcu critical section > + * before calling do_shrink_slab(), the shrinker_info may be > + * released in expand_one_shrinker_info(), so reacquire the > + * shrinker_info. > + */ > + index++; > + goto again; With that, what makes the use of shrinker_info in xchg_nr_deferred_memcg() in do_shrink_slab() coherent and valid? -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include > #include > +#include > +#include > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; Documentation, please. What does the refcount protect, what does the completion provide, etc. > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, > const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(&shrinker->refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(&shrinker->refcount)) > + complete(&shrinker->done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include > #include > #include > +#include > #include > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(&shrinker_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, &shrinker_list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). >*/ This needs to be moved to the head of the function, and document the whole list walk, get, put and completion parts of the algorithm that make it safe. There's more to this than "we hold a reference count", especially the tricky "we might see the shrinker before it is fully initialised" case . > void shrinker_free(struct shrinker *shrinker) > { > struct dentry *debugfs_entry = NULL; > @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker) > if (!shrinker) > return; > > + if (shrinker->flags & SHRINKER_REGISTERED) { > + shrinker_put(shrinker); > + wait_for_completion(&shrinker->done); > + } Needs a comment explaining why we need to wait here... > + > down_write(&shrinker_rwsem); > if (shrinker->flags & SHRINKER_REGISTERED) { > - list_del(&shrinker->list); > + /* > + * Lookups on the shrinker are over and will f
Re: [f2fs-dev] [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}
atomic_long_add(nr, > &parent_unit->nr_deferred[offset]); > + } > } > } > up_read(&shrinker_rwsem); > @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > { > struct shrinker_info *info; > unsigned long ret, freed = 0; > - int i; > + int offset, index = 0; > > if (!mem_cgroup_online(memcg)) > return 0; > @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > if (unlikely(!info)) > goto unlock; > > - for_each_set_bit(i, info->map, info->map_nr_max) { > - struct shrink_control sc = { > - .gfp_mask = gfp_mask, > - .nid = nid, > - .memcg = memcg, > - }; > - struct shrinker *shrinker; > + for (; index < shriner_id_to_index(info->map_nr_max); index++) { > + struct shrinker_info_unit *unit; This adds another layer of indent to shrink_slab_memcg(). Please factor it first so that the code ends up being readable. Doing that first as a separate patch will also make the actual algorithm changes in this patch be much more obvious - this huge hunk of diff is pretty much impossible to review... -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in shrinkers subsystem, > which protects most operations such as slab shrink, registration and > unregistration of shrinkers, etc. This can easily cause problems in the > following cases. > This commit uses the refcount+RCU method [5] proposed by Dave Chinner > to re-implement the lockless global slab shrink. The memcg slab shrink is > handled in the subsequent patch. > --- > include/linux/shrinker.h | 17 ++ > mm/shrinker.c| 70 +--- > 2 files changed, 68 insertions(+), 19 deletions(-) There's no documentation in the code explaining how the lockless shrinker algorithm works. It's left to the reader to work out how this all goes together > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include > #include > +#include > +#include > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; What does the refcount protect, why do we need the completion, etc? > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, > const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(&shrinker->refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(&shrinker->refcount)) > + complete(&shrinker->done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include > #include > #include > +#include > #include > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(&shrinker_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, &shrinker_list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). >*/ This comment really should be at the head of the function, describing the algorithm used within the function itself. i.e. how reference counts are used w.r.t. the rcu_read_lock() usage to guarantee existence of the shrinker and the validity of the list walk. I'm not going
Re: [f2fs-dev] [PATCH v3 28/49] dm zoned: dynamically allocate the dm-zoned-meta shrinker
On Thu, Jul 27, 2023 at 07:20:46PM +0900, Damien Le Moal wrote: > On 7/27/23 17:55, Qi Zheng wrote: > >>> goto err; > >>> } > >>> + zmd->mblk_shrinker->count_objects = dmz_mblock_shrinker_count; > >>> + zmd->mblk_shrinker->scan_objects = dmz_mblock_shrinker_scan; > >>> + zmd->mblk_shrinker->seeks = DEFAULT_SEEKS; > >>> + zmd->mblk_shrinker->private_data = zmd; > >>> + > >>> + shrinker_register(zmd->mblk_shrinker); > >> > >> I fail to see how this new shrinker API is better... Why isn't there a > >> shrinker_alloc_and_register() function ? That would avoid adding all this > >> code > >> all over the place as the new API call would be very similar to the current > >> shrinker_register() call with static allocation. > > > > In some registration scenarios, memory needs to be allocated in advance. > > So we continue to use the previous prealloc/register_prepared() > > algorithm. The shrinker_alloc_and_register() is just a helper function > > that combines the two, and this increases the number of APIs that > > shrinker exposes to the outside, so I choose not to add this helper. > > And that results in more code in many places instead of less code + a simple > inline helper in the shrinker header file... It's not just a "simple helper" - it's a function that has to take 6 or 7 parameters with a return value that must be checked and handled. This was done in the first versions of the patch set - the amount of code in each caller does not go down and, IMO, was much harder to read and determine "this is obviously correct" that what we have now. > So not adding that super simple > helper is not exactly the best choice in my opinion. Each to their own - I much prefer the existing style/API over having to go look up a helper function every time I want to check some random shrinker has been set up correctly -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless
On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote: > On 2023/7/26 16:08, Dave Chinner wrote: > > On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote: > > > @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker > > > *shrinker); > > > void shrinker_register(struct shrinker *shrinker); > > > void shrinker_unregister(struct shrinker *shrinker); > > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > > > +{ > > > + return READ_ONCE(shrinker->registered) && > > > +refcount_inc_not_zero(&shrinker->refcount); > > > +} > > > > Why do we care about shrinker->registered here? If we don't set > > the refcount to 1 until we have fully initialised everything, then > > the shrinker code can key entirely off the reference count and > > none of the lookup code needs to care about whether the shrinker is > > registered or not. > > The purpose of checking shrinker->registered here is to stop running > shrinker after calling shrinker_free(), which can prevent the following > situations from happening: > > CPU 0 CPU 1 > > shrinker_try_get() > >shrinker_try_get() > > shrinker_put() > shrinker_try_get() >shrinker_put() I don't see any race here? What is wrong with having multiple active users at once? > > > > This should use a completion, then it is always safe under > > rcu_read_lock(). This also gets rid of the shrinker_lock spin lock, > > which only exists because we can't take a blocking lock under > > rcu_read_lock(). i.e: > > > > > > void shrinker_put(struct shrinker *shrinker) > > { > > if (refcount_dec_and_test(&shrinker->refcount)) > > complete(&shrinker->done); > > } > > > > void shrinker_free() > > { > > . > > refcount_dec(&shrinker->refcount); > > I guess what you mean is shrinker_put(), because here may be the last > refcount. Yes, I did. > > wait_for_completion(&shrinker->done); > > /* > > * lookups on the shrinker will now all fail as refcount has > > * fallen to zero. We can now remove it from the lists and > > * free it. > > */ > > down_write(shrinker_rwsem); > > list_del_rcu(&shrinker->list); > > up_write(&shrinker_rwsem); > > call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb); > > } > > > > > > > > > @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered); > > > void shrinker_register(struct shrinker *shrinker) > > > { > > > - down_write(&shrinker_rwsem); > > > - list_add_tail(&shrinker->list, &shrinker_list); > > > - shrinker->flags |= SHRINKER_REGISTERED; > > > + refcount_set(&shrinker->refcount, 1); > > > + > > > + spin_lock(&shrinker_lock); > > > + list_add_tail_rcu(&shrinker->list, &shrinker_list); > > > + spin_unlock(&shrinker_lock); > > > + > > > shrinker_debugfs_add(shrinker); > > > - up_write(&shrinker_rwsem); > > > + WRITE_ONCE(shrinker->registered, true); > > > } > > > EXPORT_SYMBOL(shrinker_register); > > > > This just looks wrong - you are trying to use WRITE_ONCE() as a > > release barrier to indicate that the shrinker is now set up fully. > > That's not necessary - the refcount is an atomic and along with the > > rcu locks they should provides all the barriers we need. i.e. > > The reason I used WRITE_ONCE() here is because the shrinker->registered > will be read and written concurrently (read in shrinker_try_get() and > written in shrinker_free()), which is why I added shrinker::registered > field instead of using SHRINKER_REGISTERED flag (this can reduce the > addition of WRITE_ONCE()/READ_ONCE()). Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to use the field like this. You need release/acquire memory ordering here. i.e. smp_store_release()/smp_load_acquire(). As it is, the refcount_inc_not_zero() provides a control dependency, as documented in include/linux/refcount.h, refcount_dec_and_test() provides release memory ordering. The only thing I think we may need is a write barrier before refcount_set(), such that if refcount_inc_not_zero() sees a non-zero value, it is guaranteed to see an initialised structure... i.e. refcounts provide all the existence and initialisation guarantees. Hence I don't see the need to use shr
Re: [f2fs-dev] [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless
On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in shrinkers subsystem, > which protects most operations such as slab shrink, registration and > unregistration of shrinkers, etc. This can easily cause problems in the > following cases. > > 1) When the memory pressure is high and there are many filesystems >mounted or unmounted at the same time, slab shrink will be affected >(down_read_trylock() failed). > >Such as the real workload mentioned by Kirill Tkhai: > >``` >One of the real workloads from my experience is start >of an overcommitted node containing many starting >containers after node crash (or many resuming containers >after reboot for kernel update). In these cases memory >pressure is huge, and the node goes round in long reclaim. >``` > > 2) If a shrinker is blocked (such as the case mentioned >in [1]) and a writer comes in (such as mount a fs), >then this writer will be blocked and cause all >subsequent shrinker-related operations to be blocked. > > Even if there is no competitor when shrinking slab, there may still be a > problem. The down_read_trylock() may become a perf hotspot with frequent > calls to shrink_slab(). Because of the poor multicore scalability of > atomic operations, this can lead to a significant drop in IPC > (instructions per cycle). > > We used to implement the lockless slab shrink with SRCU [2], but then > kernel test robot reported -88.8% regression in > stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4]. > > This commit uses the refcount+RCU method [5] proposed by Dave Chinner > to re-implement the lockless global slab shrink. The memcg slab shrink is > handled in the subsequent patch. > > For now, all shrinker instances are converted to dynamically allocated and > will be freed by kfree_rcu(). So we can use rcu_read_{lock,unlock}() to > ensure that the shrinker instance is valid. > > And the shrinker instance will not be run again after unregistration. So > the structure that records the pointer of shrinker instance can be safely > freed without waiting for the RCU read-side critical section. > > In this way, while we implement the lockless slab shrink, we don't need to > be blocked in unregister_shrinker(). > > The following are the test results: > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & > > 1) Before applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo > ops/s > (secs)(secs)(secs) (real time) (usr+sys > time) > ramfs735238 60.00 12.37363.70 12253.05 > 1955.08 > for a 60.01s run time: >1440.27s available CPU time > 12.36s user time ( 0.86%) > 363.70s system time ( 25.25%) > 376.06s total time ( 26.11%) > load average: 10.79 4.47 1.69 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.01s (1 min, 0.01 secs) > > 2) After applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo > ops/s > (secs)(secs)(secs) (real time) (usr+sys > time) > ramfs746677 60.00 12.22367.75 12443.70 > 1965.13 > for a 60.01s run time: >1440.26s available CPU time > 12.21s user time ( 0.85%) > 367.75s system time ( 25.53%) > 379.96s total time ( 26.38%) > load average: 8.37 2.48 0.86 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.01s (1 min, 0.01 secs) > > We can see that the ops/s has hardly changed. > > [1]. > https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomi...@virtuozzo.com/ > [2]. > https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/ > [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/ > [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/ > [5]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/ > > Signed-off-by: Qi Zheng > --- > include/linux/shrinker.h | 19 +++--- > mm/shrinker.c| 75 ++-- > mm/shrinker_debug.c | 52 +--- > 3 files changed, 104 insertions(+), 42 deletions(-) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 36977a70bebb..335da93cccee 100644 > ---
Re: [f2fs-dev] [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote: > Currently, the shrinker instances can be divided into the following three > types: > > a) global shrinker instance statically defined in the kernel, such as >workingset_shadow_shrinker. > > b) global shrinker instance statically defined in the kernel modules, such >as mmu_shrinker in x86. > > c) shrinker instance embedded in other structures. > > For case a, the memory of shrinker instance is never freed. For case b, > the memory of shrinker instance will be freed after synchronize_rcu() when > the module is unloaded. For case c, the memory of shrinker instance will > be freed along with the structure it is embedded in. > > In preparation for implementing lockless slab shrink, we need to > dynamically allocate those shrinker instances in case c, then the memory > can be dynamically freed alone by calling kfree_rcu(). > > So this commit adds the following new APIs for dynamically allocating > shrinker, and add a private_data field to struct shrinker to record and > get the original embedded structure. > > 1. shrinker_alloc() > > Used to allocate shrinker instance itself and related memory, it will > return a pointer to the shrinker instance on success and NULL on failure. > > 2. shrinker_free_non_registered() > > Used to destroy the non-registered shrinker instance. This is a bit nasty > > 3. shrinker_register() > > Used to register the shrinker instance, which is same as the current > register_shrinker_prepared(). > > 4. shrinker_unregister() rename this "shrinker_free()" and key the two different freeing cases on the SHRINKER_REGISTERED bit rather than mostly duplicating the two. void shrinker_free(struct shrinker *shrinker) { struct dentry *debugfs_entry = NULL; int debugfs_id; if (!shrinker) return; down_write(&shrinker_rwsem); if (shrinker->flags & SHRINKER_REGISTERED) { list_del(&shrinker->list); debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id); } else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) { kfree_const(shrinker->name); } if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); up_write(&shrinker_rwsem); if (debugfs_entry) shrinker_debugfs_remove(debugfs_entry, debugfs_id); kfree(shrinker->nr_deferred); kfree(shrinker); } EXPORT_SYMBOL_GPL(shrinker_free); -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v5 6/8] xfs: switch to multigrain timestamps
On Thu, Jul 13, 2023 at 07:00:55PM -0400, Jeff Layton wrote: > Enable multigrain timestamps, which should ensure that there is an > apparent change to the timestamp whenever it has been written after > being actively observed via getattr. > > Also, anytime the mtime changes, the ctime must also change, and those > are now the only two options for xfs_trans_ichgtime. Have that function > unconditionally bump the ctime, and warn if XFS_ICHGTIME_CHG is ever not > set. > > Signed-off-by: Jeff Layton > --- > fs/xfs/libxfs/xfs_trans_inode.c | 6 +++--- > fs/xfs/xfs_iops.c | 4 ++-- > fs/xfs/xfs_super.c | 2 +- > 3 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > index 0c9df8df6d4a..86f5ffce2d89 100644 > --- a/fs/xfs/libxfs/xfs_trans_inode.c > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > @@ -62,12 +62,12 @@ xfs_trans_ichgtime( > ASSERT(tp); > ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > - tv = current_time(inode); > + /* If the mtime changes, then ctime must also change */ > + WARN_ON_ONCE(!(flags & XFS_ICHGTIME_CHG)); Make that an ASSERT(flags & XFS_ICHGTIME_CHG), please. There's no need to verify this at runtime on production kernels. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 16/17] block: use iomap for writes to block devices
On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote: > On 4/24/23 07:49, Christoph Hellwig wrote: > > Use iomap in buffer_head compat mode to write to block devices. > > > > Signed-off-by: Christoph Hellwig > > --- > > block/Kconfig | 1 + > > block/fops.c | 33 + > > 2 files changed, 30 insertions(+), 4 deletions(-) > > > > diff --git a/block/Kconfig b/block/Kconfig > > index 941b2dca70db73..672b08f0096ab4 100644 > > --- a/block/Kconfig > > +++ b/block/Kconfig > > @@ -5,6 +5,7 @@ > > menuconfig BLOCK > > bool "Enable the block layer" if EXPERT > > default y > > + select IOMAP > > select SBITMAP > > help > > Provide block layer support for the kernel. > > diff --git a/block/fops.c b/block/fops.c > > index 318247832a7bcf..7910636f8df33b 100644 > > --- a/block/fops.c > > +++ b/block/fops.c > > @@ -15,6 +15,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include "blk.h" > > @@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, > > struct iov_iter *iter) > > return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages)); > > } > > +static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t > > length, > > + unsigned int flags, struct iomap *iomap, struct iomap *srcmap) > > +{ > > + struct block_device *bdev = I_BDEV(inode); > > + loff_t isize = i_size_read(inode); > > + > > + iomap->bdev = bdev; > > + iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev)); > > + if (WARN_ON_ONCE(iomap->offset >= isize)) > > + return -EIO; > > I'm hitting this during booting: > [5.016324] > [5.030256] iomap_iter+0x11a/0x350 > [5.030264] iomap_readahead+0x1eb/0x2c0 > [5.030272] read_pages+0x5d/0x220 > [5.030279] page_cache_ra_unbounded+0x131/0x180 > [5.030284] filemap_get_pages+0xff/0x5a0 Why is filemap_get_pages() using unbounded readahead? Surely readahead should be limited to reading within EOF > [5.030292] filemap_read+0xca/0x320 > [5.030296] ? aa_file_perm+0x126/0x500 > [5.040216] ? touch_atime+0xc8/0x150 > [5.040224] blkdev_read_iter+0xb0/0x150 > [5.040228] vfs_read+0x226/0x2d0 > [5.040234] ksys_read+0xa5/0xe0 > [5.040238] do_syscall_64+0x5b/0x80 > > Maybe we should consider this patch: > > diff --git a/block/fops.c b/block/fops.c > index 524b8a828aad..d202fb663f25 100644 > --- a/block/fops.c > +++ b/block/fops.c > @@ -386,10 +386,13 @@ static int blkdev_iomap_begin(struct inode *inode, > loff_t offset, loff_t length, > > iomap->bdev = bdev; > iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev)); > - if (WARN_ON_ONCE(iomap->offset >= isize)) > - return -EIO; > - iomap->type = IOMAP_MAPPED; > - iomap->addr = iomap->offset; > + if (WARN_ON_ONCE(iomap->offset >= isize)) { > + iomap->type = IOMAP_HOLE; > + iomap->addr = IOMAP_NULL_ADDR; > + } else { > + iomap->type = IOMAP_MAPPED; > + iomap->addr = iomap->offset; > + } I think Christoph's code is correct. IMO, any attempt to read beyond the end of the device should throw out a warning and return an error, not silently return zeros. If readahead is trying to read beyond the end of the device, then it really seems to me like the problem here is readahead, not the iomap code detecting the OOB read request Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 00/23] fs-verity support for XFS
On Tue, Apr 11, 2023 at 07:33:19PM -0700, Eric Biggers wrote: > On Mon, Apr 10, 2023 at 10:19:46PM -0700, Christoph Hellwig wrote: > > Dave is going to hate me for this, but.. > > > > I've been looking over some of the interfaces here, and I'm starting > > to very seriously questioning the design decisions of storing the > > fsverity hashes in xattrs. > > > > Yes, storing them beyond i_size in the file is a bit of a hack, but > > it allows to reuse a lot of the existing infrastructure, and much > > of fsverity is based around it. So storing them in an xattrs causes > > a lot of churn in the interface. And the XFS side with special > > casing xattr indices also seems not exactly nice. > > It seems it's really just the Merkle tree caching interface that is causing > problems, as it's currently too closely tied to the page cache? That is just > an > implementation detail that could be reworked along the lines of what is being > discussed. > > But anyway, it is up to the XFS folks. Keep in mind there is also the option > of > doing what btrfs is doing, where it stores the Merkle tree separately from the > file data stream, but caches it past i_size in the page cache at runtime. Right. It's not entirely simple to store metadata on disk beyond EOF in XFS because of all the assumptions throughout the IO path and allocator interfaces that it can allocate space beyond EOF at will and something else will clean it up later if it is not needed. This impacts on truncate, delayed allocation, writeback, IO completion, EOF block removal on file close, background garbage collection, ENOSPC/EDQUOT driven space freeing, etc. Some of these things cross over into iomap infrastructure, too. AFAIC, it's far more intricate, complex and risky to try to store merkle tree data beyond EOF than it is to put it in an xattr namespace because IO path EOF handling bugs result in user data corruption. This happens over and over again, no matter how careful we are about these aspects of user data handling. OTOH, putting the merkle tree data in a different namespace avoids these issues completely. Yes, we now have to solve an API mismatch, but we aren't risking the addition of IO path data corruption bugs to every non-fsverity filesystem in production... Hence I think copying the btrfs approach (i.e. only caching the merkle tree data in the page cache beyond EOF) would be as far as I think we'd want to go. Realistically, there would be little practical difference between btrfs storing the merkle tree blocks in a separate internal btree and XFS storing them in an internal private xattr btree namespace. I would, however, prefer not to have to do this at all if we could simply map the blocks directly out of the xattr buffers as we already do internally for all the XFS code... > I guess there is also the issue of encryption, which hasn't come up yet since > we're talking about fsverity support only. The Merkle tree (including the > fsverity_descriptor) is supposed to be encrypted, just like the file contents > are. Having it be stored after the file contents accomplishes that easily... > Of course, it doesn't have to be that way; a separate key could be derived, or > the Merkle tree blocks could be encrypted with the file contents key using > indices past i_size, without them physically being stored in the data stream. I'm expecting that fscrypt for XFS will include encryption of the xattr names and values (just like we will need to do for directory names) except for the xattrs that hold the encryption keys themselves. That means the merkle tree blocks should get encrypted without any extra work needing to be done anywhere. This will simply require the fscrypt keys to be held in a private internal xattr namespace that isn't encrypted, but that's realtively trivial to do... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE
On Wed, Apr 05, 2023 at 10:54:06PM +, Eric Biggers wrote: > On Thu, Apr 06, 2023 at 08:26:46AM +1000, Dave Chinner wrote: > > > We could certainly think about moving to a design where fs/verity/ asks > > > the > > > filesystem to just *read* a Merkle tree block, without adding it to a > > > cache, and > > > then fs/verity/ implements the caching itself. That would require some > > > large > > > changes to each filesystem, though, unless we were to double-cache the > > > Merkle > > > tree blocks which would be inefficient. > > > > No, that's unnecessary. > > > > All we need if for fsverity to require filesystems to pass it byte > > addressable data buffers that are externally reference counted. The > > filesystem can take a page reference before mapping the page and > > passing the kaddr to fsverity, then unmap and drop the reference > > when the merkle tree walk is done as per Andrey's new drop callout. > > > > fsverity doesn't need to care what the buffer is made from, how it > > is cached, what it's life cycle is, etc. The caching mechanism and > > reference counting is entirely controlled by the filesystem callout > > implementations, and fsverity only needs to deal with memory buffers > > that are guaranteed to live for the entire walk of the merkle > > tree > > Sure. Just a couple notes: > > First, fs/verity/ does still need to be able to tell whether the buffer is > newly > instantiated or not. Boolean flag from the caller. > Second, fs/verity/ uses the ahash API to do the hashing. ahash is a > scatterlist-based API. Virtual addresses can still be used (see > sg_set_buf()), > but the memory cannot be vmalloc'ed memory, since virt_to_page() needs to > work. > Does XFS use vmalloc'ed memory for these buffers? Not vmalloc'ed, but vmapped. we allocate the pages individually, but then call vm_map_page() to present the higher level code with a single contiguous memory range if it is a multi-page buffer. We do have the backing info held in the buffer, and that's what we use for IO. If fsverity needs a page based scatter/gather list for hardware offload, it could ask the filesystem to provide it for that given buffer... > BTW, converting fs/verity/ from ahash to shash is an option; I've really never > been a fan of the scatterlist-based crypto APIs! The disadvantage of doing > this, though, would be that it would remove support for all the hardware > crypto > drivers. > > That *might* actually be okay, as that approach to crypto acceleration > has mostly fallen out of favor, in favor of CPU-based acceleration. But I do > worry about e.g. someone coming out of the woodwork and saying they need to > use > fsverity on a low-powered ARM board that has a crypto accelerator like CAAM, > and > they MUST use their crypto accelerator to get acceptable performance. True, but we are very unlikely to be using XFS on such small systems and I don't think we really care about XFS performance on android sized systems, either. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE
gt; > every > > time the cache is accessed. The problem with what you have now is that > > every > > time a single 32-byte hash is accessed, a full page (potentially 64KB!) > > will be > > allocated and filled. That's not very efficient. The need to allocate the > > temporary page can also cause ENOMEM (which will get reported as EIO). > > > > Did you consider alternatives that would work more efficiently? I think it > > would be worth designing something that works properly with how XFS is > > planned > > to cache the Merkle tree, instead of designing a workaround. > > ->read_merkle_tree_page was not really designed for what you are doing here. > > > > How about replacing ->read_merkle_tree_page with a function that takes in a > > Merkle tree block index (not a page index!) and hands back a (page, offset) > > pair > > that identifies where the Merkle tree block's data is located? Or (folio, > > offset), I suppose. {kaddr, len}, please. > > > > With that, would it be possible to directly return the XFS cache? > > > > - Eric > > > > Yeah, I also don't like it, I didn't want to change fs-verity much > so went with this workaround. But as it's ok, I will look into trying > to pass xfs buffers to fs-verity without direct use of > ->read_merkle_tree_page(). I think it's possible with (folio, > offset), the xfs buffers aren't xattr value align so the 4k merkle > tree block is stored in two pages. I don't think this is necessary to actually merge the code. We want it to work correctly as the primary concern, performance is a secondary concern. Regardless, as you mention, the xfs_buf is not made up of contiguous pages so the merkle tree block data will be split across two (or more) pages. AFAICT, the fsverity code doesn't work with data structures that span multiple disjoint pages... Another problem is that the xfs-buf might be backed by heap memory (e.g. 4kB fs block size on 64kB PAGE_SIZE) and so it cannot be treated like a page cache page by the fsverity merkle tree code. We most definitely do not want to be passing pages containing heap memory to functions expecting to be passed lru-resident page cache pages That said, xfs-bufs do have a stable method of addressing the data in the buffers, and all the XFS code uses this to access and manipulate data directly in the buffers. That is, xfs_buf_offset() returns a mapped kaddr that points to the contiguous memory region containing the metadata in the buffer. If the xfs_buf spans multiple pages, it will return a kaddr pointing into the contiguous vmapped memory address that maps the entire buffer data range. If it is heap memory, it simply returns a pointer into that heap memory. If it's a single page, then it returns the kaddr for the data within the page. If you move all the assumptions about how the merkle tree data is managed out of fsverity and require the fielsystems to do the mapping to kaddrs and reference counting to guarantee life times, then the need for multiple different methods for reading merkle tree data go away... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files
On Wed, Apr 05, 2023 at 06:02:47PM +, Eric Biggers wrote: > And I really hope that you don't want to do DIO to the *Merkle tree*, as that Not for XFS - the merkle tree is not held as file data. That said, the merkle tree in XFS is not page cache or block aligned metadata either, so we really want the same memory buffer based interface for the merkle tree reading so that the merkle tree code can read directly from the xfs-buf rather than requiring us to copy it out of the xfsbuf into temporary pages... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE
On Wed, Apr 05, 2023 at 06:16:00PM +, Eric Biggers wrote: > On Wed, Apr 05, 2023 at 09:38:47AM -0700, Darrick J. Wong wrote: > > > The merkle tree pages are dropped after verification. When page is > > > dropped xfs_buf is marked as verified. If fs-verity wants to > > > verify again it will get the same verified buffer. If buffer is > > > evicted it won't have verified state. > > > > > > So, with enough memory pressure buffers will be dropped and need to > > > be reverified. > > > > Please excuse me if this was discussed and rejected long ago, but > > perhaps fsverity should try to hang on to the merkle tree pages that > > this function returns for as long as possible until reclaim comes for > > them? > > > > With the merkle tree page lifetimes extended, you then don't need to > > attach the xfs_buf to page->private, nor does xfs have to extend the > > buffer cache to stash XBF_VERITY_CHECKED. > > Well, all the other filesystems that support fsverity (ext4, f2fs, and btrfs) > just cache the Merkle tree pages in the inode's page cache. It's an approach > that I know some people aren't a fan of, but it's efficient and it works. Which puts pages beyond EOF in the page cache. Given that XFS also allows persistent block allocation beyond EOF, having both data in the page cache and blocks beyond EOF that contain unrelated information is a Real Bad Idea. Just because putting metadata in the file data address space works for one filesystem, it doesn't me it's a good idea or that it works for every filesystem. > We could certainly think about moving to a design where fs/verity/ asks the > filesystem to just *read* a Merkle tree block, without adding it to a cache, > and > then fs/verity/ implements the caching itself. That would require some large > changes to each filesystem, though, unless we were to double-cache the Merkle > tree blocks which would be inefficient. No, that's unnecessary. All we need if for fsverity to require filesystems to pass it byte addressable data buffers that are externally reference counted. The filesystem can take a page reference before mapping the page and passing the kaddr to fsverity, then unmap and drop the reference when the merkle tree walk is done as per Andrey's new drop callout. fsverity doesn't need to care what the buffer is made from, how it is cached, what it's life cycle is, etc. The caching mechanism and reference counting is entirely controlled by the filesystem callout implementations, and fsverity only needs to deal with memory buffers that are guaranteed to live for the entire walk of the merkle tree Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files
On Wed, Apr 05, 2023 at 08:09:27AM -0700, Darrick J. Wong wrote: > On Wed, Apr 05, 2023 at 05:01:42PM +0200, Andrey Albershteyn wrote: > > On Tue, Apr 04, 2023 at 09:10:47AM -0700, Darrick J. Wong wrote: > > > On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote: > > > > The direct path is not supported on verity files. Attempts to use direct > > > > I/O path on such files should fall back to buffered I/O path. > > > > > > > > Signed-off-by: Andrey Albershteyn > > > > --- > > > > fs/xfs/xfs_file.c | 14 +++--- > > > > 1 file changed, 11 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > > > index 947b5c436172..9e072e82f6c1 100644 > > > > --- a/fs/xfs/xfs_file.c > > > > +++ b/fs/xfs/xfs_file.c > > > > @@ -244,7 +244,8 @@ xfs_file_dax_read( > > > > struct kiocb*iocb, > > > > struct iov_iter *to) > > > > { > > > > - struct xfs_inode*ip = > > > > XFS_I(iocb->ki_filp->f_mapping->host); > > > > + struct inode*inode = iocb->ki_filp->f_mapping->host; > > > > + struct xfs_inode*ip = XFS_I(inode); > > > > ssize_t ret = 0; > > > > > > > > trace_xfs_file_dax_read(iocb, to); > > > > @@ -297,10 +298,17 @@ xfs_file_read_iter( > > > > > > > > if (IS_DAX(inode)) > > > > ret = xfs_file_dax_read(iocb, to); > > > > - else if (iocb->ki_flags & IOCB_DIRECT) > > > > + else if (iocb->ki_flags & IOCB_DIRECT && > > > > !fsverity_active(inode)) > > > > ret = xfs_file_dio_read(iocb, to); > > > > - else > > > > + else { > > > > + /* > > > > +* In case fs-verity is enabled, we also fallback to the > > > > +* buffered read from the direct read path. Therefore, > > > > +* IOCB_DIRECT is set and need to be cleared > > > > +*/ > > > > + iocb->ki_flags &= ~IOCB_DIRECT; > > > > ret = xfs_file_buffered_read(iocb, to); > > > > > > XFS doesn't usually allow directio fallback to the pagecache. Why > > > would fsverity be any different? > > > > Didn't know that, this is what happens on ext4 so I did the same. > > Then it probably make sense to just error on DIRECT on verity > > sealed file. > > Thinking about this a little more -- I suppose we shouldn't just go > breaking directio reads from a verity file if we can help it. Is there > a way to ask fsverity to perform its validation against some arbitrary > memory buffer that happens to be fs-block aligned? The memory buffer doesn't even need to be fs-block aligned - it just needs to be a pointer to memory the kernel can read... We also need fsverity to be able to handle being passed mapped kernel memory rather than pages/folios for the merkle tree interfaces. That way we can just pass it the mapped buffer memory straight from the xfs-buf and we don't have to do the whacky "copy from xattr xfs_bufs into pages so fsverity can take temporary reference counts on what it thinks are page cache pages" as it walks the merkle tree. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 16/23] xfs: add inode on-disk VERITY flag
On Tue, Apr 04, 2023 at 03:41:23PM -0700, Eric Biggers wrote: > Hi Andrey, > > On Tue, Apr 04, 2023 at 04:53:12PM +0200, Andrey Albershteyn wrote: > > Add flag to mark inodes which have fs-verity enabled on them (i.e. > > descriptor exist and tree is built). > > > > Signed-off-by: Andrey Albershteyn > > --- > > fs/ioctl.c | 4 > > fs/xfs/libxfs/xfs_format.h | 4 +++- > > fs/xfs/xfs_inode.c | 2 ++ > > fs/xfs/xfs_iops.c | 2 ++ > > include/uapi/linux/fs.h| 1 + > > 5 files changed, 12 insertions(+), 1 deletion(-) > [...] > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > index b7b56871029c..5172a2eb902c 100644 > > --- a/include/uapi/linux/fs.h > > +++ b/include/uapi/linux/fs.h > > @@ -140,6 +140,7 @@ struct fsxattr { > > #define FS_XFLAG_FILESTREAM0x4000 /* use filestream > > allocator */ > > #define FS_XFLAG_DAX 0x8000 /* use DAX for IO */ > > #define FS_XFLAG_COWEXTSIZE0x0001 /* CoW extent size > > allocator hint */ > > +#define FS_XFLAG_VERITY0x0002 /* fs-verity sealed > > inode */ > > #define FS_XFLAG_HASATTR 0x8000 /* no DIFLAG for this */ > > > > I don't think "xfs: add inode on-disk VERITY flag" is an accurate description > of > a patch that involves adding something to the UAPI. Well it does that, but it also adds the UAPI for querying the on-disk flag via the FS_IOC_FSGETXATTR interface as well. It probably should be split up into two patches. > Should the other filesystems support this new flag too? I think they should get it automatically now that it has been defined for FS_IOC_FSGETXATTR and added to the generic fileattr flag fill functions in fs/ioctl.c. > I'd also like all ways of getting the verity flag to continue to be mentioned > in > Documentation/filesystems/fsverity.rst. The existing methods (FS_IOC_GETFLAGS > and statx) are already mentioned there. *nod* -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 06/23] fsverity: add drop_page() callout
On Tue, Apr 04, 2023 at 04:53:02PM +0200, Andrey Albershteyn wrote: > Allow filesystem to make additional processing on verified pages > instead of just dropping a reference. This will be used by XFS for > internal buffer cache manipulation in further patches. The btrfs, > ext4, and f2fs just drop the reference. > > Signed-off-by: Andrey Albershteyn > --- > fs/btrfs/verity.c | 12 > fs/ext4/verity.c | 6 ++ > fs/f2fs/verity.c | 6 ++ > fs/verity/read_metadata.c | 4 ++-- > fs/verity/verify.c| 6 +++--- > include/linux/fsverity.h | 10 ++ > 6 files changed, 39 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c > index c5ff16f9e9fa..4c2c09204bb4 100644 > --- a/fs/btrfs/verity.c > +++ b/fs/btrfs/verity.c > @@ -804,10 +804,22 @@ static int btrfs_write_merkle_tree_block(struct inode > *inode, const void *buf, > pos, buf, size); > } > > +/* > + * fsverity op that releases the reference obtained by > ->read_merkle_tree_page() > + * > + * @page: reference to the page which can be released > + * > + */ > +static void btrfs_drop_page(struct page *page) > +{ > + put_page(page); > +} > + > const struct fsverity_operations btrfs_verityops = { > .begin_enable_verity = btrfs_begin_enable_verity, > .end_enable_verity = btrfs_end_enable_verity, > .get_verity_descriptor = btrfs_get_verity_descriptor, > .read_merkle_tree_page = btrfs_read_merkle_tree_page, > .write_merkle_tree_block = btrfs_write_merkle_tree_block, > + .drop_page = &btrfs_drop_page, > }; Ok, that's a generic put_page() call. > diff --git a/fs/verity/verify.c b/fs/verity/verify.c > index f50e3b5b52c9..c2fc4c86af34 100644 > --- a/fs/verity/verify.c > +++ b/fs/verity/verify.c > @@ -210,7 +210,7 @@ verify_data_block(struct inode *inode, struct > fsverity_info *vi, > if (is_hash_block_verified(vi, hpage, hblock_idx)) { > memcpy_from_page(_want_hash, hpage, hoffset, hsize); > want_hash = _want_hash; > - put_page(hpage); > + inode->i_sb->s_vop->drop_page(hpage); > goto descend; fsverity_drop_page(hpage); static inline void fsverity_drop_page(struct inode *inode, struct page *page) { if (inode->i_sb->s_vop->drop_page) inode->i_sb->s_vop->drop_page(page); else put_page(page); } And then you don't need to add the functions to each of the filesystems nor make an indirect call just to run put_page(). Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 0/4] fsverity cleanups
On Wed, Dec 14, 2022 at 02:43:00PM -0800, Eric Biggers wrote: > This series implements a few cleanups that have been suggested > in the thread "[RFC PATCH 00/11] fs-verity support for XFS" > (https://lore.kernel.org/linux-fsdevel/20221213172935.680971-1-aalbe...@redhat.com/T/#u). > > This applies to mainline (commit 93761c93e9da). > > Eric Biggers (4): > fsverity: optimize fsverity_file_open() on non-verity files > fsverity: optimize fsverity_prepare_setattr() on non-verity files > fsverity: optimize fsverity_cleanup_inode() on non-verity files > fsverity: pass pos and size to ->write_merkle_tree_block > > fs/btrfs/verity.c| 19 --- > fs/ext4/verity.c | 6 ++-- > fs/f2fs/verity.c | 6 ++-- > fs/verity/enable.c | 4 +-- > fs/verity/open.c | 46 - > include/linux/fsverity.h | 74 +--- > 6 files changed, 84 insertions(+), 71 deletions(-) The whole series looks fairly sane to me. Acked-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()
On Thu, Nov 03, 2022 at 07:45:01PM -0700, Darrick J. Wong wrote: > On Fri, Nov 04, 2022 at 11:32:35AM +1100, Dave Chinner wrote: > > On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote: > > > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote: > > > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote: > > > > > - BUG_ON(PageWriteback(page)); > > > > > - if (!clear_page_dirty_for_io(page)) > > > > > + BUG_ON(folio_test_writeback(folio)); > > > > > + if (!folio_clear_dirty_for_io(folio)) > > > > > goto continue_unlock; > > > > > > > > > > trace_wbc_writepage(wbc, > > > > > inode_to_bdi(mapping->host)); > > > > > - error = (*writepage)(page, wbc, data); > > > > > + error = writepage(&folio->page, wbc, data); > > > > > > > > Yet, IIUC, this treats all folios as if they are single page folios. > > > > i.e. it passes the head page of a multi-page folio to a callback > > > > that will treat it as a single PAGE_SIZE page, because that's all > > > > the writepage callbacks are currently expected to be passed... > > > > > > > > So won't this break writeback of dirty multipage folios? > > > > > > Yes, it appears it would. But it wouldn't because its already 'broken'. > > > > It is? Then why isn't XFS broken on existing kernels? Oh, we don't > > know because it hasn't been tested? > > > > Seriously - if this really is broken, and this patchset further > > propagating the brokeness, then somebody needs to explain to me why > > this is not corrupting data in XFS. > > It looks like iomap_do_writepage finds the folio size correctly > > end_pos = folio_pos(folio) + folio_size(folio); > > and iomap_writpage_map will map out the correct number of blocks > > unsigned nblocks = i_blocks_per_folio(inode, folio); > > for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { > > right? Yup, that's how I read it, too. But my recent experience with folios involved being repeatedly burnt by edge case corruptions due to multipage folios showing up when and where I least expected them. Hence doing a 1:1 conversion of page based code to folio based code and just assuming large folios will work without any testing seems akin to playing russian roulette with loose cannons that have been doused with napalm and then set on fire by an air-dropped barrel bomb... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()
On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote: > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote: > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote: > > > Converted function to use folios throughout. This is in preparation for > > > the removal of find_get_pages_range_tag(). > > > > > > Signed-off-by: Vishal Moola (Oracle) > > > --- > > > mm/page-writeback.c | 44 +++- > > > 1 file changed, 23 insertions(+), 21 deletions(-) > > > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > > index 032a7bf8d259..087165357a5a 100644 > > > --- a/mm/page-writeback.c > > > +++ b/mm/page-writeback.c > > > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space > > > *mapping, > > > int ret = 0; > > > int done = 0; > > > int error; > > > - struct pagevec pvec; > > > - int nr_pages; > > > + struct folio_batch fbatch; > > > + int nr_folios; > > > pgoff_t index; > > > pgoff_t end;/* Inclusive */ > > > pgoff_t done_index; > > > int range_whole = 0; > > > xa_mark_t tag; > > > > > > - pagevec_init(&pvec); > > > + folio_batch_init(&fbatch); > > > if (wbc->range_cyclic) { > > > index = mapping->writeback_index; /* prev offset */ > > > end = -1; > > > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space > > > *mapping, > > > while (!done && (index <= end)) { > > > int i; > > > > > > - nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end, > > > - tag); > > > - if (nr_pages == 0) > > > + nr_folios = filemap_get_folios_tag(mapping, &index, end, > > > + tag, &fbatch); > > > > This can find and return dirty multi-page folios if the filesystem > > enables them in the mapping at instantiation time, right? > > Yup, it will. > > > > + > > > + if (nr_folios == 0) > > > break; > > > > > > - for (i = 0; i < nr_pages; i++) { > > > - struct page *page = pvec.pages[i]; > > > + for (i = 0; i < nr_folios; i++) { > > > + struct folio *folio = fbatch.folios[i]; > > > > > > - done_index = page->index; > > > + done_index = folio->index; > > > > > > - lock_page(page); > > > + folio_lock(folio); > > > > > > /* > > >* Page truncated or invalidated. We can freely skip it > > > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space > > > *mapping, > > >* even if there is now a new, dirty page at the same > > >* pagecache address. > > >*/ > > > - if (unlikely(page->mapping != mapping)) { > > > + if (unlikely(folio->mapping != mapping)) { > > > continue_unlock: > > > - unlock_page(page); > > > + folio_unlock(folio); > > > continue; > > > } > > > > > > - if (!PageDirty(page)) { > > > + if (!folio_test_dirty(folio)) { > > > /* someone wrote it for us */ > > > goto continue_unlock; > > > } > > > > > > - if (PageWriteback(page)) { > > > + if (folio_test_writeback(folio)) { > > > if (wbc->sync_mode != WB_SYNC_NONE) > > > - wait_on_page_writeback(page); > > > + folio_wait_writeback(folio); > > > else > > > goto continue_unlock; > > > } > > > > > > - BUG_ON(PageWriteback(page)); > > > - if (!clear_page_dirty_for_io(page)) > > > + BUG_ON(folio_test_writeback(folio)); > > > + if (!folio_clear_dirty_for_io(folio)) > > >
Re: [f2fs-dev] [PATCH v4 00/23] Convert to filemap_get_folios_tag()
On Thu, Nov 03, 2022 at 09:38:48AM -0700, Vishal Moola wrote: > On Thu, Nov 3, 2022 at 12:08 AM Dave Chinner wrote: > > > > On Wed, Nov 02, 2022 at 09:10:08AM -0700, Vishal Moola (Oracle) wrote: > > > This patch series replaces find_get_pages_range_tag() with > > > filemap_get_folios_tag(). This also allows the removal of multiple > > > calls to compound_head() throughout. > > > It also makes a good chunk of the straightforward conversions to folios, > > > and takes the opportunity to introduce a function that grabs a folio > > > from the pagecache. > > > > > > F2fs and Ceph have quite a lot of work to be done regarding folios, so > > > for now those patches only have the changes necessary for the removal of > > > find_get_pages_range_tag(), and only support folios of size 1 (which is > > > all they use right now anyways). > > > > > > I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may > > > be > > > beneficial. The page-writeback and filemap changes implicitly work. > > > Testing > > > and review of the other changes (afs, ceph, cifs, gfs2) would be > > > appreciated. > > > > Same question as last time: have you tested this with multipage > > folios enabled? If you haven't tested XFS, then I'm guessing the > > answer is no, and you haven't fixed the bug I pointed out in > > the write_cache_pages() implementation > > > > I haven't tested the series with multipage folios or XFS. > > I don't seem to have gotten your earlier comments, and I > can't seem to find them on the mailing lists. Could you > please send them again so I can take a look? They are in the lore -fsdevel archive - no idea why you couldn't find them https://lore.kernel.org/linux-fsdevel/20221018210152.gh2703...@dread.disaster.area/ https://lore.kernel.org/linux-fsdevel/20221018214544.gi2703...@dread.disaster.area/ -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 00/23] Convert to filemap_get_folios_tag()
On Wed, Nov 02, 2022 at 09:10:08AM -0700, Vishal Moola (Oracle) wrote: > This patch series replaces find_get_pages_range_tag() with > filemap_get_folios_tag(). This also allows the removal of multiple > calls to compound_head() throughout. > It also makes a good chunk of the straightforward conversions to folios, > and takes the opportunity to introduce a function that grabs a folio > from the pagecache. > > F2fs and Ceph have quite a lot of work to be done regarding folios, so > for now those patches only have the changes necessary for the removal of > find_get_pages_range_tag(), and only support folios of size 1 (which is > all they use right now anyways). > > I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may be > beneficial. The page-writeback and filemap changes implicitly work. Testing > and review of the other changes (afs, ceph, cifs, gfs2) would be appreciated. Same question as last time: have you tested this with multipage folios enabled? If you haven't tested XFS, then I'm guessing the answer is no, and you haven't fixed the bug I pointed out in the write_cache_pages() implementation -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 00/23] Convert to filemap_get_folios_tag()
On Thu, Sep 01, 2022 at 03:01:15PM -0700, Vishal Moola (Oracle) wrote: > This patch series replaces find_get_pages_range_tag() with > filemap_get_folios_tag(). This also allows the removal of multiple > calls to compound_head() throughout. > It also makes a good chunk of the straightforward conversions to folios, > and takes the opportunity to introduce a function that grabs a folio > from the pagecache. > > F2fs and Ceph have quite alot of work to be done regarding folios, so > for now those patches only have the changes necessary for the removal of > find_get_pages_range_tag(), and only support folios of size 1 (which is > all they use right now anyways). > > I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may be > beneficial. Well, that answers my question about how filesystems that enable multi-page folios were tested: they weren't. I'd suggest that anyone working on further extending the filemap/folio infrastructure really needs to be testing XFS as a first priority, and then other filesystems as a secondary concern. That's because XFS (via the fs/iomap infrastructure) is one of only 3 filesystems in the kernel (AFS and tmpfs are the others) that interact with the page cache and page cache "pages" solely via folio interfaces. As such they are able to support multi-page folios in the page cache. All of the tested filesystems still use the fixed PAGE_SIZE page interfaces to interact with the page cache, so they don't actually exercise interactions with multi-page folios at all. Hence if you are converting generic infrastructure that looks up pages in the page cache to look up folios in the page cache, the code that processes the returned folios also needs to be updated and validated to ensure that it correctly handles multi-page folios. And the only way you can do that fully at this point in time is via testing XFS or AFS... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()
On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote: > Converted function to use folios throughout. This is in preparation for > the removal of find_get_pages_range_tag(). > > Signed-off-by: Vishal Moola (Oracle) > --- > mm/page-writeback.c | 44 +++- > 1 file changed, 23 insertions(+), 21 deletions(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 032a7bf8d259..087165357a5a 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space *mapping, > int ret = 0; > int done = 0; > int error; > - struct pagevec pvec; > - int nr_pages; > + struct folio_batch fbatch; > + int nr_folios; > pgoff_t index; > pgoff_t end;/* Inclusive */ > pgoff_t done_index; > int range_whole = 0; > xa_mark_t tag; > > - pagevec_init(&pvec); > + folio_batch_init(&fbatch); > if (wbc->range_cyclic) { > index = mapping->writeback_index; /* prev offset */ > end = -1; > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space *mapping, > while (!done && (index <= end)) { > int i; > > - nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end, > - tag); > - if (nr_pages == 0) > + nr_folios = filemap_get_folios_tag(mapping, &index, end, > + tag, &fbatch); This can find and return dirty multi-page folios if the filesystem enables them in the mapping at instantiation time, right? > + > + if (nr_folios == 0) > break; > > - for (i = 0; i < nr_pages; i++) { > - struct page *page = pvec.pages[i]; > + for (i = 0; i < nr_folios; i++) { > + struct folio *folio = fbatch.folios[i]; > > - done_index = page->index; > + done_index = folio->index; > > - lock_page(page); > + folio_lock(folio); > > /* >* Page truncated or invalidated. We can freely skip it > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space *mapping, >* even if there is now a new, dirty page at the same >* pagecache address. >*/ > - if (unlikely(page->mapping != mapping)) { > + if (unlikely(folio->mapping != mapping)) { > continue_unlock: > - unlock_page(page); > + folio_unlock(folio); > continue; > } > > - if (!PageDirty(page)) { > + if (!folio_test_dirty(folio)) { > /* someone wrote it for us */ > goto continue_unlock; > } > > - if (PageWriteback(page)) { > + if (folio_test_writeback(folio)) { > if (wbc->sync_mode != WB_SYNC_NONE) > - wait_on_page_writeback(page); > + folio_wait_writeback(folio); > else > goto continue_unlock; > } > > - BUG_ON(PageWriteback(page)); > - if (!clear_page_dirty_for_io(page)) > + BUG_ON(folio_test_writeback(folio)); > + if (!folio_clear_dirty_for_io(folio)) > goto continue_unlock; > > trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); > - error = (*writepage)(page, wbc, data); > + error = writepage(&folio->page, wbc, data); Yet, IIUC, this treats all folios as if they are single page folios. i.e. it passes the head page of a multi-page folio to a callback that will treat it as a single PAGE_SIZE page, because that's all the writepage callbacks are currently expected to be passed... So won't this break writeback of dirty multipage folios? -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 6/9] f2fs: don't allow DIO reads but not DIO writes
On Mon, Aug 15, 2022 at 05:55:45PM -0700, Eric Biggers wrote: > On Sat, Jul 30, 2022 at 08:08:26PM -0700, Jaegeuk Kim wrote: > > On 07/25, Eric Biggers wrote: > > > On Sat, Jul 23, 2022 at 07:01:59PM -0700, Jaegeuk Kim wrote: > > > > On 07/22, Eric Biggers wrote: > > > > > From: Eric Biggers > > > > > > > > > > Currently, if an f2fs filesystem is mounted with the mode=lfs and > > > > > io_bits mount options, DIO reads are allowed but DIO writes are not. > > > > > Allowing DIO reads but not DIO writes is an unusual restriction, which > > > > > is likely to be surprising to applications, namely any application > > > > > that > > > > > both reads and writes from a file (using O_DIRECT). This behavior is > > > > > also incompatible with the proposed STATX_DIOALIGN extension to statx. > > > > > Given this, let's drop the support for DIO reads in this > > > > > configuration. > > > > > > > > IIRC, we allowed DIO reads since applications complained a lower > > > > performance. > > > > So, I'm afraid this change will make another confusion to users. Could > > > > you please apply the new bahavior only for STATX_DIOALIGN? > > > > > > > > > > Well, the issue is that the proposed STATX_DIOALIGN fields cannot > > > represent this > > > weird case where DIO reads are allowed but not DIO writes. So the > > > question is > > > whether this case actually matters, in which case we should make > > > STATX_DIOALIGN > > > distinguish between DIO reads and DIO writes, or whether it's some odd > > > edge case > > > that doesn't really matter, in which case we could just fix it or make > > > STATX_DIOALIGN report that DIO is unsupported. I was hoping that you had > > > some > > > insight here. What sort of applications want DIO reads but not DIO > > > writes? > > > Is this common at all? > > > > I think there's no specific application to use the LFS mode at this > > moment, but I'd like to allow DIO read for zoned device which will be > > used for Android devices. > > > > So if the zoned device feature becomes widely adopted, then STATX_DIOALIGN > will > be useless on all Android devices? That sounds undesirable. Are you sure > that > supporting DIO reads but not DIO writes actually works? Does it not cause > problems for existing applications? What purpose does DIO in only one direction actually serve? All it means is that we're forcibly mixing buffered and direct IO to the same file and that simply never ends well from a data coherency POV. Hence I'd suggest that mixing DIO reads and buffered writes like this ends up exposing uses to the worst of both worlds - all of the problems with none of the benefits... > What we need to do is make a decision about whether this means we should build > in a stx_dio_direction field (indicating no support / readonly support / > writeonly support / readwrite support) into the API from the beginning. If we > don't do that, then I don't think we could simply add such a field later, as > the > statx_dio_*_align fields will have already been assigned their meaning. I > think > we'd instead have to "duplicate" the API, with STATX_DIOROALIGN and > statx_dio_ro_*_align fields. That seems uglier than building a directional > indicator into the API from the beginning. On the other hand, requiring all > programs to check stx_dio_direction would add complexity to using the API. > > Any thoughts on this? Decide whether partial, single direction DIO serves a useful purpose before trying to work out what is needed in the API to indicate that this sort of crazy will be supported Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] f2fs: add sysfs entry to avoid FUA
On Fri, May 27, 2022 at 05:26:32PM -0700, Jaegeuk Kim wrote: > On 05/28, Dave Chinner wrote: > > On Fri, May 27, 2022 at 09:33:55PM +, Eric Biggers wrote: > > > [+Cc linux-block for FUA, and linux-xfs for iomap] > > > > linux-fsdevel should really be used for iomap stuff... > > > > > > > > On Fri, May 27, 2022 at 01:59:55PM -0700, Jaegeuk Kim wrote: > > > > Some UFS storage gives slower performance on FUA than write+cache_flush. > > > > Let's give a way to manage it. > > > > > > > > Signed-off-by: Jaegeuk Kim > > > > > > Should the driver even be saying that it has FUA support in this case? > > > If the > > > driver didn't claim FUA support, that would also solve this problem. > > > > Agreed, this is a hardware problem that need to addressed with a > > driver quirk to stop it advertising FUA support. The high level > > fs/iomap code should always issue FUA writes where possible and > > the lower layers tell the block layer whether to issue the FUA as > > a FUA or write+cache flush pair. > > I was thinking to turn off FUA in driver side quickly tho, one concern > was the bandwidth vs. latency. What if the device can support FUA having > short latency while giving low bandwidth? Seriously, how is a user supposed to know this sort of thing about the hardware they are using? They don't, and to expect them to not only know about the existing of a weird sysfs knob, let alone how it applies to their hardware and their workload is totally unreasonable. If the hardware has non-deterministic FUA write performance, or requires very careful switch over between cache flushes and FUA to get the most out of the hardware, then that's not something we can tune or optimise for - that's just broken hardware and the drive should quirk the brokeness away so nobody has to care about it. Tell the hardware manufacturer to fix their hardware, don't try to hack around it in software and then expect the user to know how to tune for that broken hardware. > In that case, we still have > a room to utilize FUA for small-sized writes such as filesystem metadata > writes, but avoid DIO w/ FUA for sequential write stream. Strawman. We don't use FUA for normal DIO writes - they only get used for O_DSYNC writes, in which case we either use FUA if the device supports it, or we do a normal write followed by a cache flush. If there are metadata updates that the O_DSYNC needs to also flush, we don't use FUA by let the fileystem issue a cache flush in the most optimal possible after the write completes. Either way, using O_DSYNC DIO writes for streaming, sequential data is a really poor choice for an application to make. Normal DIO writes followed by fdatasync() to flush the metadata and caches once will be much faster and far more efficient than a metadata and cache flush after every single data write, FUA or not. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] f2fs: add sysfs entry to avoid FUA
On Fri, May 27, 2022 at 09:33:55PM +, Eric Biggers wrote: > [+Cc linux-block for FUA, and linux-xfs for iomap] linux-fsdevel should really be used for iomap stuff... > > On Fri, May 27, 2022 at 01:59:55PM -0700, Jaegeuk Kim wrote: > > Some UFS storage gives slower performance on FUA than write+cache_flush. > > Let's give a way to manage it. > > > > Signed-off-by: Jaegeuk Kim > > Should the driver even be saying that it has FUA support in this case? If the > driver didn't claim FUA support, that would also solve this problem. Agreed, this is a hardware problem that need to addressed with a driver quirk to stop it advertising FUA support. The high level fs/iomap code should always issue FUA writes where possible and the lower layers tell the block layer whether to issue the FUA as a FUA or write+cache flush pair. And, quite frankly, exposing this sort of "hardware needs help" knob as a sysfs variable is exactly the sort of thing we should never do. Users have no idea how to tune stuff like this correctly (even if they knew it existed!), yet we know exactly what hardware has this problem and the kernel already has mechanisms that would allow it to just Do The Right Thing. IOWs, we can fix this without the user even having to know that they have garbage hardware that needs special help Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [RFC PATCH v2 1/7] statx: add I/O alignment information
On Thu, May 19, 2022 at 04:06:05PM -0700, Darrick J. Wong wrote: > On Wed, May 18, 2022 at 04:50:05PM -0700, Eric Biggers wrote: > > From: Eric Biggers > > > > Traditionally, the conditions for when DIO (direct I/O) is supported > > were fairly simple: filesystems either supported DIO aligned to the > > block device's logical block size, or didn't support DIO at all. > > > > However, due to filesystem features that have been added over time (e.g, > > data journalling, inline data, encryption, verity, compression, > > checkpoint disabling, log-structured mode), the conditions for when DIO > > is allowed on a file have gotten increasingly complex. Whether a > > particular file supports DIO, and with what alignment, can depend on > > various file attributes and filesystem mount options, as well as which > > block device(s) the file's data is located on. > > > > XFS has an ioctl XFS_IOC_DIOINFO which exposes this information to > > applications. However, as discussed > > (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebigg...@kernel.org/T/#u), > > this ioctl is rarely used and not known to be used outside of > > XFS-specific code. It also was never intended to indicate when a file > > doesn't support DIO at all, and it only exposes the minimum I/O > > alignment, not the optimal I/O alignment which has been requested too. > > > > Therefore, let's expose this information via statx(). Add the > > STATX_IOALIGN flag and three fields associated with it: > > > > * stx_mem_align_dio: the alignment (in bytes) required for user memory > > buffers for DIO, or 0 if DIO is not supported on the file. > > > > * stx_offset_align_dio: the alignment (in bytes) required for file > > offsets and I/O segment lengths for DIO, or 0 if DIO is not supported > > on the file. This will only be nonzero if stx_mem_align_dio is > > nonzero, and vice versa. > > > > * stx_offset_align_optimal: the alignment (in bytes) suggested for file > > offsets and I/O segment lengths to get optimal performance. This > > applies to both DIO and buffered I/O. It differs from stx_blocksize > > in that stx_offset_align_optimal will contain the real optimum I/O > > size, which may be a large value. In contrast, for compatibility > > reasons stx_blocksize is the minimum size needed to avoid page cache > > read/write/modify cycles, which may be much smaller than the optimum > > I/O size. For more details about the motivation for this field, see > > https://lore.kernel.org/r/20220210040304.gm59...@dread.disaster.area > > Hmm. So I guess this is supposed to be the filesystem's best guess at > the IO size that will minimize RMW cycles in the entire stack? i.e. if > the user does not want RMW of pagecache pages, of file allocation units > (if COW is enabled), of RAID stripes, or in the storage itself, then it > should ensure that all IOs are aligned to this value? > > I guess that means for XFS it's effectively max(pagesize, i_blocksize, > bdev io_opt, sb_width, and (pretend XFS can reflink the realtime volume) > the rt extent size)? I didn't see a manpage update for statx(2) but > that's mostly what I'm interested in. :) Yup, xfs_stat_blksize() should give a good idea of what we should do. It will end up being pretty much that, except without the need to a mount option to turn on the sunit/swidth return, and always taking into consideration extent size hints rather than just doing that for RT inodes... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 1/8] fs: move sgid strip operation from inode_init_owner into inode_sgid_strip
On Wed, Apr 20, 2022 at 01:27:39AM +, xuyang2018...@fujitsu.com wrote: > on 2022/4/19 22:05, Christian Brauner wrote: > > On Tue, Apr 19, 2022 at 07:47:07PM +0800, Yang Xu wrote: > >> This has no functional change. Just create and export inode_sgid_strip api > >> for > >> the subsequent patch. This function is used to strip S_ISGID mode when init > >> a new inode. > >> > >> Acked-by: Christian Brauner (Microsoft) > >> Signed-off-by: Yang Xu > >> --- > >> fs/inode.c | 22 ++ > >> include/linux/fs.h | 3 ++- > >> 2 files changed, 20 insertions(+), 5 deletions(-) > >> > >> diff --git a/fs/inode.c b/fs/inode.c > >> index 9d9b422504d1..3215e61a0021 100644 > >> --- a/fs/inode.c > >> +++ b/fs/inode.c > >> @@ -2246,10 +2246,8 @@ void inode_init_owner(struct user_namespace > >> *mnt_userns, struct inode *inode, > >>/* Directories are special, and always inherit S_ISGID */ > >>if (S_ISDIR(mode)) > >>mode |= S_ISGID; > >> - else if ((mode& (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&& > >> - !in_group_p(i_gid_into_mnt(mnt_userns, dir))&& > >> - !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID)) > >> - mode&= ~S_ISGID; > >> + else > >> + inode_sgid_strip(mnt_userns, dir,&mode); > >>} else > >>inode_fsgid_set(inode, mnt_userns); > >>inode->i_mode = mode; > >> @@ -2405,3 +2403,19 @@ struct timespec64 current_time(struct inode *inode) > >>return timestamp_truncate(now, inode); > >> } > >> EXPORT_SYMBOL(current_time); > >> + > >> +void inode_sgid_strip(struct user_namespace *mnt_userns, > >> +const struct inode *dir, umode_t *mode) > >> +{ > > > > I think with Willy agreeing in an earlier version with me and you > > needing to resend anyway I'd say have this return umode_t instead of > > passing a pointer. > > IMO, I am fine with your and Willy way. But I need a reason otherwise > I can't convince myself why not use mode pointer directly. You should listen to experienced developers like Willy and Christian when they say "follow existing coding conventions". Indeed, Darrick has also mentioned he'd prefer it to return the new mode, and I'd also prefer that it returns the new mode. > I have asked you and Willy before why return umode_t value is better, > why not modify mode pointer directly? Since we have use mode as > argument, why not modify mode pointer directly in function? If the function had mulitple return status (e.g. an error or a mode) the convention is to pass the mode output variable by reference and return the error status. But there is only one return value from this function - the mode - and hence it should be returned in the return value, not passed by reference. Passing by reference unnecessarily makes the code more complex and less mainatainable. Code that returns a single value is easy to understand, is more flexible in the way callers can use it and it's simpler to maintain. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] generic/066: attr1 is still there after log replay on f2fs
On Wed, Mar 09, 2022 at 03:34:27PM +0800, Chao Yu wrote: > On 2022/3/9 14:22, Dave Chinner wrote: > > On Wed, Mar 09, 2022 at 12:31:17PM +0800, Chao Yu wrote: > > > On 2022/2/28 11:57, Sun Ke via Linux-f2fs-devel wrote: > > > > The test fail on f2fs: > > > >xattr names and values after second fsync log replay: > > > ># file: SCRATCH_MNT/foobar > > > > +user.attr1="val1" > > > >user.attr3="val3" > > > > > > > > attr1 is still there after log replay. > > > > I guess it is f2fs's special feature to improve the performance. > > > > > > > > Signed-off-by: Sun Ke > > > > --- > > > > > > > > Is it a BUG on f2fs? > > > > > > I don't think so, it fails due to f2fs doesn't follow recovery rule which > > > btrfs/ext4/xfs does, but it doesn't mean f2fs has break posix semantics of > > > fsync(). > > > > I disagree. A failure in this test is indicative of non-conformance > > with the Linux definition of fsync() behaviour. > > > > You are right in that it does not break POSIX fsync semantics, but > > POSIX allows "do nothing" as a valid implementation. However, > > because of this loophole, the POSIX definition is complete garbage > > and we do not use it. > > > > That behaviour that Linux filesytsems are supposed to implement is > > defined in the Linux fsync() man page, and it goes way beyond what > > POSIX requires: > > > > $ man fsync > > > > DESCRIPTION > > fsync() transfers ("flushes") all modified in-core data of > > (i.e., modified buffer cache pages for) the file referred to by > > the file descriptor fd to the disk device (or other permanent > > storage device) so that all changed information can be retrieved > > even if the system crashes or is rebooted. This includes > > writing through or flushing a disk cache if present. The call > > blocks until the device reports that the transfer has > > completed. > > > > As well as flushing the file data, fsync() also flushes the > > metadata information associated with the file (see inode(7)). > > > > > > IOWs, fsync() on Linux is supposed to persist all data and > > metadata associated with the inode to stable storage such that it > > can be retreived after a crash or reboot. "metadata information" > > includes xattrs attached to the inode that is being fsync()d. > > Quoted from g/066: > > echo "hello world!" >> $SCRATCH_MNT/foobar > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar > $SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar > ln $SCRATCH_MNT/foobar $SCRATCH_MNT/foobar_link > touch $SCRATCH_MNT/qwerty > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/qwerty > > IIUC, to match what Linux fsync() manual restricts, if we want to persist the > xattr removal, we should call fsync() on $SCRATCH_MNT/foobar after > "$SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar"? rather than calling > fsync() > on unrelated $SCRATCH_MNT/qwerty. It might look that way, but it's not that straight forward: there's a carefully constructed object dependency chain in this test that defines what the correct behaviour should be here. What's the link count of $SCRATCH_MNT/foobar when $SCRATCH_MNT/qwerty is present after recovery? Is it 1 or 2? Does $SCRATCH_MNT/foobar_link exist? And if $SCRATCH_MNT/foobar_link exists, and the link count is 2. The test doesn't even look at these things, but if user.attr1 is not present, it means that foobar_link and qwerty are present, $SCRATCH_MNT has a link count of 5 and foobar has a link count of 2 because that's the dependency chain that leads to the user.attr1 removal being recovered correctly. So what does SCRATCH_MNT actually contain when f2fs fails this test? These depedencies exist because we can't just randomly re-order recovery of modifications to individual inodes and certain operations create atomic change dependencies between inodes. It's those atomic change dependencies that this test is actually exercising. i.e. the link count changes tie directory modifications to inode modifications and this creates cross-object ordering dependencies down the line that fsync then exposes. That's what the second part of this test is actually exercising Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] generic/066: attr1 is still there after log replay on f2fs
On Wed, Mar 09, 2022 at 12:31:17PM +0800, Chao Yu wrote: > On 2022/2/28 11:57, Sun Ke via Linux-f2fs-devel wrote: > > The test fail on f2fs: > > xattr names and values after second fsync log replay: > > # file: SCRATCH_MNT/foobar > > +user.attr1="val1" > > user.attr3="val3" > > > > attr1 is still there after log replay. > > I guess it is f2fs's special feature to improve the performance. > > > > Signed-off-by: Sun Ke > > --- > > > > Is it a BUG on f2fs? > > I don't think so, it fails due to f2fs doesn't follow recovery rule which > btrfs/ext4/xfs does, but it doesn't mean f2fs has break posix semantics of > fsync(). I disagree. A failure in this test is indicative of non-conformance with the Linux definition of fsync() behaviour. You are right in that it does not break POSIX fsync semantics, but POSIX allows "do nothing" as a valid implementation. However, because of this loophole, the POSIX definition is complete garbage and we do not use it. That behaviour that Linux filesytsems are supposed to implement is defined in the Linux fsync() man page, and it goes way beyond what POSIX requires: $ man fsync DESCRIPTION fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7)). IOWs, fsync() on Linux is supposed to persist all data and metadata associated with the inode to stable storage such that it can be retreived after a crash or reboot. "metadata information" includes xattrs attached to the inode that is being fsync()d. *fdatasync()* does not require xattrs to be persisted unless they are needed to retreive data, but that's not what g/066 is exercising. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto
On Tue, Feb 08, 2022 at 05:10:03PM -0800, Eric Biggers wrote: > On Mon, Jan 24, 2022 at 10:03:32AM +1100, Dave Chinner wrote: > > > > > > /* 0xa0 */ > > > > > > /* File range alignment needed for best performance, in bytes. */ > > > __u32 stx_dio_fpos_align_opt; > > > > This is a common property of both DIO and buffered IO, so no need > > for it to be dio-only property. > > > > __u32 stx_offset_align_optimal; > > > > Looking at this more closely: will stx_offset_align_optimal actually be > useful, > given that st[x]_blksize already exists? Yes, because > From the stat(2) and statx(2) man pages: > > st_blksize > This field gives the "preferred" block size for efficient > filesystem I/O. > > stx_blksize > The "preferred" block size for efficient filesystem I/O. (Writ‐ > ing to a file in smaller chunks may cause an inefficient read- > modify-rewrite.) ... historically speaking, this is intended to avoid RMW cycles for sub-block and/or sub-PAGE_SIZE write() IOs. i.e. the practical definition of st_blksize is the *minimum* IO size the needed to avoid page cache RMW cycles. However, XFS has a "-o largeio" mount option, that sets this value to internal optimal filesytsem alignment values such as stripe unit or even stripe width (-o largeio,swalloc). THis means it can be up to 2GB (maybe larger?) in size. THe problem with this is that many applications are not prepared to see a value of, say, 16MB in st_blksize rather than 4096 bytes. An example of such problems are applications sizing their IO buffers as a multiple of st_blksize - we've had applications fail because they try to use multi-GB sized IO buffers as a result of setting st_blksize to the filesystem/storage idea of optimal IO size rather than PAGE_SIZE. Hence, we can't really change the value of st_blksize without risking random breakage in userspace. hence the practical definition of st_blksize is the *minimum* IO size that avoids RMW cycles for an individual write() syscall, not the most efficient IO size. > File offsets aren't explicitly mentioned, but I think it's implied they should > be a multiple of st[x]_blksize, just like the I/O size. Otherwise, the I/O > would obviously require reading/writing partial blocks. Of course it implies aligned file offsets - block aligned IO is absolutely necessary for effcient filesystem IO. It has for pretty much the entire of unix history... > So, the proposed stx_offset_align_optimal field sounds like the same thing to > me. Is there anything I'm misunderstanding? > > Putting stx_offset_align_optimal behind the STATX_DIRECTIO flag would also be > confusing if it would apply to both direct and buffered I/O. So just name the flag STATX_IOALIGN so that it can cover generic, buffered specific and DIO specific parameters in one hit. Simple, yes? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto
On Thu, Jan 20, 2022 at 06:36:03PM -0800, Darrick J. Wong wrote: > On Fri, Jan 21, 2022 at 10:57:55AM +1100, Dave Chinner wrote: > Sure. How's this? I couldn't think of a real case of directio > requiring different alignments for pos and bytecount, so the only real > addition here is the alignment requirements for best performance. > > struct statx { > ... > /* 0x90 */ > __u64 stx_mnt_id; > > /* Memory buffer alignment required for directio, in bytes. */ > __u32 stx_dio_mem_align; __32stx_mem_align_dio; (for consistency with suggestions below) > > /* File range alignment required for directio, in bytes. */ > __u32 stx_dio_fpos_align_min; "fpos" is not really a user term - "offset" is the userspace term for file position, and it's much less of a random letter salad if it's named that way. Also, we don't need "min" in the name; the description of the field in the man page can give all the gory details about it being the minimum required alignment. __u32 stx_offset_align_dio; > > /* 0xa0 */ > > /* File range alignment needed for best performance, in bytes. */ > __u32 stx_dio_fpos_align_opt; This is a common property of both DIO and buffered IO, so no need for it to be dio-only property. __u32 stx_offset_align_optimal; > > /* Maximum size of a directio request, in bytes. */ > __u32 stx_dio_max_iosize; Unnecessary, it will always be the syscall max IO size, because the internal DIO code will slice and dice it down to the max sizes the hardware supports. > #define STATX_DIRECTIO0x00001000U /* Want/got directio geometry */ > > How about that? Mostly seems reasonable at a first look. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto
On Thu, Jan 20, 2022 at 02:48:52PM -0800, Eric Biggers wrote: > On Fri, Jan 21, 2022 at 09:04:14AM +1100, Dave Chinner wrote: > > On Thu, Jan 20, 2022 at 01:00:27PM -0800, Darrick J. Wong wrote: > > > On Thu, Jan 20, 2022 at 12:39:14PM -0800, Eric Biggers wrote: > > > > On Thu, Jan 20, 2022 at 09:10:27AM -0800, Darrick J. Wong wrote: > > > > > On Thu, Jan 20, 2022 at 12:30:23AM -0800, Christoph Hellwig wrote: > > > > > > On Wed, Jan 19, 2022 at 11:12:10PM -0800, Eric Biggers wrote: > > > > > > > > > > > > > > Given the above, as far as I know the only remaining objection to > > > > > > > this > > > > > > > patchset would be that DIO constraints aren't sufficiently > > > > > > > discoverable > > > > > > > by userspace. Now, to put this in context, this is a > > > > > > > longstanding issue > > > > > > > with all Linux filesystems, except XFS which has XFS_IOC_DIOINFO. > > > > > > > It's > > > > > > > not specific to this feature, and it doesn't actually seem to be > > > > > > > too > > > > > > > important in practice; many other filesystem features place > > > > > > > constraints > > > > > > > on DIO, and f2fs even *only* allows fully FS block size aligned > > > > > > > DIO. > > > > > > > (And for better or worse, many systems using fscrypt already have > > > > > > > out-of-tree patches that enable DIO support, and people don't > > > > > > > seem to > > > > > > > have trouble with the FS block size alignment requirement.) > > > > > > > > > > > > It might make sense to use this as an opportunity to implement > > > > > > XFS_IOC_DIOINFO for ext4 and f2fs. > > > > > > > > > > Hmm. A potential problem with DIOINFO is that it doesn't explicitly > > > > > list the /file/ position alignment requirement: > > > > > > > > > > struct dioattr { > > > > > __u32 d_mem; /* data buffer memory alignment > > > > > */ > > > > > __u32 d_miniosz; /* min xfer size > > > > > */ > > > > > __u32 d_maxiosz; /* max xfer size > > > > > */ > > > > > }; > > > > > > > > Well, the comment above struct dioattr says: > > > > > > > > /* > > > > * Direct I/O attribute record used with XFS_IOC_DIOINFO > > > > * d_miniosz is the min xfer size, xfer size multiple and file > > > > seek offset > > > > * alignment. > > > > */ > > > > > > > > So d_miniosz serves that purpose already. > > > > > > > > > > > > > > Since I /think/ fscrypt requires that directio writes be aligned to > > > > > file > > > > > block size, right? > > > > > > > > The file position must be a multiple of the filesystem block size, yes. > > > > Likewise for the "minimum xfer size" and "xfer size multiple", and the > > > > "data > > > > buffer memory alignment" for that matter. So I think XFS_IOC_DIOINFO > > > > would be > > > > good enough for the fscrypt direct I/O case. > > > > > > Oh, ok then. In that case, just hoist XFS_IOC_DIOINFO to the VFS and > > > add a couple of implementations for ext4 and f2fs, and I think that'll > > > be enough to get the fscrypt patchset moving again. > > > > On the contrary, I'd much prefer to see this information added to > > statx(). The file offset alignment info is a property of the current > > file (e.g. XFS can have different per-file requirements depending on > > whether the file data is hosted on the data or RT device, etc) and > > so it's not a fixed property of the filesystem. > > > > statx() was designed to be extended with per-file property > > information, and we already have stuff like filesystem block size in > > that syscall. Hence I would much prefer that we extend it with the > > DIO properties we need to support rather than "create" a new VFS > > ioctl to extract this information. We already have statx()
Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto
eph 16.2 (seastar database support?) -diod contains a copy of fsstress -e2fsprogs contains a copy of fsstress -openmpi (under ifdef SGIMPI) -partclone - actually, that has a complete copy of the xfsprogs libxfs/ iand include/ directory in it, so it's using the old libxfs_device_alignment() call that uses XFS_IOC_DIOINFOD, and only when builing the xfsclone binary. Yup, I can count them on one 6 fingered hand, and their only use is when XFS filesystems are specifically discovered. :) Hence I think it would be much more useful to application developers to include the IO alignment information in statx(), not to lift an ioctl that is pretty much unused and unknown outside the core XFS development environment Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] MM: introduce memalloc_retry_wait()
On Wed, Nov 17, 2021 at 03:28:10PM +1100, NeilBrown wrote: > > Various places in the kernel - largely in filesystems - respond to a > memory allocation failure by looping around and re-trying. . > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index aca874d33fe6..f2f2a5b28808 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -214,6 +214,27 @@ static inline void fs_reclaim_acquire(gfp_t gfp_mask) { } > static inline void fs_reclaim_release(gfp_t gfp_mask) { } > #endif > > +/* Any memory-allocation retry loop should use > + * memalloc_retry_wait(), and pass the flags for the most > + * constrained allocation attempt that might have failed. > + * This provides useful documentation of where loops are, > + * and a central place to fine tune the waiting as the MM > + * implementation changes. > + */ > +static inline void memalloc_retry_wait(gfp_t gfp_flags) > +{ > + gfp_flags = current_gfp_context(gfp_flags); > + if ((gfp_flags & __GFP_DIRECT_RECLAIM) && > + !(gfp_flags & __GFP_NORETRY)) > + /* Probably waited already, no need for much more */ > + schedule_timeout_uninterruptible(1); > + else > + /* Probably didn't wait, and has now released a lock, > + * so now is a good time to wait > + */ > + schedule_timeout_uninterruptible(HZ/50); > +} The existing congestion_wait() calls io_schedule_timeout() under TASK_UNINTERRUPTIBLE conditions. Does changing all these calls just to a plain schedule_timeout_uninterruptible() make any difference to behaviour? At least process accounting will appear different (uninterruptible sleep instead of IO wait), and I suspect that the block plug flushing in io_schedule() might be a good idea to retain for all the filesystems that call this function from IO-related routines. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4] fs: forbid invalid project ID
On Sat, Jul 10, 2021 at 10:39:59PM +0800, Wang Shilong wrote: > From: Wang Shilong > > fileattr_set_prepare() should check if project ID > is valid, otherwise dqget() will return NULL for > such project ID quota. > > Signed-off-by: Wang Shilong > --- > v3->v3: > only check project Id if caller is allowed > to change and being changed. > > v2->v3: move check before @fsx_projid is accessed > and use make_kprojid() helper. > > v1->v2: try to fix in the VFS > fs/ioctl.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/fs/ioctl.c b/fs/ioctl.c > index 1e2204fa9963..d4fabb5421cd 100644 > --- a/fs/ioctl.c > +++ b/fs/ioctl.c > @@ -817,6 +817,14 @@ static int fileattr_set_prepare(struct inode *inode, > if ((old_ma->fsx_xflags ^ fa->fsx_xflags) & > FS_XFLAG_PROJINHERIT) > return -EINVAL; > + } else { > + /* > + * Caller is allowed to change the project ID. If it is being > + * changed, make sure that the new value is valid. > + */ > + if (old_ma->fsx_projid != fa->fsx_projid && > + !projid_valid(make_kprojid(&init_user_ns, fa->fsx_projid))) > + return -EINVAL; > } > > /* Check extent size hints. */ Looks good. Thanks! Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v3] fs: forbid invalid project ID
On Fri, Jul 02, 2021 at 10:02:43AM -0400, Wang Shilong wrote: > fileattr_set_prepare() should check if project ID > is valid, otherwise dqget() will return NULL for > such project ID quota. > > Signed-off-by: Wang Shilong > --- > v2->v3: move check before @fsx_projid is accessed > and use make_kprojid() helper. > > v1->v2: try to fix in the VFS > --- > fs/ioctl.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/ioctl.c b/fs/ioctl.c > index 1e2204fa9963..d7edc92df473 100644 > --- a/fs/ioctl.c > +++ b/fs/ioctl.c > @@ -806,6 +806,8 @@ static int fileattr_set_prepare(struct inode *inode, > if (err) > return err; > > + if (!projid_valid(make_kprojid(&init_user_ns, fa->fsx_projid))) > + return -EINVAL; > /* >* Project Quota ID state is only allowed to change from within the init >* namespace. Enforce that restriction only if we are trying to change I still don't think this is correct - read the comment directly below where you put this. That was the code block I was referring to early: /* * Project Quota ID state is only allowed to change from within the init * namespace. Enforce that restriction only if we are trying to change * the quota ID state. Everything else is allowed in user namespaces. */ if (current_user_ns() != &init_user_ns) { if (old_ma->fsx_projid != fa->fsx_projid) return -EINVAL; if ((old_ma->fsx_xflags ^ fa->fsx_xflags) & FS_XFLAG_PROJINHERIT) return -EINVAL; } IOWs: if we are not changing the projid, then we should not be checking it for validity because of the way the whole get/set interface works. The reason for this is that this interface is a get/set pair, where you first have to get the values from the filesystem, then modify them and send the set back to the filesystem. If the filesystem sends out fsx_projid = -1 (for whatever reason), the caller must send that same value back into the filesystem if they are not modifying the project ID. Hence we have to accept any projid from the caller that matches the current filesystem value, regardless of whether it is an invalid value or not. Therefore, we should only be checking if fa->fsx_projid is valid if it is different to the current filesystem value *and* we are allowed to change it. So: /* * Project Quota ID state is only allowed to change from within the init * namespace. Enforce that restriction only if we are trying to change * the quota ID state. Everything else is allowed in user namespaces. */ if (current_user_ns() != &init_user_ns) { if (old_ma->fsx_projid != fa->fsx_projid) return -EINVAL; if ((old_ma->fsx_xflags ^ fa->fsx_xflags) & FS_XFLAG_PROJINHERIT) return -EINVAL; } else { /* * Caller is allowed to change the project ID. If it is being * changed, make sure that the new value is valid. */ if (old_ma->fsx_projid != fa->fsx_projid && !projid_valid(make_kprojid(&init_user_ns, fa->fsx_projid))) return -EINVAL; } Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2] fs: forbid invalid project ID
On Mon, Jun 28, 2021 at 08:38:01AM -0400, Wang Shilong wrote: > fileattr_set_prepare() should check if project ID > is valid, otherwise dqget() will return NULL for > such project ID quota. > > Signed-off-by: Wang Shilong > --- > v1->v2: try to fix in the VFS > --- > fs/ioctl.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ioctl.c b/fs/ioctl.c > index 1e2204fa9963..5db5b218637b 100644 > --- a/fs/ioctl.c > +++ b/fs/ioctl.c > @@ -845,6 +845,9 @@ static int fileattr_set_prepare(struct inode *inode, > if (fa->fsx_cowextsize == 0) > fa->fsx_xflags &= ~FS_XFLAG_COWEXTSIZE; > > + if (!projid_valid(KPROJIDT_INIT(fa->fsx_projid))) > + return -EINVAL; This needs to go further up in this function in the section where project IDs passed into this function are validated. Projids are only allowed to be changed when current_user_ns() == &init_user_ns, so this needs to be associated with that verification context. This check should also use make_kprojid(), please, not open code KPROJIDT_INIT. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 07/14] xfs: Refactor xfs_isilocked()
On Mon, Jun 07, 2021 at 04:52:17PM +0200, Jan Kara wrote: > From: Pavel Reichl > > Refactor xfs_isilocked() to use newly introduced __xfs_rwsem_islocked(). > __xfs_rwsem_islocked() is a helper function which encapsulates checking > state of rw_semaphores hold by inode. > > Signed-off-by: Pavel Reichl > Suggested-by: Dave Chinner > Suggested-by: Eric Sandeen > Suggested-by: Darrick J. Wong > Reviewed-by: Darrick J. Wong > Reviewed-by: Christoph Hellwig > Signed-off-by: Jan Kara > --- > fs/xfs/xfs_inode.c | 39 +++ > fs/xfs/xfs_inode.h | 21 ++--- > 2 files changed, 45 insertions(+), 15 deletions(-) As a standalone patch, this is overly elaborate and way more complex than it needs to be. It's not really just a refactor, either, because of the unnecessary shifting games it adds. > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 0369eb22c1bb..6247977870bd 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -342,9 +342,34 @@ xfs_ilock_demote( > } > > #if defined(DEBUG) || defined(XFS_WARN) > -int > +static inline bool > +__xfs_rwsem_islocked( > + struct rw_semaphore *rwsem, > + int lock_flags, > + int shift) > +{ > + lock_flags >>= shift; > + > + if (!debug_locks) > + return rwsem_is_locked(rwsem); > + /* > + * If the shared flag is not set, pass 0 to explicitly check for > + * exclusive access to the lock. If the shared flag is set, we typically > + * want to make sure the lock is at least held in shared mode > + * (i.e., shared | excl) but we don't necessarily care that it might > + * actually be held exclusive. Therefore, pass -1 to check whether the > + * lock is held in any mode rather than one of the explicit shared mode > + * values (1 or 2)." > + */ > + if (lock_flags & (1 << XFS_SHARED_LOCK_SHIFT)) { > + return lockdep_is_held_type(rwsem, -1); > + } > + return lockdep_is_held_type(rwsem, 0); > +} Pass in a boolean value for shared/exclusive and you can get rid of passing in the lock flags as well. static bool __xfs_rwsem_islocked( struct rw_semaphore *rwsem, boolshared) { if (!debug_locks) return rwsem_is_locked(rwsem); if (!shared) return lockdep_is_held_type(rwsem, 0); /* * We are checking that the lock is held at least in shared * mode but don't care that it might be held exclusively * (i.e. shared | excl). Hence we check if the lock is held * in any mode rather than an explicit shared mode. */ return lockdep_is_held_type(rwsem, -1); } > + > +bool > xfs_isilocked( > - xfs_inode_t *ip, > + struct xfs_inode*ip, > uintlock_flags) > { > if (lock_flags & (XFS_ILOCK_EXCL|XFS_ILOCK_SHARED)) { > @@ -359,15 +384,13 @@ xfs_isilocked( > return rwsem_is_locked(&ip->i_mmaplock.mr_lock); > } > > - if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) { > - if (!(lock_flags & XFS_IOLOCK_SHARED)) > - return !debug_locks || > - lockdep_is_held_type(&VFS_I(ip)->i_rwsem, 0); > - return rwsem_is_locked(&VFS_I(ip)->i_rwsem); > + if (lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) { > + return __xfs_rwsem_islocked(&VFS_I(ip)->i_rwsem, lock_flags, > + XFS_IOLOCK_FLAG_SHIFT); Then this is simply: return __xfs_rwsem_islocked(&VFS_I(ip)->i_rwsem, (lock_flags & XFS_IOLOCK_SHARED)); And the conversion for the MMAPLOCK in the next patch is equally simple. > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index ca826cfba91c..1c0e15c480bc 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -262,12 +262,19 @@ static inline bool xfs_inode_has_bigtime(struct > xfs_inode *ip) > * Bit ranges: 1<<1 - 1<<16-1 -- iolock/ilock modes (bitfield) > * 1<<16 - 1<<32-1 -- lockdep annotation (integers) > */ > -#define XFS_IOLOCK_EXCL (1<<0) > -#define XFS_IOLOCK_SHARED (1<<1) > -#define XFS_ILOCK_EXCL (1<<2) > -#define XFS_ILOCK_SHARED(1<<3) > -#define XFS_MMAPLOCK_EXCL (1<<4) > -#define XFS_MMAPLOCK_SHARED (1<<5) > + > +#define XFS_IOLOCK_FLAG_SHIFT0 > +#def
Re: [f2fs-dev] [PATCH 07/13] xfs: Convert to use invalidate_lock
On Tue, May 25, 2021 at 03:50:44PM +0200, Jan Kara wrote: > Use invalidate_lock instead of XFS internal i_mmap_lock. The intended > purpose of invalidate_lock is exactly the same. Note that the locking in > __xfs_filemap_fault() slightly changes as filemap_fault() already takes > invalidate_lock. > > Reviewed-by: Christoph Hellwig > CC: > CC: "Darrick J. Wong" > Signed-off-by: Jan Kara > --- > fs/xfs/xfs_file.c | 12 ++- > fs/xfs/xfs_inode.c | 52 ++ > fs/xfs/xfs_inode.h | 1 - > fs/xfs/xfs_super.c | 2 -- > 4 files changed, 36 insertions(+), 31 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 396ef36dcd0a..dc9cb5c20549 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1282,7 +1282,7 @@ xfs_file_llseek( > * > * mmap_lock (MM) > * sb_start_pagefault(vfs, freeze) > - * i_mmaplock (XFS - truncate serialisation) > + * invalidate_lock (vfs/XFS_MMAPLOCK - truncate serialisation) > * page_lock (MM) > * i_lock (XFS - extent map serialisation) > */ > @@ -1303,24 +1303,26 @@ __xfs_filemap_fault( > file_update_time(vmf->vma->vm_file); > } > > - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > if (IS_DAX(inode)) { > pfn_t pfn; > > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, > (write_fault && !vmf->cow_page) ? >&xfs_direct_write_iomap_ops : >&xfs_read_iomap_ops); > if (ret & VM_FAULT_NEEDDSYNC) > ret = dax_finish_sync_fault(vmf, pe_size, pfn); > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > } else { > - if (write_fault) > + if (write_fault) { > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > ret = iomap_page_mkwrite(vmf, > &xfs_buffered_write_iomap_ops); > - else > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > + } else > ret = filemap_fault(vmf); > } > - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); This seems kinda messy. filemap_fault() basically takes the invalidate lock around the entire operation, it runs, so maybe it would be cleaner to implement it as: filemap_fault_locked(vmf) { /* does the filemap fault work */ } filemap_fault(vmf) { filemap_invalidate_down_read(...) ret = filemap_fault_locked(vmf) filemap_invalidate_up_read(...) return ret; } And that means XFS could just call filemap_fault_locked() and not have to do all this messy locking just to avoid holding the lock that filemap_fault has now internalised. > @@ -355,8 +358,11 @@ xfs_isilocked( > > if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) { > if (!(lock_flags & XFS_MMAPLOCK_SHARED)) > - return !!ip->i_mmaplock.mr_writer; > - return rwsem_is_locked(&ip->i_mmaplock.mr_lock); > + return !debug_locks || > + lockdep_is_held_type( > + &VFS_I(ip)->i_mapping->invalidate_lock, > + 0); > + return rwsem_is_locked(&VFS_I(ip)->i_mapping->invalidate_lock); > } And so here we are again, losing more of our read vs write debug checks on debug kernels when lockdep is not enabled Can we please add rwsem_is_locked_read() and rwsem_is_locked_write() wrappers that just look at the rwsem counter value to determine how the lock is held? Then the mrlock_t can go away entirely Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock
On Fri, May 14, 2021 at 09:17:30AM -0700, Darrick J. Wong wrote: > On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote: > > On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote: > > > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote: > > > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote: > > > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote: > > > > > > +->fallocate implementation must be really careful to maintain page > > > > > > cache > > > > > > +consistency when punching holes or performing other operations > > > > > > that invalidate > > > > > > +page cache contents. Usually the filesystem needs to call > > > > > > +truncate_inode_pages_range() to invalidate relevant range of the > > > > > > page cache. > > > > > > +However the filesystem usually also needs to update its internal > > > > > > (and on disk) > > > > > > +view of file offset -> disk block mapping. Until this update is > > > > > > finished, the > > > > > > +filesystem needs to block page faults and reads from reloading > > > > > > now-stale page > > > > > > +cache contents from the disk. VFS provides > > > > > > mapping->invalidate_lock for this > > > > > > +and acquires it in shared mode in paths loading pages from disk > > > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem > > > > > > is > > > > > > +responsible for taking this lock in its fallocate implementation > > > > > > and generally > > > > > > +whenever the page cache contents needs to be invalidated because a > > > > > > block is > > > > > > +moving from under a page. > > > > > > + > > > > > > +->copy_file_range and ->remap_file_range implementations need to > > > > > > serialize > > > > > > +against modifications of file data while the operation is running. > > > > > > For blocking > > > > > > +changes through write(2) and similar operations inode->i_rwsem can > > > > > > be used. For > > > > > > +blocking changes through memory mapping, the filesystem can use > > > > > > +mapping->invalidate_lock provided it also acquires it in its > > > > > > ->page_mkwrite > > > > > > +implementation. > > > > > > > > > > Question: What is the locking order when acquiring the invalidate_lock > > > > > of two different files? Is it the same as i_rwsem (increasing order > > > > > of > > > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that > > > > > is > > > > > being hoisted here (increasing order of i_ino)? > > > > > > > > > > The reason I ask is that remap_file_range has to do that, but I don't > > > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL) > > > > > calls in xfs_ilock2_io_mmap in this series. > > > > > > > > Good question. Technically, I don't think there's real need to > > > > establish a > > > > single ordering because locks among different filesystems are never > > > > going > > > > to be acquired together (effectively each lock type is local per sb and > > > > we > > > > are free to define an ordering for each lock type differently). But to > > > > maintain some sanity I guess having the same locking order for > > > > doublelock > > > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS > > > > uses > > > > by-ino ordering? So that we don't have to consider two different orders > > > > in > > > > xfs_lock_two_inodes()... > > > > > > I imagine Dave will chime in on this, but I suspect the reason is > > > hysterical raisins^Wreasons. > > > > It's the locking rules that XFS has used pretty much forever. > > Locking by inode number always guarantees the same locking order of > > two inodes in the same filesystem, regardless of the specific > > in-memory instances of the two inodes. > > > > e.g. if we lock based on the inode structure address, in one > > instancex, we could get A -> B, then B gets recycled and >
Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock
On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote: > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote: > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote: > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote: > > > > +->fallocate implementation must be really careful to maintain page > > > > cache > > > > +consistency when punching holes or performing other operations that > > > > invalidate > > > > +page cache contents. Usually the filesystem needs to call > > > > +truncate_inode_pages_range() to invalidate relevant range of the page > > > > cache. > > > > +However the filesystem usually also needs to update its internal (and > > > > on disk) > > > > +view of file offset -> disk block mapping. Until this update is > > > > finished, the > > > > +filesystem needs to block page faults and reads from reloading > > > > now-stale page > > > > +cache contents from the disk. VFS provides mapping->invalidate_lock > > > > for this > > > > +and acquires it in shared mode in paths loading pages from disk > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is > > > > +responsible for taking this lock in its fallocate implementation and > > > > generally > > > > +whenever the page cache contents needs to be invalidated because a > > > > block is > > > > +moving from under a page. > > > > + > > > > +->copy_file_range and ->remap_file_range implementations need to > > > > serialize > > > > +against modifications of file data while the operation is running. For > > > > blocking > > > > +changes through write(2) and similar operations inode->i_rwsem can be > > > > used. For > > > > +blocking changes through memory mapping, the filesystem can use > > > > +mapping->invalidate_lock provided it also acquires it in its > > > > ->page_mkwrite > > > > +implementation. > > > > > > Question: What is the locking order when acquiring the invalidate_lock > > > of two different files? Is it the same as i_rwsem (increasing order of > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is > > > being hoisted here (increasing order of i_ino)? > > > > > > The reason I ask is that remap_file_range has to do that, but I don't > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL) > > > calls in xfs_ilock2_io_mmap in this series. > > > > Good question. Technically, I don't think there's real need to establish a > > single ordering because locks among different filesystems are never going > > to be acquired together (effectively each lock type is local per sb and we > > are free to define an ordering for each lock type differently). But to > > maintain some sanity I guess having the same locking order for doublelock > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses > > by-ino ordering? So that we don't have to consider two different orders in > > xfs_lock_two_inodes()... > > I imagine Dave will chime in on this, but I suspect the reason is > hysterical raisins^Wreasons. It's the locking rules that XFS has used pretty much forever. Locking by inode number always guarantees the same locking order of two inodes in the same filesystem, regardless of the specific in-memory instances of the two inodes. e.g. if we lock based on the inode structure address, in one instancex, we could get A -> B, then B gets recycled and reallocated, then we get B -> A as the locking order for the same two inodes. That, IMNSHO, is utterly crazy because with non-deterministic inode lock ordered like this you can't make consistent locking rules for locking the physical inode cluster buffers underlying the inodes in the situation where they also need to be locked. We've been down this path before more than a decade ago when the powers that be decreed that inode locking order is to be "by structure address" rather than inode number, because "inode number is not unique across multiple superblocks". I'm not sure that there is anywhere that locks multiple inodes across different superblocks, but here we are again > It might simply be time to convert all > three XFS inode locks to use the same ordering rules. Careful, there lie dragons along that path because of things like how the inode cluster buffer operations work - they all assume ascending inode number traversal within and across inode cluster buffers and hence we do have locking order constraints based on inode number... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 0/12 v4] fs: Hole punch vs page cache filling races
ew pages in the page cache and looking up their corresponding > block mapping. We also replace EXT4_I(inode)->i_mmap_sem with this new rwsem > which provides necessary serialization with hole punching for ext4. > > Honza > > [1] > https://lore.kernel.org/linux-fsdevel/caoq4uxjqnmxqmta_vbyw0su9rkrk2zobjmahcyeaevofkvq...@mail.gmail.com/ > > Previous versions: > Link: > https://lore.kernel.org/linux-fsdevel/20210208163918.7871-1-j...@suse.cz/ > Link: http://lore.kernel.org/r/20210413105205.3093-1-j...@suse.cz > > CC: ceph-de...@vger.kernel.org > CC: Chao Yu > CC: Damien Le Moal > CC: "Darrick J. Wong" > CC: Hugh Dickins > CC: Jaegeuk Kim > CC: Jeff Layton > CC: Johannes Thumshirn > CC: linux-c...@vger.kernel.org > CC: > CC: linux-f2fs-devel@lists.sourceforge.net > CC: > CC: > CC: > CC: Miklos Szeredi > CC: Steve French > CC: Ted Tso > -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 02/12] mm: Protect operations adding pages to page cache with invalidate_lock
On Fri, Apr 23, 2021 at 07:29:31PM +0200, Jan Kara wrote: > Currently, serializing operations such as page fault, read, or readahead > against hole punching is rather difficult. The basic race scheme is > like: > > fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. > truncate_inode_pages_range() > cache here> > > > Now the problem is in this way read / page fault / readahead can > instantiate pages in page cache with potentially stale data (if blocks > get quickly reused). Avoiding this race is not simple - page locks do > not work because we want to make sure there are *no* pages in given > range. inode->i_rwsem does not work because page fault happens under > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes > the performance for mixed read-write workloads suffer. > > So create a new rw_semaphore in the address_space - invalidate_lock - > that protects adding of pages to page cache for page faults / reads / > readahead. . > diff --git a/fs/inode.c b/fs/inode.c > index a047ab306f9a..43596dd8b61e 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -191,6 +191,9 @@ int inode_init_always(struct super_block *sb, struct > inode *inode) > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); > mapping->private_data = NULL; > mapping->writeback_index = 0; > + init_rwsem(&mapping->invalidate_lock); > + lockdep_set_class(&mapping->invalidate_lock, > + &sb->s_type->invalidate_lock_key); > inode->i_private = NULL; > inode->i_mapping = mapping; > INIT_HLIST_HEAD(&inode->i_dentry); /* buggered by rcu freeing */ Oh, lockdep. That might be a problem here. The XFS_MMAPLOCK has non-trivial lockdep annotations so that it is tracked as nesting properly against the IOLOCK and the ILOCK. When you end up using xfs_ilock(XFS_MMAPLOCK..) to lock this, XFS will add subclass annotations to the lock and they are going to be different to the locking that the VFS does. We'll see this from xfs_lock_two_inodes() (e.g. in xfs_swap_extents()) and xfs_ilock2_io_mmap() during reflink oper. Oh. The page cache copy done when breaking a shared extent needs to lock out page faults on both the source and destination, but it still needs to be able to populate the page cache of both the source and destination file. and vfs_dedupe_file_range_compare() has to be able to read pages from both the source and destination file to determine that the contents are identical and that's done while we hold the XFS_MMAPLOCK exclusively so the compare is atomic w.r.t. all other user data modification operations being run I now have many doubts that this "serialise page faults by locking out page cache instantiation" method actually works as a generic mechanism. It's not just page cache invalidation that relies on being able to lock out page faults: copy-on-write and deduplication both require the ability to populate the page cache with source data while page faults are locked out so the data can be compared/copied atomically with the extent level manipulations and so user data modifications cannot occur until the physical extent manipulation operation has completed. Having only just realised this is a problem, no solution has immediately popped into my mind. I'll chew on it over the weekend, but I'm not hopeful at this point... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v2 04/12] fat: only specify I_DIRTY_TIME when needed in fat_update_time()
On Mon, Jan 11, 2021 at 11:50:27AM -0800, Eric Biggers wrote: > On Mon, Jan 11, 2021 at 11:52:01AM +0100, Christoph Hellwig wrote: > > On Fri, Jan 08, 2021 at 11:58:55PM -0800, Eric Biggers wrote: > > > + if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false)) > > > + dirty_flags |= I_DIRTY_SYNC; > > > > fat does not support i_version updates, so this bit can be skipped. > > Is that really the case? Any filesystem (including fat) can be mounted with > "iversion", which causes SB_I_VERSION to be set. That's a bug. Filesystems taht don't support persistent i_version on disk need to clear SB_I_VERSION in their mount and remount paths because the VFS iversion mount option was a complete screwup from the start. > A lot of filesystems (including fat) don't store i_version to disk, but it > looks > like it will still get updated in-memory. Could anything be relying on that? If they do, then they are broken by definition. i_version as reported to observers is defined as monotonically increasing with every change to the inode. i.e. it never goes backwards. Which, of course, it will do if you crash or even just unmount/mount a filesystem that doesn't persist it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 1/7] fscrypt: Add functions for direct I/O support
On Sun, Jul 26, 2020 at 07:59:46PM -0700, Eric Biggers wrote: > On Mon, Jul 27, 2020 at 10:58:48AM +1000, Dave Chinner wrote: > > On Sat, Jul 25, 2020 at 07:49:20PM -0700, Eric Biggers wrote: > > > On Sat, Jul 25, 2020 at 10:14:41AM +1000, Dave Chinner wrote: > > > > > +bool fscrypt_dio_supported(struct kiocb *iocb, struct iov_iter *iter) > > > > > +{ > > > > > + const struct inode *inode = file_inode(iocb->ki_filp); > > > > > + const unsigned int blocksize = i_blocksize(inode); > > > > > + > > > > > + /* If the file is unencrypted, no veto from us. */ > > > > > + if (!fscrypt_needs_contents_encryption(inode)) > > > > > + return true; > > > > > + > > > > > + /* We only support direct I/O with inline crypto, not fs-layer > > > > > crypto */ > > > > > + if (!fscrypt_inode_uses_inline_crypto(inode)) > > > > > + return false; > > > > > + > > > > > + /* > > > > > + * Since the granularity of encryption is filesystem blocks, > > > > > the I/O > > > > > + * must be block aligned -- not just disk sector aligned. > > > > > + */ > > > > > + if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter), > > > > > blocksize)) > > > > > + return false; > > > > > > > > Doesn't this force user buffers to be filesystem block size aligned, > > > > instead of 512 byte aligned as is typical for direct IO? > > > > > > > > That's going to cause applications that work fine on normal > > > > filesystems becaues the memalign() buffers to 512 bytes or logical > > > > block device sector sizes (as per the open(2) man page) to fail on > > > > encrypted volumes, and it's not going to be obvious to users as to > > > > why this happens. > > > > > > The status quo is that direct I/O on encrypted files falls back to > > > buffered I/O. > > > > Largely irrelevant. > > > > You claimed in another thread that performance is a key feature that > > inline encryption + DIO provides. Now you're implying that failing > > to provide that performance doesn't really matter at all. > > > > > So this patch is strictly an improvement; it's making direct I/O work in > > > a case > > > where previously it didn't work. > > > > Improvements still need to follow longstanding conventions. And, > > IMO, it's not an improvement if the feature results in > > unpredictable performance for userspace applications. . > The open() man page does mention that O_DIRECT I/O typically needs to be > aligned > to logical_block_size; however it also says "In Linux alignment restrictions > vary by filesystem and kernel version and might be absent entirely." Now you are just language-laywering. I'll quote from the next paragraph in the man page: "Since Linux 2.6.0, alignment to the logical block size of the underlying storage (typically 512 bytes) suffices. The logi cal block size can be determined using the ioctl(2) BLKSSZGET operation [...]" There's the longstanding convention I've been talking about, clearly splet out. Applications that follow this convention (and there are lots) should just work. Having code that works correctly on one file but not on another -on the same filesystem- despite doing all the right things is not at all user friendly. What I really care about is that new functionality works without requiring applications to be rewritten to cater specifically for some whacky foilble in a new feature. fscrypt is generic functionality and hardware acceleration of crypt functions are only going to get more common in future. Hence the combination of the two needs to *play nicely* with the vast library of existing userspace software that is already out there. We need to get this stuff correct right from the start, otherwise we're just leaving a huge mountain of technical debt for our future selfs to have to clean up. > The other examples of falling back to buffered I/O are relevant, since they > show > that similar issues are already being dealt with in the (rare) use cases of > O_DIRECT. So I don't think the convention is as strong as you think it is... The convention is there so that applications that *expect* to be using direct IO can do so -reliably-. Breaking conventions that people have become accustomed to just because it is convenient for
Re: [f2fs-dev] [PATCH v6 1/7] fscrypt: Add functions for direct I/O support
On Sat, Jul 25, 2020 at 07:49:20PM -0700, Eric Biggers wrote: > On Sat, Jul 25, 2020 at 10:14:41AM +1000, Dave Chinner wrote: > > > +bool fscrypt_dio_supported(struct kiocb *iocb, struct iov_iter *iter) > > > +{ > > > + const struct inode *inode = file_inode(iocb->ki_filp); > > > + const unsigned int blocksize = i_blocksize(inode); > > > + > > > + /* If the file is unencrypted, no veto from us. */ > > > + if (!fscrypt_needs_contents_encryption(inode)) > > > + return true; > > > + > > > + /* We only support direct I/O with inline crypto, not fs-layer crypto */ > > > + if (!fscrypt_inode_uses_inline_crypto(inode)) > > > + return false; > > > + > > > + /* > > > + * Since the granularity of encryption is filesystem blocks, the I/O > > > + * must be block aligned -- not just disk sector aligned. > > > + */ > > > + if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter), blocksize)) > > > + return false; > > > > Doesn't this force user buffers to be filesystem block size aligned, > > instead of 512 byte aligned as is typical for direct IO? > > > > That's going to cause applications that work fine on normal > > filesystems becaues the memalign() buffers to 512 bytes or logical > > block device sector sizes (as per the open(2) man page) to fail on > > encrypted volumes, and it's not going to be obvious to users as to > > why this happens. > > The status quo is that direct I/O on encrypted files falls back to buffered > I/O. Largely irrelevant. You claimed in another thread that performance is a key feature that inline encryption + DIO provides. Now you're implying that failing to provide that performance doesn't really matter at all. > So this patch is strictly an improvement; it's making direct I/O work in a > case > where previously it didn't work. Improvements still need to follow longstanding conventions. And, IMO, it's not an improvement if the feature results in unpredictable performance for userspace applications. i.e. there is no point in enabling direct IO if it is unpredictably going to fall back to the buffered IO path when applications are coded to the guidelines the man page said they should use. Such problems are an utter PITA to diagnose in the field, and on those grounds alone the current implementation gets a NACK. > Note that there are lots of other cases where ext4 and f2fs fall back to > buffered I/O; see ext4_dio_supported() and f2fs_force_buffered_io(). So this > isn't a new problem. No shit, sherlock. But that's also irrelevant to the discussion at hand - claiming "we can fall back to buffered IO" doesn't address the problem I've raised. It's just an excuse for not fixing it. Indeed, the problem is easy to fix - fscrypt only cares that the user IO offset and length is DUN aligned. fscrypt does not care that the user memory buffer is filesystem block aligned - user memory buffer alignment is an underlying hardware DMA constraint - and so fscrypt_dio_supported() needs to relax or remove the user memroy buffer alignment constraint so that it follows existing conventions Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto
On Sun, Jul 26, 2020 at 09:47:51AM +1000, Dave Chinner wrote: > On Fri, Jul 24, 2020 at 10:41:32AM -0700, Eric Biggers wrote: > > But again, as far as I can tell, fs/iomap/direct-io.c currently *does* > > guarantee > > that *if* the input is fully filesystem-block-aligned and if blocksize <= > > PAGE_SIZE, then the issued I/O is also filesystem-block-aligned. > > Please listen to what I'm saying, Eric. > > The -current iomap implementation- may provide that behaviour. That > doesn't mean we guarantee that behaviour. i.e. the iomap -design- > does not guaranteee that behaviour, and we don't guarantee such > behaviour into the future. And we won't guarantee this behaviour - > even though the current implementation may provide it - because the > rest of the IO stack below iomap does not provide iomap with that > guarantee. > > Hence if iomap cannot get a guarantee that IO it issues won't get > split at some arbitrary boundary, it cannot provide filesystems with > that guarantee. BTW, if you want iomap_dio_rw() to provide an arbitrary bio alignment guarantee at the iomap layer, then it should be returned in the iomap along with the extent mapping. That could then be used instead of the bdev logical block size. That won't guarantee the behaviour of the rest of the stack, but it would provide a defined IO submission behaviour that iomap would have to guarantee into the future... That would also remove the need to duplicate the alignment checks in the filesystem for fscrypt DIO... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto
On Fri, Jul 24, 2020 at 10:41:32AM -0700, Eric Biggers wrote: > On Fri, Jul 24, 2020 at 03:31:30PM +1000, Dave Chinner wrote: > > On Thu, Jul 23, 2020 at 08:46:28PM -0700, Eric Biggers wrote: > > > On Fri, Jul 24, 2020 at 11:39:10AM +1000, Dave Chinner wrote: > > > > fscrypt_inode_uses_inline_crypto() ends up being: > > > > > > > > if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode) && > > > > inode->i_crypt_info->ci_inlinecrypt) > > > > > > > > I note there are no checks for inode->i_crypt_info being non-null, > > > > and I note that S_ENCRYPTED is set on the inode when the on-disk > > > > encrypted flag is encountered, not when inode->i_crypt_info is set. > > > > > > > > > > ->i_crypt_info is set when the file is opened, so it's guaranteed to be > > > set for > > > any I/O. So the case you're concerned about just doesn't happen. > > > > Ok. The connection is not obvious to someone who doesn't know the > > fscrypt code inside out. > > > > > > > Note that currently, I don't think iomap_dio_bio_actor() would handle > > > > > an > > > > > encrypted file with blocksize > PAGE_SIZE correctly, as the I/O could > > > > > be split > > > > > in the middle of a filesystem block (even after the filesystem > > > > > ensures that > > > > > direct I/O on encrypted files is fully filesystem-block-aligned, > > > > > which we do --- > > > > > see the rest of this patchset), which isn't allowed on encrypted > > > > > files. > > > > > > > > That can already happen unless you've specifically restricted DIO > > > > alignments in the filesystem code. i.e. Direct IO already supports > > > > sub-block ranges and alignment, and we can already do user DIO on > > > > sub-block, sector aligned ranges just fine. And the filesystem can > > > > already split the iomap on sub-block alignments and ranges if it > > > > needs to because the iomap uses byte range addressing, not sector or > > > > block based addressing. > > > > > > > > So either you already have a situation where the 2^32 offset can > > > > land *inside* a filesystem block, or the offset is guaranteed to be > > > > filesystem block aligned and so you'll never get this "break an IO > > > > on sub-block alignment" problem regardless of the filesystem block > > > > size... > > > > > > > > Either way, it's not an iomap problem - it's a filesystem mapping > > > > problem... > > > > > > > > > > I think you're missing the point here. Currently, the granularity of > > > encryption > > > (a.k.a. "data unit size") is always filesystem blocks, so that's the > > > minimum we > > > can directly read or write to an encrypted file. This has nothing to do > > > with > > > the IV wraparound case also being discussed. > > > > So when you change the subject, please make it *really obvious* so > > that people don't think you are still talking about the same issue. > > > > > For example, changing a single bit in the plaintext of a filesystem block > > > may > > > result in the entire block's ciphertext changing. (The exact behavior > > > depends > > > on the cryptographic algorithm that is used.) > > > > > > That's why this patchset makes ext4 only allow direct I/O on encrypted > > > files if > > > the I/O is fully filesystem-block-aligned. Note that this might be a more > > > strict alignment requirement than the bdev_logical_block_size(). > > > > > > As long as the iomap code only issues filesystem-block-aligned bios, > > > *given > > > fully filesystem-block-aligned inputs*, we're fine. That appears to be > > > the case > > > currently. > > > > The actual size and shape of the bios issued by direct IO (both old > > code and newer iomap code) is determined by the user supplied iov, > > the size of the biovec array allocated in the bio, and the IO > > constraints of the underlying hardware. Hence direct IO does not > > guarantee alignment to anything larger than the underlying block > > device logical sector size because there's no guarantee when or > > where a bio will fi
Re: [f2fs-dev] [PATCH v6 1/7] fscrypt: Add functions for direct I/O support
On Fri, Jul 24, 2020 at 06:44:55PM +, Satya Tangirala wrote: > From: Eric Biggers > > Introduce fscrypt_dio_supported() to check whether a direct I/O request > is unsupported due to encryption constraints. > > Also introduce fscrypt_limit_io_blocks() to limit how many blocks can be > added to a bio being prepared for direct I/O. This is needed for > filesystems that use the iomap direct I/O implementation to avoid DUN > wraparound in the middle of a bio (which is possible with the > IV_INO_LBLK_32 IV generation method). Elsewhere fscrypt_mergeable_bio() > is used for this, but iomap operates on logical ranges directly, so > filesystems using iomap won't have a chance to call fscrypt_mergeable_bio() > on every block added to a bio. So we need this function which limits a > logical range in one go. > > Signed-off-by: Eric Biggers > Co-developed-by: Satya Tangirala > Signed-off-by: Satya Tangirala > --- > fs/crypto/crypto.c | 8 + > fs/crypto/inline_crypt.c | 74 > include/linux/fscrypt.h | 18 ++ > 3 files changed, 100 insertions(+) > > diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c > index 9212325763b0..f72f22a718b2 100644 > --- a/fs/crypto/crypto.c > +++ b/fs/crypto/crypto.c > @@ -69,6 +69,14 @@ void fscrypt_free_bounce_page(struct page *bounce_page) > } > EXPORT_SYMBOL(fscrypt_free_bounce_page); > > +/* > + * Generate the IV for the given logical block number within the given file. > + * For filenames encryption, lblk_num == 0. > + * > + * Keep this in sync with fscrypt_limit_io_blocks(). > fscrypt_limit_io_blocks() > + * needs to know about any IV generation methods where the low bits of IV > don't > + * simply contain the lblk_num (e.g., IV_INO_LBLK_32). > + */ > void fscrypt_generate_iv(union fscrypt_iv *iv, u64 lblk_num, >const struct fscrypt_info *ci) > { > diff --git a/fs/crypto/inline_crypt.c b/fs/crypto/inline_crypt.c > index d7aecadf33c1..4cdf807b89b9 100644 > --- a/fs/crypto/inline_crypt.c > +++ b/fs/crypto/inline_crypt.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > > #include "fscrypt_private.h" > > @@ -362,3 +363,76 @@ bool fscrypt_mergeable_bio_bh(struct bio *bio, > return fscrypt_mergeable_bio(bio, inode, next_lblk); > } > EXPORT_SYMBOL_GPL(fscrypt_mergeable_bio_bh); > + > +/** > + * fscrypt_dio_supported() - check whether a direct I/O request is > unsupported > + *due to encryption constraints > + * @iocb: the file and position the I/O is targeting > + * @iter: the I/O data segment(s) > + * > + * Return: true if direct I/O is supported > + */ > +bool fscrypt_dio_supported(struct kiocb *iocb, struct iov_iter *iter) > +{ > + const struct inode *inode = file_inode(iocb->ki_filp); > + const unsigned int blocksize = i_blocksize(inode); > + > + /* If the file is unencrypted, no veto from us. */ > + if (!fscrypt_needs_contents_encryption(inode)) > + return true; > + > + /* We only support direct I/O with inline crypto, not fs-layer crypto */ > + if (!fscrypt_inode_uses_inline_crypto(inode)) > + return false; > + > + /* > + * Since the granularity of encryption is filesystem blocks, the I/O > + * must be block aligned -- not just disk sector aligned. > + */ > + if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter), blocksize)) > + return false; Doesn't this force user buffers to be filesystem block size aligned, instead of 512 byte aligned as is typical for direct IO? That's going to cause applications that work fine on normal filesystems becaues the memalign() buffers to 512 bytes or logical block device sector sizes (as per the open(2) man page) to fail on encrypted volumes, and it's not going to be obvious to users as to why this happens. XFS has XFS_IOC_DIOINFO to expose exactly this information to userspace on a per-file basis. Other filesystem and VFS developers have said for the past 15 years "we don't need no stinking DIOINFO". The same people shot down adding optional IO alignment constraint fields to statx() a few years ago, too. Yet here were are again, with alignment of DIO buffers being an issue that userspace needs to know about Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto
On Thu, Jul 23, 2020 at 08:46:28PM -0700, Eric Biggers wrote: > On Fri, Jul 24, 2020 at 11:39:10AM +1000, Dave Chinner wrote: > > fscrypt_inode_uses_inline_crypto() ends up being: > > > > if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode) && > > inode->i_crypt_info->ci_inlinecrypt) > > > > I note there are no checks for inode->i_crypt_info being non-null, > > and I note that S_ENCRYPTED is set on the inode when the on-disk > > encrypted flag is encountered, not when inode->i_crypt_info is set. > > > > ->i_crypt_info is set when the file is opened, so it's guaranteed to be set > for > any I/O. So the case you're concerned about just doesn't happen. Ok. The connection is not obvious to someone who doesn't know the fscrypt code inside out. > > > Note that currently, I don't think iomap_dio_bio_actor() would handle an > > > encrypted file with blocksize > PAGE_SIZE correctly, as the I/O could be > > > split > > > in the middle of a filesystem block (even after the filesystem ensures > > > that > > > direct I/O on encrypted files is fully filesystem-block-aligned, which we > > > do --- > > > see the rest of this patchset), which isn't allowed on encrypted files. > > > > That can already happen unless you've specifically restricted DIO > > alignments in the filesystem code. i.e. Direct IO already supports > > sub-block ranges and alignment, and we can already do user DIO on > > sub-block, sector aligned ranges just fine. And the filesystem can > > already split the iomap on sub-block alignments and ranges if it > > needs to because the iomap uses byte range addressing, not sector or > > block based addressing. > > > > So either you already have a situation where the 2^32 offset can > > land *inside* a filesystem block, or the offset is guaranteed to be > > filesystem block aligned and so you'll never get this "break an IO > > on sub-block alignment" problem regardless of the filesystem block > > size... > > > > Either way, it's not an iomap problem - it's a filesystem mapping > > problem... > > > > I think you're missing the point here. Currently, the granularity of > encryption > (a.k.a. "data unit size") is always filesystem blocks, so that's the minimum > we > can directly read or write to an encrypted file. This has nothing to do with > the IV wraparound case also being discussed. So when you change the subject, please make it *really obvious* so that people don't think you are still talking about the same issue. > For example, changing a single bit in the plaintext of a filesystem block may > result in the entire block's ciphertext changing. (The exact behavior depends > on the cryptographic algorithm that is used.) > > That's why this patchset makes ext4 only allow direct I/O on encrypted files > if > the I/O is fully filesystem-block-aligned. Note that this might be a more > strict alignment requirement than the bdev_logical_block_size(). > > As long as the iomap code only issues filesystem-block-aligned bios, *given > fully filesystem-block-aligned inputs*, we're fine. That appears to be the > case > currently. The actual size and shape of the bios issued by direct IO (both old code and newer iomap code) is determined by the user supplied iov, the size of the biovec array allocated in the bio, and the IO constraints of the underlying hardware. Hence direct IO does not guarantee alignment to anything larger than the underlying block device logical sector size because there's no guarantee when or where a bio will fill up. To guarantee alignment of what ends up at the hardware, you have to set the block device parameters (e.g. logical sector size) appropriately all the way down the stack. You also need to ensure that the filesystem is correctly aligned on the block device so that filesystem blocks don't overlap things like RAID stripe boundaires, linear concat boundaries, etc. IOWs, to constrain alignment in the IO path, you need to configure you system correct so that the information provided to iomap for IO alignment matches your requirements. This is not somethign iomap can do itself; everything from above needs to be constrained by the filesystem using iomap, everything sent below by iomap is constrained by the block device config. > (It's possible that in the future we'll support other encryption data unit > sizes, perhaps powers of 2 from 512 to filesystem block size. But for now the > filesystem block size has been good enough for everyone, Not the case. fscrypt use in enterprise environments needs support for block size < page size so that it can be deployed on 64kB page size machines without requiring 64kB filesystem block sizes. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto
On Thu, Jul 23, 2020 at 04:03:45PM -0700, Eric Biggers wrote: > Hi Dave, > > On Fri, Jul 24, 2020 at 08:07:52AM +1000, Dave Chinner wrote: > > > > > @@ -183,11 +184,16 @@ static void > > > > > iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t > > > > > pos, > > > > > unsigned len) > > > > > { > > > > > + struct inode *inode = file_inode(dio->iocb->ki_filp); > > > > > struct page *page = ZERO_PAGE(0); > > > > > int flags = REQ_SYNC | REQ_IDLE; > > > > > struct bio *bio; > > > > > > > > > > bio = bio_alloc(GFP_KERNEL, 1); > > > > > + > > > > > + /* encrypted direct I/O is guaranteed to be fs-block aligned */ > > > > > + WARN_ON_ONCE(fscrypt_needs_contents_encryption(inode)); > > > > > > > > Which means you are now placing a new constraint on this code in > > > > that we cannot ever, in future, zero entire blocks here. > > > > > > > > This code can issue arbitrary sized zeroing bios - multiple entire fs > > > > blocks > > > > blocks if necessary - so I think constraining it to only support > > > > partial block zeroing by adding a warning like this is no correct. > > > > > > In v3 and earlier this instead had the code to set an encryption context: > > > > > > fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, > > > GFP_KERNEL); > > > > > > Would you prefer that, even though the call to > > > fscrypt_set_bio_crypt_ctx() would > > > > Actually, I have no idea what that function does. It's not in a > > 5.8-rc6 kernel, and it's not in this patchset > > The cover letter mentions that this patchset is based on fscrypt/master. Which the reader is left to work out where it exists. If you are going to say "based on", then a pointer to the actual tree like this: > That is, "master" of https://git.kernel.org/pub/scm/fs/fscrypt/fscrypt.git is somewhat helpful. > fscrypt_set_bio_crypt_ctx() was introduced by > "fscrypt: add inline encryption support" on that branch. Way too much static inline function abstraction. fscrypt_inode_uses_inline_crypto() ends up being: if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode) && inode->i_crypt_info->ci_inlinecrypt) I note there are no checks for inode->i_crypt_info being non-null, and I note that S_ENCRYPTED is set on the inode when the on-disk encrypted flag is encountered, not when inode->i_crypt_info is set. Further, I note that the local variable for ci is fetched before fscrypt_inode_uses_inline_crypto() is run, so leaving a landmine for people who try to access ci before checking if inline crypto is enabled. Or, indeed, if S_ENCRYPTED is set and the crypt_info has not been set up, fscrypt_set_bio_crypt_ctx() will oops in the DIO path. > > > always be a no-op currently (since for now, iomap_dio_zero() will never be > > > called with an encrypted file) and thus wouldn't be properly tested? > > > > Same can be said for this WARN_ON_ONCE() code :) > > > > But, in the interests of not leaving landmines, if a fscrypt context > > is needed to be attached to the bio for data IO in direct IO, it > > should be attached to all bios that are allocated in the dio path > > rather than leave a landmine for people in future to trip over. > > My concern is that if we were to pass the wrong 'lblk' to > fscrypt_set_bio_crypt_ctx(), we wouldn't catch it because it's not tested. > Passing the wrong 'lblk' would cause the data to be encrypted/decrypted > incorrectly. When you implement sub-block DIO alignment for fscrypt enabled filesystems, fsx will tell you pretty quickly if you screwed up > Note that currently, I don't think iomap_dio_bio_actor() would handle an > encrypted file with blocksize > PAGE_SIZE correctly, as the I/O could be split > in the middle of a filesystem block (even after the filesystem ensures that > direct I/O on encrypted files is fully filesystem-block-aligned, which we do > --- > see the rest of this patchset), which isn't allowed on encrypted files. That can already happen unless you've specifically restricted DIO alignments in the filesystem code. i.e. Direct IO already supports sub-block ranges and alignment, and we can already do user DIO on sub-block, sector aligned ranges just fine. And the filesystem can already split the iomap on sub-block
Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto
On Wed, Jul 22, 2020 at 03:34:04PM -0700, Eric Biggers wrote: > On Thu, Jul 23, 2020 at 07:16:29AM +1000, Dave Chinner wrote: > > On Mon, Jul 20, 2020 at 11:37:35PM +, Satya Tangirala wrote: > > > From: Eric Biggers > > > > > > Wire up iomap direct I/O with the fscrypt additions for direct I/O. > > > This allows ext4 to support direct I/O on encrypted files when inline > > > encryption is enabled. > > > > > > This change consists of two parts: > > > > > > - Set a bio_crypt_ctx on bios for encrypted files, so that the file > > > contents get encrypted (or decrypted). > > > > > > - Ensure that encryption data unit numbers (DUNs) are contiguous within > > > each bio. Use the new function fscrypt_limit_io_pages() for this, > > > since the iomap code works directly with logical ranges and thus > > > doesn't have a chance to call fscrypt_mergeable_bio() on each page. > > > > > > Note that fscrypt_limit_io_pages() is normally a no-op, as normally the > > > DUNs simply increment along with the logical blocks. But it's needed to > > > handle an edge case in one of the fscrypt IV generation methods. > > > > > > Signed-off-by: Eric Biggers > > > Co-developed-by: Satya Tangirala > > > Signed-off-by: Satya Tangirala > > > --- > > > fs/iomap/direct-io.c | 12 +++- > > > 1 file changed, 11 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > > index ec7b78e6feca..12064daa3e3d 100644 > > > --- a/fs/iomap/direct-io.c > > > +++ b/fs/iomap/direct-io.c > > > @@ -6,6 +6,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -183,11 +184,16 @@ static void > > > iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, > > > unsigned len) > > > { > > > + struct inode *inode = file_inode(dio->iocb->ki_filp); > > > struct page *page = ZERO_PAGE(0); > > > int flags = REQ_SYNC | REQ_IDLE; > > > struct bio *bio; > > > > > > bio = bio_alloc(GFP_KERNEL, 1); > > > + > > > + /* encrypted direct I/O is guaranteed to be fs-block aligned */ > > > + WARN_ON_ONCE(fscrypt_needs_contents_encryption(inode)); > > > > Which means you are now placing a new constraint on this code in > > that we cannot ever, in future, zero entire blocks here. > > > > This code can issue arbitrary sized zeroing bios - multiple entire fs blocks > > blocks if necessary - so I think constraining it to only support > > partial block zeroing by adding a warning like this is no correct. > > In v3 and earlier this instead had the code to set an encryption context: > > fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, > GFP_KERNEL); > > Would you prefer that, even though the call to fscrypt_set_bio_crypt_ctx() > would Actually, I have no idea what that function does. It's not in a 5.8-rc6 kernel, and it's not in this patchset > always be a no-op currently (since for now, iomap_dio_zero() will never be > called with an encrypted file) and thus wouldn't be properly tested? Same can be said for this WARN_ON_ONCE() code :) But, in the interests of not leaving landmines, if a fscrypt context is needed to be attached to the bio for data IO in direct IO, it should be attached to all bios that are allocated in the dio path rather than leave a landmine for people in future to trip over. > BTW, iomap_dio_zero() is actually limited to one page, so it's not quite > "arbitrary sizes". Yup, but that's an implentation detail, not a design constraint. i.e. I typically review/talk about how stuff functions at a design/architecture level, not how it's been implemented in the code. e.g. block size > page size patches in progress make use of the "arbitrary length" capability of the design: https://lore.kernel.org/linux-xfs/20181107063127.3902-7-da...@fromorbit.com/ > iomap is used for other filesystem operations too, so we need to consider when > to actually do the limiting. I don't think we should break up the extents > returned FS_IOC_FIEMAP, for example. FIEMAP already has a defined behavior. > Also, it would be weird for the list of extents that FIEMAP returns to change > depending on whether the filesystem is mounted with '-o inlinecrypt' or not. We don't need to care
Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto
On Mon, Jul 20, 2020 at 11:37:35PM +, Satya Tangirala wrote: > From: Eric Biggers > > Wire up iomap direct I/O with the fscrypt additions for direct I/O. > This allows ext4 to support direct I/O on encrypted files when inline > encryption is enabled. > > This change consists of two parts: > > - Set a bio_crypt_ctx on bios for encrypted files, so that the file > contents get encrypted (or decrypted). > > - Ensure that encryption data unit numbers (DUNs) are contiguous within > each bio. Use the new function fscrypt_limit_io_pages() for this, > since the iomap code works directly with logical ranges and thus > doesn't have a chance to call fscrypt_mergeable_bio() on each page. > > Note that fscrypt_limit_io_pages() is normally a no-op, as normally the > DUNs simply increment along with the logical blocks. But it's needed to > handle an edge case in one of the fscrypt IV generation methods. > > Signed-off-by: Eric Biggers > Co-developed-by: Satya Tangirala > Signed-off-by: Satya Tangirala > --- > fs/iomap/direct-io.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index ec7b78e6feca..12064daa3e3d 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -183,11 +184,16 @@ static void > iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, > unsigned len) > { > + struct inode *inode = file_inode(dio->iocb->ki_filp); > struct page *page = ZERO_PAGE(0); > int flags = REQ_SYNC | REQ_IDLE; > struct bio *bio; > > bio = bio_alloc(GFP_KERNEL, 1); > + > + /* encrypted direct I/O is guaranteed to be fs-block aligned */ > + WARN_ON_ONCE(fscrypt_needs_contents_encryption(inode)); Which means you are now placing a new constraint on this code in that we cannot ever, in future, zero entire blocks here. This code can issue arbitrary sized zeroing bios - multiple entire fs blocks blocks if necessary - so I think constraining it to only support partial block zeroing by adding a warning like this is no correct. > @@ -253,6 +259,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, > loff_t length, > ret = nr_pages; > goto out; > } > + nr_pages = fscrypt_limit_io_pages(inode, pos, nr_pages); So if "pos" overlaps a 2^32 offset when a certain mode is used, we have to break up the IO? If true, I'm not sure that this belongs here. Limiting the size of the IOs because of filesytem contraints belongs in the filesystem extent mapping code. That's the point where the IO is broken up into maximally sized chunks the filesystem can issue as a contiguous range. If the fscrypt code is breaking that "contiguous IO range" because of the mode being used, the fs mapping code should break the mapping at the boundery where the IO needs to be broken. Hence the dio mapping code here will never build IOs that cross this -filesystem- encryption limitation, and we don't need this fscrypt code in the direct IO path at all. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 1/4] fs: introduce SB_INLINECRYPT
On Mon, Jun 22, 2020 at 06:50:17PM -0700, Eric Biggers wrote: > On Tue, Jun 23, 2020 at 10:46:36AM +1000, Dave Chinner wrote: > > On Wed, Jun 17, 2020 at 08:19:35PM -0700, Eric Biggers wrote: > > > Are you objecting to the use of a SB_* flag, or just to showing the flag > > > in > > > show_sb_opts() instead of in the individual filesystems? Note that the > > > SB_* > > > flag was requested by Christoph > > > (https://lkml.kernel.org/r/20191031183217.gf23...@infradead.org/, > > > https://lkml.kernel.org/r/20191031212103.ga6...@infradead.org/). We > > > originally > > > used a function fscrypt_operations::inline_crypt_enabled() instead. > > > > I'm objecting to the layering violations of having the filesystem > > control the mount option parsing and superblock feature flags, but > > then having no control over whether features that the filesystem has > > indicated to the VFS it is using get emitted as a mount option or > > not, and then having the VFS code unconditionally override the > > functionality that the filesystem uses because it thinks it's a > > mount option the filesystem supports > > > > For example, the current mess that has just come to light: > > filesystems like btrfs and XFS v5 which set SB_IVERSION > > unconditionally (i.e. it's not a mount option!) end up having that > > functionality turned off on remount because the VFS conflates > > MS_IVERSION with SB_IVERSION and so unconditionally clears > > SB_IVERSION because MS_IVERSION is not set on remount by userspace. > > Which userspace will never set be because the filesystems don't put > > "iversion" in their mount option strings because -its not a mount > > option- for those filesystems. > > > > See the problem? MS_IVERSION should be passed to the filesystem to > > deal with as a mount option, not treated as a flag to directly > > change SB_IVERSION in the superblock. > > > > We really need to stop with the "global mount options for everyone > > at the VFS" and instead pass everything down to the filesystems to > > parse appropriately. Yes, provide generic helper functions to deal > > with the common flags that everything supports, but the filesystems > > should be masking off mount options they doesn't support changing > > before changing their superblock feature support mask > > I think the MS_* flags are best saved for mount options that are applicable to > many/most filesystems and are mostly/entirely implementable at the VFS level. That's the theory, but so far it's caused nothing but pain. In reality, I think ithe only sane way forward if to stop mount option parsing in userspace (i.e. no new MS_* flags) for any new functionality as it only leads to future pain. i.e. all new mount options should be parsed entirely in the kernel by the filesystem parsing code > I don't think "inlinecrypt" qualifies, since while it will be shared by the > block device-based filesystems that support fscrypt, that is only 2 > filesystems > currently; and while some of the code needed to implement it is shared in > fs/crypto/, there are still substantial filesystem-specific hooks needed. Right. I wasn't suggesting this patchset should use an MS_ flag - it was pointing out the problem with the VFS code using SB_ flags to indicate enabled filesystem functionality unconditionally as a mount option that can be changed by userspace. > Hence this patchset intentionally does *not* allocate an MS_INLINECRYPT flag. > > I believe that already addresses half of your concern, as it means > SB_INLINECRYPT can only be set/cleared by the filesystem itself, not by the > VFS. > (But the commit message could use an explanation of this.) > > The other half would be addressed by the following change, right? Yes, it does. Thanks, Eric! Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 1/4] fs: introduce SB_INLINECRYPT
On Wed, Jun 17, 2020 at 08:19:35PM -0700, Eric Biggers wrote: > On Thu, Jun 18, 2020 at 11:19:12AM +1000, Dave Chinner wrote: > > On Wed, Jun 17, 2020 at 07:57:29AM +, Satya Tangirala wrote: > > > Introduce SB_INLINECRYPT, which is set by filesystems that wish to use > > > blk-crypto for file content en/decryption. This flag maps to the > > > '-o inlinecrypt' mount option which multiple filesystems will implement, > > > and code in fs/crypto/ needs to be able to check for this mount option > > > in a filesystem-independent way. > > > > > > Signed-off-by: Satya Tangirala > > > --- > > > fs/proc_namespace.c | 1 + > > > include/linux/fs.h | 1 + > > > 2 files changed, 2 insertions(+) > > > > > > diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c > > > index 3059a9394c2d..e0ff1f6ac8f1 100644 > > > --- a/fs/proc_namespace.c > > > +++ b/fs/proc_namespace.c > > > @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct > > > super_block *sb) > > > { SB_DIRSYNC, ",dirsync" }, > > > { SB_MANDLOCK, ",mand" }, > > > { SB_LAZYTIME, ",lazytime" }, > > > + { SB_INLINECRYPT, ",inlinecrypt" }, > > > { 0, NULL } > > > }; > > > const struct proc_fs_opts *fs_infop; > > > > NACK. > > > > SB_* flgs are for functionality enabled on the superblock, not for > > indicating mount options that have been set by the user. > > That's an interesting claim, given that most SB_* flags are for mount options. > E.g.: > > ro => SB_RDONLY > nosuid => SB_NOSUID > nodev => SB_NODEV > noexec => SB_NOEXEC > sync => SB_SYNCHRONOUS > mand => SB_MANDLOCK > noatime => SB_NOATIME > nodiratime => SB_NODIRATIME > lazytime => SB_LAZYTIME Yes, they *reflect* options set by mount options, but this is all so screwed up because the split of superblock functionality from the mount option API (i.e. the MS_* flag introduction to avoid having the superblock feature flags being directly defined by the userspace mount API) was never followed through to properly separate the implementation of *active superblock feature flags* from the *user specified mount API flags*. Yes, the UAPI definitions were separated, but the rest of the interface wasn't and only works because of the "MS* flag exactly equal to the SB* flag" hack that was used. So now people have no idea when to use one or the other and we're repeatedly ending up with broken mount option parsing because SB flags are used where MS flags should be used and vice versa. We've made a damn mess of mount options, and the fscontext stuff hasn't fixed any of this ... mess. It's just stirred it around and so nobody really knows what they are supposed to with mount options right now. > > If the mount options are directly parsed by the filesystem option > > parser (as is done later in this patchset), then the mount option > > setting should be emitted by the filesystem's ->show_options > > function, not a generic function. > > > > The option string must match what the filesystem defines, not > > require separate per-filesystem and VFS definitions of the same > > option that people could potentially get wrong (*cough* i_version vs > > iversion *cough*) > > Are you objecting to the use of a SB_* flag, or just to showing the flag in > show_sb_opts() instead of in the individual filesystems? Note that the SB_* > flag was requested by Christoph > (https://lkml.kernel.org/r/20191031183217.gf23...@infradead.org/, > https://lkml.kernel.org/r/20191031212103.ga6...@infradead.org/). We > originally > used a function fscrypt_operations::inline_crypt_enabled() instead. I'm objecting to the layering violations of having the filesystem control the mount option parsing and superblock feature flags, but then having no control over whether features that the filesystem has indicated to the VFS it is using get emitted as a mount option or not, and then having the VFS code unconditionally override the functionality that the filesystem uses because it thinks it's a mount option the filesystem supports For example, the current mess that has just come to light: filesystems like btrfs and XFS v5 which set SB_IVERSION unconditionally (i.e. it's not a mount option!) end up having that functionality turned off on remount because the VFS conflates MS_IVERSION with SB_IVERSION and so unconditionally clears SB_IVERSION because MS_IVERSION is not set on remount by userspace. Which user
Re: [f2fs-dev] [PATCH 1/4] fs: introduce SB_INLINECRYPT
On Wed, Jun 17, 2020 at 07:57:29AM +, Satya Tangirala wrote: > Introduce SB_INLINECRYPT, which is set by filesystems that wish to use > blk-crypto for file content en/decryption. This flag maps to the > '-o inlinecrypt' mount option which multiple filesystems will implement, > and code in fs/crypto/ needs to be able to check for this mount option > in a filesystem-independent way. > > Signed-off-by: Satya Tangirala > --- > fs/proc_namespace.c | 1 + > include/linux/fs.h | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c > index 3059a9394c2d..e0ff1f6ac8f1 100644 > --- a/fs/proc_namespace.c > +++ b/fs/proc_namespace.c > @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct > super_block *sb) > { SB_DIRSYNC, ",dirsync" }, > { SB_MANDLOCK, ",mand" }, > { SB_LAZYTIME, ",lazytime" }, > + { SB_INLINECRYPT, ",inlinecrypt" }, > { 0, NULL } > }; > const struct proc_fs_opts *fs_infop; NACK. SB_* flgs are for functionality enabled on the superblock, not for indicating mount options that have been set by the user. If the mount options are directly parsed by the filesystem option parser (as is done later in this patchset), then the mount option setting should be emitted by the filesystem's ->show_options function, not a generic function. The option string must match what the filesystem defines, not require separate per-filesystem and VFS definitions of the same option that people could potentially get wrong (*cough* i_version vs iversion *cough*) Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH 2/4] fs: avoid double-writing the inode on a lazytime expiration
On Wed, Mar 25, 2020 at 01:28:23PM +0100, Christoph Hellwig wrote: > In the case that an inode has dirty timestamp for longer than the > lazytime expiration timeout (or if all such inodes are being flushed > out due to a sync or syncfs system call), we need to inform the file > system that the inode is dirty so that the inode's timestamps can be > copied out to the on-disk data structures. That's because if the file > system supports lazytime, it will have ignored the dirty_inode(inode, > I_DIRTY_TIME) notification when the timestamp was modified in memory.q > Previously, this was accomplished by calling mark_inode_dirty_sync(), > but that has the unfortunate side effect of also putting the inode the > writeback list, and that's not necessary in this case, since we will > immediately call write_inode() afterwards. Replace the call to > mark_inode_dirty_sync() with a new lazytime_expired method to clearly > separate out this case. hmmm. Doesn't this cause issues with both iput() and vfs_fsync_range() because they call mark_inode_dirty_sync() on I_DIRTY_TIME inodes to move them onto the writeback list so they are appropriately expired when the inode is written back. i.e.: > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 2094386af8ac..e5aafd40dd0f 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -612,19 +612,13 @@ xfs_fs_destroy_inode( > } > > static void > -xfs_fs_dirty_inode( > - struct inode*inode, > - int flag) > +xfs_fs_lazytime_expired( > + struct inode*inode) > { > struct xfs_inode*ip = XFS_I(inode); > struct xfs_mount*mp = ip->i_mount; > struct xfs_trans*tp; > > - if (!(inode->i_sb->s_flags & SB_LAZYTIME)) > - return; > - if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME)) > - return; > - > if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp)) > return; > xfs_ilock(ip, XFS_ILOCK_EXCL); > @@ -1053,7 +1047,7 @@ xfs_fs_free_cached_objects( > static const struct super_operations xfs_super_operations = { > .alloc_inode= xfs_fs_alloc_inode, > .destroy_inode = xfs_fs_destroy_inode, > - .dirty_inode= xfs_fs_dirty_inode, > + .lazytime_expired = xfs_fs_lazytime_expired, > .drop_inode = xfs_fs_drop_inode, > .put_super = xfs_fs_put_super, > .sync_fs= xfs_fs_sync_fs, This means XFS no longer updates/logs the current timestamp because ->dirty_inode(I_DIRTY_SYNC) is no longer called for XFS) before ->fsync flushes the inode data and metadata changes to the journal. Hence the current in-memory timestamps are not present in the log before the fsync is run as so we violate the fsync guarantees lazytime gives for timestamp updates I haven't quite got it straight in my head if the iput() case has similar problems, but the fsync case definitely looks broken. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration
On Thu, Mar 12, 2020 at 07:34:45AM -0700, Christoph Hellwig wrote: > On Thu, Mar 12, 2020 at 11:07:17AM +1100, Dave Chinner wrote: > > > That's true, but when the timestamps were originally modified, > > > dirty_inode() will be called with flag == I_DIRTY_TIME, which will > > > *not* be a no-op; which is to say, XFS will force the timestamps to be > > > updated on disk when the timestamps are first dirtied, because it > > > doesn't support I_DIRTY_TIME. > > > > We log the initial timestamp change, and then ignore timestamp > > updates until the dirty time expires and the inode is set > > I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have > > time stamps that may be 24 hours out of date in memory, and they > > still need to be flushed to the journal. > > > > However, your change does not mark the inode dirtying on expiry > > anymore, so... > > > > > So I think we're fine. > > > > ... we're not fine. This breaks XFS and any other filesystem that > > relies on a I_DIRTY_SYNC notification to handle dirty time expiry > > correctly. > > I haven't seen the original mail this replies to, The original problem was calling mark_inode_dirty_sync() on expiry during inode writeback was causing the inode to be put back on the dirty inode list and so ext4 was flushing it twice - once on expiry and once 5 seconds later on the next background writeback pass. This is a problem that XFS does not have because it does not implement ->write_inode... > but if we could > get the lazytime expirty by some other means (e.g. an explicit > callback), XFS could opt out of all the VFS inode tracking again, > which would simplify a few things. Yes, that would definitely make things simpler for XFS, and it would also solve the problem that the generic lazytime expiry code has Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration
On Wed, Mar 11, 2020 at 08:57:49AM -0400, Theodore Y. Ts'o wrote: > On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote: > > Thanks Ted! This fixes the fscrypt test failure. > > > > However, are you sure this works correctly on all filesystems? I'm not sure > > about XFS. XFS only implements ->dirty_inode(), not ->write_inode(), and > > in its > > ->dirty_inode() it does: > ... > > if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME)) > > return; > > That's true, but when the timestamps were originally modified, > dirty_inode() will be called with flag == I_DIRTY_TIME, which will > *not* be a no-op; which is to say, XFS will force the timestamps to be > updated on disk when the timestamps are first dirtied, because it > doesn't support I_DIRTY_TIME. We log the initial timestamp change, and then ignore timestamp updates until the dirty time expires and the inode is set I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have time stamps that may be 24 hours out of date in memory, and they still need to be flushed to the journal. However, your change does not mark the inode dirtying on expiry anymore, so... > So I think we're fine. ... we're not fine. This breaks XFS and any other filesystem that relies on a I_DIRTY_SYNC notification to handle dirty time expiry correctly. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration
On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote: > On Fri, Mar 06, 2020 at 09:00:43PM -0500, Theodore Ts'o wrote: > > In the case that an inode has dirty timestamp for longer than the > > lazytime expiration timeout (or if all such inodes are being flushed > > out due to a sync or syncfs system call), we need to inform the file > > system that the inode is dirty so that the inode's timestamps can be > > copied out to the on-disk data structures. That's because if the file > > system supports lazytime, it will have ignored the dirty_inode(inode, > > I_DIRTY_TIME) notification when the timestamp was modified in memory.q > > > > Previously, this was accomplished by calling mark_inode_dirty_sync(), > > but that has the unfortunate side effect of also putting the inode the > > writeback list, and that's not necessary in this case, since we will > > immediately call write_inode() afterwards. > > > > Eric Biggers noticed that this was causing problems for fscrypt after > > the key was removed[1]. > > > > [1] https://lore.kernel.org/r/20200306004555.gb225...@gmail.com > > > > Reported-by: Eric Biggers > > Signed-off-by: Theodore Ts'o > > --- > > fs/fs-writeback.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > index 76ac9c7d32ec..32101349ba97 100644 > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -1504,8 +1504,9 @@ __writeback_single_inode(struct inode *inode, struct > > writeback_control *wbc) > > > > spin_unlock(&inode->i_lock); > > > > - if (dirty & I_DIRTY_TIME) > > - mark_inode_dirty_sync(inode); > > + /* This was a lazytime expiration; we need to tell the file system */ > > + if (dirty & I_DIRTY_TIME_EXPIRED && inode->i_sb->s_op->dirty_inode) > > + inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_TIME_EXPIRED); > > /* Don't write the inode if only I_DIRTY_PAGES was set */ > > if (dirty & ~I_DIRTY_PAGES) { > > int err = write_inode(inode, wbc); > > -- > > Thanks Ted! This fixes the fscrypt test failure. > > However, are you sure this works correctly on all filesystems? I'm not sure > about XFS. XFS only implements ->dirty_inode(), not ->write_inode(), and in > its > ->dirty_inode() it does: > > static void > xfs_fs_dirty_inode( > struct inode*inode, > int flag) > { > struct xfs_inode*ip = XFS_I(inode); > struct xfs_mount*mp = ip->i_mount; > struct xfs_trans*tp; > > if (!(inode->i_sb->s_flags & SB_LAZYTIME)) > return; > if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME)) > return; > > if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp)) > return; > xfs_ilock(ip, XFS_ILOCK_EXCL); > xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP); > xfs_trans_commit(tp); > } > > > So flag=I_DIRTY_TIME_EXPIRED will be a no-op. > > Maybe you should be using I_DIRTY_SYNC instead? Or perhaps XFS should be > checking for either I_DIRTY_TIME_EXPIRED or I_DIRTY_SYNC? Right, XFS does not use the VFS inode writeback code at all - we track all metadata changes internally via journalling. The VFS uses I_DIRTY_SYNC to indicate and inode is metadata dirty and a writeback candidate. Hence if we need to mark an inode dirty for integrity purposes for _any reason_, then I_DIRTY_SYNC is the correct flag to be passing to ->dirty_inode. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Tue, Feb 18, 2020 at 10:04:15PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > > > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t > > > pos, loff_t length, > > > } > > > ret = iomap_readpage_actor(inode, pos + done, length - done, > > > ctx, iomap, srcmap); > > > + if (WARN_ON(ret == 0)) > > > + break; > > > > This error case now leaks ctx->cur_page > > Yes ... and I see the consequence. I mean, this is a "shouldn't happen", > so do we want to put effort into cleanup here ... Well, the normal thing for XFS is that a production kernel cleans up and handles the error gracefully with a WARN_ON_ONCE, while a debug kernel build will chuck a tanty and burn the house down so to make the developers aware that there is a "should not happen" situation occurring > > > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, > > > struct list_head *pages, > > > done: > > > if (ctx.bio) > > > submit_bio(ctx.bio); > > > - if (ctx.cur_page) { > > > - if (!ctx.cur_page_in_bio) > > > - unlock_page(ctx.cur_page); > > > - put_page(ctx.cur_page); > > > - } > > > + BUG_ON(ctx.cur_page); > > > > And so will now trigger both a warn and a bug > > ... or do we just want to run slap bang into this bug? > > Option 1: Remove the check for 'ret == 0' altogether, as we had it before. > That puts us into endless loop territory for a failure mode, and it's not > parallel with iomap_readpage(). > > Option 2: Remove the WARN_ON from the check. Then we just hit the BUG_ON, > but we don't know why we did it. > > Option 3: Set cur_page to NULL. We'll hit the WARN_ON, avoid the BUG_ON, > might end up with a page in the page cache which is never unlocked. None of these are appealing. > Option 4: Do the unlock/put page dance before setting the cur_page to NULL. > We might double-unlock the page. why would we double unlock the page? Oh, the readahead cursor doesn't handle the case of partial page submission, which would result in IO completion unlocking the page. Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e. if we've got a page that the readahead cursor points at, and we haven't actually added it to a bio, then we can leave it to the read_pages() to unlock and clean up. If it's in a bio, then IO completion will unlock it and so we only have to drop the submission reference and move the readahead cursor forwards so read_pages() doesn't try to unlock this page. i.e: /* clean up partial page submission failures */ if (ctx.cur_page && ctx.cur_page_in_bio) { put_page(ctx.cur_page); readahead_next(rac); } looks to me like it will handle the case of "ret == 0" in the actor function just fine. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote: > > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > > > Latest version in your git tree: > > > > > > > > > > $ ▶ glo -n 5 willy/readahead > > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > > > $ > > > > > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > > > > > [2.423116] [ cut here ] > > > > > [2.424957] list_add double add: new=ea000efff4c8, > > > > > prev=8883bfffee60, next=ea000efff4c8. > > > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > > > __list_add_valid+0x67/0x70 > > > > > [2.457484] Call Trace: > > > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > > > [2.462787] lru_add_drain+0x18/0x20 > > > > > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > > > > > Yes, because it's the only version I've actually merged into my > > > working tree, compiled and tried to run. :P > > > > > > > version pushed to that git tree that did contain a list double-add > > > > (due to a mismerge when shuffling patches). I noticed it and fixed > > > > it, and 4be497096c04 doesn't have that problem. I also test with > > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > > > probabilistic because it'll depend on the timing between whatever other > > > > list is being used and the page actually being added to the LRU. > > > > > > I'll see if I can reproduce it. > > > > Just updated to a current TOT Linus kernel and your latest branch, > > and so far this is 100% reproducable. > > > > Not sure how I'm going to debug it yet, because it's init that is > > triggering it > > Eric found it ... Yeah, just saw that and am applying his patch to test it... > still not sure why I don't see it. No readahead configured on your device? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API
On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > Latest version in your git tree: > > > > > > $ ▶ glo -n 5 willy/readahead > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > $ > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > [2.423116] [ cut here ] > > > [2.424957] list_add double add: new=ea000efff4c8, > > > prev=8883bfffee60, next=ea000efff4c8. > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > __list_add_valid+0x67/0x70 > > > [2.457484] Call Trace: > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > [2.462787] lru_add_drain+0x18/0x20 > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > Yes, because it's the only version I've actually merged into my > working tree, compiled and tried to run. :P > > > version pushed to that git tree that did contain a list double-add > > (due to a mismerge when shuffling patches). I noticed it and fixed > > it, and 4be497096c04 doesn't have that problem. I also test with > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > probabilistic because it'll depend on the timing between whatever other > > list is being used and the page actually being added to the LRU. > > I'll see if I can reproduce it. Just updated to a current TOT Linus kernel and your latest branch, and so far this is 100% reproducable. Not sure how I'm going to debug it yet, because it's init that is triggering it Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path
On Mon, Feb 17, 2020 at 10:46:13AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Ensure that memory allocations in the readahead path do not attempt to > reclaim file-backed pages, which could lead to a deadlock. It is > possible, though unlikely this is the root cause of a problem observed > by Cong Wang. > > Signed-off-by: Matthew Wilcox (Oracle) > Reported-by: Cong Wang > Suggested-by: Michal Hocko > --- > mm/readahead.c | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/mm/readahead.c b/mm/readahead.c > index 94d499cfb657..8f9c0dba24e7 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -22,6 +22,7 @@ > #include > #include > #include > +#include > > #include "internal.h" > > @@ -174,6 +175,18 @@ void page_cache_readahead_limit(struct address_space > *mapping, > ._nr_pages = 0, > }; > > + /* > + * Partway through the readahead operation, we will have added > + * locked pages to the page cache, but will not yet have submitted > + * them for I/O. Adding another page may need to allocate memory, > + * which can trigger memory reclaim. Telling the VM we're in > + * the middle of a filesystem operation will cause it to not > + * touch file-backed pages, preventing a deadlock. Most (all?) > + * filesystems already specify __GFP_NOFS in their mapping's > + * gfp_mask, but let's be explicit here. > + */ > + unsigned int nofs = memalloc_nofs_save(); > + So doesn't this largely remove the need for all the gfp flag futzing in the readahead path? i.e. almost all readahead allocations are now going to be GFP_NOFS | GFP_NORETRY | GFP_NOWARN ? If so, shouldn't we just strip all the gfp flags and masking out of the readahead path altogether? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 18/19] iomap: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:12AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in iomap. Convert XFS and ZoneFS to > use it. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 91 +++--- > fs/iomap/trace.h | 2 +- > fs/xfs/xfs_aops.c | 13 +++--- > fs/zonefs/super.c | 7 ++-- > include/linux/iomap.h | 3 +- > 5 files changed, 42 insertions(+), 74 deletions(-) All pretty straight forward... > @@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > break; > done += ret; > if (offset_in_page(pos + done) == 0) { > + readahead_next(ctx->rac); > if (!ctx->cur_page_in_bio) > unlock_page(ctx->cur_page); > put_page(ctx->cur_page); Though now I look at the addition here, this might be better restructured to mention how we handle partial page submission such as: done += ret; /* * Keep working on a partially complete page, otherwise ready * the ctx for the next page to be acted on. */ if (offset_in_page(pos + done)) continue; if (!ctx->cur_page_in_bio) unlock_page(ctx->cur_page); put_page(ctx->cur_page); ctx->cur_page = NULL; readahead_next(ctx->rac); } Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > By putting the 'have we reached the end of the page' condition at the end > of the loop instead of the beginning, we can remove the 'submit the last > page' code from iomap_readpages(). Also check that iomap_readpage_actor() > didn't return 0, which would lead to an endless loop. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 25 - > 1 file changed, 12 insertions(+), 13 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index cb3511eb152a..44303f370b2d 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > void *data, struct iomap *iomap, struct iomap *srcmap) > { > struct iomap_readpage_ctx *ctx = data; > - loff_t done, ret; > + loff_t ret, done = 0; > > - for (done = 0; done < length; done += ret) { > - if (ctx->cur_page && offset_in_page(pos + done) == 0) { > - if (!ctx->cur_page_in_bio) > - unlock_page(ctx->cur_page); > - put_page(ctx->cur_page); > - ctx->cur_page = NULL; > - } > + while (done < length) { > if (!ctx->cur_page) { > ctx->cur_page = iomap_next_page(inode, ctx->pages, > pos, length, &done); > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > } > ret = iomap_readpage_actor(inode, pos + done, length - done, > ctx, iomap, srcmap); > + if (WARN_ON(ret == 0)) > + break; This error case now leaks ctx->cur_page > + done += ret; > + if (offset_in_page(pos + done) == 0) { > + if (!ctx->cur_page_in_bio) > + unlock_page(ctx->cur_page); > + put_page(ctx->cur_page); > + ctx->cur_page = NULL; > + } > } > > return done; > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct > list_head *pages, > done: > if (ctx.bio) > submit_bio(ctx.bio); > - if (ctx.cur_page) { > - if (!ctx.cur_page_in_bio) > - unlock_page(ctx.cur_page); > - put_page(ctx.cur_page); > - } > + BUG_ON(ctx.cur_page); And so will now trigger both a warn and a bug Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 16/19] fuse: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:09AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in fuse. Switching away from the > read_cache_pages() helper gets rid of an implicit call to put_page(), > so we can get rid of the get_page() call in fuse_readpages_fill(). > > Signed-off-by: Matthew Wilcox (Oracle) Looks OK. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 14/19] ext4: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in ext4 > > Signed-off-by: Matthew Wilcox (Oracle) There's nothing I can see in this that would cause that list corruption I saw with ext4. I'll re-introduce the patch and see if it falls over again. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:03AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/zdata.c | 29 + > 1 file changed, 9 insertions(+), 20 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/data.c | 39 +--- > fs/erofs/zdata.c | 2 +- > include/trace/events/erofs.h | 6 +++--- > 3 files changed, 18 insertions(+), 29 deletions(-) Looks fine from the perspective of page iteration and error handling. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 11/19] btrfs: Convert from readpages to readahead
On Tue, Feb 18, 2020 at 01:12:28PM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > - if (nr) { > > > - u64 contig_start = page_offset(pagepool[0]); > > > + ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end); > > > > Ok, yes it does. :) > > > > I don't see how readahead_for_each_batch() guarantees that, though. > > I ... don't see how it doesn't? We start at rac->_start and iterate > through the consecutive pages in the page cache. readahead_for_each_batch() > does assume that __do_page_cache_readahead() has its current behaviour > of putting the pages in the page cache in order, and kicks off a new > call to ->readahead() every time it has to skip an index for whatever > reason (eg page already in page cache). And there is the comment I was looking for while reading readahead_for_each_batch() :) > > > > - if (bio) > > > - return submit_one_bio(bio, 0, bio_flags); > > > - return 0; > > > + if (bio) { > > > + if (submit_one_bio(bio, 0, bio_flags)) > > > + return; > > > + } > > > } > > > > Shouldn't that just be > > > > if (bio) > > submit_one_bio(bio, 0, bio_flags); > > It should, but some overzealous person decided to mark submit_one_bio() > as __must_check, so I have to work around that. /me looks at code Ngggh. I rather dislike functions that are named in a way that look like they belong to core kernel APIs but in reality are local static functions. I'd ask for this to be fixed if it was generic code, but it's btrfs specific code so they can deal with the ugliness of their own creation. :/ > > Confusing when put alongside rac->_batch_count counting the number > > of pages in the batch, and "batch" being the index into the page > > array, and they aren't the same counts > > Yes. Renamed to 'i'. > > > > + XA_STATE(xas, &rac->mapping->i_pages, rac->_start); > > > + struct page *page; > > > + > > > + rac->_batch_count = 0; > > > + xas_for_each(&xas, page, rac->_start + rac->_nr_pages - 1) { > > > > That just iterates pages in the start,end doesn't it? What > > guarantees that this fills the array with a contiguous page range? > > The behaviour of __do_page_cache_readahead(). Dave Howells also has a > usecase for xas_for_each_contig(), so I'm going to add that soon. > > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + VM_BUG_ON_PAGE(PageTail(page), page); > > > + array[batch++] = page; > > > + rac->_batch_count += hpage_nr_pages(page); > > > + if (PageHead(page)) > > > + xas_set(&xas, rac->_start + rac->_batch_count); > > > > What on earth does this do? Comments please! > > /* >* The page cache isn't using multi-index entries yet, >* so xas_for_each() won't do the right thing for >* large pages. This can be removed once the page cache >* is converted. >*/ Oh, it's changing the internal xarray lookup cursor position to point at the correct next page index? Perhaps it's better to say that instead of "won't do the right thing"? > > > +#define readahead_for_each_batch(rac, array, size, nr) > > > \ > > > + for (; (nr = readahead_page_batch(rac, array, size)); \ > > > + readahead_next(rac)) > > > > I had to go look at the caller to work out what "size" refered to > > here. > > > > This is complex enough that it needs proper API documentation. > > How about just: > > -#define readahead_for_each_batch(rac, array, size, nr) \ > - for (; (nr = readahead_page_batch(rac, array, size)); \ > +#define readahead_for_each_batch(rac, array, array_sz, nr) \ > + for (; (nr = readahead_page_batch(rac, array, array_sz)); \ Yup, that's fine - now the macro documents itself. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Tue, Feb 18, 2020 at 11:54:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > ext4 and f2fs have duplicated the guts of the readahead code so > > > they can read past i_size. Instead, separate out the guts of the > > > readahead code so they can call it directly. > > > > Gross and nasty (hosting non-stale data beyond EOF in the page > > cache, that is). > > I thought you meant sneaking changes into the VFS (that were rejected) by > copying VFS code and modifying it ... Well, now that you mention it... :P > > > +/** > > > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > > > + * @mapping: File address space. > > > + * @file: This instance of the open file; used for authentication. > > > + * @offset: First page index to read. > > > + * @end_index: The maximum page index to read. > > > + * @nr_to_read: The number of pages to read. > > > + * @lookahead_size: Where to start the next readahead. > > > + * > > > + * This function is for filesystems to call when they want to start > > > + * readahead potentially beyond a file's stated i_size. If you want > > > + * to start readahead on a normal file, you probably want to call > > > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > > > + * > > > + * Context: File is referenced by caller. Mutexes may be held by caller. > > > + * May sleep, but will not reenter filesystem to reclaim memory. > > > */ > > > -void __do_page_cache_readahead(struct address_space *mapping, > > > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > > > - unsigned long lookahead_size) > > > +void page_cache_readahead_limit(struct address_space *mapping, > > > > ... I don't think the function name conveys it's purpose. It's > > really a ranged readahead that ignores where i_size lies. i.e > > > > page_cache_readahead_range(mapping, start, end, nr_to_read) > > > > seems like a better API to me, and then you can drop the "start > > readahead beyond i_size" comments and replace it with "Range is not > > limited by the inode's i_size and hence can be used to read data > > stored beyond EOF into the page cache." > > I'm concerned that calling it 'range' implies "I want to read between > start and end" rather than "I want to read nr_to_read at start, oh but > don't go past end". > > Maybe the right way to do this is have the three callers cap nr_to_read. > Well, the one caller ... after all, f2fs and ext4 have no desire to > cap the length. Then we can call it page_cache_readahead_exceed() or > page_cache_readahead_dangerous() or something else like that to make it > clear that you shouldn't be calling it. Fair point. And in reading this, it occurred to me that what we are enabling is an "out of bounds" readahead function. so page_cache_readahead_OOB() or *_unbounded() might be a better name > * Like add_to_page_cache_locked, but used to add newly allocated pages: > diff --git a/mm/readahead.c b/mm/readahead.c > index 9dd431fa16c9..cad26287ad8b 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > blk_finish_plug(&plug); > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_exceed - Start unchecked readahead. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @index: First page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead beyond a file's stated i_size. This is almost certainly > + * not the function you want to call. Use page_cache_async_readahead() > + * or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. Yup, looks much better. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 08/19] mm: Add readahead address space operation
On Tue, Feb 18, 2020 at 08:10:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > This replaces ->readpages with a saner interface: > > > - Return void instead of an ignored error code. > > > - Pages are already in the page cache when ->readahead is called. > > > > Might read better as: > > > > - Page cache is already populates with locked pages when > >->readahead is called. > > Will do. > > > > - Implementation looks up the pages in the page cache instead of > > >having them passed in a linked list. > > > > Add: > > > > - cleanup of unused readahead handled by ->readahead caller, not > >the method implementation. > > The readpages caller does that cleanup too, so it's not an advantage > to the readahead interface. Right. I kinda of read the list as "the reasons the new API is sane" not as "how readahead is different to readpages" > > > ``readpages`` > > > called by the VM to read pages associated with the address_space > > > object. This is essentially just a vector version of readpage. > > > Instead of just one page, several pages are requested. > > > readpages is only used for read-ahead, so read errors are > > > ignored. If anything goes wrong, feel free to give up. > > > + This interface is deprecated; implement readahead instead. > > > > What is the removal schedule for the deprecated interface? > > I mentioned that in the cover letter; once Dave Howells has the fscache > branch merged, I'll do the remaining filesystems. Should be within the > next couple of merge windows. Sure, but I like to see actual release tags with the deprecation notice so that it's obvious to the reader as to whether this is something new and/or when they can expect it to go away. > > > +/* The byte offset into the file of this readahead block */ > > > +static inline loff_t readahead_offset(struct readahead_control *rac) > > > +{ > > > + return (loff_t)rac->_start * PAGE_SIZE; > > > +} > > > > Urk. Didn't an early page use "offset" for the page index? That > > was was "mm: Remove 'page_offset' from readahead loop" did, right? > > > > That's just going to cause confusion to have different units for > > readahead "offsets" > > We are ... not consistent anywhere in the VM/VFS with our naming. > Unfortunately. > > $ grep -n offset mm/filemap.c > 391: * @start:offset in bytes where the range starts > ... > 815: pgoff_t offset = old->index; > ... > 2020: unsigned long offset; /* offset into pagecache page */ > ... > 2257: *ppos = ((loff_t)index << PAGE_SHIFT) + offset; > > That last one's my favourite. Not to mention the fine distinction you > and I discussed recently between offset_in_page() and page_offset(). > > Best of all, even our types encode the ambiguity of an 'offset'. We have > pgoff_t and loff_t (replacing the earlier off_t). > > So, new rule. 'pos' is the number of bytes into a file. 'index' is the > number of PAGE_SIZE pages into a file. We don't use the word 'offset' > at all. 'length' as a byte count and 'count' as a page count seem like > fine names to me. That sounds very reasonable to me. Another patchset in the making? :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 07/19] mm: Put readahead pages in cache earlier
On Tue, Feb 18, 2020 at 07:42:22AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > At allocation time, put the pages in the cache unless we're using > > > ->readpages. Add the readahead_for_each() iterator for the benefit of > > > the ->readpage fallback. This iterator supports huge pages, even though > > > none of the filesystems to be converted do yet. > > > > This could be better written - took me some time to get my head > > around it and the code. > > > > "When populating the page cache for readahead, mappings that don't > > use ->readpages need to have their pages added to the page cache > > before ->readpage is called. Do this insertion earlier so that the > > pages can be looked up immediately prior to ->readpage calls rather > > than passing them on a linked list. This early insert functionality > > is also required by the upcoming ->readahead method that will > > replace ->readpages. > > > > Optimise and simplify the readpage loop by adding a > > readahead_for_each() iterator to provide the pages we need to read. > > This iterator also supports huge pages, even though none of the > > filesystems have been converted to use them yet." > > Thanks, I'll use that. > > > > +static inline struct page *readahead_page(struct readahead_control *rac) > > > +{ > > > + struct page *page; > > > + > > > + if (!rac->_nr_pages) > > > + return NULL; > > > > H. > > > > > + > > > + page = xa_load(&rac->mapping->i_pages, rac->_start); > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + rac->_batch_count = hpage_nr_pages(page); > > > > So we could have rac->_nr_pages = 2, and then we get an order 2 > > large page returned, and so rac->_batch_count = 4. > > Well, no, we couldn't. rac->_nr_pages is incremented by 4 when we add > an order-2 page to the readahead. I don't see any code that does that. :) i.e. we aren't actually putting high order pages into the page cache here - page_alloc() allocates order-0 pages) - so there's nothing in the patch that tells me how rac->_nr_pages behaves when allocating large pages... IOWs, we have an undocumented assumption in the implementation... > I can put a > BUG_ON(rac->_batch_count > rac->_nr_pages) > in here to be sure to catch any logic error like that. Definitely necessary given that we don't insert large pages for readahead yet. A comment explaining the assumptions that the code makes for large pages is probably in order, too. > > > - page->index = offset; > > > - list_add(&page->lru, &page_pool); > > > + if (use_list) { > > > + page->index = offset; > > > + list_add(&page->lru, &page_pool); > > > + } else if (add_to_page_cache_lru(page, mapping, offset, > > > + gfp_mask) < 0) { > > > + put_page(page); > > > + goto read; > > > + } > > > > Ok, so that's why you put read code at the end of the loop. To turn > > the code into spaghetti :/ > > > > How much does this simplify down when we get rid of ->readpages and > > can restructure the loop? This really seems like you're trying to > > flatten two nested loops into one by the use of goto > > I see it as having two failure cases in this loop. One for "page is > already present" (which already existed) and one for "allocated a page, > but failed to add it to the page cache" (which used to be done later). > I didn't want to duplicate the "call read_pages()" code. So I reshuffled > the code rather than add a nested loop. I don't think the nested loop > is easier to read (we'll be at 5 levels of indentation for some statements). > Could do it this way ... Can we move the update of @rac inside read_pages()? The next start offset^Windex we start at is rac._start + rac._nr_pages, right? so read_pages() could do: { if (readahead_count(rac)) { /* do readahead */ } /* advance the readahead cursor */ rac->_start += rac->_nr_pages; rac._nr_pages = 0; } and then we only need to call read_pages() in these cases and so the requirement for avoiding duplicating code is avoided... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 04/19] mm: Rearrange readahead loop
On Tue, Feb 18, 2020 at 05:57:36AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > Move the declaration of 'page' to inside the loop and move the 'kick > > > off a fresh batch' code to the end of the function for easier use in > > > subsequent patches. > > > > Stale? the "kick off" code is moved to the tail of the loop, not the > > end of the function. > > Braino; I meant to write end of the loop. > > > > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > page = xa_load(&mapping->i_pages, page_offset); > > > if (page && !xa_is_value(page)) { > > > /* > > > - * Page already present? Kick off the current batch of > > > - * contiguous pages before continuing with the next > > > - * batch. > > > + * Page already present? Kick off the current batch > > > + * of contiguous pages before continuing with the > > > + * next batch. This page may be the one we would > > > + * have intended to mark as Readahead, but we don't > > > + * have a stable reference to this page, and it's > > > + * not worth getting one just for that. > > >*/ > > > - if (readahead_count(&rac)) > > > - read_pages(&rac, &page_pool, gfp_mask); > > > - rac._nr_pages = 0; > > > - continue; > > > + goto read; > > > } > > > > > > page = __page_cache_alloc(gfp_mask); > > > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > if (page_idx == nr_to_read - lookahead_size) > > > SetPageReadahead(page); > > > rac._nr_pages++; > > > + continue; > > > +read: > > > + if (readahead_count(&rac)) > > > + read_pages(&rac, &page_pool, gfp_mask); > > > + rac._nr_pages = 0; > > > } > > > > Also, why? This adds a goto from branched code that continues, then > > adds a continue so the unbranched code doesn't execute the code the > > goto jumps to. In absence of any explanation, this isn't an > > improvement and doesn't make any sense... > > I thought I was explaining it ... "for easier use in subsequent patches". Sorry, my braino there. :) I commented on the problem with the first part of the sentence, then the rest of the sentence completely failed to sink in. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > > > +static void read_pages(struct readahead_control *rac, struct list_head > > > *pages, > > > + gfp_t gfp) > > > { > > > + const struct address_space_operations *aops = rac->mapping->a_ops; > > > struct blk_plug plug; > > > unsigned page_idx; > > > > Splitting out the aops rather than the mapping here just looks > > weird, especially as you need the mapping later in the function. > > Using aops doesn't even reduce the code side > > It does in subsequent patches ... I agree it looks a little weird here, > but I think in the final form, it makes sense: Ok. Perhaps just an additional commit comment to say "read_pages() is changed to be aops centric as @rac abstracts away all other implementation details by the end of the patchset." > > > + if (readahead_count(&rac)) > > > + read_pages(&rac, &page_pool, gfp_mask); > > > + rac._nr_pages = 0; > > > > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead > > control structure - if we have to pass the gfp_mask down all the > > way along side the rac, then I think it makes sense to do that... > > So we end up removing it later on in this series, but I do wonder if > it would make sense anyway. By the end of the series, we still have > this in iomap: > > if (ctx->rac) /* same as readahead_gfp_mask */ > gfp |= __GFP_NORETRY | __GFP_NOWARN; > > and we could get rid of that by passing gfp flags down in the rac. On the > other hand, I don't know why it doesn't just use readahead_gfp_mask() > here anyway ... Christoph? mapping->gfp_mask is awful. Is it a mask, or is it a valid set of allocation flags? Or both? Some callers to mapping_gfp_constraint() uses it as a mask, some callers to mapping_gfp_constraint() use it as base flags that context specific flags get masked out of, readahead_gfp_mask() callers use it as the entire set of gfp flags for allocation. That whole API sucks - undocumented as to what it's suposed to do and how it's supposed to be used. Hence it's difficult to use correctly or understand whether it's being used correctly. And reading callers only leads to more confusion and crazy code like in do_mpage_readpage() where readahead returns a mask that are used as base flags and normal reads return a masked set of base flags... The iomap code is obviously correct when it comes to gfp flag manipulation. We start with GFP_KERNEL context, then constrain it via the mask held in mapping->gfp_mask, then if it's readahead we allow the allocation to silently fail. Simple to read and understand code, versus having weird code that requires the reader to decipher an undocumented and inconsistent API to understand how the gfp flags have been calculated and are valid. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > Latest version in your git tree: > > > > $ ▶ glo -n 5 willy/readahead > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > ff63497fcb98 iomap: Convert from readpages to readahead > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > 8115bcca7312 fuse: Convert from readpages to readahead > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > $ > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > [2.423116] [ cut here ] > > [2.424957] list_add double add: new=ea000efff4c8, > > prev=8883bfffee60, next=ea000efff4c8. > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > __list_add_valid+0x67/0x70 > > [2.457484] Call Trace: > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > [2.462787] lru_add_drain+0x18/0x20 > > Are you sure that was 4be497096c04 ? I ask because there was a Yes, because it's the only version I've actually merged into my working tree, compiled and tried to run. :P > version pushed to that git tree that did contain a list double-add > (due to a mismerge when shuffling patches). I noticed it and fixed > it, and 4be497096c04 doesn't have that problem. I also test with > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > probabilistic because it'll depend on the timing between whatever other > list is being used and the page actually being added to the LRU. I'll see if I can reproduce it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 11/19] btrfs: Convert from readpages to readahead
the start,end doesn't it? What guarantees that this fills the array with a contiguous page range? > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE(PageTail(page), page); > + array[batch++] = page; > + rac->_batch_count += hpage_nr_pages(page); > + if (PageHead(page)) > + xas_set(&xas, rac->_start + rac->_batch_count); What on earth does this do? Comments please! > + > + if (batch == size) > + break; > + } > + > + return batch; > +} Seems a bit big for an inline function. > + > +#define readahead_for_each_batch(rac, array, size, nr) > \ > + for (; (nr = readahead_page_batch(rac, array, size)); \ > + readahead_next(rac)) I had to go look at the caller to work out what "size" refered to here. This is complex enough that it needs proper API documentation. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Implement the new readahead aop and convert all callers (block_dev, > exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6, > reiserfs & udf). The callers are all trivial except for GFS2 & OCFS2. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Junxiao Bi # ocfs2 > --- > drivers/staging/exfat/exfat_super.c | 7 +++--- > fs/block_dev.c | 7 +++--- > fs/ext2/inode.c | 10 +++- > fs/fat/inode.c | 7 +++--- > fs/gfs2/aops.c | 23 ++--- > fs/hpfs/file.c | 7 +++--- > fs/iomap/buffered-io.c | 2 +- > fs/isofs/inode.c| 7 +++--- > fs/jfs/inode.c | 7 +++--- > fs/mpage.c | 38 + > fs/nilfs2/inode.c | 15 +++- > fs/ocfs2/aops.c | 34 ++ > fs/omfs/file.c | 7 +++--- > fs/qnx6/inode.c | 7 +++--- > fs/reiserfs/inode.c | 8 +++--- > fs/udf/inode.c | 7 +++--- > include/linux/mpage.h | 4 +-- > mm/migrate.c| 2 +- > 18 files changed, 73 insertions(+), 126 deletions(-) That's actually pretty simple changeover. Nothing really scary there. :) Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > ext4 and f2fs have duplicated the guts of the readahead code so > they can read past i_size. Instead, separate out the guts of the > readahead code so they can call it directly. Gross and nasty (hosting non-stale data beyond EOF in the page cache, that is). Code is pretty simple, but... > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @offset: First page index to read. > + * @end_index: The maximum page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead potentially beyond a file's stated i_size. If you want > + * to start readahead on a normal file, you probably want to call > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. > */ > -void __do_page_cache_readahead(struct address_space *mapping, > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > - unsigned long lookahead_size) > +void page_cache_readahead_limit(struct address_space *mapping, ... I don't think the function name conveys it's purpose. It's really a ranged readahead that ignores where i_size lies. i.e page_cache_readahead_range(mapping, start, end, nr_to_read) seems like a better API to me, and then you can drop the "start readahead beyond i_size" comments and replace it with "Range is not limited by the inode's i_size and hence can be used to read data stored beyond EOF into the page cache." Also: "This is almost certainly not the function you want to call. Use page_cache_async_readahead or page_cache_sync_readahead() instead." Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 08/19] mm: Add readahead address space operation
agemap.h > index 3613154e79e4..bd4291f78f41 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -665,6 +665,24 @@ static inline void readahead_next(struct > readahead_control *rac) > #define readahead_for_each(rac, page) > \ > for (; (page = readahead_page(rac)); readahead_next(rac)) > > +/* The byte offset into the file of this readahead block */ > +static inline loff_t readahead_offset(struct readahead_control *rac) > +{ > + return (loff_t)rac->_start * PAGE_SIZE; > +} Urk. Didn't an early page use "offset" for the page index? That was was "mm: Remove 'page_offset' from readahead loop" did, right? That's just going to cause confusion to have different units for readahead "offsets" > + > +/* The number of bytes in this readahead block */ > +static inline loff_t readahead_length(struct readahead_control *rac) > +{ > + return (loff_t)rac->_nr_pages * PAGE_SIZE; > +} > + > +/* The index of the first page in this readahead block */ > +static inline unsigned int readahead_index(struct readahead_control *rac) > +{ > + return rac->_start; > +} Based on this, I suspect the earlier patch should use "index" rather than "offset" when walking the page cache indexes... > + > /* The number of pages in this readahead block */ > static inline unsigned int readahead_count(struct readahead_control *rac) > { > diff --git a/mm/readahead.c b/mm/readahead.c > index 9e430daae42f..975ff5e387be 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > > blk_start_plug(&plug); > > - if (aops->readpages) { > + if (aops->readahead) { > + aops->readahead(rac); > + readahead_for_each(rac, page) { > + unlock_page(page); > + put_page(page); > + } This needs a comment to explain the unwinding that needs to be done here. I'm not going to remember in a year's time that this is just for the pages that weren't submitted by ->readahead Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 07/19] mm: Put readahead pages in cache earlier
break; > - page->index = offset; > - list_add(&page->lru, &page_pool); > + if (use_list) { > + page->index = offset; > + list_add(&page->lru, &page_pool); > + } else if (add_to_page_cache_lru(page, mapping, offset, > + gfp_mask) < 0) { > + put_page(page); > + goto read; > + } Ok, so that's why you put read code at the end of the loop. To turn the code into spaghetti :/ How much does this simplify down when we get rid of ->readpages and can restructure the loop? This really seems like you're trying to flatten two nested loops into one by the use of goto Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 06/19] mm: rename readahead loop variable to 'i'
On Mon, Feb 17, 2020 at 10:45:50AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Change the type of page_idx to unsigned long, and rename it -- it's > just a loop counter, not a page index. > > Suggested-by: John Hubbard > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop
On Mon, Feb 17, 2020 at 10:45:48AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Eliminate the page_offset variable which was confusing with the > 'offset' parameter and record the start of each consecutive run of > pages in the readahead_control. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) Looks ok, but having the readahead dispatch out of line from the case that triggers it makes it hard to follow. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 04/19] mm: Rearrange readahead loop
On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Move the declaration of 'page' to inside the loop and move the 'kick > off a fresh batch' code to the end of the function for easier use in > subsequent patches. Stale? the "kick off" code is moved to the tail of the loop, not the end of the function. > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > *mapping, > page = xa_load(&mapping->i_pages, page_offset); > if (page && !xa_is_value(page)) { > /* > - * Page already present? Kick off the current batch of > - * contiguous pages before continuing with the next > - * batch. > + * Page already present? Kick off the current batch > + * of contiguous pages before continuing with the > + * next batch. This page may be the one we would > + * have intended to mark as Readahead, but we don't > + * have a stable reference to this page, and it's > + * not worth getting one just for that. >*/ > - if (readahead_count(&rac)) > - read_pages(&rac, &page_pool, gfp_mask); > - rac._nr_pages = 0; > - continue; > + goto read; > } > > page = __page_cache_alloc(gfp_mask); > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > *mapping, > if (page_idx == nr_to_read - lookahead_size) > SetPageReadahead(page); > rac._nr_pages++; > + continue; > +read: > + if (readahead_count(&rac)) > + read_pages(&rac, &page_pool, gfp_mask); > + rac._nr_pages = 0; > } Also, why? This adds a goto from branched code that continues, then adds a continue so the unbranched code doesn't execute the code the goto jumps to. In absence of any explanation, this isn't an improvement and doesn't make any sense... -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > In this patch, only between __do_page_cache_readahead() and > read_pages(), but it will be extended in upcoming patches. Also add > the readahead_count() accessor. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > include/linux/pagemap.h | 17 + > mm/readahead.c | 36 +--- > 2 files changed, 38 insertions(+), 15 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index ccb14b6a16b5..982ecda2d4a2 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page, > return error; > } > > +/* > + * Readahead is of a block of consecutive pages. > + */ > +struct readahead_control { > + struct file *file; > + struct address_space *mapping; > +/* private: use the readahead_* accessors instead */ > + pgoff_t _start; > + unsigned int _nr_pages; > +}; > + > +/* The number of pages in this readahead block */ > +static inline unsigned int readahead_count(struct readahead_control *rac) > +{ > + return rac->_nr_pages; > +} > + > static inline unsigned long dir_pages(struct inode *inode) > { > return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >> > diff --git a/mm/readahead.c b/mm/readahead.c > index 12d13b7792da..15329309231f 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, > struct list_head *pages, > > EXPORT_SYMBOL(read_cache_pages); > > -static void read_pages(struct address_space *mapping, struct file *filp, > - struct list_head *pages, unsigned int nr_pages, gfp_t gfp) > +static void read_pages(struct readahead_control *rac, struct list_head > *pages, > + gfp_t gfp) > { > + const struct address_space_operations *aops = rac->mapping->a_ops; > struct blk_plug plug; > unsigned page_idx; Splitting out the aops rather than the mapping here just looks weird, especially as you need the mapping later in the function. Using aops doesn't even reduce the code side > > blk_start_plug(&plug); > > - if (mapping->a_ops->readpages) { > - mapping->a_ops->readpages(filp, mapping, pages, nr_pages); > + if (aops->readpages) { > + aops->readpages(rac->file, rac->mapping, pages, > + readahead_count(rac)); > /* Clean up the remaining pages */ > put_pages_list(pages); > goto out; > } > > - for (page_idx = 0; page_idx < nr_pages; page_idx++) { > + for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) { > struct page *page = lru_to_page(pages); > list_del(&page->lru); > - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) > - mapping->a_ops->readpage(filp, page); > + if (!add_to_page_cache_lru(page, rac->mapping, page->index, > + gfp)) > + aops->readpage(rac->file, page); ... it just makes this less readable by splitting the if() over two lines... > put_page(page); > } > > @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space > *mapping, > unsigned long end_index;/* The last page we want to read */ > LIST_HEAD(page_pool); > int page_idx; > - unsigned int nr_pages = 0; > loff_t isize = i_size_read(inode); > gfp_t gfp_mask = readahead_gfp_mask(mapping); > + struct readahead_control rac = { > + .mapping = mapping, > + .file = filp, > + ._nr_pages = 0, > + }; No need to initialise _nr_pages to zero, leaving it out will do the same thing. > > if (isize == 0) > return; > @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space > *mapping, >* contiguous pages before continuing with the next >* batch. >*/ > - if (nr_pages) > - read_pages(mapping, filp, &page_pool, nr_pages, > - gfp_mask); > - nr_pages = 0; > + if (readahead_count(&rac)) > + read_pages(&rac, &page_pool, gfp_mask); > + rac._