Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote: - convert ext3/4 to use -update_time instead of the -dirty_time callout so it gets and exact notifications (preferably the few remaining filesystems as well, although that shouldn't really be a blocker) We could do that, although ext3/4's -update_time() would be exactly the same as the generic update_time() function, so there would be code duplication. If the goal is to get rid of the magic in --dirty_inode() being used to work around how the VFS makes changes to fields that end up in the on-disk inode, we would need to audit a lot of extra code paths; at the very least, in how the generic quota code handles updates to i_size and i_blocks (for example). And BTW, we don't actually have a dirty_time() function any more in the current patch series. update_time() is currently looking like this: Sorry, I actually meant -dirty_inode, which is where ext4 currently hooks in for time updates. -update_time was introduced to a) more specificly catch the kind of update b) allow the filesystem to take locks or a start a transaction before the inode fields are updated to provide proper atomicy. It seems like the quota code has the same problem, but given that neither XFS nor btrfs use it it seems like no one cared enough to sort it out properly.. static int update_time(struct inode *inode, struct timespec *time, int flags) { if (inode-i_op-update_time) return inode-i_op-update_time(inode, time, flags); if (flags S_ATIME) inode-i_atime = *time; if (flags S_VERSION) inode_inc_iversion(inode); if (flags S_CTIME) inode-i_ctime = *time; if (flags S_MTIME) inode-i_mtime = *time; if ((inode-i_sb-s_flags MS_LAZYTIME) !(flags S_VERSION) !(inode-i_state I_DIRTY)) __mark_inode_dirty(inode, I_DIRTY_TIME); else __mark_inode_dirty(inode, I_DIRTY_SYNC); return 0; Why do you need the additional I_DIRTY flag? A lesser __mark_inode_dirty should never override a stronger one. Otherwise this looks fine to me, except that I would split the default implementation into a new generic_update_time helper. XFS doesn't have a -dirty_time yet, but that way XFS would be able to use the I_DIRTY_TIME flag to log the journal timestamps if it so desires, and perhaps drop the need for it to use update_time(). We will probably always need a -update_time to proide proper locking around the timestamp updates. (And with XFS doing logical journalling, it may be that you might want to include the timestamp update in the journal if you have a journal transaction open already, so the disk is spun up or likely to be spin up anyway, right?) XFS transactions are explicitly opened and closed, so during the atime updates we'll never have one open. What we could try is to have CIL items that are on indefinit hold before they are batched into a checkpoint. We'd still commit them to an in-memory transaction in -upate_time for that. All this requires a lot of through and will take some time, though. In the current from the generic lazytime might even be a loss for XFS as we're already really good at batching updates from multiple inodes in the same cluster for the in-place writeback, so I really don't want to just enable it without those optimizations without a lot of testing. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Tue, Dec 02, 2014 at 01:20:33AM -0800, Christoph Hellwig wrote: Why do you need the additional I_DIRTY flag? A lesser __mark_inode_dirty should never override a stronger one. Agreed, will fix. Otherwise this looks fine to me, except that I would split the default implementation into a new generic_update_time helper. Sure, I can do that. XFS doesn't have a -dirty_time yet, but that way XFS would be able to use the I_DIRTY_TIME flag to log the journal timestamps if it so desires, and perhaps drop the need for it to use update_time(). We will probably always need a -update_time to proide proper locking around the timestamp updates. Couldn't you let the VFS set the inode timesstamps and then have xfs's -dirty_time(inode, I_DIRTY_TIME) copy the timestamps to the on-disk inode structure under the appropriate lock, or am I missing something? In the current from the generic lazytime might even be a loss for XFS as we're already really good at batching updates from multiple inodes in the same cluster for the in-place writeback, so I really don't want to just enable it without those optimizations without a lot of testing. Fair enough; it's not surprising that this might be much more effective as an optimization for ext4, for no other reason that timestamp updates are so much heavyweight for us. I suspect that it should be a win for btrfs, though, and it should definitely be a win for those file systems that don't use journalling at all. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Thu, Nov 27, 2014 at 03:27:31PM -0500, Theodore Ts'o wrote: I can do that, but part of the reason why we were doing this rather involved set of changes was to allow other file systems to be able to take advantage of lazytime. I suppose there is value in allowing other file systems, such as jfs, f2fs, etc., to use it, but still, it's a bit of a shame to drop btrfs and xfs support for this feature. I want to see xfs and btrfs support, but I think we're running in some conceptual problems here. I don't have the time right now to fully review the XFS changes for correctness and test them, and I'd rather keep things as-is for a while and then add properly designed and fully teste support in rather than something possible broken. I'll note by the way that ext3 and ext4 doesn't really use VFS dirty tracking either --- see my other comments about the naming of mark_inode_dirty being a bit misleading, at least for all/most of the major file systems. The problem seems to be that replacement schemes that we've all using are slightly different. :-/ Indeed. It seems all existing -dirty_inode instances basically just try to work around the problem that the VFS simply updates timestamps by writing into the inode without involving the filesystem. There are all kinds of bugs in different instances, as well as comments mentioning an assumption that this only happens for atime although the VFS also dos this trick for c/mtime, including a caller from the page fault code that the filesystems can't even avoid by providing non-default methods everywhere. I suppose should let the btrfs folks decide whether they want to add is_readonly() and write_time() function --- or maybe help with the cleanup work so that mark_inode_dirty() can reflect an error to its callers. Chris, David, what do you think? The -is_readonly method seems like a clear winner to me, I'm all for adding it, and thus suggested moving it first in the series. I've read a bit more through the series and would like to suggest the following approach for the rest: - convert ext3/4 to use -update_time instead of the -dirty_time callout so it gets and exact notifications (preferably the few remaining filesystems as well, although that shouldn't really be a blocker) - defer timestamp updates for any filesystems not defining -update_time (or -dirty_time for now), and allow filesystems using -update_time to defer the update as well by calling mark_inode_dirty with the I_DIRTY_TIME flag so that XFS and btrfs don't have to opt-in without testing. - Convert xfs, btrfs and the remaining filesystes using -dirty_inode incrementally. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote: The -is_readonly method seems like a clear winner to me, I'm all for adding it, and thus suggested moving it first in the series. It's a real winner for me as well, but the reason why I dropped it is because if btrfs() has to keep its -update_time function, we wouldn't actually have a user for is_readonly(). I suppose we could have update_time() call -is_readonly() and then -update_time() if they exist, but it only seemed to add an extra call and a bit of extra overhead without really simplifying things for btrfs. If there were other users of -is_readonly, then it would make sense, but it seemed better to move into a separate code refactoring series. I've read a bit more through the series and would like to suggest the following approach for the rest: - convert ext3/4 to use -update_time instead of the -dirty_time callout so it gets and exact notifications (preferably the few remaining filesystems as well, although that shouldn't really be a blocker) We could do that, although ext3/4's -update_time() would be exactly the same as the generic update_time() function, so there would be code duplication. If the goal is to get rid of the magic in --dirty_inode() being used to work around how the VFS makes changes to fields that end up in the on-disk inode, we would need to audit a lot of extra code paths; at the very least, in how the generic quota code handles updates to i_size and i_blocks (for example). And BTW, we don't actually have a dirty_time() function any more in the current patch series. update_time() is currently looking like this: static int update_time(struct inode *inode, struct timespec *time, int flags) { if (inode-i_op-update_time) return inode-i_op-update_time(inode, time, flags); if (flags S_ATIME) inode-i_atime = *time; if (flags S_VERSION) inode_inc_iversion(inode); if (flags S_CTIME) inode-i_ctime = *time; if (flags S_MTIME) inode-i_mtime = *time; if ((inode-i_sb-s_flags MS_LAZYTIME) !(flags S_VERSION) !(inode-i_state I_DIRTY)) __mark_inode_dirty(inode, I_DIRTY_TIME); else __mark_inode_dirty(inode, I_DIRTY_SYNC); return 0; } - Convert xfs, btrfs and the remaining filesystes using -dirty_inode incrementally. Right, so xfs and btrfs (which are the two file systems that have update_time at the moment) can just drop update_time() and then check the -dirty_time() for (flags I_DIRTY_TIME). Hmm, I suspect this might be better for xfs, yes? if ((inode-i_sb-s_flags MS_LAZYTIME) !(flags S_VERSION) !(inode-i_state I_DIRTY)) __mark_inode_dirty(inode, I_DIRTY_TIME); else __mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_TIME); XFS doesn't have a -dirty_time yet, but that way XFS would be able to use the I_DIRTY_TIME flag to log the journal timestamps if it so desires, and perhaps drop the need for it to use update_time(). (And with XFS doing logical journalling, it may be that you might want to include the timestamp update in the journal if you have a journal transaction open already, so the disk is spun up or likely to be spin up anyway, right?) - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote: On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote: The -is_readonly method seems like a clear winner to me, I'm all for adding it, and thus suggested moving it first in the series. It's a real winner for me as well, but the reason why I dropped it is because if btrfs() has to keep its -update_time function, we wouldn't actually have a user for is_readonly(). I suppose we could have update_time() call -is_readonly() and then -update_time() if they exist, but it only seemed to add an extra call and a bit of extra overhead without really simplifying things for btrfs. We would use is_readonly in order to remove some extra checks from btrfs (setxattr, removexattr, possibly setsize). If there were other users of -is_readonly, then it would make sense, but it seemed better to move into a separate code refactoring series. Yeah it would be better addressed separately as it's not the point of lazytime patchset and only turned out to be a good idea during the iterations. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Wed 26-11-14 11:23:28, Christoph Hellwig wrote: As mentioned last round please move the addition of the is_readonly operation to the first thing in the series, so that the ordering makes more sense. Second I think this patch is incorrect for XFS - XFS uses -update_time to set the time stampst in the dinode. These two need to be coherent as we can write out a dirty inode any time, so it needs to have the timestamp uptodate. But Ted changed XFS to copy timestamps to on-disk structure from the in-memory inode fields after VFS updated the timestamps. So the stamps should be coherent AFAICT, shouldn't they? Third update_time now calls mark_inode_dirty unconditionally, while previously it wasn't called when -update_time was set. At least for XFS that's a major change in behavior as XFS never used VFS dirty tracking for metadata updates. We don't call mark_inode_dirty() when -write_time is set (note the return, I missed it on the first reading) which looks sensible to me. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Wed, Nov 26, 2014 at 11:23:28AM -0800, Christoph Hellwig wrote: As mentioned last round please move the addition of the is_readonly operation to the first thing in the series, so that the ordering makes more sense. OK, will fix. Second I think this patch is incorrect for XFS - XFS uses -update_time to set the time stampst in the dinode. These two need to be coherent as we can write out a dirty inode any time, so it needs to have the timestamp uptodate. Third update_time now calls mark_inode_dirty unconditionally, while previously it wasn't called when -update_time was set. At least for XFS that's a major change in behavior as XFS never used VFS dirty tracking for metadata updates. Thanks, I'll fix both of the above. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Thu, Nov 27, 2014 at 01:34:29PM +0100, Jan Kara wrote: But Ted changed XFS to copy timestamps to on-disk structure from the in-memory inode fields after VFS updated the timestamps. So the stamps should be coherent AFAICT, shouldn't they? Not coherent enough. We need the XFS ilock to synchronize reading from and writing to the time stamps. update_time() only has i_mutex, which we can't take for the transaction commit path. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
FYI, I suspect for now the best might be to let filesystems that define -update_times work as-is and not tie them into the infrastructure. At least for XFS I suspect the lazy updates might better be handled internally, although I'm not entirely sure yet. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
Christoph, can you take a quick look at this? I'm not sure I got the xfs inode transaction logging correct. Thanks!! - Ted commit cd58addfa340c9cf88b1f9b2d31a42e2e65c7252 Author: Theodore Ts'o ty...@mit.edu Date: Thu Nov 27 10:14:27 2014 -0500 vfs: split update_time() into update_time() and write_time() In preparation for adding support for the lazytime mount option, we need to be able to separate out the update_time() and write_time() inode operations. Previously, only btrfs and xfs uses update_time(). With this patch, btrfs only needs write_time(), and xfs uses update_time() to synchronize its on-disk and in-memory timestamps. Signed-off-by: Theodore Ts'o ty...@mit.edu Cc: x...@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Acked-by: David Sterba dste...@suse.cz diff --git a/include/linux/fs.h b/include/linux/fs.h index f4b0ecd..befd5d2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1545,7 +1545,8 @@ struct inode_operations { int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); int (*is_readonly)(struct inode *); - int (*update_time)(struct inode *, struct timespec *, int); + void (*update_time)(struct inode *); + int (*write_time)(struct inode *); int (*atomic_open)(struct inode *, struct dentry *, struct file *, unsigned open_flag, umode_t create_mode, int *opened); diff --git a/fs/inode.c b/fs/inode.c index 53f0173..94bc908 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1506,9 +1506,6 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (ret) return ret; } - if (inode-i_op-update_time) - return inode-i_op-update_time(inode, time, flags); - if (flags S_ATIME) inode-i_atime = *time; if (flags S_VERSION) @@ -1517,6 +1514,10 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) inode-i_ctime = *time; if (flags S_MTIME) inode-i_mtime = *time; + if (inode-i_op-update_time) + inode-i_op-update_time(inode); + if (inode-i_op-write_time) + return inode-i_op-write_time(inode); mark_inode_dirty_sync(inode); return 0; } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index ec6dcdc..b69493d 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -983,42 +983,42 @@ xfs_vn_setattr( return error; } -STATIC int +STATIC void xfs_vn_update_time( - struct inode*inode, - struct timespec *now, - int flags) + struct inode*inode) +{ + struct xfs_inode*ip = XFS_I(inode); + + trace_xfs_update_time(ip); + xfs_ilock(ip, XFS_ILOCK_EXCL); + ip-i_d.di_ctime.t_sec = (__int32_t) inode-i_ctime.tv_sec; + ip-i_d.di_ctime.t_nsec = (__int32_t) inode-i_ctime.tv_nsec; + + ip-i_d.di_mtime.t_sec = (__int32_t)inode-i_mtime.tv_sec; + ip-i_d.di_mtime.t_nsec = (__int32_t) inode-i_mtime.tv_nsec; + + ip-i_d.di_atime.t_sec = (__int32_t) inode-i_atime.tv_sec; + ip-i_d.di_atime.t_nsec = (__int32_t) inode-i_atime.tv_nsec; + xfs_iunlock(ip, XFS_ILOCK_EXCL); +} + +STATIC int +xfs_vn_write_time( + struct inode*inode) { struct xfs_inode*ip = XFS_I(inode); struct xfs_mount*mp = ip-i_mount; struct xfs_trans*tp; int error; - trace_xfs_update_time(ip); - + trace_xfs_write_time(ip); tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS); error = xfs_trans_reserve(tp, M_RES(mp)-tr_fsyncts, 0, 0); if (error) { xfs_trans_cancel(tp, 0); return error; } - xfs_ilock(ip, XFS_ILOCK_EXCL); - if (flags S_CTIME) { - inode-i_ctime = *now; - ip-i_d.di_ctime.t_sec = (__int32_t)now-tv_sec; - ip-i_d.di_ctime.t_nsec = (__int32_t)now-tv_nsec; - } - if (flags S_MTIME) { - inode-i_mtime = *now; - ip-i_d.di_mtime.t_sec = (__int32_t)now-tv_sec; - ip-i_d.di_mtime.t_nsec = (__int32_t)now-tv_nsec; - } - if (flags S_ATIME) { - inode-i_atime = *now; - ip-i_d.di_atime.t_sec = (__int32_t)now-tv_sec; - ip-i_d.di_atime.t_nsec = (__int32_t)now-tv_nsec; - } xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP); return xfs_trans_commit(tp, 0); @@ -1130,6 +1130,7 @@ static const struct inode_operations xfs_inode_operations = { .listxattr = xfs_vn_listxattr, .fiemap = xfs_vn_fiemap,
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
I don't think this scheme works well. As mentioned earlier XFS doesn't even use vfs dirty tracking at the moment, so introducing this in a hidden way sounds like a bad idea. Probably the same for btrfs. I'd rather keep update_time as-is for now, don't add -write_time and let btrfs and XFS figure out how to implement the semantics on their own. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
On Thu, Nov 27, 2014 at 08:49:52AM -0800, Christoph Hellwig wrote: I don't think this scheme works well. As mentioned earlier XFS doesn't even use vfs dirty tracking at the moment, so introducing this in a hidden way sounds like a bad idea. Probably the same for btrfs. I'd rather keep update_time as-is for now, don't add -write_time and let btrfs and XFS figure out how to implement the semantics on their own. I can do that, but part of the reason why we were doing this rather involved set of changes was to allow other file systems to be able to take advantage of lazytime. I suppose there is value in allowing other file systems, such as jfs, f2fs, etc., to use it, but still, it's a bit of a shame to drop btrfs and xfs support for this feature. I'll note by the way that ext3 and ext4 doesn't really use VFS dirty tracking either --- see my other comments about the naming of mark_inode_dirty being a bit misleading, at least for all/most of the major file systems. The problem seems to be that replacement schemes that we've all using are slightly different. :-/ I suppose should let the btrfs folks decide whether they want to add is_readonly() and write_time() function --- or maybe help with the cleanup work so that mark_inode_dirty() can reflect an error to its callers. Chris, David, what do you think? - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
As mentioned last round please move the addition of the is_readonly operation to the first thing in the series, so that the ordering makes more sense. Second I think this patch is incorrect for XFS - XFS uses -update_time to set the time stampst in the dinode. These two need to be coherent as we can write out a dirty inode any time, so it needs to have the timestamp uptodate. Third update_time now calls mark_inode_dirty unconditionally, while previously it wasn't called when -update_time was set. At least for XFS that's a major change in behavior as XFS never used VFS dirty tracking for metadata updates. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html