Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-02 Thread Christoph Hellwig
On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote:
   - convert ext3/4 to use -update_time instead of the -dirty_time
 callout so it gets and exact notifications (preferably the few
 remaining filesystems as well, although that shouldn't really be a
 blocker)
 
 We could do that, although ext3/4's -update_time() would be exactly
 the same as the generic update_time() function, so there would be code
 duplication.  If the goal is to get rid of the magic in
 --dirty_inode() being used to work around how the VFS makes changes
 to fields that end up in the on-disk inode, we would need to audit a
 lot of extra code paths; at the very least, in how the generic quota
 code handles updates to i_size and i_blocks (for example).
 
 And BTW, we don't actually have a dirty_time() function any more in
 the current patch series.  update_time() is currently looking like
 this:

Sorry, I actually meant -dirty_inode, which is where ext4 currently
hooks in for time updates.  -update_time was introduced to

 a) more specificly catch the kind of update
 b) allow the filesystem to take locks or a start a transaction
before the inode fields are updated to provide proper atomicy.

It seems like the quota code has the same problem, but given that
neither XFS nor btrfs use it it seems like no one cared enough to sort
it out properly..

 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
   if (inode-i_op-update_time)
   return inode-i_op-update_time(inode, time, flags);
 
   if (flags  S_ATIME)
   inode-i_atime = *time;
   if (flags  S_VERSION)
   inode_inc_iversion(inode);
   if (flags  S_CTIME)
   inode-i_ctime = *time;
   if (flags  S_MTIME)
   inode-i_mtime = *time;
 
   if ((inode-i_sb-s_flags  MS_LAZYTIME)  !(flags  S_VERSION) 
   !(inode-i_state  I_DIRTY))
   __mark_inode_dirty(inode, I_DIRTY_TIME);
   else
   __mark_inode_dirty(inode, I_DIRTY_SYNC);
   return 0;

Why do you need the additional I_DIRTY flag?  A lesser
__mark_inode_dirty should never override a stronger one.

Otherwise this looks fine to me, except that I would split the default
implementation into a new generic_update_time helper.

 XFS doesn't have a -dirty_time yet, but that way XFS would be able to
 use the I_DIRTY_TIME flag to log the journal timestamps if it so
 desires, and perhaps drop the need for it to use update_time().

We will probably always need a -update_time to proide proper locking
around the timestamp updates.

 (And
 with XFS doing logical journalling, it may be that you might want to
 include the timestamp update in the journal if you have a journal
 transaction open already, so the disk is spun up or likely to be spin
 up anyway, right?)

XFS transactions are explicitly opened and closed, so during the atime
updates we'll never have one open.

What we could try is to have CIL items that are on indefinit hold
before they are batched into a checkpoint.  We'd still commit them to
an in-memory transaction in -upate_time for that.  All this requires
a lot of through and will take some time, though.

In the current from the generic lazytime might even be a loss for XFS as
we're already really good at batching updates from multiple inodes in
the same cluster for the in-place writeback, so I really don't want
to just enable it without those optimizations without a lot of testing.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-02 Thread Theodore Ts'o
On Tue, Dec 02, 2014 at 01:20:33AM -0800, Christoph Hellwig wrote:
 Why do you need the additional I_DIRTY flag?  A lesser
 __mark_inode_dirty should never override a stronger one.

Agreed, will fix.

 Otherwise this looks fine to me, except that I would split the default
 implementation into a new generic_update_time helper.

Sure, I can do that.

  XFS doesn't have a -dirty_time yet, but that way XFS would be able to
  use the I_DIRTY_TIME flag to log the journal timestamps if it so
  desires, and perhaps drop the need for it to use update_time().
 
 We will probably always need a -update_time to proide proper locking
 around the timestamp updates.

Couldn't you let the VFS set the inode timesstamps and then have xfs's
-dirty_time(inode, I_DIRTY_TIME) copy the timestamps to the on-disk
inode structure under the appropriate lock, or am I missing something?

 In the current from the generic lazytime might even be a loss for XFS as
 we're already really good at batching updates from multiple inodes in
 the same cluster for the in-place writeback, so I really don't want
 to just enable it without those optimizations without a lot of testing.

Fair enough; it's not surprising that this might be much more
effective as an optimization for ext4, for no other reason that
timestamp updates are so much heavyweight for us.  I suspect that it
should be a win for btrfs, though, and it should definitely be a win
for those file systems that don't use journalling at all.

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-01 Thread Christoph Hellwig
On Thu, Nov 27, 2014 at 03:27:31PM -0500, Theodore Ts'o wrote:
 I can do that, but part of the reason why we were doing this rather
 involved set of changes was to allow other file systems to be able to
 take advantage of lazytime.  I suppose there is value in allowing
 other file systems, such as jfs, f2fs, etc., to use it, but still,
 it's a bit of a shame to drop btrfs and xfs support for this feature.

I want to see xfs and btrfs support, but I think we're running in some
conceptual problems here.  I don't have the time right now to fully
review the XFS changes for correctness and test them, and I'd rather
keep things as-is for a while and then add properly designed and fully
teste support in rather than something possible broken.

 I'll note by the way that ext3 and ext4 doesn't really use VFS dirty
 tracking either --- see my other comments about the naming of
 mark_inode_dirty being a bit misleading, at least for all/most of
 the major file systems.  The problem seems to be that replacement
 schemes that we've all using are slightly different.  :-/

Indeed.  It seems all existing -dirty_inode instances basically
just try to work around the problem that the VFS simply updates
timestamps by writing into the inode without involving the filesystem.
There are all kinds of bugs in different instances, as well as comments
mentioning an assumption that this only happens for atime although
the VFS also dos this trick for c/mtime, including a caller from
the page fault code that the filesystems can't even avoid by providing
non-default methods everywhere.

 I suppose should let the btrfs folks decide whether they want to add
 is_readonly() and write_time() function --- or maybe help with the
 cleanup work so that mark_inode_dirty() can reflect an error to its
 callers.   Chris, David, what do you think?


The -is_readonly method seems like a clear winner to me, I'm all for
adding it, and thus suggested moving it first in the series.

I've read a bit more through the series and would like to suggest
the following approach for the rest:

 - convert ext3/4 to use -update_time instead of the -dirty_time
   callout so it gets and exact notifications (preferably the few
   remaining filesystems as well, although that shouldn't really be a
   blocker)
 - defer timestamp updates for any filesystems not defining
   -update_time (or -dirty_time for now), and allow filesystems
   using -update_time to defer the update as well by calling
   mark_inode_dirty with the I_DIRTY_TIME flag so that XFS and btrfs
   don't have to opt-in without testing.
 - Convert xfs, btrfs and the remaining filesystes using -dirty_inode
   incrementally.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-01 Thread Theodore Ts'o
On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote:
 
 The -is_readonly method seems like a clear winner to me, I'm all for
 adding it, and thus suggested moving it first in the series.

It's a real winner for me as well, but the reason why I dropped it is
because if btrfs() has to keep its -update_time function, we wouldn't
actually have a user for is_readonly().  I suppose we could have
update_time() call -is_readonly() and then -update_time() if they
exist, but it only seemed to add an extra call and a bit of extra
overhead without really simplifying things for btrfs.

If there were other users of -is_readonly, then it would make sense,
but it seemed better to move into a separate code refactoring series.

 I've read a bit more through the series and would like to suggest
 the following approach for the rest:
 
  - convert ext3/4 to use -update_time instead of the -dirty_time
callout so it gets and exact notifications (preferably the few
remaining filesystems as well, although that shouldn't really be a
blocker)

We could do that, although ext3/4's -update_time() would be exactly
the same as the generic update_time() function, so there would be code
duplication.  If the goal is to get rid of the magic in
--dirty_inode() being used to work around how the VFS makes changes
to fields that end up in the on-disk inode, we would need to audit a
lot of extra code paths; at the very least, in how the generic quota
code handles updates to i_size and i_blocks (for example).

And BTW, we don't actually have a dirty_time() function any more in
the current patch series.  update_time() is currently looking like
this:

static int update_time(struct inode *inode, struct timespec *time, int flags)
{
if (inode-i_op-update_time)
return inode-i_op-update_time(inode, time, flags);

if (flags  S_ATIME)
inode-i_atime = *time;
if (flags  S_VERSION)
inode_inc_iversion(inode);
if (flags  S_CTIME)
inode-i_ctime = *time;
if (flags  S_MTIME)
inode-i_mtime = *time;

if ((inode-i_sb-s_flags  MS_LAZYTIME)  !(flags  S_VERSION) 
!(inode-i_state  I_DIRTY))
__mark_inode_dirty(inode, I_DIRTY_TIME);
else
__mark_inode_dirty(inode, I_DIRTY_SYNC);
return 0;
}

  - Convert xfs, btrfs and the remaining filesystes using -dirty_inode
incrementally.

Right, so xfs and btrfs (which are the two file systems that have
update_time at the moment) can just drop update_time() and then check
the -dirty_time() for (flags  I_DIRTY_TIME).  Hmm, I suspect this
might be better for xfs, yes?

if ((inode-i_sb-s_flags  MS_LAZYTIME)  !(flags  S_VERSION) 
!(inode-i_state  I_DIRTY))
__mark_inode_dirty(inode, I_DIRTY_TIME);
else
__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_TIME);

XFS doesn't have a -dirty_time yet, but that way XFS would be able to
use the I_DIRTY_TIME flag to log the journal timestamps if it so
desires, and perhaps drop the need for it to use update_time().  (And
with XFS doing logical journalling, it may be that you might want to
include the timestamp update in the journal if you have a journal
transaction open already, so the disk is spun up or likely to be spin
up anyway, right?)

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-01 Thread David Sterba
On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote:
 On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote:
  
  The -is_readonly method seems like a clear winner to me, I'm all for
  adding it, and thus suggested moving it first in the series.
 
 It's a real winner for me as well, but the reason why I dropped it is
 because if btrfs() has to keep its -update_time function, we wouldn't
 actually have a user for is_readonly().  I suppose we could have
 update_time() call -is_readonly() and then -update_time() if they
 exist, but it only seemed to add an extra call and a bit of extra
 overhead without really simplifying things for btrfs.

We would use is_readonly in order to remove some extra checks from btrfs
(setxattr, removexattr, possibly setsize).

 If there were other users of -is_readonly, then it would make sense,
 but it seemed better to move into a separate code refactoring series.

Yeah it would be better addressed separately as it's not the point of
lazytime patchset and only turned out to be a good idea during the
iterations.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Jan Kara
On Wed 26-11-14 11:23:28, Christoph Hellwig wrote:
 As mentioned last round please move the addition of the is_readonly
 operation to the first thing in the series, so that the ordering makes
 more sense.
 
 Second I think this patch is incorrect for XFS - XFS uses -update_time
 to set the time stampst in the dinode.  These two need to be coherent
 as we can write out a dirty inode any time, so it needs to have the
 timestamp uptodate.
  But Ted changed XFS to copy timestamps to on-disk structure from the
in-memory inode fields after VFS updated the timestamps. So the stamps
should be coherent AFAICT, shouldn't they?

 Third update_time now calls mark_inode_dirty unconditionally, while
 previously it wasn't called when -update_time was set.  At least
 for XFS that's a major change in behavior as XFS never used VFS dirty
 tracking for metadata updates.
  We don't call mark_inode_dirty() when -write_time is set (note the
return, I missed it on the first reading) which looks sensible to me.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Theodore Ts'o
On Wed, Nov 26, 2014 at 11:23:28AM -0800, Christoph Hellwig wrote:
 As mentioned last round please move the addition of the is_readonly
 operation to the first thing in the series, so that the ordering makes
 more sense.

OK, will fix.

 Second I think this patch is incorrect for XFS - XFS uses -update_time
 to set the time stampst in the dinode.  These two need to be coherent
 as we can write out a dirty inode any time, so it needs to have the
 timestamp uptodate.
 
 Third update_time now calls mark_inode_dirty unconditionally, while
 previously it wasn't called when -update_time was set.  At least
 for XFS that's a major change in behavior as XFS never used VFS dirty
 tracking for metadata updates.

Thanks, I'll fix both of the above.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Christoph Hellwig
On Thu, Nov 27, 2014 at 01:34:29PM +0100, Jan Kara wrote:
   But Ted changed XFS to copy timestamps to on-disk structure from the
 in-memory inode fields after VFS updated the timestamps. So the stamps
 should be coherent AFAICT, shouldn't they?

Not coherent enough.  We need the XFS ilock to synchronize reading from
and writing to the time stamps.  update_time() only has i_mutex, which
we can't take for the transaction commit path.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Christoph Hellwig
FYI, I suspect for now the best might be to let filesystems that define
-update_times work as-is and not tie them into the infrastructure.  At
least for XFS I suspect the lazy updates might better be handled
internally, although I'm not entirely sure yet.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Theodore Ts'o
Christoph, can you take a quick look at this?  I'm not sure I got the
xfs inode transaction logging correct.

Thanks!!

- Ted

commit cd58addfa340c9cf88b1f9b2d31a42e2e65c7252
Author: Theodore Ts'o ty...@mit.edu
Date:   Thu Nov 27 10:14:27 2014 -0500

vfs: split update_time() into update_time() and write_time()

In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Previously, only btrfs and xfs uses update_time().

With this patch, btrfs only needs write_time(), and xfs uses
update_time() to synchronize its on-disk and in-memory timestamps.

Signed-off-by: Theodore Ts'o ty...@mit.edu
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba dste...@suse.cz

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f4b0ecd..befd5d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1545,7 +1545,8 @@ struct inode_operations {
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
  u64 len);
int (*is_readonly)(struct inode *);
-   int (*update_time)(struct inode *, struct timespec *, int);
+   void (*update_time)(struct inode *);
+   int (*write_time)(struct inode *);
int (*atomic_open)(struct inode *, struct dentry *,
   struct file *, unsigned open_flag,
   umode_t create_mode, int *opened);
diff --git a/fs/inode.c b/fs/inode.c
index 53f0173..94bc908 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1506,9 +1506,6 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (ret)
return ret;
}
-   if (inode-i_op-update_time)
-   return inode-i_op-update_time(inode, time, flags);
-
if (flags  S_ATIME)
inode-i_atime = *time;
if (flags  S_VERSION)
@@ -1517,6 +1514,10 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
inode-i_ctime = *time;
if (flags  S_MTIME)
inode-i_mtime = *time;
+   if (inode-i_op-update_time)
+   inode-i_op-update_time(inode);
+   if (inode-i_op-write_time)
+   return inode-i_op-write_time(inode);
mark_inode_dirty_sync(inode);
return 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..b69493d 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -983,42 +983,42 @@ xfs_vn_setattr(
return error;
 }
 
-STATIC int
+STATIC void
 xfs_vn_update_time(
-   struct inode*inode,
-   struct timespec *now,
-   int flags)
+   struct inode*inode)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+
+   trace_xfs_update_time(ip);
+   xfs_ilock(ip, XFS_ILOCK_EXCL);
+   ip-i_d.di_ctime.t_sec = (__int32_t) inode-i_ctime.tv_sec;
+   ip-i_d.di_ctime.t_nsec = (__int32_t) inode-i_ctime.tv_nsec;
+
+   ip-i_d.di_mtime.t_sec = (__int32_t)inode-i_mtime.tv_sec;
+   ip-i_d.di_mtime.t_nsec = (__int32_t) inode-i_mtime.tv_nsec;
+
+   ip-i_d.di_atime.t_sec = (__int32_t) inode-i_atime.tv_sec;
+   ip-i_d.di_atime.t_nsec = (__int32_t) inode-i_atime.tv_nsec;
+   xfs_iunlock(ip, XFS_ILOCK_EXCL);
+}
+
+STATIC int
+xfs_vn_write_time(
+   struct inode*inode)
 {
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip-i_mount;
struct xfs_trans*tp;
int error;
 
-   trace_xfs_update_time(ip);
-
+   trace_xfs_write_time(ip);
tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
error = xfs_trans_reserve(tp, M_RES(mp)-tr_fsyncts, 0, 0);
if (error) {
xfs_trans_cancel(tp, 0);
return error;
}
-
xfs_ilock(ip, XFS_ILOCK_EXCL);
-   if (flags  S_CTIME) {
-   inode-i_ctime = *now;
-   ip-i_d.di_ctime.t_sec = (__int32_t)now-tv_sec;
-   ip-i_d.di_ctime.t_nsec = (__int32_t)now-tv_nsec;
-   }
-   if (flags  S_MTIME) {
-   inode-i_mtime = *now;
-   ip-i_d.di_mtime.t_sec = (__int32_t)now-tv_sec;
-   ip-i_d.di_mtime.t_nsec = (__int32_t)now-tv_nsec;
-   }
-   if (flags  S_ATIME) {
-   inode-i_atime = *now;
-   ip-i_d.di_atime.t_sec = (__int32_t)now-tv_sec;
-   ip-i_d.di_atime.t_nsec = (__int32_t)now-tv_nsec;
-   }
xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
return xfs_trans_commit(tp, 0);
@@ -1130,6 +1130,7 @@ static const struct inode_operations xfs_inode_operations 
= {
.listxattr  = xfs_vn_listxattr,
.fiemap = xfs_vn_fiemap,

Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Christoph Hellwig
I don't think this scheme works well.  As mentioned earlier XFS doesn't
even use vfs dirty tracking at the moment, so introducing this in a
hidden way sounds like a bad idea.  Probably the same for btrfs.

I'd rather keep update_time as-is for now, don't add -write_time and
let btrfs and XFS figure out how to implement the semantics on their
own.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Theodore Ts'o
On Thu, Nov 27, 2014 at 08:49:52AM -0800, Christoph Hellwig wrote:
 I don't think this scheme works well.  As mentioned earlier XFS doesn't
 even use vfs dirty tracking at the moment, so introducing this in a
 hidden way sounds like a bad idea.  Probably the same for btrfs.
 
 I'd rather keep update_time as-is for now, don't add -write_time and
 let btrfs and XFS figure out how to implement the semantics on their
 own.

I can do that, but part of the reason why we were doing this rather
involved set of changes was to allow other file systems to be able to
take advantage of lazytime.  I suppose there is value in allowing
other file systems, such as jfs, f2fs, etc., to use it, but still,
it's a bit of a shame to drop btrfs and xfs support for this feature.

I'll note by the way that ext3 and ext4 doesn't really use VFS dirty
tracking either --- see my other comments about the naming of
mark_inode_dirty being a bit misleading, at least for all/most of
the major file systems.  The problem seems to be that replacement
schemes that we've all using are slightly different.  :-/

I suppose should let the btrfs folks decide whether they want to add
is_readonly() and write_time() function --- or maybe help with the
cleanup work so that mark_inode_dirty() can reflect an error to its
callers.   Chris, David, what do you think?

 - Ted

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-26 Thread Christoph Hellwig
As mentioned last round please move the addition of the is_readonly
operation to the first thing in the series, so that the ordering makes
more sense.

Second I think this patch is incorrect for XFS - XFS uses -update_time
to set the time stampst in the dinode.  These two need to be coherent
as we can write out a dirty inode any time, so it needs to have the
timestamp uptodate.

Third update_time now calls mark_inode_dirty unconditionally, while
previously it wasn't called when -update_time was set.  At least
for XFS that's a major change in behavior as XFS never used VFS dirty
tracking for metadata updates.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html