On Fri, Nov 21, 2014 at 02:59:22PM -0500, Theodore Ts'o wrote:
> Add a new mount option which enables a new "lazytime" mode.  This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode.  The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
> 
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
> 
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table.  The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives.  Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
> 
> Google-Bug-Id: 18297052
> 
> Signed-off-by: Theodore Ts'o <ty...@mit.edu>
> ---
>  fs/fs-writeback.c       | 38 +++++++++++++++++++++++++++++++++++++-
>  fs/inode.c              | 18 ++++++++++++++++++
>  fs/proc_namespace.c     |  1 +
>  fs/sync.c               |  7 +++++++
>  include/linux/fs.h      |  1 +
>  include/uapi/linux/fs.h |  1 +
>  6 files changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ef9bef1..ce7de22 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct 
> writeback_control *wbc)
>       if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
>               inode->i_state &= ~I_DIRTY_PAGES;
>       dirty = inode->i_state & I_DIRTY;
> -     inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> +     inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
>       spin_unlock(&inode->i_lock);
>       /* Don't write the inode if only I_DIRTY_PAGES was set */
>       if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
>       iput(old_inode);
>  }
>  
> +/*
> + * This works like wait_sb_inodes(), but it is called *before* we kick
> + * the bdi so the inodes can get written out.
> + */
> +static void flush_sb_dirty_time(struct super_block *sb)
> +{
> +     struct inode *inode, *old_inode = NULL;
> +
> +     WARN_ON(!rwsem_is_locked(&sb->s_umount));
> +     spin_lock(&inode_sb_list_lock);
> +     list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +             int dirty_time;
> +
> +             spin_lock(&inode->i_lock);
> +             if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +                     spin_unlock(&inode->i_lock);
> +                     continue;
> +             }
> +             dirty_time = inode->i_state & I_DIRTY_TIME;
> +             __iget(inode);
> +             spin_unlock(&inode->i_lock);
> +             spin_unlock(&inode_sb_list_lock);
> +
> +             iput(old_inode);
> +             old_inode = inode;
> +
> +             if (dirty_time)
> +                     mark_inode_dirty(inode);
> +             cond_resched();
> +             spin_lock(&inode_sb_list_lock);
> +     }
> +     spin_unlock(&inode_sb_list_lock);
> +     iput(old_inode);
> +}

This just seems wrong to me, not to mention extremely expensive when we have
millions of cached inodes on the superblock.

Why can't we just add a function like mark_inode_dirty_time() which
puts the inode on the dirty inode list with a writeback time 24
hours in the future rather than 30s in the future?



> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -534,6 +534,18 @@ static void evict(struct inode *inode)
>       BUG_ON(!(inode->i_state & I_FREEING));
>       BUG_ON(!list_empty(&inode->i_lru));
>  
> +     if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
> +             if (inode->i_op->write_time)
> +                     inode->i_op->write_time(inode);
> +             else if (inode->i_sb->s_op->write_inode) {
> +                     struct writeback_control wbc = {
> +                             .sync_mode = WB_SYNC_NONE,
> +                     };
> +                     mark_inode_dirty(inode);
> +                     inode->i_sb->s_op->write_inode(inode, &wbc);
> +             }
> +     }
> +

Eviction is too late for this. I'm pretty sure that it won't get
this far as iput_final() should catch the I_DIRTY_TIME in the !drop
case via write_inode_now().


>  int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int 
> datasync)
>  {
> +     struct inode *inode = file->f_mapping->host;
> +
>       if (!file->f_op->fsync)
>               return -EINVAL;
> +     if (!datasync && inode->i_state & I_DIRTY_TIME) {

FWIW, I'm surprised gcc isn't throwing warnings about that. Please
make sure there isn't any ambiguity in the code by writing those
checks like this:

        if (!datasync && (inode->i_state & I_DIRTY_TIME)) {

> +             spin_lock(&inode->i_lock);
> +             inode->i_state |= I_DIRTY_SYNC;
> +             spin_unlock(&inode->i_lock);
> +     }
>       return file->f_op->fsync(file, start, end, datasync);

When we mark the inode I_DIRTY_TIME, we should also be marking it
I_DIRTY_SYNC so that all the sync operations know that they should
be writing this inode. That's partly why I also think these inodes
should be tracked on the dirty inode list....

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3633239..489b2f2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1721,6 +1721,7 @@ struct super_operations {
>  #define __I_DIO_WAKEUP               9
>  #define I_DIO_WAKEUP         (1 << I_DIO_WAKEUP)
>  #define I_LINKABLE           (1 << 10)
> +#define I_DIRTY_TIME         (1 << 11)
>  
>  #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

Shouldn't I_DIRTY also include I_DIRTY_TIME now?

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to