On Fri 28-06-13 13:58:25, Dave Chinner wrote:
> On Fri, Jun 28, 2013 at 11:13:01AM +1000, Dave Chinner wrote:
> > On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote:
> > > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote:
> > >  
> > >  
> > >  > > Yup, that's about three of orders of magnitude faster on this
> > >  > > workload....
> > >  > > 
> > >  > > Lightly smoke tested patch below - it passed the first round of
> > >  > > XFS data integrity tests in xfstests, so it's not completely
> > >  > > busted...
> > >  > 
> > >  > And now with even less smoke out that the first version. This one
> > >  > gets though a full xfstests run...
> > > 
> > > :sadface:
> > > 
> > > [  567.680836] ======================================================
> > > [  567.681582] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 
> > > ]
> > > [  567.682389] 3.10.0-rc7+ #9 Not tainted
> > > [  567.682862] ------------------------------------------------------
> > > [  567.683607] trinity-child2/8665 [HC0[0]:SC0[0]:HE0:SE1] is trying to 
> > > acquire:
> > > [  567.684464]  (&sb->s_type->i_lock_key#3){+.+...}, at: 
> > > [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
> > > [  567.685632] 
> > > and this task is already holding:
> > > [  567.686334]  (&(&wb->wb_list_lock)->rlock){..-...}, at: 
> > > [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
> > > [  567.687506] which would create a new lock dependency:
> > > [  567.688115]  (&(&wb->wb_list_lock)->rlock){..-...} -> 
> > > (&sb->s_type->i_lock_key#3){+.+...}
> > 
> > .....
> > 
> > > other info that might help us debug this:
> > > 
> > > [  567.750396]  Possible interrupt unsafe locking scenario:
> > > 
> > > [  567.752062]        CPU0                    CPU1
> > > [  567.753025]        ----                    ----
> > > [  567.753981]   lock(&sb->s_type->i_lock_key#3);
> > > [  567.754969]                                local_irq_disable();
> > > [  567.756085]                                
> > > lock(&(&wb->wb_list_lock)->rlock);
> > > [  567.757368]                                
> > > lock(&sb->s_type->i_lock_key#3);
> > > [  567.758642]   <Interrupt>
> > > [  567.759370]     lock(&(&wb->wb_list_lock)->rlock);
> > 
> > Oh, that's easy enough to fix. It's just changing the wait_sb_inodes
> > loop to use a spin_trylock(&inode->i_lock), moving the inode to
> > the end of the sync list, dropping all locks and starting again...
> 
> New version below, went through xfstests with lockdep enabled this
> time....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> writeback: store inodes under writeback on a separate list
> 
> From: Dave Chinner <dchin...@redhat.com>
> 
> When there are lots of cached inodes, a sync(2) operation walks all
> of them to try to find which ones are under writeback and wait for
> IO completion on them. Run enough load, and this caused catastrophic
> lock contention on the inode_sb_list_lock.
> 
> Try to fix this problem by tracking inodes under data IO and wait
> specifically for only those inodes that haven't completed their data
> IO in wait_sb_inodes().
> 
> This is a bit hacky and messy, but demonstrates one method of
> solving this problem....
> 
> XXX: This will catch data IO - do we need to catch actual inode
> writeback (i.e. the metadata) here? I'm pretty sure we don't need to
> because the existing code just calls filemap_fdatawait() and that
> doesn't wait for the inode metadata writeback to occur....
> 
> [v3 - avoid deadlock due to interrupt while holding inode->i_lock]
> 
> [v2 - needs spin_lock_irq variants in wait_sb_inodes.
>     - move freeing inodes back to primary list, we don't wait for
>       them
>     - take mapping lock in wait_sb_inodes when requeuing.]
> 
> Signed-off-by: Dave Chinner <dchin...@redhat.com>
> ---
>  fs/fs-writeback.c           |   70 
> +++++++++++++++++++++++++++++++++++++------
>  fs/inode.c                  |    1 +
>  include/linux/backing-dev.h |    3 ++
>  include/linux/fs.h          |    3 +-
>  mm/backing-dev.c            |    2 ++
>  mm/page-writeback.c         |   22 ++++++++++++++
>  6 files changed, 91 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 3be5718..589c40b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>  
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -     struct inode *inode, *old_inode = NULL;
> +     struct backing_dev_info *bdi = sb->s_bdi;
> +     struct inode *old_inode = NULL;
> +     unsigned long flags;
> +     LIST_HEAD(sync_list);
>  
>       /*
>        * We need to be protected against the filesystem going from
> @@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
>        */
>       WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -     spin_lock(&inode_sb_list_lock);
>  
>       /*
>        * Data integrity sync. Must wait for all pages under writeback,
> @@ -1224,19 +1226,58 @@ static void wait_sb_inodes(struct super_block *sb)
>        * call, but which had writeout started before we write it out.
>        * In which case, the inode may not be on the dirty list, but
>        * we still have to wait for that writeout.
> +      *
> +      * To avoid syncing inodes put under IO after we have started here,
> +      * splice the io list to a temporary list head and walk that. Newly
> +      * dirtied inodes will go onto the primary list so we won't wait for
> +      * them.
> +      *
> +      * Inodes that have pages under writeback after we've finished the wait
> +      * may or may not be on the primary list. They had pages under put IOi
> +      * after
> +      * we started our wait, so we need to make sure the next sync or IO
> +      * completion treats them correctly, Move them back to the primary list
> +      * and restart the walk.
> +      *
> +      * Inodes that are clean after we have waited for them don't belong
> +      * on any list, and the cleaning of them should have removed them from
> +      * the temporary list. Check this is true, and restart the walk.
>        */
> -     list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +     spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> +     list_splice_init(&bdi->wb.b_wb, &sync_list);
> +
> +     while (!list_empty(&sync_list)) {
> +             struct inode *inode = list_first_entry(&sync_list, struct inode,
> +                                                    i_io_list);
>               struct address_space *mapping = inode->i_mapping;
>  
> -             spin_lock(&inode->i_lock);
> -             if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -                 (mapping->nrpages == 0)) {
> +             /*
> +              * we are nesting the inode->i_lock inside a IRQ disabled
> +              * section here, so there's the possibility that we could have
> +              * a lock inversion due to an interrupt while holding the
> +              * inode->i_lock elsewhere. This is the only place we take the
> +              * inode->i_lock inside the wb_list_lock, so we need to use a
> +              * trylock to avoid a deadlock. If we fail to get the lock,
> +              * the only way to make progress is to also drop the
> +              * wb_list_lock so the interrupt trying to get it can make
> +              * progress.
> +              */
> +             if (!spin_trylock(&inode->i_lock)) {
> +                     list_move(&inode->i_io_list, &bdi->wb.b_wb);
> +                     spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
> +                     cpu_relax();
> +                     spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> +                     continue;
> +             }
> +
> +             if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +                     list_move(&inode->i_io_list, &bdi->wb.b_wb);
>                       spin_unlock(&inode->i_lock);
>                       continue;
>               }
  Ugh, the locking looks ugly. Plus the list handling is buggy because the
first wait_sb_inodes() invocation will move all inodes to its private
sync_list so if there's another wait_sb_inodes() invocation racing with it,
it won't wait properly for all the inodes it should.

Won't it be easier to remove inodes from b_wb list (btw, I'd slightly
prefer name b_writeback) lazily instead of from
test_clear_page_writeback()? I mean we would remove inodes from b_wb list
only in wait_sb_inodes() or when inodes get reclaimed from memory. That way
we save some work in test_clear_page_writeback() which is a fast path and
defer it to sync which isn't that performance critical. Also we would avoid
that ugly games with irq safe spinlocks.

>               __iget(inode);
>               spin_unlock(&inode->i_lock);
> -             spin_unlock(&inode_sb_list_lock);
> +             spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
>  
>               /*
>                * We hold a reference to 'inode' so it couldn't have been
> @@ -1253,9 +1294,20 @@ static void wait_sb_inodes(struct super_block *sb)
>  
>               cond_resched();
>  
> -             spin_lock(&inode_sb_list_lock);
> +             /*
> +              * the inode has been written back now, so check whether we
> +              * still have pages under IO and move it back to the primary
> +              * list if necessary
> +              */
> +             spin_lock_irqsave(&mapping->tree_lock, flags);
> +             spin_lock(&bdi->wb.wb_list_lock);
> +             if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +                     WARN_ON(list_empty(&inode->i_io_list));
> +                     list_move(&inode->i_io_list, &bdi->wb.b_wb);
> +             }
> +             spin_unlock(&mapping->tree_lock);
>       }
> -     spin_unlock(&inode_sb_list_lock);
> +     spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
>       iput(old_inode);
>  }
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 00d5fc3..f25c1ca 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -364,6 +364,7 @@ void inode_init_once(struct inode *inode)
>       INIT_HLIST_NODE(&inode->i_hash);
>       INIT_LIST_HEAD(&inode->i_devices);
>       INIT_LIST_HEAD(&inode->i_wb_list);
> +     INIT_LIST_HEAD(&inode->i_io_list);
>       INIT_LIST_HEAD(&inode->i_lru);
>       address_space_init_once(&inode->i_data);
>       i_size_ordered_init(inode);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..4a6283c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -59,6 +59,9 @@ struct bdi_writeback {
>       struct list_head b_io;          /* parked for writeback */
>       struct list_head b_more_io;     /* parked for more writeback */
>       spinlock_t list_lock;           /* protects the b_* lists */
> +
> +     spinlock_t wb_list_lock;        /* writeback list lock */
> +     struct list_head b_wb;          /* under writeback, for wait_sb_inodes 
> */
>  };
>  
>  struct backing_dev_info {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 63cac31..7861017 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -573,7 +573,8 @@ struct inode {
>       unsigned long           dirtied_when;   /* jiffies of first dirtying */
>  
>       struct hlist_node       i_hash;
> -     struct list_head        i_wb_list;      /* backing dev IO list */
> +     struct list_head        i_wb_list;      /* backing dev WB list */
> +     struct list_head        i_io_list;      /* backing dev IO list */
>       struct list_head        i_lru;          /* inode LRU list */
>       struct list_head        i_sb_list;
>       union {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5025174..896b8f5 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -421,7 +421,9 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct 
> backing_dev_info *bdi)
>       INIT_LIST_HEAD(&wb->b_dirty);
>       INIT_LIST_HEAD(&wb->b_io);
>       INIT_LIST_HEAD(&wb->b_more_io);
> +     INIT_LIST_HEAD(&wb->b_wb);
>       spin_lock_init(&wb->list_lock);
> +     spin_lock_init(&wb->wb_list_lock);
>       INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4514ad7..4c411fe 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2238,6 +2238,15 @@ int test_clear_page_writeback(struct page *page)
>                               __dec_bdi_stat(bdi, BDI_WRITEBACK);
>                               __bdi_writeout_inc(bdi);
>                       }
> +                     if (mapping->host &&
> +                         !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +                             struct inode *inode = mapping->host;
> +
> +                             WARN_ON(list_empty(&inode->i_io_list));
> +                             spin_lock(&bdi->wb.wb_list_lock);
> +                             list_del_init(&inode->i_io_list);
> +                             spin_unlock(&bdi->wb.wb_list_lock);
> +                     }
>               }
>               spin_unlock_irqrestore(&mapping->tree_lock, flags);
>       } else {
> @@ -2262,11 +2271,24 @@ int test_set_page_writeback(struct page *page)
>               spin_lock_irqsave(&mapping->tree_lock, flags);
>               ret = TestSetPageWriteback(page);
>               if (!ret) {
> +                     bool on_wblist;
> +
> +                     /* __swap_writepage comes through here */
> +                     on_wblist = mapping_tagged(mapping,
> +                                                PAGECACHE_TAG_WRITEBACK);
>                       radix_tree_tag_set(&mapping->page_tree,
>                                               page_index(page),
>                                               PAGECACHE_TAG_WRITEBACK);
>                       if (bdi_cap_account_writeback(bdi))
>                               __inc_bdi_stat(bdi, BDI_WRITEBACK);
> +                     if (!on_wblist && mapping->host) {
> +                             struct inode *inode = mapping->host;
> +
> +                             WARN_ON(!list_empty(&inode->i_io_list));
> +                             spin_lock(&bdi->wb.wb_list_lock);
> +                             list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
> +                             spin_unlock(&bdi->wb.wb_list_lock);
> +                     }
>               }
>               if (!PageDirty(page))
>                       radix_tree_tag_clear(&mapping->page_tree,
  I'm somewhat uneasy about this. Writeback code generally uses
inode_to_bdi() function to get from the mapping to backing_dev_info (which
uses inode->i_sb->s_bdi except for inodes on blockdev_superblock). That isn't
always the same as inode->i_mapping->backing_dev_info used here although
I now fail to remember a case where inode->i_mapping->backing_dev_info
would be a wrong bdi to use for sync purposes.

                                                                Honza
-- 
Jan Kara <j...@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to