On Fri, Oct 12, 2007 at 07:53:53AM +1000, David Chinner wrote:
> On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote:
> > On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:
> > 
> > > So it's almost certainly pointing at an elevator or driver change, not an
> > > XFS change.
> > 
> > heh, git bisect begs to differ :)
> > 
> > 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit
> > 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner <[EMAIL 
> > PROTECTED]>
> > Date:   Sat Nov 11 18:05:00 2006 +1100
> > 
> >     [XFS] Prevent a deadlock when xfslogd unpins inodes.
> 
> Oh, of course - I failed to notice the significance of
> this loop in your test:
> 
>       while [foo]; do
>               touch fred
>               rm fred
>       done
> 
> The inode allocator keeps reusing the same inode.  If the
> transaction that did the unlink has not hit the disk before we
> allocate the inode again, we have to force the log to get the unlink
> transaction to disk to get the xfs inode unpinned (i.e. able to be
> modified in memory again).
> 
> It's the log force I/O that's introducing the latency.
> 
> If we don't force the log, then we have a possible use-after free
> of the linux inode because of a fundamental mismatch between
> the XFS inode life cycle and the linux inode life cycle. The
> use-after free only occurs on large machines under heavy, heavy
> metadata load to many disks and filesystems (requires enough
> traffic to overload an xfslogd) and is very difficult to
> reproduce (large machine, lots of disks and 20-30 hours MTTF).
> 
> I'll have a look at other ways to solve this problem, but it
> took 6 months to find a solution to the race in the first place
> so don't hold your breath.

You can breath again. Here's a test patch (warning - may harm
kittens - not fully tested or verified) that solves both
the use-after-free issue (by avoiding it altogether) as well the
unlink/create latency because the log force is no longer there.

(yay! xfsqa test 016 passes again ;)

It does have other possible side effects triggering extra
log forces elsewhere on inode writeback and affects sync behaviour
so it's only a proof of concept at this point.

The latency output looks like:

<idle>

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000281>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000184>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000188>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000182>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000225>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000182>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000185>

<start dd>

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000405>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000224>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000199>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000163>

<writeback starts>

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000145>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.817704>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000148>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000143>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000154>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000147>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000158>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000144>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000379>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000151>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000190>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000191>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000150>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.797099>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000139>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000163>

<writeback ends>

So we still see some operations show high latency - that will most likely
be due to the allocation transaction filling a log buffer and pushing it
to disk, then having to wait because I/O is in progress on all other log
buffers.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/linux-2.6/xfs_iops.c |   16 ++++++++++++++++
 fs/xfs/xfs_iget.c           |   18 ------------------
 fs/xfs/xfs_inode.c          |   34 +---------------------------------
 fs/xfs/xfs_inode.h          |    1 +
 fs/xfs/xfs_inode_item.c     |    5 +++++
 5 files changed, 23 insertions(+), 51 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c      2007-10-02 
16:01:47.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c   2007-10-12 09:54:12.466194814 
+1000
@@ -71,6 +71,22 @@ xfs_synchronize_atime(
 }
 
 /*
+ * If the linux inode exists, mark it dirty.
+ * Used when commiting a dirty inode into a transaction so that
+ * the inode will get written back by the linux code
+ */
+void
+xfs_mark_inode_dirty_sync(
+       xfs_inode_t     *ip)
+{
+       bhv_vnode_t     *vp;
+
+       vp = XFS_ITOV_NULL(ip);
+       if (vp)
+               mark_inode_dirty_sync(vn_to_inode(vp));
+}
+
+/*
  * Change the requested timestamp in the given inode.
  * We don't lock across timestamp updates, and we don't log them but
  * we do record the fact that there is dirty information in core.
Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c        2007-10-02 16:01:48.000000000 
+1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c     2007-10-12 09:58:38.263910771 +1000
@@ -140,27 +140,9 @@ again:
                                return ENOENT;
                        }
 
-                       /*
-                        * There may be transactions sitting in the
-                        * incore log buffers or being flushed to disk
-                        * at this time.  We can't clear the
-                        * XFS_IRECLAIMABLE flag until these
-                        * transactions have hit the disk, otherwise we
-                        * will void the guarantee the flag provides
-                        * xfs_iunpin()
-                        */
-                       if (xfs_ipincount(ip)) {
-                               read_unlock(&pag->pag_ici_lock);
-                               xfs_log_force(mp, 0,
-                                       XFS_LOG_FORCE|XFS_LOG_SYNC);
-                               XFS_STATS_INC(xs_ig_frecycle);
-                               goto again;
-                       }
-
                        xfs_itrace_exit_tag(ip, "xfs_iget.alloc");
 
                        XFS_STATS_INC(xs_ig_found);
-
                        xfs_iflags_clear(ip, XFS_IRECLAIMABLE);
                        read_unlock(&pag->pag_ici_lock);
 
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c       2007-10-11 09:53:53.000000000 
+1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c    2007-10-12 09:56:49.721912637 +1000
@@ -2807,40 +2807,8 @@ xfs_iunpin(
 {
        ASSERT(atomic_read(&ip->i_pincount) > 0);
 
-       if (atomic_dec_and_lock(&ip->i_pincount, &ip->i_flags_lock)) {
-
-               /*
-                * If the inode is currently being reclaimed, the link between
-                * the bhv_vnode and the xfs_inode will be broken after the
-                * XFS_IRECLAIM* flag is set. Hence, if these flags are not
-                * set, then we can move forward and mark the linux inode dirty
-                * knowing that it is still valid as it won't freed until after
-                * the bhv_vnode<->xfs_inode link is broken in xfs_reclaim. The
-                * i_flags_lock is used to synchronise the setting of the
-                * XFS_IRECLAIM* flags and the breaking of the link, and so we
-                * can execute atomically w.r.t to reclaim by holding this lock
-                * here.
-                *
-                * However, we still need to issue the unpin wakeup call as the
-                * inode reclaim may be blocked waiting for the inode to become
-                * unpinned.
-                */
-
-               if (!__xfs_iflags_test(ip, XFS_IRECLAIM|XFS_IRECLAIMABLE)) {
-                       bhv_vnode_t     *vp = XFS_ITOV_NULL(ip);
-                       struct inode *inode = NULL;
-
-                       BUG_ON(vp == NULL);
-                       inode = vn_to_inode(vp);
-                       BUG_ON(inode->i_state & I_CLEAR);
-
-                       /* make sync come back and flush this inode */
-                       if (!(inode->i_state & (I_NEW|I_FREEING)))
-                               mark_inode_dirty_sync(inode);
-               }
-               spin_unlock(&ip->i_flags_lock);
+       if (atomic_dec_and_test(&ip->i_pincount))
                wake_up(&ip->i_ipin_wait);
-       }
 }
 
 /*
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.h       2007-10-02 16:01:48.000000000 
+1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.h    2007-10-12 10:00:10.979948827 +1000
@@ -532,6 +532,7 @@ xfs_fsize_t xfs_file_last_byte(xfs_inode
 void           xfs_lock_inodes(xfs_inode_t **, int, int, uint);
 
 void           xfs_synchronize_atime(xfs_inode_t *);
+void           xfs_mark_inode_dirty_sync(xfs_inode_t *);
 
 xfs_bmbt_rec_host_t *xfs_iext_get_ext(xfs_ifork_t *, xfs_extnum_t);
 void           xfs_iext_insert(xfs_ifork_t *, xfs_extnum_t, xfs_extnum_t,
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode_item.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode_item.c  2007-10-02 16:01:48.000000000 
+1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode_item.c       2007-10-12 09:54:18.053474268 
+1000
@@ -274,6 +274,11 @@ xfs_inode_item_format(
         */
        xfs_synchronize_atime(ip);
 
+       /*
+        * make sure the linux inode is dirty
+        */
+       xfs_mark_inode_dirty_sync(ip);
+
        vecp->i_addr = (xfs_caddr_t)&ip->i_d;
        vecp->i_len  = sizeof(xfs_dinode_core_t);
        XLOG_VEC_SET_TYPE(vecp, XLOG_REG_TYPE_ICORE);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to