Re: [PATCH 4/8] fs: kill i_alloc_sem

2011-06-30 Thread Joel Becker
On Mon, Jun 20, 2011 at 06:18:57PM -0400, Christoph Hellwig wrote:
> On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote:
> > Are we guaranteed that all allocation changes are locked out by
> > i_dio_count>0?  I don't think we are.  The ocfs2 code very strongly
> > assumes the state of a file's allocation when it holds i_alloc_sem.  I
> > feel like we lose that here. 
> 
> You aren't, neither with the old i_alloc_sem code, nor with the 1:1
> replacement using i_dio_count.
> 
> Do a quick grep who gets i_alloc_sem exclusively (down_write): it's
> really just the truncate code, and it's cut & paste duplicates in ntfs
> and reiserfs.

Sorry, I confused this with our ip_alloc_sem.  I was tired.

Joel

-- 

Life's Little Instruction Book #24

"Drink champagne for no reason at all."

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] fs: kill i_alloc_sem

2011-06-21 Thread Christoph Hellwig
On Tue, Jun 21, 2011 at 03:40:56PM +1000, Dave Chinner wrote:
> Modification of inode->i_state is not safe outside the
> inode->i_lock.

We never actually set the new bit in i_state, we just use it as a key
for the hashed lookups.  Or rather we try to, as I misunderstood how
wait_on_bit works, so currently we busywait for i_dio_count to reach
zero.  I'll respin a version that actually works as expected.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] fs: kill i_alloc_sem

2011-06-20 Thread Dave Chinner
On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig wrote:
> i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
> be released by a non-owner, and it's write side is always mirrored by
> real exclusion.  It's intended use it to wait for all pending direct I/O
> requests to finish before starting a truncate.
> 
> Replace it with a hand-grown construct:
> 
>  - exclusion for truncates is already guaranteed by i_mutex, so it can
>simply fall way
>  - the reader side is replaced by an i_dio_count member in struct inode
>that counts the number of pending direct I/O requests.  Truncate can't
>proceed as long as it's non-zero
>  - when i_dio_count reaches non-zero we wake up a pending truncate using
>wake_up_bit on a new bit in i_flags
>  - new references to i_dio_count can't appear while we are waiting for
>it to read zero because the direct I/O count always needs i_mutex
>(or an equivalent like XFS's i_iolock) for starting a new operation.
> 
> This scheme is much simpler, and saves the space of a spinlock_t and a
> struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
> system).
> 
> Signed-off-by: Christoph Hellwig 
> 
> Index: linux-2.6/fs/direct-io.c
> ===
> --- linux-2.6.orig/fs/direct-io.c 2011-06-20 14:55:31.0 +0200
> +++ linux-2.6/fs/direct-io.c  2011-06-20 14:55:34.602490284 +0200
> @@ -136,6 +136,27 @@ struct dio {
>  };
>  
>  /*
> + * Wait for outstanding DIO requests to finish.  Must be locked against
> + * increments of i_dio_count by i_mutex.
> + */
> +void inode_dio_wait(struct inode *inode)
> +{
> + might_sleep();
> + while (atomic_read(&inode->i_dio_count)) {
> + wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait,
> + TASK_UNINTERRUPTIBLE);
> + }
> +}
> +EXPORT_SYMBOL_GPL(inode_dio_wait);
> +
> +void inode_dio_wake(struct inode *inode)
> +{
> + if (atomic_dec_and_test(&inode->i_dio_count))
> + wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
> +}
> +EXPORT_SYMBOL_GPL(inode_dio_wake);

Modification of inode->i_state is not safe outside the
inode->i_lock.

This probably needs to be implemented similar to the
__I_NEW/__wait_on_freeing_inode() and
__I_SYNC/inode_wait_for_writeback() pattern...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] fs: kill i_alloc_sem

2011-06-20 Thread Christoph Hellwig
On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote:
>   Are we guaranteed that all allocation changes are locked out by
> i_dio_count>0?  I don't think we are.  The ocfs2 code very strongly
> assumes the state of a file's allocation when it holds i_alloc_sem.  I
> feel like we lose that here. 

You aren't, neither with the old i_alloc_sem code, nor with the 1:1
replacement using i_dio_count.

Do a quick grep who gets i_alloc_sem exclusively (down_write): it's
really just the truncate code, and it's cut & paste duplicates in ntfs
and reiserfs.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] fs: kill i_alloc_sem

2011-06-20 Thread Joel Becker
On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig wrote:
> i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
> be released by a non-owner, and it's write side is always mirrored by
> real exclusion.  It's intended use it to wait for all pending direct I/O
> requests to finish before starting a truncate.
> 
> Replace it with a hand-grown construct:
> 
>  - exclusion for truncates is already guaranteed by i_mutex, so it can
>simply fall way
>  - the reader side is replaced by an i_dio_count member in struct inode
>that counts the number of pending direct I/O requests.  Truncate can't
>proceed as long as it's non-zero
>  - when i_dio_count reaches non-zero we wake up a pending truncate using
>wake_up_bit on a new bit in i_flags
>  - new references to i_dio_count can't appear while we are waiting for
>it to read zero because the direct I/O count always needs i_mutex
>(or an equivalent like XFS's i_iolock) for starting a new operation.
> 
> This scheme is much simpler, and saves the space of a spinlock_t and a
> struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
> system).

Are we guaranteed that all allocation changes are locked out by
i_dio_count>0?  I don't think we are.  The ocfs2 code very strongly
assumes the state of a file's allocation when it holds i_alloc_sem.  I
feel like we lose that here. 

Joel

-- 

"I don't even butter my bread; I consider that cooking."
 - Katherine Cebrian

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] fs: kill i_alloc_sem

2011-06-20 Thread Christoph Hellwig
i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion.  It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

 - exclusion for truncates is already guaranteed by i_mutex, so it can
   simply fall way
 - the reader side is replaced by an i_dio_count member in struct inode
   that counts the number of pending direct I/O requests.  Truncate can't
   proceed as long as it's non-zero
 - when i_dio_count reaches non-zero we wake up a pending truncate using
   wake_up_bit on a new bit in i_flags
 - new references to i_dio_count can't appear while we are waiting for
   it to read zero because the direct I/O count always needs i_mutex
   (or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig 

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2011-06-20 14:55:31.0 +0200
+++ linux-2.6/fs/direct-io.c2011-06-20 14:55:34.602490284 +0200
@@ -136,6 +136,27 @@ struct dio {
 };
 
 /*
+ * Wait for outstanding DIO requests to finish.  Must be locked against
+ * increments of i_dio_count by i_mutex.
+ */
+void inode_dio_wait(struct inode *inode)
+{
+   might_sleep();
+   while (atomic_read(&inode->i_dio_count)) {
+   wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait,
+   TASK_UNINTERRUPTIBLE);
+   }
+}
+EXPORT_SYMBOL_GPL(inode_dio_wait);
+
+void inode_dio_wake(struct inode *inode)
+{
+   if (atomic_dec_and_test(&inode->i_dio_count))
+   wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
+}
+EXPORT_SYMBOL_GPL(inode_dio_wake);
+
+/*
  * How many pages are in the queue?
  */
 static inline unsigned dio_pages_present(struct dio *dio)
@@ -254,9 +275,7 @@ static ssize_t dio_complete(struct dio *
}
 
if (dio->flags & DIO_LOCKING)
-   /* lockdep: non-owner release */
-   up_read_non_owner(&dio->inode->i_alloc_sem);
-
+   inode_dio_wake(dio->inode);
return ret;
 }
 
@@ -980,9 +999,6 @@ out:
return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1146,15 +1162,14 @@ direct_io_worker(int rw, struct kiocb *i
  *For writes this function is called under i_mutex and returns with
  *i_mutex held, for reads, i_mutex is not held on entry, but it is
  *taken and dropped again before returning.
- *For reads and writes i_alloc_sem is taken in shared mode and released
- *on I/O completion (which may happen asynchronously after returning to
- *the caller).
+ *The i_dio_count counter keeps track of the number of outstanding
+ *direct I/O requests, and truncate waits for it to reach zero.
+ *New references to i_dio_count must only be grabbed with i_mutex
+ *held.
  *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *internal locking but rather rely on the filesystem to synchronize
  *direct I/O reads/writes versus each other and truncate.
- *For reads and writes both i_mutex and i_alloc_sem are not held on
- *entry and are never taken.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1234,10 +1249,9 @@ __blockdev_direct_IO(int rw, struct kioc
}
 
/*
-* Will be released at I/O completion, possibly in a
-* different thread.
+* Will be decremented at I/O completion time.
 */
-   down_read_non_owner(&inode->i_alloc_sem);
+   atomic_inc(&inode->i_dio_count);
}
 
/*
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2011-06-20 14:19:27.019266696 +0200
+++ linux-2.6/mm/filemap.c  2011-06-20 14:55:34.605823617 +0200
@@ -78,9 +78,6 @@
  *  ->i_mutex  (generic_file_buffered_write)
  *->mmap_sem   (fault_in_pages_readable->do_page_fault)
  *
- *  ->i_mutex
- *->i_alloc_sem (various)
- *
  *  inode_wb_list_lock
  *sb_lock  (fs/fs-writeback.c)
  *->mapping->tree_lock (__sync_single_inode)
Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2011-06-20 14:19:27.0 +0200
+++ linux-2.6/mm/rmap.c 2011-06-20 14:55:34.605823617 +0200
@@ -21,7 +21,6 @@
  * Lock ordering in mm:
  *
  * inode