Re: [PATCH 4/8] fs: kill i_alloc_sem
On Mon, Jun 20, 2011 at 06:18:57PM -0400, Christoph Hellwig wrote: > On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote: > > Are we guaranteed that all allocation changes are locked out by > > i_dio_count>0? I don't think we are. The ocfs2 code very strongly > > assumes the state of a file's allocation when it holds i_alloc_sem. I > > feel like we lose that here. > > You aren't, neither with the old i_alloc_sem code, nor with the 1:1 > replacement using i_dio_count. > > Do a quick grep who gets i_alloc_sem exclusively (down_write): it's > really just the truncate code, and it's cut & paste duplicates in ntfs > and reiserfs. Sorry, I confused this with our ip_alloc_sem. I was tired. Joel -- Life's Little Instruction Book #24 "Drink champagne for no reason at all." http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] fs: kill i_alloc_sem
On Tue, Jun 21, 2011 at 03:40:56PM +1000, Dave Chinner wrote: > Modification of inode->i_state is not safe outside the > inode->i_lock. We never actually set the new bit in i_state, we just use it as a key for the hashed lookups. Or rather we try to, as I misunderstood how wait_on_bit works, so currently we busywait for i_dio_count to reach zero. I'll respin a version that actually works as expected. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] fs: kill i_alloc_sem
On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig wrote: > i_alloc_sem is a rather special rw_semaphore. It's the last one that may > be released by a non-owner, and it's write side is always mirrored by > real exclusion. It's intended use it to wait for all pending direct I/O > requests to finish before starting a truncate. > > Replace it with a hand-grown construct: > > - exclusion for truncates is already guaranteed by i_mutex, so it can >simply fall way > - the reader side is replaced by an i_dio_count member in struct inode >that counts the number of pending direct I/O requests. Truncate can't >proceed as long as it's non-zero > - when i_dio_count reaches non-zero we wake up a pending truncate using >wake_up_bit on a new bit in i_flags > - new references to i_dio_count can't appear while we are waiting for >it to read zero because the direct I/O count always needs i_mutex >(or an equivalent like XFS's i_iolock) for starting a new operation. > > This scheme is much simpler, and saves the space of a spinlock_t and a > struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit > system). > > Signed-off-by: Christoph Hellwig > > Index: linux-2.6/fs/direct-io.c > === > --- linux-2.6.orig/fs/direct-io.c 2011-06-20 14:55:31.0 +0200 > +++ linux-2.6/fs/direct-io.c 2011-06-20 14:55:34.602490284 +0200 > @@ -136,6 +136,27 @@ struct dio { > }; > > /* > + * Wait for outstanding DIO requests to finish. Must be locked against > + * increments of i_dio_count by i_mutex. > + */ > +void inode_dio_wait(struct inode *inode) > +{ > + might_sleep(); > + while (atomic_read(&inode->i_dio_count)) { > + wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait, > + TASK_UNINTERRUPTIBLE); > + } > +} > +EXPORT_SYMBOL_GPL(inode_dio_wait); > + > +void inode_dio_wake(struct inode *inode) > +{ > + if (atomic_dec_and_test(&inode->i_dio_count)) > + wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); > +} > +EXPORT_SYMBOL_GPL(inode_dio_wake); Modification of inode->i_state is not safe outside the inode->i_lock. This probably needs to be implemented similar to the __I_NEW/__wait_on_freeing_inode() and __I_SYNC/inode_wait_for_writeback() pattern... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] fs: kill i_alloc_sem
On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote: > Are we guaranteed that all allocation changes are locked out by > i_dio_count>0? I don't think we are. The ocfs2 code very strongly > assumes the state of a file's allocation when it holds i_alloc_sem. I > feel like we lose that here. You aren't, neither with the old i_alloc_sem code, nor with the 1:1 replacement using i_dio_count. Do a quick grep who gets i_alloc_sem exclusively (down_write): it's really just the truncate code, and it's cut & paste duplicates in ntfs and reiserfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] fs: kill i_alloc_sem
On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig wrote: > i_alloc_sem is a rather special rw_semaphore. It's the last one that may > be released by a non-owner, and it's write side is always mirrored by > real exclusion. It's intended use it to wait for all pending direct I/O > requests to finish before starting a truncate. > > Replace it with a hand-grown construct: > > - exclusion for truncates is already guaranteed by i_mutex, so it can >simply fall way > - the reader side is replaced by an i_dio_count member in struct inode >that counts the number of pending direct I/O requests. Truncate can't >proceed as long as it's non-zero > - when i_dio_count reaches non-zero we wake up a pending truncate using >wake_up_bit on a new bit in i_flags > - new references to i_dio_count can't appear while we are waiting for >it to read zero because the direct I/O count always needs i_mutex >(or an equivalent like XFS's i_iolock) for starting a new operation. > > This scheme is much simpler, and saves the space of a spinlock_t and a > struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit > system). Are we guaranteed that all allocation changes are locked out by i_dio_count>0? I don't think we are. The ocfs2 code very strongly assumes the state of a file's allocation when it holds i_alloc_sem. I feel like we lose that here. Joel -- "I don't even butter my bread; I consider that cooking." - Katherine Cebrian http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] fs: kill i_alloc_sem
i_alloc_sem is a rather special rw_semaphore. It's the last one that may be released by a non-owner, and it's write side is always mirrored by real exclusion. It's intended use it to wait for all pending direct I/O requests to finish before starting a truncate. Replace it with a hand-grown construct: - exclusion for truncates is already guaranteed by i_mutex, so it can simply fall way - the reader side is replaced by an i_dio_count member in struct inode that counts the number of pending direct I/O requests. Truncate can't proceed as long as it's non-zero - when i_dio_count reaches non-zero we wake up a pending truncate using wake_up_bit on a new bit in i_flags - new references to i_dio_count can't appear while we are waiting for it to read zero because the direct I/O count always needs i_mutex (or an equivalent like XFS's i_iolock) for starting a new operation. This scheme is much simpler, and saves the space of a spinlock_t and a struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit system). Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/direct-io.c === --- linux-2.6.orig/fs/direct-io.c 2011-06-20 14:55:31.0 +0200 +++ linux-2.6/fs/direct-io.c2011-06-20 14:55:34.602490284 +0200 @@ -136,6 +136,27 @@ struct dio { }; /* + * Wait for outstanding DIO requests to finish. Must be locked against + * increments of i_dio_count by i_mutex. + */ +void inode_dio_wait(struct inode *inode) +{ + might_sleep(); + while (atomic_read(&inode->i_dio_count)) { + wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait, + TASK_UNINTERRUPTIBLE); + } +} +EXPORT_SYMBOL_GPL(inode_dio_wait); + +void inode_dio_wake(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_dio_count)) + wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); +} +EXPORT_SYMBOL_GPL(inode_dio_wake); + +/* * How many pages are in the queue? */ static inline unsigned dio_pages_present(struct dio *dio) @@ -254,9 +275,7 @@ static ssize_t dio_complete(struct dio * } if (dio->flags & DIO_LOCKING) - /* lockdep: non-owner release */ - up_read_non_owner(&dio->inode->i_alloc_sem); - + inode_dio_wake(dio->inode); return ret; } @@ -980,9 +999,6 @@ out: return ret; } -/* - * Releases both i_mutex and i_alloc_sem - */ static ssize_t direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, @@ -1146,15 +1162,14 @@ direct_io_worker(int rw, struct kiocb *i *For writes this function is called under i_mutex and returns with *i_mutex held, for reads, i_mutex is not held on entry, but it is *taken and dropped again before returning. - *For reads and writes i_alloc_sem is taken in shared mode and released - *on I/O completion (which may happen asynchronously after returning to - *the caller). + *The i_dio_count counter keeps track of the number of outstanding + *direct I/O requests, and truncate waits for it to reach zero. + *New references to i_dio_count must only be grabbed with i_mutex + *held. * * - if the flags value does NOT contain DIO_LOCKING we don't use any *internal locking but rather rely on the filesystem to synchronize *direct I/O reads/writes versus each other and truncate. - *For reads and writes both i_mutex and i_alloc_sem are not held on - *entry and are never taken. */ ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, @@ -1234,10 +1249,9 @@ __blockdev_direct_IO(int rw, struct kioc } /* -* Will be released at I/O completion, possibly in a -* different thread. +* Will be decremented at I/O completion time. */ - down_read_non_owner(&inode->i_alloc_sem); + atomic_inc(&inode->i_dio_count); } /* Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2011-06-20 14:19:27.019266696 +0200 +++ linux-2.6/mm/filemap.c 2011-06-20 14:55:34.605823617 +0200 @@ -78,9 +78,6 @@ * ->i_mutex (generic_file_buffered_write) *->mmap_sem (fault_in_pages_readable->do_page_fault) * - * ->i_mutex - *->i_alloc_sem (various) - * * inode_wb_list_lock *sb_lock (fs/fs-writeback.c) *->mapping->tree_lock (__sync_single_inode) Index: linux-2.6/mm/rmap.c === --- linux-2.6.orig/mm/rmap.c2011-06-20 14:19:27.0 +0200 +++ linux-2.6/mm/rmap.c 2011-06-20 14:55:34.605823617 +0200 @@ -21,7 +21,6 @@ * Lock ordering in mm: * * inode