Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Mon, Mar 18, 2013 at 6:17 PM, Dave Chinner wrote: > On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote: >> On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote: >> > We don't care about the ordering between multiple concurrent >> > metadata modifications - what matters is whether the ongoing data IO >> > around them is ordered correctly. >> >> Dave, >> >> The point that Michel is making is that there never was any ordering >> guarantee by rwsem. It's an illusion. > > Weasel words. Whoaaa, calm down. You initially made one false statement (that the change meant a stream of readers would starve a writer forever) and one imprecise statement (that rwsem used to guarantee that readers don't skip ahead of writers - this may be true in practice for your use case because the latencies involved are very large compared to scheduling latencies, but that's a very important qualification that needs to be added here). That confused me enough that I initially couldn't tell what your actual concern was, so I pointed out the source of my confusion and asked you to clarify. It seems unfair to characterize that as "weasel words" - I'm not trying to be a smartass here, but only to actually understand your concern. >> The reason is simple: to even get to the lock the cpu has to be >> sleep-able. So for every submission that you believe is ordered, is by >> its very nature __not ordered__, even when used by kernel code. >> >> Why? Because any thread on its way to claim the lock (reader or writer) >> could be pre-empted for some other task, thus delaying the submission of >> whatever i/o you believed to be ordered. > > You think I don't know this? You're arguing fine grained, low level > behaviour between tasks is unpredictable. I get that. I understand > that. But I'm not arguing about fine-grained, low level, microsecond > semantics of the locking order > > What you (and Michael) appear to be failing to see is what happens > on a macro level when you have read locks being held for periods > measured in *seconds* (e.g. direct IO gets queued behind a few > thousand other IOs in the elevator waiting for a request slot), > and the subsequent effect of inserting an operation that requires a > write lock into that IO stream. > > IOWs, it simply doesn't matter if there's a micro-level race between > the write lock and a couple of the readers. That's the level you > guys are arguing at but it simply does not matter in the cases I'm > describing. I'm talking about high level serialisation behaviours > that might take of *seconds* to play out and the ordering behaviours > observed at that scale. > > That is, I don't care if a couple of threads out of a few thousand > race with the write lock over few tens to hundreds of microseconds, > but I most definitely care if a few thousand IOs issued seconds > after the write lock is queued jump over the write lock. That is a > gross behavioural change at the macro-level. Understood. I accepted your concern and made sure my v2 proposal doesn't do such macro level reordering. >> So just to reiterate: there is no 'queue' and no 'barrier'. The >> guarantees that rwsem makes are; >> 1. Multiple readers can own the lock. >> 2. Only a single writer can own the lock. >> 3. Readers will not starve writers. > > You've conveniently ignored the fact that the current implementation > also provides following guarantee: > > 4. new readers will block behind existing writers In your use case, with large enough queue latencies, yes. Please don't make it sound like this applies in every use case - it has never applied for short (http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote: > On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote: > > We don't care about the ordering between multiple concurrent > > metadata modifications - what matters is whether the ongoing data IO > > around them is ordered correctly. > > Dave, > > The point that Michel is making is that there never was any ordering > guarantee by rwsem. It's an illusion. Weasel words. > The reason is simple: to even get to the lock the cpu has to be > sleep-able. So for every submission that you believe is ordered, is by > its very nature __not ordered__, even when used by kernel code. > > Why? Because any thread on its way to claim the lock (reader or writer) > could be pre-empted for some other task, thus delaying the submission of > whatever i/o you believed to be ordered. You think I don't know this? You're arguing fine grained, low level behaviour between tasks is unpredictable. I get that. I understand that. But I'm not arguing about fine-grained, low level, microsecond semantics of the locking order What you (and Michael) appear to be failing to see is what happens on a macro level when you have read locks being held for periods measured in *seconds* (e.g. direct IO gets queued behind a few thousand other IOs in the elevator waiting for a request slot), and the subsequent effect of inserting an operation that requires a write lock into that IO stream. IOWs, it simply doesn't matter if there's a micro-level race between the write lock and a couple of the readers. That's the level you guys are arguing at but it simply does not matter in the cases I'm describing. I'm talking about high level serialisation behaviours that might take of *seconds* to play out and the ordering behaviours observed at that scale. That is, I don't care if a couple of threads out of a few thousand race with the write lock over few tens to hundreds of microseconds, but I most definitely care if a few thousand IOs issued seconds after the write lock is queued jump over the write lock. That is a gross behavioural change at the macro-level. > So just to reiterate: there is no 'queue' and no 'barrier'. The > guarantees that rwsem makes are; > 1. Multiple readers can own the lock. > 2. Only a single writer can own the lock. > 3. Readers will not starve writers. You've conveniently ignored the fact that the current implementation also provides following guarantee: 4. new readers will block behind existing writers And that's the behaviour we currently depend on, whether you like it or not. > Where lock policy can have a significant impact is on performance. But > predicting that impact is difficult -- it's better just to measure. Predicting the impact in this case is trivial - it's obvious that ordering of operations will change and break high level assumptions that userspace currently makes about various IO operations on XFS filesystems > It's not my intention to convince you (or anyone else) that there should > only be One True Rwsem, because I don't believe that. But I didn't want > the impression to persist that rwsem does anything more than implement a > fair reader/writer semaphore. I'm sorry, but redefining "fair" to suit your own needs doesn't convince me of anything. rwsem behaviour has been unchanged for at least 10 years and hence the current implementation defines what is "fair", not what you say is fair Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Thu, Mar 14, 2013 at 4:39 AM, Peter Hurley wrote: > On Thu, 2013-03-14 at 00:03 -0700, Michel Lespinasse wrote: >> > CPU 0 | CPU 1 >> > | >> > | down_write() >> > >> > ... CPU 1 has the write lock for the semaphore. >> > Meanwhile, 1 or more down_read(s) are attempted and fail; >> > these are put on the wait list. >> >> I'm not sure of the relevance of these other down_read() calls - >> please note that as these extra readers are put on the wait list, >> their +read_bias adjustments are canceled so that the count value >> ends up at write_bias + waiting_bias (for a total active count of 1) > > The relevance of the waiting readers is that when CPU 1 drops the writer > ___and grabs the spin lock___, it then wakes up these already-waiting > readers (CPU 0 is still parked in down_read_failed() waiting to acquire > the spin lock). > > When CPU 1 wakes up these readers, the sem count goes > 0 and the > waiting list is emptied. CPU 1 then drops the spin lock and leaves > up_write(). Ah, correct. So the race you noticed is real, and the one I found is just a more complicated way to hit it. I'll add a fix in v2 of this patch series (probably later today). BTW, you seem to have been paying pretty close attention, would you mind giving appropriate Reviewed-by tags when I send v2 ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Thu, 2013-03-14 at 00:03 -0700, Michel Lespinasse wrote: > On Mon, Mar 11, 2013 at 04:36:47PM -0400, Peter Hurley wrote: > > > > On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote: > > > + retry_reader_grants: > > > + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; > > > + if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { > > > + /* A writer stole the lock. Undo our reader grants. */ > > > + if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS) > > > + goto out; > > > + /* The writer left. Retry waking readers. */ > > > + goto retry_reader_grants; > > > + } > > > > This can be reduced to single looping cmpxchg in the grant reversal > > path; then if reversing the grant fails, the count can simply be > > re-tested for grant success, rather than trying to atomically re-grant. > > For example, with a helper function, rwsem_cmpxchg(): > > > > static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore > > *sem) > > { > > long tmp = *old; > > *old = cmpxchg(&sem->count, *old, new); > > return tmp == *old; > > } > > > > ... then above becomes ... > > > > count = rwsem_atomic_update(adjustment, sem); > > do { > > if (count - adjustment >= RWSEM_WAITING_BIAS) > > break; > > if (rwsem_cmpxchg(&count, count - adjustment, sem)) > > goto out; /* or simply return sem */ > > } while (1); > > > > < wake up readers > > > This would work, but I don't see a benefit as we still end up with a > retry loop. Also, the loop would have to retry whenever the counter > value changed instead of only when writer(s) appear or are removed. It's a minor optimization in that, while the count is unstable, this CPU will stay here and while it does so, it's only inflicting one bus lock rather than two. Feel free to ignore it. > > Also, this series and the original rwsem can mistakenly sleep reader(s) > > when the lock is transitioned from writer-owned to waiting readers-owned > > with no waiting writers. For example, > > > > > > CPU 0 | CPU 1 > > | > > | down_write() > > > > ... CPU 1 has the write lock for the semaphore. > > Meanwhile, 1 or more down_read(s) are attempted and fail; > > these are put on the wait list. > > I'm not sure of the relevance of these other down_read() calls - > please note that as these extra readers are put on the wait list, > their +read_bias adjustments are canceled so that the count value > ends up at write_bias + waiting_bias (for a total active count of 1) The relevance of the waiting readers is that when CPU 1 drops the writer ___and grabs the spin lock___, it then wakes up these already-waiting readers (CPU 0 is still parked in down_read_failed() waiting to acquire the spin lock). When CPU 1 wakes up these readers, the sem count goes > 0 and the waiting list is emptied. CPU 1 then drops the spin lock and leaves up_write(). CPU 0 has been spinning on the wait_lock during this time. It now acquires the lock, and discovers the wait list is empty and ups the adjustment(+wait bias). The count is already (+ # of waiting readers) so it never does the wake up. > > Then ... > > > > down_read() | up_write() > > local = atomic_update(+read_bias) | > > local <= 0? | local = atomic_update(-write_bias) > > if (true) | local < 0? > > down_read_failed() | if (true) > > | wake() > > | grab wait_lock > > wait for wait_lock| wake all readers > > | release wait_lock > > > > ... At this point, sem->count > 0 and the wait list is empty, > > but down_read_failed() will sleep the reader. > > > > In this case, CPU 0 has observed the sem count with the write lock (and > > the other waiters) and so is detoured to down_read_failed(). But if > > CPU 0 can't grab the wait_lock before the up_write() does (via > > rwsem_wake()), then down_read_failed() will wake no one and sleep the > > reader. > > Actually - down_read_failed() will insert the reader on the wait_list > and do an atomic update(-read_bias). If there are no active lockers at > the time of that atomic update, it calls __rwsem_do_wake() to unqueued > the first queued waiter - which may well be itself. > > In your specific example, __rwsem_do_wake() will wake the previously queued > readers as well as current. Hopefully my comments above make it clear that there are no queued readers once CPU 0 can advance with the spin lock in down_read_failed(). > > Unfortunately, this means readers and writers which observe the sem > > count after the adjustment is committed by CPU 0 in down_re
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Mon, Mar 11, 2013 at 04:36:47PM -0400, Peter Hurley wrote: > > On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote: > > + retry_reader_grants: > > + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; > > + if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { > > + /* A writer stole the lock. Undo our reader grants. */ > > + if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS) > > + goto out; > > + /* The writer left. Retry waking readers. */ > > + goto retry_reader_grants; > > + } > > This can be reduced to single looping cmpxchg in the grant reversal > path; then if reversing the grant fails, the count can simply be > re-tested for grant success, rather than trying to atomically re-grant. > For example, with a helper function, rwsem_cmpxchg(): > > static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore *sem) > { > long tmp = *old; > *old = cmpxchg(&sem->count, *old, new); > return tmp == *old; > } > > ... then above becomes ... > > count = rwsem_atomic_update(adjustment, sem); > do { > if (count - adjustment >= RWSEM_WAITING_BIAS) > break; > if (rwsem_cmpxchg(&count, count - adjustment, sem)) > goto out; /* or simply return sem */ > } while (1); > > < wake up readers > This would work, but I don't see a benefit as we still end up with a retry loop. Also, the loop would have to retry whenever the counter value changed instead of only when writer(s) appear or are removed. > Also, this series and the original rwsem can mistakenly sleep reader(s) > when the lock is transitioned from writer-owned to waiting readers-owned > with no waiting writers. For example, > > > CPU 0 | CPU 1 > | > | down_write() > > ... CPU 1 has the write lock for the semaphore. > Meanwhile, 1 or more down_read(s) are attempted and fail; > these are put on the wait list. I'm not sure of the relevance of these other down_read() calls - please note that as these extra readers are put on the wait list, their +read_bias adjustments are canceled so that the count value ends up at write_bias + waiting_bias (for a total active count of 1) > Then ... > > down_read() | up_write() > local = atomic_update(+read_bias) | > local <= 0? | local = atomic_update(-write_bias) > if (true) | local < 0? > down_read_failed() | if (true) > | wake() > | grab wait_lock > wait for wait_lock| wake all readers > | release wait_lock > > ... At this point, sem->count > 0 and the wait list is empty, > but down_read_failed() will sleep the reader. > > In this case, CPU 0 has observed the sem count with the write lock (and > the other waiters) and so is detoured to down_read_failed(). But if > CPU 0 can't grab the wait_lock before the up_write() does (via > rwsem_wake()), then down_read_failed() will wake no one and sleep the > reader. Actually - down_read_failed() will insert the reader on the wait_list and do an atomic update(-read_bias). If there are no active lockers at the time of that atomic update, it calls __rwsem_do_wake() to unqueued the first queued waiter - which may well be itself. In your specific example, __rwsem_do_wake() will wake the previously queued readers as well as current. > Unfortunately, this means readers and writers which observe the sem > count after the adjustment is committed by CPU 0 in down_read_failed() > will sleep as well, until the sem count returns to 0. Note that the active count does go back to 0 in your example. However, thinking about it made me consider the following case: CPU 0 CPU 1 CPU 2 down_write() down_read() local = atomic_update(+read_bias) up_write() down_read() if (local <= 0) down_read_failed() At this point CPU 0 doesn't hold the write lock anymore; CPU 2 grabbed a read lock; CPU 1 should grab an additional read lock but it won't - instead, it will queue itself (and get woken by CPU 2 when that read lock is released). This is indeed slightly suboptimal. Could be fixed in an additional patch. As you mentioned, this is not any new issue caused by this patch series though - it's been there forever as far as I know. Thanks, -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote: > On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote: > > Hi Dave, > > > > On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner wrote: > > > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote: > > >> - since all readers are woken at once, you might see burst of direct > > >> IO operations followed by bursts of truncate operations, instead of > > >> having them interleaved in smaller groups as before. > > > > > > And this will result in the same application IO pattern giving > > > vastly different results. e.g. a read IO promoted ahead of a > > > truncate will now return data instead of -1 (beyond EOF). > > > > I will reply to this part first, as I think this gives a more concrete > > anchor to the discussion. > > > > The crucial point is that the truncate system call hasn't completed > > yet - that thread is still queued in the rwsem. > > > > You still have the guarantee that the truncate will complete > > eventually (even if there is a never-ending stream of IOs), and that > > all IO system calls that start after the truncate system call > > completes will see the file as truncated. > > Sure, but the problem is not about when the syscall completes - the > problem is that you are changing where in the pipeline of IO the > truncate is initially executed. i.e. ordering is not defined by > when an operation completes, but by the order ing which the queue is > processed after a blocking operation completes. That is not when the > syscall completes, but when the filesystem drops the exclusive lock. > > From a single threaded userspace application perspective or > multithreaded apps with their own userspace locking, truncates > complete when either the syscall completes or some time after when > the application drops it's lock. Either way, we're looking at > completion time serialisation, and in that case what XFS or the > kernel does simply doesn't matter. > > However, if we are talking about userspace applications that use > lockless IO algorithms or kernel side applications (like knfsd) that > are exposed directly to the filesystem locking semantics, > serialisation behaviour is actually defined by filesystem's > submission side locking behaviour. There is no external > serialisation of IO completion, so serialisation and ordering is > defined (for XFS) solely by the rwsem behaviour. > > > You don't have guarantees about which system call will "appear to run > > before the other" if the execution times of the two system calls > > overlap each other, but you actually never had such a guarantee from a > > correctness perspective (i.e. you could never guarantee which of the > > two would get queued on the rwsem ahead of the other). > > Sure, but as long as the submission side ordering is deterministic, > it doesn't matter. > > > > Ok, so I can see where your latency figure comes from, but it's > > > still a change of ordering in that W2 is no longer a barrier to the > > > reads queued after it. > > > > My claim is that it's not a visible change from a correctness > > perspective > > I am not arguing that it is incorrect. I'm arguing that the change > of ordering semantics breaks assumptions a lot of code makes about > how these locks work. > > > > similar to this: > > > > > > W1R1W2R2W3R3.WnRm > > > > > > By my reading of the algorithm you are proposing, after W1 is > > > released, we end up with the queue being treated like this: > > > > > > R1R2R3RmW2W3Wn > > > > > > Right? > > > > Yes, if what you are representing is the state of the queue at a given > > point of time (which implies R1..Rm must be distinct threads, not just > > the same thread doing repeated requests). > > Yup, that's very typical. > > > As new requests come in over time, one important point to remember is > > that each writer will see at most one batch of readers wake up ahead > > of it (though this batch may include readers that were originally > > queued both before and after it). > > And that's *exactly* the problem with the changes you are proposing > - rwsems will no longer provide strongly ordered write vs read > barrier semantics. > > > I find the name 'barrier' actually confusing when used to describe > > synchronous operations. To me a barrier is usualy between > > asynchronous operations, and it is well defined which operations > > are ahead or behind of the barrier (either because they were > > queued by the same thread, or they were queued by different > > threads which may have synchronized together some other way). > > When you have hundreds or thousands of threads doing IO to the one > file, it doesn't matter if the IO is issued synchronously or > asynchronously by the threads - you simply have a huge amount of > data IO concurrency and very, very deep pipelines. > > Inserting a metadata modification (truncate, preallocation, > holepunch, etc) into that pipeline currently causes all new > submissions
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Tue, Mar 12, 2013 at 8:23 PM, Dave Chinner wrote: > On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote: >> I find the name 'barrier' actually confusing when used to describe >> synchronous operations. To me a barrier is usualy between >> asynchronous operations, and it is well defined which operations >> are ahead or behind of the barrier (either because they were >> queued by the same thread, or they were queued by different >> threads which may have synchronized together some other way). > > When you have hundreds or thousands of threads doing IO to the one > file, it doesn't matter if the IO is issued synchronously or > asynchronously by the threads - you simply have a huge amount of > data IO concurrency and very, very deep pipelines. > > Inserting a metadata modification (truncate, preallocation, > holepunch, etc) into that pipeline currently causes all new > submissions to queue behind the metadata modification, waits for > everything submitted before the metadata modification to complete > and then runs the metadata modification. Once it completes, it then > allows everything queued up to the next metadata modification to > run > > That matches your definition of a barrier pretty well, I think. The difference I see is that with async operations one can easily tell when the desired operations made it into the queue: t1() async_submit(request X) t2() .. .. .. (eventually) request X completes After t2 we know the operation is queued, and thus we can legitimately expect that new operations submitted after t2 should logically run after request X. With sync operations we can't make that observation as the caller thread is blocked: t1() begin system call .. .. request X gets queued .. .. request X completes .. .. system call completes t2() Userspace can't pinpoint exactly when did the request get onto the queue. If another request Y gets submitted between t1 and t2, userspace can never know for sure if Y was before of after X in the queue, so the proposed reordering should be (my claim) not observable by userspace, in the sense that it never produces a result that wasn't already possible before. Anyway, I didn't intend for the proposed change to warrant a thread of this size. I will abandon patch 10 and propose a different (slightly longer, but without reordering) change to replace patch 11. Thanks, -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote: > Hi Dave, > > On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner wrote: > > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote: > >> - since all readers are woken at once, you might see burst of direct > >> IO operations followed by bursts of truncate operations, instead of > >> having them interleaved in smaller groups as before. > > > > And this will result in the same application IO pattern giving > > vastly different results. e.g. a read IO promoted ahead of a > > truncate will now return data instead of -1 (beyond EOF). > > I will reply to this part first, as I think this gives a more concrete > anchor to the discussion. > > The crucial point is that the truncate system call hasn't completed > yet - that thread is still queued in the rwsem. > > You still have the guarantee that the truncate will complete > eventually (even if there is a never-ending stream of IOs), and that > all IO system calls that start after the truncate system call > completes will see the file as truncated. Sure, but the problem is not about when the syscall completes - the problem is that you are changing where in the pipeline of IO the truncate is initially executed. i.e. ordering is not defined by when an operation completes, but by the order ing which the queue is processed after a blocking operation completes. That is not when the syscall completes, but when the filesystem drops the exclusive lock. >From a single threaded userspace application perspective or multithreaded apps with their own userspace locking, truncates complete when either the syscall completes or some time after when the application drops it's lock. Either way, we're looking at completion time serialisation, and in that case what XFS or the kernel does simply doesn't matter. However, if we are talking about userspace applications that use lockless IO algorithms or kernel side applications (like knfsd) that are exposed directly to the filesystem locking semantics, serialisation behaviour is actually defined by filesystem's submission side locking behaviour. There is no external serialisation of IO completion, so serialisation and ordering is defined (for XFS) solely by the rwsem behaviour. > You don't have guarantees about which system call will "appear to run > before the other" if the execution times of the two system calls > overlap each other, but you actually never had such a guarantee from a > correctness perspective (i.e. you could never guarantee which of the > two would get queued on the rwsem ahead of the other). Sure, but as long as the submission side ordering is deterministic, it doesn't matter. > > Ok, so I can see where your latency figure comes from, but it's > > still a change of ordering in that W2 is no longer a barrier to the > > reads queued after it. > > My claim is that it's not a visible change from a correctness > perspective I am not arguing that it is incorrect. I'm arguing that the change of ordering semantics breaks assumptions a lot of code makes about how these locks work. > > similar to this: > > > > W1R1W2R2W3R3.WnRm > > > > By my reading of the algorithm you are proposing, after W1 is > > released, we end up with the queue being treated like this: > > > > R1R2R3RmW2W3Wn > > > > Right? > > Yes, if what you are representing is the state of the queue at a given > point of time (which implies R1..Rm must be distinct threads, not just > the same thread doing repeated requests). Yup, that's very typical. > As new requests come in over time, one important point to remember is > that each writer will see at most one batch of readers wake up ahead > of it (though this batch may include readers that were originally > queued both before and after it). And that's *exactly* the problem with the changes you are proposing - rwsems will no longer provide strongly ordered write vs read barrier semantics. > I find the name 'barrier' actually confusing when used to describe > synchronous operations. To me a barrier is usualy between > asynchronous operations, and it is well defined which operations > are ahead or behind of the barrier (either because they were > queued by the same thread, or they were queued by different > threads which may have synchronized together some other way). When you have hundreds or thousands of threads doing IO to the one file, it doesn't matter if the IO is issued synchronously or asynchronously by the threads - you simply have a huge amount of data IO concurrency and very, very deep pipelines. Inserting a metadata modification (truncate, preallocation, holepunch, etc) into that pipeline currently causes all new submissions to queue behind the metadata modification, waits for everything submitted before the metadata modification to complete and then runs the metadata modification. Once it completes, it then allows everything queued up to the next metadata modification to run That m
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
Hi Dave, On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner wrote: > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote: >> - since all readers are woken at once, you might see burst of direct >> IO operations followed by bursts of truncate operations, instead of >> having them interleaved in smaller groups as before. > > And this will result in the same application IO pattern giving > vastly different results. e.g. a read IO promoted ahead of a > truncate will now return data instead of -1 (beyond EOF). I will reply to this part first, as I think this gives a more concrete anchor to the discussion. The crucial point is that the truncate system call hasn't completed yet - that thread is still queued in the rwsem. You still have the guarantee that the truncate will complete eventually (even if there is a never-ending stream of IOs), and that all IO system calls that start after the truncate system call completes will see the file as truncated. You don't have guarantees about which system call will "appear to run before the other" if the execution times of the two system calls overlap each other, but you actually never had such a guarantee from a correctness perspective (i.e. you could never guarantee which of the two would get queued on the rwsem ahead of the other). > So: > > RW1W2 >> >> As our N-1 readers >> complete their IO operations, they might get queued again after W2, > > So: > W1W2R > >> and then skip ahead of W2 when RN reaches the front of the queue. So, >> W2 is still guaranteed to run eventually, but with a worst case >> latency that is ~2x worse than before > > So, when W1 is released, the queue is treated as though it was: > > RW2 > > yes? Yes. > Ok, so I can see where your latency figure comes from, but it's > still a change of ordering in that W2 is no longer a barrier to the > reads queued after it. My claim is that it's not a visible change from a correctness perspective - if the extra readers R are submitted before W2 completes then they can't assume their execution order vs W2. From a correctness perspective, the outcome won't look any different as if W2's thread had been delayed right before entering the truncate system call. Now from a performance point of view, I concede there IS a difference as this will happen more often than before. But from a correctness point of view, this is not a new possible outcome - it was already possible before. > And if we extend that to WN, we have an interleaved queue > similar to this: > > W1R1W2R2W3R3.WnRm > > By my reading of the algorithm you are proposing, after W1 is > released, we end up with the queue being treated like this: > > R1R2R3RmW2W3Wn > > Right? Yes, if what you are representing is the state of the queue at a given point of time (which implies R1..Rm must be distinct threads, not just the same thread doing repeated requests). As new requests come in over time, one important point to remember is that each writer will see at most one batch of readers wake up ahead of it (though this batch may include readers that were originally queued both before and after it). This guarantees that the writer won't get starved / will get its turn running eventually. > If so, that is definitely a change of lock ordering semantics - > writes do not have barrier properties anymore. I find the name 'barrier' actually confusing when used to describe synchronous operations. To me a barrier is usualy between asynchronous operations, and it is well defined which operations are ahead or behind of the barrier (either because they were queued by the same thread, or they were queued by different threads which may have synchronized together some other way). But in this case, we have two synchronous operations with overlapping execution times - they don't have a way to know which one is 'supposed to' be ahead of the other. The two threads can't observe their relative ordering since they are blocked during the operation... -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote: > On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner wrote: > > On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote: > >> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner wrote: > >> > Isn't this a significant change of semantics for the rwsem? i.e. > >> > that read lock requests that come after a write lock request now > >> > jump ahead of the write lock request? i.e.the write lock request is > >> > no longer a barrier in the queue? > >> > >> Yes, I am allowing readers to skip ahead of writers in the queue (but > >> only if they can run with another reader that was already ahead). > >> > >> I don't see that this is a change of observable semantics for correct > >> programs. If a reader and a writer both block on the rwsem, how do you > >> known for sure which one got queued first ? rwsem API doesn't give you > >> any easy way to know whether a thread is currently queued on the rwsem > >> (it could also be descheduled before it gets onto the rwsem queue). > > > > There are algorithms that rely on write locks to act as read-side > > barriers to prevent write side starvation. i.e. if you keep queuing > > up read locks, the writer never gets the lock, thereby starving the > > writer. > > What I propose actually isn't as bad as a reader preference rwlock > like rwlock_t. When a reader makes it to the head of the queue, all > readers gets woken. At that point there are only writers left in the > queue, and new readers will get queued after them and won't be able to > skip over them (since they have no earlier reader to join up with). > So, I am not throwing reader/writer fairness out the window here. OK, but > > My point isn't that XFS gets the serialisation wrong - my point is > > that the change of queuing order will change the userspace visible > > behaviour of the filesystem *significantly*. > > > > For example: - direct IO only takes read locks, while truncate takes > > a write lock. Right now we can drop a truncate, preallocation or > > hole punch operation into a stream of concurrent direct IO and have > > it execute almost immediately - the barrier mechanism in the rwsem > > ensures that it will be executed ahead of any future IO that is > > issued by the application. With your proposed change, that operation > > won't take place until all current *and all future* IOs stop and the > > read lock is no longer held by anyone. > > You actually wouldn't get such starvation with my proposal. > > What you might see would be: > > - Assuming you have up to N concurrent reads in progress, a writer > might see up to 2N-1 readers proceed in front of it. This limit is > reached if you first had N-1 readers grabbing the rwsem with an empty > queue, then another writer W1 got queued, So something like RW1 > then a reader RN (note that > it will get queued after W1 and not run concurrently with the existing > N-1 readers), then our writer W2 gets queued. So: RW1W2 > > As our N-1 readers > complete their IO operations, they might get queued again after W2, So: W1W2R > and then skip ahead of W2 when RN reaches the front of the queue. So, > W2 is still guaranteed to run eventually, but with a worst case > latency that is ~2x worse than before So, when W1 is released, the queue is treated as though it was: RW2 yes? Ok, so I can see where your latency figure comes from, but it's still a change of ordering in that W2 is no longer a barrier to the reads queued after it. And if we extend that to WN, we have an interleaved queue similar to this: W1R1W2R2W3R3.WnRm By my reading of the algorithm you are proposing, after W1 is released, we end up with the queue being treated like this: R1R2R3RmW2W3Wn Right? If so, that is definitely a change of lock ordering semantics - writes do not have barrier properties anymore. > - since all readers are woken at once, you might see burst of direct > IO operations followed by bursts of truncate operations, instead of > having them interleaved in smaller groups as before. And this will result in the same application IO pattern giving vastly different results. e.g. a read IO promoted ahead of a truncate will now return data instead of -1 (beyond EOF). > I'm not sure if these characteristics are good enough for the XFS use > case. It seems tougher than many rwsem use cases, since the benefits > of increased reader parallelism are not as clear here (i.e. the IO Well defined, predictable, deterministc ordering of operations takes far more precedence over parallelism when it comes to filesystem behaviour. The rwsems already have enough parallelism to allow us to drive more than 2 million 4k IOPS to a single file via multi-threaded direct IO(*), so we aren't limited by parallelism and concurrency in rwsems. > So if > this explanation still didn't made you
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote: > + retry_reader_grants: > + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; > + if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { > + /* A writer stole the lock. Undo our reader grants. */ > + if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS) > + goto out; > + /* The writer left. Retry waking readers. */ > + goto retry_reader_grants; > + } This can be reduced to single looping cmpxchg in the grant reversal path; then if reversing the grant fails, the count can simply be re-tested for grant success, rather than trying to atomically re-grant. For example, with a helper function, rwsem_cmpxchg(): static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore *sem) { long tmp = *old; *old = cmpxchg(&sem->count, *old, new); return tmp == *old; } ... then above becomes ... count = rwsem_atomic_update(adjustment, sem); do { if (count - adjustment >= RWSEM_WAITING_BIAS) break; if (rwsem_cmpxchg(&count, count - adjustment, sem)) goto out; /* or simply return sem */ } while (1); < wake up readers > Also, this series and the original rwsem can mistakenly sleep reader(s) when the lock is transitioned from writer-owned to waiting readers-owned with no waiting writers. For example, CPU 0 | CPU 1 | | down_write() ... CPU 1 has the write lock for the semaphore. Meanwhile, 1 or more down_read(s) are attempted and fail; these are put on the wait list. Then ... down_read() | up_write() local = atomic_update(+read_bias) | local <= 0? | local = atomic_update(-write_bias) if (true) | local < 0? down_read_failed() | if (true) | wake() | grab wait_lock wait for wait_lock| wake all readers | release wait_lock ... At this point, sem->count > 0 and the wait list is empty, but down_read_failed() will sleep the reader. In this case, CPU 0 has observed the sem count with the write lock (and the other waiters) and so is detoured to down_read_failed(). But if CPU 0 can't grab the wait_lock before the up_write() does (via rwsem_wake()), then down_read_failed() will wake no one and sleep the reader. Unfortunately, this means readers and writers which observe the sem count after the adjustment is committed by CPU 0 in down_read_failed() will sleep as well, until the sem count returns to 0. I think the best solution would be to preserve the observed count when down_read() fails and pass it to rwsem_down_read_failed() -- of course, this is also the most disruptive approach as it changes the per-arch interface (the attached patch does not include the arch changes). The other alternative is to go through the __rwsem_do_wake() path. Regards, Peter Hurley --- >% --- Subject: [PATCH] rwsem: Early-out tardy readers Signed-off-by: Peter Hurley --- lib/rwsem.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index f9a5705..8eb2cdf 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -118,12 +118,11 @@ __rwsem_do_wake(struct rw_semaphore *sem, bool wakewrite) /* * wait for the read lock to be granted */ -struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) +struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem, long count) { signed long adjustment = -RWSEM_ACTIVE_READ_BIAS; struct rwsem_waiter waiter; struct task_struct *tsk = current; - signed long count; /* set up my own style of waitqueue */ waiter.task = tsk; @@ -131,6 +130,20 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) get_task_struct(tsk); raw_spin_lock_irq(&sem->wait_lock); + + /* Try to reverse the lock attempt but if the count has changed +* so that reversing fails, check if there are are no waiters, +* and early-out if not */ + do { + if (rwsem_cmpxchg(&count, count + adjust, sem)) + break; + if (count > 0) { + raw_spin_unlock_irq(&sem->wait_lock); + put_task_struct(tsk); + return sem; + } + } while (1); + sem->wait_readers++; if (list_empty(&sem->wait_list)) adjustment += RWSEM_WAITING_BIAS; -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
* Dave Chinner wrote: > So, that's my concern - we've got 20 years of algorithms in XFS designed > around rwsem write locks providing read-side barriers, and it's been > down this road once before (over 15 years ago). [...] We are certainly not going to break reader/writer fairness - that's a must for any generic, long-held reader-writer locking mechanism. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner wrote: > On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote: >> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner wrote: >> > Isn't this a significant change of semantics for the rwsem? i.e. >> > that read lock requests that come after a write lock request now >> > jump ahead of the write lock request? i.e.the write lock request is >> > no longer a barrier in the queue? >> >> Yes, I am allowing readers to skip ahead of writers in the queue (but >> only if they can run with another reader that was already ahead). >> >> I don't see that this is a change of observable semantics for correct >> programs. If a reader and a writer both block on the rwsem, how do you >> known for sure which one got queued first ? rwsem API doesn't give you >> any easy way to know whether a thread is currently queued on the rwsem >> (it could also be descheduled before it gets onto the rwsem queue). > > There are algorithms that rely on write locks to act as read-side > barriers to prevent write side starvation. i.e. if you keep queuing > up read locks, the writer never gets the lock, thereby starving the > writer. What I propose actually isn't as bad as a reader preference rwlock like rwlock_t. When a reader makes it to the head of the queue, all readers gets woken. At that point there are only writers left in the queue, and new readers will get queued after them and won't be able to skip over them (since they have no earlier reader to join up with). So, I am not throwing reader/writer fairness out the window here. >> > XFS has long assumed that a rwsem write lock is a barrier that >> > stops new read locks from being taken, and this change will break >> > that assumption. Given that this barrier assumption is used as the >> > basis for serialisation of operations like IO vs truncate, there's a >> > bit more at stake than just improving parallelism here. i.e. IO >> > issued after truncate/preallocate/hole punch could now be issued >> > ahead of the pending metadata operation, whereas currently the IO >> > issued after the pending metadata operation is waiting for the write >> > lock will be only be processed -after- the metadata modification >> > operation completes... >> > >> > That is a recipe for weird data corruption problems because >> > applications are likely to have implicit dependencies on the barrier >> > effect of metadata operations on data IO... >> >> I am confused as to exactly what XFS is doing, could you point me to >> the code / indicate a scenario where this would go wrong ? If you >> really rely on this for correctness you'd have to do something already >> to guarantee that your original queueing order is as desired, and I >> just don't see how it'd be done... > > My point isn't that XFS gets the serialisation wrong - my point is > that the change of queuing order will change the userspace visible > behaviour of the filesystem *significantly*. > > For example: - direct IO only takes read locks, while truncate takes > a write lock. Right now we can drop a truncate, preallocation or > hole punch operation into a stream of concurrent direct IO and have > it execute almost immediately - the barrier mechanism in the rwsem > ensures that it will be executed ahead of any future IO that is > issued by the application. With your proposed change, that operation > won't take place until all current *and all future* IOs stop and the > read lock is no longer held by anyone. You actually wouldn't get such starvation with my proposal. What you might see would be: - Assuming you have up to N concurrent reads in progress, a writer might see up to 2N-1 readers proceed in front of it. This limit is reached if you first had N-1 readers grabbing the rwsem with an empty queue, then another writer W1 got queued, then a reader RN (note that it will get queued after W1 and not run concurrently with the existing N-1 readers), then our writer W2 gets queued. As our N-1 readers complete their IO operations, they might get queued again after W2, and then skip ahead of W2 when RN reaches the front of the queue. So, W2 is still guaranteed to run eventually, but with a worst case latency that is ~2x worse than before - since all readers are woken at once, you might see burst of direct IO operations followed by bursts of truncate operations, instead of having them interleaved in smaller groups as before. I'm not sure if these characteristics are good enough for the XFS use case. It seems tougher than many rwsem use cases, since the benefits of increased reader parallelism are not as clear here (i.e. the IO operations still have to hit disk so you don't get much further benefit if you already had more IO parallelism than platters). So if this explanation still didn't made you comfortable with the proposal, I will rework it to avoid the queue reordering. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe from this list: se
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote: > On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner wrote: > > On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote: > >> When the first queued waiter is a reader, wake all readers instead of > >> just those that are at the front of the queue. There are really two > >> motivations for this change: > > > > Isn't this a significant change of semantics for the rwsem? i.e. > > that read lock requests that come after a write lock request now > > jump ahead of the write lock request? i.e.the write lock request is > > no longer a barrier in the queue? > > Yes, I am allowing readers to skip ahead of writers in the queue (but > only if they can run with another reader that was already ahead). > > I don't see that this is a change of observable semantics for correct > programs. If a reader and a writer both block on the rwsem, how do you > known for sure which one got queued first ? rwsem API doesn't give you > any easy way to know whether a thread is currently queued on the rwsem > (it could also be descheduled before it gets onto the rwsem queue). There are algorithms that rely on write locks to act as read-side barriers to prevent write side starvation. i.e. if you keep queuing up read locks, the writer never gets the lock, thereby starving the writer. > But yes, if you're making assumptions about queuing order the change > makes it more likely that they'll be observably wrong. > > > XFS has long assumed that a rwsem write lock is a barrier that > > stops new read locks from being taken, and this change will break > > that assumption. Given that this barrier assumption is used as the > > basis for serialisation of operations like IO vs truncate, there's a > > bit more at stake than just improving parallelism here. i.e. IO > > issued after truncate/preallocate/hole punch could now be issued > > ahead of the pending metadata operation, whereas currently the IO > > issued after the pending metadata operation is waiting for the write > > lock will be only be processed -after- the metadata modification > > operation completes... > > > > That is a recipe for weird data corruption problems because > > applications are likely to have implicit dependencies on the barrier > > effect of metadata operations on data IO... > > I am confused as to exactly what XFS is doing, could you point me to > the code / indicate a scenario where this would go wrong ? If you > really rely on this for correctness you'd have to do something already > to guarantee that your original queueing order is as desired, and I > just don't see how it'd be done... My point isn't that XFS gets the serialisation wrong - my point is that the change of queuing order will change the userspace visible behaviour of the filesystem *significantly*. For example: - direct IO only takes read locks, while truncate takes a write lock. Right now we can drop a truncate, preallocation or hole punch operation into a stream of concurrent direct IO and have it execute almost immediately - the barrier mechanism in the rwsem ensures that it will be executed ahead of any future IO that is issued by the application. With your proposed change, that operation won't take place until all current *and all future* IOs stop and the read lock is no longer held by anyone. To put this in context, here's the Irix XFS iolock initialisation code from mid 1997, where the barrier semantics are first explicitly documented: mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", (long)vp->v_number); It was only coded like this in 1997, because the Irix multi-reader lock only grew read-side queue jumping and priority inheritence at this time. This changed the default behaviour from write lock barrier semantics to something similar to what you are proposing for rwsems right now. All the locks that relied on barrier semantics of write locks had to be configured explicitly to have that behaviour rather than implicitly relying on the fact that mrlocks had provided write barrier semantics. So, that's my concern - we've got 20 years of algorithms in XFS designed around rwsem write locks providing read-side barriers, and it's been down this road once before (over 15 years ago). Changing the semantics of rwsems is going to break stuff in subtle and unpredictable ways Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner wrote: > On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote: >> When the first queued waiter is a reader, wake all readers instead of >> just those that are at the front of the queue. There are really two >> motivations for this change: > > Isn't this a significant change of semantics for the rwsem? i.e. > that read lock requests that come after a write lock request now > jump ahead of the write lock request? i.e.the write lock request is > no longer a barrier in the queue? Yes, I am allowing readers to skip ahead of writers in the queue (but only if they can run with another reader that was already ahead). I don't see that this is a change of observable semantics for correct programs. If a reader and a writer both block on the rwsem, how do you known for sure which one got queued first ? rwsem API doesn't give you any easy way to know whether a thread is currently queued on the rwsem (it could also be descheduled before it gets onto the rwsem queue). But yes, if you're making assumptions about queuing order the change makes it more likely that they'll be observably wrong. > XFS has long assumed that a rwsem write lock is a barrier that > stops new read locks from being taken, and this change will break > that assumption. Given that this barrier assumption is used as the > basis for serialisation of operations like IO vs truncate, there's a > bit more at stake than just improving parallelism here. i.e. IO > issued after truncate/preallocate/hole punch could now be issued > ahead of the pending metadata operation, whereas currently the IO > issued after the pending metadata operation is waiting for the write > lock will be only be processed -after- the metadata modification > operation completes... > > That is a recipe for weird data corruption problems because > applications are likely to have implicit dependencies on the barrier > effect of metadata operations on data IO... I am confused as to exactly what XFS is doing, could you point me to the code / indicate a scenario where this would go wrong ? If you really rely on this for correctness you'd have to do something already to guarantee that your original queueing order is as desired, and I just don't see how it'd be done... That said, it is doable to add support for write lock stealing in the rwsem write path while still preserving the queueing order of readers vs writers; I'm just not sure that I fully understand the correctness concern at this point. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote: > When the first queued waiter is a reader, wake all readers instead of > just those that are at the front of the queue. There are really two > motivations for this change: Isn't this a significant change of semantics for the rwsem? i.e. that read lock requests that come after a write lock request now jump ahead of the write lock request? i.e.the write lock request is no longer a barrier in the queue? XFS has long assumed that a rwsem write lock is a barrier that stops new read locks from being taken, and this change will break that assumption. Given that this barrier assumption is used as the basis for serialisation of operations like IO vs truncate, there's a bit more at stake than just improving parallelism here. i.e. IO issued after truncate/preallocate/hole punch could now be issued ahead of the pending metadata operation, whereas currently the IO issued after the pending metadata operation is waiting for the write lock will be only be processed -after- the metadata modification operation completes... That is a recipe for weird data corruption problems because applications are likely to have implicit dependencies on the barrier effect of metadata operations on data IO... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/