Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-19 Thread Michel Lespinasse
On Mon, Mar 18, 2013 at 6:17 PM, Dave Chinner  wrote:
> On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote:
>> On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
>> > We don't care about the ordering between multiple concurrent
>> > metadata modifications - what matters is whether the ongoing data IO
>> > around them is ordered correctly.
>>
>> Dave,
>>
>> The point that Michel is making is that there never was any ordering
>> guarantee by rwsem. It's an illusion.
>
> Weasel words.

Whoaaa, calm down.

You initially made one false statement (that the change meant a stream
of readers would starve a writer forever) and one imprecise statement
(that rwsem used to guarantee that readers don't skip ahead of writers
- this may be true in practice for your use case because the latencies
involved are very large compared to scheduling latencies, but that's a
very important qualification that needs to be added here). That
confused me enough that I initially couldn't tell what your actual
concern was, so I pointed out the source of my confusion and asked you
to clarify. It seems unfair to characterize that as "weasel words" -
I'm not trying to be a smartass here, but only to actually understand
your concern.

>> The reason is simple: to even get to the lock the cpu has to be
>> sleep-able. So for every submission that you believe is ordered, is by
>> its very nature __not ordered__, even when used by kernel code.
>>
>> Why? Because any thread on its way to claim the lock (reader or writer)
>> could be pre-empted for some other task, thus delaying the submission of
>> whatever i/o you believed to be ordered.
>
> You think I don't know this?  You're arguing fine grained, low level
> behaviour between tasks is unpredictable. I get that. I understand
> that. But I'm not arguing about fine-grained, low level, microsecond
> semantics of the locking order
>
> What you (and Michael) appear to be failing to see is what happens
> on a macro level when you have read locks being held for periods
> measured in *seconds* (e.g. direct IO gets queued behind a few
> thousand other IOs in the elevator waiting for a request slot),
> and the subsequent effect of inserting an operation that requires a
> write lock into that IO stream.
>
> IOWs, it simply doesn't matter if there's a micro-level race between
> the write lock and a couple of the readers. That's the level you
> guys are arguing at but it simply does not matter in the cases I'm
> describing. I'm talking about high level serialisation behaviours
> that might take of *seconds* to play out and the ordering behaviours
> observed at that scale.
>
> That is, I don't care if a couple of threads out of a few thousand
> race with the write lock over few tens to hundreds of microseconds,
> but I most definitely care if a few thousand IOs issued seconds
> after the write lock is queued jump over the write lock. That is a
> gross behavioural change at the macro-level.

Understood. I accepted your concern and made sure my v2 proposal
doesn't do such macro level reordering.

>> So just to reiterate: there is no 'queue' and no 'barrier'. The
>> guarantees that rwsem makes are;
>> 1. Multiple readers can own the lock.
>> 2. Only a single writer can own the lock.
>> 3. Readers will not starve writers.
>
> You've conveniently ignored the fact that the current implementation
> also provides following guarantee:
>
> 4. new readers will block behind existing writers

In your use case, with large enough queue latencies, yes.

Please don't make it sound like this applies in every use case - it
has never applied for short (http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-18 Thread Dave Chinner
On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote:
> On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
> > We don't care about the ordering between multiple concurrent
> > metadata modifications - what matters is whether the ongoing data IO
> > around them is ordered correctly.
> 
> Dave,
> 
> The point that Michel is making is that there never was any ordering
> guarantee by rwsem. It's an illusion.

Weasel words.

> The reason is simple: to even get to the lock the cpu has to be
> sleep-able. So for every submission that you believe is ordered, is by
> its very nature __not ordered__, even when used by kernel code.
>
> Why? Because any thread on its way to claim the lock (reader or writer)
> could be pre-empted for some other task, thus delaying the submission of
> whatever i/o you believed to be ordered.

You think I don't know this?  You're arguing fine grained, low level
behaviour between tasks is unpredictable. I get that. I understand
that. But I'm not arguing about fine-grained, low level, microsecond
semantics of the locking order

What you (and Michael) appear to be failing to see is what happens
on a macro level when you have read locks being held for periods
measured in *seconds* (e.g. direct IO gets queued behind a few
thousand other IOs in the elevator waiting for a request slot),
and the subsequent effect of inserting an operation that requires a
write lock into that IO stream.

IOWs, it simply doesn't matter if there's a micro-level race between
the write lock and a couple of the readers. That's the level you
guys are arguing at but it simply does not matter in the cases I'm
describing. I'm talking about high level serialisation behaviours
that might take of *seconds* to play out and the ordering behaviours
observed at that scale.

That is, I don't care if a couple of threads out of a few thousand
race with the write lock over few tens to hundreds of microseconds,
but I most definitely care if a few thousand IOs issued seconds
after the write lock is queued jump over the write lock. That is a
gross behavioural change at the macro-level.

> So just to reiterate: there is no 'queue' and no 'barrier'. The
> guarantees that rwsem makes are;
> 1. Multiple readers can own the lock.
> 2. Only a single writer can own the lock.
> 3. Readers will not starve writers.

You've conveniently ignored the fact that the current implementation
also provides following guarantee:

4. new readers will block behind existing writers

And that's the behaviour we currently depend on, whether you like it
or not.

> Where lock policy can have a significant impact is on performance. But
> predicting that impact is difficult -- it's better just to measure.

Predicting the impact in this case is trivial - it's obvious that
ordering of operations will change and break high level assumptions
that userspace currently makes about various IO operations on XFS
filesystems

> It's not my intention to convince you (or anyone else) that there should
> only be One True Rwsem, because I don't believe that. But I didn't want
> the impression to persist that rwsem does anything more than implement a
> fair reader/writer semaphore.

I'm sorry, but redefining "fair" to suit your own needs doesn't
convince me of anything. rwsem behaviour has been unchanged for at
least 10 years and hence the current implementation defines what is
"fair", not what you say is fair

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-14 Thread Michel Lespinasse
On Thu, Mar 14, 2013 at 4:39 AM, Peter Hurley  wrote:
> On Thu, 2013-03-14 at 00:03 -0700, Michel Lespinasse wrote:
>> >   CPU 0   |  CPU 1
>> >   |
>> >   | down_write()
>> >
>> > ... CPU 1 has the write lock for the semaphore.
>> > Meanwhile, 1 or more down_read(s) are attempted and fail;
>> > these are put on the wait list.
>>
>> I'm not sure of the relevance of these other down_read() calls -
>> please note that as these extra readers are put on the wait list,
>> their +read_bias adjustments are canceled so that the count value
>> ends up at write_bias + waiting_bias (for a total active count of 1)
>
> The relevance of the waiting readers is that when CPU 1 drops the writer
> ___and grabs the spin lock___, it then wakes up these already-waiting
> readers (CPU 0 is still parked in down_read_failed() waiting to acquire
> the spin lock).
>
> When CPU 1 wakes up these readers, the sem count goes > 0 and the
> waiting list is emptied. CPU 1 then drops the spin lock and leaves
> up_write().

Ah, correct. So the race you noticed is real, and the one I found is
just a more complicated way to hit it.

I'll add a fix in v2 of this patch series (probably later today).

BTW, you seem to have been paying pretty close attention, would you
mind giving appropriate Reviewed-by tags when I send v2 ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-14 Thread Peter Hurley
On Thu, 2013-03-14 at 00:03 -0700, Michel Lespinasse wrote:
> On Mon, Mar 11, 2013 at 04:36:47PM -0400, Peter Hurley wrote:
> > 
> > On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote:
> > > + retry_reader_grants:
> > > + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > > + if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > > + /* A writer stole the lock.  Undo our reader grants. */
> > > + if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS)
> > > + goto out;
> > > + /* The writer left.  Retry waking readers. */
> > > + goto retry_reader_grants;
> > > + }
> > 
> > This can be reduced to single looping cmpxchg in the grant reversal
> > path; then if reversing the grant fails, the count can simply be
> > re-tested for grant success, rather than trying to atomically re-grant.
> > For example, with a helper function, rwsem_cmpxchg():
> > 
> > static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore 
> > *sem)
> > {
> > long tmp = *old;
> > *old = cmpxchg(&sem->count, *old, new);
> > return tmp == *old;
> > }
> > 
> > ... then above becomes ...
> > 
> > count = rwsem_atomic_update(adjustment, sem);
> > do {
> > if (count - adjustment >= RWSEM_WAITING_BIAS)
> > break;
> > if (rwsem_cmpxchg(&count, count - adjustment, sem))
> > goto out;   /* or simply return sem */
> > } while (1);
> > 
> > < wake up readers >
> 
> This would work, but I don't see a benefit as we still end up with a
> retry loop. Also, the loop would have to retry whenever the counter
> value changed instead of only when writer(s) appear or are removed.

It's a minor optimization in that, while the count is unstable, this CPU
will stay here and while it does so, it's only inflicting one bus lock
rather than two. Feel free to ignore it.

> > Also, this series and the original rwsem can mistakenly sleep reader(s)
> > when the lock is transitioned from writer-owned to waiting readers-owned
> > with no waiting writers. For example,
> > 
> > 
> >   CPU 0   |  CPU 1
> >   |
> >   | down_write()
> > 
> > ... CPU 1 has the write lock for the semaphore.
> > Meanwhile, 1 or more down_read(s) are attempted and fail;
> > these are put on the wait list.
> 
> I'm not sure of the relevance of these other down_read() calls -
> please note that as these extra readers are put on the wait list,
> their +read_bias adjustments are canceled so that the count value
> ends up at write_bias + waiting_bias (for a total active count of 1)

The relevance of the waiting readers is that when CPU 1 drops the writer
___and grabs the spin lock___, it then wakes up these already-waiting
readers (CPU 0 is still parked in down_read_failed() waiting to acquire
the spin lock).

When CPU 1 wakes up these readers, the sem count goes > 0 and the
waiting list is emptied. CPU 1 then drops the spin lock and leaves
up_write().

CPU 0 has been spinning on the wait_lock during this time. It now
acquires the lock, and discovers the wait list is empty and ups the
adjustment(+wait bias). The count is already (+ # of waiting readers) so
it never does the wake up.

> > Then ...
> > 
> > down_read()   | up_write()
> >   local = atomic_update(+read_bias)   |
> >   local <= 0? |   local = atomic_update(-write_bias)
> >   if (true)   |   local < 0?
> >  down_read_failed()   |   if (true)
> >   |  wake()
> >   | grab wait_lock
> > wait for wait_lock| wake all readers
> >   | release wait_lock
> > 
> > ... At this point, sem->count > 0 and the wait list is empty,
> > but down_read_failed() will sleep the reader.
> > 
> > In this case, CPU 0 has observed the sem count with the write lock (and
> > the other waiters) and so is detoured to down_read_failed(). But if 
> > CPU 0 can't grab the wait_lock before the up_write() does (via
> > rwsem_wake()), then down_read_failed() will wake no one and sleep the
> > reader.
> 
> Actually - down_read_failed() will insert the reader on the wait_list
> and do an atomic update(-read_bias). If there are no active lockers at
> the time of that atomic update, it calls __rwsem_do_wake() to unqueued
> the first queued waiter - which may well be itself.
> 
> In your specific example, __rwsem_do_wake() will wake the previously queued
> readers as well as current.

Hopefully my comments above make it clear that there are no queued
readers once CPU 0 can advance with the spin lock in down_read_failed().

> > Unfortunately, this means readers and writers which observe the sem
> > count after the adjustment is committed by CPU 0 in down_re

Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-14 Thread Michel Lespinasse
On Mon, Mar 11, 2013 at 04:36:47PM -0400, Peter Hurley wrote:
> 
> On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote:
> > + retry_reader_grants:
> > +   oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > +   if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > +   /* A writer stole the lock.  Undo our reader grants. */
> > +   if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS)
> > +   goto out;
> > +   /* The writer left.  Retry waking readers. */
> > +   goto retry_reader_grants;
> > +   }
> 
> This can be reduced to single looping cmpxchg in the grant reversal
> path; then if reversing the grant fails, the count can simply be
> re-tested for grant success, rather than trying to atomically re-grant.
> For example, with a helper function, rwsem_cmpxchg():
> 
> static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore *sem)
> {
>   long tmp = *old;
>   *old = cmpxchg(&sem->count, *old, new);
>   return tmp == *old;
> }
> 
> ... then above becomes ...
> 
>   count = rwsem_atomic_update(adjustment, sem);
>   do {
>   if (count - adjustment >= RWSEM_WAITING_BIAS)
>   break;
>   if (rwsem_cmpxchg(&count, count - adjustment, sem))
>   goto out;   /* or simply return sem */
>   } while (1);
> 
>   < wake up readers >

This would work, but I don't see a benefit as we still end up with a
retry loop. Also, the loop would have to retry whenever the counter
value changed instead of only when writer(s) appear or are removed.

> Also, this series and the original rwsem can mistakenly sleep reader(s)
> when the lock is transitioned from writer-owned to waiting readers-owned
> with no waiting writers. For example,
> 
> 
>   CPU 0   |  CPU 1
>   |
>   | down_write()
> 
> ... CPU 1 has the write lock for the semaphore.
> Meanwhile, 1 or more down_read(s) are attempted and fail;
> these are put on the wait list.

I'm not sure of the relevance of these other down_read() calls -
please note that as these extra readers are put on the wait list,
their +read_bias adjustments are canceled so that the count value
ends up at write_bias + waiting_bias (for a total active count of 1)

> Then ...
> 
> down_read()   | up_write()
>   local = atomic_update(+read_bias)   |
>   local <= 0? |   local = atomic_update(-write_bias)
>   if (true)   |   local < 0?
>  down_read_failed()   |   if (true)
>   |  wake()
>   | grab wait_lock
> wait for wait_lock| wake all readers
>   | release wait_lock
> 
> ... At this point, sem->count > 0 and the wait list is empty,
> but down_read_failed() will sleep the reader.
> 
> In this case, CPU 0 has observed the sem count with the write lock (and
> the other waiters) and so is detoured to down_read_failed(). But if 
> CPU 0 can't grab the wait_lock before the up_write() does (via
> rwsem_wake()), then down_read_failed() will wake no one and sleep the
> reader.

Actually - down_read_failed() will insert the reader on the wait_list
and do an atomic update(-read_bias). If there are no active lockers at
the time of that atomic update, it calls __rwsem_do_wake() to unqueued
the first queued waiter - which may well be itself.

In your specific example, __rwsem_do_wake() will wake the previously queued
readers as well as current.

> Unfortunately, this means readers and writers which observe the sem
> count after the adjustment is committed by CPU 0 in down_read_failed()
> will sleep as well, until the sem count returns to 0.

Note that the active count does go back to 0 in your example.

However, thinking about it made me consider the following case:

CPU 0   CPU 1   CPU 2

down_write()

down_read()
  local = atomic_update(+read_bias)

up_write()

down_read()

  if (local <= 0)
down_read_failed()

At this point CPU 0 doesn't hold the write lock anymore;
CPU 2 grabbed a read lock;
CPU 1 should grab an additional read lock but it won't - instead, it will
queue itself (and get woken by CPU 2 when that read lock is released).

This is indeed slightly suboptimal. Could be fixed in an additional patch.

As you mentioned, this is not any new issue caused by this patch series
though - it's been there forever as far as I know.

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-13 Thread Peter Hurley
On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
> On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
> > Hi Dave,
> > 
> > On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner  wrote:
> > > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> > >> - since all readers are woken at once, you might see burst of direct
> > >> IO operations followed by bursts of truncate operations, instead of
> > >> having them interleaved in smaller groups as before.
> > >
> > > And this will result in the same application IO pattern giving
> > > vastly different results. e.g. a read IO promoted ahead of a
> > > truncate will now return data instead of -1 (beyond EOF).
> > 
> > I will reply to this part first, as I think this gives a more concrete
> > anchor to the discussion.
> > 
> > The crucial point is that the truncate system call hasn't completed
> > yet - that thread is still queued in the rwsem.
> > 
> > You still have the guarantee that the truncate will complete
> > eventually (even if there is a never-ending stream of IOs), and that
> > all IO system calls that start after the truncate system call
> > completes will see the file as truncated.
> 
> Sure, but the problem is not about when the syscall completes - the
> problem is that you are changing where in the pipeline of IO the
> truncate is initially executed.  i.e. ordering is not defined by
> when an operation completes, but by the order ing which the queue is
> processed after a blocking operation completes. That is not when the
> syscall completes, but when the filesystem drops the exclusive lock.
> 
> From a single threaded userspace application perspective or
> multithreaded apps with their own userspace locking, truncates
> complete when either the syscall completes or some time after when
> the application drops it's lock. Either way, we're looking at
> completion time serialisation, and in that case what XFS or the
> kernel does simply doesn't matter.
> 
> However, if we are talking about userspace applications that use
> lockless IO algorithms or kernel side applications (like knfsd) that
> are exposed directly to the filesystem locking semantics,
> serialisation behaviour is actually defined by filesystem's
> submission side locking behaviour. There is no external
> serialisation of IO completion, so serialisation and ordering is
> defined (for XFS) solely by the rwsem behaviour.
> 
> > You don't have guarantees about which system call will "appear to run
> > before the other" if the execution times of the two system calls
> > overlap each other, but you actually never had such a guarantee from a
> > correctness perspective (i.e. you could never guarantee which of the
> > two would get queued on the rwsem ahead of the other).
> 
> Sure, but as long as the submission side ordering is deterministic,
> it doesn't matter.
> 
> > > Ok, so I can see where your latency figure comes from, but it's
> > > still a change of ordering in that W2 is no longer a barrier to the
> > > reads queued after it.
> > 
> > My claim is that it's not a visible change from a correctness
> > perspective
> 
> I am not arguing that it is incorrect. I'm arguing that the change
> of ordering semantics breaks assumptions a lot of code makes about
> how these locks work.
> 
> > > similar to this:
> > >
> > > W1R1W2R2W3R3.WnRm
> > >
> > > By my reading of the algorithm you are proposing, after W1 is
> > > released, we end up with the queue being treated like this:
> > >
> > > R1R2R3RmW2W3Wn
> > >
> > > Right?
> > 
> > Yes, if what you are representing is the state of the queue at a given
> > point of time (which implies R1..Rm must be distinct threads, not just
> > the same thread doing repeated requests).
> 
> Yup, that's very typical.
> 
> > As new requests come in over time, one important point to remember is
> > that each writer will see at most one batch of readers wake up ahead
> > of it (though this batch may include readers that were originally
> > queued both before and after it).
> 
> And that's *exactly* the problem with the changes you are proposing
> - rwsems will no longer provide strongly ordered write vs read
> barrier semantics.
> 
> > I find the name 'barrier' actually confusing when used to describe
> > synchronous operations.  To me a barrier is usualy between
> > asynchronous operations, and it is well defined which operations
> > are ahead or behind of the barrier (either because they were
> > queued by the same thread, or they were queued by different
> > threads which may have synchronized together some other way).
> 
> When you have hundreds or thousands of threads doing IO to the one
> file, it doesn't matter if the IO is issued synchronously or
> asynchronously by the threads - you simply have a huge amount of
> data IO concurrency and very, very deep pipelines.
> 
> Inserting a metadata modification (truncate, preallocation,
> holepunch, etc) into that pipeline currently causes all new
> submissions

Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-13 Thread Michel Lespinasse
On Tue, Mar 12, 2013 at 8:23 PM, Dave Chinner  wrote:
> On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
>> I find the name 'barrier' actually confusing when used to describe
>> synchronous operations.  To me a barrier is usualy between
>> asynchronous operations, and it is well defined which operations
>> are ahead or behind of the barrier (either because they were
>> queued by the same thread, or they were queued by different
>> threads which may have synchronized together some other way).
>
> When you have hundreds or thousands of threads doing IO to the one
> file, it doesn't matter if the IO is issued synchronously or
> asynchronously by the threads - you simply have a huge amount of
> data IO concurrency and very, very deep pipelines.
>
> Inserting a metadata modification (truncate, preallocation,
> holepunch, etc) into that pipeline currently causes all new
> submissions to queue behind the metadata modification, waits for
> everything submitted before the metadata modification to complete
> and then runs the metadata modification. Once it completes, it then
> allows everything queued up to the next metadata modification to
> run
>
> That matches your definition of a barrier pretty well, I think.

The difference I see is that with async operations one can easily tell
when the desired operations made it into the queue:

t1()
async_submit(request X)
t2()
..
..
..
(eventually) request X completes

After t2 we know the operation is queued, and thus we can legitimately
expect that new operations submitted after t2 should logically run
after request X.


With sync operations we can't make that observation as the caller
thread is blocked:

t1()
begin system call
..
..
request X gets queued
..
..
request X completes
..
..
system call completes
t2()

Userspace can't pinpoint exactly when did the request get onto the queue.

If another request Y gets submitted between t1 and t2, userspace can
never know for sure if Y was before of after X in the queue, so the
proposed reordering should be (my claim) not observable by userspace,
in the sense that it never produces a result that wasn't already
possible before.




Anyway, I didn't intend for the proposed change to warrant a thread of
this size. I will abandon patch 10 and propose a different (slightly
longer, but without reordering) change to replace patch 11.

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-12 Thread Dave Chinner
On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
> Hi Dave,
> 
> On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner  wrote:
> > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> >> - since all readers are woken at once, you might see burst of direct
> >> IO operations followed by bursts of truncate operations, instead of
> >> having them interleaved in smaller groups as before.
> >
> > And this will result in the same application IO pattern giving
> > vastly different results. e.g. a read IO promoted ahead of a
> > truncate will now return data instead of -1 (beyond EOF).
> 
> I will reply to this part first, as I think this gives a more concrete
> anchor to the discussion.
> 
> The crucial point is that the truncate system call hasn't completed
> yet - that thread is still queued in the rwsem.
> 
> You still have the guarantee that the truncate will complete
> eventually (even if there is a never-ending stream of IOs), and that
> all IO system calls that start after the truncate system call
> completes will see the file as truncated.

Sure, but the problem is not about when the syscall completes - the
problem is that you are changing where in the pipeline of IO the
truncate is initially executed.  i.e. ordering is not defined by
when an operation completes, but by the order ing which the queue is
processed after a blocking operation completes. That is not when the
syscall completes, but when the filesystem drops the exclusive lock.

>From a single threaded userspace application perspective or
multithreaded apps with their own userspace locking, truncates
complete when either the syscall completes or some time after when
the application drops it's lock. Either way, we're looking at
completion time serialisation, and in that case what XFS or the
kernel does simply doesn't matter.

However, if we are talking about userspace applications that use
lockless IO algorithms or kernel side applications (like knfsd) that
are exposed directly to the filesystem locking semantics,
serialisation behaviour is actually defined by filesystem's
submission side locking behaviour. There is no external
serialisation of IO completion, so serialisation and ordering is
defined (for XFS) solely by the rwsem behaviour.

> You don't have guarantees about which system call will "appear to run
> before the other" if the execution times of the two system calls
> overlap each other, but you actually never had such a guarantee from a
> correctness perspective (i.e. you could never guarantee which of the
> two would get queued on the rwsem ahead of the other).

Sure, but as long as the submission side ordering is deterministic,
it doesn't matter.

> > Ok, so I can see where your latency figure comes from, but it's
> > still a change of ordering in that W2 is no longer a barrier to the
> > reads queued after it.
> 
> My claim is that it's not a visible change from a correctness
> perspective

I am not arguing that it is incorrect. I'm arguing that the change
of ordering semantics breaks assumptions a lot of code makes about
how these locks work.

> > similar to this:
> >
> > W1R1W2R2W3R3.WnRm
> >
> > By my reading of the algorithm you are proposing, after W1 is
> > released, we end up with the queue being treated like this:
> >
> > R1R2R3RmW2W3Wn
> >
> > Right?
> 
> Yes, if what you are representing is the state of the queue at a given
> point of time (which implies R1..Rm must be distinct threads, not just
> the same thread doing repeated requests).

Yup, that's very typical.

> As new requests come in over time, one important point to remember is
> that each writer will see at most one batch of readers wake up ahead
> of it (though this batch may include readers that were originally
> queued both before and after it).

And that's *exactly* the problem with the changes you are proposing
- rwsems will no longer provide strongly ordered write vs read
barrier semantics.

> I find the name 'barrier' actually confusing when used to describe
> synchronous operations.  To me a barrier is usualy between
> asynchronous operations, and it is well defined which operations
> are ahead or behind of the barrier (either because they were
> queued by the same thread, or they were queued by different
> threads which may have synchronized together some other way).

When you have hundreds or thousands of threads doing IO to the one
file, it doesn't matter if the IO is issued synchronously or
asynchronously by the threads - you simply have a huge amount of
data IO concurrency and very, very deep pipelines.

Inserting a metadata modification (truncate, preallocation,
holepunch, etc) into that pipeline currently causes all new
submissions to queue behind the metadata modification, waits for
everything submitted before the metadata modification to complete
and then runs the metadata modification. Once it completes, it then
allows everything queued up to the next metadata modification to
run

That m

Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-11 Thread Michel Lespinasse
Hi Dave,

On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner  wrote:
> On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
>> - since all readers are woken at once, you might see burst of direct
>> IO operations followed by bursts of truncate operations, instead of
>> having them interleaved in smaller groups as before.
>
> And this will result in the same application IO pattern giving
> vastly different results. e.g. a read IO promoted ahead of a
> truncate will now return data instead of -1 (beyond EOF).

I will reply to this part first, as I think this gives a more concrete
anchor to the discussion.

The crucial point is that the truncate system call hasn't completed
yet - that thread is still queued in the rwsem.

You still have the guarantee that the truncate will complete
eventually (even if there is a never-ending stream of IOs), and that
all IO system calls that start after the truncate system call
completes will see the file as truncated.

You don't have guarantees about which system call will "appear to run
before the other" if the execution times of the two system calls
overlap each other, but you actually never had such a guarantee from a
correctness perspective (i.e. you could never guarantee which of the
two would get queued on the rwsem ahead of the other).

> So:
>
> RW1W2
>>
>> As our N-1 readers
>> complete their IO operations, they might get queued again after W2,
>
> So:
> W1W2R
>
>> and then skip ahead of W2 when RN reaches the front of the queue. So,
>> W2 is still guaranteed to run eventually, but with a worst case
>> latency that is ~2x worse than before
>
> So, when W1 is released, the queue is treated as though it was:
>
> RW2
>
> yes?

Yes.

> Ok, so I can see where your latency figure comes from, but it's
> still a change of ordering in that W2 is no longer a barrier to the
> reads queued after it.

My claim is that it's not a visible change from a correctness
perspective - if the extra readers R are submitted before W2 completes
then they can't assume their execution order vs W2. From a correctness
perspective, the outcome won't look any different as if W2's thread
had been delayed right before entering the truncate system call.

Now from a performance point of view, I concede there IS a difference
as this will happen more often than before. But from a correctness
point of view, this is not a new possible outcome - it was already
possible before.

> And if we extend that to WN, we have an interleaved queue
> similar to this:
>
> W1R1W2R2W3R3.WnRm
>
> By my reading of the algorithm you are proposing, after W1 is
> released, we end up with the queue being treated like this:
>
> R1R2R3RmW2W3Wn
>
> Right?

Yes, if what you are representing is the state of the queue at a given
point of time (which implies R1..Rm must be distinct threads, not just
the same thread doing repeated requests).

As new requests come in over time, one important point to remember is
that each writer will see at most one batch of readers wake up ahead
of it (though this batch may include readers that were originally
queued both before and after it). This guarantees that the writer
won't get starved / will get its turn running eventually.

> If so, that is definitely a change of lock ordering semantics -
> writes do not have barrier properties anymore.

I find the name 'barrier' actually confusing when used to describe
synchronous operations. To me a barrier is usualy between asynchronous
operations, and it is well defined which operations are ahead or
behind of the barrier (either because they were queued by the same
thread, or they were queued by different threads which may have
synchronized together some other way). But in this case, we have two
synchronous operations with overlapping execution times - they don't
have a way to know which one is 'supposed to' be ahead of the other.
The two threads can't observe their relative ordering since they are
blocked during the operation...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-11 Thread Dave Chinner
On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner  wrote:
> > On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
> >> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner  wrote:
> >> > Isn't this a significant change of semantics for the rwsem? i.e.
> >> > that read lock requests that come after a write lock request now
> >> > jump ahead of the write lock request? i.e.the write lock request is
> >> > no longer a barrier in the queue?
> >>
> >> Yes, I am allowing readers to skip ahead of writers in the queue (but
> >> only if they can run with another reader that was already ahead).
> >>
> >> I don't see that this is a change of observable semantics for correct
> >> programs. If a reader and a writer both block on the rwsem, how do you
> >> known for sure which one got queued first ? rwsem API doesn't give you
> >> any easy way to know whether a thread is currently queued on the rwsem
> >> (it could also be descheduled before it gets onto the rwsem queue).
> >
> > There are algorithms that rely on write locks to act as read-side
> > barriers to prevent write side starvation. i.e. if you keep queuing
> > up read locks, the writer never gets the lock, thereby starving the
> > writer.
> 
> What I propose actually isn't as bad as a reader preference rwlock
> like rwlock_t. When a reader makes it to the head of the queue, all
> readers gets woken. At that point there are only writers left in the
> queue, and new readers will get queued after them and won't be able to
> skip over them (since they have no earlier reader to join up with).
> So, I am not throwing reader/writer fairness out the window here.

OK, but

> > My point isn't that XFS gets the serialisation wrong - my point is
> > that the change of queuing order will change the userspace visible
> > behaviour of the filesystem *significantly*.
> >
> > For example: - direct IO only takes read locks, while truncate takes
> > a write lock.  Right now we can drop a truncate, preallocation or
> > hole punch operation into a stream of concurrent direct IO and have
> > it execute almost immediately - the barrier mechanism in the rwsem
> > ensures that it will be executed ahead of any future IO that is
> > issued by the application. With your proposed change, that operation
> > won't take place until all current *and all future* IOs stop and the
> > read lock is no longer held by anyone.
> 
> You actually wouldn't get such starvation with my proposal.
> 
> What you might see would be:
> 
> - Assuming you have up to N concurrent reads in progress, a writer
> might see up to 2N-1 readers proceed in front of it. This limit is
> reached if you first had N-1 readers grabbing the rwsem with an empty
> queue, then another writer W1 got queued,

So something like

RW1

> then a reader RN (note that
> it will get queued after W1 and not run concurrently with the existing
> N-1 readers), then our writer W2 gets queued.

So:

RW1W2
>
> As our N-1 readers
> complete their IO operations, they might get queued again after W2,

So:
W1W2R

> and then skip ahead of W2 when RN reaches the front of the queue. So,
> W2 is still guaranteed to run eventually, but with a worst case
> latency that is ~2x worse than before

So, when W1 is released, the queue is treated as though it was:

RW2

yes? Ok, so I can see where your latency figure comes from, but it's
still a change of ordering in that W2 is no longer a barrier to the
reads queued after it.

And if we extend that to WN, we have an interleaved queue
similar to this:

W1R1W2R2W3R3.WnRm

By my reading of the algorithm you are proposing, after W1 is
released, we end up with the queue being treated like this:

R1R2R3RmW2W3Wn

Right?

If so, that is definitely a change of lock ordering semantics -
writes do not have barrier properties anymore.

> - since all readers are woken at once, you might see burst of direct
> IO operations followed by bursts of truncate operations, instead of
> having them interleaved in smaller groups as before.

And this will result in the same application IO pattern giving
vastly different results. e.g. a read IO promoted ahead of a
truncate will now return data instead of -1 (beyond EOF).

> I'm not sure if these characteristics are good enough for the XFS use
> case. It seems tougher than many rwsem use cases, since the benefits
> of increased reader parallelism are not as clear here (i.e. the IO

Well defined, predictable, deterministc ordering of operations takes
far more precedence over parallelism when it comes to filesystem
behaviour. The rwsems already have enough parallelism to allow us to
drive more than 2 million 4k IOPS to a single file via
multi-threaded direct IO(*), so we aren't limited by parallelism and
concurrency in rwsems.

> So if
> this explanation still didn't made you

Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-11 Thread Peter Hurley

On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote:
> + retry_reader_grants:
> + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> + if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> + /* A writer stole the lock.  Undo our reader grants. */
> + if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS)
> + goto out;
> + /* The writer left.  Retry waking readers. */
> + goto retry_reader_grants;
> + }

This can be reduced to single looping cmpxchg in the grant reversal
path; then if reversing the grant fails, the count can simply be
re-tested for grant success, rather than trying to atomically re-grant.
For example, with a helper function, rwsem_cmpxchg():

static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore *sem)
{
long tmp = *old;
*old = cmpxchg(&sem->count, *old, new);
return tmp == *old;
}

... then above becomes ...

count = rwsem_atomic_update(adjustment, sem);
do {
if (count - adjustment >= RWSEM_WAITING_BIAS)
break;
if (rwsem_cmpxchg(&count, count - adjustment, sem))
goto out;   /* or simply return sem */
} while (1);

< wake up readers >


Also, this series and the original rwsem can mistakenly sleep reader(s)
when the lock is transitioned from writer-owned to waiting readers-owned
with no waiting writers. For example,


  CPU 0   |  CPU 1
  |
  | down_write()

... CPU 1 has the write lock for the semaphore.
Meanwhile, 1 or more down_read(s) are attempted and fail;
these are put on the wait list. Then ...

down_read()   | up_write()
  local = atomic_update(+read_bias)   |
  local <= 0? |   local = atomic_update(-write_bias)
  if (true)   |   local < 0?
 down_read_failed()   |   if (true)
  |  wake()
  | grab wait_lock
wait for wait_lock| wake all readers
  | release wait_lock

... At this point, sem->count > 0 and the wait list is empty,
but down_read_failed() will sleep the reader.


In this case, CPU 0 has observed the sem count with the write lock (and
the other waiters) and so is detoured to down_read_failed(). But if 
CPU 0 can't grab the wait_lock before the up_write() does (via
rwsem_wake()), then down_read_failed() will wake no one and sleep the
reader.

Unfortunately, this means readers and writers which observe the sem
count after the adjustment is committed by CPU 0 in down_read_failed()
will sleep as well, until the sem count returns to 0.

I think the best solution would be to preserve the observed count when
down_read() fails and pass it to rwsem_down_read_failed() -- of course,
this is also the most disruptive approach as it changes the per-arch
interface (the attached patch does not include the arch changes). The
other alternative is to go through the __rwsem_do_wake() path.

Regards,
Peter Hurley

--- >% ---
Subject: [PATCH] rwsem: Early-out tardy readers


Signed-off-by: Peter Hurley 
---
 lib/rwsem.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index f9a5705..8eb2cdf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -118,12 +118,11 @@ __rwsem_do_wake(struct rw_semaphore *sem, bool wakewrite)
 /*
  * wait for the read lock to be granted
  */
-struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
+struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem, 
long count)
 {
signed long adjustment = -RWSEM_ACTIVE_READ_BIAS;
struct rwsem_waiter waiter;
struct task_struct *tsk = current;
-   signed long count;
 
/* set up my own style of waitqueue */
waiter.task = tsk;
@@ -131,6 +130,20 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
get_task_struct(tsk);
 
raw_spin_lock_irq(&sem->wait_lock);
+
+   /* Try to reverse the lock attempt but if the count has changed
+* so that reversing fails, check if there are are no waiters,
+* and early-out if not */
+   do {
+   if (rwsem_cmpxchg(&count, count + adjust, sem))
+   break;
+   if (count > 0) {
+   raw_spin_unlock_irq(&sem->wait_lock);
+   put_task_struct(tsk);
+   return sem;
+   }
+   } while (1);
+
sem->wait_readers++;
if (list_empty(&sem->wait_list))
adjustment += RWSEM_WAITING_BIAS;
-- 
1.8.1.2




--
To unsubscribe from this list: send the line "unsubscribe 

Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-11 Thread Ingo Molnar

* Dave Chinner  wrote:

> So, that's my concern - we've got 20 years of algorithms in XFS designed 
> around rwsem write locks providing read-side barriers, and it's been 
> down this road once before (over 15 years ago). [...]

We are certainly not going to break reader/writer fairness - that's a must 
for any generic, long-held reader-writer locking mechanism.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-10 Thread Michel Lespinasse
On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner  wrote:
> On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
>> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner  wrote:
>> > Isn't this a significant change of semantics for the rwsem? i.e.
>> > that read lock requests that come after a write lock request now
>> > jump ahead of the write lock request? i.e.the write lock request is
>> > no longer a barrier in the queue?
>>
>> Yes, I am allowing readers to skip ahead of writers in the queue (but
>> only if they can run with another reader that was already ahead).
>>
>> I don't see that this is a change of observable semantics for correct
>> programs. If a reader and a writer both block on the rwsem, how do you
>> known for sure which one got queued first ? rwsem API doesn't give you
>> any easy way to know whether a thread is currently queued on the rwsem
>> (it could also be descheduled before it gets onto the rwsem queue).
>
> There are algorithms that rely on write locks to act as read-side
> barriers to prevent write side starvation. i.e. if you keep queuing
> up read locks, the writer never gets the lock, thereby starving the
> writer.

What I propose actually isn't as bad as a reader preference rwlock
like rwlock_t. When a reader makes it to the head of the queue, all
readers gets woken. At that point there are only writers left in the
queue, and new readers will get queued after them and won't be able to
skip over them (since they have no earlier reader to join up with).
So, I am not throwing reader/writer fairness out the window here.

>> > XFS has long assumed that a rwsem write lock is a barrier that
>> > stops new read locks from being taken, and this change will break
>> > that assumption. Given that this barrier assumption is used as the
>> > basis for serialisation of operations like IO vs truncate, there's a
>> > bit more at stake than just improving parallelism here.  i.e. IO
>> > issued after truncate/preallocate/hole punch could now be issued
>> > ahead of the pending metadata operation, whereas currently the IO
>> > issued after the pending metadata operation is waiting for the write
>> > lock will be only be processed -after- the metadata modification
>> > operation completes...
>> >
>> > That is a recipe for weird data corruption problems because
>> > applications are likely to have implicit dependencies on the barrier
>> > effect of metadata operations on data IO...
>>
>> I am confused as to exactly what XFS is doing, could you point me to
>> the code / indicate a scenario where this would go wrong ? If you
>> really rely on this for correctness you'd have to do something already
>> to guarantee that your original queueing order is as desired, and I
>> just don't see how it'd be done...
>
> My point isn't that XFS gets the serialisation wrong - my point is
> that the change of queuing order will change the userspace visible
> behaviour of the filesystem *significantly*.
>
> For example: - direct IO only takes read locks, while truncate takes
> a write lock.  Right now we can drop a truncate, preallocation or
> hole punch operation into a stream of concurrent direct IO and have
> it execute almost immediately - the barrier mechanism in the rwsem
> ensures that it will be executed ahead of any future IO that is
> issued by the application. With your proposed change, that operation
> won't take place until all current *and all future* IOs stop and the
> read lock is no longer held by anyone.

You actually wouldn't get such starvation with my proposal.

What you might see would be:

- Assuming you have up to N concurrent reads in progress, a writer
might see up to 2N-1 readers proceed in front of it. This limit is
reached if you first had N-1 readers grabbing the rwsem with an empty
queue, then another writer W1 got queued, then a reader RN (note that
it will get queued after W1 and not run concurrently with the existing
N-1 readers), then our writer W2 gets queued. As our N-1 readers
complete their IO operations, they might get queued again after W2,
and then skip ahead of W2 when RN reaches the front of the queue. So,
W2 is still guaranteed to run eventually, but with a worst case
latency that is ~2x worse than before

- since all readers are woken at once, you might see burst of direct
IO operations followed by bursts of truncate operations, instead of
having them interleaved in smaller groups as before.

I'm not sure if these characteristics are good enough for the XFS use
case. It seems tougher than many rwsem use cases, since the benefits
of increased reader parallelism are not as clear here (i.e. the IO
operations still have to hit disk so you don't get much further
benefit if you already had more IO parallelism than platters). So if
this explanation still didn't made you comfortable with the proposal,
I will rework it to avoid the queue reordering.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: se

Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-10 Thread Dave Chinner
On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner  wrote:
> > On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote:
> >> When the first queued waiter is a reader, wake all readers instead of
> >> just those that are at the front of the queue. There are really two
> >> motivations for this change:
> >
> > Isn't this a significant change of semantics for the rwsem? i.e.
> > that read lock requests that come after a write lock request now
> > jump ahead of the write lock request? i.e.the write lock request is
> > no longer a barrier in the queue?
> 
> Yes, I am allowing readers to skip ahead of writers in the queue (but
> only if they can run with another reader that was already ahead).
> 
> I don't see that this is a change of observable semantics for correct
> programs. If a reader and a writer both block on the rwsem, how do you
> known for sure which one got queued first ? rwsem API doesn't give you
> any easy way to know whether a thread is currently queued on the rwsem
> (it could also be descheduled before it gets onto the rwsem queue).

There are algorithms that rely on write locks to act as read-side
barriers to prevent write side starvation. i.e. if you keep queuing
up read locks, the writer never gets the lock, thereby starving the
writer.

> But yes, if you're making assumptions about queuing order the change
> makes it more likely that they'll be observably wrong.
> 
> > XFS has long assumed that a rwsem write lock is a barrier that
> > stops new read locks from being taken, and this change will break
> > that assumption. Given that this barrier assumption is used as the
> > basis for serialisation of operations like IO vs truncate, there's a
> > bit more at stake than just improving parallelism here.  i.e. IO
> > issued after truncate/preallocate/hole punch could now be issued
> > ahead of the pending metadata operation, whereas currently the IO
> > issued after the pending metadata operation is waiting for the write
> > lock will be only be processed -after- the metadata modification
> > operation completes...
> >
> > That is a recipe for weird data corruption problems because
> > applications are likely to have implicit dependencies on the barrier
> > effect of metadata operations on data IO...
> 
> I am confused as to exactly what XFS is doing, could you point me to
> the code / indicate a scenario where this would go wrong ? If you
> really rely on this for correctness you'd have to do something already
> to guarantee that your original queueing order is as desired, and I
> just don't see how it'd be done...

My point isn't that XFS gets the serialisation wrong - my point is
that the change of queuing order will change the userspace visible
behaviour of the filesystem *significantly*.

For example: - direct IO only takes read locks, while truncate takes
a write lock.  Right now we can drop a truncate, preallocation or
hole punch operation into a stream of concurrent direct IO and have
it execute almost immediately - the barrier mechanism in the rwsem
ensures that it will be executed ahead of any future IO that is
issued by the application. With your proposed change, that operation
won't take place until all current *and all future* IOs stop and the
read lock is no longer held by anyone.

To put this in context, here's the Irix XFS iolock initialisation
code from mid 1997, where the barrier semantics are first explicitly
documented:

mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", (long)vp->v_number);

It was only coded like this in 1997, because the Irix multi-reader
lock only grew read-side queue jumping and priority inheritence at
this time. This changed the default behaviour from write lock
barrier semantics to something similar to what you are proposing for
rwsems right now. All the locks that relied on barrier semantics of
write locks had to be configured explicitly to have that behaviour
rather than implicitly relying on the fact that mrlocks had provided
write barrier semantics.

So, that's my concern - we've got 20 years of algorithms in XFS
designed around rwsem write locks providing read-side barriers, and
it's been down this road once before (over 15 years ago).  Changing
the semantics of rwsems is going to break stuff in subtle and
unpredictable ways

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-08 Thread Michel Lespinasse
On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner  wrote:
> On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote:
>> When the first queued waiter is a reader, wake all readers instead of
>> just those that are at the front of the queue. There are really two
>> motivations for this change:
>
> Isn't this a significant change of semantics for the rwsem? i.e.
> that read lock requests that come after a write lock request now
> jump ahead of the write lock request? i.e.the write lock request is
> no longer a barrier in the queue?

Yes, I am allowing readers to skip ahead of writers in the queue (but
only if they can run with another reader that was already ahead).

I don't see that this is a change of observable semantics for correct
programs. If a reader and a writer both block on the rwsem, how do you
known for sure which one got queued first ? rwsem API doesn't give you
any easy way to know whether a thread is currently queued on the rwsem
(it could also be descheduled before it gets onto the rwsem queue).

But yes, if you're making assumptions about queuing order the change
makes it more likely that they'll be observably wrong.

> XFS has long assumed that a rwsem write lock is a barrier that
> stops new read locks from being taken, and this change will break
> that assumption. Given that this barrier assumption is used as the
> basis for serialisation of operations like IO vs truncate, there's a
> bit more at stake than just improving parallelism here.  i.e. IO
> issued after truncate/preallocate/hole punch could now be issued
> ahead of the pending metadata operation, whereas currently the IO
> issued after the pending metadata operation is waiting for the write
> lock will be only be processed -after- the metadata modification
> operation completes...
>
> That is a recipe for weird data corruption problems because
> applications are likely to have implicit dependencies on the barrier
> effect of metadata operations on data IO...

I am confused as to exactly what XFS is doing, could you point me to
the code / indicate a scenario where this would go wrong ? If you
really rely on this for correctness you'd have to do something already
to guarantee that your original queueing order is as desired, and I
just don't see how it'd be done...

That said, it is doable to add support for write lock stealing in the
rwsem write path while still preserving the queueing order of readers
vs writers; I'm just not sure that I fully understand the correctness
concern at this point.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-08 Thread Dave Chinner
On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote:
> When the first queued waiter is a reader, wake all readers instead of
> just those that are at the front of the queue. There are really two
> motivations for this change:

Isn't this a significant change of semantics for the rwsem? i.e.
that read lock requests that come after a write lock request now
jump ahead of the write lock request? i.e.the write lock request is
no longer a barrier in the queue?

XFS has long assumed that a rwsem write lock is a barrier that
stops new read locks from being taken, and this change will break
that assumption. Given that this barrier assumption is used as the
basis for serialisation of operations like IO vs truncate, there's a
bit more at stake than just improving parallelism here.  i.e. IO
issued after truncate/preallocate/hole punch could now be issued
ahead of the pending metadata operation, whereas currently the IO
issued after the pending metadata operation is waiting for the write
lock will be only be processed -after- the metadata modification
operation completes...

That is a recipe for weird data corruption problems because
applications are likely to have implicit dependencies on the barrier
effect of metadata operations on data IO...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/