Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-12 Thread Oleg Nesterov
On 02/11, Jeremy Fitzhardinge wrote:
>
> On 02/11/2015 09:24 AM, Oleg Nesterov wrote:
> > I agree, and I have to admit I am not sure I fully understand why
> > unlock uses the locked add. Except we need a barrier to avoid the race
> > with the enter_slowpath() users, of course. Perhaps this is the only
> > reason?
>
> Right now it needs to be a locked operation to prevent read-reordering.
> x86 memory ordering rules state that all writes are seen in a globally
> consistent order, and are globally ordered wrt reads *on the same
> addresses*, but reads to different addresses can be reordered wrt to writes.
>
> So, if the unlocking add were not a locked operation:
>
> __add(&lock->tickets.head, TICKET_LOCK_INC);  /* not locked */
>
> if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
> __ticket_unlock_slowpath(lock, prev);
>
> Then the read of lock->tickets.tail can be reordered before the unlock,
> which introduces a race:

Yes, yes, thanks, but this is what I meant. We need a barrier. Even if
"Every store is a release" as Linus mentioned.

> This *might* be OK, but I think it's on dubious ground:
>
> __add(&lock->tickets.head, TICKET_LOCK_INC);  /* not locked */
>
>   /* read overlaps write, and so is ordered */
> if (unlikely(lock->head_tail & (TICKET_SLOWPATH_FLAG << TICKET_SHIFT))
> __ticket_unlock_slowpath(lock, prev);
>
> because I think Intel and AMD differed in interpretation about how
> overlapping but different-sized reads & writes are ordered (or it simply
> isn't architecturally defined).

can't comment, I simply so not know how the hardware works.

> If the slowpath flag is moved to head, then it would always have to be
> locked anyway, because it needs to be atomic against other CPU's RMW
> operations setting the flag.

Yes, this is true.

But again, if we want to avoid the read-after-unlock, we need to update
this lock and read SLOWPATH atomically, it seems that we can't avoid the
locked insn.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Jeremy Fitzhardinge

On 02/11/2015 03:28 PM, Linus Torvalds wrote:
>
>
> On Feb 11, 2015 3:15 PM, "Jeremy Fitzhardinge"  > wrote:
> >
> > Right now it needs to be a locked operation to prevent read-reordering.
> > x86 memory ordering rules state that all writes are seen in a globally
> > consistent order, and are globally ordered wrt reads *on the same
> > addresses*, but reads to different addresses can be reordered wrt to
> writes.
>
> The modern x86 rules are actually much tighter than that.
>
> Every store is a release, and every load is an acquire. So a
> non-atomic store is actually a perfectly fine unlock. All preceding
> stores will be seen by other cpu's before the unlock, and while reads
> can pass stores, they only pass *earlier* stores.
>

Right, so in this particular instance, the read of the SLOWPATH flag
*can't* pass the previous unlock store, hence the need for an atomic
unlock or some other mechanism to prevent the read from being reordered.

J

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Linus Torvalds
On Feb 11, 2015 3:15 PM, "Jeremy Fitzhardinge"  wrote:
>
> Right now it needs to be a locked operation to prevent read-reordering.
> x86 memory ordering rules state that all writes are seen in a globally
> consistent order, and are globally ordered wrt reads *on the same
> addresses*, but reads to different addresses can be reordered wrt to
writes.

The modern x86 rules are actually much tighter than that.

Every store is a release, and every load is an acquire. So a non-atomic
store is actually a perfectly fine unlock. All preceding stores will be
seen by other cpu's before the unlock, and while reads can pass stores,
they only pass *earlier* stores.

For *taking* a lock you need an atomic access, because otherwise loads
inside the locked region could bleed out to before the store that takes the
lock.

 Linus
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Jeremy Fitzhardinge

On 02/11/2015 09:24 AM, Oleg Nesterov wrote:
> I agree, and I have to admit I am not sure I fully understand why
> unlock uses the locked add. Except we need a barrier to avoid the race
> with the enter_slowpath() users, of course. Perhaps this is the only
> reason?

Right now it needs to be a locked operation to prevent read-reordering.
x86 memory ordering rules state that all writes are seen in a globally
consistent order, and are globally ordered wrt reads *on the same
addresses*, but reads to different addresses can be reordered wrt to writes.

So, if the unlocking add were not a locked operation:

__add(&lock->tickets.head, TICKET_LOCK_INC);/* not locked */

if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);

Then the read of lock->tickets.tail can be reordered before the unlock,
which introduces a race:

/* read reordered here */
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) /* false */
/* ... */;

/* other CPU sets SLOWPATH and blocks */

__add(&lock->tickets.head, TICKET_LOCK_INC);/* not locked */

/* other CPU hung */

So it doesn't *have* to be a locked operation. This should also work:

__add(&lock->tickets.head, TICKET_LOCK_INC);/* not locked */

lfence();   /* prevent read 
reordering */
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);

but in practice a locked add is cheaper than an lfence (or at least was).

This *might* be OK, but I think it's on dubious ground:

__add(&lock->tickets.head, TICKET_LOCK_INC);/* not locked */

/* read overlaps write, and so is ordered */
if (unlikely(lock->head_tail & (TICKET_SLOWPATH_FLAG << TICKET_SHIFT))
__ticket_unlock_slowpath(lock, prev);

because I think Intel and AMD differed in interpretation about how
overlapping but different-sized reads & writes are ordered (or it simply
isn't architecturally defined).

If the slowpath flag is moved to head, then it would always have to be
locked anyway, because it needs to be atomic against other CPU's RMW
operations setting the flag.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Raghavendra K T

On 02/11/2015 11:08 PM, Oleg Nesterov wrote:

On 02/11, Raghavendra K T wrote:


On 02/10/2015 06:56 PM, Oleg Nesterov wrote:


In this case __ticket_check_and_clear_slowpath() really needs to cmpxchg
the whole .head_tail. Plus obviously more boring changes. This needs a
separate patch even _if_ this can work.


Correct, but apart from this, before doing xadd in unlock,
we would have to make sure lsb bit is cleared so that we can live with 1
bit overflow to tail which is unused. now either or both of head,tail
lsb bit may be set after unlock.


Sorry, can't understand... could you spell?

If TICKET_SLOWPATH_FLAG lives in .head arch_spin_unlock() could simply do

head = xadd(&lock->tickets.head, TICKET_LOCK_INC);

if (head & TICKET_SLOWPATH_FLAG)
__ticket_unlock_kick(head);

so it can't overflow to .tail?



You are right.
I totally forgot we can get rid of tail operations :)



And we we do this, probably it makes sense to add something like

bool tickets_equal(__ticket_t one, __ticket_t two)
{
return (one ^ two) & ~TICKET_SLOWPATH_FLAG;
}



Very nice idea. I was tired of ~TICKET_SLOWPATH_FLAG usage all over in
the current (complex :)) implementation. These two suggestions helps
alot.


and change kvm_lock_spinning() to use tickets_equal(tickets.head, want), plus
it can have more users in asm/spinlock.h.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Oleg Nesterov
On 02/11, Raghavendra K T wrote:
>
> On 02/10/2015 06:56 PM, Oleg Nesterov wrote:
>
>> In this case __ticket_check_and_clear_slowpath() really needs to cmpxchg
>> the whole .head_tail. Plus obviously more boring changes. This needs a
>> separate patch even _if_ this can work.
>
> Correct, but apart from this, before doing xadd in unlock,
> we would have to make sure lsb bit is cleared so that we can live with 1
> bit overflow to tail which is unused. now either or both of head,tail
> lsb bit may be set after unlock.

Sorry, can't understand... could you spell?

If TICKET_SLOWPATH_FLAG lives in .head arch_spin_unlock() could simply do

head = xadd(&lock->tickets.head, TICKET_LOCK_INC);

if (head & TICKET_SLOWPATH_FLAG)
__ticket_unlock_kick(head);

so it can't overflow to .tail?

But probably I missed your concern.



And we we do this, probably it makes sense to add something like

bool tickets_equal(__ticket_t one, __ticket_t two)
{
return (one ^ two) & ~TICKET_SLOWPATH_FLAG;
}

and change kvm_lock_spinning() to use tickets_equal(tickets.head, want), plus
it can have more users in asm/spinlock.h.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Oleg Nesterov
On 02/10, Jeremy Fitzhardinge wrote:
>
> On 02/10/2015 05:26 AM, Oleg Nesterov wrote:
> > On 02/10, Raghavendra K T wrote:
> >> Unfortunately xadd could result in head overflow as tail is high.
> >>
> >> The other option was repeated cmpxchg which is bad I believe.
> >> Any suggestions?
> > Stupid question... what if we simply move SLOWPATH from .tail to .head?
> > In this case arch_spin_unlock() could do xadd(tickets.head) and check
> > the result
>
> Well, right now, "tail" is manipulated by locked instructions by CPUs
> who are contending for the ticketlock, but head can be manipulated
> unlocked by the CPU which currently owns the ticketlock. If SLOWPATH
> moved into head, then non-owner CPUs would be touching head, requiring
> everyone to use locked instructions on it.
>
> That's the theory, but I don't see much (any?) code which depends on that.
>
> Ideally we could find a way so that pv ticketlocks could use a plain
> unlocked add for the unlock like the non-pv case, but I just don't see a
> way to do it.

I agree, and I have to admit I am not sure I fully understand why unlock
uses the locked add. Except we need a barrier to avoid the race with the
enter_slowpath() users, of course. Perhaps this is the only reason?

Anyway, I suggested this to avoid the overflow if we use xadd(), and I
guess we need the locked insn anyway if we want to eliminate the unsafe
read-after-unlock...

> > BTW. If we move "clear slowpath" into "lock" path, then probably trylock
> > should be changed too? Something like below, we just need to clear SLOWPATH
> > before cmpxchg.
>
> How important / widely used is trylock these days?

I am not saying this is that important. Just this looks more consistent imo
and we can do this for free.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Raghavendra K T

On 02/10/2015 06:56 PM, Oleg Nesterov wrote:

On 02/10, Raghavendra K T wrote:


On 02/10/2015 06:23 AM, Linus Torvalds wrote:


  add_smp(&lock->tickets.head, TICKET_LOCK_INC);
  if (READ_ONCE(lock->tickets.tail) & TICKET_SLOWPATH_FLAG) ..

into something like

  val = xadd((&lock->ticket.head_tail, TICKET_LOCK_INC << TICKET_SHIFT);
  if (unlikely(val & TICKET_SLOWPATH_FLAG)) ...

would be the right thing to do. Somebody should just check that I got
that shift right, and that the tail is in the high bytes (head really
needs to be high to work, if it's in the low byte(s) the xadd would
overflow from head into tail which would be wrong).


Unfortunately xadd could result in head overflow as tail is high.

The other option was repeated cmpxchg which is bad I believe.
Any suggestions?


Stupid question... what if we simply move SLOWPATH from .tail to .head?
In this case arch_spin_unlock() could do xadd(tickets.head) and check
the result


It is a good idea. Trying this now.


In this case __ticket_check_and_clear_slowpath() really needs to cmpxchg
the whole .head_tail. Plus obviously more boring changes. This needs a
separate patch even _if_ this can work.


Correct, but apart from this, before doing xadd in unlock,
we would have to make sure lsb bit is cleared so that we can live with 1 
bit overflow to tail which is unused. now either or both of head,tail

lsb bit may be set after unlock.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Jeremy Fitzhardinge

On 02/10/2015 05:26 AM, Oleg Nesterov wrote:
> On 02/10, Raghavendra K T wrote:
>> On 02/10/2015 06:23 AM, Linus Torvalds wrote:
>>
>>>  add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>>>  if (READ_ONCE(lock->tickets.tail) & TICKET_SLOWPATH_FLAG) ..
>>>
>>> into something like
>>>
>>>  val = xadd((&lock->ticket.head_tail, TICKET_LOCK_INC << 
>>> TICKET_SHIFT);
>>>  if (unlikely(val & TICKET_SLOWPATH_FLAG)) ...
>>>
>>> would be the right thing to do. Somebody should just check that I got
>>> that shift right, and that the tail is in the high bytes (head really
>>> needs to be high to work, if it's in the low byte(s) the xadd would
>>> overflow from head into tail which would be wrong).
>> Unfortunately xadd could result in head overflow as tail is high.
>>
>> The other option was repeated cmpxchg which is bad I believe.
>> Any suggestions?
> Stupid question... what if we simply move SLOWPATH from .tail to .head?
> In this case arch_spin_unlock() could do xadd(tickets.head) and check
> the result

Well, right now, "tail" is manipulated by locked instructions by CPUs
who are contending for the ticketlock, but head can be manipulated
unlocked by the CPU which currently owns the ticketlock. If SLOWPATH
moved into head, then non-owner CPUs would be touching head, requiring
everyone to use locked instructions on it.

That's the theory, but I don't see much (any?) code which depends on that.

Ideally we could find a way so that pv ticketlocks could use a plain
unlocked add for the unlock like the non-pv case, but I just don't see a
way to do it.

> In this case __ticket_check_and_clear_slowpath() really needs to cmpxchg
> the whole .head_tail. Plus obviously more boring changes. This needs a
> separate patch even _if_ this can work.

Definitely.

> BTW. If we move "clear slowpath" into "lock" path, then probably trylock
> should be changed too? Something like below, we just need to clear SLOWPATH
> before cmpxchg.

How important / widely used is trylock these days?

J

>
> Oleg.
>
> --- x/arch/x86/include/asm/spinlock.h
> +++ x/arch/x86/include/asm/spinlock.h
> @@ -109,7 +109,8 @@ static __always_inline int arch_spin_try
>   if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG))
>   return 0;
>  
> - new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
> + new.tickets.head = old.tickets.head;
> + new.tickets.tail = (old.tickets.tail & ~TICKET_SLOWPATH_FLAG) + 
> TICKET_LOCK_INC;
>  
>   /* cmpxchg is a full barrier, so nothing can move before it */
>   return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> old.head_tail;
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Oleg Nesterov
On 02/10, Denys Vlasenko wrote:
>
> # define HEAD_MASK (TICKET_SLOWPATH_FLAG-1)
>
> ...
> unlock_again:
>
> val = xadd((&lock->ticket.head_tail, TICKET_LOCK_INC);
> if (unlikely(!(val & HEAD_MASK))) {
> /* overflow. we inadvertently incremented the tail word.
>  * tail's lsb is TICKET_SLOWPATH_FLAG.
>  * Increment inverted this bit, fix it up.
>  * (inc _may_ have messed up tail counter too,
>  * will deal with it after kick.)
> */
> val ^= TICKET_SLOWPATH_FLAG;
> }
>
> if (unlikely(val & TICKET_SLOWPATH_FLAG)) {
> ...kick the waiting task...
>
>val -= TICKET_SLOWPATH_FLAG;
>if (unlikely(!(val & HEAD_MASK))) {
> /* overflow. we inadvertently incremented tail word, *and*
>  * TICKET_SLOWPATH_FLAG was set, increment overflowed
>  * that bit too and incremented tail counter.
>  * This means we (inadvertently) taking the lock again!
>  * Oh well. Take it, and unlock it again...
>  */
> while (1) {
> if (READ_ONCE(lock->tickets.head) != TICKET_TAIL(val))
> cpu_relax();
> }
> goto unlock_again;
> }
>
>
> Granted, this looks ugly.

complicated ;)

But "Take it, and unlock it again" simply can't work, this can deadlock.
Note that unlock() can be called after successful try_lock(). And other
problems with lock-ordering, like

lock(X);
lock(Y);

unlock(X);

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Oleg Nesterov
On 02/10, Raghavendra K T wrote:
>
> On 02/10/2015 06:23 AM, Linus Torvalds wrote:
>
>>  add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>>  if (READ_ONCE(lock->tickets.tail) & TICKET_SLOWPATH_FLAG) ..
>>
>> into something like
>>
>>  val = xadd((&lock->ticket.head_tail, TICKET_LOCK_INC << 
>> TICKET_SHIFT);
>>  if (unlikely(val & TICKET_SLOWPATH_FLAG)) ...
>>
>> would be the right thing to do. Somebody should just check that I got
>> that shift right, and that the tail is in the high bytes (head really
>> needs to be high to work, if it's in the low byte(s) the xadd would
>> overflow from head into tail which would be wrong).
>
> Unfortunately xadd could result in head overflow as tail is high.
>
> The other option was repeated cmpxchg which is bad I believe.
> Any suggestions?

Stupid question... what if we simply move SLOWPATH from .tail to .head?
In this case arch_spin_unlock() could do xadd(tickets.head) and check
the result

In this case __ticket_check_and_clear_slowpath() really needs to cmpxchg
the whole .head_tail. Plus obviously more boring changes. This needs a
separate patch even _if_ this can work.



BTW. If we move "clear slowpath" into "lock" path, then probably trylock
should be changed too? Something like below, we just need to clear SLOWPATH
before cmpxchg.

Oleg.

--- x/arch/x86/include/asm/spinlock.h
+++ x/arch/x86/include/asm/spinlock.h
@@ -109,7 +109,8 @@ static __always_inline int arch_spin_try
if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG))
return 0;
 
-   new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
+   new.tickets.head = old.tickets.head;
+   new.tickets.tail = (old.tickets.tail & ~TICKET_SLOWPATH_FLAG) + 
TICKET_LOCK_INC;
 
/* cmpxchg is a full barrier, so nothing can move before it */
return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
old.head_tail;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Sasha Levin
On 02/10/2015 04:30 AM, Raghavendra K T wrote:
>>
>> So I think Raghavendra's last version (which hopefully fixes the
>> lockup problem that Sasha reported) together with changing that
> 
> V2 did pass the stress, but getting confirmation Sasha would help.

I've been running it for the last two days, and didn't see any lockups
or other strange behaviour aside from some invalid reads which we
expected.


Thanks,
Sasha
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Denys Vlasenko
On Tue, Feb 10, 2015 at 2:18 PM, Denys Vlasenko
 wrote:
> while (1) {
> if (READ_ONCE(lock->tickets.head) != TICKET_TAIL(val))
> cpu_relax();
> }

Doh should be

 while (READ_ONCE(lock->tickets.head) != TICKET_TAIL(val)
 cpu_relax();
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization



Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Denys Vlasenko
On Tue, Feb 10, 2015 at 10:30 AM, Raghavendra K T
 wrote:
> On 02/10/2015 06:23 AM, Linus Torvalds wrote:
>>  add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>>  if (READ_ONCE(lock->tickets.tail) & TICKET_SLOWPATH_FLAG) ..
>>
>> into something like
>>
>>  val = xadd((&lock->ticket.head_tail, TICKET_LOCK_INC <<
>> TICKET_SHIFT);
>>  if (unlikely(val & TICKET_SLOWPATH_FLAG)) ...
>>
>> would be the right thing to do. Somebody should just check that I got
>> that shift right, and that the tail is in the high bytes (head really
>> needs to be high to work, if it's in the low byte(s) the xadd would
>> overflow from head into tail which would be wrong).
>
> Unfortunately xadd could result in head overflow as tail is high.

xadd can overflow, but is this really a problem?

# define HEAD_MASK (TICKET_SLOWPATH_FLAG-1)

...
unlock_again:

val = xadd((&lock->ticket.head_tail, TICKET_LOCK_INC);
if (unlikely(!(val & HEAD_MASK))) {
/* overflow. we inadvertently incremented the tail word.
 * tail's lsb is TICKET_SLOWPATH_FLAG.
 * Increment inverted this bit, fix it up.
 * (inc _may_ have messed up tail counter too,
 * will deal with it after kick.)
*/
val ^= TICKET_SLOWPATH_FLAG;
}

if (unlikely(val & TICKET_SLOWPATH_FLAG)) {
...kick the waiting task...

   val -= TICKET_SLOWPATH_FLAG;
   if (unlikely(!(val & HEAD_MASK))) {
/* overflow. we inadvertently incremented tail word, *and*
 * TICKET_SLOWPATH_FLAG was set, increment overflowed
 * that bit too and incremented tail counter.
 * This means we (inadvertently) taking the lock again!
 * Oh well. Take it, and unlock it again...
 */
while (1) {
if (READ_ONCE(lock->tickets.head) != TICKET_TAIL(val))
cpu_relax();
}
goto unlock_again;
}


Granted, this looks ugly.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Raghavendra K T

On 02/10/2015 06:23 AM, Linus Torvalds wrote:

On Mon, Feb 9, 2015 at 4:02 AM, Peter Zijlstra  wrote:

On Mon, Feb 09, 2015 at 03:04:22PM +0530, Raghavendra K T wrote:

So we have 3 choices,
1. xadd
2. continue with current approach.
3. a read before unlock and also after that.


For the truly paranoid we have probe_kernel_address(), suppose the lock
was in module space and the module just got unloaded under us.


That's much too expensive.

The xadd shouldn't be noticeably more expensive than the current
"add_smp()". Yes, "lock xadd" used to be several cycles slower than
just "lock add" on some early cores, but I think these days it's down
to a single-cycle difference, which is not really different from doing
a separate load after the add.

The real problem with xadd used to be that we always had to do magic
special-casing for i386, but that's one of the reasons we dropped
support for original 80386.

So I think Raghavendra's last version (which hopefully fixes the
lockup problem that Sasha reported) together with changing that


V2 did pass the stress, but getting confirmation Sasha would help.



 add_smp(&lock->tickets.head, TICKET_LOCK_INC);
 if (READ_ONCE(lock->tickets.tail) & TICKET_SLOWPATH_FLAG) ..

into something like

 val = xadd((&lock->ticket.head_tail, TICKET_LOCK_INC << TICKET_SHIFT);
 if (unlikely(val & TICKET_SLOWPATH_FLAG)) ...

would be the right thing to do. Somebody should just check that I got
that shift right, and that the tail is in the high bytes (head really
needs to be high to work, if it's in the low byte(s) the xadd would
overflow from head into tail which would be wrong).


Unfortunately xadd could result in head overflow as tail is high.

The other option was repeated cmpxchg which is bad I believe.
Any suggestions?

( I have the V3 with Oleg's suggestion and performance numbers but
without this getting resolved, It will be one unnecessary iteration).

How about getting rid off SLOW_PATH_FLAG in spinlock (i.e. use it only
 as hint for paravirt), but do unlock_kick whenever we see that
(tail-head) > TICKET_LOCK_INC?. (but this also may need cmpxchg in loop
in unlock but we will be able to get rid of clear slowpath logic)

Only problem is we may do unnecessary kicks even in 1x load.








___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-09 Thread Linus Torvalds
On Mon, Feb 9, 2015 at 4:02 AM, Peter Zijlstra  wrote:
> On Mon, Feb 09, 2015 at 03:04:22PM +0530, Raghavendra K T wrote:
>> So we have 3 choices,
>> 1. xadd
>> 2. continue with current approach.
>> 3. a read before unlock and also after that.
>
> For the truly paranoid we have probe_kernel_address(), suppose the lock
> was in module space and the module just got unloaded under us.

That's much too expensive.

The xadd shouldn't be noticeably more expensive than the current
"add_smp()". Yes, "lock xadd" used to be several cycles slower than
just "lock add" on some early cores, but I think these days it's down
to a single-cycle difference, which is not really different from doing
a separate load after the add.

The real problem with xadd used to be that we always had to do magic
special-casing for i386, but that's one of the reasons we dropped
support for original 80386.

So I think Raghavendra's last version (which hopefully fixes the
lockup problem that Sasha reported) together with changing that

add_smp(&lock->tickets.head, TICKET_LOCK_INC);
if (READ_ONCE(lock->tickets.tail) & TICKET_SLOWPATH_FLAG) ..

into something like

val = xadd((&lock->ticket.head_tail, TICKET_LOCK_INC << TICKET_SHIFT);
if (unlikely(val & TICKET_SLOWPATH_FLAG)) ...

would be the right thing to do. Somebody should just check that I got
that shift right, and that the tail is in the high bytes (head really
needs to be high to work, if it's in the low byte(s) the xadd would
overflow from head into tail which would be wrong).

 Linus
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-09 Thread Raghavendra K T

On 02/09/2015 05:32 PM, Peter Zijlstra wrote:

On Mon, Feb 09, 2015 at 03:04:22PM +0530, Raghavendra K T wrote:

So we have 3 choices,
1. xadd
2. continue with current approach.
3. a read before unlock and also after that.


For the truly paranoid we have probe_kernel_address(), suppose the lock
was in module space and the module just got unloaded under us.



Thanks.. Good idea, How costly is it?
atleast we could do probe_kernel_address() and check the value of
slowpath flag if people as us to address invalid read problem.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-09 Thread Peter Zijlstra
On Mon, Feb 09, 2015 at 03:04:22PM +0530, Raghavendra K T wrote:
> So we have 3 choices,
> 1. xadd
> 2. continue with current approach.
> 3. a read before unlock and also after that.

For the truly paranoid we have probe_kernel_address(), suppose the lock
was in module space and the module just got unloaded under us.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-09 Thread Raghavendra K T

On 02/09/2015 02:44 AM, Jeremy Fitzhardinge wrote:

On 02/06/2015 06:49 AM, Raghavendra K T wrote:

[...]



Linus suggested that we should not do any writes to lock after unlock(),
and we can move slowpath clearing to fastpath lock.


Yep, that seems like a sound approach.


Current approach seem to be working now. (though we could not avoid read).
Related question: Do you think we could avoid SLOWPATH_FLAG itself by
checking head and tail difference. or is it costly because it may
result in unnecessary unlock_kicks?


However it brings additional case to be handled, viz., slowpath still
could be set when somebody does arch_trylock. Handle that too by ignoring
slowpath flag during lock availability check.

Reported-by: Sasha Levin 
Suggested-by: Linus Torvalds 
Signed-off-by: Raghavendra K T 
---
  arch/x86/include/asm/spinlock.h | 70 -
  1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 625660f..0829f86 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -49,6 +49,23 @@ static inline void __ticket_enter_slowpath(arch_spinlock_t 
*lock)
set_bit(0, (volatile unsigned long *)&lock->tickets.tail);
  }

+static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
+{
+   arch_spinlock_t old, new;
+   __ticket_t diff;
+
+   old.tickets = READ_ONCE(lock->tickets);


Couldn't the caller pass in the lock state that it read rather than
re-reading it?



Yes we could. do you mean we could pass additional read value apart from 
lock, (because lock will be anyway needed for cmpxchg).




+static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
+{
+}
+
  #endif /* CONFIG_PARAVIRT_SPINLOCKS */

  static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
@@ -84,7 +105,7 @@ static __always_inline void arch_spin_lock(arch_spinlock_t 
*lock)
register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };

inc = xadd(&lock->tickets, inc);
-   if (likely(inc.head == inc.tail))
+   if (likely(inc.head == (inc.tail & ~TICKET_SLOWPATH_FLAG)))




good point, we can get rid of this as well.


The intent of this conditional was to be the quickest possible path when
taking a fastpath lock, with the code below being used for all slowpath
locks (free or taken). So I don't think masking out SLOWPATH_FLAG is
necessary here.


goto out;

inc.tail &= ~TICKET_SLOWPATH_FLAG;
@@ -98,7 +119,10 @@ static __always_inline void arch_spin_lock(arch_spinlock_t 
*lock)
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
-out:   barrier();  /* make sure nothing creeps before the lock is taken */
+out:
+   __ticket_check_and_clear_slowpath(lock);
+
+   barrier();  /* make sure nothing creeps before the lock is taken */


Which means that if "goto out" path is only ever used for fastpath
locks, you can limit calling __ticket_check_and_clear_slowpath() to the
slowpath case.



Yes, I ll move that call up.


  }

  static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
@@ -115,47 +139,21 @@ static __always_inline int 
arch_spin_trylock(arch_spinlock_t *lock)
return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
old.head_tail;
  }

-static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
-   arch_spinlock_t old)
-{
-   arch_spinlock_t new;
-
-   BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
-
-   /* Perform the unlock on the "before" copy */
-   old.tickets.head += TICKET_LOCK_INC;


NB (see below)


Thanks for pointing, this solved the hang issue. I
missed this exact addition.




-
-   /* Clear the slowpath flag */
-   new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
-
-   /*
-* If the lock is uncontended, clear the flag - use cmpxchg in
-* case it changes behind our back though.
-*/
-   if (new.tickets.head != new.tickets.tail ||
-   cmpxchg(&lock->head_tail, old.head_tail,
-   new.head_tail) != old.head_tail) {
-   /*
-* Lock still has someone queued for it, so wake up an
-* appropriate waiter.
-*/
-   __ticket_unlock_kick(lock, old.tickets.head);
-   }
-}
-
  static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
  {
if (TICKET_SLOWPATH_FLAG &&
-   static_key_false(¶virt_ticketlocks_enabled)) {
-   arch_spinlock_t prev;
+   static_key_false(¶virt_ticketlocks_enabled)) {
+   __ticket_t prev_head;

-   prev = *lock;
+   prev_head = lock->tickets.head;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);

/* add_smp()

Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-08 Thread Jeremy Fitzhardinge
On 02/06/2015 06:49 AM, Raghavendra K T wrote:
> Paravirt spinlock clears slowpath flag after doing unlock.
> As explained by Linus currently it does:
> prev = *lock;
> add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>
> /* add_smp() is a full mb() */
>
> if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
> __ticket_unlock_slowpath(lock, prev);
>
>
> which is *exactly* the kind of things you cannot do with spinlocks,
> because after you've done the "add_smp()" and released the spinlock
> for the fast-path, you can't access the spinlock any more.  Exactly
> because a fast-path lock might come in, and release the whole data
> structure.

Yeah, that's an embarrasingly obvious bug in retrospect.

> Linus suggested that we should not do any writes to lock after unlock(),
> and we can move slowpath clearing to fastpath lock.

Yep, that seems like a sound approach.

> However it brings additional case to be handled, viz., slowpath still
> could be set when somebody does arch_trylock. Handle that too by ignoring
> slowpath flag during lock availability check.
>
> Reported-by: Sasha Levin 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Raghavendra K T 
> ---
>  arch/x86/include/asm/spinlock.h | 70 
> -
>  1 file changed, 34 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 625660f..0829f86 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -49,6 +49,23 @@ static inline void __ticket_enter_slowpath(arch_spinlock_t 
> *lock)
>   set_bit(0, (volatile unsigned long *)&lock->tickets.tail);
>  }
>  
> +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
> +{
> + arch_spinlock_t old, new;
> + __ticket_t diff;
> +
> + old.tickets = READ_ONCE(lock->tickets);

Couldn't the caller pass in the lock state that it read rather than
re-reading it?

> + diff = (old.tickets.tail & ~TICKET_SLOWPATH_FLAG) - old.tickets.head;
> +
> + /* try to clear slowpath flag when there are no contenders */
> + if ((old.tickets.tail & TICKET_SLOWPATH_FLAG) &&
> + (diff == TICKET_LOCK_INC)) {
> + new = old;
> + new.tickets.tail &= ~TICKET_SLOWPATH_FLAG;
> + cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
> + }
> +}
> +
>  #else  /* !CONFIG_PARAVIRT_SPINLOCKS */
>  static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock,
>   __ticket_t ticket)
> @@ -59,6 +76,10 @@ static inline void __ticket_unlock_kick(arch_spinlock_t 
> *lock,
>  {
>  }
>  
> +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
> +{
> +}
> +
>  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>  
>  static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
> @@ -84,7 +105,7 @@ static __always_inline void arch_spin_lock(arch_spinlock_t 
> *lock)
>   register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
>  
>   inc = xadd(&lock->tickets, inc);
> - if (likely(inc.head == inc.tail))
> + if (likely(inc.head == (inc.tail & ~TICKET_SLOWPATH_FLAG)))

The intent of this conditional was to be the quickest possible path when
taking a fastpath lock, with the code below being used for all slowpath
locks (free or taken). So I don't think masking out SLOWPATH_FLAG is
necessary here.

>   goto out;
>  
>   inc.tail &= ~TICKET_SLOWPATH_FLAG;
> @@ -98,7 +119,10 @@ static __always_inline void 
> arch_spin_lock(arch_spinlock_t *lock)
>   } while (--count);
>   __ticket_lock_spinning(lock, inc.tail);
>   }
> -out: barrier();  /* make sure nothing creeps before the lock is taken */
> +out:
> + __ticket_check_and_clear_slowpath(lock);
> +
> + barrier();  /* make sure nothing creeps before the lock is taken */

Which means that if "goto out" path is only ever used for fastpath
locks, you can limit calling __ticket_check_and_clear_slowpath() to the
slowpath case.

>  }
>  
>  static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
> @@ -115,47 +139,21 @@ static __always_inline int 
> arch_spin_trylock(arch_spinlock_t *lock)
>   return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> old.head_tail;
>  }
>  
> -static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
> - arch_spinlock_t old)
> -{
> - arch_spinlock_t new;
> -
> - BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
> -
> - /* Perform the unlock on the "before" copy */
> - old.tickets.head += TICKET_LOCK_INC;

NB (see below)

> -
> - /* Clear the slowpath flag */
> - new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
> -
> - /*
> -  * If the lock is uncontended, clear the flag - us

Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-08 Thread Raghavendra K T

On 02/07/2015 12:27 AM, Sasha Levin wrote:

On 02/06/2015 09:49 AM, Raghavendra K T wrote:

Paravirt spinlock clears slowpath flag after doing unlock.
As explained by Linus currently it does:
 prev = *lock;
 add_smp(&lock->tickets.head, TICKET_LOCK_INC);

 /* add_smp() is a full mb() */

 if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
 __ticket_unlock_slowpath(lock, prev);


which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more.  Exactly
because a fast-path lock might come in, and release the whole data
structure.

Linus suggested that we should not do any writes to lock after unlock(),
and we can move slowpath clearing to fastpath lock.

However it brings additional case to be handled, viz., slowpath still
could be set when somebody does arch_trylock. Handle that too by ignoring
slowpath flag during lock availability check.

Reported-by: Sasha Levin 
Suggested-by: Linus Torvalds 
Signed-off-by: Raghavendra K T 


With this patch, my VMs lock up quickly after boot with:


Tried to reproduce the hang myself, and there seems to be still a
barrier (or logic I miss).

Looking closely below, unlock_kick got missed though we see
that SLOWPATH_FLAG is still set:

/me goes back to look closely

(gdb) bt
#0  native_halt () at ./arch/x86/include/asm/irqflags.h:55
#1  0x81037c27 in halt () at ./arch/x86/include/asm/paravirt.h:116
#2  kvm_lock_spinning (lock=0x88023ffe8240, want=52504) at 
arch/x86/kernel/kvm.c:786

#3  0x81037251 in __raw_callee_save_kvm_lock_spinning ()
#4  0x88023fc0edb0 in ?? ()
#5  0x in ?? ()

(gdb) p *(arch_spinlock_t *)0x88023ffe8240
$1 = {{head_tail = 3441806612, tickets = {head = 52500, tail = 52517}}}
(gdb) t 2
[Switching to thread 2 (Thread 2)]
#0  native_halt () at ./arch/x86/include/asm/irqflags.h:55
55  }
(gdb) bt
#0  native_halt () at ./arch/x86/include/asm/irqflags.h:55
#1  0x81037c27 in halt () at ./arch/x86/include/asm/paravirt.h:116
#2  kvm_lock_spinning (lock=0x88023ffe8240, want=52502) at 
arch/x86/kernel/kvm.c:786

#3  0x81037251 in __raw_callee_save_kvm_lock_spinning ()
#4  0x0246 in irq_stack_union ()
#5  0x00080750 in ?? ()
#6  0x0002 in ?? ()
#7  0x0004 in irq_stack_union ()
#8  0xcd16 in nmi_print_seq ()
Cannot access memory at address 0xbfc0
(gdb) t 3
[Switching to thread 3 (Thread 3)]
#0  native_halt () at ./arch/x86/include/asm/irqflags.h:55
55  }
(gdb) bt
#0  native_halt () at ./arch/x86/include/asm/irqflags.h:55
#1  0x81037c27 in halt () at ./arch/x86/include/asm/paravirt.h:116
#2  kvm_lock_spinning (lock=0x88023ffe8240, want=52512) at 
arch/x86/kernel/kvm.c:786

#3  0x81037251 in __raw_callee_save_kvm_lock_spinning ()
#4  0x88023fc8edb0 in ?? ()
#5  0x in ?? ()

[...] //other threads with similar output

(gdb) t 8
[Switching to thread 8 (Thread 8)]
#0  native_halt () at ./arch/x86/include/asm/irqflags.h:55
55  }
(gdb) bt
#0  native_halt () at ./arch/x86/include/asm/irqflags.h:55
#1  0x81037c27 in halt () at ./arch/x86/include/asm/paravirt.h:116
#2  kvm_lock_spinning (lock=0x88023ffe8240, want=52500) at 
arch/x86/kernel/kvm.c:786

#3  0x81037251 in __raw_callee_save_kvm_lock_spinning ()
#4  0x88023fdcedb0 in ?? ()
#5  0x in ?? ()

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-08 Thread Raghavendra K T

On 02/06/2015 09:55 PM, Linus Torvalds wrote:

On Fri, Feb 6, 2015 at 6:49 AM, Raghavendra K T
 wrote:

Paravirt spinlock clears slowpath flag after doing unlock.

[ fix edited out ]

So I'm not going to be applying this for 3.19, because it's much too
late and the patch is too scary. Plus the bug probably effectively
never shows up in real life (it is probably easy to trigger the
speculative *read* but probably never the actual speculative write
after dropping the lock last).



Understood and agreed.


This will need a lot of testing by the paravirt people - both
performance and correctness. So *maybe* for 3.20, but maybe for even
later, and then marked for stable, of course.

Are there any good paravirt stress-tests that people could run for
extended times?



I have been running several benchmarks (kern, sys, hack, ebizzy etc in
in 1x,2x scenarios. I run them for performance test as well.
(In the current patch I did not get kvm hang in normal run, But
overcommit reproduced it).

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-08 Thread Oleg Nesterov
On 02/06, Sasha Levin wrote:
>
> Can we modify it slightly to avoid potentially accessing invalid memory:
> 
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 5315887..cd22d73 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -144,13 +144,13 @@ static __always_inline void 
> arch_spin_unlock(arch_spinlock_t *lock
> if (TICKET_SLOWPATH_FLAG &&
> static_key_false(¶virt_ticketlocks_enabled)) {
> __ticket_t prev_head;
> -
> +   bool needs_kick = lock->tickets.tail & TICKET_SLOWPATH_FLAG;
> prev_head = lock->tickets.head;
> add_smp(&lock->tickets.head, TICKET_LOCK_INC);
> 
> /* add_smp() is a full mb() */
> 
> -   if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) {
> +   if (unlikely(needs_kick)) {

This doesn't look right too...

We need to guarantee that either unlock() sees TICKET_SLOWPATH_FLAG, or
the calller of __ticket_enter_slowpath() sees the result of add_smp().

Suppose that kvm_lock_spinning() is called right before add_smp() and it
sets SLOWPATH. It will block then because .head != want, and it needs
__ticket_unlock_kick().

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Davidlohr Bueso
On Fri, 2015-02-06 at 16:15 -0500, Sasha Levin wrote:
> On 02/06/2015 02:42 PM, Davidlohr Bueso wrote:
> > On Fri, 2015-02-06 at 08:25 -0800, Linus Torvalds wrote:
> >> On Fri, Feb 6, 2015 at 6:49 AM, Raghavendra K T
> >>  wrote:
> >>> Paravirt spinlock clears slowpath flag after doing unlock.
> >> [ fix edited out ]
> >>
> >> So I'm not going to be applying this for 3.19, because it's much too
> >> late and the patch is too scary. Plus the bug probably effectively
> >> never shows up in real life (it is probably easy to trigger the
> >> speculative *read* but probably never the actual speculative write
> >> after dropping the lock last).
> >>
> >> This will need a lot of testing by the paravirt people - both
> >> performance and correctness. So *maybe* for 3.20, but maybe for even
> >> later, and then marked for stable, of course.
> >>
> >> Are there any good paravirt stress-tests that people could run for
> >> extended times?
> > 
> > locktorture inside a VM should give a proper pounding.
> 
> Would it catch lifetime issues too? I thought it just tests out correctness.

Given enough contention it should, yeah. The spinlock can be acquired by
another thread right after releasing the lock in the unlock's slowpath.
And all locktorture does is pound on locks with random hold times.

Thanks,
Davidlohr

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Sasha Levin
On 02/06/2015 02:42 PM, Davidlohr Bueso wrote:
> On Fri, 2015-02-06 at 08:25 -0800, Linus Torvalds wrote:
>> On Fri, Feb 6, 2015 at 6:49 AM, Raghavendra K T
>>  wrote:
>>> Paravirt spinlock clears slowpath flag after doing unlock.
>> [ fix edited out ]
>>
>> So I'm not going to be applying this for 3.19, because it's much too
>> late and the patch is too scary. Plus the bug probably effectively
>> never shows up in real life (it is probably easy to trigger the
>> speculative *read* but probably never the actual speculative write
>> after dropping the lock last).
>>
>> This will need a lot of testing by the paravirt people - both
>> performance and correctness. So *maybe* for 3.20, but maybe for even
>> later, and then marked for stable, of course.
>>
>> Are there any good paravirt stress-tests that people could run for
>> extended times?
> 
> locktorture inside a VM should give a proper pounding.

Would it catch lifetime issues too? I thought it just tests out correctness.

I tried it and other unrelated stuff broke. I'll send separate mails for that...


Thanks,
Sasha
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Davidlohr Bueso
On Fri, 2015-02-06 at 08:25 -0800, Linus Torvalds wrote:
> On Fri, Feb 6, 2015 at 6:49 AM, Raghavendra K T
>  wrote:
> > Paravirt spinlock clears slowpath flag after doing unlock.
> [ fix edited out ]
> 
> So I'm not going to be applying this for 3.19, because it's much too
> late and the patch is too scary. Plus the bug probably effectively
> never shows up in real life (it is probably easy to trigger the
> speculative *read* but probably never the actual speculative write
> after dropping the lock last).
> 
> This will need a lot of testing by the paravirt people - both
> performance and correctness. So *maybe* for 3.20, but maybe for even
> later, and then marked for stable, of course.
> 
> Are there any good paravirt stress-tests that people could run for
> extended times?

locktorture inside a VM should give a proper pounding.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Sasha Levin
On 02/06/2015 09:49 AM, Raghavendra K T wrote:
> Paravirt spinlock clears slowpath flag after doing unlock.
> As explained by Linus currently it does:
> prev = *lock;
> add_smp(&lock->tickets.head, TICKET_LOCK_INC);
> 
> /* add_smp() is a full mb() */
> 
> if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
> __ticket_unlock_slowpath(lock, prev);
> 
> 
> which is *exactly* the kind of things you cannot do with spinlocks,
> because after you've done the "add_smp()" and released the spinlock
> for the fast-path, you can't access the spinlock any more.  Exactly
> because a fast-path lock might come in, and release the whole data
> structure.
> 
> Linus suggested that we should not do any writes to lock after unlock(),
> and we can move slowpath clearing to fastpath lock.
> 
> However it brings additional case to be handled, viz., slowpath still
> could be set when somebody does arch_trylock. Handle that too by ignoring
> slowpath flag during lock availability check.
> 
> Reported-by: Sasha Levin 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Raghavendra K T 

With this patch, my VMs lock up quickly after boot with:

[  161.613469] BUG: spinlock lockup suspected on CPU#31, kworker/31:1/5213
[  161.613469]  lock: purge_lock.28981+0x0/0x40, .magic: dead4ead, .owner: 
kworker/7:1/6400, .owner_cpu: 7
[  161.613469] CPU: 31 PID: 5213 Comm: kworker/31:1 Not tainted 
3.19.0-rc7-next-20150204-sasha-00048-gee3a350 #1869
[  161.613469] Workqueue: events bpf_prog_free_deferred
[  161.613469]   f03dd27f 88056b227a88 
a1702276
[  161.613469]   88017cf7 88056b227aa8 
9e1d009c
[  161.613469]  a3edae60 86c3f830 88056b227ad8 
9e1d01d7
[  161.613469] Call Trace:
[  161.613469] dump_stack (lib/dump_stack.c:52)
[  161.613469] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8))
[  161.613469] do_raw_spin_lock (include/linux/nmi.h:48 
kernel/locking/spinlock_debug.c:119 kernel/locking/spinlock_debug.c:137)
[  161.613469] _raw_spin_lock (kernel/locking/spinlock.c:152)
[  161.613469] ? __purge_vmap_area_lazy (mm/vmalloc.c:615)
[  161.613469] __purge_vmap_area_lazy (mm/vmalloc.c:615)
[  161.613469] ? vm_unmap_aliases (mm/vmalloc.c:1021)
[  161.613469] vm_unmap_aliases (mm/vmalloc.c:1052)
[  161.613469] ? vm_unmap_aliases (mm/vmalloc.c:1021)
[  161.613469] ? __lock_acquire (kernel/locking/lockdep.c:2019 
kernel/locking/lockdep.c:3184)
[  161.613469] change_page_attr_set_clr (arch/x86/mm/pageattr.c:1394)
[  161.613469] ? debug_object_deactivate (lib/debugobjects.c:463)
[  161.613469] set_memory_rw (arch/x86/mm/pageattr.c:1662)
[  161.613469] ? __lock_is_held (kernel/locking/lockdep.c:3518)
[  161.613469] bpf_jit_free (include/linux/filter.h:377 
arch/x86/net/bpf_jit_comp.c:991)
[  161.613469] bpf_prog_free_deferred (kernel/bpf/core.c:646)
[  161.613469] process_one_work (kernel/workqueue.c:2014 
include/linux/jump_label.h:114 include/trace/events/workqueue.h:111 
kernel/workqueue.c:2019)
[  161.613469] ? process_one_work (./arch/x86/include/asm/atomic64_64.h:33 
include/asm-generic/atomic-long.h:38 kernel/workqueue.c:598 
kernel/workqueue.c:625 kernel/workqueue.c:2007)
[  161.613469] worker_thread (include/linux/list.h:189 kernel/workqueue.c:2147)
[  161.613469] ? process_one_work (kernel/workqueue.c:2091)
[  161.613469] kthread (kernel/kthread.c:207)
[  161.613469] ? finish_task_switch (./arch/x86/include/asm/current.h:14 
kernel/sched/sched.h:1058 kernel/sched/core.c:2258)
[  161.613469] ? flush_kthread_work (kernel/kthread.c:176)
[  161.613469] ret_from_fork (arch/x86/kernel/entry_64.S:283)
[  161.613469] ? flush_kthread_work (kernel/kthread.c:176)

And a few soft lockup messages inside the scheduler after that.


Thanks,
Sasha


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Andrey Ryabinin
On 02/06/2015 07:15 PM, Linus Torvalds wrote:
> On Fri, Feb 6, 2015 at 7:20 AM, Sasha Levin  wrote:
>>
>> Can we modify it slightly to avoid potentially accessing invalid memory:
> 
> So I think there's a race with that.
> 
> And I'll warn you: the kernel does do speculative reads of memory that
> might be invalid, not just in places like this. See the comment in
> get_user_huge_page() for example, where we knowingly do speculative
> reads, but hide it if DEBUG_PAGEALLOC is set.
> 
> More commonly, CONFIG_DCACHE_WORD_ACCESS is very much about doing
> speculative reads. Now, that access is hidden inside an asm, so KASan
> won't see it, but there might well be others.
> 
> You probably don't see them very much just because they are so rarely
> a problem, and most of the time it's not to other processes stack but
> to allocated structures where freeing takes long enough to basically
> hide any small race..
> 
> In other words: I suspect it would be good to instead just teach KASan
> about "this is a speculative read" and just suppress the warning for
> those instead.
> 

We can suppress warnings by wrapping such speculative reads with
kasan_disable_current()/kasan_enable_current() calls.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Linus Torvalds
On Fri, Feb 6, 2015 at 6:49 AM, Raghavendra K T
 wrote:
> Paravirt spinlock clears slowpath flag after doing unlock.
[ fix edited out ]

So I'm not going to be applying this for 3.19, because it's much too
late and the patch is too scary. Plus the bug probably effectively
never shows up in real life (it is probably easy to trigger the
speculative *read* but probably never the actual speculative write
after dropping the lock last).

This will need a lot of testing by the paravirt people - both
performance and correctness. So *maybe* for 3.20, but maybe for even
later, and then marked for stable, of course.

Are there any good paravirt stress-tests that people could run for
extended times?

Linus
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Linus Torvalds
On Fri, Feb 6, 2015 at 7:20 AM, Sasha Levin  wrote:
>
> Can we modify it slightly to avoid potentially accessing invalid memory:

So I think there's a race with that.

And I'll warn you: the kernel does do speculative reads of memory that
might be invalid, not just in places like this. See the comment in
get_user_huge_page() for example, where we knowingly do speculative
reads, but hide it if DEBUG_PAGEALLOC is set.

More commonly, CONFIG_DCACHE_WORD_ACCESS is very much about doing
speculative reads. Now, that access is hidden inside an asm, so KASan
won't see it, but there might well be others.

You probably don't see them very much just because they are so rarely
a problem, and most of the time it's not to other processes stack but
to allocated structures where freeing takes long enough to basically
hide any small race..

In other words: I suspect it would be good to instead just teach KASan
about "this is a speculative read" and just suppress the warning for
those instead.

 Linus
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Sasha Levin
On 02/06/2015 09:49 AM, Raghavendra K T wrote:
>  static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>  {
>   if (TICKET_SLOWPATH_FLAG &&
> - static_key_false(¶virt_ticketlocks_enabled)) {
> - arch_spinlock_t prev;
> + static_key_false(¶virt_ticketlocks_enabled)) {
> + __ticket_t prev_head;
>  
> - prev = *lock;
> + prev_head = lock->tickets.head;
>   add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>  
>   /* add_smp() is a full mb() */
>  
> - if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
> - __ticket_unlock_slowpath(lock, prev);
> + if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) {
> + BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
> + __ticket_unlock_kick(lock, prev_head);

Can we modify it slightly to avoid potentially accessing invalid memory:

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 5315887..cd22d73 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -144,13 +144,13 @@ static __always_inline void 
arch_spin_unlock(arch_spinlock_t *lock
if (TICKET_SLOWPATH_FLAG &&
static_key_false(¶virt_ticketlocks_enabled)) {
__ticket_t prev_head;
-
+   bool needs_kick = lock->tickets.tail & TICKET_SLOWPATH_FLAG;
prev_head = lock->tickets.head;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);

/* add_smp() is a full mb() */

-   if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) {
+   if (unlikely(needs_kick)) {
BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
__ticket_unlock_kick(lock, prev_head);
}


Thanks,
Sasha
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-06 Thread Raghavendra K T
Paravirt spinlock clears slowpath flag after doing unlock.
As explained by Linus currently it does:
prev = *lock;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);

/* add_smp() is a full mb() */

if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);


which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more.  Exactly
because a fast-path lock might come in, and release the whole data
structure.

Linus suggested that we should not do any writes to lock after unlock(),
and we can move slowpath clearing to fastpath lock.

However it brings additional case to be handled, viz., slowpath still
could be set when somebody does arch_trylock. Handle that too by ignoring
slowpath flag during lock availability check.

Reported-by: Sasha Levin 
Suggested-by: Linus Torvalds 
Signed-off-by: Raghavendra K T 
---
 arch/x86/include/asm/spinlock.h | 70 -
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 625660f..0829f86 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -49,6 +49,23 @@ static inline void __ticket_enter_slowpath(arch_spinlock_t 
*lock)
set_bit(0, (volatile unsigned long *)&lock->tickets.tail);
 }
 
+static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
+{
+   arch_spinlock_t old, new;
+   __ticket_t diff;
+
+   old.tickets = READ_ONCE(lock->tickets);
+   diff = (old.tickets.tail & ~TICKET_SLOWPATH_FLAG) - old.tickets.head;
+
+   /* try to clear slowpath flag when there are no contenders */
+   if ((old.tickets.tail & TICKET_SLOWPATH_FLAG) &&
+   (diff == TICKET_LOCK_INC)) {
+   new = old;
+   new.tickets.tail &= ~TICKET_SLOWPATH_FLAG;
+   cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
+   }
+}
+
 #else  /* !CONFIG_PARAVIRT_SPINLOCKS */
 static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock,
__ticket_t ticket)
@@ -59,6 +76,10 @@ static inline void __ticket_unlock_kick(arch_spinlock_t 
*lock,
 {
 }
 
+static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
+{
+}
+
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
@@ -84,7 +105,7 @@ static __always_inline void arch_spin_lock(arch_spinlock_t 
*lock)
register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
 
inc = xadd(&lock->tickets, inc);
-   if (likely(inc.head == inc.tail))
+   if (likely(inc.head == (inc.tail & ~TICKET_SLOWPATH_FLAG)))
goto out;
 
inc.tail &= ~TICKET_SLOWPATH_FLAG;
@@ -98,7 +119,10 @@ static __always_inline void arch_spin_lock(arch_spinlock_t 
*lock)
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
-out:   barrier();  /* make sure nothing creeps before the lock is taken */
+out:
+   __ticket_check_and_clear_slowpath(lock);
+
+   barrier();  /* make sure nothing creeps before the lock is taken */
 }
 
 static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
@@ -115,47 +139,21 @@ static __always_inline int 
arch_spin_trylock(arch_spinlock_t *lock)
return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
old.head_tail;
 }
 
-static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
-   arch_spinlock_t old)
-{
-   arch_spinlock_t new;
-
-   BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
-
-   /* Perform the unlock on the "before" copy */
-   old.tickets.head += TICKET_LOCK_INC;
-
-   /* Clear the slowpath flag */
-   new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
-
-   /*
-* If the lock is uncontended, clear the flag - use cmpxchg in
-* case it changes behind our back though.
-*/
-   if (new.tickets.head != new.tickets.tail ||
-   cmpxchg(&lock->head_tail, old.head_tail,
-   new.head_tail) != old.head_tail) {
-   /*
-* Lock still has someone queued for it, so wake up an
-* appropriate waiter.
-*/
-   __ticket_unlock_kick(lock, old.tickets.head);
-   }
-}
-
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
if (TICKET_SLOWPATH_FLAG &&
-   static_key_false(¶virt_ticketlocks_enabled)) {
-   arch_spinlock_t prev;
+   static_key_false(¶virt_ticketlocks_enabled)) {
+   __ticket_t prev_head;