Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-16 Thread Peter Zijlstra
On Mon, May 16, 2016 at 12:55:57PM +0200, Peter Zijlstra wrote:
> > So the barriers already forbid that the compiler reorders code. How on earth
> > is the compiler supposed to optimize the dec/inc out?
> 
> Order things like:
> 
> > #define sched_preempt_enable_no_resched() \
> > do { \
> > barrier(); \
> > preempt_count_dec(); \
> > } while (0)
> 
> > #define preempt_disable() \
> > do { \
> > preempt_count_inc(); \
> > barrier(); \
> > } while (0)
> 
> And there is no barrier between the dec and inc, and a smarty pants
> compiler could just decide to forgo the update, since in program order
> there is no observable difference either way.
> 
> Making the thing volatile tells the compiler there can be external
> observations of the memory and it cannot assume things like that and
> must emit the operations.

...

> So I'll go write a proper changelog for the volatile thing and get it
> merged with a Cc to stable.


---
Subject: sched,preempt: Fix preempt_count manipulations
From: Peter Zijlstra 
Date: Mon May 16 15:01:11 CEST 2016

Vikram reported that his ARM64 compiler managed to 'optimize' away the
preempt_count manipulations in code like:

preempt_enable_no_resched();
put_user();
preempt_disable();

Irrespective of that fact that that is horrible code that should be
fixed for many reasons, it does highlight a deficiency in the generic
preempt_count manipulators. As it is never right to combine/elide
preempt_count manipulations like this.

Therefore sprinkle some volatile in the two generic accessors to
ensure the compiler is aware of the fact that the preempt_count is
observed outside of the regular program-order view and thus cannot be
optimized away like this.

x86; the only arch not using the generic code is not affected as we
do all this in asm in order to use the segment base per-cpu stuff.

Cc: sta...@vger.kernel.org
Cc: Thomas Gleixner 
Fixes: a787870924db ("sched, arch: Create asm/preempt.h")
Reported-by: Vikram Mulukutla 
Tested-by: Vikram Mulukutla 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/asm-generic/preempt.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@
 
 static __always_inline int preempt_count(void)
 {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
 }
 
-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
 {
return _thread_info()->preempt_count;
 }


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-16 Thread Peter Zijlstra
On Mon, May 16, 2016 at 12:55:57PM +0200, Peter Zijlstra wrote:
> > So the barriers already forbid that the compiler reorders code. How on earth
> > is the compiler supposed to optimize the dec/inc out?
> 
> Order things like:
> 
> > #define sched_preempt_enable_no_resched() \
> > do { \
> > barrier(); \
> > preempt_count_dec(); \
> > } while (0)
> 
> > #define preempt_disable() \
> > do { \
> > preempt_count_inc(); \
> > barrier(); \
> > } while (0)
> 
> And there is no barrier between the dec and inc, and a smarty pants
> compiler could just decide to forgo the update, since in program order
> there is no observable difference either way.
> 
> Making the thing volatile tells the compiler there can be external
> observations of the memory and it cannot assume things like that and
> must emit the operations.

...

> So I'll go write a proper changelog for the volatile thing and get it
> merged with a Cc to stable.


---
Subject: sched,preempt: Fix preempt_count manipulations
From: Peter Zijlstra 
Date: Mon May 16 15:01:11 CEST 2016

Vikram reported that his ARM64 compiler managed to 'optimize' away the
preempt_count manipulations in code like:

preempt_enable_no_resched();
put_user();
preempt_disable();

Irrespective of that fact that that is horrible code that should be
fixed for many reasons, it does highlight a deficiency in the generic
preempt_count manipulators. As it is never right to combine/elide
preempt_count manipulations like this.

Therefore sprinkle some volatile in the two generic accessors to
ensure the compiler is aware of the fact that the preempt_count is
observed outside of the regular program-order view and thus cannot be
optimized away like this.

x86; the only arch not using the generic code is not affected as we
do all this in asm in order to use the segment base per-cpu stuff.

Cc: sta...@vger.kernel.org
Cc: Thomas Gleixner 
Fixes: a787870924db ("sched, arch: Create asm/preempt.h")
Reported-by: Vikram Mulukutla 
Tested-by: Vikram Mulukutla 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/asm-generic/preempt.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@
 
 static __always_inline int preempt_count(void)
 {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
 }
 
-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
 {
return _thread_info()->preempt_count;
 }


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-16 Thread Peter Zijlstra
On Mon, May 16, 2016 at 12:55:57PM +0200, Peter Zijlstra wrote:
> You're right in that the 'proper' sequence:
> 
> > #define preempt_enable() \
> > do { \
> > barrier(); \
> > if (unlikely(preempt_count_dec_and_test())) \
> > __preempt_schedule(); \
> > } while (0)
> 
> > #define preempt_disable() \
> > do { \
> > preempt_count_inc(); \
> > barrier(); \
> > } while (0)
> 
> Has a higher chance of succeeding to emit the operations to memory; but
> an even smarter pants compiler might figure doing something like:
> 
>   if (preempt_count() == 1)
>   __preempt_schedule();
> 
> is equivalent and emits that instead, not bothering to modify the actual
> variable at all -- the program as specified cannot tell the difference
> etc..

For this to work the call __preempt_schedule() must be obfuscated
though; but I think the thing we do for x86 which turns it into:

asm("call ___preempt_schedule;");

might just be enough for the compiler to get confused about that.

A normal function call would act as a compiler barrier and force the
compiler to emit the memory ops.

Then again, it could maybe do:

if (preempt_count() == 1) {
preempt_count_dec();
__preempt_schedule();
preempt_count_inc();
}

and think it did us a service by 'optimizing' away that memory reference
in the 'fast' path.

Who knows what these compilers get up to ;-)


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-16 Thread Peter Zijlstra
On Mon, May 16, 2016 at 12:55:57PM +0200, Peter Zijlstra wrote:
> You're right in that the 'proper' sequence:
> 
> > #define preempt_enable() \
> > do { \
> > barrier(); \
> > if (unlikely(preempt_count_dec_and_test())) \
> > __preempt_schedule(); \
> > } while (0)
> 
> > #define preempt_disable() \
> > do { \
> > preempt_count_inc(); \
> > barrier(); \
> > } while (0)
> 
> Has a higher chance of succeeding to emit the operations to memory; but
> an even smarter pants compiler might figure doing something like:
> 
>   if (preempt_count() == 1)
>   __preempt_schedule();
> 
> is equivalent and emits that instead, not bothering to modify the actual
> variable at all -- the program as specified cannot tell the difference
> etc..

For this to work the call __preempt_schedule() must be obfuscated
though; but I think the thing we do for x86 which turns it into:

asm("call ___preempt_schedule;");

might just be enough for the compiler to get confused about that.

A normal function call would act as a compiler barrier and force the
compiler to emit the memory ops.

Then again, it could maybe do:

if (preempt_count() == 1) {
preempt_count_dec();
__preempt_schedule();
preempt_count_inc();
}

and think it did us a service by 'optimizing' away that memory reference
in the 'fast' path.

Who knows what these compilers get up to ;-)


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-16 Thread Peter Zijlstra
On Sat, May 14, 2016 at 05:39:37PM +0200, Thomas Gleixner wrote:
> I have a hard time to understand why the compiler optimizes out stuff w/o that
> patch.
> 
> We already have:
> 
> #define preempt_disable() \
> do { \
> preempt_count_inc(); \
> barrier(); \
> } while (0)
> 
> #define sched_preempt_enable_no_resched() \
> do { \
> barrier(); \
> preempt_count_dec(); \
> } while (0)
> 
> #define preempt_enable() \
> do { \
> barrier(); \
> if (unlikely(preempt_count_dec_and_test())) \
> __preempt_schedule(); \
> } while (0)
> 
> So the barriers already forbid that the compiler reorders code. How on earth
> is the compiler supposed to optimize the dec/inc out?

Order things like:

> #define sched_preempt_enable_no_resched() \
> do { \
> barrier(); \
> preempt_count_dec(); \
> } while (0)

> #define preempt_disable() \
> do { \
> preempt_count_inc(); \
> barrier(); \
> } while (0)

And there is no barrier between the dec and inc, and a smarty pants
compiler could just decide to forgo the update, since in program order
there is no observable difference either way.

Making the thing volatile tells the compiler there can be external
observations of the memory and it cannot assume things like that and
must emit the operations.

You're right in that the 'proper' sequence:

> #define preempt_enable() \
> do { \
> barrier(); \
> if (unlikely(preempt_count_dec_and_test())) \
> __preempt_schedule(); \
> } while (0)

> #define preempt_disable() \
> do { \
> preempt_count_inc(); \
> barrier(); \
> } while (0)

Has a higher chance of succeeding to emit the operations to memory; but
an even smarter pants compiler might figure doing something like:

if (preempt_count() == 1)
__preempt_schedule();

is equivalent and emits that instead, not bothering to modify the actual
variable at all -- the program as specified cannot tell the difference
etc..

Also; in the case of !PREEMPT && PREEMPT_COUNT, the normal:

preempt_disable();
preempt_enable();

sequence turns into the first case again.

So I'll go write a proper changelog for the volatile thing and get it
merged with a Cc to stable.


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-16 Thread Peter Zijlstra
On Sat, May 14, 2016 at 05:39:37PM +0200, Thomas Gleixner wrote:
> I have a hard time to understand why the compiler optimizes out stuff w/o that
> patch.
> 
> We already have:
> 
> #define preempt_disable() \
> do { \
> preempt_count_inc(); \
> barrier(); \
> } while (0)
> 
> #define sched_preempt_enable_no_resched() \
> do { \
> barrier(); \
> preempt_count_dec(); \
> } while (0)
> 
> #define preempt_enable() \
> do { \
> barrier(); \
> if (unlikely(preempt_count_dec_and_test())) \
> __preempt_schedule(); \
> } while (0)
> 
> So the barriers already forbid that the compiler reorders code. How on earth
> is the compiler supposed to optimize the dec/inc out?

Order things like:

> #define sched_preempt_enable_no_resched() \
> do { \
> barrier(); \
> preempt_count_dec(); \
> } while (0)

> #define preempt_disable() \
> do { \
> preempt_count_inc(); \
> barrier(); \
> } while (0)

And there is no barrier between the dec and inc, and a smarty pants
compiler could just decide to forgo the update, since in program order
there is no observable difference either way.

Making the thing volatile tells the compiler there can be external
observations of the memory and it cannot assume things like that and
must emit the operations.

You're right in that the 'proper' sequence:

> #define preempt_enable() \
> do { \
> barrier(); \
> if (unlikely(preempt_count_dec_and_test())) \
> __preempt_schedule(); \
> } while (0)

> #define preempt_disable() \
> do { \
> preempt_count_inc(); \
> barrier(); \
> } while (0)

Has a higher chance of succeeding to emit the operations to memory; but
an even smarter pants compiler might figure doing something like:

if (preempt_count() == 1)
__preempt_schedule();

is equivalent and emits that instead, not bothering to modify the actual
variable at all -- the program as specified cannot tell the difference
etc..

Also; in the case of !PREEMPT && PREEMPT_COUNT, the normal:

preempt_disable();
preempt_enable();

sequence turns into the first case again.

So I'll go write a proper changelog for the volatile thing and get it
merged with a Cc to stable.


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-14 Thread Vikram Mulukutla

On 5/14/2016 8:39 AM, Thomas Gleixner wrote:

On Fri, 13 May 2016, Vikram Mulukutla wrote:

On 5/13/2016 7:58 AM, Peter Zijlstra wrote:

diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@

   static __always_inline int preempt_count(void)
   {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
   }

-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
   {
return _thread_info()->preempt_count;
   }



Thanks Peter, this patch worked for me. The compiler no longer optimizes out
the increment/decrement of the preempt_count.


I have a hard time to understand why the compiler optimizes out stuff w/o that
patch.

We already have:

#define preempt_disable() \
do { \
 preempt_count_inc(); \
 barrier(); \
} while (0)

#define sched_preempt_enable_no_resched() \
do { \
 barrier(); \
 preempt_count_dec(); \
} while (0)

#define preempt_enable() \
do { \
 barrier(); \
 if (unlikely(preempt_count_dec_and_test())) \
 __preempt_schedule(); \
} while (0)

So the barriers already forbid that the compiler reorders code. How on earth
is the compiler supposed to optimize the dec/inc out?



While I cannot claim more than a very rudimentary knowledge of 
compilers, this was the initial reaction that I had as well. But then 
the barriers are in the wrong spots for the way the code was used in the 
driver in question. preempt_disable has the barrier() _after_ the 
increment, and sched_preempt_enable_no_resched has it _before_ the 
decrement. Therefore, if one were to use preempt_enable_no_resched 
followed by preempt_disable, there is no compiler barrier between the 
decrement and the increment statements. Whether this should translate to 
such a seemingly aggressive optimization is something I'm not completely 
sure of.



There is more code than the preempt stuff depending on barrier() and expecting
that the compiler does not optimize out things at will. Are we supposed to
hunt all occurences and amend them with READ_ONCE or whatever one by one?



I think the barrier is working as intended for the sequence of 
preempt_disable followed by preempt_enable_no_resched.



Thanks,

tglx





As a simple experiment I used gcc on x86 on the following C program. 
This was really to convince myself rather than you and Peter!


#include 

#define barrier() __asm__ __volatile__("": : :"memory")

struct thread_info {
int pc;
};

#define preempt_count() \
ti_ptr->pc

#define preempt_disable() \
do \
{ \
preempt_count() += 1; \
barrier(); \
} \
while (0)

#define preempt_enable() \
do \
{ \
barrier(); \
preempt_count() -=  1; \
} \
while (0)

struct thread_info *ti_ptr;

int main(void)
{
struct thread_info ti;
ti_ptr = 
int b;

preempt_enable();
b = b + 500;
preempt_disable();

printf("b = %d\n", b);

return 0;
}

With gcc -O2 I get this (the ARM compiler behaves similarly):

00400470 :
  400470:   48 83 ec 18 sub$0x18,%rsp
  400474:   48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip)
  40047b:   ba f4 01 00 00  mov$0x1f4,%edx
  400480:   be 14 06 40 00  mov$0x400614,%esi
  400485:   bf 01 00 00 00  mov$0x1,%edi
  40048a:   31 c0   xor%eax,%eax
  40048c:   e8 cf ff ff ff  callq  400460 <__printf_chk@plt>
  400491:   31 c0   xor%eax,%eax
  400493:   48 83 c4 18 add$0x18,%rsp
  400497:   c3  retq

If I swap preempt_enable and preempt_disable I get this:

00400470 :
  400470:   48 83 ec 18 sub$0x18,%rsp
  400474:   48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip)
  40047b:   83 04 24 01 addl   $0x1,(%rsp)
  40047f:   48 8b 05 c2 0b 20 00mov0x200bc2(%rip),%rax
  400486:   ba f4 01 00 00  mov$0x1f4,%edx
  40048b:   be 14 06 40 00  mov$0x400614,%esi
  400490:   bf 01 00 00 00  mov$0x1,%edi
  400495:   83 28 01subl   $0x1,(%rax)
  400498:   31 c0   xor%eax,%eax
  40049a:   e8 c1 ff ff ff  callq  400460 <__printf_chk@plt>
  40049f:   31 c0   xor%eax,%eax
  4004a1:   48 83 c4 18 add$0x18,%rsp
  4004a5:   c3  retq

Note: If I place ti_ptr on the stack, the ordering/barriers no longer 
matter, the inc/dec is always optimized out. I suppose the compiler does 
treat current_thread_info as coming from a different memory location 
rather than the current stack frame. In any 

Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-14 Thread Vikram Mulukutla

On 5/14/2016 8:39 AM, Thomas Gleixner wrote:

On Fri, 13 May 2016, Vikram Mulukutla wrote:

On 5/13/2016 7:58 AM, Peter Zijlstra wrote:

diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@

   static __always_inline int preempt_count(void)
   {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
   }

-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
   {
return _thread_info()->preempt_count;
   }



Thanks Peter, this patch worked for me. The compiler no longer optimizes out
the increment/decrement of the preempt_count.


I have a hard time to understand why the compiler optimizes out stuff w/o that
patch.

We already have:

#define preempt_disable() \
do { \
 preempt_count_inc(); \
 barrier(); \
} while (0)

#define sched_preempt_enable_no_resched() \
do { \
 barrier(); \
 preempt_count_dec(); \
} while (0)

#define preempt_enable() \
do { \
 barrier(); \
 if (unlikely(preempt_count_dec_and_test())) \
 __preempt_schedule(); \
} while (0)

So the barriers already forbid that the compiler reorders code. How on earth
is the compiler supposed to optimize the dec/inc out?



While I cannot claim more than a very rudimentary knowledge of 
compilers, this was the initial reaction that I had as well. But then 
the barriers are in the wrong spots for the way the code was used in the 
driver in question. preempt_disable has the barrier() _after_ the 
increment, and sched_preempt_enable_no_resched has it _before_ the 
decrement. Therefore, if one were to use preempt_enable_no_resched 
followed by preempt_disable, there is no compiler barrier between the 
decrement and the increment statements. Whether this should translate to 
such a seemingly aggressive optimization is something I'm not completely 
sure of.



There is more code than the preempt stuff depending on barrier() and expecting
that the compiler does not optimize out things at will. Are we supposed to
hunt all occurences and amend them with READ_ONCE or whatever one by one?



I think the barrier is working as intended for the sequence of 
preempt_disable followed by preempt_enable_no_resched.



Thanks,

tglx





As a simple experiment I used gcc on x86 on the following C program. 
This was really to convince myself rather than you and Peter!


#include 

#define barrier() __asm__ __volatile__("": : :"memory")

struct thread_info {
int pc;
};

#define preempt_count() \
ti_ptr->pc

#define preempt_disable() \
do \
{ \
preempt_count() += 1; \
barrier(); \
} \
while (0)

#define preempt_enable() \
do \
{ \
barrier(); \
preempt_count() -=  1; \
} \
while (0)

struct thread_info *ti_ptr;

int main(void)
{
struct thread_info ti;
ti_ptr = 
int b;

preempt_enable();
b = b + 500;
preempt_disable();

printf("b = %d\n", b);

return 0;
}

With gcc -O2 I get this (the ARM compiler behaves similarly):

00400470 :
  400470:   48 83 ec 18 sub$0x18,%rsp
  400474:   48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip)
  40047b:   ba f4 01 00 00  mov$0x1f4,%edx
  400480:   be 14 06 40 00  mov$0x400614,%esi
  400485:   bf 01 00 00 00  mov$0x1,%edi
  40048a:   31 c0   xor%eax,%eax
  40048c:   e8 cf ff ff ff  callq  400460 <__printf_chk@plt>
  400491:   31 c0   xor%eax,%eax
  400493:   48 83 c4 18 add$0x18,%rsp
  400497:   c3  retq

If I swap preempt_enable and preempt_disable I get this:

00400470 :
  400470:   48 83 ec 18 sub$0x18,%rsp
  400474:   48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip)
  40047b:   83 04 24 01 addl   $0x1,(%rsp)
  40047f:   48 8b 05 c2 0b 20 00mov0x200bc2(%rip),%rax
  400486:   ba f4 01 00 00  mov$0x1f4,%edx
  40048b:   be 14 06 40 00  mov$0x400614,%esi
  400490:   bf 01 00 00 00  mov$0x1,%edi
  400495:   83 28 01subl   $0x1,(%rax)
  400498:   31 c0   xor%eax,%eax
  40049a:   e8 c1 ff ff ff  callq  400460 <__printf_chk@plt>
  40049f:   31 c0   xor%eax,%eax
  4004a1:   48 83 c4 18 add$0x18,%rsp
  4004a5:   c3  retq

Note: If I place ti_ptr on the stack, the ordering/barriers no longer 
matter, the inc/dec is always optimized out. I suppose the compiler does 
treat current_thread_info as coming from a different memory location 
rather than the current stack frame. In any 

Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-14 Thread Thomas Gleixner
On Fri, 13 May 2016, Vikram Mulukutla wrote:
> On 5/13/2016 7:58 AM, Peter Zijlstra wrote:
> > diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
> > index 5d8ffa3e6f8c..c1cde3577551 100644
> > --- a/include/asm-generic/preempt.h
> > +++ b/include/asm-generic/preempt.h
> > @@ -7,10 +7,10 @@
> > 
> >   static __always_inline int preempt_count(void)
> >   {
> > -   return current_thread_info()->preempt_count;
> > +   return READ_ONCE(current_thread_info()->preempt_count);
> >   }
> > 
> > -static __always_inline int *preempt_count_ptr(void)
> > +static __always_inline volatile int *preempt_count_ptr(void)
> >   {
> > return _thread_info()->preempt_count;
> >   }
> > 
> 
> Thanks Peter, this patch worked for me. The compiler no longer optimizes out
> the increment/decrement of the preempt_count.

I have a hard time to understand why the compiler optimizes out stuff w/o that
patch.

We already have:

#define preempt_disable() \
do { \
preempt_count_inc(); \
barrier(); \
} while (0)

#define sched_preempt_enable_no_resched() \
do { \
barrier(); \
preempt_count_dec(); \
} while (0)

#define preempt_enable() \
do { \
barrier(); \
if (unlikely(preempt_count_dec_and_test())) \
__preempt_schedule(); \
} while (0)

So the barriers already forbid that the compiler reorders code. How on earth
is the compiler supposed to optimize the dec/inc out?

There is more code than the preempt stuff depending on barrier() and expecting
that the compiler does not optimize out things at will. Are we supposed to
hunt all occurences and amend them with READ_ONCE or whatever one by one?

Thanks,

tglx





Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-14 Thread Thomas Gleixner
On Fri, 13 May 2016, Vikram Mulukutla wrote:
> On 5/13/2016 7:58 AM, Peter Zijlstra wrote:
> > diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
> > index 5d8ffa3e6f8c..c1cde3577551 100644
> > --- a/include/asm-generic/preempt.h
> > +++ b/include/asm-generic/preempt.h
> > @@ -7,10 +7,10 @@
> > 
> >   static __always_inline int preempt_count(void)
> >   {
> > -   return current_thread_info()->preempt_count;
> > +   return READ_ONCE(current_thread_info()->preempt_count);
> >   }
> > 
> > -static __always_inline int *preempt_count_ptr(void)
> > +static __always_inline volatile int *preempt_count_ptr(void)
> >   {
> > return _thread_info()->preempt_count;
> >   }
> > 
> 
> Thanks Peter, this patch worked for me. The compiler no longer optimizes out
> the increment/decrement of the preempt_count.

I have a hard time to understand why the compiler optimizes out stuff w/o that
patch.

We already have:

#define preempt_disable() \
do { \
preempt_count_inc(); \
barrier(); \
} while (0)

#define sched_preempt_enable_no_resched() \
do { \
barrier(); \
preempt_count_dec(); \
} while (0)

#define preempt_enable() \
do { \
barrier(); \
if (unlikely(preempt_count_dec_and_test())) \
__preempt_schedule(); \
} while (0)

So the barriers already forbid that the compiler reorders code. How on earth
is the compiler supposed to optimize the dec/inc out?

There is more code than the preempt stuff depending on barrier() and expecting
that the compiler does not optimize out things at will. Are we supposed to
hunt all occurences and amend them with READ_ONCE or whatever one by one?

Thanks,

tglx





Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-13 Thread Vikram Mulukutla

On 5/13/2016 7:58 AM, Peter Zijlstra wrote:

On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote:

Hi,

I came across a piece of engineering code that looked like:

preempt_disable();
/* --cut, lots of code-- */
preempt_enable_no_resched();
put_user()
preempt_disable();

(If you wish to seriously question the usage of the preempt API in this
manner, I unfortunately have no comment since I didn't write the code.)


I'm with Thomas here, that's broken and should not be done.


Ok. I did in fact zero in on this code by replacing each instance of 
preempt_enable_no_resched with preempt_enable one by one (there were 
several uses in the driver). I will ask the original developer to 
consider using preempt_enable.





This particular block of code was causing lockups and crashes on a certain
ARM64 device. The generated assembly revealed that the compiler was simply
optimizing out the increment and decrement of the preempt count, allowing
put_user to run without preemption enabled, causing all sorts of badness.
Since put_user doesn't actually access the preempt count and translates to
just a few instructions without any branching, I suppose that the compiler
figured it was OK to optimize.

The immediate solution is to add a compiler barrier to the code above, but
should sched_preempt_enable_no_resched have an additional compiler barrier
after (has one before already) the preempt-count decrement to prevent this
sort of thing?


I think the below would be sufficient; IIRC the compiler may not combine
or elide volatile operations.

---
  include/asm-generic/preempt.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@

  static __always_inline int preempt_count(void)
  {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
  }

-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
  {
return _thread_info()->preempt_count;
  }



Thanks Peter, this patch worked for me. The compiler no longer optimizes 
out the increment/decrement of the preempt_count.


Thanks,
Vikram



Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-13 Thread Vikram Mulukutla

On 5/13/2016 7:58 AM, Peter Zijlstra wrote:

On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote:

Hi,

I came across a piece of engineering code that looked like:

preempt_disable();
/* --cut, lots of code-- */
preempt_enable_no_resched();
put_user()
preempt_disable();

(If you wish to seriously question the usage of the preempt API in this
manner, I unfortunately have no comment since I didn't write the code.)


I'm with Thomas here, that's broken and should not be done.


Ok. I did in fact zero in on this code by replacing each instance of 
preempt_enable_no_resched with preempt_enable one by one (there were 
several uses in the driver). I will ask the original developer to 
consider using preempt_enable.





This particular block of code was causing lockups and crashes on a certain
ARM64 device. The generated assembly revealed that the compiler was simply
optimizing out the increment and decrement of the preempt count, allowing
put_user to run without preemption enabled, causing all sorts of badness.
Since put_user doesn't actually access the preempt count and translates to
just a few instructions without any branching, I suppose that the compiler
figured it was OK to optimize.

The immediate solution is to add a compiler barrier to the code above, but
should sched_preempt_enable_no_resched have an additional compiler barrier
after (has one before already) the preempt-count decrement to prevent this
sort of thing?


I think the below would be sufficient; IIRC the compiler may not combine
or elide volatile operations.

---
  include/asm-generic/preempt.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@

  static __always_inline int preempt_count(void)
  {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
  }

-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
  {
return _thread_info()->preempt_count;
  }



Thanks Peter, this patch worked for me. The compiler no longer optimizes 
out the increment/decrement of the preempt_count.


Thanks,
Vikram



Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-13 Thread Peter Zijlstra
On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote:
> Hi,
> 
> I came across a piece of engineering code that looked like:
> 
> preempt_disable();
> /* --cut, lots of code-- */
> preempt_enable_no_resched();
> put_user()
> preempt_disable();
> 
> (If you wish to seriously question the usage of the preempt API in this
> manner, I unfortunately have no comment since I didn't write the code.)

I'm with Thomas here, that's broken and should not be done.

> This particular block of code was causing lockups and crashes on a certain
> ARM64 device. The generated assembly revealed that the compiler was simply
> optimizing out the increment and decrement of the preempt count, allowing
> put_user to run without preemption enabled, causing all sorts of badness.
> Since put_user doesn't actually access the preempt count and translates to
> just a few instructions without any branching, I suppose that the compiler
> figured it was OK to optimize.
> 
> The immediate solution is to add a compiler barrier to the code above, but
> should sched_preempt_enable_no_resched have an additional compiler barrier
> after (has one before already) the preempt-count decrement to prevent this
> sort of thing?

I think the below would be sufficient; IIRC the compiler may not combine
or elide volatile operations.

---
 include/asm-generic/preempt.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@
 
 static __always_inline int preempt_count(void)
 {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
 }
 
-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
 {
return _thread_info()->preempt_count;
 }


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-13 Thread Peter Zijlstra
On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote:
> Hi,
> 
> I came across a piece of engineering code that looked like:
> 
> preempt_disable();
> /* --cut, lots of code-- */
> preempt_enable_no_resched();
> put_user()
> preempt_disable();
> 
> (If you wish to seriously question the usage of the preempt API in this
> manner, I unfortunately have no comment since I didn't write the code.)

I'm with Thomas here, that's broken and should not be done.

> This particular block of code was causing lockups and crashes on a certain
> ARM64 device. The generated assembly revealed that the compiler was simply
> optimizing out the increment and decrement of the preempt count, allowing
> put_user to run without preemption enabled, causing all sorts of badness.
> Since put_user doesn't actually access the preempt count and translates to
> just a few instructions without any branching, I suppose that the compiler
> figured it was OK to optimize.
> 
> The immediate solution is to add a compiler barrier to the code above, but
> should sched_preempt_enable_no_resched have an additional compiler barrier
> after (has one before already) the preempt-count decrement to prevent this
> sort of thing?

I think the below would be sufficient; IIRC the compiler may not combine
or elide volatile operations.

---
 include/asm-generic/preempt.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@
 
 static __always_inline int preempt_count(void)
 {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
 }
 
-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
 {
return _thread_info()->preempt_count;
 }


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-13 Thread Thomas Gleixner
On Thu, 12 May 2016, Vikram Mulukutla wrote:
> preempt_disable();
> /* --cut, lots of code-- */
> preempt_enable_no_resched();
> put_user()
> preempt_disable();
> 
> (If you wish to seriously question the usage of the preempt API in this
> manner, I unfortunately have no comment since I didn't write the code.)
> 
> This particular block of code was causing lockups and crashes on a certain
> ARM64 device. The generated assembly revealed that the compiler was simply
> optimizing out the increment and decrement of the preempt count, allowing
> put_user to run without preemption enabled, causing all sorts of badness.
> Since put_user doesn't actually access the preempt count and translates to
> just a few instructions without any branching, I suppose that the compiler
> figured it was OK to optimize.
> 
> The immediate solution is to add a compiler barrier to the code above, but
> should sched_preempt_enable_no_resched have an additional compiler barrier

preempt_enable_no_resched() should not be used at all. Use preempt_enable().

Thanks,

tglx


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-13 Thread Thomas Gleixner
On Thu, 12 May 2016, Vikram Mulukutla wrote:
> preempt_disable();
> /* --cut, lots of code-- */
> preempt_enable_no_resched();
> put_user()
> preempt_disable();
> 
> (If you wish to seriously question the usage of the preempt API in this
> manner, I unfortunately have no comment since I didn't write the code.)
> 
> This particular block of code was causing lockups and crashes on a certain
> ARM64 device. The generated assembly revealed that the compiler was simply
> optimizing out the increment and decrement of the preempt count, allowing
> put_user to run without preemption enabled, causing all sorts of badness.
> Since put_user doesn't actually access the preempt count and translates to
> just a few instructions without any branching, I suppose that the compiler
> figured it was OK to optimize.
> 
> The immediate solution is to add a compiler barrier to the code above, but
> should sched_preempt_enable_no_resched have an additional compiler barrier

preempt_enable_no_resched() should not be used at all. Use preempt_enable().

Thanks,

tglx