Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Mon, May 16, 2016 at 12:55:57PM +0200, Peter Zijlstra wrote: > > So the barriers already forbid that the compiler reorders code. How on earth > > is the compiler supposed to optimize the dec/inc out? > > Order things like: > > > #define sched_preempt_enable_no_resched() \ > > do { \ > > barrier(); \ > > preempt_count_dec(); \ > > } while (0) > > > #define preempt_disable() \ > > do { \ > > preempt_count_inc(); \ > > barrier(); \ > > } while (0) > > And there is no barrier between the dec and inc, and a smarty pants > compiler could just decide to forgo the update, since in program order > there is no observable difference either way. > > Making the thing volatile tells the compiler there can be external > observations of the memory and it cannot assume things like that and > must emit the operations. ... > So I'll go write a proper changelog for the volatile thing and get it > merged with a Cc to stable. --- Subject: sched,preempt: Fix preempt_count manipulations From: Peter ZijlstraDate: Mon May 16 15:01:11 CEST 2016 Vikram reported that his ARM64 compiler managed to 'optimize' away the preempt_count manipulations in code like: preempt_enable_no_resched(); put_user(); preempt_disable(); Irrespective of that fact that that is horrible code that should be fixed for many reasons, it does highlight a deficiency in the generic preempt_count manipulators. As it is never right to combine/elide preempt_count manipulations like this. Therefore sprinkle some volatile in the two generic accessors to ensure the compiler is aware of the fact that the preempt_count is observed outside of the regular program-order view and thus cannot be optimized away like this. x86; the only arch not using the generic code is not affected as we do all this in asm in order to use the segment base per-cpu stuff. Cc: sta...@vger.kernel.org Cc: Thomas Gleixner Fixes: a787870924db ("sched, arch: Create asm/preempt.h") Reported-by: Vikram Mulukutla Tested-by: Vikram Mulukutla Signed-off-by: Peter Zijlstra (Intel) --- include/asm-generic/preempt.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return _thread_info()->preempt_count; }
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Mon, May 16, 2016 at 12:55:57PM +0200, Peter Zijlstra wrote: > > So the barriers already forbid that the compiler reorders code. How on earth > > is the compiler supposed to optimize the dec/inc out? > > Order things like: > > > #define sched_preempt_enable_no_resched() \ > > do { \ > > barrier(); \ > > preempt_count_dec(); \ > > } while (0) > > > #define preempt_disable() \ > > do { \ > > preempt_count_inc(); \ > > barrier(); \ > > } while (0) > > And there is no barrier between the dec and inc, and a smarty pants > compiler could just decide to forgo the update, since in program order > there is no observable difference either way. > > Making the thing volatile tells the compiler there can be external > observations of the memory and it cannot assume things like that and > must emit the operations. ... > So I'll go write a proper changelog for the volatile thing and get it > merged with a Cc to stable. --- Subject: sched,preempt: Fix preempt_count manipulations From: Peter Zijlstra Date: Mon May 16 15:01:11 CEST 2016 Vikram reported that his ARM64 compiler managed to 'optimize' away the preempt_count manipulations in code like: preempt_enable_no_resched(); put_user(); preempt_disable(); Irrespective of that fact that that is horrible code that should be fixed for many reasons, it does highlight a deficiency in the generic preempt_count manipulators. As it is never right to combine/elide preempt_count manipulations like this. Therefore sprinkle some volatile in the two generic accessors to ensure the compiler is aware of the fact that the preempt_count is observed outside of the regular program-order view and thus cannot be optimized away like this. x86; the only arch not using the generic code is not affected as we do all this in asm in order to use the segment base per-cpu stuff. Cc: sta...@vger.kernel.org Cc: Thomas Gleixner Fixes: a787870924db ("sched, arch: Create asm/preempt.h") Reported-by: Vikram Mulukutla Tested-by: Vikram Mulukutla Signed-off-by: Peter Zijlstra (Intel) --- include/asm-generic/preempt.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return _thread_info()->preempt_count; }
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Mon, May 16, 2016 at 12:55:57PM +0200, Peter Zijlstra wrote: > You're right in that the 'proper' sequence: > > > #define preempt_enable() \ > > do { \ > > barrier(); \ > > if (unlikely(preempt_count_dec_and_test())) \ > > __preempt_schedule(); \ > > } while (0) > > > #define preempt_disable() \ > > do { \ > > preempt_count_inc(); \ > > barrier(); \ > > } while (0) > > Has a higher chance of succeeding to emit the operations to memory; but > an even smarter pants compiler might figure doing something like: > > if (preempt_count() == 1) > __preempt_schedule(); > > is equivalent and emits that instead, not bothering to modify the actual > variable at all -- the program as specified cannot tell the difference > etc.. For this to work the call __preempt_schedule() must be obfuscated though; but I think the thing we do for x86 which turns it into: asm("call ___preempt_schedule;"); might just be enough for the compiler to get confused about that. A normal function call would act as a compiler barrier and force the compiler to emit the memory ops. Then again, it could maybe do: if (preempt_count() == 1) { preempt_count_dec(); __preempt_schedule(); preempt_count_inc(); } and think it did us a service by 'optimizing' away that memory reference in the 'fast' path. Who knows what these compilers get up to ;-)
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Mon, May 16, 2016 at 12:55:57PM +0200, Peter Zijlstra wrote: > You're right in that the 'proper' sequence: > > > #define preempt_enable() \ > > do { \ > > barrier(); \ > > if (unlikely(preempt_count_dec_and_test())) \ > > __preempt_schedule(); \ > > } while (0) > > > #define preempt_disable() \ > > do { \ > > preempt_count_inc(); \ > > barrier(); \ > > } while (0) > > Has a higher chance of succeeding to emit the operations to memory; but > an even smarter pants compiler might figure doing something like: > > if (preempt_count() == 1) > __preempt_schedule(); > > is equivalent and emits that instead, not bothering to modify the actual > variable at all -- the program as specified cannot tell the difference > etc.. For this to work the call __preempt_schedule() must be obfuscated though; but I think the thing we do for x86 which turns it into: asm("call ___preempt_schedule;"); might just be enough for the compiler to get confused about that. A normal function call would act as a compiler barrier and force the compiler to emit the memory ops. Then again, it could maybe do: if (preempt_count() == 1) { preempt_count_dec(); __preempt_schedule(); preempt_count_inc(); } and think it did us a service by 'optimizing' away that memory reference in the 'fast' path. Who knows what these compilers get up to ;-)
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Sat, May 14, 2016 at 05:39:37PM +0200, Thomas Gleixner wrote: > I have a hard time to understand why the compiler optimizes out stuff w/o that > patch. > > We already have: > > #define preempt_disable() \ > do { \ > preempt_count_inc(); \ > barrier(); \ > } while (0) > > #define sched_preempt_enable_no_resched() \ > do { \ > barrier(); \ > preempt_count_dec(); \ > } while (0) > > #define preempt_enable() \ > do { \ > barrier(); \ > if (unlikely(preempt_count_dec_and_test())) \ > __preempt_schedule(); \ > } while (0) > > So the barriers already forbid that the compiler reorders code. How on earth > is the compiler supposed to optimize the dec/inc out? Order things like: > #define sched_preempt_enable_no_resched() \ > do { \ > barrier(); \ > preempt_count_dec(); \ > } while (0) > #define preempt_disable() \ > do { \ > preempt_count_inc(); \ > barrier(); \ > } while (0) And there is no barrier between the dec and inc, and a smarty pants compiler could just decide to forgo the update, since in program order there is no observable difference either way. Making the thing volatile tells the compiler there can be external observations of the memory and it cannot assume things like that and must emit the operations. You're right in that the 'proper' sequence: > #define preempt_enable() \ > do { \ > barrier(); \ > if (unlikely(preempt_count_dec_and_test())) \ > __preempt_schedule(); \ > } while (0) > #define preempt_disable() \ > do { \ > preempt_count_inc(); \ > barrier(); \ > } while (0) Has a higher chance of succeeding to emit the operations to memory; but an even smarter pants compiler might figure doing something like: if (preempt_count() == 1) __preempt_schedule(); is equivalent and emits that instead, not bothering to modify the actual variable at all -- the program as specified cannot tell the difference etc.. Also; in the case of !PREEMPT && PREEMPT_COUNT, the normal: preempt_disable(); preempt_enable(); sequence turns into the first case again. So I'll go write a proper changelog for the volatile thing and get it merged with a Cc to stable.
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Sat, May 14, 2016 at 05:39:37PM +0200, Thomas Gleixner wrote: > I have a hard time to understand why the compiler optimizes out stuff w/o that > patch. > > We already have: > > #define preempt_disable() \ > do { \ > preempt_count_inc(); \ > barrier(); \ > } while (0) > > #define sched_preempt_enable_no_resched() \ > do { \ > barrier(); \ > preempt_count_dec(); \ > } while (0) > > #define preempt_enable() \ > do { \ > barrier(); \ > if (unlikely(preempt_count_dec_and_test())) \ > __preempt_schedule(); \ > } while (0) > > So the barriers already forbid that the compiler reorders code. How on earth > is the compiler supposed to optimize the dec/inc out? Order things like: > #define sched_preempt_enable_no_resched() \ > do { \ > barrier(); \ > preempt_count_dec(); \ > } while (0) > #define preempt_disable() \ > do { \ > preempt_count_inc(); \ > barrier(); \ > } while (0) And there is no barrier between the dec and inc, and a smarty pants compiler could just decide to forgo the update, since in program order there is no observable difference either way. Making the thing volatile tells the compiler there can be external observations of the memory and it cannot assume things like that and must emit the operations. You're right in that the 'proper' sequence: > #define preempt_enable() \ > do { \ > barrier(); \ > if (unlikely(preempt_count_dec_and_test())) \ > __preempt_schedule(); \ > } while (0) > #define preempt_disable() \ > do { \ > preempt_count_inc(); \ > barrier(); \ > } while (0) Has a higher chance of succeeding to emit the operations to memory; but an even smarter pants compiler might figure doing something like: if (preempt_count() == 1) __preempt_schedule(); is equivalent and emits that instead, not bothering to modify the actual variable at all -- the program as specified cannot tell the difference etc.. Also; in the case of !PREEMPT && PREEMPT_COUNT, the normal: preempt_disable(); preempt_enable(); sequence turns into the first case again. So I'll go write a proper changelog for the volatile thing and get it merged with a Cc to stable.
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On 5/14/2016 8:39 AM, Thomas Gleixner wrote: On Fri, 13 May 2016, Vikram Mulukutla wrote: On 5/13/2016 7:58 AM, Peter Zijlstra wrote: diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index 5d8ffa3e6f8c..c1cde3577551 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return _thread_info()->preempt_count; } Thanks Peter, this patch worked for me. The compiler no longer optimizes out the increment/decrement of the preempt_count. I have a hard time to understand why the compiler optimizes out stuff w/o that patch. We already have: #define preempt_disable() \ do { \ preempt_count_inc(); \ barrier(); \ } while (0) #define sched_preempt_enable_no_resched() \ do { \ barrier(); \ preempt_count_dec(); \ } while (0) #define preempt_enable() \ do { \ barrier(); \ if (unlikely(preempt_count_dec_and_test())) \ __preempt_schedule(); \ } while (0) So the barriers already forbid that the compiler reorders code. How on earth is the compiler supposed to optimize the dec/inc out? While I cannot claim more than a very rudimentary knowledge of compilers, this was the initial reaction that I had as well. But then the barriers are in the wrong spots for the way the code was used in the driver in question. preempt_disable has the barrier() _after_ the increment, and sched_preempt_enable_no_resched has it _before_ the decrement. Therefore, if one were to use preempt_enable_no_resched followed by preempt_disable, there is no compiler barrier between the decrement and the increment statements. Whether this should translate to such a seemingly aggressive optimization is something I'm not completely sure of. There is more code than the preempt stuff depending on barrier() and expecting that the compiler does not optimize out things at will. Are we supposed to hunt all occurences and amend them with READ_ONCE or whatever one by one? I think the barrier is working as intended for the sequence of preempt_disable followed by preempt_enable_no_resched. Thanks, tglx As a simple experiment I used gcc on x86 on the following C program. This was really to convince myself rather than you and Peter! #include #define barrier() __asm__ __volatile__("": : :"memory") struct thread_info { int pc; }; #define preempt_count() \ ti_ptr->pc #define preempt_disable() \ do \ { \ preempt_count() += 1; \ barrier(); \ } \ while (0) #define preempt_enable() \ do \ { \ barrier(); \ preempt_count() -= 1; \ } \ while (0) struct thread_info *ti_ptr; int main(void) { struct thread_info ti; ti_ptr = int b; preempt_enable(); b = b + 500; preempt_disable(); printf("b = %d\n", b); return 0; } With gcc -O2 I get this (the ARM compiler behaves similarly): 00400470 : 400470: 48 83 ec 18 sub$0x18,%rsp 400474: 48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip) 40047b: ba f4 01 00 00 mov$0x1f4,%edx 400480: be 14 06 40 00 mov$0x400614,%esi 400485: bf 01 00 00 00 mov$0x1,%edi 40048a: 31 c0 xor%eax,%eax 40048c: e8 cf ff ff ff callq 400460 <__printf_chk@plt> 400491: 31 c0 xor%eax,%eax 400493: 48 83 c4 18 add$0x18,%rsp 400497: c3 retq If I swap preempt_enable and preempt_disable I get this: 00400470 : 400470: 48 83 ec 18 sub$0x18,%rsp 400474: 48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip) 40047b: 83 04 24 01 addl $0x1,(%rsp) 40047f: 48 8b 05 c2 0b 20 00mov0x200bc2(%rip),%rax 400486: ba f4 01 00 00 mov$0x1f4,%edx 40048b: be 14 06 40 00 mov$0x400614,%esi 400490: bf 01 00 00 00 mov$0x1,%edi 400495: 83 28 01subl $0x1,(%rax) 400498: 31 c0 xor%eax,%eax 40049a: e8 c1 ff ff ff callq 400460 <__printf_chk@plt> 40049f: 31 c0 xor%eax,%eax 4004a1: 48 83 c4 18 add$0x18,%rsp 4004a5: c3 retq Note: If I place ti_ptr on the stack, the ordering/barriers no longer matter, the inc/dec is always optimized out. I suppose the compiler does treat current_thread_info as coming from a different memory location rather than the current stack frame. In any
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On 5/14/2016 8:39 AM, Thomas Gleixner wrote: On Fri, 13 May 2016, Vikram Mulukutla wrote: On 5/13/2016 7:58 AM, Peter Zijlstra wrote: diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index 5d8ffa3e6f8c..c1cde3577551 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return _thread_info()->preempt_count; } Thanks Peter, this patch worked for me. The compiler no longer optimizes out the increment/decrement of the preempt_count. I have a hard time to understand why the compiler optimizes out stuff w/o that patch. We already have: #define preempt_disable() \ do { \ preempt_count_inc(); \ barrier(); \ } while (0) #define sched_preempt_enable_no_resched() \ do { \ barrier(); \ preempt_count_dec(); \ } while (0) #define preempt_enable() \ do { \ barrier(); \ if (unlikely(preempt_count_dec_and_test())) \ __preempt_schedule(); \ } while (0) So the barriers already forbid that the compiler reorders code. How on earth is the compiler supposed to optimize the dec/inc out? While I cannot claim more than a very rudimentary knowledge of compilers, this was the initial reaction that I had as well. But then the barriers are in the wrong spots for the way the code was used in the driver in question. preempt_disable has the barrier() _after_ the increment, and sched_preempt_enable_no_resched has it _before_ the decrement. Therefore, if one were to use preempt_enable_no_resched followed by preempt_disable, there is no compiler barrier between the decrement and the increment statements. Whether this should translate to such a seemingly aggressive optimization is something I'm not completely sure of. There is more code than the preempt stuff depending on barrier() and expecting that the compiler does not optimize out things at will. Are we supposed to hunt all occurences and amend them with READ_ONCE or whatever one by one? I think the barrier is working as intended for the sequence of preempt_disable followed by preempt_enable_no_resched. Thanks, tglx As a simple experiment I used gcc on x86 on the following C program. This was really to convince myself rather than you and Peter! #include #define barrier() __asm__ __volatile__("": : :"memory") struct thread_info { int pc; }; #define preempt_count() \ ti_ptr->pc #define preempt_disable() \ do \ { \ preempt_count() += 1; \ barrier(); \ } \ while (0) #define preempt_enable() \ do \ { \ barrier(); \ preempt_count() -= 1; \ } \ while (0) struct thread_info *ti_ptr; int main(void) { struct thread_info ti; ti_ptr = int b; preempt_enable(); b = b + 500; preempt_disable(); printf("b = %d\n", b); return 0; } With gcc -O2 I get this (the ARM compiler behaves similarly): 00400470 : 400470: 48 83 ec 18 sub$0x18,%rsp 400474: 48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip) 40047b: ba f4 01 00 00 mov$0x1f4,%edx 400480: be 14 06 40 00 mov$0x400614,%esi 400485: bf 01 00 00 00 mov$0x1,%edi 40048a: 31 c0 xor%eax,%eax 40048c: e8 cf ff ff ff callq 400460 <__printf_chk@plt> 400491: 31 c0 xor%eax,%eax 400493: 48 83 c4 18 add$0x18,%rsp 400497: c3 retq If I swap preempt_enable and preempt_disable I get this: 00400470 : 400470: 48 83 ec 18 sub$0x18,%rsp 400474: 48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip) 40047b: 83 04 24 01 addl $0x1,(%rsp) 40047f: 48 8b 05 c2 0b 20 00mov0x200bc2(%rip),%rax 400486: ba f4 01 00 00 mov$0x1f4,%edx 40048b: be 14 06 40 00 mov$0x400614,%esi 400490: bf 01 00 00 00 mov$0x1,%edi 400495: 83 28 01subl $0x1,(%rax) 400498: 31 c0 xor%eax,%eax 40049a: e8 c1 ff ff ff callq 400460 <__printf_chk@plt> 40049f: 31 c0 xor%eax,%eax 4004a1: 48 83 c4 18 add$0x18,%rsp 4004a5: c3 retq Note: If I place ti_ptr on the stack, the ordering/barriers no longer matter, the inc/dec is always optimized out. I suppose the compiler does treat current_thread_info as coming from a different memory location rather than the current stack frame. In any
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Fri, 13 May 2016, Vikram Mulukutla wrote: > On 5/13/2016 7:58 AM, Peter Zijlstra wrote: > > diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h > > index 5d8ffa3e6f8c..c1cde3577551 100644 > > --- a/include/asm-generic/preempt.h > > +++ b/include/asm-generic/preempt.h > > @@ -7,10 +7,10 @@ > > > > static __always_inline int preempt_count(void) > > { > > - return current_thread_info()->preempt_count; > > + return READ_ONCE(current_thread_info()->preempt_count); > > } > > > > -static __always_inline int *preempt_count_ptr(void) > > +static __always_inline volatile int *preempt_count_ptr(void) > > { > > return _thread_info()->preempt_count; > > } > > > > Thanks Peter, this patch worked for me. The compiler no longer optimizes out > the increment/decrement of the preempt_count. I have a hard time to understand why the compiler optimizes out stuff w/o that patch. We already have: #define preempt_disable() \ do { \ preempt_count_inc(); \ barrier(); \ } while (0) #define sched_preempt_enable_no_resched() \ do { \ barrier(); \ preempt_count_dec(); \ } while (0) #define preempt_enable() \ do { \ barrier(); \ if (unlikely(preempt_count_dec_and_test())) \ __preempt_schedule(); \ } while (0) So the barriers already forbid that the compiler reorders code. How on earth is the compiler supposed to optimize the dec/inc out? There is more code than the preempt stuff depending on barrier() and expecting that the compiler does not optimize out things at will. Are we supposed to hunt all occurences and amend them with READ_ONCE or whatever one by one? Thanks, tglx
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Fri, 13 May 2016, Vikram Mulukutla wrote: > On 5/13/2016 7:58 AM, Peter Zijlstra wrote: > > diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h > > index 5d8ffa3e6f8c..c1cde3577551 100644 > > --- a/include/asm-generic/preempt.h > > +++ b/include/asm-generic/preempt.h > > @@ -7,10 +7,10 @@ > > > > static __always_inline int preempt_count(void) > > { > > - return current_thread_info()->preempt_count; > > + return READ_ONCE(current_thread_info()->preempt_count); > > } > > > > -static __always_inline int *preempt_count_ptr(void) > > +static __always_inline volatile int *preempt_count_ptr(void) > > { > > return _thread_info()->preempt_count; > > } > > > > Thanks Peter, this patch worked for me. The compiler no longer optimizes out > the increment/decrement of the preempt_count. I have a hard time to understand why the compiler optimizes out stuff w/o that patch. We already have: #define preempt_disable() \ do { \ preempt_count_inc(); \ barrier(); \ } while (0) #define sched_preempt_enable_no_resched() \ do { \ barrier(); \ preempt_count_dec(); \ } while (0) #define preempt_enable() \ do { \ barrier(); \ if (unlikely(preempt_count_dec_and_test())) \ __preempt_schedule(); \ } while (0) So the barriers already forbid that the compiler reorders code. How on earth is the compiler supposed to optimize the dec/inc out? There is more code than the preempt stuff depending on barrier() and expecting that the compiler does not optimize out things at will. Are we supposed to hunt all occurences and amend them with READ_ONCE or whatever one by one? Thanks, tglx
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On 5/13/2016 7:58 AM, Peter Zijlstra wrote: On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote: Hi, I came across a piece of engineering code that looked like: preempt_disable(); /* --cut, lots of code-- */ preempt_enable_no_resched(); put_user() preempt_disable(); (If you wish to seriously question the usage of the preempt API in this manner, I unfortunately have no comment since I didn't write the code.) I'm with Thomas here, that's broken and should not be done. Ok. I did in fact zero in on this code by replacing each instance of preempt_enable_no_resched with preempt_enable one by one (there were several uses in the driver). I will ask the original developer to consider using preempt_enable. This particular block of code was causing lockups and crashes on a certain ARM64 device. The generated assembly revealed that the compiler was simply optimizing out the increment and decrement of the preempt count, allowing put_user to run without preemption enabled, causing all sorts of badness. Since put_user doesn't actually access the preempt count and translates to just a few instructions without any branching, I suppose that the compiler figured it was OK to optimize. The immediate solution is to add a compiler barrier to the code above, but should sched_preempt_enable_no_resched have an additional compiler barrier after (has one before already) the preempt-count decrement to prevent this sort of thing? I think the below would be sufficient; IIRC the compiler may not combine or elide volatile operations. --- include/asm-generic/preempt.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index 5d8ffa3e6f8c..c1cde3577551 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return _thread_info()->preempt_count; } Thanks Peter, this patch worked for me. The compiler no longer optimizes out the increment/decrement of the preempt_count. Thanks, Vikram
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On 5/13/2016 7:58 AM, Peter Zijlstra wrote: On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote: Hi, I came across a piece of engineering code that looked like: preempt_disable(); /* --cut, lots of code-- */ preempt_enable_no_resched(); put_user() preempt_disable(); (If you wish to seriously question the usage of the preempt API in this manner, I unfortunately have no comment since I didn't write the code.) I'm with Thomas here, that's broken and should not be done. Ok. I did in fact zero in on this code by replacing each instance of preempt_enable_no_resched with preempt_enable one by one (there were several uses in the driver). I will ask the original developer to consider using preempt_enable. This particular block of code was causing lockups and crashes on a certain ARM64 device. The generated assembly revealed that the compiler was simply optimizing out the increment and decrement of the preempt count, allowing put_user to run without preemption enabled, causing all sorts of badness. Since put_user doesn't actually access the preempt count and translates to just a few instructions without any branching, I suppose that the compiler figured it was OK to optimize. The immediate solution is to add a compiler barrier to the code above, but should sched_preempt_enable_no_resched have an additional compiler barrier after (has one before already) the preempt-count decrement to prevent this sort of thing? I think the below would be sufficient; IIRC the compiler may not combine or elide volatile operations. --- include/asm-generic/preempt.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index 5d8ffa3e6f8c..c1cde3577551 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return _thread_info()->preempt_count; } Thanks Peter, this patch worked for me. The compiler no longer optimizes out the increment/decrement of the preempt_count. Thanks, Vikram
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote: > Hi, > > I came across a piece of engineering code that looked like: > > preempt_disable(); > /* --cut, lots of code-- */ > preempt_enable_no_resched(); > put_user() > preempt_disable(); > > (If you wish to seriously question the usage of the preempt API in this > manner, I unfortunately have no comment since I didn't write the code.) I'm with Thomas here, that's broken and should not be done. > This particular block of code was causing lockups and crashes on a certain > ARM64 device. The generated assembly revealed that the compiler was simply > optimizing out the increment and decrement of the preempt count, allowing > put_user to run without preemption enabled, causing all sorts of badness. > Since put_user doesn't actually access the preempt count and translates to > just a few instructions without any branching, I suppose that the compiler > figured it was OK to optimize. > > The immediate solution is to add a compiler barrier to the code above, but > should sched_preempt_enable_no_resched have an additional compiler barrier > after (has one before already) the preempt-count decrement to prevent this > sort of thing? I think the below would be sufficient; IIRC the compiler may not combine or elide volatile operations. --- include/asm-generic/preempt.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index 5d8ffa3e6f8c..c1cde3577551 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return _thread_info()->preempt_count; }
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote: > Hi, > > I came across a piece of engineering code that looked like: > > preempt_disable(); > /* --cut, lots of code-- */ > preempt_enable_no_resched(); > put_user() > preempt_disable(); > > (If you wish to seriously question the usage of the preempt API in this > manner, I unfortunately have no comment since I didn't write the code.) I'm with Thomas here, that's broken and should not be done. > This particular block of code was causing lockups and crashes on a certain > ARM64 device. The generated assembly revealed that the compiler was simply > optimizing out the increment and decrement of the preempt count, allowing > put_user to run without preemption enabled, causing all sorts of badness. > Since put_user doesn't actually access the preempt count and translates to > just a few instructions without any branching, I suppose that the compiler > figured it was OK to optimize. > > The immediate solution is to add a compiler barrier to the code above, but > should sched_preempt_enable_no_resched have an additional compiler barrier > after (has one before already) the preempt-count decrement to prevent this > sort of thing? I think the below would be sufficient; IIRC the compiler may not combine or elide volatile operations. --- include/asm-generic/preempt.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index 5d8ffa3e6f8c..c1cde3577551 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return _thread_info()->preempt_count; }
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Thu, 12 May 2016, Vikram Mulukutla wrote: > preempt_disable(); > /* --cut, lots of code-- */ > preempt_enable_no_resched(); > put_user() > preempt_disable(); > > (If you wish to seriously question the usage of the preempt API in this > manner, I unfortunately have no comment since I didn't write the code.) > > This particular block of code was causing lockups and crashes on a certain > ARM64 device. The generated assembly revealed that the compiler was simply > optimizing out the increment and decrement of the preempt count, allowing > put_user to run without preemption enabled, causing all sorts of badness. > Since put_user doesn't actually access the preempt count and translates to > just a few instructions without any branching, I suppose that the compiler > figured it was OK to optimize. > > The immediate solution is to add a compiler barrier to the code above, but > should sched_preempt_enable_no_resched have an additional compiler barrier preempt_enable_no_resched() should not be used at all. Use preempt_enable(). Thanks, tglx
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On Thu, 12 May 2016, Vikram Mulukutla wrote: > preempt_disable(); > /* --cut, lots of code-- */ > preempt_enable_no_resched(); > put_user() > preempt_disable(); > > (If you wish to seriously question the usage of the preempt API in this > manner, I unfortunately have no comment since I didn't write the code.) > > This particular block of code was causing lockups and crashes on a certain > ARM64 device. The generated assembly revealed that the compiler was simply > optimizing out the increment and decrement of the preempt count, allowing > put_user to run without preemption enabled, causing all sorts of badness. > Since put_user doesn't actually access the preempt count and translates to > just a few instructions without any branching, I suppose that the compiler > figured it was OK to optimize. > > The immediate solution is to add a compiler barrier to the code above, but > should sched_preempt_enable_no_resched have an additional compiler barrier preempt_enable_no_resched() should not be used at all. Use preempt_enable(). Thanks, tglx