Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-03 Thread Will Schmidt

dvh...@linux.vnet.ibm.com wrote on 09/02/2010 01:04:28 AM:

 Subject

 Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with
 CONFIG_PREEMPT on pseries

 With this in place, we no longer see the preempt_count dropping below
 zero. However, if I offline/online a CPU about 246 times I hit the
 opposite problem, a preempt_count() overflow. There appears to be a
 missing preempt_enable() somewhere in the offline/online paths.

This (preempt_count overflow) also occurred in mainline (with
CONFIG_PREEMPT=y) in 2.6.35, but not in 2.6.36-rc3.   A
bisect seems to indicate it was fixed with
a7c2bb8279d20d853e43c34584eaf2b039de8026   powerpc: Re-enable preemption
before cpu_die().

Which may look familiar.  :-)

It looks like this patch went to mainline (likely via the powerpc tree),
but may
have not gotten back into the -rt branch.

-Will


 --
 Darren Hart
 IBM Linux Technology Center
 Real-Time Linux Team___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-02 Thread Darren Hart
On 09/01/2010 09:06 PM, Steven Rostedt wrote:
 On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote:
 
 We need to call smp_startup_cpu on boot when we the cpus are still in
 FW.  smp_startup_cpu does this for us on boot.

 I'm wondering if we just need to move the test down a bit to make sure
 the preempt_count is set.  I've not been following this thread, but
 maybe this might work?
 
 Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren
 even said that adding the exit prevented the bug (although now he's
 hitting a hard lockup someplace else). The original code he was using
 did not have the condition to return for kexec. It was just a
 coincidence that this code helped in bringing a CPU back online.
 

 Untested patch below...

 Mikey

 diff --git a/arch/powerpc/platforms/pseries/smp.c 
 b/arch/powerpc/platforms/pseries/smp.c
 index 0317cce..3afaba4 100644
 --- a/arch/powerpc/platforms/pseries/smp.c
 +++ b/arch/powerpc/platforms/pseries/smp.c
 @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned 
 int lcpu)
  
  pcpu = get_hard_smp_processor_id(lcpu);
  
 -/* Check to see if the CPU out of FW already for kexec */
 -if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
 -cpumask_set_cpu(lcpu, of_spin_mask);
 -return 1;
 -}
 -
  /* Fixup atomic count: it exited inside IRQ handler. */
  task_thread_info(paca[lcpu].__current)-preempt_count   = 0;
 
 We DON'T want to do the above. It's nasty! This is one CPU's task
 touching an intimate part of another CPU's task. It's equivalent of me
 putting my hand down you wife's blouse. It's offensive, and rude.
 
 OK, if the CPU was never online, then you can do what you want. But what
 we see is that this fails on CPU hotplug.  You stop a CPU, and it goes
 into this cede_processor() call. When you wake it up, suddenly the task
 on that woken CPU has its preempt count fscked up.  This was really
 really hard to debug. We thought it was stack corruption or something.
 But it ended up being that this code has one CPU touching the breasts of
 another CPU. This code is a pervert!
 
 What the trace clearly showed, was that we take down a CPU, and in doing
 so, the code on that CPU set the preempt count to 1, and it expected to
 have it as 1 when it returned. But the code that kicked started this CPU
 back to life (bring the CPU back online), set the preempt count on the
 task of that CPU to 0, and screwed everything up.


Right, what Steve said.

This patch resolved the problem, but it did so inadvertently while
trying to fix kexec. I'm wondering if the offline/online code needs to
be updated to never call smp_startup_cpu(). Or perhaps this is the right
fix, and the comment needs to be cleaned up so that it isn't only for
kexec.

With this in place, we no longer see the preempt_count dropping below
zero. However, if I offline/online a CPU about 246 times I hit the
opposite problem, a preempt_count() overflow. There appears to be a
missing preempt_enable() somewhere in the offline/online paths.

On 2.6.33.7 with CONFIG_PREEMPT=y I run the following test:

# cat offon_loop.sh 
#!/bin/sh
for i in `seq 100`; do
echo Iteration $i
echo 0  /sys/devices/system/cpu/cpu1/online
echo 1  /sys/devices/system/cpu/cpu1/online
done

And see the overflow:

Iteration 238
Iteration 239
Iteration 240

Message from sysl...@igoort1 at Sep  1 17:36:45 ...
 kernel:[ cut here ]
Iteration 241
Iteration 242
Iteration 243
Iteration 244
Iteration 245

With the following on the console. This is the:
/*
 * Spinlock count overflowing soon?
 */
DEBUG_LOCKS_WARN_ON((preempt_count()  PREEMPT_MASK) =
PREEMPT_MASK - 10);
test. Max preempt count is 256.

Somewhere we are now MISSING a sub_preempt_count() or preempt_enable().

Badness at kernel/sched.c:5372
NIP: c0694f14 LR: c0694ef8 CTR: c005b0a0
REGS: c0008e776ae0 TRAP: 0700   Not tainted  (2.6.33.7-dirty)
MSR: 80021032 ME,CE,IR,DR  CR: 2882  XER: 
TASK = c0010e6fc040[0] 'swapper' THREAD: c0008e774000 CPU: 1
GPR00:  c0008e776d60 c0af2ab8 0001
GPR04: c0008e776fb8 0004 0001 ff676980
GPR08: 0400 c0bcd930 0001 c0b2d360
GPR12: 0002 c0b0f680  0f394acc
GPR16:    c0008e777880
GPR20: c0e05160 c0008e777870 7fff c0b0f480
GPR24: c09da400 0002  0001
GPR28: c0008e776fb8 0001 c0a75bf0 c0008e776d60
NIP [c0694f14] .add_preempt_count+0xc0/0xe0
LR [c0694ef8] .add_preempt_count+0xa4/0xe0
Call Trace:
[c0008e776d60] [c0008e776e00] 0xc0008e776e00 (unreliable)
[c0008e776df0] 

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-02 Thread Michael Neuling


In message 1283400367.2356.69.ca...@gandalf.stny.rr.com you wrote:
 On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote:
 
  We need to call smp_startup_cpu on boot when we the cpus are still in
  FW.  smp_startup_cpu does this for us on boot.
  
  I'm wondering if we just need to move the test down a bit to make sure
  the preempt_count is set.  I've not been following this thread, but
  maybe this might work?
 
 Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren
 even said that adding the exit prevented the bug (although now he's
 hitting a hard lockup someplace else). The original code he was using
 did not have the condition to return for kexec. It was just a
 coincidence that this code helped in bringing a CPU back online.
 
  
  Untested patch below...
  
  Mikey
  
  diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/
pseries/smp.c
  index 0317cce..3afaba4 100644
  --- a/arch/powerpc/platforms/pseries/smp.c
  +++ b/arch/powerpc/platforms/pseries/smp.c
  @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned 
int lcpu)
   
  pcpu = get_hard_smp_processor_id(lcpu);
   
  -   /* Check to see if the CPU out of FW already for kexec */
  -   if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
  -   cpumask_set_cpu(lcpu, of_spin_mask);
  -   return 1;
  -   }
  -
  /* Fixup atomic count: it exited inside IRQ handler. */
  task_thread_info(paca[lcpu].__current)-preempt_count   = 0;
 
 We DON'T want to do the above. It's nasty! This is one CPU's task
 touching an intimate part of another CPU's task. It's equivalent of me
 putting my hand down you wife's blouse. It's offensive, and rude.
 
 OK, if the CPU was never online, then you can do what you want. But what
 we see is that this fails on CPU hotplug.  You stop a CPU, and it goes
 into this cede_processor() call. When you wake it up, suddenly the task
 on that woken CPU has its preempt count fscked up.  This was really
 really hard to debug. We thought it was stack corruption or something.
 But it ended up being that this code has one CPU touching the breasts of
 another CPU. This code is a pervert!
 
 What the trace clearly showed, was that we take down a CPU, and in doing
 so, the code on that CPU set the preempt count to 1, and it expected to
 have it as 1 when it returned. But the code that kicked started this CPU
 back to life (bring the CPU back online), set the preempt count on the
 task of that CPU to 0, and screwed everything up.

/me goes to checks where this came from...

It's been in the kernel since hotplug CPU support was added to ppc64
back in 2004, so I guess we are all at fault for letting this pervert
get away with this stuff for so long in plain sight. :-)

So I guess we should remove this but we need to audit all the different
paths that go through here to make sure they are OK with preempt.
Normal boot, kexec boot, hotplug with FW stop and hotplug with
extended_cede all hit this.

Mikey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-02 Thread Darren Hart
On 09/02/2010 04:04 PM, Michael Neuling wrote:
 In message 1283400367.2356.69.ca...@gandalf.stny.rr.com you wrote:
 On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote:

 We need to call smp_startup_cpu on boot when we the cpus are still in
 FW.  smp_startup_cpu does this for us on boot.

 I'm wondering if we just need to move the test down a bit to make sure
 the preempt_count is set.  I've not been following this thread, but
 maybe this might work?

 Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren
 even said that adding the exit prevented the bug (although now he's
 hitting a hard lockup someplace else). The original code he was using
 did not have the condition to return for kexec. It was just a
 coincidence that this code helped in bringing a CPU back online.


 Untested patch below...

 Mikey

 diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/
 pseries/smp.c
 index 0317cce..3afaba4 100644
 --- a/arch/powerpc/platforms/pseries/smp.c
 +++ b/arch/powerpc/platforms/pseries/smp.c
 @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned 
 int lcpu)
  
 pcpu = get_hard_smp_processor_id(lcpu);
  
 -   /* Check to see if the CPU out of FW already for kexec */
 -   if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
 -   cpumask_set_cpu(lcpu, of_spin_mask);
 -   return 1;
 -   }
 -
 /* Fixup atomic count: it exited inside IRQ handler. */
 task_thread_info(paca[lcpu].__current)-preempt_count   = 0;

 We DON'T want to do the above. It's nasty! This is one CPU's task
 touching an intimate part of another CPU's task. It's equivalent of me
 putting my hand down you wife's blouse. It's offensive, and rude.

 OK, if the CPU was never online, then you can do what you want. But what
 we see is that this fails on CPU hotplug.  You stop a CPU, and it goes
 into this cede_processor() call. When you wake it up, suddenly the task
 on that woken CPU has its preempt count fscked up.  This was really
 really hard to debug. We thought it was stack corruption or something.
 But it ended up being that this code has one CPU touching the breasts of
 another CPU. This code is a pervert!

 What the trace clearly showed, was that we take down a CPU, and in doing
 so, the code on that CPU set the preempt count to 1, and it expected to
 have it as 1 when it returned. But the code that kicked started this CPU
 back to life (bring the CPU back online), set the preempt count on the
 task of that CPU to 0, and screwed everything up.
 
 /me goes to checks where this came from...
 
 It's been in the kernel since hotplug CPU support was added to ppc64
 back in 2004, so I guess we are all at fault for letting this pervert
 get away with this stuff for so long in plain sight. :-)
 
 So I guess we should remove this but we need to audit all the different
 paths that go through here to make sure they are OK with preempt.
 Normal boot, kexec boot, hotplug with FW stop and hotplug with
 extended_cede all hit this.
 
 Mikey

CC'ing my alter ego.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-01 Thread Darren Hart
On 08/31/2010 10:54 PM, Michael Ellerman wrote:
 On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote:
 ..

 When running with the function plugin I had to stop the trace
 immediately before entering start_secondary after an online or my traces
 would not include the pseries_mach_cpu_die function, nor the tracing I
 added there (possibly buffer size, I am using 2048). The following trace
 was collected using trace-cmd record -p function -e irq -e sched and
 has been filtered to only show CPU [001] (the CPU undergoing the
 offline/online test, and the one seeing preempt_count (pcnt) go to
  after cede. The function tracer does not indicate anything
 running on the CPU other than the HCALL - unless the __trace_hcall*
 commands might be to blame. 
 
 It's not impossible. Though normally they're disabled right, so the only
 reason they're running is because you're tracing. So if they are causing
 the bug then that doesn't explain why you see it normally.
 
 Still, might be worth disabling just the hcall tracepoints just to be
 100% sure.

A couple of updates. I was working from tip/rt/head, which has been
stale for some months now. I switched to tip/rt/2.6.33 and the
preempt_count() change over cede went away. I now see the live hang that
Will has reported.

Before I dive into the live hang, I want to understand what fixed the
preempt_count() change. That might start pointing us in the right
direction for the live hang.

I did an inverted git bisect between tip/rt/head and tip/rt/2.6.33 to
try and locate the fix. I've narrowed it down to the 2.6.33.6 merge:

# git show 7e1af1172bbd4109d09ac515c5d376f633da7cff
commit 7e1af1172bbd4109d09ac515c5d376f633da7cff
Merge: d8e94db 9666790
Author: Thomas Gleixner t...@linutronix.de
Date:   Tue Jul 13 16:01:16 2010 +0200

Merge
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y

Conflicts:
Makefile

Signed-off-by: Thomas Gleixner t...@linutronix.de


Visual inspection yields two patches of interest:

f8b67691828321f5c85bb853283aa101ae673130
powerpc/pseries: Make query-cpu-stopped callable outside hotplug cpu

aef40e87d866355ffd279ab21021de733242d0d5
powerpc/pseries: Only call start-cpu when a CPU is stopped

I'm going to try reverting these today and see if they addressed the
issue indirectly.


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-01 Thread Darren Hart
On 09/01/2010 08:10 AM, Darren Hart wrote:
 On 08/31/2010 10:54 PM, Michael Ellerman wrote:
 On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote:
 ..

 When running with the function plugin I had to stop the trace
 immediately before entering start_secondary after an online or my traces
 would not include the pseries_mach_cpu_die function, nor the tracing I
 added there (possibly buffer size, I am using 2048). The following trace
 was collected using trace-cmd record -p function -e irq -e sched and
 has been filtered to only show CPU [001] (the CPU undergoing the
 offline/online test, and the one seeing preempt_count (pcnt) go to
  after cede. The function tracer does not indicate anything
 running on the CPU other than the HCALL - unless the __trace_hcall*
 commands might be to blame. 

 It's not impossible. Though normally they're disabled right, so the only
 reason they're running is because you're tracing. So if they are causing
 the bug then that doesn't explain why you see it normally.

 Still, might be worth disabling just the hcall tracepoints just to be
 100% sure.
 
 A couple of updates. I was working from tip/rt/head, which has been
 stale for some months now. I switched to tip/rt/2.6.33 and the
 preempt_count() change over cede went away. I now see the live hang that
 Will has reported.
 
 Before I dive into the live hang, I want to understand what fixed the
 preempt_count() change. That might start pointing us in the right
 direction for the live hang.
 
 I did an inverted git bisect between tip/rt/head and tip/rt/2.6.33 to
 try and locate the fix. I've narrowed it down to the 2.6.33.6 merge:
 
 # git show 7e1af1172bbd4109d09ac515c5d376f633da7cff
 commit 7e1af1172bbd4109d09ac515c5d376f633da7cff
 Merge: d8e94db 9666790
 Author: Thomas Gleixner t...@linutronix.de
 Date:   Tue Jul 13 16:01:16 2010 +0200
 
 Merge
 git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y
 
 Conflicts:
 Makefile
 
 Signed-off-by: Thomas Gleixner t...@linutronix.de
 
 
 Visual inspection yields two patches of interest:
 
 f8b67691828321f5c85bb853283aa101ae673130
 powerpc/pseries: Make query-cpu-stopped callable outside hotplug cpu
 
 aef40e87d866355ffd279ab21021de733242d0d5
 powerpc/pseries: Only call start-cpu when a CPU is stopped
 
 I'm going to try reverting these today and see if they addressed the
 issue indirectly.

Removing:

aef40e87d866355ffd279ab21021de733242d0d5
powerpc/pseries: Only call start-cpu when a CPU is stopped

--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned int 
lcpu)
 
pcpu = get_hard_smp_processor_id(lcpu);
 
+   /* Check to see if the CPU out of FW already for kexec */

This comment is really confusing to me. I _think_ it is saying that this test
determines if the CPU is done executing firmware code and has begun executing
OS code Is that right?

+   if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
+   cpu_set(lcpu, of_spin_map);
+   return 1;
+   }
+
/* Fixup atomic count: it exited inside IRQ handler. */
task_thread_info(paca[lcpu].__current)-preempt_count   = 0;

from tip/rt/2.6.33 causes the preempt_count() to change across the cede
call.  This patch appears to prevents the proxy preempt_count assignment
from happening. This non-local-cpu assignment to 0 would cause an
underrun of preempt_count() if the local-cpu had disabled preemption
prior to the assignment and then later tried to enable it. This appears
to be the case with the stack of __trace_hcall* calls preceeding the
return from extended_cede_processor() in the latency format trace-cmd
report:

  idle-0   1d   201.252737: function: .cpu_die
  idle-0   1d   201.252738: function:
.pseries_mach_cpu_die
  idle-0   1d   201.252740: function:   
.idle_task_exit
  idle-0   1d   201.252741: function:  .switch_slb
  idle-0   1d   201.252742: function:   
.xics_teardown_cpu
  idle-0   1d   201.252743: function:  
.xics_set_cpu_priority
  idle-0   1d   201.252744: function: .__trace_hcall_entry
  idle-0   1d..1.   201.252745: function:
.probe_hcall_entry
  idle-0   1d..1.   201.252746: function: .__trace_hcall_exit
  idle-0   1d..2.   201.252747: function:.probe_hcall_exit
  idle-0   1d   201.252748: function: .__trace_hcall_entry
  idle-0   1d..1.   201.252748: function:
.probe_hcall_entry
  idle-0   1d..1.   201.252750: function: .__trace_hcall_exit
  idle-0   1d..2.   201.252751: function:.probe_hcall_exit
  idle-0   1d   201.252752: function: .__trace_hcall_entry
  idle-0   1d..1.   201.252753: 

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-01 Thread Steven Rostedt
On Wed, 2010-09-01 at 11:47 -0700, Darren Hart wrote:

 from tip/rt/2.6.33 causes the preempt_count() to change across the cede
 call.  This patch appears to prevents the proxy preempt_count assignment
 from happening. This non-local-cpu assignment to 0 would cause an
 underrun of preempt_count() if the local-cpu had disabled preemption
 prior to the assignment and then later tried to enable it. This appears
 to be the case with the stack of __trace_hcall* calls preceeding the
 return from extended_cede_processor() in the latency format trace-cmd
 report:
 
   idle-0   1d   201.252737: function: .cpu_die

Note, the above 1d is a series of values. The first being the CPU,
the next if interrupts are disabled, the next if the NEED_RESCHED flag
is set, the next is softirqs enabled or disabled, next the
preempt_count, and finally the lockdepth count.

Here we only care about the preempt_count, which is zero when '.' and a
number if it is something else. It is the second to last field in that
list.


   idle-0   1d   201.252738: function:
 .pseries_mach_cpu_die
   idle-0   1d   201.252740: function:   
 .idle_task_exit
   idle-0   1d   201.252741: function:  
 .switch_slb
   idle-0   1d   201.252742: function:   
 .xics_teardown_cpu
   idle-0   1d   201.252743: function:  
 .xics_set_cpu_priority
   idle-0   1d   201.252744: function: 
 .__trace_hcall_entry
   idle-0   1d..1.   201.252745: function:
 .probe_hcall_entry

   ^
preempt_count set to 1

   idle-0   1d..1.   201.252746: function: 
 .__trace_hcall_exit
   idle-0   1d..2.   201.252747: function:
 .probe_hcall_exit
   idle-0   1d   201.252748: function: 
 .__trace_hcall_entry
   idle-0   1d..1.   201.252748: function:
 .probe_hcall_entry
   idle-0   1d..1.   201.252750: function: 
 .__trace_hcall_exit
   idle-0   1d..2.   201.252751: function:
 .probe_hcall_exit
   idle-0   1d   201.252752: function: 
 .__trace_hcall_entry
   idle-0   1d..1.   201.252753: function:
 .probe_hcall_entry
   ^   ^
  CPU  preempt_count

Entering the function probe_hcall_entry() the preempt_count is 1 (see
below). But probe_hcall_entry does:

h = get_cpu_var(hcall_stats)[opcode / 4];

Without doing the put (which it does in probe_hcall_exit())

So exiting the probe_hcall_entry() the prempt_count is 2.
The trace_hcall_entry() will do a preempt_enable() making it leave as 1.


   offon.sh-3684  6.   201.466488: bprint:   
 .smp_pSeries_kick_cpu : resetting pcnt to 0 for cpu 1

This is CPU 6, changing the preempt count from 1 to 0.

 
 preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the
 QCSS_NOT_STOPPED check from the patch above.
 
   idle-0   1d   201.466503: function: 
 .__trace_hcall_exit

Note: __trace_hcall_exit() and __trace_hcall_entry() basically do:

 preempt_disable();
 call probe();
 preempt_enable();


   idle-0   1d..1.   201.466505: function:
 .probe_hcall_exit

The preempt_count of 1 entering the probe_hcall_exit() is because of the
preempt_disable() shown above. It should have been entered as a 2.

But then it does:


put_cpu_var(hcall_stats);

making preempt_count 0.

But the preempt_enable() in the trace_hcall_exit() causes this to be -1.


   idle-0   1d.Hff.   201.466507: bprint:   
 .pseries_mach_cpu_die : after cede: 
 
 With the preempt_count() being one less than it should be, the final
 preempt_enable() in the trace_hcall path drops preempt_count to
 0x, which of course is an illegal value and leads to a crash.

I'm confused to how this works in mainline?

-- Steve

 
 I don't know if the patch above is the right fix for this or not as
 don't yet understand why this occurs only with PREEMPT_RT and not in
 mainline. Once we sort that out, we may need a more specific fix.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-01 Thread Darren Hart
On 09/01/2010 12:59 PM, Steven Rostedt wrote:
 On Wed, 2010-09-01 at 11:47 -0700, Darren Hart wrote:
 
 from tip/rt/2.6.33 causes the preempt_count() to change across the cede
 call.  This patch appears to prevents the proxy preempt_count assignment
 from happening. This non-local-cpu assignment to 0 would cause an
 underrun of preempt_count() if the local-cpu had disabled preemption
 prior to the assignment and then later tried to enable it. This appears
 to be the case with the stack of __trace_hcall* calls preceeding the
 return from extended_cede_processor() in the latency format trace-cmd
 report:

   idle-0   1d   201.252737: function: .cpu_die
 
 Note, the above 1d is a series of values. The first being the CPU,
 the next if interrupts are disabled, the next if the NEED_RESCHED flag
 is set, the next is softirqs enabled or disabled, next the
 preempt_count, and finally the lockdepth count.
 
 Here we only care about the preempt_count, which is zero when '.' and a
 number if it is something else. It is the second to last field in that
 list.
 
 
   idle-0   1d   201.252738: function:
 .pseries_mach_cpu_die
   idle-0   1d   201.252740: function:   
 .idle_task_exit
   idle-0   1d   201.252741: function:  
 .switch_slb
   idle-0   1d   201.252742: function:   
 .xics_teardown_cpu
   idle-0   1d   201.252743: function:  
 .xics_set_cpu_priority
   idle-0   1d   201.252744: function: 
 .__trace_hcall_entry
   idle-0   1d..1.   201.252745: function:
 .probe_hcall_entry
 
^
 preempt_count set to 1
 
   idle-0   1d..1.   201.252746: function: 
 .__trace_hcall_exit
   idle-0   1d..2.   201.252747: function:
 .probe_hcall_exit
   idle-0   1d   201.252748: function: 
 .__trace_hcall_entry
   idle-0   1d..1.   201.252748: function:
 .probe_hcall_entry
   idle-0   1d..1.   201.252750: function: 
 .__trace_hcall_exit
   idle-0   1d..2.   201.252751: function:
 .probe_hcall_exit
   idle-0   1d   201.252752: function: 
 .__trace_hcall_entry
   idle-0   1d..1.   201.252753: function:
 .probe_hcall_entry
^   ^
   CPU  preempt_count
 
 Entering the function probe_hcall_entry() the preempt_count is 1 (see
 below). But probe_hcall_entry does:
 
   h = get_cpu_var(hcall_stats)[opcode / 4];
 
 Without doing the put (which it does in probe_hcall_exit())
 
 So exiting the probe_hcall_entry() the prempt_count is 2.
 The trace_hcall_entry() will do a preempt_enable() making it leave as 1.
 
 
   offon.sh-3684  6.   201.466488: bprint:   
 .smp_pSeries_kick_cpu : resetting pcnt to 0 for cpu 1
 
 This is CPU 6, changing the preempt count from 1 to 0.
 

 preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the
 QCSS_NOT_STOPPED check from the patch above.

   idle-0   1d   201.466503: function: 
 .__trace_hcall_exit
 
 Note: __trace_hcall_exit() and __trace_hcall_entry() basically do:
 
  preempt_disable();
  call probe();
  preempt_enable();
 
 
   idle-0   1d..1.   201.466505: function:
 .probe_hcall_exit
 
 The preempt_count of 1 entering the probe_hcall_exit() is because of the
 preempt_disable() shown above. It should have been entered as a 2.
 
 But then it does:
 
 
   put_cpu_var(hcall_stats);
 
 making preempt_count 0.
 
 But the preempt_enable() in the trace_hcall_exit() causes this to be -1.
 
 
   idle-0   1d.Hff.   201.466507: bprint:   
 .pseries_mach_cpu_die : after cede: 

 With the preempt_count() being one less than it should be, the final
 preempt_enable() in the trace_hcall path drops preempt_count to
 0x, which of course is an illegal value and leads to a crash.
 
 I'm confused to how this works in mainline?

Turns out it didn't. 2.6.33.5 with CONFIG_PREEMPT=y sees this exact same
behavior. The following, part of the 2.6.33.6 stable release, prevents
this from happening:

aef40e87d866355ffd279ab21021de733242d0d5
powerpc/pseries: Only call start-cpu when a CPU is stopped

--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned
int lcpu)

pcpu = get_hard_smp_processor_id(lcpu);

+   /* Check to see if the CPU out of FW already for kexec */
+   if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
+   cpu_set(lcpu, of_spin_map);
+   return 1;
+   }
+
/* Fixup atomic count: it exited inside IRQ handler. */
task_thread_info(paca[lcpu].__current)-preempt_count   = 0;

The question is now, Is this the right fix? If so, perhaps we can 

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-01 Thread Michael Neuling


In message 4c7ebaa8.7030...@us.ibm.com you wrote:
 On 09/01/2010 12:59 PM, Steven Rostedt wrote:
  On Wed, 2010-09-01 at 11:47 -0700, Darren Hart wrote:
  
  from tip/rt/2.6.33 causes the preempt_count() to change across the cede
  call.  This patch appears to prevents the proxy preempt_count assignment
  from happening. This non-local-cpu assignment to 0 would cause an
  underrun of preempt_count() if the local-cpu had disabled preemption
  prior to the assignment and then later tried to enable it. This appears
  to be the case with the stack of __trace_hcall* calls preceeding the
  return from extended_cede_processor() in the latency format trace-cmd
  report:
 
idle-0   1d   201.252737: function: .cpu_die
  
  Note, the above 1d is a series of values. The first being the CPU,
  the next if interrupts are disabled, the next if the NEED_RESCHED flag
  is set, the next is softirqs enabled or disabled, next the
  preempt_count, and finally the lockdepth count.
  
  Here we only care about the preempt_count, which is zero when '.' and a
  number if it is something else. It is the second to last field in that
  list.
  
  
idle-0   1d   201.252738: function:.pseries_ma
ch_cpu_die
idle-0   1d   201.252740: function:   .idle_ta
sk_exit
idle-0   1d   201.252741: function:  .swit
ch_slb
idle-0   1d   201.252742: function:   .xics_te
ardown_cpu
idle-0   1d   201.252743: function:  .xics
_set_cpu_priority
idle-0   1d   201.252744: function: .__trace_hcall
_entry
idle-0   1d..1.   201.252745: function:.probe_hcal
l_entry
  
 ^
  preempt_count set to 1
  
idle-0   1d..1.   201.252746: function: .__trace_hcall
_exit
idle-0   1d..2.   201.252747: function:.probe_hcal
l_exit
idle-0   1d   201.252748: function: .__trace_hcall
_entry
idle-0   1d..1.   201.252748: function:.probe_hcal
l_entry
idle-0   1d..1.   201.252750: function: .__trace_hcall
_exit
idle-0   1d..2.   201.252751: function:.probe_hcal
l_exit
idle-0   1d   201.252752: function: .__trace_hcall
_entry
idle-0   1d..1.   201.252753: function:.probe_hcal
l_entry
 ^   ^
CPU  preempt_count
  
  Entering the function probe_hcall_entry() the preempt_count is 1 (see
  below). But probe_hcall_entry does:
  
  h = get_cpu_var(hcall_stats)[opcode / 4];
  
  Without doing the put (which it does in probe_hcall_exit())
  
  So exiting the probe_hcall_entry() the prempt_count is 2.
  The trace_hcall_entry() will do a preempt_enable() making it leave as 1.
  
  
offon.sh-3684  6.   201.466488: bprint:   .smp_pSeries_k
ick_cpu : resetting pcnt to 0 for cpu 1
  
  This is CPU 6, changing the preempt count from 1 to 0.
  
 
  preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the
  QCSS_NOT_STOPPED check from the patch above.
 
idle-0   1d   201.466503: function: .__trace_hcall
_exit
  
  Note: __trace_hcall_exit() and __trace_hcall_entry() basically do:
  
   preempt_disable();
   call probe();
   preempt_enable();
  
  
idle-0   1d..1.   201.466505: function:.probe_hcal
l_exit
  
  The preempt_count of 1 entering the probe_hcall_exit() is because of the
  preempt_disable() shown above. It should have been entered as a 2.
  
  But then it does:
  
  
  put_cpu_var(hcall_stats);
  
  making preempt_count 0.
  
  But the preempt_enable() in the trace_hcall_exit() causes this to be -1.
  
  
idle-0   1d.Hff.   201.466507: bprint:   .pseries_mach
_cpu_die : after cede: 
 
  With the preempt_count() being one less than it should be, the final
  preempt_enable() in the trace_hcall path drops preempt_count to
  0x, which of course is an illegal value and leads to a crash.
  
  I'm confused to how this works in mainline?
 
 Turns out it didn't. 2.6.33.5 with CONFIG_PREEMPT=y sees this exact same
 behavior. The following, part of the 2.6.33.6 stable release, prevents
 this from happening:
 
 aef40e87d866355ffd279ab21021de733242d0d5
 powerpc/pseries: Only call start-cpu when a CPU is stopped
 
 --- a/arch/powerpc/platforms/pseries/smp.c
 +++ b/arch/powerpc/platforms/pseries/smp.c
 @@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned
 int lcpu)
 
 pcpu = get_hard_smp_processor_id(lcpu);
 
 +   /* Check to see if the CPU out of FW already for kexec */
 +   if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
 +   cpu_set(lcpu, of_spin_map);
 +   return 1;
 +   }
 +
 /* Fixup atomic count: it exited inside IRQ 

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-01 Thread Michael Neuling
 +   /* Check to see if the CPU out of FW already for kexec */

Wow, that comment is shit.  The checkin comment in
aef40e87d866355ffd279ab21021de733242d0d5 is much better.


 This comment is really confusing to me. I _think_ it is saying that this test
 determines if the CPU is done executing firmware code and has begun executing
 OS code Is that right?

Yeah.  

It means for a normal boot, the CPU will not have started yet (still in
firmware (FW)) so we have to call FW to bring it out.  In the kexec case
though, the CPU will have started already (it's spinning in the kernel)
so we don't have to call FW to bring it back out again.  To distinguish
between these two cases, we ask FW if the CPU has started or not (via
smp_query_cpu_stopped()) and if it's already start, don't restart it.

Originally, we could call FW to start a cpu that was already started,
but FW changed recently and stopped allowing us to do this.  Hence this
patch.

Mikey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-09-01 Thread Steven Rostedt
On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote:

 We need to call smp_startup_cpu on boot when we the cpus are still in
 FW.  smp_startup_cpu does this for us on boot.
 
 I'm wondering if we just need to move the test down a bit to make sure
 the preempt_count is set.  I've not been following this thread, but
 maybe this might work?

Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren
even said that adding the exit prevented the bug (although now he's
hitting a hard lockup someplace else). The original code he was using
did not have the condition to return for kexec. It was just a
coincidence that this code helped in bringing a CPU back online.

 
 Untested patch below...
 
 Mikey
 
 diff --git a/arch/powerpc/platforms/pseries/smp.c 
 b/arch/powerpc/platforms/pseries/smp.c
 index 0317cce..3afaba4 100644
 --- a/arch/powerpc/platforms/pseries/smp.c
 +++ b/arch/powerpc/platforms/pseries/smp.c
 @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned 
 int lcpu)
  
   pcpu = get_hard_smp_processor_id(lcpu);
  
 - /* Check to see if the CPU out of FW already for kexec */
 - if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
 - cpumask_set_cpu(lcpu, of_spin_mask);
 - return 1;
 - }
 -
   /* Fixup atomic count: it exited inside IRQ handler. */
   task_thread_info(paca[lcpu].__current)-preempt_count   = 0;

We DON'T want to do the above. It's nasty! This is one CPU's task
touching an intimate part of another CPU's task. It's equivalent of me
putting my hand down you wife's blouse. It's offensive, and rude.

OK, if the CPU was never online, then you can do what you want. But what
we see is that this fails on CPU hotplug.  You stop a CPU, and it goes
into this cede_processor() call. When you wake it up, suddenly the task
on that woken CPU has its preempt count fscked up.  This was really
really hard to debug. We thought it was stack corruption or something.
But it ended up being that this code has one CPU touching the breasts of
another CPU. This code is a pervert!

What the trace clearly showed, was that we take down a CPU, and in doing
so, the code on that CPU set the preempt count to 1, and it expected to
have it as 1 when it returned. But the code that kicked started this CPU
back to life (bring the CPU back online), set the preempt count on the
task of that CPU to 0, and screwed everything up.

-- Steve


  
   if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE)
   goto out;
  
 + /* Check to see if the CPU out of FW already for kexec */
 + if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
 + cpumask_set_cpu(lcpu, of_spin_mask);
 + return 1;
 + }
 +
   /* 
* If the RTAS start-cpu token does not exist then presume the
* cpu is already spinning.
 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-31 Thread Darren Hart
On 08/19/2010 08:58 AM, Ankita Garg wrote:
 Hi Darren,
 
 On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote:

 With some instrumentation we were able to determine that the
 preempt_count() appears to change across the extended_cede_processor()
 call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
 PREEMPT_RT we call this with preempt_count=1 and return with
 preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value
 is different (0x65) but is still incorrect.
 
 I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could
 easily reproduce this on the RT kernel and not the non-RT kernel.

This matches my experience.

 However, I hit it every single time I did a cpu online operation. I also
 noticed that the issue persists even when I disable H_CEDE by passing
 the cede_offline=0 kernel commandline parameter. Could you pl confirm
 if you observe the same in your setup ?

Yes, this is my experience as well.

 
 However, the issue still remains. Will spend few cycles looking into
 this issue.
 

I've spent today trying to collect some useful traces. I consistently
see the preempt_count() change to 0x across the H_CEDE call, but
the irq and sched events (to ftrace) do not indicate anything else
running on the CPU that could change that.

  idle-0   1d.h1. 11408us : irq_handler_entry: irq=16 name=IPI
  idle-0   1d.h1. 11411us : irq_handler_exit: irq=16 ret=handled
  ...other cpus
  idle-0   1d 22101us : .pseries_mach_cpu_die: start
  idle-0   1d 22108us : .pseries_mach_cpu_die: cpu 1 (hwid 1) ceding 
for offline with hint 2
  ...other cpus
  idle-0   1d.Hff. 33804us : .pseries_mach_cpu_die: returned from cede 
with pcnt: 
  idle-0   1d.Hff. 33805us : .pseries_mach_cpu_die: forcing pcnt to 0
  idle-0   1d 33807us : .pseries_mach_cpu_die: cpu 1 (hwid 1) 
returned from cede.
  idle-0   1d 33808us : .pseries_mach_cpu_die: Decrementer value = 
7fa49474 Timebase value = baefee6c88113
  idle-0   1d 33809us : .pseries_mach_cpu_die: cpu 1 (hwid 1) got 
prodded to go online
  idle-0   1d 33816us : .pseries_mach_cpu_die: calling 
start_seconday, pcnt: 0
  idle-0   1d 33816us : .pseries_mach_cpu_die: forcing pcnt to 0

-
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last 
unloaded: scsi_wait_scan]
NIP: c006aa50 LR: c006ac40 CTR: c006ac14
REGS: c0008e79ffc0 TRAP: 0300   Not tainted  
(2.6.33-rt-dvhrt16-02358-g61223dd-dirty)
MSR: 80001032 ME,IR,DR  CR: 2822  XER: 
DAR: c18000ba44c0, DSISR: 4000
TASK = c0010e6de040[0] 'swapper' THREAD: c0008e7a CPU: 1
GPR00: 01800040 c0008e7a0240 c0b55008 c0010e6de040
GPR04: c00085d800c0   000f
GPR08: 0180 c0008e7a c0ba4480 c0a32c80
GPR12: 80009032 c0ba4680 c0008e7a08f0 0001
GPR16: f2fa c0010e6de040  7fff
GPR20:  0001 0001 c0f42c80
GPR24: c000850b7638   0001
GPR28: c0f42c80 c0010e6de040 c0ad7698 c0008e7a0240
NIP [c006aa50] .resched_task+0x48/0xd8
LR [c006ac40] .check_preempt_curr_idle+0x2c/0x44
Call Trace:
Instruction dump:
f821ff71 7c3f0b78 ebc2ab28 7c7d1b78 6000 6000 e95e8008 e97e8000
e93d0008 81090010 79084da4 38080040 7d4a002a 7c0b502e 7c74 7800d182
---[ end trace 6d3f1cddaa17382c ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Call Trace:
Rebooting in 180 seconds..



When running with the function plugin I had to stop the trace
immediately before entering start_secondary after an online or my traces
would not include the pseries_mach_cpu_die function, nor the tracing I
added there (possibly buffer size, I am using 2048). The following trace
was collected using trace-cmd record -p function -e irq -e sched and
has been filtered to only show CPU [001] (the CPU undergoing the
offline/online test, and the one seeing preempt_count (pcnt) go to
 after cede. The function tracer does not indicate anything
running on the CPU other than the HCALL - unless the __trace_hcall*
commands might be to blame. Can a POWER guru comment?

  idle-0 [001]   417.625286: function: .cpu_die
  idle-0 [001]   417.625287: function:
.pseries_mach_cpu_die
  idle-0 [001]   417.625290: bprint:   
.pseries_mach_cpu_die : start
  idle-0 [001]   417.625291: function:   
.idle_task_exit
  idle-0 [001]   417.625292: function:  
.switch_slb
  idle-0 [001]   417.625294: function:   
.xics_teardown_cpu
  idle-0 

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-31 Thread Michael Ellerman
On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote:
..
 
 When running with the function plugin I had to stop the trace
 immediately before entering start_secondary after an online or my traces
 would not include the pseries_mach_cpu_die function, nor the tracing I
 added there (possibly buffer size, I am using 2048). The following trace
 was collected using trace-cmd record -p function -e irq -e sched and
 has been filtered to only show CPU [001] (the CPU undergoing the
 offline/online test, and the one seeing preempt_count (pcnt) go to
  after cede. The function tracer does not indicate anything
 running on the CPU other than the HCALL - unless the __trace_hcall*
 commands might be to blame. 

It's not impossible. Though normally they're disabled right, so the only
reason they're running is because you're tracing. So if they are causing
the bug then that doesn't explain why you see it normally.

Still, might be worth disabling just the hcall tracepoints just to be
100% sure.

cheers



signature.asc
Description: This is a digitally signed message part
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-23 Thread Darren Hart

On 08/19/2010 08:58 AM, Ankita Garg wrote:

Hi Darren,

On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote:


With some instrumentation we were able to determine that the
preempt_count() appears to change across the extended_cede_processor()
call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
PREEMPT_RT we call this with preempt_count=1 and return with
preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value
is different (0x65) but is still incorrect.


I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could
easily reproduce this on the RT kernel and not the non-RT kernel.
However, I hit it every single time I did a cpu online operation. I also
noticed that the issue persists even when I disable H_CEDE by passing
the cede_offline=0 kernel commandline parameter. Could you pl confirm
if you observe the same in your setup ?


with the following patches:
[r...@igoort1 linux-2.6-combined]# quilt applied
patches/0001-wms-fix01.patch
patches/powerpc-increase-pseries_cpu_die-delay.patch
patches/powerpc-enable-preemption-before-cpu_die.patch
patches/powerpc-silence-__cpu_up-under-normal-operation.patch
patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch
patches/powerpc-wait-for-cpu-to-go-inactive.patch
patches/powerpc-disable-decrementer-on-offline.patch
patches/powerpc-cpu_die-preempt-hack.patch
patches/powerpc-cede-processor-inst.patch
patches/irq-preempt-inst.patch
patches/disable-decrementer-in-cpu_die.patch
patches/powerpc-hard_irq_disable.patch
[r...@igoort1 linux-2.6-combined]# quilt unapplied
patches/powerpc-debug-replace-cede-with-mdelay.patch
patches/powerpc-pad-thread_info.patch

applied to tip/rt/head (2.6.33-rt ish) I will see the following crash 
after 3 or 4 runs:


3Badness at kernel/sched.c:3720
4NIP: c06986f8 LR: c06986dc CTR: c006ec34
4REGS: c0008e7a7e50 TRAP: 0700   Not tainted 
(2.6.33-rt-dvhrt08-02358-g61223dd-dirty)

4MSR: 80021032 ME,CE,IR,DR  CR: 2822  XER: 
4TASK = c0010e7080c0[0] 'swapper' THREAD: c0008e7a8000 CPU: 3
4GPR00:  c0008e7a80d0 c0b54fa0 
0001
4GPR04:  0032  
000f
4GPR08: c0008eb68d00 c0ca5420 0001 
c0bc22e8
4GPR12: 80009032 c0ba4a80 c0008e7a8a70 
0003
4GPR16: f9ba c0010e7080c0  
7fff
4GPR20:  0001 0003 
c1042c80
4GPR24:  c0008eb686a0 0003 
0001
4GPR28:  0001 c0ad7628 
c0008e7a80d0

4NIP [c06986f8] .sub_preempt_count+0x6c/0xdc
4LR [c06986dc] .sub_preempt_count+0x50/0xdc
4Call Trace:
4Instruction dump:
478290464 80090014 7f80e800 40fc002c 4bd08a99 6000 2fa3 419e0068
4e93e87e0 8009 2f80 409e0058 0fe0 4850 2b9d00fe 41fd0038
1Unable to handle kernel paging request for data at address 
0xc18000ba44c0

1Faulting instruction address: 0xc006aae4

This occurs with or without the cede_offline=0 parameter.

Also, in a similar experiment which seems to corroborate your results, 
suggesting the HCEDE call is not necessarily to blame here.


I had replaced the HCEDE call with a mdelay(2) and still ran into 
issues. I didn't see the preempt count change, but I do see the 
rtmutex.c:684 bug.


cpu 0x0: Vector: 300 (Data Access) at [c0b53ef0]
pc: c006aa54: .resched_task+0x48/0xd8
lr: c006ac44: .check_preempt_curr_idle+0x2c/0x44
sp: c0b54170
   msr: 80001032
   dar: c18000ba44c0
 dsisr: 4000
  current = 0xc0aa1410
  paca= 0xc0ba4480
pid   = 0, comm = swapper
enter ? for help
[c0b54200] c006ac44 .check_preempt_curr_idle+0x2c/0x44
[c0b54290] c007b494 .try_to_wake_up+0x430/0x540
[c0b54360] c007b754 .wake_up_process+0x34/0x48
[c0b543f0] c0089fa8 .wakeup_softirqd+0x78/0x9c
[c0b54480] c008a2c4 .raise_softirq+0x7c/0xb8
[c0b54520] c00977b0 .run_local_timers+0x2c/0x4c
[c0b545a0] c0097828 .update_process_times+0x58/0x9c
[c0b54640] c00beb3c .tick_sched_timer+0xd0/0x120
[c0b546f0] c00b08b8 .__run_hrtimer+0x1a0/0x29c
[c0b547a0] c00b1258 .hrtimer_interrupt+0x21c/0x394
[c0b548d0] c00304c4 .timer_interrupt+0x1ec/0x2f8
[c0b54980] c0003700 decrementer_common+0x100/0x180
--- Exception: 901 (Decrementer) at c00100f8 
.raw_local_irq_restore+0x74/0x8c

[c0b54d00] c0017d14 .cpu_idle+0x12c/0x220
[c0b54da0] c06a1768 .start_secondary+0x3d8/0x418
[c0b54e60] c005c1f0 .pseries_mach_cpu_die+0x244/0x318
[c0b54f10] c001e7e0 .cpu_die+0x4c/0x68
[c0b54f90] 

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-19 Thread Ankita Garg
Hi Darren,

On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote:
 
 With some instrumentation we were able to determine that the
 preempt_count() appears to change across the extended_cede_processor()
 call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
 PREEMPT_RT we call this with preempt_count=1 and return with
 preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value
 is different (0x65) but is still incorrect.

I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could
easily reproduce this on the RT kernel and not the non-RT kernel.
However, I hit it every single time I did a cpu online operation. I also
noticed that the issue persists even when I disable H_CEDE by passing
the cede_offline=0 kernel commandline parameter. Could you pl confirm
if you observe the same in your setup ? 

However, the issue still remains. Will spend few cycles looking into
this issue.

 
 Also of interest is that this path
 cpu_idle()-cpu_die()-pseries_mach_cpu_die() to start_secondary()
 enters with a preempt_count=1 if it wasn't corrupted across the hcall.
 The early boot path from _start however appears to call
 start_secondary() with a preempt_count of 0.
 
 The following patch is most certainly not correct, but it does eliminate
 the situation on mainline 100% of the time (there is still a 25%
 reproduction rate on PREEMPT_RT). Can someone comment on:
 
 1) How can the preempt_count() get mangled across the H_CEDE hcall?
 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
 
 Hacked-up-by: Darren Hart dvh...@us.ibm.com
 
 Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
 ===
 --- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
 +++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
 @@ -138,6 +138,7 @@ static void pseries_mach_cpu_die(void)
* Kernel stack will be reset and start_secondary()
* will be called to continue the online operation.
*/
 + preempt_count() = 0;
   start_secondary_resume();
   }
   }
 
 

-- 
Regards,
Ankita Garg (ank...@in.ibm.com)
Linux Technology Center
IBM India Systems  Technology Labs,
Bangalore, India
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-19 Thread Will Schmidt
Ankita Garg ank...@in.ibm.com wrote on 08/19/2010 10:58:24 AM:
 Subject
 Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with
 CONFIG_PREEMPT on pseries

 Hi Darren,

 On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote:
 
  With some instrumentation we were able to determine that the
  preempt_count() appears to change across the extended_cede_processor()
  call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
  PREEMPT_RT we call this with preempt_count=1 and return with
  preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value
  is different (0x65) but is still incorrect.

 I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could
 easily reproduce this on the RT kernel and not the non-RT kernel.
 However, I hit it every single time I did a cpu online operation. I also
 noticed that the issue persists even when I disable H_CEDE by passing
 the cede_offline=0 kernel commandline parameter. Could you pl confirm
 if you observe the same in your setup ?

If you see it every time, double-check that you have
http://patchwork.ozlabs.org/patch/60922/  or an equivalent in your tree.
(The
patch moves the preempt_enable_no_resched() call up above a call to cpu_die
().  An
earlier variation called a preempt_enable before calling
start_secondary_resume()).

Darren and I have been seeing different problems (different
dedicated-processor LPARS
on the same physical system).  He's seeing scrambled preempt_count values,
I'm tending
to see a system hang/death [with processor backtraces
showing .pseries_mach_cpu_die or
.pseries_dedicated_idle_sleep as expected, but no processes running] .

I'll be curious which you end up seeing.   With 2.6.33.7-rt27 as of a few
minutes ago,
I'm still seeing a system hang/death.




 However, the issue still remains. Will spend few cycles looking into
 this issue.

 
  Also of interest is that this path
  cpu_idle()-cpu_die()-pseries_mach_cpu_die() to start_secondary()
  enters with a preempt_count=1 if it wasn't corrupted across the hcall.
  The early boot path from _start however appears to call
  start_secondary() with a preempt_count of 0.
 
  The following patch is most certainly not correct, but it does
eliminate
  the situation on mainline 100% of the time (there is still a 25%
  reproduction rate on PREEMPT_RT). Can someone comment on:
 
  1) How can the preempt_count() get mangled across the H_CEDE hcall?
  2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
 
  Hacked-up-by: Darren Hart dvh...@us.ibm.com
 
  Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
  ===
  --- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
  +++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
  @@ -138,6 +138,7 @@ static void pseries_mach_cpu_die(void)
 * Kernel stack will be reset and start_secondary()
 * will be called to continue the online operation.
 */
  + preempt_count() = 0;
start_secondary_resume();
 }
  }
 
 

 --
 Regards,
 Ankita Garg (ank...@in.ibm.com)
 Linux Technology Center
 IBM India Systems  Technology Labs,
 Bangalore, India___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-10 Thread Darren Hart

On 07/22/2010 11:38 AM, Thomas Gleixner wrote:

On Thu, 22 Jul 2010, Darren Hart wrote:


Also of interest is that this path
cpu_idle()-cpu_die()-pseries_mach_cpu_die() to start_secondary()
enters with a preempt_count=1 if it wasn't corrupted across the hcall.


That triggers the problem as well. preempt_count needs to be 0 when
entering start_secondary(). So I really wonder how that ever worked.


The early boot path from _start however appears to call
start_secondary() with a preempt_count of 0.


Which is correct.


The following patch is most certainly not correct, but it does eliminate


It is correct, but i think it is incomplete as other portions of the
thread_info on the stack might be in some weird state as well.


Just FYI, I took a look at the stack pointers as well as all the fields 
in the thread_info struct. The only one that changes is preempt_count. 
The previous value of preempt_count doesn't impact the value after cede. 
An initial value of 0, 1, or 4 all result in an after-cede value of 
0x. I also added 32 bits of padding on either side of the 
preempt_count in case the change was accidental - it wasnt', the padded 
values remained unchanged across the cede call while the preempt_count 
still changed to 0x.



--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-06 Thread Darren Hart
On 08/05/2010 10:09 PM, Vaidyanathan Srinivasan wrote:
 * Darren Hartdvh...@us.ibm.com  [2010-08-05 19:19:00]:
 
 On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:

suggestion I updated the instrumentation to display the
 local_save_flags and irqs_disabled_flags:

 Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1

 I'm not sure if I'm reading that right, but I believe interrupts are
 intended to be disabled here. If accomplished via the
 spin_lock_irqsave() this would behave differently on RT. However, this
 path disables the interrupts handled by xics, all but the IPIs anyway.
 On RT I disabled the decrementer as well.

 Is it possible for RT to be receiving other interrupts here?

 Also you may want to call hard_irq_disable() to -really- disable
 interrupts ... since we do lazy-disable on powerpc


 A quick update and request for direction wrt the cede hcall, interrupt 
 handling, and the stack pointer.

 I've added patches one at a time, eliminating bugs with preempt_rt
 (tip/rt/head). With my current patchset I have no crashes with a single run 
 of
 random_online.sh (stress testing to hit the hang will sees is my todo for
 tomorrow).

 Current patchset includes:
 patches/0001-wms-fix01.patch # P7 lazy flushing thing

 # next four are sent to / queued for mainline
 patches/powerpc-increase-pseries_cpu_die-delay.patch
 patches/powerpc-enable-preemption-before-cpu_die.patch
 patches/powerpc-silence-__cpu_up-under-normal-operation.patch
 patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch

 # this one needs to be cleaned up and sent to mainline
 patches/powerpc-wait-for-cpu-to-go-inactive.patch

 patches/powerpc-disable-decrementer-on-offline.patch
 patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after 
 cede
 patches/powerpc-cede-processor-inst.patch # instrumentation
 patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
 patches/powerpc-pad-thread_info.patch

 I didn't include all the patches as the relevant bits are included below in
 code form.

 With the instrumentation, it's clear the change to preempt_count() is 
 targeted
 (since I added padding before and after preempt_count in the thread_info
 struct) and preempt_count is still changed. It is also clear that it isn't 
 just
 a stray decrement since I set preempt_count() to 4 prior to calling cede and 
 it
 still is 0x after the hcall. Also note that the stack pointer doesn't
 change across the cede call and neither does any other part of the 
 thread_info
 structure.

 Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
 serialized list of before/after thread_info prints around cede, I see
 several befores then several afters. But the preempt_count is still modified.

 The relevant code looks like this (from pseries_mach_cpu_die())

  hard_irq_disable(); /* this doesn't fix things... 
 but
 does change behavior a bit */
  preempt_count() = 0x4;
  asm(mr %0,1 : =r (sp));  /* stack pointer is in 
 R1 */
  printk(before cede: sp=%lx pcnt=%x\n, sp, 
 preempt_count());
  print_thread_info();
  extended_cede_processor(cede_latency_hint);
  asm(mr %0,1 : =r (sp));
  printk(after cede: sp=%lx pcnt=%x\n, sp, 
 preempt_count());
  print_thread_info();
  preempt_count() = 0;


 With these patches applied, the output across cede looks like:

 before cede: sp=c0b57150 pcnt=4
 *** current-thread_info ***
 ti-task: c0aa1410
 ti-exec_domain: c0a59958
 ti-cpu: 0
 ti-stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
 ti-preempt_count: 4
 ti-stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
 ti-local_flags: 0
 *** end thread_info ***
 after cede: sp=c0b57150 pcnt=
 *** current-thread_info ***
 ti-task: c0aa1410
 ti-exec_domain: c0a59958
 ti-cpu: 0
 
 Is this print sufficient to prove that we did not jump CPU?

Probably not, but the following should be sufficient. Note that the
dump occurs on CPU: 6 which matches the CPU noted in the ti-cpu
before and after the cede call.

*** current-thread_info ***
ti-task: c0008e7b2240
ti-exec_domain: c0a59958
ti-cpu: 6
ti-stomp_on_me: 57005
ti-preempt_count: 4
ti-stomp_on_me_too: 48879
ti-local_flags: 0
*** end thread_info ***
[ cut here ]
Badness at kernel/sched.c:3698
NIP: c06987e4 LR: c06987c8 CTR: c005c668
REGS: c0010e713800 TRAP: 0700   Not tainted  

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-05 Thread Vaidyanathan Srinivasan
* Darren Hart dvh...@us.ibm.com [2010-08-04 21:45:51]:

 On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote:
 * Benjamin Herrenschmidtb...@kernel.crashing.org  [2010-07-23 15:11:00]:
 
 On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
 Yes.  extended_cede_processor() will return with interrupts enabled in
 the cpu. (This is done by the hypervisor).  Under normal cases we
 cannot be interrupted because no IO interrupts are routed to us after
 xics_teardown_cpu() and since the CPU is out of the map, nobody will
 send us IPIs.
 
 What about decrementer ?
 
 Decrementer expiry event handling is bit complex.  The event as such
 may not bring back the extended_cede_processor() cpu, but may be
 marked pending when we get out of this state eventually.  I will find
 more information on this event and update.
 
 Hi Vaidy, have you been able to dig anything up about the
 decrementer expiry?

Hi Darren,

Yes, it is possible that the cpu in extended_cede_processor() can be
woken up by the decrementer.  But the expectation is that we will
return back to this context since we will not have any pending timers.

Our stack should not change even if we get the decrementer interrupts.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-05 Thread Thomas Gleixner
On Thu, 5 Aug 2010, Vaidyanathan Srinivasan wrote:

 * Darren Hart dvh...@us.ibm.com [2010-08-04 21:45:51]:
 
  On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote:
  * Benjamin Herrenschmidtb...@kernel.crashing.org  [2010-07-23 15:11:00]:
  
  On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
  Yes.  extended_cede_processor() will return with interrupts enabled in
  the cpu. (This is done by the hypervisor).  Under normal cases we
  cannot be interrupted because no IO interrupts are routed to us after
  xics_teardown_cpu() and since the CPU is out of the map, nobody will
  send us IPIs.
  
  What about decrementer ?
  
  Decrementer expiry event handling is bit complex.  The event as such
  may not bring back the extended_cede_processor() cpu, but may be
  marked pending when we get out of this state eventually.  I will find
  more information on this event and update.
  
  Hi Vaidy, have you been able to dig anything up about the
  decrementer expiry?
 
 Hi Darren,
 
 Yes, it is possible that the cpu in extended_cede_processor() can be
 woken up by the decrementer.  But the expectation is that we will
 return back to this context since we will not have any pending timers.

The problem is that you might get woken _before_ the timers have been
migrated away. 
 
 Our stack should not change even if we get the decrementer interrupts.

But you should not execute kernel code when you got offlined either.

Thanks,

tglx
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-05 Thread Darren Hart
On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:
 
   suggestion I updated the instrumentation to display the
 local_save_flags and irqs_disabled_flags:
 
 Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1

 I'm not sure if I'm reading that right, but I believe interrupts are
 intended to be disabled here. If accomplished via the
 spin_lock_irqsave() this would behave differently on RT. However, this
 path disables the interrupts handled by xics, all but the IPIs anyway.
 On RT I disabled the decrementer as well.

 Is it possible for RT to be receiving other interrupts here?
 
 Also you may want to call hard_irq_disable() to -really- disable
 interrupts ... since we do lazy-disable on powerpc
 

A quick update and request for direction wrt the cede hcall, interrupt 
handling, and the stack pointer.

I've added patches one at a time, eliminating bugs with preempt_rt
(tip/rt/head). With my current patchset I have no crashes with a single run of
random_online.sh (stress testing to hit the hang will sees is my todo for
tomorrow).

Current patchset includes:
patches/0001-wms-fix01.patch # P7 lazy flushing thing

# next four are sent to / queued for mainline
patches/powerpc-increase-pseries_cpu_die-delay.patch
patches/powerpc-enable-preemption-before-cpu_die.patch
patches/powerpc-silence-__cpu_up-under-normal-operation.patch
patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch

# this one needs to be cleaned up and sent to mainline
patches/powerpc-wait-for-cpu-to-go-inactive.patch

patches/powerpc-disable-decrementer-on-offline.patch
patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede
patches/powerpc-cede-processor-inst.patch # instrumentation
patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
patches/powerpc-pad-thread_info.patch

I didn't include all the patches as the relevant bits are included below in
code form.

With the instrumentation, it's clear the change to preempt_count() is targeted
(since I added padding before and after preempt_count in the thread_info
struct) and preempt_count is still changed. It is also clear that it isn't just
a stray decrement since I set preempt_count() to 4 prior to calling cede and it
still is 0x after the hcall. Also note that the stack pointer doesn't
change across the cede call and neither does any other part of the thread_info
structure.

Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
serialized list of before/after thread_info prints around cede, I see
several befores then several afters. But the preempt_count is still modified.

The relevant code looks like this (from pseries_mach_cpu_die())

hard_irq_disable(); /* this doesn't fix things... but
   does change behavior a bit */
preempt_count() = 0x4; 
asm(mr %0,1 : =r (sp));  /* stack pointer is in R1 
*/
printk(before cede: sp=%lx pcnt=%x\n, sp, 
preempt_count()); 
print_thread_info(); 
extended_cede_processor(cede_latency_hint); 
asm(mr %0,1 : =r (sp)); 
printk(after cede: sp=%lx pcnt=%x\n, sp, 
preempt_count()); 
print_thread_info(); 
preempt_count() = 0; 


With these patches applied, the output across cede looks like:

before cede: sp=c0b57150 pcnt=4
*** current-thread_info ***
ti-task: c0aa1410
ti-exec_domain: c0a59958
ti-cpu: 0
ti-stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
ti-preempt_count: 4
ti-stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
ti-local_flags: 0
*** end thread_info ***
after cede: sp=c0b57150 pcnt=
*** current-thread_info ***
ti-task: c0aa1410
ti-exec_domain: c0a59958
ti-cpu: 0
ti-stomp_on_me: 57005
ti-preempt_count: 
ti-stomp_on_me_too: 48879
ti-local_flags: 0
*** end thread_info ***

Are there any additional thoughts on what might be causing preempt_count to 
change?
I was thinking that cede might somehow force it to 0 (or perhaps one of the
preempt_count explicit value setters in irq.c) and then decrement it - but that 
dec
should trigger an error in the dec_preempt_count() as I have 
CONFIG_DEBUG_PREEMPT=y.

...

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-05 Thread Vaidyanathan Srinivasan
* Darren Hart dvh...@us.ibm.com [2010-08-05 19:19:00]:

 On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
  On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:
  
suggestion I updated the instrumentation to display the
  local_save_flags and irqs_disabled_flags:
  
  Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
  Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: 
  c0010e9e3ce0, pcnt: 1
  Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: 
  c0010e9e3ce0, pcnt: 1
 
  I'm not sure if I'm reading that right, but I believe interrupts are
  intended to be disabled here. If accomplished via the
  spin_lock_irqsave() this would behave differently on RT. However, this
  path disables the interrupts handled by xics, all but the IPIs anyway.
  On RT I disabled the decrementer as well.
 
  Is it possible for RT to be receiving other interrupts here?
  
  Also you may want to call hard_irq_disable() to -really- disable
  interrupts ... since we do lazy-disable on powerpc
  
 
 A quick update and request for direction wrt the cede hcall, interrupt 
 handling, and the stack pointer.
 
 I've added patches one at a time, eliminating bugs with preempt_rt
 (tip/rt/head). With my current patchset I have no crashes with a single run of
 random_online.sh (stress testing to hit the hang will sees is my todo for
 tomorrow).
 
 Current patchset includes:
 patches/0001-wms-fix01.patch # P7 lazy flushing thing
 
 # next four are sent to / queued for mainline
 patches/powerpc-increase-pseries_cpu_die-delay.patch
 patches/powerpc-enable-preemption-before-cpu_die.patch
 patches/powerpc-silence-__cpu_up-under-normal-operation.patch
 patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch
 
 # this one needs to be cleaned up and sent to mainline
 patches/powerpc-wait-for-cpu-to-go-inactive.patch
 
 patches/powerpc-disable-decrementer-on-offline.patch
 patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after 
 cede
 patches/powerpc-cede-processor-inst.patch # instrumentation
 patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
 patches/powerpc-pad-thread_info.patch
 
 I didn't include all the patches as the relevant bits are included below in
 code form.
 
 With the instrumentation, it's clear the change to preempt_count() is targeted
 (since I added padding before and after preempt_count in the thread_info
 struct) and preempt_count is still changed. It is also clear that it isn't 
 just
 a stray decrement since I set preempt_count() to 4 prior to calling cede and 
 it
 still is 0x after the hcall. Also note that the stack pointer doesn't
 change across the cede call and neither does any other part of the thread_info
 structure.
 
 Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
 serialized list of before/after thread_info prints around cede, I see
 several befores then several afters. But the preempt_count is still modified.
 
 The relevant code looks like this (from pseries_mach_cpu_die())
 
 hard_irq_disable(); /* this doesn't fix things... but
  does change behavior a bit */
 preempt_count() = 0x4; 
 asm(mr %0,1 : =r (sp));  /* stack pointer is in 
 R1 */
 printk(before cede: sp=%lx pcnt=%x\n, sp, 
 preempt_count()); 
 print_thread_info(); 
 extended_cede_processor(cede_latency_hint); 
 asm(mr %0,1 : =r (sp)); 
 printk(after cede: sp=%lx pcnt=%x\n, sp, 
 preempt_count()); 
 print_thread_info(); 
 preempt_count() = 0; 
 
 
 With these patches applied, the output across cede looks like:
 
 before cede: sp=c0b57150 pcnt=4
 *** current-thread_info ***
 ti-task: c0aa1410
 ti-exec_domain: c0a59958
 ti-cpu: 0
 ti-stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
 ti-preempt_count: 4
 ti-stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
 ti-local_flags: 0
 *** end thread_info ***
 after cede: sp=c0b57150 pcnt=
 *** current-thread_info ***
 ti-task: c0aa1410
 ti-exec_domain: c0a59958
 ti-cpu: 0

Is this print sufficient to prove that we did not jump CPU?

We agree that decrementer can possibly expire and we could have gone
to the handler and come back just like we would do in an idle loop.
This will not happen always, but I could see a window of time during
which this may happen.  But that should not change the preempt_count
unconditionally to -1 with the same stack pointer.

 ti-stomp_on_me: 57005
 ti-preempt_count: 
 ti-stomp_on_me_too: 48879
 ti-local_flags: 0

Is there a method to identify whether we had executed any of the
interrupt handler while in this code?

 *** end thread_info ***
 
 Are there any additional thoughts on what 

Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-04 Thread Darren Hart

On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:

On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:


1) How can the preempt_count() get mangled across the H_CEDE hcall?
2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?


The preempt count is on the thread info at the bottom of the stack.

Can you check the stack pointers ?


Hi Ben, sorry if I didn't get back to you on this already. I checked the 
stack pointer before and after the cede call and they match.


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-04 Thread Darren Hart

On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote:

* Benjamin Herrenschmidtb...@kernel.crashing.org  [2010-07-23 15:11:00]:


On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:

Yes.  extended_cede_processor() will return with interrupts enabled in
the cpu. (This is done by the hypervisor).  Under normal cases we
cannot be interrupted because no IO interrupts are routed to us after
xics_teardown_cpu() and since the CPU is out of the map, nobody will
send us IPIs.


What about decrementer ?


Decrementer expiry event handling is bit complex.  The event as such
may not bring back the extended_cede_processor() cpu, but may be
marked pending when we get out of this state eventually.  I will find
more information on this event and update.


Hi Vaidy, have you been able to dig anything up about the decrementer 
expiry?


Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team





Though H_CEDE will return with interrupts enabled, it is unlikely that
an interrupt can be delivered in this context.


Well, if interrupts are soft-disabled, even if one occurs, we will just
mask and return, so that at least should be ok.


Yes.  We will immediately return to the extended_cede_processor() in
the while loop until the preferred_offline_state is changed.

--Vaidy



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-23 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt b...@kernel.crashing.org [2010-07-23 15:11:00]:

 On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
  Yes.  extended_cede_processor() will return with interrupts enabled in
  the cpu. (This is done by the hypervisor).  Under normal cases we
  cannot be interrupted because no IO interrupts are routed to us after
  xics_teardown_cpu() and since the CPU is out of the map, nobody will
  send us IPIs.
 
 What about decrementer ?

Decrementer expiry event handling is bit complex.  The event as such
may not bring back the extended_cede_processor() cpu, but may be
marked pending when we get out of this state eventually.  I will find
more information on this event and update.

  Though H_CEDE will return with interrupts enabled, it is unlikely that
  an interrupt can be delivered in this context. 
 
 Well, if interrupts are soft-disabled, even if one occurs, we will just
 mask and return, so that at least should be ok.

Yes.  We will immediately return to the extended_cede_processor() in
the while loop until the preferred_offline_state is changed.

--Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-23 Thread Will Schmidt
dvh...@linux.vnet.ibm.com wrote on 07/22/2010 06:57:18 PM:
 Subject

 Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with
 CONFIG_PREEMPT on pseries

 On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
  On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
 
  1) How can the preempt_count() get mangled across the H_CEDE hcall?
  2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
 
  The preempt count is on the thread info at the bottom of the stack.
 
  Can you check the stack pointers ?

 Hi Ben, thanks for looking.

 I instrumented the area around extended_cede_processor() as follows
 (please confirm I'm getting the stack pointer correctly).

 while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
asm(mr %0,1 : =r (sp));
printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp,
 preempt_count());
extended_cede_processor(cede_latency_hint);
asm(mr %0,1 : =r (sp));
printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp,
 preempt_count());
  }


 On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
 Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack:
 c0010e9e3ce0, pcnt: 1
 Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack:
 c0010e9e3ce0, pcnt: 1

 This surprised me as preempt_count is 1 before and after, so no
 corruption appears to occur on mainline. This makes the pcnt of 65 I see
 without the preempt_count()=0 hack very strange. I ran several hundred
 off/on cycles. The issue of preempt_count being 1 is still addressed by
 this patch however.

 On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time,
promise):
 Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack:
 c00089bcfcf0, pcnt: 1
 Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack:
 c00089bcfcf0, pcnt: 

I'm not seeing the preempt_count value corrupted with my current set of
debug, however, I have added buffers to the thread_info struct, so
wonder if I've moved the preempt_count variable out of the way of
the corruption.  (Still investigating that point..)

Why..   because I had been trying to set a DABR on the preempt_count
value to catch the corrupter, and due to hits on the nearby _flags fields,
getting
false positives..


struct thread_info {
|---struct task_struct *task;|--|---/* main task structure */
|---struct exec_domain *exec_domain;|---/* execution domain */
|---int||---cpu;|---|---|---/* cpu we're on */
|---int||---pad_buffer[64];
|---int||---preempt_count;|-|---/* 0 = preemptable,
|---|---|---|---|---|---   0 = BUG */
|---int||---pad_buffer2[256];
|---struct restart_block restart_block;
|---unsigned long|--local_flags;|---|---/* private flags for thread
*/

|---/* low level flags - has atomic operations done on it */
|---unsigned long|--flags cacheline_aligned_in_smp;
};



 In both cases the stack pointer appears unchanged.

 Note: there is a BUG triggered in between these statements as the
 preempt_count causes the printk to trigger:
 Badness at kernel/sched.c:5572

 Thanks,

 --
 Darren Hart
 IBM Linux Technology Center
 Real-Time Linux Team

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Darren Hart
While testing CPU offline/online, we hit various preempt_count related
bugs. Various hacks have been employed for several theoretical corner
cases. One situation however is perfectly repeatable on 2.6.33.6 with
CONFIG_PREEMPT=y.

BUG: scheduling while atomic: swapper/0/0x0065
Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log
dm_mod ehea sg ext4 jbd2 mbcache sd_mod crc_t10dif ibmvscsic
scsi_transport_srp scsi_tgt [last unloaded: scsi_wait_scan]
Call Trace:
[c0010e9e39f0] [c00144d4] .show_stack+0x74/0x1c0 (unreliable)
[c0010e9e3aa0] [c007a680] .__schedule_bug+0xa0/0xb0
[c0010e9e3b30] [c056dea4] .schedule+0x7a4/0xd60
[c0010e9e3cd0] [c0016be8] .cpu_idle+0x1f8/0x220
[c0010e9e3d80] [c057d858] .start_secondary+0x388/0x3c0
[c0010e9e3e30] [c0008278] .start_secondary_resume+0x10/0x14

With some instrumentation we were able to determine that the
preempt_count() appears to change across the extended_cede_processor()
call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
PREEMPT_RT we call this with preempt_count=1 and return with
preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value
is different (0x65) but is still incorrect.

Also of interest is that this path
cpu_idle()-cpu_die()-pseries_mach_cpu_die() to start_secondary()
enters with a preempt_count=1 if it wasn't corrupted across the hcall.
The early boot path from _start however appears to call
start_secondary() with a preempt_count of 0.

The following patch is most certainly not correct, but it does eliminate
the situation on mainline 100% of the time (there is still a 25%
reproduction rate on PREEMPT_RT). Can someone comment on:

1) How can the preempt_count() get mangled across the H_CEDE hcall?
2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?

Hacked-up-by: Darren Hart dvh...@us.ibm.com

Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
===
--- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -138,6 +138,7 @@ static void pseries_mach_cpu_die(void)
 * Kernel stack will be reset and start_secondary()
 * will be called to continue the online operation.
 */
+   preempt_count() = 0;
start_secondary_resume();
}
}


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Darren Hart

On 07/22/2010 11:24 AM, Darren Hart wrote:


The following patch is most certainly not correct, but it does eliminate
the situation on mainline 100% of the time (there is still a 25%
reproduction rate on PREEMPT_RT). Can someone comment on:


Apologies. This particular issue is also 100% eliminated on PREEMPT_RT. 
We hit another issue possibly unrelated to this 25% of time. Please 
disregard the comment regarding 25% failure on PREEMPT_RT.


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Thomas Gleixner
On Thu, 22 Jul 2010, Darren Hart wrote:
 
 Also of interest is that this path
 cpu_idle()-cpu_die()-pseries_mach_cpu_die() to start_secondary()
 enters with a preempt_count=1 if it wasn't corrupted across the hcall.

That triggers the problem as well. preempt_count needs to be 0 when
entering start_secondary(). So I really wonder how that ever worked.

 The early boot path from _start however appears to call
 start_secondary() with a preempt_count of 0.

Which is correct.
 
 The following patch is most certainly not correct, but it does eliminate

It is correct, but i think it is incomplete as other portions of the
thread_info on the stack might be in some weird state as well.

 the situation on mainline 100% of the time (there is still a 25%
 reproduction rate on PREEMPT_RT).

But those are diffferent issues, for which we have reasonable
explanations and patches/workarounds.

 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?

No
 
Thanks,

tglx
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Benjamin Herrenschmidt
On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
 
 1) How can the preempt_count() get mangled across the H_CEDE hcall?
 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?

The preempt count is on the thread info at the bottom of the stack.

Can you check the stack pointers ?

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Darren Hart
On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:

 1) How can the preempt_count() get mangled across the H_CEDE hcall?
 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
 
 The preempt count is on the thread info at the bottom of the stack.
 
 Can you check the stack pointers ?

Hi Ben, thanks for looking.

I instrumented the area around extended_cede_processor() as follows
(please confirm I'm getting the stack pointer correctly).

while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
asm(mr %0,1 : =r (sp));
printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
preempt_count());  
extended_cede_processor(cede_latency_hint);
asm(mr %0,1 : =r (sp));
printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
preempt_count());
 }


On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, 
pcnt: 1
Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack:  c0010e9e3ce0, 
pcnt: 1

This surprised me as preempt_count is 1 before and after, so no
corruption appears to occur on mainline. This makes the pcnt of 65 I see
without the preempt_count()=0 hack very strange. I ran several hundred
off/on cycles. The issue of preempt_count being 1 is still addressed by
this patch however.

On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise):
Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack: c00089bcfcf0, 
pcnt: 1
Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack:  c00089bcfcf0, 
pcnt: 

In both cases the stack pointer appears unchanged.

Note: there is a BUG triggered in between these statements as the
preempt_count causes the printk to trigger:
Badness at kernel/sched.c:5572

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Darren Hart
On 07/22/2010 04:57 PM, Darren Hart wrote:
 On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:

 1) How can the preempt_count() get mangled across the H_CEDE hcall?
 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?

 The preempt count is on the thread info at the bottom of the stack.

 Can you check the stack pointers ?
 
 Hi Ben, thanks for looking.
 
 I instrumented the area around extended_cede_processor() as follows
 (please confirm I'm getting the stack pointer correctly).
 
 while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
   asm(mr %0,1 : =r (sp));
   printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
 preempt_count());
   extended_cede_processor(cede_latency_hint);
   asm(mr %0,1 : =r (sp));
   printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
 preempt_count());
   }
 
 
 On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
 Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack:  
 c0010e9e3ce0, pcnt: 1
 
 This surprised me as preempt_count is 1 before and after, so no
 corruption appears to occur on mainline. This makes the pcnt of 65 I see
 without the preempt_count()=0 hack very strange. I ran several hundred
 off/on cycles. The issue of preempt_count being 1 is still addressed by
 this patch however.
 
 On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise):
 Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack: 
 c00089bcfcf0, pcnt: 1
 Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack:  
 c00089bcfcf0, pcnt: 
 
 In both cases the stack pointer appears unchanged.
 
 Note: there is a BUG triggered in between these statements as the
 preempt_count causes the printk to trigger:
 Badness at kernel/sched.c:5572

At Steven's suggestion I updated the instrumentation to display the
local_save_flags and irqs_disabled_flags:

while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
local_save_flags(flags);
printk(local flags: %lx, irqs disabled: %d\n, flags, 
irqs_disabled_flags(flags));
asm(mr %0,1 : =r (sp));
printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
preempt_count());
extended_cede_processor(cede_latency_hint);
asm(mr %0,1 : =r (sp));
printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
preempt_count());
}


Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, 
pcnt: 1
Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, 
pcnt: 1

I'm not sure if I'm reading that right, but I believe interrupts are
intended to be disabled here. If accomplished via the
spin_lock_irqsave() this would behave differently on RT. However, this
path disables the interrupts handled by xics, all but the IPIs anyway.
On RT I disabled the decrementer as well.

Is it possible for RT to be receiving other interrupts here?

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Vaidyanathan Srinivasan
* Darren Hart dvh...@us.ibm.com [2010-07-22 21:44:04]:

 On 07/22/2010 04:57 PM, Darren Hart wrote:
  On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
  On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
 
  1) How can the preempt_count() get mangled across the H_CEDE hcall?
  2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
 
  The preempt count is on the thread info at the bottom of the stack.
 
  Can you check the stack pointers ?
  
  Hi Ben, thanks for looking.
  
  I instrumented the area around extended_cede_processor() as follows
  (please confirm I'm getting the stack pointer correctly).
  
  while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
  asm(mr %0,1 : =r (sp));
  printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
  preempt_count());
  extended_cede_processor(cede_latency_hint);
  asm(mr %0,1 : =r (sp));
  printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
  preempt_count());
}
  
  
  On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
  Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack: 
  c0010e9e3ce0, pcnt: 1
  Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack:  
  c0010e9e3ce0, pcnt: 1
  
  This surprised me as preempt_count is 1 before and after, so no
  corruption appears to occur on mainline. This makes the pcnt of 65 I see
  without the preempt_count()=0 hack very strange. I ran several hundred
  off/on cycles. The issue of preempt_count being 1 is still addressed by
  this patch however.
  
  On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise):
  Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack: 
  c00089bcfcf0, pcnt: 1
  Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack:  
  c00089bcfcf0, pcnt: 
  
  In both cases the stack pointer appears unchanged.
  
  Note: there is a BUG triggered in between these statements as the
  preempt_count causes the printk to trigger:
  Badness at kernel/sched.c:5572
 
 At Steven's suggestion I updated the instrumentation to display the
 local_save_flags and irqs_disabled_flags:
 
 while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
 local_save_flags(flags);
 printk(local flags: %lx, irqs disabled: %d\n, flags, 
 irqs_disabled_flags(flags));
 asm(mr %0,1 : =r (sp));
 printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
 preempt_count());
 extended_cede_processor(cede_latency_hint);
 asm(mr %0,1 : =r (sp));
 printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
 preempt_count());
 }
 
 
 Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 
 I'm not sure if I'm reading that right, but I believe interrupts are
 intended to be disabled here. If accomplished via the
 spin_lock_irqsave() this would behave differently on RT. However, this
 path disables the interrupts handled by xics, all but the IPIs anyway.
 On RT I disabled the decrementer as well.
 
 Is it possible for RT to be receiving other interrupts here?

Yes.  extended_cede_processor() will return with interrupts enabled in
the cpu. (This is done by the hypervisor).  Under normal cases we
cannot be interrupted because no IO interrupts are routed to us after
xics_teardown_cpu() and since the CPU is out of the map, nobody will
send us IPIs.

Though H_CEDE will return with interrupts enabled, it is unlikely that
an interrupt can be delivered in this context.

--Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Benjamin Herrenschmidt
On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:

  suggestion I updated the instrumentation to display the
 local_save_flags and irqs_disabled_flags:

 Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 
 I'm not sure if I'm reading that right, but I believe interrupts are
 intended to be disabled here. If accomplished via the
 spin_lock_irqsave() this would behave differently on RT. However, this
 path disables the interrupts handled by xics, all but the IPIs anyway.
 On RT I disabled the decrementer as well.
 
 Is it possible for RT to be receiving other interrupts here?

Also you may want to call hard_irq_disable() to -really- disable
interrupts ... since we do lazy-disable on powerpc

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Benjamin Herrenschmidt
On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
 Yes.  extended_cede_processor() will return with interrupts enabled in
 the cpu. (This is done by the hypervisor).  Under normal cases we
 cannot be interrupted because no IO interrupts are routed to us after
 xics_teardown_cpu() and since the CPU is out of the map, nobody will
 send us IPIs.

What about decrementer ?

 Though H_CEDE will return with interrupts enabled, it is unlikely that
 an interrupt can be delivered in this context. 

Well, if interrupts are soft-disabled, even if one occurs, we will just
mask and return, so that at least should be ok.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev