On 08/05/2010 10:09 PM, Vaidyanathan Srinivasan wrote: > * Darren Hart<dvh...@us.ibm.com> [2010-08-05 19:19:00]: > >> On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote: >>> On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote: >>> >>>> suggestion I updated the instrumentation to display the >>>> local_save_flags and irqs_disabled_flags: >>> >>>> Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1 >>>> Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: >>>> c00000010e9e3ce0, pcnt: 1 >>>> Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: >>>> c00000010e9e3ce0, pcnt: 1 >>>> >>>> I'm not sure if I'm reading that right, but I believe interrupts are >>>> intended to be disabled here. If accomplished via the >>>> spin_lock_irqsave() this would behave differently on RT. However, this >>>> path disables the interrupts handled by xics, all but the IPIs anyway. >>>> On RT I disabled the decrementer as well. >>>> >>>> Is it possible for RT to be receiving other interrupts here? >>> >>> Also you may want to call hard_irq_disable() to -really- disable >>> interrupts ... since we do lazy-disable on powerpc >>> >> >> A quick update and request for direction wrt the cede hcall, interrupt >> handling, and the stack pointer. >> >> I've added patches one at a time, eliminating bugs with preempt_rt >> (tip/rt/head). With my current patchset I have no crashes with a single run >> of >> random_online.sh (stress testing to hit the hang will sees is my todo for >> tomorrow). >> >> Current patchset includes: >> patches/0001-wms-fix01.patch # P7 lazy flushing thing >> >> # next four are sent to / queued for mainline >> patches/powerpc-increase-pseries_cpu_die-delay.patch >> patches/powerpc-enable-preemption-before-cpu_die.patch >> patches/powerpc-silence-__cpu_up-under-normal-operation.patch >> patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch >> >> # this one needs to be cleaned up and sent to mainline >> patches/powerpc-wait-for-cpu-to-go-inactive.patch >> >> patches/powerpc-disable-decrementer-on-offline.patch >> patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after >> cede >> patches/powerpc-cede-processor-inst.patch # instrumentation >> patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede >> patches/powerpc-pad-thread_info.patch >> >> I didn't include all the patches as the relevant bits are included below in >> code form. >> >> With the instrumentation, it's clear the change to preempt_count() is >> targeted >> (since I added padding before and after preempt_count in the thread_info >> struct) and preempt_count is still changed. It is also clear that it isn't >> just >> a stray decrement since I set preempt_count() to 4 prior to calling cede and >> it >> still is 0xffffffff after the hcall. Also note that the stack pointer doesn't >> change across the cede call and neither does any other part of the >> thread_info >> structure. >> >> Adding hard_irq_disable() did seem to affect things somewhat. Rather than a >> serialized list of before/after thread_info prints around cede, I see >> several befores then several afters. But the preempt_count is still modified. >> >> The relevant code looks like this (from pseries_mach_cpu_die()) >> >> hard_irq_disable(); /* this doesn't fix things... >> but >> does change behavior a bit */ >> preempt_count() = 0x4; >> asm("mr %0,1" : "=r" (sp)); /* stack pointer is in >> R1 */ >> printk("before cede: sp=%lx pcnt=%x\n", sp, >> preempt_count()); >> print_thread_info(); >> extended_cede_processor(cede_latency_hint); >> asm("mr %0,1" : "=r" (sp)); >> printk("after cede: sp=%lx pcnt=%x\n", sp, >> preempt_count()); >> print_thread_info(); >> preempt_count() = 0; >> >> >> With these patches applied, the output across cede looks like: >> >> before cede: sp=c000000000b57150 pcnt=4 >> *** current->thread_info *** >> ti->task: c000000000aa1410 >> ti->exec_domain: c000000000a59958 >> ti->cpu: 0 >> ti->stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */ >> ti->preempt_count: 4 >> ti->stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */ >> ti->local_flags: 0 >> *** end thread_info *** >> after cede: sp=c000000000b57150 pcnt=ffffffff >> *** current->thread_info *** >> ti->task: c000000000aa1410 >> ti->exec_domain: c000000000a59958 >> ti->cpu: 0 > > Is this print sufficient to prove that we did not jump CPU?
Probably not, but the following should be sufficient. Note that the dump occurs on "CPU: 6" which matches the CPU noted in the ti->cpu before and after the cede call. *** current->thread_info *** ti->task: c00000008e7b2240 ti->exec_domain: c000000000a59958 ti->cpu: 6 ti->stomp_on_me: 57005 ti->preempt_count: 4 ti->stomp_on_me_too: 48879 ti->local_flags: 0 *** end thread_info *** ------------[ cut here ]------------ Badness at kernel/sched.c:3698 NIP: c0000000006987e4 LR: c0000000006987c8 CTR: c00000000005c668 REGS: c00000010e713800 TRAP: 0700 Not tainted (2.6.33-rt-dvhrt05-02358-g61223dd-dirty) MSR: 8000000000021032 <ME,CE,IR,DR> CR: 24000082 XER: 20000000 TASK = c00000008e7b2240[0] 'swapper' THREAD: c00000010e710000 CPU: 6 GPR00: 0000000000000000 c00000010e713a80 c000000000b54f88 0000000000000001 GPR04: c00000010e713d08 ffffffffffffffff 00000000000000e0 800000000d049138 GPR08: 0000000000000000 c000000000ca5420 0000000000000001 c000000000bc22e8 GPR12: 0000000000000002 c000000000ba5080 0000000000000000 000000000f394b6c GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: c000000000ba50c0 0000000000000004 0000000000000002 0000000000000000 GPR24: 0000000000000004 c00000010e713cd0 c0000000009f1ecc c00000000088aaff GPR28: 0000000000000000 0000000000000001 c000000000ad7610 c00000010e713a80 NIP [c0000000006987e4] .add_preempt_count+0x6c/0xe0 LR [c0000000006987c8] .add_preempt_count+0x50/0xe0 Call Trace: [c00000010e713a80] [c00000010e713b30] 0xc00000010e713b30 (unreliable) [c00000010e713b10] [c0000000000824b4] .vprintk+0xac/0x480 [c00000010e713c40] [c00000000069bcbc] .printk+0x48/0x5c [c00000010e713cd0] [c00000000005c190] .pseries_mach_cpu_die+0x1c4/0x39c [c00000010e713db0] [c00000000001e7e0] .cpu_die+0x4c/0x68 [c00000010e713e30] [c000000000017de0] .cpu_idle+0x1f8/0x220 [c00000010e713ed0] [c0000000006a17c4] .start_secondary+0x394/0x3d4 [c00000010e713f90] [c000000000008264] .start_secondary_prolog+0x10/0x14 Instruction dump: 78290464 80090018 2f800000 40fc002c 4bd089c1 60000000 2fa30000 419e006c e93e87e0 80090000 2f800000 409e005c <0fe00000> 48000054 e97e8398 78290464 after cede: sp=c00000010e713cd0 pcnt=ffffffff *** current->thread_info *** ti->task: c00000008e7b2240 ti->exec_domain: c000000000a59958 ti->cpu: 6 ti->stomp_on_me: 57005 ti->preempt_count: fffffffe ti->stomp_on_me_too: 48879 ti->local_flags: 0 *** end thread_info *** > We agree that decrementer can possibly expire and we could have gone > to the handler and come back just like we would do in an idle loop. I disable the decrementer in the stop_cpu path... that may not be sufficient for the pseries_mach_cpu_die() path, I'll have to experiment. However, even if it did, it shouldn't be just changing the value, especially not to something illegal. > This will not happen always, but I could see a window of time during > which this may happen. But that should not change the preempt_count > unconditionally to -1 with the same stack pointer. > >> ti->stomp_on_me: 57005 >> ti->preempt_count: ffffffff >> ti->stomp_on_me_too: 48879 >> ti->local_flags: 0 > > Is there a method to identify whether we had executed any of the > interrupt handler while in this code? I'd like to know as well. If anyone knows, please share. Otherwise I'll be trying to sort that out tomorrow. > >> *** end thread_info *** >> >> Are there any additional thoughts on what might be causing preempt_count to >> change? >> I was thinking that cede might somehow force it to 0 (or perhaps one of the >> preempt_count explicit value setters in irq.c) and then decrement it - but >> that dec >> should trigger an error in the dec_preempt_count() as I have >> CONFIG_DEBUG_PREEMPT=y. > > Decrementer is only the suspect, we have not yet proved that the dec > handler is being executed. Can somebody be using our stack? The > pointer is same.. but do we still own it? I cannot think of how > somebody else could be using this cpu's stack... but just a thought to > explain this anomaly. > > --Vaidy Thanks for the thoughts, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev