Michael Ellerman <michael <at> ellerman.id.au> writes: > > On Tue, 2010-04-20 at 17:17 -0500, Brian King wrote: > > In stress testing enabling and disabling of SMT, we are regularly > > seeing the badness warning below. Looking through the cpu offline > > path, this is what I see: > > > > 1. stop_cpu: IRQ's get disabled > > 2. pseries_cpu_disable: set cpu offline (no barriers after this) > > 3. xics_migrate_irqs_away: Remove ourselves from the GIQ, but still allow > > IPIs > > 4. stop_cpu: IRQ's get enabled again (local_irq_enable) > > > > It looks to me like there is plenty of opportunity between 1 and 2 for > > an IPI to get queued, resulting in the badness below. Is there something > > in xics_migrate_irqs_away that should clear any pending IPIs?
Is that not what this does? /* Reject any interrupt that was queued to us... */ xics_set_cpu_priority(0); /* Remove ourselves from the global interrupt queue */ xics_set_cpu_giq(default_distrib_server, 0); I thought the above would clear any pending (queued) interrupts and disable additional interrupts from coming in. Of course the next line allows IPIs again" /* Allow IPIs again... */ xics_set_cpu_priority(DEFAULT_PRIORITY); Which I confess I really don't get... > > If there > > is, maybe the solution is as simple as adding a barrier after marking > > the cpu offline. Or is the warning bogus and we should just remove it? > > It looks like xics_migrate_irqs_away() doesn't do anything about IPIs, > at least the comment says "Allow IPIs again". So I don't see what's to > stop you just taking another IPI after you reenable interrupts in > stop_cpu(). Maybe xics_ipi_dispatch() should just return if the cpu is > offline? We're seeing something possibly related in real-time. Notice how the decrementer handler interrupts stop_cpu(). Is the decrementer interrupt delivered as an IPI? cpu 0x3: Vector: 700 (Program Check) at [c000000084d02d90] pc: c000000000068af4: .__might_sleep+0x11c/0x148 lr: c000000000068af0: .__might_sleep+0x118/0x148 sp: c000000084d03010 msr: 8000000000021032 current = 0xc000000086658240 paca = 0xc000000000bb8a80 pid = 4045, comm = kstop/3 kernel BUG at kernel/sched.c:10168! enter ? for help [c000000084d030b0] c0000000006a2798 .rt_spin_lock+0x4c/0x9c [c000000084d03140] c0000000000e3c98 .cpuset_cpus_allowed_locked+0x38/0x74 [c000000084d031e0] c000000000070be0 .select_fallback_rq+0x10c/0x1a4 [c000000084d032a0] c00000000007cda8 .try_to_wake_up+0x1b0/0x540 [c000000084d03370] c00000000007d2e8 .wake_up_process+0x34/0x48 [c000000084d03400] c00000000008c5f8 .wakeup_softirqd+0x78/0x9c [c000000084d03490] c00000000008c8e4 .raise_softirq+0x6c/0xa4 [c000000084d03520] c000000000099c18 .run_local_timers+0x2c/0x4c [c000000084d035a0] c000000000099c90 .update_process_times+0x58/0x9c [c000000084d03640] c0000000000c2e70 .tick_sched_timer+0xd0/0x120 [c000000084d036f0] c0000000000b4bec .__run_hrtimer+0x1a0/0x29c [c000000084d037a0] c0000000000b558c .hrtimer_interrupt+0x21c/0x394 [c000000084d038d0] c0000000000307d8 .timer_interrupt+0x1dc/0x2e4 [c000000084d03970] c000000000003700 decrementer_common+0x100/0x180 --- Exception: 901 (Decrementer) at c00000000000d144 .raw_local_irq_restore+0x48/0x54 [link register ] c0000000000e57ec .stop_cpu+0x1c0/0x1ec [c000000084d03c60] c00000000104a4f0 (unreliable) [c000000084d03ca0] c0000000000e5780 .stop_cpu+0x154/0x1ec [c000000084d03d40] c0000000000a8b84 .worker_thread+0x25c/0x338 [c000000084d03e60] c0000000000af8c8 .kthread+0xb8/0xc4 [c000000084d03f90] c000000000034408 .kernel_thread+0x54/0x70 Thanks, Darren Hart _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev