Re: Hard hang in hypervisor!?
On Thu Oct 11 10:04:40 EST 2007, Paul Mackerras wrote: Linas Vepstas writes: Err .. it was cpu 0 that was spinlocked. Are interrupts not distributed? We have some bogosities in the xics code that I noticed a couple of days ago. Basically we only set the xics to distribute interrupts to all cpus if (a) the affinity mask is equal to CPU_MASK_ALL (which has ones in every bit position from 0 to NR_CPUS-1) and (b) all present cpus are online (cpu_online_map == cpu_present_map). Otherwise we direct interrupts to the first cpu in the affinity map. So you can easily have the affinity mask containing all the online cpus and still not get distributed interrupts. The second condition was just added to try fix some issues where a vendor wants to always run the kdump kernel with maxcpus=1 on all architectures, and the emulated xics on js20 was not working. For a true xics, this should work because we (1) remove all but 1 cpu from the global server list and (2) raise the prioirity of the cpu to disabled and the hardware will deliver to another cpu in the parition. http://ozlabs.org/pipermail/linuxppc-dev/2006-December/028941.html http://ozlabs.org/pipermail/linuxppc-dev/2007-January/029607.html http://ozlabs.org/pipermail/linuxppc-dev/2007-March/032621.html However, my experience the other day on a js21 was that firmware delivered either to all cpus (if we bound to the global server) or the first online cpu in the partition, regardless of to which cpu we bound the interrupt, so I don't know that the change will fix the original problem. It does mean that taking a cpu offline but not dlpar removing it from the kernel will result in the inability to actually distribute interrupts to all cpus. I'd be happy to say remove the extra check and work with firmware to property distribute the interrupts. milton ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Hard hang in hypervisor!?
Linas Vepstas writes: Err .. it was cpu 0 that was spinlocked. Are interrupts not distributed? We have some bogosities in the xics code that I noticed a couple of days ago. Basically we only set the xics to distribute interrupts to all cpus if (a) the affinity mask is equal to CPU_MASK_ALL (which has ones in every bit position from 0 to NR_CPUS-1) and (b) all present cpus are online (cpu_online_map == cpu_present_map). Otherwise we direct interrupts to the first cpu in the affinity map. So you can easily have the affinity mask containing all the online cpus and still not get distributed interrupts. So in your case it's quite possible that all interrupts were directed to cpu 0. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Hard hang in hypervisor!?
Linas Vepstas wrote: I was futzing with linux-2.6.23-rc8-mm1 in a power6 lpar when, for whatever reason, a spinlock locked up. The bizarre thing was that the rest of system locked up as well: an ssh terminal, and also an hvc console. Breaking into the debugger showed 4 cpus, 1 of which was deadlocked in the spinlock, and the other 3 in .pseries_dedicated_idle_sleep This was, ahhh, unexpected. What's up with that? Can anyone provide any insight? Sounds consistent with a task trying to double-acquire the lock, or an interrupt handler attempting to acquire a lock that the current task holds. Or maybe even an uninitialized spinlock. Do you know which lock it was? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Hard hang in hypervisor!?
On Tue, Oct 09, 2007 at 04:18:19PM -0500, Nathan Lynch wrote: Linas Vepstas wrote: I was futzing with linux-2.6.23-rc8-mm1 in a power6 lpar when, for whatever reason, a spinlock locked up. The bizarre thing was that the rest of system locked up as well: an ssh terminal, and also an hvc console. Breaking into the debugger showed 4 cpus, 1 of which was deadlocked in the spinlock, and the other 3 in .pseries_dedicated_idle_sleep This was, ahhh, unexpected. What's up with that? Can anyone provide any insight? Sounds consistent with a task trying to double-acquire the lock, or an interrupt handler attempting to acquire a lock that the current task holds. Or maybe even an uninitialized spinlock. Do you know which lock it was? Not sure .. trying to find out now. But why would that kill the ssh session, and the console? Sure, so maybe one cpu is spinning, but the other three can still take interrupts, right? The ssh session should have been generating ethernet card interrupts, and the console should have been generating hvc interrupts. Err .. it was cpu 0 that was spinlocked. Are interrupts not distributed? Perhaps I should IRC this ... --linas ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
never mind .. [was Re: Hard hang in hypervisor!?
On Tue, Oct 09, 2007 at 04:28:10PM -0500, Linas Vepstas wrote: Perhaps I should IRC this ... yeah. I guess I'd forgotten how funky things can get. So never mind ... --linas ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev