Re: Hard hang in hypervisor!?

2007-10-11 Thread Milton Miller
On Thu Oct 11 10:04:40 EST 2007, Paul Mackerras wrote:
 Linas Vepstas writes:
 Err ..  it was cpu 0 that was spinlocked.  Are interrupts not
 distributed?
 
 We have some bogosities in the xics code that I noticed a couple of
 days ago.  Basically we only set the xics to distribute interrupts to
 all cpus if (a) the affinity mask is equal to CPU_MASK_ALL (which has
 ones in every bit position from 0 to NR_CPUS-1) and (b) all present
 cpus are online (cpu_online_map == cpu_present_map).  Otherwise we
 direct interrupts to the first cpu in the affinity map.  So you can
 easily have the affinity mask containing all the online cpus and still
 not get distributed interrupts.


The second condition was just added to try fix some issues where a 
vendor wants to always run the kdump kernel with maxcpus=1 on all
architectures, and the emulated xics on js20 was not working.
For a true xics, this should work because we (1) remove all but 1
cpu from the global server list and (2) raise the prioirity of the
cpu to disabled and the hardware will deliver to another cpu in the
parition.

http://ozlabs.org/pipermail/linuxppc-dev/2006-December/028941.html
http://ozlabs.org/pipermail/linuxppc-dev/2007-January/029607.html
http://ozlabs.org/pipermail/linuxppc-dev/2007-March/032621.html

However, my experience the other day on a js21 was that firmware
delivered either to all cpus (if we bound to the global server) or
the first online cpu in the partition, regardless of to which cpu
we bound the interrupt, so I don't know that the change will fix
the original problem.

It does mean that taking a cpu offline but not dlpar removing it from the
kernel will result in the inability to actually distribute interrupts
to all cpus.

I'd be happy to say remove the extra check and work with firmware to
property distribute the interrupts.

milton
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Hard hang in hypervisor!?

2007-10-10 Thread Paul Mackerras
Linas Vepstas writes:

 Err ..  it was cpu 0 that was spinlocked.  Are interrupts not
 distributed?

We have some bogosities in the xics code that I noticed a couple of
days ago.  Basically we only set the xics to distribute interrupts to
all cpus if (a) the affinity mask is equal to CPU_MASK_ALL (which has
ones in every bit position from 0 to NR_CPUS-1) and (b) all present
cpus are online (cpu_online_map == cpu_present_map).  Otherwise we
direct interrupts to the first cpu in the affinity map.  So you can
easily have the affinity mask containing all the online cpus and still
not get distributed interrupts.

So in your case it's quite possible that all interrupts were directed
to cpu 0.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Hard hang in hypervisor!?

2007-10-09 Thread Nathan Lynch
Linas Vepstas wrote:
 
 I was futzing with linux-2.6.23-rc8-mm1 in a power6 lpar when,
 for whatever reason, a spinlock locked up. The bizarre thing 
 was that the rest of system locked up as well: an ssh terminal,
 and also an hvc console.
 
 Breaking into the debugger showed 4 cpus, 1 of which was 
 deadlocked in the spinlock, and the other 3 in 
 .pseries_dedicated_idle_sleep
 
 This was, ahhh, unexpected.  What's up with that? Can
 anyone provide any insight?

Sounds consistent with a task trying to double-acquire the lock, or an
interrupt handler attempting to acquire a lock that the current task
holds.  Or maybe even an uninitialized spinlock.  Do you know which
lock it was?
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Hard hang in hypervisor!?

2007-10-09 Thread Linas Vepstas
On Tue, Oct 09, 2007 at 04:18:19PM -0500, Nathan Lynch wrote:
 Linas Vepstas wrote:
  
  I was futzing with linux-2.6.23-rc8-mm1 in a power6 lpar when,
  for whatever reason, a spinlock locked up. The bizarre thing 
  was that the rest of system locked up as well: an ssh terminal,
  and also an hvc console.
  
  Breaking into the debugger showed 4 cpus, 1 of which was 
  deadlocked in the spinlock, and the other 3 in 
  .pseries_dedicated_idle_sleep
  
  This was, ahhh, unexpected.  What's up with that? Can
  anyone provide any insight?
 
 Sounds consistent with a task trying to double-acquire the lock, or an
 interrupt handler attempting to acquire a lock that the current task
 holds.  Or maybe even an uninitialized spinlock.  Do you know which
 lock it was?

Not sure .. trying to find out now. But why would that kill the
ssh session, and the console? Sure, so maybe one cpu is spinning,
but the other three can still take interrupts, right?  The ssh session
should have been generating ethernet card interrupts, and the console
should have been generating hvc interrupts.  

Err ..  it was cpu 0 that was spinlocked.  Are interrupts not
distributed?

Perhaps I should IRC this ... 

--linas
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


never mind .. [was Re: Hard hang in hypervisor!?

2007-10-09 Thread Linas Vepstas
On Tue, Oct 09, 2007 at 04:28:10PM -0500, Linas Vepstas wrote:
 
 Perhaps I should IRC this ... 

yeah. I guess I'd forgotten how funky things can get. So never mind ... 

--linas

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev