Re: [osol-code] About debug hang issues with deadman

Eric Saxe Thu, 07 Dec 2006 11:14:28 -0800

Oliver Yang wrote:

Eric Saxe wrote:
> ::cpuinfo -v
ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
 0 fec254f8  1b    5   10 105   no    no t-5676 a13f7de0 sched
              |    |    |
   RUNNING <--+    |    +--> PIL THREAD
     READY         |          10 a13fade0
    EXISTS         |           6 a13f7de0
    ENABLE         |           - a11ccde0 (idle)
                   |
                   +-->  PRI THREAD   PROC
                         109 a13fade0 sched
                          60 a1b7ede0 sched
                          60 a1587de0 sched
                          60 a27a1de0 sched
                          59 a727f000 snoop
The priority 109 thread is the clock. It passivated because itblocked waiting for a lock that someoneelse held. At some point, the lock was dropped which made the clockthread runnable again....which
is why it's now in the run queue.
You mean, the a13fade0 was unpinned due to block on a mutex, and itwas in turnstile before.Later, it was placed into run queue, because the the mutex wasreleased by other ?

The clock interrupt was at some pointer pinning a13f7de0 (the priority105 thread). When it passivated becausethe adaptive mutex it wanted was already held, it became a regularthread (a13fade0), scheduled just like everyone else.It waited in the sleep queue until the lock was dropped, at which pointclock became runnable, and was then

enqueued in CPU 0's run queue.

If it's a true, why it was not executed onproc immediately? Becauseit's have highest pri in the run queue.

You can check t_flag for the priority 105 thread, but I think it'sbecause of this code in kpreempt():

http://src.opensolaris.org/source/xref/loficc/crypto/usr/src/uts/i86pc/os/trap.c#1440

So kpreempt() doesn't do much if it's an interrupt thread....likelybecause it's assumed interrupts are short. :)

The status of a13fade0 thread is run, and I think while systempanic, the deadman should be running on CPU0.How can I find the stack of deadman? I know it should be high levelinterrupt, but I can't get any infomation from cpu_t cpu_intr_stack:
a13fade0 is priority 109, which is PIL 10. This is clock. The deadmancyclic fires at PIL 14, and it will pin whatever threadis running on the CPU. Because you panicked on CPU 0, this is wherethe deadman cyclic must have fired.
Thanks for your explanation. Yes, deadman paniced on the CPU0, I knowthat via output of ::msgbuf.But, as far as I known, the stack of hight level interrupts(PIL>10)should be pointed by cpu_intr_stack.I'm tried to get the stack of deadman by using cpu_intr_stack, itshould work, but why I failed?

It looks like cpu_intr_stack is used to preserve the interrupt stackwhen we panic, but I would have to dig

further to know more....

> fec254f8::print cpu_t cpu_intr_stack
cpu_intr_stack = 0xa13f5000
> 0xa13f5000,20/nap
0xa13f5000:
mdb: failed to read data from target: no mapping for address
0xa13f5000:
mdb: failed to read data from target: no mapping for address
We can found there are 5112 pending counts of clock, that's whydeadman was triggered.And we also could find lots of pending counts ofapic_redistribute_compute, does it mean there is a APIC issue?
Not necessarily. The apic_redistribute_compute cyclic fires at thelow level, which is PIL 1, I believe. From theoutput above, we see that the bge interrupt is at PIL 6, which wouldblock out things firing at PIL 1.
To me, this looks like the bge interrupt is either getting stuck, oris spending *way* too much time in it's ISR.
Now I see. :-)

How did you know the PIL for apic_redistribute_compute and deadman?
It's might be a stupid question, because you know the code.
But if I use mdb, I can't get them from ::interrupts dcmds. For akernel newbie, if we can get them from mdb, that would be better.

I looked in cbe.c at cbe_set_level(). The low, lock, and high levels,correspond to CBE_LOW_PIL, CBE_LOCK_PIL, andCBE_HIGH_PIL, which are defined in the architecture specific clock.h.Actually, it looks like CBE_LOW_PIL is 1 on

SPARC and 2 on x86.

> ::interrupts
IRQ  Vector IPL Bus   Type  CPU Share APIC/INT# ISR(s)
9    0x81   9   PCI   Fixed 1   1     0x0/0x9   acpi_wrapper_isr
21   0x61   6   PCI   Fixed 1   1     0x0/0x15  nge_chip_intr
22   0x20   1   PCI   Fixed 3   1     0x0/0x16  ohci_intr
47   0x62   6   PCI   Fixed 1   1     0x3/0x17  nge_chip_intr
56   0x60   6   PCI   Fixed 0   1     0x2/0x0   e1000g_intr
63   0x40   5   PCI   MSI   2   1     -         mpt_intr
160  0xa0   0         IPI   ALL 0     -         poke_cpu
192  0xc0   13        IPI   ALL 1     -         xc_serv
208  0xd0   14        IPI   ALL 1     -         kcpc_hw_overflow_intr
209  0xd1   14        IPI   ALL 1     -         cbe_fire
210  0xd3   14        IPI   ALL 1     -         cbe_fire
224  0xe0   15        IPI   ALL 1     -         xc_serv
225  0xe1   15        IPI   ALL 1     -         apic_error_intr

Anyway, I really appreciate your reply, and I did learn somethingimportant from your helps. :-)

Sure. Thanks for the great question. By the way, I believe what you areseeing here is:

    6498937 system hang while doing MAX and snoop through bge

-Eric
_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Re: [osol-code] About debug hang issues with deadman

Reply via email to