Oliver Yang wrote:
Eric Saxe wrote:
> ::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fec254f8 1b 5 10 105 no no t-5676 a13f7de0 sched
| | |
RUNNING <--+ | +--> PIL THREAD
READY | 10 a13fade0
EXISTS | 6 a13f7de0
ENABLE | - a11ccde0 (idle)
|
+--> PRI THREAD PROC
109 a13fade0 sched
60 a1b7ede0 sched
60 a1587de0 sched
60 a27a1de0 sched
59 a727f000 snoop
The priority 109 thread is the clock. It passivated because it
blocked waiting for a lock that someone
else held. At some point, the lock was dropped which made the clock
thread runnable again....which
is why it's now in the run queue.
You mean, the a13fade0 was unpinned due to block on a mutex, and it
was in turnstile before.
Later, it was placed into run queue, because the the mutex was
released by other ?
The clock interrupt was at some pointer pinning a13f7de0 (the priority
105 thread). When it passivated because
the adaptive mutex it wanted was already held, it became a regular
thread (a13fade0), scheduled just like everyone else.
It waited in the sleep queue until the lock was dropped, at which point
clock became runnable, and was then
enqueued in CPU 0's run queue.
If it's a true, why it was not executed onproc immediately? Because
it's have highest pri in the run queue.
You can check t_flag for the priority 105 thread, but I think it's
because of this code in kpreempt():
http://src.opensolaris.org/source/xref/loficc/crypto/usr/src/uts/i86pc/os/trap.c#1440
So kpreempt() doesn't do much if it's an interrupt thread....likely
because it's assumed interrupts are short. :)
The status of a13fade0 thread is run, and I think while system
panic, the deadman should be running on CPU0.
How can I find the stack of deadman? I know it should be high level
interrupt, but I can't get any infomation from cpu_t cpu_intr_stack:
a13fade0 is priority 109, which is PIL 10. This is clock. The deadman
cyclic fires at PIL 14, and it will pin whatever thread
is running on the CPU. Because you panicked on CPU 0, this is where
the deadman cyclic must have fired.
Thanks for your explanation. Yes, deadman paniced on the CPU0, I know
that via output of ::msgbuf.
But, as far as I known, the stack of hight level interrupts(PIL>10)
should be pointed by cpu_intr_stack.
I'm tried to get the stack of deadman by using cpu_intr_stack, it
should work, but why I failed?
It looks like cpu_intr_stack is used to preserve the interrupt stack
when we panic, but I would have to dig
further to know more....
> fec254f8::print cpu_t cpu_intr_stack
cpu_intr_stack = 0xa13f5000
> 0xa13f5000,20/nap
0xa13f5000:
mdb: failed to read data from target: no mapping for address
0xa13f5000:
mdb: failed to read data from target: no mapping for address
We can found there are 5112 pending counts of clock, that's why
deadman was triggered.
And we also could find lots of pending counts of
apic_redistribute_compute, does it mean there is a APIC issue?
Not necessarily. The apic_redistribute_compute cyclic fires at the
low level, which is PIL 1, I believe. From the
output above, we see that the bge interrupt is at PIL 6, which would
block out things firing at PIL 1.
To me, this looks like the bge interrupt is either getting stuck, or
is spending *way* too much time in it's ISR.
Now I see. :-)
How did you know the PIL for apic_redistribute_compute and deadman?
It's might be a stupid question, because you know the code.
But if I use mdb, I can't get them from ::interrupts dcmds. For a
kernel newbie, if we can get them from mdb, that would be better.
I looked in cbe.c at cbe_set_level(). The low, lock, and high levels,
correspond to CBE_LOW_PIL, CBE_LOCK_PIL, and
CBE_HIGH_PIL, which are defined in the architecture specific clock.h.
Actually, it looks like CBE_LOW_PIL is 1 on
SPARC and 2 on x86.
> ::interrupts
IRQ Vector IPL Bus Type CPU Share APIC/INT# ISR(s)
9 0x81 9 PCI Fixed 1 1 0x0/0x9 acpi_wrapper_isr
21 0x61 6 PCI Fixed 1 1 0x0/0x15 nge_chip_intr
22 0x20 1 PCI Fixed 3 1 0x0/0x16 ohci_intr
47 0x62 6 PCI Fixed 1 1 0x3/0x17 nge_chip_intr
56 0x60 6 PCI Fixed 0 1 0x2/0x0 e1000g_intr
63 0x40 5 PCI MSI 2 1 - mpt_intr
160 0xa0 0 IPI ALL 0 - poke_cpu
192 0xc0 13 IPI ALL 1 - xc_serv
208 0xd0 14 IPI ALL 1 - kcpc_hw_overflow_intr
209 0xd1 14 IPI ALL 1 - cbe_fire
210 0xd3 14 IPI ALL 1 - cbe_fire
224 0xe0 15 IPI ALL 1 - xc_serv
225 0xe1 15 IPI ALL 1 - apic_error_intr
Anyway, I really appreciate your reply, and I did learn something
important from your helps. :-)
Sure. Thanks for the great question. By the way, I believe what you are
seeing here is:
6498937 system hang while doing MAX and snoop through bge
-Eric
_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code