Hi Oliver,
Responses in line below:
Oliver Yang wrote:
Hi All,
Recently, I ran into a hang issue, and I enabled deadman timer by
setting "set snooping =1" in /etc/system file. Finally, we got a
crashdump file.
Does anybody can give me a hint on how to debug hang issue with
enabling deadman?
Here are steps what I have tried on one of crashdump files:
> ::msgbuf
panic[cpu0]/thread=a13f7de0:
deadman: timed out after 50 seconds of clock inactivity
a13f4f24 genunix:deadman+159 (0)
a13f4f5c genunix:cyclic_expire+280 (a14fe000, 3, aa2183)
a13f4fb8 genunix:cyclic_fire+17d (fec21d30)
a13f4fd4 unix:cbe_fire+4a (0, 0)
a13f75b8 unix:_interrupt+2a1 (1b0, abb10000, ab36)
a13f7638 unix:bcopy+39 (abb11780)
a13f765c genunix:copymsg+2e (aa228340)
a13f7688 genunix:copymsgchain+23 (aa228340)
a13f76c0 dls:i_dls_link_rx_func+9d (a3041d78, 0, a13f77)
a13f7704 dls:i_dls_link_rx_common_promisc+3b (a3041d78, 0, a13f77)
a13f77a8 dls:i_dls_link_rx_common+11c (a3041d78, 0, aa2283)
a13f77c4 dls:i_dls_link_txloop+18 (a3041d78, aa228340)
a13f77f0 mac:mac_txloop+92 (a3042ee8, aa228340)
a13f780c dls:dls_tx+16 (a3026f80, abb08e80)
a13f7828 dld:dld_tx_single+1e (a2bfdea8, abb08e80)
a13f7840 dld:str_mdata_fastpath_put+60 (a2bfdea8, abb08e80)
a13f78b0 ip:tcp_send_data+85c (aafda500, aafdfc68,)
a13f791c ip:tcp_send+6f1 (aafdfc68, aafda500,)
a13f79bc ip:tcp_wput_data+663 (aafda500, 0, 0)
a13f7aa0 ip:tcp_rput_data+29b2 (aafda100, a70a13c0,)
a13f7af8 ip:squeue_drain+1c3 (a1df0d00, 4, 4bb30e)
a13f7b48 ip:squeue_enter_chain+5be (a1df0d00, abb16f40,)
a13f7bdc ip:ip_input+731 (a2581754, a2afc020,)
a13f7c78 dls:i_dls_link_rx_common+26e (a3041d78, a2afc020,)
a13f7c90 dls:i_dls_link_rx_promisc+19 (a3041d78, a2afc020,)
a13f7ccc mac:mac_rx+53 (a3042ee8, a2afc020,)
a13f7d60 bge:bge_receive+598 (a2c78000, a2f26800)
a13f7dac bge:bge_intr+30f (a2c78000, 0)
syncing file systems...
done
dumping to /dev/dsk/c2d0s1, offset 1719074816, content: kernel
Then I used mdb checking the crash dump file, and we found most of
CPUs are IDLE while system paniced:
The fact that they are idle (and quiesced) is probably a side effect of
panicking.
> ::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fec254f8 1b 5 10 105 no no t-5676 a13f7de0 sched
| | |
RUNNING <--+ | +--> PIL THREAD
READY | 10 a13fade0
EXISTS | 6 a13f7de0
ENABLE | - a11ccde0 (idle)
|
+--> PRI THREAD PROC
109 a13fade0 sched
60 a1b7ede0 sched
60 a1587de0 sched
60 a27a1de0 sched
59 a727f000 snoop
The priority 109 thread is the clock. It passivated because it blocked
waiting for a lock that someone
else held. At some point, the lock was dropped which made the clock
thread runnable again....which
is why it's now in the run queue.
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 a1d6e200 1f 1 0 -1 no no t-0 a22a9de0 (idle)
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 60 a24a3de0 sched
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
2 a1d6d180 1f 0 0 -1 no no t-0 a23f6de0 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
3 a1d6c100 1f 0 0 -1 no no t-0 a242fde0 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
The CPU 1,2,3 status is QUIESCED, I think it must be disabled by
deadman via a cross-call.
Likely a side effect of panicking....
On CPU0, a network interrupt thread a13f7de0 was interrupted by a
clock(cyclic) interrupt:
> a13f7de0::findstack -v
stack pointer for thread a13f7de0: a13f75a8
a13f75b8 _interrupt+0xe7()
a13f7638 bcopy+0x39(abb11780)
a13f765c copymsg+0x2e(aa228340)
a13f7688 copymsgchain+0x23(aa228340)
a13f76c0 i_dls_link_rx_func+0x9d(a3041d78, 0, a13f773c, aa228340,
10000, 0)
a13f7704 i_dls_link_rx_common_promisc+0x3b(a3041d78, 0, a13f773c,
aa228340, 0, e669fcdc)
a13f77a8 i_dls_link_rx_common+0x11c(a3041d78, 0, aa228340, e669fcdc)
a13f77c4 i_dls_link_txloop+0x18(a3041d78, aa228340)
a13f77f0 mac_txloop+0x92(a3042ee8, aa228340)
a13f780c dls_tx+0x16(a3026f80, abb08e80)
a13f7828 dld_tx_single+0x1e(a2bfdea8, abb08e80)
a13f7840 str_mdata_fastpath_put+0x60(a2bfdea8, abb08e80)
a13f78b0 tcp_send_data+0x85c(aafda500, aafdfc68, abb08e80)
a13f791c tcp_send+0x6f1(aafdfc68, aafda500, 5a8, 34, 20, 0)
a13f79bc tcp_wput_data+0x663(aafda500, 0, 0)
a13f7aa0 tcp_rput_data+0x29b2(aafda100, a70a13c0, a1df0d00)
a13f7af8 squeue_drain+0x1c3(a1df0d00, 4, 4bb30e5e, bd)
a13f7b48 squeue_enter_chain+0x5be(a1df0d00, abb16f40, a70a13c0, 3, 1)
a13f7bdc ip_input+0x731(a2581754, a2afc020, a1359540, a13f7c0c)
a13f7c78 i_dls_link_rx_common+0x26e(a3041d78, a2afc020, a1359540,
e669fc0c)
a13f7c90 i_dls_link_rx_promisc+0x19(a3041d78, a2afc020, a1359540)
a13f7ccc mac_rx+0x53(a3042ee8, a2afc020, a1359540)
a13f7d60 bge_receive+0x598(a2c78000, a2f26800)
a13f7dac bge_intr+0x30f(a2c78000, 0)
a13f7ddc intr_thread+0x152()
I check the stack of clock(cyclic) interrupt, it seems that it is a
soft interrupt, and it was blocked while trying to process the callout
table in clock routine.
Right.
I don't know why it was blocked, and I didn't see any other thread
held the mutex:
Someone already dropped it...which is why the clock thread is now runnable.
> a13fade0::findstack -v
stack pointer for thread a13fade0: a13fabb8
a13fabec swtch+0xc8()
a13fac24 turnstile_block+0x775(aab00900, 0, a1279000, fec04bc8, 0, 0)
a13fac80 mutex_vector_enter+0x34e(a1279000)
a13faca8 callout_schedule_1+0x13(a1279000)
a13facc8 callout_schedule+0x31()
a13fad14 clock+0x488(0)
a13fad80 cyclic_softint+0x29e(fec21d30, 1)
a13fad94 cbe_softclock+0x14(0, 0)
a13fadcc av_dispatch_softvect+0x66(a)
a13faddc dosoftint+0x109()
> a13fade0::thread
ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR DISPTIME
BOUND PR
a13fade0 run 9 0 3 109 0 10 n/a 0
-1 1
> a13fade0::thread -b
ADDR WCHAN TS PITS SOBJ OPS
a13fade0 0 aab00900 0 0
> a1279000::mutex
ADDR TYPE HELD MINSPL OLDSPL WAITERS
a1279000 adapt no - - no
The status of a13fade0 thread is run, and I think while system panic,
the deadman should be running on CPU0.
How can I find the stack of deadman? I know it should be high level
interrupt, but I can't get any infomation from cpu_t cpu_intr_stack:
a13fade0 is priority 109, which is PIL 10. This is clock. The deadman
cyclic fires at PIL 14, and it will pin whatever thread
is running on the CPU. Because you panicked on CPU 0, this is where the
deadman cyclic must have fired.
> fec254f8::print cpu_t cpu_intr_stack
cpu_intr_stack = 0xa13f5000
> 0xa13f5000,20/nap
0xa13f5000:
mdb: failed to read data from target: no mapping for address
0xa13f5000:
mdb: failed to read data from target: no mapping for address
We can see, most of physical memory are free:
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 127163 496 3%
Anon 14338 56 0%
Exec and libs 1775 6 0%
Page cache 509 1 0%
Free (cachelist) 642 2 0%
Free (freelist) 4046691 15807 97%
Total 4191118 16371
Physical 4191117 16371
We can found there are 5112 pending counts of clock, that's why
deadman was triggered.
And we also could find lots of pending counts of
apic_redistribute_compute, does it mean there is a APIC issue?
Not necessarily. The apic_redistribute_compute cyclic fires at the low
level, which is PIL 1, I believe. From the
output above, we see that the bge interrupt is at PIL 6, which would
block out things firing at PIL 1.
To me, this looks like the bge interrupt is either getting stuck, or is
spending *way* too much time in it's ISR.
-Eric
_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code