Typically this is deadloop in kernel/module code. This case is bge's long loop in ISR, which prevent lbolt from being cleared by clock.
Oliver Yang wrote: > Hi Guys, > > I have a question about using mdb debug hang issue, does anybody can > give me the answers in below mail? > > -------- Original Message -------- > Subject: About debug hang issues with deadman > Date: Mon, 27 Nov 2006 16:40:00 +0800 > From: Oliver Yang <Oliver.Yang at Sun.COM> > To: opensolaris-code at opensolaris.org > > > > Hi All, > > Recently, I ran into a hang issue, and I enabled deadman timer by > setting "set snooping =1" in /etc/system file. Finally, we got a > crashdump file. > > Does anybody can give me a hint on how to debug hang issue with > enabling deadman? > > Here are steps what I have tried on one of crashdump files: > >> ::msgbuf > > panic[cpu0]/thread=a13f7de0: > deadman: timed out after 50 seconds of clock inactivity > > > a13f4f24 genunix:deadman+159 (0) > a13f4f5c genunix:cyclic_expire+280 (a14fe000, 3, aa2183) > a13f4fb8 genunix:cyclic_fire+17d (fec21d30) > a13f4fd4 unix:cbe_fire+4a (0, 0) > a13f75b8 unix:_interrupt+2a1 (1b0, abb10000, ab36) > a13f7638 unix:bcopy+39 (abb11780) > a13f765c genunix:copymsg+2e (aa228340) > a13f7688 genunix:copymsgchain+23 (aa228340) > a13f76c0 dls:i_dls_link_rx_func+9d (a3041d78, 0, a13f77) > a13f7704 dls:i_dls_link_rx_common_promisc+3b (a3041d78, 0, a13f77) > a13f77a8 dls:i_dls_link_rx_common+11c (a3041d78, 0, aa2283) > a13f77c4 dls:i_dls_link_txloop+18 (a3041d78, aa228340) > a13f77f0 mac:mac_txloop+92 (a3042ee8, aa228340) > a13f780c dls:dls_tx+16 (a3026f80, abb08e80) > a13f7828 dld:dld_tx_single+1e (a2bfdea8, abb08e80) > a13f7840 dld:str_mdata_fastpath_put+60 (a2bfdea8, abb08e80) > a13f78b0 ip:tcp_send_data+85c (aafda500, aafdfc68,) > a13f791c ip:tcp_send+6f1 (aafdfc68, aafda500,) > a13f79bc ip:tcp_wput_data+663 (aafda500, 0, 0) > a13f7aa0 ip:tcp_rput_data+29b2 (aafda100, a70a13c0,) > a13f7af8 ip:squeue_drain+1c3 (a1df0d00, 4, 4bb30e) > a13f7b48 ip:squeue_enter_chain+5be (a1df0d00, abb16f40,) > a13f7bdc ip:ip_input+731 (a2581754, a2afc020,) > a13f7c78 dls:i_dls_link_rx_common+26e (a3041d78, a2afc020,) > a13f7c90 dls:i_dls_link_rx_promisc+19 (a3041d78, a2afc020,) > a13f7ccc mac:mac_rx+53 (a3042ee8, a2afc020,) > a13f7d60 bge:bge_receive+598 (a2c78000, a2f26800) > a13f7dac bge:bge_intr+30f (a2c78000, 0) > > syncing file systems... > done > dumping to /dev/dsk/c2d0s1, offset 1719074816, content: kernel > > > Then I used mdb checking the crash dump file, and we found most of > CPUs are IDLE while system paniced: > > >> ::cpuinfo -v > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC > 0 fec254f8 1b 5 10 105 no no t-5676 a13f7de0 sched > | | | > RUNNING <--+ | +--> PIL THREAD > READY | 10 a13fade0 > EXISTS | 6 a13f7de0 > ENABLE | - a11ccde0 (idle) > | > +--> PRI THREAD PROC > 109 a13fade0 sched > 60 a1b7ede0 sched > 60 a1587de0 sched > 60 a27a1de0 sched > 59 a727f000 snoop > > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC > 1 a1d6e200 1f 1 0 -1 no no t-0 a22a9de0 (idle) > | | > RUNNING <--+ +--> PRI THREAD PROC > READY 60 a24a3de0 sched > QUIESCED > EXISTS > ENABLE > > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC > 2 a1d6d180 1f 0 0 -1 no no t-0 a23f6de0 (idle) > | > RUNNING <--+ > READY > QUIESCED > EXISTS > ENABLE > > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC > 3 a1d6c100 1f 0 0 -1 no no t-0 a242fde0 (idle) > | > RUNNING <--+ > READY > QUIESCED > EXISTS > ENABLE > > The CPU 1,2,3 status is QUIESCED, I think it must be disabled by > deadman via a cross-call. > > On CPU0, a network interrupt thread a13f7de0 was interrupted by a > clock(cyclic) interrupt: > > >> a13f7de0::findstack -v > stack pointer for thread a13f7de0: a13f75a8 > a13f75b8 _interrupt+0xe7() > a13f7638 bcopy+0x39(abb11780) > a13f765c copymsg+0x2e(aa228340) > a13f7688 copymsgchain+0x23(aa228340) > a13f76c0 i_dls_link_rx_func+0x9d(a3041d78, 0, a13f773c, aa228340, > 10000, 0) > a13f7704 i_dls_link_rx_common_promisc+0x3b(a3041d78, 0, a13f773c, > aa228340, 0, e669fcdc) > a13f77a8 i_dls_link_rx_common+0x11c(a3041d78, 0, aa228340, e669fcdc) > a13f77c4 i_dls_link_txloop+0x18(a3041d78, aa228340) > a13f77f0 mac_txloop+0x92(a3042ee8, aa228340) > a13f780c dls_tx+0x16(a3026f80, abb08e80) > a13f7828 dld_tx_single+0x1e(a2bfdea8, abb08e80) > a13f7840 str_mdata_fastpath_put+0x60(a2bfdea8, abb08e80) > a13f78b0 tcp_send_data+0x85c(aafda500, aafdfc68, abb08e80) > a13f791c tcp_send+0x6f1(aafdfc68, aafda500, 5a8, 34, 20, 0) > a13f79bc tcp_wput_data+0x663(aafda500, 0, 0) > a13f7aa0 tcp_rput_data+0x29b2(aafda100, a70a13c0, a1df0d00) > a13f7af8 squeue_drain+0x1c3(a1df0d00, 4, 4bb30e5e, bd) > a13f7b48 squeue_enter_chain+0x5be(a1df0d00, abb16f40, a70a13c0, 3, 1) > a13f7bdc ip_input+0x731(a2581754, a2afc020, a1359540, a13f7c0c) > a13f7c78 i_dls_link_rx_common+0x26e(a3041d78, a2afc020, a1359540, > e669fc0c) > a13f7c90 i_dls_link_rx_promisc+0x19(a3041d78, a2afc020, a1359540) > a13f7ccc mac_rx+0x53(a3042ee8, a2afc020, a1359540) > a13f7d60 bge_receive+0x598(a2c78000, a2f26800) > a13f7dac bge_intr+0x30f(a2c78000, 0) > a13f7ddc intr_thread+0x152() > > I check the stack of clock(cyclic) interrupt, it seems that it is a > soft interrupt, and it was blocked while trying to process the callout > table in clock routine. > > I don't know why it was blocked, and I didn't see any other thread > held the mutex: > >> a13fade0::findstack -v > stack pointer for thread a13fade0: a13fabb8 > a13fabec swtch+0xc8() > a13fac24 turnstile_block+0x775(aab00900, 0, a1279000, fec04bc8, 0, 0) > a13fac80 mutex_vector_enter+0x34e(a1279000) > a13faca8 callout_schedule_1+0x13(a1279000) > a13facc8 callout_schedule+0x31() > a13fad14 clock+0x488(0) > a13fad80 cyclic_softint+0x29e(fec21d30, 1) > a13fad94 cbe_softclock+0x14(0, 0) > a13fadcc av_dispatch_softvect+0x66(a) > a13faddc dosoftint+0x109() > >> a13fade0::thread > ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR DISPTIME BOUND PR > a13fade0 run 9 0 3 109 0 10 n/a 0 > -1 1 >> a13fade0::thread -b > ADDR WCHAN TS PITS SOBJ OPS > a13fade0 0 aab00900 0 0 >> a1279000::mutex > ADDR TYPE HELD MINSPL OLDSPL WAITERS > a1279000 adapt no - - no > > The status of a13fade0 thread is run, and I think while system panic, > the deadman should be running on CPU0. > How can I find the stack of deadman? I know it should be high level > interrupt, but I can't get any infomation from cpu_t cpu_intr_stack: > >> fec254f8::print cpu_t cpu_intr_stack > cpu_intr_stack = 0xa13f5000 >> 0xa13f5000,20/nap > 0xa13f5000: > mdb: failed to read data from target: no mapping for address > 0xa13f5000: > mdb: failed to read data from target: no mapping for address > > > > We can see, most of physical memory are free: > >> ::memstat > Page Summary Pages MB %Tot > ------------ ---------------- ---------------- ---- > Kernel 127163 496 3% > Anon 14338 56 0% > Exec and libs 1775 6 0% > Page cache 509 1 0% > Free (cachelist) 642 2 0% > Free (freelist) 4046691 15807 97% > > Total 4191118 16371 > Physical 4191117 16371 > > > We can found there are 5112 pending counts of clock, that's why > deadman was triggered. > And we also could find lots of pending counts of > apic_redistribute_compute, does it mean there is a APIC issue? > >> ::cycinfo -v > CPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER > 0 a14fe000 online 5 aa2182c0 bd4a938200 deadman > > 3 > | > +------------------+------------------+ > 0 4 > | | > +---------+--------+ +---------+---------+ > 1 2 > | | > +----+----+ +----+----+ > > ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER > aa2182c0 0 1 high 0 bd4a938200 10000 cbe_hres_tick > aa2182e0 1 3 low 10787 bd4b2c1880 10000 > apic_redistribute_compute > aa218300 2 4 lock 5112 bd4b2c1880 10000 clock > aa218320 3 0 high 0 bd4a938200 1000000 deadman > aa218340 4 2 low 11 beebcf0800 10000000 ao_mca_poll_cyclic > > > CPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER > 1 a2742000 online 2 a133d000 f226b17650 deadman > > 1 > | > +------------------+------------------+ > 0 > | > +---------+--------+ > > ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER > a133d000 0 1 low 1 f224d4a000 10000000 ao_mca_poll_cyclic > a133d020 1 0 high 0 f226b17650 1000000 deadman > > > CPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER > 2 a274d000 online 4 a26c2a80 f200000000 bge_chip_cyclic > > 2 > | > +------------------+------------------+ > 0 3 > | | > +---------+--------+ +---------+---------+ > 1 > | > +----+----+ > > ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER > a26c2a80 0 1 low 1 f224d4a000 10000000 ao_mca_poll_cyclic > a26c2aa0 1 3 high 0 f1ecf382a0 1000000 deadman > a26c2ac0 2 0 lock 18 f200000000 536870 bge_chip_cyclic > a26c2ae0 3 2 lock 9 f200000000 1073741 nge_chip_cyclic > > > CPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER > 3 a2758000 online 2 a133d880 f22a6b22f0 deadman > > 1 > | > +------------------+------------------+ > 0 > | > +---------+--------+ > > ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER > a133d880 0 1 low 1 f224d4a000 10000000 ao_mca_poll_cyclic > a133d8a0 1 0 high 0 f22a6b22f0 1000000 deadman > > > > >