Typically this is deadloop in kernel/module code. This case is bge's 
long loop in ISR, which prevent lbolt from being cleared by clock.

Oliver Yang wrote:
> Hi Guys,
>
> I have a question about using mdb debug hang issue, does anybody can 
> give me the answers in below mail?
>
> -------- Original Message --------
> Subject:     About debug hang issues with deadman
> Date:     Mon, 27 Nov 2006 16:40:00 +0800
> From:     Oliver Yang <Oliver.Yang at Sun.COM>
> To:     opensolaris-code at opensolaris.org
>
>
>
> Hi All,
>
> Recently, I ran into a hang issue, and I enabled deadman timer by 
> setting "set snooping =1" in /etc/system file. Finally, we got a 
> crashdump file.
>
> Does anybody can give me a hint on how to debug hang issue with 
> enabling deadman?
>
> Here are steps what I have tried on one of crashdump files:
>
>> ::msgbuf
>
> panic[cpu0]/thread=a13f7de0:
> deadman: timed out after 50 seconds of clock inactivity
>
>
> a13f4f24 genunix:deadman+159 (0)
> a13f4f5c genunix:cyclic_expire+280 (a14fe000, 3, aa2183)
> a13f4fb8 genunix:cyclic_fire+17d (fec21d30)
> a13f4fd4 unix:cbe_fire+4a (0, 0)
> a13f75b8 unix:_interrupt+2a1 (1b0, abb10000, ab36)
> a13f7638 unix:bcopy+39 (abb11780)
> a13f765c genunix:copymsg+2e (aa228340)
> a13f7688 genunix:copymsgchain+23 (aa228340)
> a13f76c0 dls:i_dls_link_rx_func+9d (a3041d78, 0, a13f77)
> a13f7704 dls:i_dls_link_rx_common_promisc+3b (a3041d78, 0, a13f77)
> a13f77a8 dls:i_dls_link_rx_common+11c (a3041d78, 0, aa2283)
> a13f77c4 dls:i_dls_link_txloop+18 (a3041d78, aa228340)
> a13f77f0 mac:mac_txloop+92 (a3042ee8, aa228340)
> a13f780c dls:dls_tx+16 (a3026f80, abb08e80)
> a13f7828 dld:dld_tx_single+1e (a2bfdea8, abb08e80)
> a13f7840 dld:str_mdata_fastpath_put+60 (a2bfdea8, abb08e80)
> a13f78b0 ip:tcp_send_data+85c (aafda500, aafdfc68,)
> a13f791c ip:tcp_send+6f1 (aafdfc68, aafda500,)
> a13f79bc ip:tcp_wput_data+663 (aafda500, 0, 0)
> a13f7aa0 ip:tcp_rput_data+29b2 (aafda100, a70a13c0,)
> a13f7af8 ip:squeue_drain+1c3 (a1df0d00, 4, 4bb30e)
> a13f7b48 ip:squeue_enter_chain+5be (a1df0d00, abb16f40,)
> a13f7bdc ip:ip_input+731 (a2581754, a2afc020,)
> a13f7c78 dls:i_dls_link_rx_common+26e (a3041d78, a2afc020,)
> a13f7c90 dls:i_dls_link_rx_promisc+19 (a3041d78, a2afc020,)
> a13f7ccc mac:mac_rx+53 (a3042ee8, a2afc020,)
> a13f7d60 bge:bge_receive+598 (a2c78000, a2f26800)
> a13f7dac bge:bge_intr+30f (a2c78000, 0)
>
> syncing file systems...
> done
> dumping to /dev/dsk/c2d0s1, offset 1719074816, content: kernel
>
>
> Then I used mdb checking the crash dump file, and we found most of 
> CPUs are IDLE while system paniced:
>
>
>> ::cpuinfo -v
> ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
> 0 fec254f8  1b    5   10 105   no    no t-5676 a13f7de0 sched
>              |    |    |
>   RUNNING <--+    |    +--> PIL THREAD
>     READY         |          10 a13fade0
>    EXISTS         |           6 a13f7de0
>    ENABLE         |           - a11ccde0 (idle)
>                   |
>                   +-->  PRI THREAD   PROC
>                         109 a13fade0 sched
>                          60 a1b7ede0 sched
>                          60 a1587de0 sched
>                          60 a27a1de0 sched
>                          59 a727f000 snoop
>
> ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
> 1 a1d6e200  1f    1    0  -1   no    no t-0    a22a9de0 (idle)
>              |    |
>   RUNNING <--+    +-->  PRI THREAD   PROC
>     READY                60 a24a3de0 sched
>  QUIESCED
>    EXISTS
>    ENABLE
>
> ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
> 2 a1d6d180  1f    0    0  -1   no    no t-0    a23f6de0 (idle)
>              |
>   RUNNING <--+
>     READY
>  QUIESCED
>    EXISTS
>    ENABLE
>
> ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
> 3 a1d6c100  1f    0    0  -1   no    no t-0    a242fde0 (idle)
>              |
>   RUNNING <--+
>     READY
>  QUIESCED
>    EXISTS
>    ENABLE
>
> The CPU 1,2,3 status is QUIESCED, I think it must be disabled by 
> deadman via a cross-call.
>
> On CPU0, a network interrupt thread a13f7de0 was interrupted by a 
> clock(cyclic) interrupt:
>
>
>> a13f7de0::findstack -v
> stack pointer for thread a13f7de0: a13f75a8
> a13f75b8 _interrupt+0xe7()
> a13f7638 bcopy+0x39(abb11780)
> a13f765c copymsg+0x2e(aa228340)
> a13f7688 copymsgchain+0x23(aa228340)
> a13f76c0 i_dls_link_rx_func+0x9d(a3041d78, 0, a13f773c, aa228340, 
> 10000, 0)
> a13f7704 i_dls_link_rx_common_promisc+0x3b(a3041d78, 0, a13f773c, 
> aa228340, 0, e669fcdc)
> a13f77a8 i_dls_link_rx_common+0x11c(a3041d78, 0, aa228340, e669fcdc)
> a13f77c4 i_dls_link_txloop+0x18(a3041d78, aa228340)
> a13f77f0 mac_txloop+0x92(a3042ee8, aa228340)
> a13f780c dls_tx+0x16(a3026f80, abb08e80)
> a13f7828 dld_tx_single+0x1e(a2bfdea8, abb08e80)
> a13f7840 str_mdata_fastpath_put+0x60(a2bfdea8, abb08e80)
> a13f78b0 tcp_send_data+0x85c(aafda500, aafdfc68, abb08e80)
> a13f791c tcp_send+0x6f1(aafdfc68, aafda500, 5a8, 34, 20, 0)
> a13f79bc tcp_wput_data+0x663(aafda500, 0, 0)
> a13f7aa0 tcp_rput_data+0x29b2(aafda100, a70a13c0, a1df0d00)
> a13f7af8 squeue_drain+0x1c3(a1df0d00, 4, 4bb30e5e, bd)
> a13f7b48 squeue_enter_chain+0x5be(a1df0d00, abb16f40, a70a13c0, 3, 1)
> a13f7bdc ip_input+0x731(a2581754, a2afc020, a1359540, a13f7c0c)
> a13f7c78 i_dls_link_rx_common+0x26e(a3041d78, a2afc020, a1359540, 
> e669fc0c)
> a13f7c90 i_dls_link_rx_promisc+0x19(a3041d78, a2afc020, a1359540)
> a13f7ccc mac_rx+0x53(a3042ee8, a2afc020, a1359540)
> a13f7d60 bge_receive+0x598(a2c78000, a2f26800)
> a13f7dac bge_intr+0x30f(a2c78000, 0)
> a13f7ddc intr_thread+0x152()
>
> I check the stack of  clock(cyclic) interrupt, it seems that it is a 
> soft interrupt, and it was blocked while trying to process the callout 
> table in clock routine.
>
> I don't know why it was blocked, and I didn't see any other thread 
> held the mutex:
>
>> a13fade0::findstack -v
> stack pointer for thread a13fade0: a13fabb8
> a13fabec swtch+0xc8()
> a13fac24 turnstile_block+0x775(aab00900, 0, a1279000, fec04bc8, 0, 0)
> a13fac80 mutex_vector_enter+0x34e(a1279000)
> a13faca8 callout_schedule_1+0x13(a1279000)
> a13facc8 callout_schedule+0x31()
> a13fad14 clock+0x488(0)
> a13fad80 cyclic_softint+0x29e(fec21d30, 1)
> a13fad94 cbe_softclock+0x14(0, 0)
> a13fadcc av_dispatch_softvect+0x66(a)
> a13faddc dosoftint+0x109()
>
>> a13fade0::thread
>   ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL     INTR DISPTIME BOUND PR
> a13fade0 run         9    0    3   109     0  10      n/a        0    
> -1  1
>> a13fade0::thread -b
>   ADDR    WCHAN       TS     PITS    SOBJ OPS
> a13fade0        0 aab00900        0           0
>> a1279000::mutex
>   ADDR  TYPE     HELD MINSPL OLDSPL WAITERS
> a1279000 adapt       no      -      -      no
>
> The status of a13fade0 thread is run, and I think while system panic, 
> the deadman should be running on CPU0.
> How can I find the stack of deadman?  I know it should be high level 
> interrupt, but I can't get any infomation from cpu_t cpu_intr_stack:
>
>> fec254f8::print cpu_t cpu_intr_stack
> cpu_intr_stack = 0xa13f5000
>> 0xa13f5000,20/nap
> 0xa13f5000:
> mdb: failed to read data from target: no mapping for address
> 0xa13f5000:
> mdb: failed to read data from target: no mapping for address
>
>
>
> We can see, most of physical memory are free:
>
>> ::memstat
> Page Summary                Pages                MB  %Tot
> ------------     ----------------  ----------------  ----
> Kernel                     127163               496    3%
> Anon                        14338                56    0%
> Exec and libs                1775                 6    0%
> Page cache                    509                 1    0%
> Free (cachelist)              642                 2    0%
> Free (freelist)           4046691             15807   97%
>
> Total                     4191118             16371
> Physical                  4191117             16371
>
>
> We can found there are 5112 pending counts of clock, that's why 
> deadman was triggered.
> And we also could find lots of pending counts of 
> apic_redistribute_compute, does it mean there is a APIC issue?
>
>> ::cycinfo -v
> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
> 0 a14fe000  online      5 aa2182c0      bd4a938200 deadman
>
>                                      3
>                                      |
>                   +------------------+------------------+
>                   0                                     4
>                   |                                     |
>         +---------+--------+                  +---------+---------+
>         1                  2
>         |                  |
>    +----+----+        +----+----+
>
>     ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
> aa2182c0   0    1 high     0      bd4a938200   10000 cbe_hres_tick
> aa2182e0   1    3  low 10787      bd4b2c1880   10000 
> apic_redistribute_compute
> aa218300   2    4 lock  5112      bd4b2c1880   10000 clock
> aa218320   3    0 high     0      bd4a938200 1000000 deadman
> aa218340   4    2  low    11      beebcf0800 10000000 ao_mca_poll_cyclic
>
>
> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
> 1 a2742000  online      2 a133d000      f226b17650 deadman
>
>                                      1
>                                      |
>                   +------------------+------------------+
>                   0
>                   |
>         +---------+--------+
>
>     ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
> a133d000   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
> a133d020   1    0 high     0      f226b17650 1000000 deadman
>
>
> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
> 2 a274d000  online      4 a26c2a80      f200000000 bge_chip_cyclic
>
>                                      2
>                                      |
>                   +------------------+------------------+
>                   0                                     3
>                   |                                     |
>         +---------+--------+                  +---------+---------+
>         1
>         |
>    +----+----+
>
>     ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
> a26c2a80   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
> a26c2aa0   1    3 high     0      f1ecf382a0 1000000 deadman
> a26c2ac0   2    0 lock    18      f200000000  536870 bge_chip_cyclic
> a26c2ae0   3    2 lock     9      f200000000 1073741 nge_chip_cyclic
>
>
> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
> 3 a2758000  online      2 a133d880      f22a6b22f0 deadman
>
>                                      1
>                                      |
>                   +------------------+------------------+
>                   0
>                   |
>         +---------+--------+
>
>     ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
> a133d880   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
> a133d8a0   1    0 high     0      f22a6b22f0 1000000 deadman
>
>
>
>
>


Reply via email to