On Monday 27 November 2006 12:40 am, Oliver Yang wrote:
> Hi All,
>
> Recently, I ran into a hang issue, and I enabled deadman timer by
> setting "set snooping =1" in /etc/system file. Finally, we got a
> crashdump file.

This still hung during stress testing last night.

ajd

>
> Does anybody can give me a hint on how to debug hang issue with enabling
> deadman?
>
> Here are steps what I have tried on one of crashdump files:
>  > ::msgbuf
>
> panic[cpu0]/thread=a13f7de0:
> deadman: timed out after 50 seconds of clock inactivity
>
>
> a13f4f24 genunix:deadman+159 (0)
> a13f4f5c genunix:cyclic_expire+280 (a14fe000, 3, aa2183)
>  a13f4fb8 genunix:cyclic_fire+17d (fec21d30)
> a13f4fd4 unix:cbe_fire+4a (0, 0)
> a13f75b8 unix:_interrupt+2a1 (1b0, abb10000, ab36)
> a13f7638 unix:bcopy+39 (abb11780)
> a13f765c genunix:copymsg+2e (aa228340)
> a13f7688 genunix:copymsgchain+23 (aa228340)
> a13f76c0 dls:i_dls_link_rx_func+9d (a3041d78, 0, a13f77)
> a13f7704 dls:i_dls_link_rx_common_promisc+3b (a3041d78, 0, a13f77)
> a13f77a8 dls:i_dls_link_rx_common+11c (a3041d78, 0, aa2283)
> a13f77c4 dls:i_dls_link_txloop+18 (a3041d78, aa228340)
> a13f77f0 mac:mac_txloop+92 (a3042ee8, aa228340)
> a13f780c dls:dls_tx+16 (a3026f80, abb08e80)
> a13f7828 dld:dld_tx_single+1e (a2bfdea8, abb08e80)
> a13f7840 dld:str_mdata_fastpath_put+60 (a2bfdea8, abb08e80)
> a13f78b0 ip:tcp_send_data+85c (aafda500, aafdfc68,)
> a13f791c ip:tcp_send+6f1 (aafdfc68, aafda500,)
> a13f79bc ip:tcp_wput_data+663 (aafda500, 0, 0)
> a13f7aa0 ip:tcp_rput_data+29b2 (aafda100, a70a13c0,)
> a13f7af8 ip:squeue_drain+1c3 (a1df0d00, 4, 4bb30e)
> a13f7b48 ip:squeue_enter_chain+5be (a1df0d00, abb16f40,)
> a13f7bdc ip:ip_input+731 (a2581754, a2afc020,)
> a13f7c78 dls:i_dls_link_rx_common+26e (a3041d78, a2afc020,)
> a13f7c90 dls:i_dls_link_rx_promisc+19 (a3041d78, a2afc020,)
> a13f7ccc mac:mac_rx+53 (a3042ee8, a2afc020,)
> a13f7d60 bge:bge_receive+598 (a2c78000, a2f26800)
> a13f7dac bge:bge_intr+30f (a2c78000, 0)
>
> syncing file systems...
>  done
> dumping to /dev/dsk/c2d0s1, offset 1719074816, content: kernel
>
>
> Then I used mdb checking the crash dump file, and we found most of CPUs
>
> are IDLE while system paniced:
>  > ::cpuinfo -v
>
>  ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
>   0 fec254f8  1b    5   10 105   no    no t-5676 a13f7de0 sched
>
>     RUNNING <--+    |    +--> PIL THREAD
>       READY         |          10 a13fade0
>      EXISTS         |           6 a13f7de0
>      ENABLE         |           - a11ccde0 (idle)
>
>                     +-->  PRI THREAD   PROC
>                           109 a13fade0 sched
>                            60 a1b7ede0 sched
>                            60 a1587de0 sched
>                            60 a27a1de0 sched
>                            59 a727f000 snoop
>
>  ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
>   1 a1d6e200  1f    1    0  -1   no    no t-0    a22a9de0 (idle)
>
>     RUNNING <--+    +-->  PRI THREAD   PROC
>       READY                60 a24a3de0 sched
>    QUIESCED
>      EXISTS
>      ENABLE
>
>  ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
>   2 a1d6d180  1f    0    0  -1   no    no t-0    a23f6de0 (idle)
>
>     RUNNING <--+
>       READY
>    QUIESCED
>      EXISTS
>      ENABLE
>
>  ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
>   3 a1d6c100  1f    0    0  -1   no    no t-0    a242fde0 (idle)
>
>     RUNNING <--+
>       READY
>    QUIESCED
>      EXISTS
>      ENABLE
>
> The CPU 1,2,3 status is QUIESCED, I think it must be disabled by deadman
> via a cross-call.
>
> On CPU0, a network interrupt thread a13f7de0 was interrupted by a
>
> clock(cyclic) interrupt:
>  > a13f7de0::findstack -v
>
> stack pointer for thread a13f7de0: a13f75a8
>   a13f75b8 _interrupt+0xe7()
>   a13f7638 bcopy+0x39(abb11780)
>   a13f765c copymsg+0x2e(aa228340)
>   a13f7688 copymsgchain+0x23(aa228340)
>   a13f76c0 i_dls_link_rx_func+0x9d(a3041d78, 0, a13f773c, aa228340,
> 10000, 0)
>   a13f7704 i_dls_link_rx_common_promisc+0x3b(a3041d78, 0, a13f773c,
> aa228340, 0, e669fcdc)
>   a13f77a8 i_dls_link_rx_common+0x11c(a3041d78, 0, aa228340, e669fcdc)
>   a13f77c4 i_dls_link_txloop+0x18(a3041d78, aa228340)
>   a13f77f0 mac_txloop+0x92(a3042ee8, aa228340)
>   a13f780c dls_tx+0x16(a3026f80, abb08e80)
>   a13f7828 dld_tx_single+0x1e(a2bfdea8, abb08e80)
>   a13f7840 str_mdata_fastpath_put+0x60(a2bfdea8, abb08e80)
>   a13f78b0 tcp_send_data+0x85c(aafda500, aafdfc68, abb08e80)
>   a13f791c tcp_send+0x6f1(aafdfc68, aafda500, 5a8, 34, 20, 0)
>   a13f79bc tcp_wput_data+0x663(aafda500, 0, 0)
>   a13f7aa0 tcp_rput_data+0x29b2(aafda100, a70a13c0, a1df0d00)
>   a13f7af8 squeue_drain+0x1c3(a1df0d00, 4, 4bb30e5e, bd)
>   a13f7b48 squeue_enter_chain+0x5be(a1df0d00, abb16f40, a70a13c0, 3, 1)
>   a13f7bdc ip_input+0x731(a2581754, a2afc020, a1359540, a13f7c0c)
>   a13f7c78 i_dls_link_rx_common+0x26e(a3041d78, a2afc020, a1359540,
> e669fc0c)
>   a13f7c90 i_dls_link_rx_promisc+0x19(a3041d78, a2afc020, a1359540)
>   a13f7ccc mac_rx+0x53(a3042ee8, a2afc020, a1359540)
>   a13f7d60 bge_receive+0x598(a2c78000, a2f26800)
>   a13f7dac bge_intr+0x30f(a2c78000, 0)
>   a13f7ddc intr_thread+0x152()
>
> I check the stack of  clock(cyclic) interrupt, it seems that it is a
> soft interrupt, and it was blocked while trying to process the callout
> table in clock routine.
>
> I don't know why it was blocked, and I didn't see any other thread held
>
> the mutex:
>  > a13fade0::findstack -v
>
> stack pointer for thread a13fade0: a13fabb8
>   a13fabec swtch+0xc8()
>   a13fac24 turnstile_block+0x775(aab00900, 0, a1279000, fec04bc8, 0, 0)
>   a13fac80 mutex_vector_enter+0x34e(a1279000)
>   a13faca8 callout_schedule_1+0x13(a1279000)
>   a13facc8 callout_schedule+0x31()
>   a13fad14 clock+0x488(0)
>   a13fad80 cyclic_softint+0x29e(fec21d30, 1)
>   a13fad94 cbe_softclock+0x14(0, 0)
>   a13fadcc av_dispatch_softvect+0x66(a)
>   a13faddc dosoftint+0x109()
>
>  > a13fade0::thread
>
>     ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL     INTR DISPTIME BOUND PR
> a13fade0 run         9    0    3   109     0  10      n/a        0    -1  1
>
>  > a13fade0::thread -b
>
>     ADDR    WCHAN       TS     PITS    SOBJ OPS
> a13fade0        0 aab00900        0           0
>
>  > a1279000::mutex
>
>     ADDR  TYPE     HELD MINSPL OLDSPL WAITERS
> a1279000 adapt       no      -      -      no
>
> The status of a13fade0 thread is run, and I think while system panic,
> the deadman should be running on CPU0.
> How can I find the stack of deadman?  I know it should be high level
>
> interrupt, but I can't get any infomation from cpu_t cpu_intr_stack:
>  > fec254f8::print cpu_t cpu_intr_stack
>
> cpu_intr_stack = 0xa13f5000
>
>  > 0xa13f5000,20/nap
>
> 0xa13f5000:
> mdb: failed to read data from target: no mapping for address
> 0xa13f5000:
> mdb: failed to read data from target: no mapping for address
>
> We can see, most of physical memory are free:
>  > ::memstat
>
> Page Summary                Pages                MB  %Tot
> ------------     ----------------  ----------------  ----
> Kernel                     127163               496    3%
> Anon                        14338                56    0%
> Exec and libs                1775                 6    0%
> Page cache                    509                 1    0%
> Free (cachelist)              642                 2    0%
> Free (freelist)           4046691             15807   97%
>
> Total                     4191118             16371
> Physical                  4191117             16371
>
>
> We can found there are 5112 pending counts of clock, that's why deadman
> was triggered.
> And we also could find lots of pending counts of
> apic_redistribute_compute, does it mean there is a APIC issue?
>
>  > ::cycinfo -v
>
> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
>   0 a14fe000  online      5 aa2182c0      bd4a938200 deadman
>
>                                        3
>
>                     +------------------+------------------+
>                     0                                     4
>
>           +---------+--------+                  +---------+---------+
>           1                  2
>
>      +----+----+        +----+----+
>
>       ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
>   aa2182c0   0    1 high     0      bd4a938200   10000 cbe_hres_tick
>   aa2182e0   1    3  low 10787      bd4b2c1880   10000
> apic_redistribute_compute
>   aa218300   2    4 lock  5112      bd4b2c1880   10000 clock
>   aa218320   3    0 high     0      bd4a938200 1000000 deadman
>   aa218340   4    2  low    11      beebcf0800 10000000 ao_mca_poll_cyclic
>
>
> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
>   1 a2742000  online      2 a133d000      f226b17650 deadman
>
>                                        1
>
>                     +------------------+------------------+
>                     0
>
>           +---------+--------+
>
>       ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
>   a133d000   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
>   a133d020   1    0 high     0      f226b17650 1000000 deadman
>
>
> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
>   2 a274d000  online      4 a26c2a80      f200000000 bge_chip_cyclic
>
>                                        2
>
>                     +------------------+------------------+
>                     0                                     3
>
>           +---------+--------+                  +---------+---------+
>           1
>
>      +----+----+
>
>       ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
>   a26c2a80   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
>   a26c2aa0   1    3 high     0      f1ecf382a0 1000000 deadman
>   a26c2ac0   2    0 lock    18      f200000000  536870 bge_chip_cyclic
>   a26c2ae0   3    2 lock     9      f200000000 1073741 nge_chip_cyclic
>
>
> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
>   3 a2758000  online      2 a133d880      f22a6b22f0 deadman
>
>                                        1
>
>                     +------------------+------------------+
>                     0
>
>           +---------+--------+
>
>       ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
>   a133d880   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
>   a133d8a0   1    0 high     0      f22a6b22f0 1000000 deadman

-- 

Alan DuBoff - Solaris x86 Engineering - IHV/OEM Group
Advocate of Insourcing at Sun, hire people that care about our company!




_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Reply via email to