Re: [networking-discuss] bad LSO ill state..

Garrett D'Amore Tue, 14 Oct 2008 09:48:53 -0700

Andrew Gallatin wrote:
> Paul Durrant wrote:
>   
>> Andrew Gallatin wrote:
>>     
>>> The plumb/unplumb tests in NICDRV are killing me..  I've got a strange
>>> memory corruption bug I'm still tracking that I may end up asking for
>>> advice on..
>>>
>>>       
>> My only other problem with them was that they seem to expect the 
>> netperf/netserver processes to always succeed; which in the face of an 
>> unplumb isn't necessarily the case. I had no other driver problems.
>>     
>
> That was the problem I started looking at, and I assumed that was the
> only issue.  Then I noticed bad checksums on the receiver, and started
> looking harder.  Maybe too hard :(
>
>   
>> If you're getting corruption in your driver then you may shed some light 
>> on it by setting kmem_flags to 0xf in your /etc/system file and 
>> rebooting before running your test.
>>     
>
> The issue I'm seeing now is that I'm seeing somewhat random corruption
> in *other* drivers if I plumb/unplumb a few hundred times in a tight
> loop under load.   I've tried setting kmem_flags, but that doesn't
> seem to affect the problem at all.
>
> The crash is almost always in the bge driver's tx routine on a page
> aligned kernel virtual address,  something like this:
>
> sched:
> #pf Page fault
> Bad kernel fault at addr=0xffffff01db2a5000
> pid=0, pc=0xfffffffffb83bfaa, sp=0xffffff0007d7e5d8, eflags=0x10212
> cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
> cr2: ffffff01db2a5000
> cr3: 3400000
> cr8: c
>
>          rdi: ffffff01db2a4ffa rsi: ffffff01eaa0b918 rdx:               40
>          rcx:                2  r8:               40  r9:                0
>          rax:                0 rbx: ffffff01eda1eec0 rbp: ffffff0007d7e640
>          r10: ffffff01f0316900 r11: ffffff01cb57a040 r12: ffffff01db2a4fca
>          r13: ffffff01cfe0fad0 r14: ffffff01d03c4a20 r15:               40
>          fsb:                0 gsb: ffffff01ceabeac0  ds:               4b
>           es:               4b  fs:                0  gs:              1c3
>          trp:                e err:                2 rip: fffffffffb83bfaa
>           cs:               30 rfl:            10212 rsp: ffffff0007d7e5d8
>           ss:               38
>
> ffffff0007d7e3c0 unix:die+ea ()
> ffffff0007d7e4d0 unix:trap+13b9 ()
> ffffff0007d7e4e0 unix:cmntrap+e9 ()
> ffffff0007d7e640 unix:bcopy+a ()
> ffffff0007d7e680 bge:bge_m_tx+60 ()
> ffffff0007d7e6a0 dls:dls_tx+1d ()
> ffffff0007d7e6d0 dld:dld_tx_single+2a ()
> ffffff0007d7e700 dld:str_mdata_fastpath_put+7f ()
> ffffff0007d7e7f0 ip:tcp_lsosend_data+581 ()
> <....>
>
> If I disassmble the pc, it looks like this faulting address
> is the *source* of a bcopy:
>
>  > 0xfffffffffb83bfaa::dis
> bcopy:                          xchgq  %rdi,%rsi
> bcopy+3:                        movq   %rdx,%rcx
> bcopy+6:                        shrq   $0x3,%rcx
> bcopy+0xa:                      repz movsq (%rsi),(%rdi)
>
> I don't have the dump handy, but I also saw one case where there
> was a panic in a copyin. So maybe I held a bogus reference to an
> mblk or dblk, and freed it out from under the system?
>
>
> At the time it is happens, my driver will usually be in the middle of
> tearing down its transmit ring:
>
>  >  ::pgrep ifconfig |::walk thread |::findstack
> stack pointer for thread ffffff01cfa87040: ffffff000877b170
>    ffffff000877b200 page_ctr_add_internal+0x5c()
>    ffffff000877b250 do_interrupt+0xdb()
>    ffffff000877b260 _interrupt+0xba()
>    ffffff000877b390 mutex_exit+0xc()
>    ffffff000877b3f0 kmem_cache_free+0xa7()
>    ffffff000877b430 rootnex_dma_freehdl+0x3d()
>    ffffff000877b460 ddi_dma_freehdl+0x29()
>    ffffff000877b480 ddi_dma_free_handle+0x1b()
>    ffffff000877b4c0 myri10ge_unprepare_tx_ring+0x61()
>    ffffff000877b4f0 myri10ge_teardown_slice+0x3f()
>    ffffff000877b530 myri10ge_stop_locked+0x6c()
>    ffffff000877b550 myri10ge_m_stop+0x6b()
>    ffffff000877b580 mac_stop+0x47()
>    ffffff000877b5d0 dls_close+0x17a()
> <....>
>
> I'm not good enough yet with mdb to be able to figure out what
> happened..
>


Is this on SPARC or x86 hardware?

It *sounds* sort of like it might be a problem with corruption of DMA.  
Make sure that when you do m_stop, you've really shut down your hardware 
including any DMA transfers *before* you yank the DMA mappings out from 
underneath it.   (I can imagine in particular a DMA region getting 
reused, and if your device is still accessing that region, then problems 
could ensue.)

    -- Garrett
> Drew
> _______________________________________________
> networking-discuss mailing list
> [email protected]
>   

_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] bad LSO ill state..

Reply via email to