Andrew Gallatin wrote:
> Paul Durrant wrote:
>
>> Andrew Gallatin wrote:
>>
>>> The plumb/unplumb tests in NICDRV are killing me.. I've got a strange
>>> memory corruption bug I'm still tracking that I may end up asking for
>>> advice on..
>>>
>>>
>> My only other problem with them was that they seem to expect the
>> netperf/netserver processes to always succeed; which in the face of an
>> unplumb isn't necessarily the case. I had no other driver problems.
>>
>
> That was the problem I started looking at, and I assumed that was the
> only issue. Then I noticed bad checksums on the receiver, and started
> looking harder. Maybe too hard :(
>
>
>> If you're getting corruption in your driver then you may shed some light
>> on it by setting kmem_flags to 0xf in your /etc/system file and
>> rebooting before running your test.
>>
>
> The issue I'm seeing now is that I'm seeing somewhat random corruption
> in *other* drivers if I plumb/unplumb a few hundred times in a tight
> loop under load. I've tried setting kmem_flags, but that doesn't
> seem to affect the problem at all.
>
> The crash is almost always in the bge driver's tx routine on a page
> aligned kernel virtual address, something like this:
>
> sched:
> #pf Page fault
> Bad kernel fault at addr=0xffffff01db2a5000
> pid=0, pc=0xfffffffffb83bfaa, sp=0xffffff0007d7e5d8, eflags=0x10212
> cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
> cr2: ffffff01db2a5000
> cr3: 3400000
> cr8: c
>
> rdi: ffffff01db2a4ffa rsi: ffffff01eaa0b918 rdx: 40
> rcx: 2 r8: 40 r9: 0
> rax: 0 rbx: ffffff01eda1eec0 rbp: ffffff0007d7e640
> r10: ffffff01f0316900 r11: ffffff01cb57a040 r12: ffffff01db2a4fca
> r13: ffffff01cfe0fad0 r14: ffffff01d03c4a20 r15: 40
> fsb: 0 gsb: ffffff01ceabeac0 ds: 4b
> es: 4b fs: 0 gs: 1c3
> trp: e err: 2 rip: fffffffffb83bfaa
> cs: 30 rfl: 10212 rsp: ffffff0007d7e5d8
> ss: 38
>
> ffffff0007d7e3c0 unix:die+ea ()
> ffffff0007d7e4d0 unix:trap+13b9 ()
> ffffff0007d7e4e0 unix:cmntrap+e9 ()
> ffffff0007d7e640 unix:bcopy+a ()
> ffffff0007d7e680 bge:bge_m_tx+60 ()
> ffffff0007d7e6a0 dls:dls_tx+1d ()
> ffffff0007d7e6d0 dld:dld_tx_single+2a ()
> ffffff0007d7e700 dld:str_mdata_fastpath_put+7f ()
> ffffff0007d7e7f0 ip:tcp_lsosend_data+581 ()
> <....>
>
> If I disassmble the pc, it looks like this faulting address
> is the *source* of a bcopy:
>
> > 0xfffffffffb83bfaa::dis
> bcopy: xchgq %rdi,%rsi
> bcopy+3: movq %rdx,%rcx
> bcopy+6: shrq $0x3,%rcx
> bcopy+0xa: repz movsq (%rsi),(%rdi)
>
> I don't have the dump handy, but I also saw one case where there
> was a panic in a copyin. So maybe I held a bogus reference to an
> mblk or dblk, and freed it out from under the system?
>
>
> At the time it is happens, my driver will usually be in the middle of
> tearing down its transmit ring:
>
> > ::pgrep ifconfig |::walk thread |::findstack
> stack pointer for thread ffffff01cfa87040: ffffff000877b170
> ffffff000877b200 page_ctr_add_internal+0x5c()
> ffffff000877b250 do_interrupt+0xdb()
> ffffff000877b260 _interrupt+0xba()
> ffffff000877b390 mutex_exit+0xc()
> ffffff000877b3f0 kmem_cache_free+0xa7()
> ffffff000877b430 rootnex_dma_freehdl+0x3d()
> ffffff000877b460 ddi_dma_freehdl+0x29()
> ffffff000877b480 ddi_dma_free_handle+0x1b()
> ffffff000877b4c0 myri10ge_unprepare_tx_ring+0x61()
> ffffff000877b4f0 myri10ge_teardown_slice+0x3f()
> ffffff000877b530 myri10ge_stop_locked+0x6c()
> ffffff000877b550 myri10ge_m_stop+0x6b()
> ffffff000877b580 mac_stop+0x47()
> ffffff000877b5d0 dls_close+0x17a()
> <....>
>
> I'm not good enough yet with mdb to be able to figure out what
> happened..
>
Is this on SPARC or x86 hardware?
It *sounds* sort of like it might be a problem with corruption of DMA.
Make sure that when you do m_stop, you've really shut down your hardware
including any DMA transfers *before* you yank the DMA mappings out from
underneath it. (I can imagine in particular a DMA region getting
reused, and if your device is still accessing that region, then problems
could ensue.)
-- Garrett
> Drew
> _______________________________________________
> networking-discuss mailing list
> [email protected]
>
_______________________________________________
networking-discuss mailing list
[email protected]