Paul Durrant wrote:
> Andrew Gallatin wrote:
>>
>> The plumb/unplumb tests in NICDRV are killing me.. I've got a strange
>> memory corruption bug I'm still tracking that I may end up asking for
>> advice on..
>>
>
> My only other problem with them was that they seem to expect the
> netperf/netserver processes to always succeed; which in the face of an
> unplumb isn't necessarily the case. I had no other driver problems.
That was the problem I started looking at, and I assumed that was the
only issue. Then I noticed bad checksums on the receiver, and started
looking harder. Maybe too hard :(
> If you're getting corruption in your driver then you may shed some light
> on it by setting kmem_flags to 0xf in your /etc/system file and
> rebooting before running your test.
The issue I'm seeing now is that I'm seeing somewhat random corruption
in *other* drivers if I plumb/unplumb a few hundred times in a tight
loop under load. I've tried setting kmem_flags, but that doesn't
seem to affect the problem at all.
The crash is almost always in the bge driver's tx routine on a page
aligned kernel virtual address, something like this:
sched:
#pf Page fault
Bad kernel fault at addr=0xffffff01db2a5000
pid=0, pc=0xfffffffffb83bfaa, sp=0xffffff0007d7e5d8, eflags=0x10212
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: ffffff01db2a5000
cr3: 3400000
cr8: c
rdi: ffffff01db2a4ffa rsi: ffffff01eaa0b918 rdx: 40
rcx: 2 r8: 40 r9: 0
rax: 0 rbx: ffffff01eda1eec0 rbp: ffffff0007d7e640
r10: ffffff01f0316900 r11: ffffff01cb57a040 r12: ffffff01db2a4fca
r13: ffffff01cfe0fad0 r14: ffffff01d03c4a20 r15: 40
fsb: 0 gsb: ffffff01ceabeac0 ds: 4b
es: 4b fs: 0 gs: 1c3
trp: e err: 2 rip: fffffffffb83bfaa
cs: 30 rfl: 10212 rsp: ffffff0007d7e5d8
ss: 38
ffffff0007d7e3c0 unix:die+ea ()
ffffff0007d7e4d0 unix:trap+13b9 ()
ffffff0007d7e4e0 unix:cmntrap+e9 ()
ffffff0007d7e640 unix:bcopy+a ()
ffffff0007d7e680 bge:bge_m_tx+60 ()
ffffff0007d7e6a0 dls:dls_tx+1d ()
ffffff0007d7e6d0 dld:dld_tx_single+2a ()
ffffff0007d7e700 dld:str_mdata_fastpath_put+7f ()
ffffff0007d7e7f0 ip:tcp_lsosend_data+581 ()
<....>
If I disassmble the pc, it looks like this faulting address
is the *source* of a bcopy:
> 0xfffffffffb83bfaa::dis
bcopy: xchgq %rdi,%rsi
bcopy+3: movq %rdx,%rcx
bcopy+6: shrq $0x3,%rcx
bcopy+0xa: repz movsq (%rsi),(%rdi)
I don't have the dump handy, but I also saw one case where there
was a panic in a copyin. So maybe I held a bogus reference to an
mblk or dblk, and freed it out from under the system?
At the time it is happens, my driver will usually be in the middle of
tearing down its transmit ring:
> ::pgrep ifconfig |::walk thread |::findstack
stack pointer for thread ffffff01cfa87040: ffffff000877b170
ffffff000877b200 page_ctr_add_internal+0x5c()
ffffff000877b250 do_interrupt+0xdb()
ffffff000877b260 _interrupt+0xba()
ffffff000877b390 mutex_exit+0xc()
ffffff000877b3f0 kmem_cache_free+0xa7()
ffffff000877b430 rootnex_dma_freehdl+0x3d()
ffffff000877b460 ddi_dma_freehdl+0x29()
ffffff000877b480 ddi_dma_free_handle+0x1b()
ffffff000877b4c0 myri10ge_unprepare_tx_ring+0x61()
ffffff000877b4f0 myri10ge_teardown_slice+0x3f()
ffffff000877b530 myri10ge_stop_locked+0x6c()
ffffff000877b550 myri10ge_m_stop+0x6b()
ffffff000877b580 mac_stop+0x47()
ffffff000877b5d0 dls_close+0x17a()
<....>
I'm not good enough yet with mdb to be able to figure out what
happened..
Drew
_______________________________________________
networking-discuss mailing list
[email protected]