Hi Drew,

The ire entry is not maintained correctly during the plumb/unplumb
process. The packet destined to your driver was sent out through bge
interface and bge does not check the packet length and it tries to copy
the large LSO packet to it's own buffer. 6586787 has been filed for
this.

Thanks,
Lucy

Andrew Gallatin wrote:
> Paul Durrant wrote:
>> Andrew Gallatin wrote:
>>> The plumb/unplumb tests in NICDRV are killing me..  I've got a strange
>>> memory corruption bug I'm still tracking that I may end up asking for
>>> advice on..
>>>
>> My only other problem with them was that they seem to expect the 
>> netperf/netserver processes to always succeed; which in the face of an 
>> unplumb isn't necessarily the case. I had no other driver problems.
> 
> That was the problem I started looking at, and I assumed that was the
> only issue.  Then I noticed bad checksums on the receiver, and started
> looking harder.  Maybe too hard :(
> 
>> If you're getting corruption in your driver then you may shed some light 
>> on it by setting kmem_flags to 0xf in your /etc/system file and 
>> rebooting before running your test.
> 
> The issue I'm seeing now is that I'm seeing somewhat random corruption
> in *other* drivers if I plumb/unplumb a few hundred times in a tight
> loop under load.   I've tried setting kmem_flags, but that doesn't
> seem to affect the problem at all.
> 
> The crash is almost always in the bge driver's tx routine on a page
> aligned kernel virtual address,  something like this:
> 
> sched:
> #pf Page fault
> Bad kernel fault at addr=0xffffff01db2a5000
> pid=0, pc=0xfffffffffb83bfaa, sp=0xffffff0007d7e5d8, eflags=0x10212
> cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
> cr2: ffffff01db2a5000
> cr3: 3400000
> cr8: c
> 
>          rdi: ffffff01db2a4ffa rsi: ffffff01eaa0b918 rdx:               40
>          rcx:                2  r8:               40  r9:                0
>          rax:                0 rbx: ffffff01eda1eec0 rbp: ffffff0007d7e640
>          r10: ffffff01f0316900 r11: ffffff01cb57a040 r12: ffffff01db2a4fca
>          r13: ffffff01cfe0fad0 r14: ffffff01d03c4a20 r15:               40
>          fsb:                0 gsb: ffffff01ceabeac0  ds:               4b
>           es:               4b  fs:                0  gs:              1c3
>          trp:                e err:                2 rip: fffffffffb83bfaa
>           cs:               30 rfl:            10212 rsp: ffffff0007d7e5d8
>           ss:               38
> 
> ffffff0007d7e3c0 unix:die+ea ()
> ffffff0007d7e4d0 unix:trap+13b9 ()
> ffffff0007d7e4e0 unix:cmntrap+e9 ()
> ffffff0007d7e640 unix:bcopy+a ()
> ffffff0007d7e680 bge:bge_m_tx+60 ()
> ffffff0007d7e6a0 dls:dls_tx+1d ()
> ffffff0007d7e6d0 dld:dld_tx_single+2a ()
> ffffff0007d7e700 dld:str_mdata_fastpath_put+7f ()
> ffffff0007d7e7f0 ip:tcp_lsosend_data+581 ()
> <....>
> 
> If I disassmble the pc, it looks like this faulting address
> is the *source* of a bcopy:
> 
>  > 0xfffffffffb83bfaa::dis
> bcopy:                          xchgq  %rdi,%rsi
> bcopy+3:                        movq   %rdx,%rcx
> bcopy+6:                        shrq   $0x3,%rcx
> bcopy+0xa:                      repz movsq (%rsi),(%rdi)
> 
> I don't have the dump handy, but I also saw one case where there
> was a panic in a copyin. So maybe I held a bogus reference to an
> mblk or dblk, and freed it out from under the system?
> 
> 
> At the time it is happens, my driver will usually be in the middle of
> tearing down its transmit ring:
> 
>  >  ::pgrep ifconfig |::walk thread |::findstack
> stack pointer for thread ffffff01cfa87040: ffffff000877b170
>    ffffff000877b200 page_ctr_add_internal+0x5c()
>    ffffff000877b250 do_interrupt+0xdb()
>    ffffff000877b260 _interrupt+0xba()
>    ffffff000877b390 mutex_exit+0xc()
>    ffffff000877b3f0 kmem_cache_free+0xa7()
>    ffffff000877b430 rootnex_dma_freehdl+0x3d()
>    ffffff000877b460 ddi_dma_freehdl+0x29()
>    ffffff000877b480 ddi_dma_free_handle+0x1b()
>    ffffff000877b4c0 myri10ge_unprepare_tx_ring+0x61()
>    ffffff000877b4f0 myri10ge_teardown_slice+0x3f()
>    ffffff000877b530 myri10ge_stop_locked+0x6c()
>    ffffff000877b550 myri10ge_m_stop+0x6b()
>    ffffff000877b580 mac_stop+0x47()
>    ffffff000877b5d0 dls_close+0x17a()
> <....>
> 
> I'm not good enough yet with mdb to be able to figure out what
> happened..
> 
> Drew
> _______________________________________________
> networking-discuss mailing list
> [email protected]
> 

_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to