Re: new zero copy sockets snapshot

Andrew Gallatin Thu, 20 Jun 2002 10:57:57 -0700


Bosko Milekic writes:
 > > 
 > > I'm a bit worried about other devices.. Tradidtionally, mbufs have
 > > never crossed page boundaries so most drivers never bother to check
 > > for a transmit mbuf crossing a page boundary.  Using physically
 > > discontigous mbufs could lead to a lot of subtle data corruption.
 > 
 >   I assume here that when you say "mbuf" you mean "jumbo buffer attached
 > to an mbuf."  In that case, yeah, all that we need to make sure of is


Yes.  

 > that the driver knows that it's dealing with non-physically-contiguous
 > pages.  For what concerns regular 2K mbuf clusters as well as the 256
 > byte mbufs themselves, they never cross page boundaries so this should
 > not be a problem for those drivers that do not use jumbo clusters.

But it would be problematic if we used the 10K jumbo cluster in
so_send() like I'd like to.  Ah, the pains of legacy code... :-(

I'm also having a little trouble convincing myself that page-crossing
jumbo clusters would be safe in all scenarios.  I suppose if you were
to make the transmit logic in all drivers which supported jumbo frames
clueful, then using them for receives would be safe.

 > > One question.  I've observed some really anomolous behaviour under
 > > -stable with my Myricom GM driver (2Gb/s + 2Gb/s link speed, Dual 1GHz
 > > pIII).  When I use 4K mbufs for receives, the best speed I see is
 > > about 1300Mb/sec.  However, if I use private 9K physically contiguous
 > > buffers I see 1850Mb/sec (iperf TCP).
 > > 
 > > The obvious conclusion is that there's a lot of overhead in setting up
 > > the DMA engines, but that's not the case; we have a fairly quick chain
 > > dma engine.  I've provided a "control" by breaking my contiguous
 > > buffers down into 4K chunks so that I do the same number of DMAs in
 > > both cases and I still see ~1850 Mb/sec for the 9K buffers.  
 > > 
 > > A coworker suggested that the problem was that when doing copyouts to
 > > userspace, the PIII was doing speculative reads and loading the cache
 > > with the next page.  However, we then start copying from a totally
 > > different address using discontigous buffers, so we effectively take
 > > 2x the number of cache misses we'd need to.  Does that sound
 > > reasonable to you?  I need to try malloc'ing virtually contigous and
 > > physically discontigous buffers & see if I get the same (good)
 > > performance...
 > 
 >   I believe that the Intel chips do "virtual page caching" and that the
 > logic that does the virtual -> physical address translation sits between
 > the L2 cache and RAM.  If that is indeed the case, then your idea of
 > testing with virtually contiguous pages is a good one.
 >   Unfortunately, I don't know if the PIII is doing speculative
 > cache-loads, but it could very well be the case.  If it is and if in
 > fact the chip does caching based on virtual addresses, then providing it
 > with virtually contiguous address space may yield better results.  If
 > you try this, please let me know.  I'm extremely interested in seeing
 > the results!


Thanks for your input.  I'll post the results when I get to it.
I'm working on an AIX driver right now and I need to finish that
before I have any playtime.. (AIX is utterly bizzare;
pagable kernel, misleading docs, etc, etc)


Thanks again,

Drew

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: new zero copy sockets snapshot

Reply via email to