From: Olof Johansson <[EMAIL PROTECTED]>
Date: Thu, 20 Apr 2006 22:04:26 -0500

> On Thu, Apr 20, 2006 at 05:27:42PM -0700, David S. Miller wrote:
> > Besides the control overhead of the DMA engines, the biggest thing
> > lost in my opinion is the perfect cache warming that a cpu based copy
> > does from the kernel socket buffer into userspace.
> 
> It's definitely the easiest way to always make sure the right caches
> are warm for the app, that I agree with.
> 
> But, when warming those caches by copying, the data is pulled in through
> a potentially cold cache in the first place. So the cache misses are
> just moved from the copy loop to userspace with dma offload. Or am I
> missing something?

Yes, and it means that the memory bandwidth costs are equivalent
between I/O AT and cpu copy.

In the cpu copy case you eat the read cache miss, but on the write
side you'll prewarm the cache properly.  In the I/O AT case you
eat the same read cost, but the cache will not be prewarmed, so you'll
eat the read cache miss in the application.  It's moving the same
exact cost from one place to another.

The time it takes to get the app to make forward progress (meaning
returned from the recvmsg() system call and back in userspace) must by
definition take at least as long with I/O AT as it does with cpu
copies.  Yet in the I/O AT case, the application must wait that long
and also then take in the delays of the cache misses when it tries to
read the data that the I/O AT engine copied.  Instead of eating the
cache miss cost in the kernel, we eat it in the app because in the I/O
AT case the cpu won't have the user data fresh and loaded into the cpu
cache.

And I say I/O AT must take "at least as long" as cpu copies because
the same memory copy cost is there, and on top of that I/O AT has to
program the DMA controller and touch a _lot_ of other state to get
things going and then wake the task up.  We're talking non-trivial
overheads like grabbing the page mappings out of the page tables using
get_user_pages().  Evgivny has posted some very nice performance graphs
showing how poorly that function scales.

This is basically why none of the performance gains add up to me.  I
am thus very concerned that the current non-cache-warming
implmentation may fall flat performance wise.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to