On 4/20/06, David S. Miller <[EMAIL PROTECTED]> wrote:
> Yes, and it means that the memory bandwidth costs are equivalent
> between I/O AT and cpu copy.

The following is a response from the I/OAT architects.  I only point
out that this is not coming directly from me because I have not seen
the data to verify the claims regarding the speed of a copy vs a load
and the cost of the rep mov instruction.  I'll encourage more direct
participation in this discussion from the architects moving forward.

    - Chris

Let's talk about the caching benefits that is seemingly lost when
using the DMA engine. The intent of the DMA engine is to save CPU
cycles spent in copying data (rep mov). In cases where the destination
is already warm in cache (due to destination buffer re-use) and the
source is in memory, the cycles spent in a host copy is not just due
to the cache misses it encounters in the process of bringing in the
source but also due to the execution of rep move itself within the
host core. If you contrast this to simply touching (loading) the data
residing in memory, the cost of this load is primarily the cost of the
cache misses and not so much CPU execution time. Given this, some of
the following points are noteworthy:

1. While the DMA engine forces the destination to be in memory and
touching it may cause the same number of observable cache misses as a
host copy assuming a cache warmed destination, the cost of the host
copy (in terms of CPU cycles) is much more than the cost of the touch.

2. CPU hardware prefetchers do a pretty good job of staying ahead of
the fetch stream to minimize cache misses. So for loads of medium to
large buffers, cache misses form a much smaller component of the data
fetch timeā€¦most of it is dominated by front side bus (FSB) or Memory
bandwidth. For small buffers, we do not use the DMA engine but if we
had to, we would insert SW prefetches that do reasonably well.

3. If the destination wasn't already warm in cache i.e., it was in
memory or some CPU other cache, host copy will have to snoop and bring
the destination in and will encounter additional misses on the
destination buffer as well. These misses are the same as those
encountered in #1 above when using the DMA engine and touching the
data afterwards. So in effect it becomes a wash when compared to the
DMA engine's behavior. The case where the destination is already warm
in cache is common in benchmarks such as iperf, ttcp etc. where the
same buffer is reused over and over again. Real applications typically
will not exhibit this aggressive buffer re-use behavior.

4. It may take a large number of packets (and several interrupts) to
satisfy a large posted buffer (say 64KB). Even if you use host copy to
warm the cache with the destination, there is no guarantee that some
or all of the destination will stay in the cache before the
application has a chance to read the data.

5. The source data payload (skb ->data) is typically needed only once
for the copy and has no use later. The host copy brings it into the
cache and may end up polluting the cache, and consuming FSB bandwidth
whereas the DMA engine avoids this altogether.

The IxChariot data posted earlier that touches the data and yet shows
I/OAT benefit is due to some of the reasons above. Bottom line is that
I agree with the cache benefit argument of host copy for small buffers
(64B to 512B) but for larger buffers and certain application scenarios
(destination in memory), the DMA engine will show better performance
regardless of where the destination buffer resided to begin with and
where it is accessed from.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to