On Fri, 2005-02-12 at 11:04 -0700, Grant Grundler wrote:
> On Thu, Dec 01, 2005 at 09:32:37PM -0500, jamal wrote:
[..]
> 
> We've already been down this path before. How and where to prefetch
> is quite dependent on the CPU implementation and workload.
> 

[..]

> At the time you did this, I read the Intel docs on P3 and P4 cache
> behaviors. IIRC, the P4 HW prefetches very aggressively. ie the SW
> prefetching just becomes noise or burns extra CPU cycles. 

I think this may be it. i.e in the case where things got worse, you
waste cycles executing useless prefetches. 

[some good stuff deleted]
Thanks for the elucidation on x86, pa-risc and ia64.
Like i said, luckily i still have one of the old machines around. I
believe it was either P3 or early P4 or maybe even P2 class machine.

> In the case of David Mosberger's patch, we *know* 100% of the time we
> are going to look at the first cacheline (packet header) after the
> DMA completes. IMHO, all RISC architectures should prefetch that
> once the DMA is unmapped.
> 

I think if the intel changes were as small as David Mosberger's i
probably wouldnt have said anything. 

> BTW, my interest in all this is because NIC DMA causes the system chipset
> to take ownership of the cachelines away from the CPU. The CPU has to
> re-acquire a copy of the cacheline when looking at NIC TX/RX descriptor
> rings and the payload data buffers. This is a big deal for the small
> packet routing/forwarding that Robert Olsen and Jamal are using as a
> primary workload/metric.
> 
> [BTW2, if the NIC could say "I'm done with this cacheline, now give it
> to CPU X" where X is the target of MSI, I think we see some dramatic
> gains too in the control data handling too.]
> 

interesting ..

> > The danger of prefetching is the close dependence on the "stride" (or
> > load-latency as David Mosberger would call it). 
> 
> stride != load-latency. "stride" is the increment used to prefetch the
> next cacheline when walking through memory.
> 

yes, you are correct. My brain was telling me there is another term for
prefetch scheduling which still escapes my brain's reach at the moment.
So my fingers insisted it was "stride". Actually what i just said
"prefetch scheduling" maybe more of a generic term.

[description of load-latency deleted]

> > If somehow the amount of CPU cycles executed to where the data is really
> > used happens to be less than the amount of time it would take to fetch
> > the line into cache, then it is useless and maybe detrimental. 
> 
> That's not quite correct IMHO. The prefetching can get cachelines
> in-flight which will reduce the CPU stall (in the case the cacheline
> hasn't arrived before CPU asked for it). The prefetching just needs
> to reduce the CPU stall enough to cover the cost of issueing the
> prefetch to be a net win. This sounds simple but is not because
> of mispredicted SW prefetches, HW prefetching (depending on Arch),
> eviction of cachelines still in use, changes in i-cache footprint, etc.
> 

You seem to say that if s/ware schedules a prefetch, when the CPU
needs to load that location into cache it will "know" that a prefetch
has already been issued?

> > If you have small cache (It is hard to imagine that i just bought a
> > cheap AMD-64 _today_ with only 64KB L1 cache and 512KB L2) then
> > prefetching too early may be detrimental because it evicts other useful
> > data which you are using. If you have huge cache this may not happen
> > depending on the workload but it may result in no perfomance
> > improvement. 

BTW, I have been corrected in that 512KB cache is not "that small"
really (All the intel machines i have have >= 1MB L2). So i overstated
the point i was trying to make above. Let me restate what i was trying
to say it again so it is not lost in the noise because i exagerated what
is to be considered as "small":

- prefetching has dependencies on workload, memory latencies, cache
sizes and CPU architecture. On the size of the cache in regards to when
you scheduled the prefetch:
a) if you call prefetch too early, cache size (and workload/code)
dependent, it may be evicted before you get to it. Prefetching may also
displace data that is in use.
b) if you fetch it too late, it wont be there when you get to needing to
use it. And that the CPU will fetch again.

> If on the other hand you prefetch at a distance less than
> > the stride you are still gonna stall in any case.
> 
> "the still gonna stall" case has to be evaluated for how long we stall
> and if the prefetching helped (or not).  ie stalling on AMD64 local memory
> is not as bad as stalling on remote NUMA memory. And it depends on how far
> in advance we can prefetch.
> 

Ok, so you seem to be saying again that for case #b above, there is no
harm in issuing the prefetch late since the CPU wont issue a second
fetch for the address? 

> > The repurcasions are that a change in the driver that used to work may
> > result in degradation if it changes the "stride distance"; i.e it is
> > voodoo instead of a science since there is nothing in the compiler or at
> > runtime that will tell you "you are placing that prefetch too close".
> 
> Yup. We can tune for workload/load-latency of each architecture.
> I think tuning for all of them in one source code is the current problem.
> We have to come up with a way for the compiler to insert (or not)
> prefetching at different places for different architectures (and maybe
> even CPU model) in order for this to be acceptable/optimal.
> 

I was hoping to just be able to turn off the prefetch from the driver
when i know my hardware does well using it.
I suspect most newer hardware wont have observable issues given that
memory latencies have improved over the last 2-3 years; but we run on a
lot of older hardware too.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to