On Thu, Dec 01, 2005 at 09:32:37PM -0500, jamal wrote:
> > I think until a counter case is shown, the prefetches should
> > remain on unconditionally.  Proof of detriment is the burdon
> > of the accusor, especially since the Intel folks aparently
> > did a lot of testing :-)

We've already been down this path before. How and where to prefetch
is quite dependent on the CPU implementation and workload.

> So let me take the liberty of first cc-ing all the usual suspects i have
> talked to about this in the past when i attempted to use prefetching on
> e1000.

I'm just an innocent bystander. Really. :)

> Then i will try to make a case and eventually collect some newer
> evidence ;-> I will be happy actually if it turns out i am wrong since i
> would like to squueze as much juice out of them processors.
> 
> Background:
> ----------
> 
> After mucking with prefetching for some time a while back and observing
> inconsistent behavior, I came across a patch from David Mosberger (BTW,
> not sure if his email address above is correct - maybe someone could
> forward him this discussion).

I've cc'd him: David Mosberger-Tang <[EMAIL PROTECTED]>

>  With this specific patch (attached), in
> all the Xeons i tested on (~2 years back) there was consistent
> improvement of anywhere between 5-10Kpps in the forwarding tests if i
> turned only in the AGGRESSIVE mode. Nothing useful when i used the other
> mode.
> On P3 class hardware prefetching without aggressive mode resulted in
> decreased throughput but also decreased CPU occupancy. No effect with
> aggressive mode. So the overall effect was it was not harmful to have
> aggressive mode on. But i am not sure i even trust that would have been
> the case in every piece of hardware i could find.

At the time you did this, I read the Intel docs on P3 and P4 cache
behaviors. IIRC, the P4 HW prefetches very aggressively. ie the SW
prefetching just becomes noise or burns extra CPU cycles. My guess
was the x86 HW designers had exactly the opposite perspective than
the RISC designers had: x86 SW won't change - ie can't add SW controlled
prefetching to improvement perf - therefore HW has to do it.

I don't know if that's still true or how aggressive the AMD64 CPUs
are at prefetching.

PA-RISC and IA64 processors can execute the SW prefetching "in parallel".
ie they essentially only cost backplane/memory controller bandwidth.
There is a trade off between being too "smart" about when to prefetch
and being too "stupid". I mean we shouldn't make complex calculations
to figure out if we need to prefetch something or not unless we know
exactly what the workload is.

In the case of David Mosberger's patch, we *know* 100% of the time we
are going to look at the first cacheline (packet header) after the
DMA completes. IMHO, all RISC architectures should prefetch that
once the DMA is unmapped.

BTW, my interest in all this is because NIC DMA causes the system chipset
to take ownership of the cachelines away from the CPU. The CPU has to
re-acquire a copy of the cacheline when looking at NIC TX/RX descriptor
rings and the payload data buffers. This is a big deal for the small
packet routing/forwarding that Robert Olsen and Jamal are using as a
primary workload/metric.

[BTW2, if the NIC could say "I'm done with this cacheline, now give it
to CPU X" where X is the target of MSI, I think we see some dramatic
gains too in the control data handling too.]

> Lennert and I as well sweated over testing some simple kernel modules
> with very inconsistent results if he used his laptop or i used mine.
> 
> I will test with a newer piece of hardware and one of the older ones i
> have (both Xeons) - perhaps this weekend.
> Robert may have some results perhaps on this driver, Robert?
> It would also be nice for the intel folks to post their full results
> somewhere.

Yeah, I'd like to see that too.

> Having said all the above:
> --------------------------
> 
> The danger of prefetching is the close dependence on the "stride" (or
> load-latency as David Mosberger would call it). 

stride != load-latency. "stride" is the increment used to prefetch the
next cacheline when walking through memory.

Load-latency is how long the CPU needs to acquire a current copy (or
ownership) of the cacheline. The SW prefetching is intended to hide
load latency.  Ideally the stride is equal to the IO cacheline size.
Of course, RX/TX descriptors aren't laid out by cacheline size
and I'm not advocating we want them to be. We might issue a prefetch
for the same cacheline multiple times and depending on architecture,
that might be another source of wasted CPU cycles.

BTW, "IO Cacheline size" is typically the same as the "outermost" cache
of the CPU - ie whatever the basic unit of cache coherency is for the
platform.  It is often bigger than the L1 cacheline size.

> If somehow the amount of CPU cycles executed to where the data is really
> used happens to be less than the amount of time it would take to fetch
> the line into cache, then it is useless and maybe detrimental. 

That's not quite correct IMHO. The prefetching can get cachelines
in-flight which will reduce the CPU stall (in the case the cacheline
hasn't arrived before CPU asked for it). The prefetching just needs
to reduce the CPU stall enough to cover the cost of issueing the
prefetch to be a net win. This sounds simple but is not because
of mispredicted SW prefetches, HW prefetching (depending on Arch),
eviction of cachelines still in use, changes in i-cache footprint, etc.

> If you have small cache (It is hard to imagine that i just bought a
> cheap AMD-64 _today_ with only 64KB L1 cache and 512KB L2) then
> prefetching too early may be detrimental because it evicts other useful
> data which you are using. If you have huge cache this may not happen
> depending on the workload but it may result in no perfomance
> improvement. If on the other hand you prefetch at a distance less than
> the stride you are still gonna stall in any case.

"the still gonna stall" case has to be evaluated for how long we stall
and if the prefetching helped (or not).  ie stalling on AMD64 local memory
is not as bad as stalling on remote NUMA memory. And it depends on how far
in advance we can prefetch.

> The repurcasions are that a change in the driver that used to work may
> result in degradation if it changes the "stride distance"; i.e it is
> voodoo instead of a science since there is nothing in the compiler or at
> runtime that will tell you "you are placing that prefetch too close".

Yup. We can tune for workload/load-latency of each architecture.
I think tuning for all of them in one source code is the current problem.
We have to come up with a way for the compiler to insert (or not)
prefetching at different places for different architectures (and maybe
even CPU model) in order for this to be acceptable/optimal.

> 
> On the copybreak change
> -----------------------
> 
> Lets defer this since i already said a lot on prefetch ;->
> But someone like Robert with results could shed more light.
> 
> I rest my case.

sorry...I went on a bit too long too...

thanks,
grant

> 
> cheers,
> jamal
> 
> 

> 
> ===== drivers/net/e1000/e1000_main.c 1.134 vs edited =====
> --- 1.134/drivers/net/e1000/e1000_main.c        2004-09-12 16:52:48 -07:00
> +++ edited/drivers/net/e1000/e1000_main.c       2004-09-30 06:05:11 -07:00
> @@ -2278,12 +2278,30 @@
>         uint8_t last_byte;
>         unsigned int i;
>         boolean_t cleaned = FALSE;
> +#define AGGRESSIVE 1
>  
>         i = rx_ring->next_to_clean;
> +#if AGGRESSIVE
> +       prefetch(rx_ring->buffer_info[i].skb->data);
> +#endif
>         rx_desc = E1000_RX_DESC(*rx_ring, i);
>  
>         while(rx_desc->status & E1000_RXD_STAT_DD) {
>                 buffer_info = &rx_ring->buffer_info[i];
> +# if AGGRESSIVE
> +               {
> +                       struct e1000_rx_desc *next_rx;
> +                       unsigned int j = i + 1;
> +
> +                       if (j == rx_ring->count)
> +                               j = 0;
> +                       next_rx = E1000_RX_DESC(*rx_ring, j);
> +                       if (next_rx->status & E1000_RXD_STAT_DD)
> +                               prefetch(rx_ring->buffer_info[j].skb->data);
> +               }
> +# else
> +               prefetch(buffer_info->skb->data);
> +# endif
>  #ifdef CONFIG_E1000_NAPI
>                 if(*work_done >= work_to_do)
>                         break;
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to