Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread Neil Horman
On Wed, Nov 13, 2013 at 01:32:50PM -, David Laight wrote:
> > > I'm not sure, whats the typical capacity for the branch predictors
> > > ability to remember code paths?
> ...
> > 
> > For such simple single-target branches it goes near or over a thousand for
> > recent Intel and AMD microarchitectures. Thousands for really recent CPUs.
> 
> IIRC the x86 can also correctly predict simple sequences - like a branch
> in a loop that is taken every other iteration, or only after a previous
> branch is taken.
> 
> Much simpler cpus may use a much simpler strategy.
> I think one I've used (a fpga soft-core cpu) just uses the low
> bits of the instruction address to index a single bit table.
> This means that branches alias each other.
> In order to get the consistent cycle counts in order to minimise
> the worst case code path we had to disable the dynamic prediction.
> 
> For the checksum code the loop branch isn't a problem.
> Tests on entry to the function might get mispredicted.
> 
> So if you have conditional prefetch when the buffer is long
> then time a short buffer after a 100 long ones you'll almost
> certainly see the mispredition penalty.
> 
> FWIW I remember speeding up a copy (I think) loop on a strongarm by
> adding an extra instruction to fetch a word from later in the buffer
> into a register I never otherwise used.
> (That was an unpaged system so I knew it couldn't fault.)
> 
Fair enough, but the code we're looking at here is arch specific.  If strongarms
benefit from different coding patterns, we can handle that in that arch.  This
x86 implementation can still avoid worrying about branch predicition since its
hardware handles it well
Neil

>   David
> 
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread Ingo Molnar

* David Laight  wrote:

> > > I'm not sure, whats the typical capacity for the branch predictors 
> > > ability to remember code paths?
> ...
> > 
> > For such simple single-target branches it goes near or over a thousand 
> > for recent Intel and AMD microarchitectures. Thousands for really 
> > recent CPUs.
> 
> IIRC the x86 can also correctly predict simple sequences - like a branch 
> in a loop that is taken every other iteration, or only after a previous 
> branch is taken.

They tend to be rather capable but not very well documented :) With a 
large out of order execution design and 20+ pipeline stages x86 branch 
prediction accuracy is perhaps the most important design aspect to good 
CPU performance.

> Much simpler cpus may use a much simpler strategy.

Yeah. The patches in this thread are about the x86 assembly implementation 
of the csum routines, and for 'typical' x86 CPUs the branch prediction 
units and caches are certainly sophisticated enough.

Also note that here, for real usecases, the csum routines are (or should 
be) memory bandwidth limited, missing the data cache most of the time, 
with a partially idling pipeline, while branch prediction accuracy matters 
most when the pipeline is well fed and there are a lot of instructions in 
flight.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread David Laight
> > I'm not sure, whats the typical capacity for the branch predictors
> > ability to remember code paths?
...
> 
> For such simple single-target branches it goes near or over a thousand for
> recent Intel and AMD microarchitectures. Thousands for really recent CPUs.

IIRC the x86 can also correctly predict simple sequences - like a branch
in a loop that is taken every other iteration, or only after a previous
branch is taken.

Much simpler cpus may use a much simpler strategy.
I think one I've used (a fpga soft-core cpu) just uses the low
bits of the instruction address to index a single bit table.
This means that branches alias each other.
In order to get the consistent cycle counts in order to minimise
the worst case code path we had to disable the dynamic prediction.

For the checksum code the loop branch isn't a problem.
Tests on entry to the function might get mispredicted.

So if you have conditional prefetch when the buffer is long
then time a short buffer after a 100 long ones you'll almost
certainly see the mispredition penalty.

FWIW I remember speeding up a copy (I think) loop on a strongarm by
adding an extra instruction to fetch a word from later in the buffer
into a register I never otherwise used.
(That was an unpaged system so I knew it couldn't fault.)

David




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread Ingo Molnar

* Neil Horman  wrote:

> On Wed, Nov 13, 2013 at 10:09:51AM -, David Laight wrote:
> > > Sure, I modified the code so that we only prefetched 2 cache lines ahead, 
> > > but
> > > only if the overall length of the input buffer is more than 2 cache lines.
> > > Below are the results (all counts are the average of 100 iterations 
> > > of the
> > > csum operation, as previous tests were, I just omitted that column).
> > 
> > Hmmm averaging over 10 iterations means that all the code
> > is in the i-cache and the branch predictor will be correctly primed.
> > 
> > For short checksum requests I'd guess that the relevant data
> > has just been written and is already in the cpu cache (unless
> > there has been a process and cpu switch).
> > So prefetch is likely to be unnecessary.
> > 
> > If you assume that the checksum code isn't in the i-cache then
> > small requests are likely to be dominated by the code size.
> 
> I'm not sure, whats the typical capacity for the branch predictors 
> ability to remember code paths?  I ask because the most likely use of 
> do_csum will be in the receive path of the networking stack 
> (specifically in the softirq handler). So if we run do_csum once, we're 
> likely to run it many more times, as we clean out an adapters receive 
> queue.

For such simple single-target branches it goes near or over a thousand for 
recent Intel and AMD microarchitectures. Thousands for really recent CPUs.

Note that branch prediction caches are hierarchical and are typically 
attached to cache hierarchies (where the uops are fetched from), so the 
first level BTB is typically shared between SMT CPUs that share an icache 
and L2 BTBs (which is larger and more associative) are shared by all cores 
in a package.

So it's possible for some other task on another (sibling) CPU to keep 
pressure on your BTB, but I'd say it's relatively rare, it's hard to do it 
at a really high rate that blows away all the cache all the time. (PeterZ 
has written some artificial pseudorandom branching monster just to be able 
to generate cache misses and validate perf's branch stats - but even if 
deliberately want to it's pretty hard to beat that cache.)

I'd definitely not worry about the prediction accuracy of repetitive loops 
like csum routines, they'll be cached well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread Neil Horman
On Wed, Nov 13, 2013 at 10:09:51AM -, David Laight wrote:
> > Sure, I modified the code so that we only prefetched 2 cache lines ahead, 
> > but
> > only if the overall length of the input buffer is more than 2 cache lines.
> > Below are the results (all counts are the average of 100 iterations of 
> > the
> > csum operation, as previous tests were, I just omitted that column).
> 
> Hmmm averaging over 10 iterations means that all the code
> is in the i-cache and the branch predictor will be correctly primed.
> 
> For short checksum requests I'd guess that the relevant data
> has just been written and is already in the cpu cache (unless
> there has been a process and cpu switch).
> So prefetch is likely to be unnecessary.
> 
> If you assume that the checksum code isn't in the i-cache then
> small requests are likely to be dominated by the code size.
> 
I'm not sure, whats the typical capacity for the branch predictors ability to
remember code paths?  I ask because the most likely use of do_csum will be in
the receive path of the networking stack (specifically in the softirq handler).
So if we run do_csum once, we're likely to run it many more times, as we clean
out an adapters receive queue.

Neil

>   David
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread David Laight
> Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
> only if the overall length of the input buffer is more than 2 cache lines.
> Below are the results (all counts are the average of 100 iterations of the
> csum operation, as previous tests were, I just omitted that column).

Hmmm averaging over 10 iterations means that all the code
is in the i-cache and the branch predictor will be correctly primed.

For short checksum requests I'd guess that the relevant data
has just been written and is already in the cpu cache (unless
there has been a process and cpu switch).
So prefetch is likely to be unnecessary.

If you assume that the checksum code isn't in the i-cache then
small requests are likely to be dominated by the code size.

David



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread David Laight
 Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
 only if the overall length of the input buffer is more than 2 cache lines.
 Below are the results (all counts are the average of 100 iterations of the
 csum operation, as previous tests were, I just omitted that column).

Hmmm averaging over 10 iterations means that all the code
is in the i-cache and the branch predictor will be correctly primed.

For short checksum requests I'd guess that the relevant data
has just been written and is already in the cpu cache (unless
there has been a process and cpu switch).
So prefetch is likely to be unnecessary.

If you assume that the checksum code isn't in the i-cache then
small requests are likely to be dominated by the code size.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread Neil Horman
On Wed, Nov 13, 2013 at 10:09:51AM -, David Laight wrote:
  Sure, I modified the code so that we only prefetched 2 cache lines ahead, 
  but
  only if the overall length of the input buffer is more than 2 cache lines.
  Below are the results (all counts are the average of 100 iterations of 
  the
  csum operation, as previous tests were, I just omitted that column).
 
 Hmmm averaging over 10 iterations means that all the code
 is in the i-cache and the branch predictor will be correctly primed.
 
 For short checksum requests I'd guess that the relevant data
 has just been written and is already in the cpu cache (unless
 there has been a process and cpu switch).
 So prefetch is likely to be unnecessary.
 
 If you assume that the checksum code isn't in the i-cache then
 small requests are likely to be dominated by the code size.
 
I'm not sure, whats the typical capacity for the branch predictors ability to
remember code paths?  I ask because the most likely use of do_csum will be in
the receive path of the networking stack (specifically in the softirq handler).
So if we run do_csum once, we're likely to run it many more times, as we clean
out an adapters receive queue.

Neil

   David
 
 
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread Ingo Molnar

* Neil Horman nhor...@tuxdriver.com wrote:

 On Wed, Nov 13, 2013 at 10:09:51AM -, David Laight wrote:
   Sure, I modified the code so that we only prefetched 2 cache lines ahead, 
   but
   only if the overall length of the input buffer is more than 2 cache lines.
   Below are the results (all counts are the average of 100 iterations 
   of the
   csum operation, as previous tests were, I just omitted that column).
  
  Hmmm averaging over 10 iterations means that all the code
  is in the i-cache and the branch predictor will be correctly primed.
  
  For short checksum requests I'd guess that the relevant data
  has just been written and is already in the cpu cache (unless
  there has been a process and cpu switch).
  So prefetch is likely to be unnecessary.
  
  If you assume that the checksum code isn't in the i-cache then
  small requests are likely to be dominated by the code size.
 
 I'm not sure, whats the typical capacity for the branch predictors 
 ability to remember code paths?  I ask because the most likely use of 
 do_csum will be in the receive path of the networking stack 
 (specifically in the softirq handler). So if we run do_csum once, we're 
 likely to run it many more times, as we clean out an adapters receive 
 queue.

For such simple single-target branches it goes near or over a thousand for 
recent Intel and AMD microarchitectures. Thousands for really recent CPUs.

Note that branch prediction caches are hierarchical and are typically 
attached to cache hierarchies (where the uops are fetched from), so the 
first level BTB is typically shared between SMT CPUs that share an icache 
and L2 BTBs (which is larger and more associative) are shared by all cores 
in a package.

So it's possible for some other task on another (sibling) CPU to keep 
pressure on your BTB, but I'd say it's relatively rare, it's hard to do it 
at a really high rate that blows away all the cache all the time. (PeterZ 
has written some artificial pseudorandom branching monster just to be able 
to generate cache misses and validate perf's branch stats - but even if 
deliberately want to it's pretty hard to beat that cache.)

I'd definitely not worry about the prediction accuracy of repetitive loops 
like csum routines, they'll be cached well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread David Laight
  I'm not sure, whats the typical capacity for the branch predictors
  ability to remember code paths?
...
 
 For such simple single-target branches it goes near or over a thousand for
 recent Intel and AMD microarchitectures. Thousands for really recent CPUs.

IIRC the x86 can also correctly predict simple sequences - like a branch
in a loop that is taken every other iteration, or only after a previous
branch is taken.

Much simpler cpus may use a much simpler strategy.
I think one I've used (a fpga soft-core cpu) just uses the low
bits of the instruction address to index a single bit table.
This means that branches alias each other.
In order to get the consistent cycle counts in order to minimise
the worst case code path we had to disable the dynamic prediction.

For the checksum code the loop branch isn't a problem.
Tests on entry to the function might get mispredicted.

So if you have conditional prefetch when the buffer is long
then time a short buffer after a 100 long ones you'll almost
certainly see the mispredition penalty.

FWIW I remember speeding up a copy (I think) loop on a strongarm by
adding an extra instruction to fetch a word from later in the buffer
into a register I never otherwise used.
(That was an unpaged system so I knew it couldn't fault.)

David




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread Ingo Molnar

* David Laight david.lai...@aculab.com wrote:

   I'm not sure, whats the typical capacity for the branch predictors 
   ability to remember code paths?
 ...
  
  For such simple single-target branches it goes near or over a thousand 
  for recent Intel and AMD microarchitectures. Thousands for really 
  recent CPUs.
 
 IIRC the x86 can also correctly predict simple sequences - like a branch 
 in a loop that is taken every other iteration, or only after a previous 
 branch is taken.

They tend to be rather capable but not very well documented :) With a 
large out of order execution design and 20+ pipeline stages x86 branch 
prediction accuracy is perhaps the most important design aspect to good 
CPU performance.

 Much simpler cpus may use a much simpler strategy.

Yeah. The patches in this thread are about the x86 assembly implementation 
of the csum routines, and for 'typical' x86 CPUs the branch prediction 
units and caches are certainly sophisticated enough.

Also note that here, for real usecases, the csum routines are (or should 
be) memory bandwidth limited, missing the data cache most of the time, 
with a partially idling pipeline, while branch prediction accuracy matters 
most when the pipeline is well fed and there are a lot of instructions in 
flight.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread Neil Horman
On Wed, Nov 13, 2013 at 01:32:50PM -, David Laight wrote:
   I'm not sure, whats the typical capacity for the branch predictors
   ability to remember code paths?
 ...
  
  For such simple single-target branches it goes near or over a thousand for
  recent Intel and AMD microarchitectures. Thousands for really recent CPUs.
 
 IIRC the x86 can also correctly predict simple sequences - like a branch
 in a loop that is taken every other iteration, or only after a previous
 branch is taken.
 
 Much simpler cpus may use a much simpler strategy.
 I think one I've used (a fpga soft-core cpu) just uses the low
 bits of the instruction address to index a single bit table.
 This means that branches alias each other.
 In order to get the consistent cycle counts in order to minimise
 the worst case code path we had to disable the dynamic prediction.
 
 For the checksum code the loop branch isn't a problem.
 Tests on entry to the function might get mispredicted.
 
 So if you have conditional prefetch when the buffer is long
 then time a short buffer after a 100 long ones you'll almost
 certainly see the mispredition penalty.
 
 FWIW I remember speeding up a copy (I think) loop on a strongarm by
 adding an extra instruction to fetch a word from later in the buffer
 into a register I never otherwise used.
 (That was an unpaged system so I knew it couldn't fault.)
 
Fair enough, but the code we're looking at here is arch specific.  If strongarms
benefit from different coding patterns, we can handle that in that arch.  This
x86 implementation can still avoid worrying about branch predicition since its
hardware handles it well
Neil

   David
 
 
 
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Neil Horman
On Tue, Nov 12, 2013 at 12:38:01PM -0800, Joe Perches wrote:
> On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote:
> > On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
> > > On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
> []
> > > > So, the numbers are correct now that I returned my hardware to its 
> > > > previous
> > > > interrupt affinity state, but the trend seems to be the same (namely 
> > > > that there
> > > > isn't a clear one).  We seem to find peak performance around a 
> > > > readahead of 2
> > > > cachelines, but its very small (about 3%), and its inconsistent (larger 
> > > > set
> > > > sizes fall to either side of that stride).  So I don't see it as a 
> > > > clear win.  I
> > > > still think we should probably scrap the readahead for now, just take 
> > > > the perf
> > > > bits, and revisit this when we can use the vector instructions or the
> > > > independent carry chain instructions to improve this more consistently.
> > > > 
> > > > Thoughts
> > > 
> > > Perhaps a single prefetch, not of the first addr but of
> > > the addr after PREFETCH_STRIDE would work best but only
> > > if length is > PREFETCH_STRIDE.
> > > 
> > > I'd try:
> > > 
> > >   if (len > PREFETCH_STRIDE)
> > >   prefetch(buf + PREFETCH_STRIDE);
> > >   while (count64) {
> > >   etc...
> > >   }
> > > 
> > > I still don't know how much that impacts very short lengths.
> > > Can you please add a 20 byte length to your tests?
> > Sure, I modified the code so that we only prefetched 2 cache lines ahead, 
> > but
> > only if the overall length of the input buffer is more than 2 cache lines.
> > Below are the results (all counts are the average of 100 iterations of 
> > the
> > csum operation, as previous tests were, I just omitted that column).
> > 
> > len set cycles/byte cycles/byte improvement
> > no prefetch prefetch
> > ===
> > 20B 64MB45.014989   44.402432   1.3%
> > 20B 128MB   44.900317   46.146447   -2.7%
> > 20B 256MB   45.303223   48.193623   -6.3%
> > 20B 512MB   45.615301   44.486872   2.2%
> []
> > I'm still left thinking we should just abandon the prefetch at this point 
> > and
> > keep the perf code until we have new instructions to help us with this 
> > further,
> > unless you see something I dont.
> 
> I tend to agree but perhaps the 3% performance
> increase with a prefetch for longer lengths is
> actually significant and desirable.
> 
Maybe, but I worry that its not going to be consistent.  At least not with the
cost of the extra comparison and jump.

> It doesn't seem you've done the test I suggested
> where prefetch is done only for
> "len > PREFETCH_STRIDE".
> 
No, thats exactly what I did, I did this:

#define PREFETCH_STRIDE (cache_line_size() * 2)

...

if (len > PREFETCH_STRIDE)
prefecth(buf + PREFETCH_STRIDE)

while (count64) {
...

> Is it ever useful to do a prefetch of the
> address/data being accessed by the next
> instruction?
> 
Doubtful, you need to prefetch the data far enough in advance that its loaded by
the time you need to reference it.  Otherwise you wind up stalling the data
pipeline while the load completes.  So unless you have really fast memory, the
prefetch is effectively a no-op for the next access.  But the next cacheline
ahead is good, as it prevents the stall there.  Any more than that though (from
this testing), seems to again be a no-op as modern hardware automatically issues
the prefetch because it notices the linear data access pattern.

> Anyway, thanks for doing all the work.
> 
No worries, glad to do it.  Thanks for the review

Ingo, what do you think, shall I submit the perf bits as a separate thread, or
do you not want those any more?

Regards
Neil

> Joe
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Joe Perches
On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote:
> On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
> > On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
[]
> > > So, the numbers are correct now that I returned my hardware to its 
> > > previous
> > > interrupt affinity state, but the trend seems to be the same (namely that 
> > > there
> > > isn't a clear one).  We seem to find peak performance around a readahead 
> > > of 2
> > > cachelines, but its very small (about 3%), and its inconsistent (larger 
> > > set
> > > sizes fall to either side of that stride).  So I don't see it as a clear 
> > > win.  I
> > > still think we should probably scrap the readahead for now, just take the 
> > > perf
> > > bits, and revisit this when we can use the vector instructions or the
> > > independent carry chain instructions to improve this more consistently.
> > > 
> > > Thoughts
> > 
> > Perhaps a single prefetch, not of the first addr but of
> > the addr after PREFETCH_STRIDE would work best but only
> > if length is > PREFETCH_STRIDE.
> > 
> > I'd try:
> > 
> > if (len > PREFETCH_STRIDE)
> > prefetch(buf + PREFETCH_STRIDE);
> > while (count64) {
> > etc...
> > }
> > 
> > I still don't know how much that impacts very short lengths.
> > Can you please add a 20 byte length to your tests?
> Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
> only if the overall length of the input buffer is more than 2 cache lines.
> Below are the results (all counts are the average of 100 iterations of the
> csum operation, as previous tests were, I just omitted that column).
> 
> len   set cycles/byte cycles/byte improvement
>   no prefetch prefetch
> ===
> 20B   64MB45.014989   44.402432   1.3%
> 20B   128MB   44.900317   46.146447   -2.7%
> 20B   256MB   45.303223   48.193623   -6.3%
> 20B   512MB   45.615301   44.486872   2.2%
[]
> I'm still left thinking we should just abandon the prefetch at this point and
> keep the perf code until we have new instructions to help us with this 
> further,
> unless you see something I dont.

I tend to agree but perhaps the 3% performance
increase with a prefetch for longer lengths is
actually significant and desirable.

It doesn't seem you've done the test I suggested
where prefetch is done only for
"len > PREFETCH_STRIDE".

Is it ever useful to do a prefetch of the
address/data being accessed by the next
instruction?

Anyway, thanks for doing all the work.

Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Neil Horman
On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
> On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
> > On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
> > > Hi again Neil.
> > > 
> > > Forwarding on to netdev with a concern as to how often
> > > do_csum is used via csum_partial for very short headers
> > > and what impact any prefetch would have there.
> > > 
> > > Also, what changed in your test environment?
> > > 
> > > Why are the new values 5+% higher cycles/byte than the
> > > previous values?
> > > 
> > > And here is the new table reformatted:
> > > 
> > > len   set iterations  Readahead cachelines vs cycles/byte
> > >   1   2   3   4   6   10  20
> > > 1500B 64MB100 1.4342  1.4300  1.4350  1.4350  1.4396  1.4315  
> > > 1.4555
> > > 1500B 128MB   100 1.4312  1.4346  1.4271  1.4284  1.4376  1.4318  
> > > 1.4431
> > > 1500B 256MB   100 1.4309  1.4254  1.4316  1.4308  1.4418  1.4304  
> > > 1.4367
> > > 1500B 512MB   100 1.4534  1.4516  1.4523  1.4563  1.4554  1.4644  
> > > 1.4590
> > > 9000B 64MB100 0.8921  0.8924  0.8932  0.8949  0.8952  0.8939  
> > > 0.8985
> > > 9000B 128MB   100 0.8841  0.8856  0.8845  0.8854  0.8861  0.8879  
> > > 0.8861
> > > 9000B 256MB   100 0.8806  0.8821  0.8813  0.8833  0.8814  0.8827  
> > > 0.8895
> > > 9000B 512MB   100 0.8838  0.8852  0.8841  0.8865  0.8846  0.8901  
> > > 0.8865
> > > 64KB  64MB100 0.8132  0.8136  0.8132  0.8150  0.8147  0.8149  
> > > 0.8147
> > > 64KB  128MB   100 0.8013  0.8014  0.8013  0.8020  0.8041  0.8015  
> > > 0.8033
> > > 64KB  256MB   100 0.7956  0.7959  0.7956  0.7976  0.7981  0.7967  
> > > 0.7973
> > > 64KB  512MB   100 0.7934  0.7932  0.7937  0.7951  0.7954  0.7943  
> > > 0.7948
> > > 
> > 
> > 
> > There we go, thats better:
> > len   set iterations  Readahead cachelines vs cycles/byte
> > 1   2   3   4   5   10  20
> > 1500B 64MB  100 1.3638  1.3288  1.3464  1.3505  1.3586  1.3527  1.3408
> > 1500B 128MB 100 1.3394  1.3357  1.3625  1.3456  1.3536  1.3400  1.3410
> > 1500B 256MB 100 1.3773  1.3362  1.3419  1.3548  1.3543  1.3442  1.4163
> > 1500B 512MB 100 1.3442  1.3390  1.3434  1.3505  1.3767  1.3513  1.3820
> > 9000B 64MB  100 0.8505  0.8492  0.8521  0.8593  0.8566  0.8577  0.8547
> > 9000B 128MB 100 0.8507  0.8507  0.8523  0.8627  0.8593  0.8670  0.8570
> > 9000B 256MB 100 0.8516  0.8515  0.8568  0.8546  0.8549  0.8609  0.8596
> > 9000B 512MB 100 0.8517  0.8526  0.8552  0.8675  0.8547  0.8526  0.8621
> > 64KB  64MB  100 0.7679  0.7689  0.7688  0.7716  0.7714  0.7722  0.7716
> > 64KB  128MB 100 0.7683  0.7687  0.7710  0.7690  0.7717  0.7694  0.7703
> > 64KB  256MB 100 0.7680  0.7703  0.7688  0.7689  0.7726  0.7717  0.7713
> > 64KB  512MB 100 0.7692  0.7690  0.7701  0.7705  0.7698  0.7693  0.7735
> > 
> > 
> > So, the numbers are correct now that I returned my hardware to its previous
> > interrupt affinity state, but the trend seems to be the same (namely that 
> > there
> > isn't a clear one).  We seem to find peak performance around a readahead of 
> > 2
> > cachelines, but its very small (about 3%), and its inconsistent (larger set
> > sizes fall to either side of that stride).  So I don't see it as a clear 
> > win.  I
> > still think we should probably scrap the readahead for now, just take the 
> > perf
> > bits, and revisit this when we can use the vector instructions or the
> > independent carry chain instructions to improve this more consistently.
> > 
> > Thoughts
> 
> Perhaps a single prefetch, not of the first addr but of
> the addr after PREFETCH_STRIDE would work best but only
> if length is > PREFETCH_STRIDE.
> 
> I'd try:
> 
>   if (len > PREFETCH_STRIDE)
>   prefetch(buf + PREFETCH_STRIDE);
>   while (count64) {
>   etc...
>   }
> 
> I still don't know how much that impacts very short lengths.
> 
> Can you please add a 20 byte length to your tests?
> 
> 


Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
only if the overall length of the input buffer is more than 2 cache lines.
Below are the results (all counts are the average of 100 iterations of the
csum operation, as previous tests were, I just omitted that column).

len set cycles/byte cycles/byte improvement
no prefetch prefetch
===
20B 64MB45.014989   44.402432   1.3%
20B 128MB   44.900317   46.146447   -2.7%
20B 256MB   45.303223   48.193623   -6.3%
20B 512MB   45.615301   44.486872   2.2%
1500B   64MB1.3643651.3322851.9%
1500B   128MB   1.3739451.3359071.4%
1500B   256MB   1.3569711.3390841.2%
1500B   512MB 

Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Joe Perches
On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
> On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
> > Hi again Neil.
> > 
> > Forwarding on to netdev with a concern as to how often
> > do_csum is used via csum_partial for very short headers
> > and what impact any prefetch would have there.
> > 
> > Also, what changed in your test environment?
> > 
> > Why are the new values 5+% higher cycles/byte than the
> > previous values?
> > 
> > And here is the new table reformatted:
> > 
> > len set iterations  Readahead cachelines vs cycles/byte
> > 1   2   3   4   6   10  20
> > 1500B   64MB100 1.4342  1.4300  1.4350  1.4350  1.4396  1.4315  
> > 1.4555
> > 1500B   128MB   100 1.4312  1.4346  1.4271  1.4284  1.4376  1.4318  
> > 1.4431
> > 1500B   256MB   100 1.4309  1.4254  1.4316  1.4308  1.4418  1.4304  
> > 1.4367
> > 1500B   512MB   100 1.4534  1.4516  1.4523  1.4563  1.4554  1.4644  
> > 1.4590
> > 9000B   64MB100 0.8921  0.8924  0.8932  0.8949  0.8952  0.8939  
> > 0.8985
> > 9000B   128MB   100 0.8841  0.8856  0.8845  0.8854  0.8861  0.8879  
> > 0.8861
> > 9000B   256MB   100 0.8806  0.8821  0.8813  0.8833  0.8814  0.8827  
> > 0.8895
> > 9000B   512MB   100 0.8838  0.8852  0.8841  0.8865  0.8846  0.8901  
> > 0.8865
> > 64KB64MB100 0.8132  0.8136  0.8132  0.8150  0.8147  0.8149  
> > 0.8147
> > 64KB128MB   100 0.8013  0.8014  0.8013  0.8020  0.8041  0.8015  
> > 0.8033
> > 64KB256MB   100 0.7956  0.7959  0.7956  0.7976  0.7981  0.7967  
> > 0.7973
> > 64KB512MB   100 0.7934  0.7932  0.7937  0.7951  0.7954  0.7943  
> > 0.7948
> > 
> 
> 
> There we go, thats better:
> len   set iterations  Readahead cachelines vs cycles/byte
>   1   2   3   4   5   10  20
> 1500B 64MB100 1.3638  1.3288  1.3464  1.3505  1.3586  1.3527  1.3408
> 1500B 128MB   100 1.3394  1.3357  1.3625  1.3456  1.3536  1.3400  1.3410
> 1500B 256MB   100 1.3773  1.3362  1.3419  1.3548  1.3543  1.3442  1.4163
> 1500B 512MB   100 1.3442  1.3390  1.3434  1.3505  1.3767  1.3513  1.3820
> 9000B 64MB100 0.8505  0.8492  0.8521  0.8593  0.8566  0.8577  0.8547
> 9000B 128MB   100 0.8507  0.8507  0.8523  0.8627  0.8593  0.8670  0.8570
> 9000B 256MB   100 0.8516  0.8515  0.8568  0.8546  0.8549  0.8609  0.8596
> 9000B 512MB   100 0.8517  0.8526  0.8552  0.8675  0.8547  0.8526  0.8621
> 64KB  64MB100 0.7679  0.7689  0.7688  0.7716  0.7714  0.7722  0.7716
> 64KB  128MB   100 0.7683  0.7687  0.7710  0.7690  0.7717  0.7694  0.7703
> 64KB  256MB   100 0.7680  0.7703  0.7688  0.7689  0.7726  0.7717  0.7713
> 64KB  512MB   100 0.7692  0.7690  0.7701  0.7705  0.7698  0.7693  0.7735
> 
> 
> So, the numbers are correct now that I returned my hardware to its previous
> interrupt affinity state, but the trend seems to be the same (namely that 
> there
> isn't a clear one).  We seem to find peak performance around a readahead of 2
> cachelines, but its very small (about 3%), and its inconsistent (larger set
> sizes fall to either side of that stride).  So I don't see it as a clear win. 
>  I
> still think we should probably scrap the readahead for now, just take the perf
> bits, and revisit this when we can use the vector instructions or the
> independent carry chain instructions to improve this more consistently.
> 
> Thoughts

Perhaps a single prefetch, not of the first addr but of
the addr after PREFETCH_STRIDE would work best but only
if length is > PREFETCH_STRIDE.

I'd try:

if (len > PREFETCH_STRIDE)
prefetch(buf + PREFETCH_STRIDE);
while (count64) {
etc...
}

I still don't know how much that impacts very short lengths.

Can you please add a 20 byte length to your tests?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Neil Horman
On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
> Hi again Neil.
> 
> Forwarding on to netdev with a concern as to how often
> do_csum is used via csum_partial for very short headers
> and what impact any prefetch would have there.
> 
> Also, what changed in your test environment?
> 
> Why are the new values 5+% higher cycles/byte than the
> previous values?
> 
> And here is the new table reformatted:
> 
> len   set iterations  Readahead cachelines vs cycles/byte
>   1   2   3   4   6   10  20
> 1500B 64MB100 1.4342  1.4300  1.4350  1.4350  1.4396  1.4315  1.4555
> 1500B 128MB   100 1.4312  1.4346  1.4271  1.4284  1.4376  1.4318  1.4431
> 1500B 256MB   100 1.4309  1.4254  1.4316  1.4308  1.4418  1.4304  1.4367
> 1500B 512MB   100 1.4534  1.4516  1.4523  1.4563  1.4554  1.4644  1.4590
> 9000B 64MB100 0.8921  0.8924  0.8932  0.8949  0.8952  0.8939  0.8985
> 9000B 128MB   100 0.8841  0.8856  0.8845  0.8854  0.8861  0.8879  0.8861
> 9000B 256MB   100 0.8806  0.8821  0.8813  0.8833  0.8814  0.8827  0.8895
> 9000B 512MB   100 0.8838  0.8852  0.8841  0.8865  0.8846  0.8901  0.8865
> 64KB  64MB100 0.8132  0.8136  0.8132  0.8150  0.8147  0.8149  0.8147
> 64KB  128MB   100 0.8013  0.8014  0.8013  0.8020  0.8041  0.8015  0.8033
> 64KB  256MB   100 0.7956  0.7959  0.7956  0.7976  0.7981  0.7967  0.7973
> 64KB  512MB   100 0.7934  0.7932  0.7937  0.7951  0.7954  0.7943  0.7948
> 


There we go, thats better:
len   set iterations  Readahead cachelines vs cycles/byte
1   2   3   4   5   10  20
1500B 64MB  100 1.3638  1.3288  1.3464  1.3505  1.3586  1.3527  1.3408
1500B 128MB 100 1.3394  1.3357  1.3625  1.3456  1.3536  1.3400  1.3410
1500B 256MB 100 1.3773  1.3362  1.3419  1.3548  1.3543  1.3442  1.4163
1500B 512MB 100 1.3442  1.3390  1.3434  1.3505  1.3767  1.3513  1.3820
9000B 64MB  100 0.8505  0.8492  0.8521  0.8593  0.8566  0.8577  0.8547
9000B 128MB 100 0.8507  0.8507  0.8523  0.8627  0.8593  0.8670  0.8570
9000B 256MB 100 0.8516  0.8515  0.8568  0.8546  0.8549  0.8609  0.8596
9000B 512MB 100 0.8517  0.8526  0.8552  0.8675  0.8547  0.8526  0.8621
64KB  64MB  100 0.7679  0.7689  0.7688  0.7716  0.7714  0.7722  0.7716
64KB  128MB 100 0.7683  0.7687  0.7710  0.7690  0.7717  0.7694  0.7703
64KB  256MB 100 0.7680  0.7703  0.7688  0.7689  0.7726  0.7717  0.7713
64KB  512MB 100 0.7692  0.7690  0.7701  0.7705  0.7698  0.7693  0.7735


So, the numbers are correct now that I returned my hardware to its previous
interrupt affinity state, but the trend seems to be the same (namely that there
isn't a clear one).  We seem to find peak performance around a readahead of 2
cachelines, but its very small (about 3%), and its inconsistent (larger set
sizes fall to either side of that stride).  So I don't see it as a clear win.  I
still think we should probably scrap the readahead for now, just take the perf
bits, and revisit this when we can use the vector instructions or the
independent carry chain instructions to improve this more consistently.

Thoughts
Neil



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Neil Horman
On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
> Hi again Neil.
> 
> Forwarding on to netdev with a concern as to how often
> do_csum is used via csum_partial for very short headers
> and what impact any prefetch would have there.
> 
> Also, what changed in your test environment?
> 
> Why are the new values 5+% higher cycles/byte than the
> previous values?
> 
Hmm, thank you, I didn't notice the increase.  I think I rebooted my system and
failed to reset my irq affinity to avoid the processor I was testing on.  Let me
rerun.
Neil

> And here is the new table reformatted:
> 
> len   set iterations  Readahead cachelines vs cycles/byte
>   1   2   3   4   6   10  20
> 1500B 64MB100 1.4342  1.4300  1.4350  1.4350  1.4396  1.4315  1.4555
> 1500B 128MB   100 1.4312  1.4346  1.4271  1.4284  1.4376  1.4318  1.4431
> 1500B 256MB   100 1.4309  1.4254  1.4316  1.4308  1.4418  1.4304  1.4367
> 1500B 512MB   100 1.4534  1.4516  1.4523  1.4563  1.4554  1.4644  1.4590
> 9000B 64MB100 0.8921  0.8924  0.8932  0.8949  0.8952  0.8939  0.8985
> 9000B 128MB   100 0.8841  0.8856  0.8845  0.8854  0.8861  0.8879  0.8861
> 9000B 256MB   100 0.8806  0.8821  0.8813  0.8833  0.8814  0.8827  0.8895
> 9000B 512MB   100 0.8838  0.8852  0.8841  0.8865  0.8846  0.8901  0.8865
> 64KB  64MB100 0.8132  0.8136  0.8132  0.8150  0.8147  0.8149  0.8147
> 64KB  128MB   100 0.8013  0.8014  0.8013  0.8020  0.8041  0.8015  0.8033
> 64KB  256MB   100 0.7956  0.7959  0.7956  0.7976  0.7981  0.7967  0.7973
> 64KB  512MB   100 0.7934  0.7932  0.7937  0.7951  0.7954  0.7943  0.7948
> 
>  Forwarded Message 
> From: Neil Horman 
> To: Joe Perches 
> Cc: Dave Jones , linux-kernel@vger.kernel.org,
> sebastien.du...@bull.net, Thomas Gleixner , Ingo
> Molnar , H. Peter Anvin ,
> x...@kernel.org
> Subject: Re: [PATCH v2 2/2] x86: add prefetching to do_csum
> 
> On Fri, Nov 08, 2013 at 12:29:07PM -0800, Joe Perches wrote:
> > On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote:
> > > On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> > > > On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > > > > >  > do_csum was identified via perf recently as a hot spot when 
> > > > > > > > doing
> > > > > > > >  > receive on ip over infiniband workloads.  After alot of 
> > > > > > > > testing and
> > > > > > > >  > ideas, we found the best optimization available to us 
> > > > > > > > currently is to
> > > > > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > > > > []
> > > > > > > I'll fix this up and send a v3, but I'll give it a day in case 
> > > > > > > there are more
> > > > > > > comments first.
> > > > > > 
> > > > > > Perhaps a reduction in prefetch loop count helps.
> > > > > > 
> > > > > > Was capping the amount prefetched and letting the
> > > > > > hardware prefetch also tested?
> > > > > > 
> > > > > > prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > > > > 
> > > > > 
> > > > > Just tested this out:
> > > > 
> > > > Thanks.
> > > > 
> > > > Reformatting the table so it's a bit more
> > > > readable/comparable for me:
> > > > 
> > > > len SetSz   Loops   cycles/byte
> > > > limited unlimited
> > > > 1500B   64MB1M  1.3442  1.3605
> > > > 1500B   128MB   1M  1.3410  1.3542
> > > > 1500B   256MB   1M  1.3536  1.3710
> > > > 1500B   512MB   1M  1.3463  1.3536
> > > > 9000B   64MB1M  0.8522  0.8504
> > > > 9000B   128MB   1M  0.8528  0.8536
> > > > 9000B   256MB   1M  0.8532  0.8520
> > > > 9000B   512MB   1M  0.8527  0.8525
> > > > 64KB64MB1M  0.7686  0.7683
> > > > 64KB128MB   1M  0.7695  0.7686
> > > > 64KB256MB   1M  0.7699  0.7708
> > > > 64KB512MB   1M  0.7799  0.7694
> > > > 
> > > > This data appears to show some value
> > > > in capping for 1500b lengths and noise
> > > > for shorter and longer lengths.
> > > > 
> > > > Any idea what the actual distribution of
> > > > do_csum lengths is under various loads?
> > > > 
> > > I don't have any hard data no, sorry.
> > 
> > I think you should before you implement this.
> > You might find extremely short lengths.
> > 
> > > I'll cap the prefetch at 1500B for now, since it
> > > doesn't seem to hurt or help beyond that
> > 
> > The table data has a max prefetch of
> > 8 * boot_cpu_data.x86_cache_alignment so
> > I believe it's always less than 1500 but
> > perhaps 4 might be slightly better still.
> > 
> 
> 
> So, you appear to be correct, I reran my test set with different prefetch
> ceilings and got the results below.  

Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Neil Horman
On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
 Hi again Neil.
 
 Forwarding on to netdev with a concern as to how often
 do_csum is used via csum_partial for very short headers
 and what impact any prefetch would have there.
 
 Also, what changed in your test environment?
 
 Why are the new values 5+% higher cycles/byte than the
 previous values?
 
Hmm, thank you, I didn't notice the increase.  I think I rebooted my system and
failed to reset my irq affinity to avoid the processor I was testing on.  Let me
rerun.
Neil

 And here is the new table reformatted:
 
 len   set iterations  Readahead cachelines vs cycles/byte
   1   2   3   4   6   10  20
 1500B 64MB100 1.4342  1.4300  1.4350  1.4350  1.4396  1.4315  1.4555
 1500B 128MB   100 1.4312  1.4346  1.4271  1.4284  1.4376  1.4318  1.4431
 1500B 256MB   100 1.4309  1.4254  1.4316  1.4308  1.4418  1.4304  1.4367
 1500B 512MB   100 1.4534  1.4516  1.4523  1.4563  1.4554  1.4644  1.4590
 9000B 64MB100 0.8921  0.8924  0.8932  0.8949  0.8952  0.8939  0.8985
 9000B 128MB   100 0.8841  0.8856  0.8845  0.8854  0.8861  0.8879  0.8861
 9000B 256MB   100 0.8806  0.8821  0.8813  0.8833  0.8814  0.8827  0.8895
 9000B 512MB   100 0.8838  0.8852  0.8841  0.8865  0.8846  0.8901  0.8865
 64KB  64MB100 0.8132  0.8136  0.8132  0.8150  0.8147  0.8149  0.8147
 64KB  128MB   100 0.8013  0.8014  0.8013  0.8020  0.8041  0.8015  0.8033
 64KB  256MB   100 0.7956  0.7959  0.7956  0.7976  0.7981  0.7967  0.7973
 64KB  512MB   100 0.7934  0.7932  0.7937  0.7951  0.7954  0.7943  0.7948
 
  Forwarded Message 
 From: Neil Horman nhor...@tuxdriver.com
 To: Joe Perches j...@perches.com
 Cc: Dave Jones da...@redhat.com, linux-kernel@vger.kernel.org,
 sebastien.du...@bull.net, Thomas Gleixner t...@linutronix.de, Ingo
 Molnar mi...@redhat.com, H. Peter Anvin h...@zytor.com,
 x...@kernel.org
 Subject: Re: [PATCH v2 2/2] x86: add prefetching to do_csum
 
 On Fri, Nov 08, 2013 at 12:29:07PM -0800, Joe Perches wrote:
  On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote:
   On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
 On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
  On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
   On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
  do_csum was identified via perf recently as a hot spot when 
doing
  receive on ip over infiniband workloads.  After alot of 
testing and
  ideas, we found the best optimization available to us 
currently is to
  prefetch the entire data buffer prior to doing the checksum
  []
   I'll fix this up and send a v3, but I'll give it a day in case 
   there are more
   comments first.
  
  Perhaps a reduction in prefetch loop count helps.
  
  Was capping the amount prefetched and letting the
  hardware prefetch also tested?
  
  prefetch_lines(buff, min(len, cache_line_size() * 8u));
  
 
 Just tested this out:

Thanks.

Reformatting the table so it's a bit more
readable/comparable for me:

len SetSz   Loops   cycles/byte
limited unlimited
1500B   64MB1M  1.3442  1.3605
1500B   128MB   1M  1.3410  1.3542
1500B   256MB   1M  1.3536  1.3710
1500B   512MB   1M  1.3463  1.3536
9000B   64MB1M  0.8522  0.8504
9000B   128MB   1M  0.8528  0.8536
9000B   256MB   1M  0.8532  0.8520
9000B   512MB   1M  0.8527  0.8525
64KB64MB1M  0.7686  0.7683
64KB128MB   1M  0.7695  0.7686
64KB256MB   1M  0.7699  0.7708
64KB512MB   1M  0.7799  0.7694

This data appears to show some value
in capping for 1500b lengths and noise
for shorter and longer lengths.

Any idea what the actual distribution of
do_csum lengths is under various loads?

   I don't have any hard data no, sorry.
  
  I think you should before you implement this.
  You might find extremely short lengths.
  
   I'll cap the prefetch at 1500B for now, since it
   doesn't seem to hurt or help beyond that
  
  The table data has a max prefetch of
  8 * boot_cpu_data.x86_cache_alignment so
  I believe it's always less than 1500 but
  perhaps 4 might be slightly better still.
  
 
 
 So, you appear to be correct, I reran my test set with different prefetch
 ceilings and got the results below.  There are some cases in which there is a
 performance gain, but the gain is small, and occurs at different spots 
 depending
 on the input buffer size (though most peak gains appear around 2 cache lines).
 I'm guessing it takes about 2 prefetches before hardware 

Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Neil Horman
On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
 Hi again Neil.
 
 Forwarding on to netdev with a concern as to how often
 do_csum is used via csum_partial for very short headers
 and what impact any prefetch would have there.
 
 Also, what changed in your test environment?
 
 Why are the new values 5+% higher cycles/byte than the
 previous values?
 
 And here is the new table reformatted:
 
 len   set iterations  Readahead cachelines vs cycles/byte
   1   2   3   4   6   10  20
 1500B 64MB100 1.4342  1.4300  1.4350  1.4350  1.4396  1.4315  1.4555
 1500B 128MB   100 1.4312  1.4346  1.4271  1.4284  1.4376  1.4318  1.4431
 1500B 256MB   100 1.4309  1.4254  1.4316  1.4308  1.4418  1.4304  1.4367
 1500B 512MB   100 1.4534  1.4516  1.4523  1.4563  1.4554  1.4644  1.4590
 9000B 64MB100 0.8921  0.8924  0.8932  0.8949  0.8952  0.8939  0.8985
 9000B 128MB   100 0.8841  0.8856  0.8845  0.8854  0.8861  0.8879  0.8861
 9000B 256MB   100 0.8806  0.8821  0.8813  0.8833  0.8814  0.8827  0.8895
 9000B 512MB   100 0.8838  0.8852  0.8841  0.8865  0.8846  0.8901  0.8865
 64KB  64MB100 0.8132  0.8136  0.8132  0.8150  0.8147  0.8149  0.8147
 64KB  128MB   100 0.8013  0.8014  0.8013  0.8020  0.8041  0.8015  0.8033
 64KB  256MB   100 0.7956  0.7959  0.7956  0.7976  0.7981  0.7967  0.7973
 64KB  512MB   100 0.7934  0.7932  0.7937  0.7951  0.7954  0.7943  0.7948
 


There we go, thats better:
len   set iterations  Readahead cachelines vs cycles/byte
1   2   3   4   5   10  20
1500B 64MB  100 1.3638  1.3288  1.3464  1.3505  1.3586  1.3527  1.3408
1500B 128MB 100 1.3394  1.3357  1.3625  1.3456  1.3536  1.3400  1.3410
1500B 256MB 100 1.3773  1.3362  1.3419  1.3548  1.3543  1.3442  1.4163
1500B 512MB 100 1.3442  1.3390  1.3434  1.3505  1.3767  1.3513  1.3820
9000B 64MB  100 0.8505  0.8492  0.8521  0.8593  0.8566  0.8577  0.8547
9000B 128MB 100 0.8507  0.8507  0.8523  0.8627  0.8593  0.8670  0.8570
9000B 256MB 100 0.8516  0.8515  0.8568  0.8546  0.8549  0.8609  0.8596
9000B 512MB 100 0.8517  0.8526  0.8552  0.8675  0.8547  0.8526  0.8621
64KB  64MB  100 0.7679  0.7689  0.7688  0.7716  0.7714  0.7722  0.7716
64KB  128MB 100 0.7683  0.7687  0.7710  0.7690  0.7717  0.7694  0.7703
64KB  256MB 100 0.7680  0.7703  0.7688  0.7689  0.7726  0.7717  0.7713
64KB  512MB 100 0.7692  0.7690  0.7701  0.7705  0.7698  0.7693  0.7735


So, the numbers are correct now that I returned my hardware to its previous
interrupt affinity state, but the trend seems to be the same (namely that there
isn't a clear one).  We seem to find peak performance around a readahead of 2
cachelines, but its very small (about 3%), and its inconsistent (larger set
sizes fall to either side of that stride).  So I don't see it as a clear win.  I
still think we should probably scrap the readahead for now, just take the perf
bits, and revisit this when we can use the vector instructions or the
independent carry chain instructions to improve this more consistently.

Thoughts
Neil



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Joe Perches
On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
 On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
  Hi again Neil.
  
  Forwarding on to netdev with a concern as to how often
  do_csum is used via csum_partial for very short headers
  and what impact any prefetch would have there.
  
  Also, what changed in your test environment?
  
  Why are the new values 5+% higher cycles/byte than the
  previous values?
  
  And here is the new table reformatted:
  
  len set iterations  Readahead cachelines vs cycles/byte
  1   2   3   4   6   10  20
  1500B   64MB100 1.4342  1.4300  1.4350  1.4350  1.4396  1.4315  
  1.4555
  1500B   128MB   100 1.4312  1.4346  1.4271  1.4284  1.4376  1.4318  
  1.4431
  1500B   256MB   100 1.4309  1.4254  1.4316  1.4308  1.4418  1.4304  
  1.4367
  1500B   512MB   100 1.4534  1.4516  1.4523  1.4563  1.4554  1.4644  
  1.4590
  9000B   64MB100 0.8921  0.8924  0.8932  0.8949  0.8952  0.8939  
  0.8985
  9000B   128MB   100 0.8841  0.8856  0.8845  0.8854  0.8861  0.8879  
  0.8861
  9000B   256MB   100 0.8806  0.8821  0.8813  0.8833  0.8814  0.8827  
  0.8895
  9000B   512MB   100 0.8838  0.8852  0.8841  0.8865  0.8846  0.8901  
  0.8865
  64KB64MB100 0.8132  0.8136  0.8132  0.8150  0.8147  0.8149  
  0.8147
  64KB128MB   100 0.8013  0.8014  0.8013  0.8020  0.8041  0.8015  
  0.8033
  64KB256MB   100 0.7956  0.7959  0.7956  0.7976  0.7981  0.7967  
  0.7973
  64KB512MB   100 0.7934  0.7932  0.7937  0.7951  0.7954  0.7943  
  0.7948
  
 
 
 There we go, thats better:
 len   set iterations  Readahead cachelines vs cycles/byte
   1   2   3   4   5   10  20
 1500B 64MB100 1.3638  1.3288  1.3464  1.3505  1.3586  1.3527  1.3408
 1500B 128MB   100 1.3394  1.3357  1.3625  1.3456  1.3536  1.3400  1.3410
 1500B 256MB   100 1.3773  1.3362  1.3419  1.3548  1.3543  1.3442  1.4163
 1500B 512MB   100 1.3442  1.3390  1.3434  1.3505  1.3767  1.3513  1.3820
 9000B 64MB100 0.8505  0.8492  0.8521  0.8593  0.8566  0.8577  0.8547
 9000B 128MB   100 0.8507  0.8507  0.8523  0.8627  0.8593  0.8670  0.8570
 9000B 256MB   100 0.8516  0.8515  0.8568  0.8546  0.8549  0.8609  0.8596
 9000B 512MB   100 0.8517  0.8526  0.8552  0.8675  0.8547  0.8526  0.8621
 64KB  64MB100 0.7679  0.7689  0.7688  0.7716  0.7714  0.7722  0.7716
 64KB  128MB   100 0.7683  0.7687  0.7710  0.7690  0.7717  0.7694  0.7703
 64KB  256MB   100 0.7680  0.7703  0.7688  0.7689  0.7726  0.7717  0.7713
 64KB  512MB   100 0.7692  0.7690  0.7701  0.7705  0.7698  0.7693  0.7735
 
 
 So, the numbers are correct now that I returned my hardware to its previous
 interrupt affinity state, but the trend seems to be the same (namely that 
 there
 isn't a clear one).  We seem to find peak performance around a readahead of 2
 cachelines, but its very small (about 3%), and its inconsistent (larger set
 sizes fall to either side of that stride).  So I don't see it as a clear win. 
  I
 still think we should probably scrap the readahead for now, just take the perf
 bits, and revisit this when we can use the vector instructions or the
 independent carry chain instructions to improve this more consistently.
 
 Thoughts

Perhaps a single prefetch, not of the first addr but of
the addr after PREFETCH_STRIDE would work best but only
if length is  PREFETCH_STRIDE.

I'd try:

if (len  PREFETCH_STRIDE)
prefetch(buf + PREFETCH_STRIDE);
while (count64) {
etc...
}

I still don't know how much that impacts very short lengths.

Can you please add a 20 byte length to your tests?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Neil Horman
On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
 On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
  On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
   Hi again Neil.
   
   Forwarding on to netdev with a concern as to how often
   do_csum is used via csum_partial for very short headers
   and what impact any prefetch would have there.
   
   Also, what changed in your test environment?
   
   Why are the new values 5+% higher cycles/byte than the
   previous values?
   
   And here is the new table reformatted:
   
   len   set iterations  Readahead cachelines vs cycles/byte
 1   2   3   4   6   10  20
   1500B 64MB100 1.4342  1.4300  1.4350  1.4350  1.4396  1.4315  
   1.4555
   1500B 128MB   100 1.4312  1.4346  1.4271  1.4284  1.4376  1.4318  
   1.4431
   1500B 256MB   100 1.4309  1.4254  1.4316  1.4308  1.4418  1.4304  
   1.4367
   1500B 512MB   100 1.4534  1.4516  1.4523  1.4563  1.4554  1.4644  
   1.4590
   9000B 64MB100 0.8921  0.8924  0.8932  0.8949  0.8952  0.8939  
   0.8985
   9000B 128MB   100 0.8841  0.8856  0.8845  0.8854  0.8861  0.8879  
   0.8861
   9000B 256MB   100 0.8806  0.8821  0.8813  0.8833  0.8814  0.8827  
   0.8895
   9000B 512MB   100 0.8838  0.8852  0.8841  0.8865  0.8846  0.8901  
   0.8865
   64KB  64MB100 0.8132  0.8136  0.8132  0.8150  0.8147  0.8149  
   0.8147
   64KB  128MB   100 0.8013  0.8014  0.8013  0.8020  0.8041  0.8015  
   0.8033
   64KB  256MB   100 0.7956  0.7959  0.7956  0.7976  0.7981  0.7967  
   0.7973
   64KB  512MB   100 0.7934  0.7932  0.7937  0.7951  0.7954  0.7943  
   0.7948
   
  
  
  There we go, thats better:
  len   set iterations  Readahead cachelines vs cycles/byte
  1   2   3   4   5   10  20
  1500B 64MB  100 1.3638  1.3288  1.3464  1.3505  1.3586  1.3527  1.3408
  1500B 128MB 100 1.3394  1.3357  1.3625  1.3456  1.3536  1.3400  1.3410
  1500B 256MB 100 1.3773  1.3362  1.3419  1.3548  1.3543  1.3442  1.4163
  1500B 512MB 100 1.3442  1.3390  1.3434  1.3505  1.3767  1.3513  1.3820
  9000B 64MB  100 0.8505  0.8492  0.8521  0.8593  0.8566  0.8577  0.8547
  9000B 128MB 100 0.8507  0.8507  0.8523  0.8627  0.8593  0.8670  0.8570
  9000B 256MB 100 0.8516  0.8515  0.8568  0.8546  0.8549  0.8609  0.8596
  9000B 512MB 100 0.8517  0.8526  0.8552  0.8675  0.8547  0.8526  0.8621
  64KB  64MB  100 0.7679  0.7689  0.7688  0.7716  0.7714  0.7722  0.7716
  64KB  128MB 100 0.7683  0.7687  0.7710  0.7690  0.7717  0.7694  0.7703
  64KB  256MB 100 0.7680  0.7703  0.7688  0.7689  0.7726  0.7717  0.7713
  64KB  512MB 100 0.7692  0.7690  0.7701  0.7705  0.7698  0.7693  0.7735
  
  
  So, the numbers are correct now that I returned my hardware to its previous
  interrupt affinity state, but the trend seems to be the same (namely that 
  there
  isn't a clear one).  We seem to find peak performance around a readahead of 
  2
  cachelines, but its very small (about 3%), and its inconsistent (larger set
  sizes fall to either side of that stride).  So I don't see it as a clear 
  win.  I
  still think we should probably scrap the readahead for now, just take the 
  perf
  bits, and revisit this when we can use the vector instructions or the
  independent carry chain instructions to improve this more consistently.
  
  Thoughts
 
 Perhaps a single prefetch, not of the first addr but of
 the addr after PREFETCH_STRIDE would work best but only
 if length is  PREFETCH_STRIDE.
 
 I'd try:
 
   if (len  PREFETCH_STRIDE)
   prefetch(buf + PREFETCH_STRIDE);
   while (count64) {
   etc...
   }
 
 I still don't know how much that impacts very short lengths.
 
 Can you please add a 20 byte length to your tests?
 
 


Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
only if the overall length of the input buffer is more than 2 cache lines.
Below are the results (all counts are the average of 100 iterations of the
csum operation, as previous tests were, I just omitted that column).

len set cycles/byte cycles/byte improvement
no prefetch prefetch
===
20B 64MB45.014989   44.402432   1.3%
20B 128MB   44.900317   46.146447   -2.7%
20B 256MB   45.303223   48.193623   -6.3%
20B 512MB   45.615301   44.486872   2.2%
1500B   64MB1.3643651.3322851.9%
1500B   128MB   1.3739451.3359071.4%
1500B   256MB   1.3569711.3390841.2%
1500B   512MB   1.3510911.3414310.7%
9000B   64MB0.8509660.851077-0.1%
9000B   128MB   0.8510130.8503660.1%
9000B   256MB   0.8542120.8510330.3%
9000B   

Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Joe Perches
On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote:
 On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
  On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
[]
   So, the numbers are correct now that I returned my hardware to its 
   previous
   interrupt affinity state, but the trend seems to be the same (namely that 
   there
   isn't a clear one).  We seem to find peak performance around a readahead 
   of 2
   cachelines, but its very small (about 3%), and its inconsistent (larger 
   set
   sizes fall to either side of that stride).  So I don't see it as a clear 
   win.  I
   still think we should probably scrap the readahead for now, just take the 
   perf
   bits, and revisit this when we can use the vector instructions or the
   independent carry chain instructions to improve this more consistently.
   
   Thoughts
  
  Perhaps a single prefetch, not of the first addr but of
  the addr after PREFETCH_STRIDE would work best but only
  if length is  PREFETCH_STRIDE.
  
  I'd try:
  
  if (len  PREFETCH_STRIDE)
  prefetch(buf + PREFETCH_STRIDE);
  while (count64) {
  etc...
  }
  
  I still don't know how much that impacts very short lengths.
  Can you please add a 20 byte length to your tests?
 Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
 only if the overall length of the input buffer is more than 2 cache lines.
 Below are the results (all counts are the average of 100 iterations of the
 csum operation, as previous tests were, I just omitted that column).
 
 len   set cycles/byte cycles/byte improvement
   no prefetch prefetch
 ===
 20B   64MB45.014989   44.402432   1.3%
 20B   128MB   44.900317   46.146447   -2.7%
 20B   256MB   45.303223   48.193623   -6.3%
 20B   512MB   45.615301   44.486872   2.2%
[]
 I'm still left thinking we should just abandon the prefetch at this point and
 keep the perf code until we have new instructions to help us with this 
 further,
 unless you see something I dont.

I tend to agree but perhaps the 3% performance
increase with a prefetch for longer lengths is
actually significant and desirable.

It doesn't seem you've done the test I suggested
where prefetch is done only for
len  PREFETCH_STRIDE.

Is it ever useful to do a prefetch of the
address/data being accessed by the next
instruction?

Anyway, thanks for doing all the work.

Joe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-12 Thread Neil Horman
On Tue, Nov 12, 2013 at 12:38:01PM -0800, Joe Perches wrote:
 On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote:
  On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
   On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
 []
So, the numbers are correct now that I returned my hardware to its 
previous
interrupt affinity state, but the trend seems to be the same (namely 
that there
isn't a clear one).  We seem to find peak performance around a 
readahead of 2
cachelines, but its very small (about 3%), and its inconsistent (larger 
set
sizes fall to either side of that stride).  So I don't see it as a 
clear win.  I
still think we should probably scrap the readahead for now, just take 
the perf
bits, and revisit this when we can use the vector instructions or the
independent carry chain instructions to improve this more consistently.

Thoughts
   
   Perhaps a single prefetch, not of the first addr but of
   the addr after PREFETCH_STRIDE would work best but only
   if length is  PREFETCH_STRIDE.
   
   I'd try:
   
 if (len  PREFETCH_STRIDE)
 prefetch(buf + PREFETCH_STRIDE);
 while (count64) {
 etc...
 }
   
   I still don't know how much that impacts very short lengths.
   Can you please add a 20 byte length to your tests?
  Sure, I modified the code so that we only prefetched 2 cache lines ahead, 
  but
  only if the overall length of the input buffer is more than 2 cache lines.
  Below are the results (all counts are the average of 100 iterations of 
  the
  csum operation, as previous tests were, I just omitted that column).
  
  len set cycles/byte cycles/byte improvement
  no prefetch prefetch
  ===
  20B 64MB45.014989   44.402432   1.3%
  20B 128MB   44.900317   46.146447   -2.7%
  20B 256MB   45.303223   48.193623   -6.3%
  20B 512MB   45.615301   44.486872   2.2%
 []
  I'm still left thinking we should just abandon the prefetch at this point 
  and
  keep the perf code until we have new instructions to help us with this 
  further,
  unless you see something I dont.
 
 I tend to agree but perhaps the 3% performance
 increase with a prefetch for longer lengths is
 actually significant and desirable.
 
Maybe, but I worry that its not going to be consistent.  At least not with the
cost of the extra comparison and jump.

 It doesn't seem you've done the test I suggested
 where prefetch is done only for
 len  PREFETCH_STRIDE.
 
No, thats exactly what I did, I did this:

#define PREFETCH_STRIDE (cache_line_size() * 2)

...

if (len  PREFETCH_STRIDE)
prefecth(buf + PREFETCH_STRIDE)

while (count64) {
...

 Is it ever useful to do a prefetch of the
 address/data being accessed by the next
 instruction?
 
Doubtful, you need to prefetch the data far enough in advance that its loaded by
the time you need to reference it.  Otherwise you wind up stalling the data
pipeline while the load completes.  So unless you have really fast memory, the
prefetch is effectively a no-op for the next access.  But the next cacheline
ahead is good, as it prevents the stall there.  Any more than that though (from
this testing), seems to again be a no-op as modern hardware automatically issues
the prefetch because it notices the linear data access pattern.

 Anyway, thanks for doing all the work.
 
No worries, glad to do it.  Thanks for the review

Ingo, what do you think, shall I submit the perf bits as a separate thread, or
do you not want those any more?

Regards
Neil

 Joe
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/