Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Wed, Nov 13, 2013 at 01:32:50PM -, David Laight wrote: > > > I'm not sure, whats the typical capacity for the branch predictors > > > ability to remember code paths? > ... > > > > For such simple single-target branches it goes near or over a thousand for > > recent Intel and AMD microarchitectures. Thousands for really recent CPUs. > > IIRC the x86 can also correctly predict simple sequences - like a branch > in a loop that is taken every other iteration, or only after a previous > branch is taken. > > Much simpler cpus may use a much simpler strategy. > I think one I've used (a fpga soft-core cpu) just uses the low > bits of the instruction address to index a single bit table. > This means that branches alias each other. > In order to get the consistent cycle counts in order to minimise > the worst case code path we had to disable the dynamic prediction. > > For the checksum code the loop branch isn't a problem. > Tests on entry to the function might get mispredicted. > > So if you have conditional prefetch when the buffer is long > then time a short buffer after a 100 long ones you'll almost > certainly see the mispredition penalty. > > FWIW I remember speeding up a copy (I think) loop on a strongarm by > adding an extra instruction to fetch a word from later in the buffer > into a register I never otherwise used. > (That was an unpaged system so I knew it couldn't fault.) > Fair enough, but the code we're looking at here is arch specific. If strongarms benefit from different coding patterns, we can handle that in that arch. This x86 implementation can still avoid worrying about branch predicition since its hardware handles it well Neil > David > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
* David Laight wrote: > > > I'm not sure, whats the typical capacity for the branch predictors > > > ability to remember code paths? > ... > > > > For such simple single-target branches it goes near or over a thousand > > for recent Intel and AMD microarchitectures. Thousands for really > > recent CPUs. > > IIRC the x86 can also correctly predict simple sequences - like a branch > in a loop that is taken every other iteration, or only after a previous > branch is taken. They tend to be rather capable but not very well documented :) With a large out of order execution design and 20+ pipeline stages x86 branch prediction accuracy is perhaps the most important design aspect to good CPU performance. > Much simpler cpus may use a much simpler strategy. Yeah. The patches in this thread are about the x86 assembly implementation of the csum routines, and for 'typical' x86 CPUs the branch prediction units and caches are certainly sophisticated enough. Also note that here, for real usecases, the csum routines are (or should be) memory bandwidth limited, missing the data cache most of the time, with a partially idling pipeline, while branch prediction accuracy matters most when the pipeline is well fed and there are a lot of instructions in flight. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
> > I'm not sure, whats the typical capacity for the branch predictors > > ability to remember code paths? ... > > For such simple single-target branches it goes near or over a thousand for > recent Intel and AMD microarchitectures. Thousands for really recent CPUs. IIRC the x86 can also correctly predict simple sequences - like a branch in a loop that is taken every other iteration, or only after a previous branch is taken. Much simpler cpus may use a much simpler strategy. I think one I've used (a fpga soft-core cpu) just uses the low bits of the instruction address to index a single bit table. This means that branches alias each other. In order to get the consistent cycle counts in order to minimise the worst case code path we had to disable the dynamic prediction. For the checksum code the loop branch isn't a problem. Tests on entry to the function might get mispredicted. So if you have conditional prefetch when the buffer is long then time a short buffer after a 100 long ones you'll almost certainly see the mispredition penalty. FWIW I remember speeding up a copy (I think) loop on a strongarm by adding an extra instruction to fetch a word from later in the buffer into a register I never otherwise used. (That was an unpaged system so I knew it couldn't fault.) David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
* Neil Horman wrote: > On Wed, Nov 13, 2013 at 10:09:51AM -, David Laight wrote: > > > Sure, I modified the code so that we only prefetched 2 cache lines ahead, > > > but > > > only if the overall length of the input buffer is more than 2 cache lines. > > > Below are the results (all counts are the average of 100 iterations > > > of the > > > csum operation, as previous tests were, I just omitted that column). > > > > Hmmm averaging over 10 iterations means that all the code > > is in the i-cache and the branch predictor will be correctly primed. > > > > For short checksum requests I'd guess that the relevant data > > has just been written and is already in the cpu cache (unless > > there has been a process and cpu switch). > > So prefetch is likely to be unnecessary. > > > > If you assume that the checksum code isn't in the i-cache then > > small requests are likely to be dominated by the code size. > > I'm not sure, whats the typical capacity for the branch predictors > ability to remember code paths? I ask because the most likely use of > do_csum will be in the receive path of the networking stack > (specifically in the softirq handler). So if we run do_csum once, we're > likely to run it many more times, as we clean out an adapters receive > queue. For such simple single-target branches it goes near or over a thousand for recent Intel and AMD microarchitectures. Thousands for really recent CPUs. Note that branch prediction caches are hierarchical and are typically attached to cache hierarchies (where the uops are fetched from), so the first level BTB is typically shared between SMT CPUs that share an icache and L2 BTBs (which is larger and more associative) are shared by all cores in a package. So it's possible for some other task on another (sibling) CPU to keep pressure on your BTB, but I'd say it's relatively rare, it's hard to do it at a really high rate that blows away all the cache all the time. (PeterZ has written some artificial pseudorandom branching monster just to be able to generate cache misses and validate perf's branch stats - but even if deliberately want to it's pretty hard to beat that cache.) I'd definitely not worry about the prediction accuracy of repetitive loops like csum routines, they'll be cached well. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Wed, Nov 13, 2013 at 10:09:51AM -, David Laight wrote: > > Sure, I modified the code so that we only prefetched 2 cache lines ahead, > > but > > only if the overall length of the input buffer is more than 2 cache lines. > > Below are the results (all counts are the average of 100 iterations of > > the > > csum operation, as previous tests were, I just omitted that column). > > Hmmm averaging over 10 iterations means that all the code > is in the i-cache and the branch predictor will be correctly primed. > > For short checksum requests I'd guess that the relevant data > has just been written and is already in the cpu cache (unless > there has been a process and cpu switch). > So prefetch is likely to be unnecessary. > > If you assume that the checksum code isn't in the i-cache then > small requests are likely to be dominated by the code size. > I'm not sure, whats the typical capacity for the branch predictors ability to remember code paths? I ask because the most likely use of do_csum will be in the receive path of the networking stack (specifically in the softirq handler). So if we run do_csum once, we're likely to run it many more times, as we clean out an adapters receive queue. Neil > David > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
> Sure, I modified the code so that we only prefetched 2 cache lines ahead, but > only if the overall length of the input buffer is more than 2 cache lines. > Below are the results (all counts are the average of 100 iterations of the > csum operation, as previous tests were, I just omitted that column). Hmmm averaging over 10 iterations means that all the code is in the i-cache and the branch predictor will be correctly primed. For short checksum requests I'd guess that the relevant data has just been written and is already in the cpu cache (unless there has been a process and cpu switch). So prefetch is likely to be unnecessary. If you assume that the checksum code isn't in the i-cache then small requests are likely to be dominated by the code size. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
Sure, I modified the code so that we only prefetched 2 cache lines ahead, but only if the overall length of the input buffer is more than 2 cache lines. Below are the results (all counts are the average of 100 iterations of the csum operation, as previous tests were, I just omitted that column). Hmmm averaging over 10 iterations means that all the code is in the i-cache and the branch predictor will be correctly primed. For short checksum requests I'd guess that the relevant data has just been written and is already in the cpu cache (unless there has been a process and cpu switch). So prefetch is likely to be unnecessary. If you assume that the checksum code isn't in the i-cache then small requests are likely to be dominated by the code size. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Wed, Nov 13, 2013 at 10:09:51AM -, David Laight wrote: Sure, I modified the code so that we only prefetched 2 cache lines ahead, but only if the overall length of the input buffer is more than 2 cache lines. Below are the results (all counts are the average of 100 iterations of the csum operation, as previous tests were, I just omitted that column). Hmmm averaging over 10 iterations means that all the code is in the i-cache and the branch predictor will be correctly primed. For short checksum requests I'd guess that the relevant data has just been written and is already in the cpu cache (unless there has been a process and cpu switch). So prefetch is likely to be unnecessary. If you assume that the checksum code isn't in the i-cache then small requests are likely to be dominated by the code size. I'm not sure, whats the typical capacity for the branch predictors ability to remember code paths? I ask because the most likely use of do_csum will be in the receive path of the networking stack (specifically in the softirq handler). So if we run do_csum once, we're likely to run it many more times, as we clean out an adapters receive queue. Neil David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
* Neil Horman nhor...@tuxdriver.com wrote: On Wed, Nov 13, 2013 at 10:09:51AM -, David Laight wrote: Sure, I modified the code so that we only prefetched 2 cache lines ahead, but only if the overall length of the input buffer is more than 2 cache lines. Below are the results (all counts are the average of 100 iterations of the csum operation, as previous tests were, I just omitted that column). Hmmm averaging over 10 iterations means that all the code is in the i-cache and the branch predictor will be correctly primed. For short checksum requests I'd guess that the relevant data has just been written and is already in the cpu cache (unless there has been a process and cpu switch). So prefetch is likely to be unnecessary. If you assume that the checksum code isn't in the i-cache then small requests are likely to be dominated by the code size. I'm not sure, whats the typical capacity for the branch predictors ability to remember code paths? I ask because the most likely use of do_csum will be in the receive path of the networking stack (specifically in the softirq handler). So if we run do_csum once, we're likely to run it many more times, as we clean out an adapters receive queue. For such simple single-target branches it goes near or over a thousand for recent Intel and AMD microarchitectures. Thousands for really recent CPUs. Note that branch prediction caches are hierarchical and are typically attached to cache hierarchies (where the uops are fetched from), so the first level BTB is typically shared between SMT CPUs that share an icache and L2 BTBs (which is larger and more associative) are shared by all cores in a package. So it's possible for some other task on another (sibling) CPU to keep pressure on your BTB, but I'd say it's relatively rare, it's hard to do it at a really high rate that blows away all the cache all the time. (PeterZ has written some artificial pseudorandom branching monster just to be able to generate cache misses and validate perf's branch stats - but even if deliberately want to it's pretty hard to beat that cache.) I'd definitely not worry about the prediction accuracy of repetitive loops like csum routines, they'll be cached well. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
I'm not sure, whats the typical capacity for the branch predictors ability to remember code paths? ... For such simple single-target branches it goes near or over a thousand for recent Intel and AMD microarchitectures. Thousands for really recent CPUs. IIRC the x86 can also correctly predict simple sequences - like a branch in a loop that is taken every other iteration, or only after a previous branch is taken. Much simpler cpus may use a much simpler strategy. I think one I've used (a fpga soft-core cpu) just uses the low bits of the instruction address to index a single bit table. This means that branches alias each other. In order to get the consistent cycle counts in order to minimise the worst case code path we had to disable the dynamic prediction. For the checksum code the loop branch isn't a problem. Tests on entry to the function might get mispredicted. So if you have conditional prefetch when the buffer is long then time a short buffer after a 100 long ones you'll almost certainly see the mispredition penalty. FWIW I remember speeding up a copy (I think) loop on a strongarm by adding an extra instruction to fetch a word from later in the buffer into a register I never otherwise used. (That was an unpaged system so I knew it couldn't fault.) David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
* David Laight david.lai...@aculab.com wrote: I'm not sure, whats the typical capacity for the branch predictors ability to remember code paths? ... For such simple single-target branches it goes near or over a thousand for recent Intel and AMD microarchitectures. Thousands for really recent CPUs. IIRC the x86 can also correctly predict simple sequences - like a branch in a loop that is taken every other iteration, or only after a previous branch is taken. They tend to be rather capable but not very well documented :) With a large out of order execution design and 20+ pipeline stages x86 branch prediction accuracy is perhaps the most important design aspect to good CPU performance. Much simpler cpus may use a much simpler strategy. Yeah. The patches in this thread are about the x86 assembly implementation of the csum routines, and for 'typical' x86 CPUs the branch prediction units and caches are certainly sophisticated enough. Also note that here, for real usecases, the csum routines are (or should be) memory bandwidth limited, missing the data cache most of the time, with a partially idling pipeline, while branch prediction accuracy matters most when the pipeline is well fed and there are a lot of instructions in flight. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Wed, Nov 13, 2013 at 01:32:50PM -, David Laight wrote: I'm not sure, whats the typical capacity for the branch predictors ability to remember code paths? ... For such simple single-target branches it goes near or over a thousand for recent Intel and AMD microarchitectures. Thousands for really recent CPUs. IIRC the x86 can also correctly predict simple sequences - like a branch in a loop that is taken every other iteration, or only after a previous branch is taken. Much simpler cpus may use a much simpler strategy. I think one I've used (a fpga soft-core cpu) just uses the low bits of the instruction address to index a single bit table. This means that branches alias each other. In order to get the consistent cycle counts in order to minimise the worst case code path we had to disable the dynamic prediction. For the checksum code the loop branch isn't a problem. Tests on entry to the function might get mispredicted. So if you have conditional prefetch when the buffer is long then time a short buffer after a 100 long ones you'll almost certainly see the mispredition penalty. FWIW I remember speeding up a copy (I think) loop on a strongarm by adding an extra instruction to fetch a word from later in the buffer into a register I never otherwise used. (That was an unpaged system so I knew it couldn't fault.) Fair enough, but the code we're looking at here is arch specific. If strongarms benefit from different coding patterns, we can handle that in that arch. This x86 implementation can still avoid worrying about branch predicition since its hardware handles it well Neil David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Tue, Nov 12, 2013 at 12:38:01PM -0800, Joe Perches wrote: > On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote: > > On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote: > > > On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote: > [] > > > > So, the numbers are correct now that I returned my hardware to its > > > > previous > > > > interrupt affinity state, but the trend seems to be the same (namely > > > > that there > > > > isn't a clear one). We seem to find peak performance around a > > > > readahead of 2 > > > > cachelines, but its very small (about 3%), and its inconsistent (larger > > > > set > > > > sizes fall to either side of that stride). So I don't see it as a > > > > clear win. I > > > > still think we should probably scrap the readahead for now, just take > > > > the perf > > > > bits, and revisit this when we can use the vector instructions or the > > > > independent carry chain instructions to improve this more consistently. > > > > > > > > Thoughts > > > > > > Perhaps a single prefetch, not of the first addr but of > > > the addr after PREFETCH_STRIDE would work best but only > > > if length is > PREFETCH_STRIDE. > > > > > > I'd try: > > > > > > if (len > PREFETCH_STRIDE) > > > prefetch(buf + PREFETCH_STRIDE); > > > while (count64) { > > > etc... > > > } > > > > > > I still don't know how much that impacts very short lengths. > > > Can you please add a 20 byte length to your tests? > > Sure, I modified the code so that we only prefetched 2 cache lines ahead, > > but > > only if the overall length of the input buffer is more than 2 cache lines. > > Below are the results (all counts are the average of 100 iterations of > > the > > csum operation, as previous tests were, I just omitted that column). > > > > len set cycles/byte cycles/byte improvement > > no prefetch prefetch > > === > > 20B 64MB45.014989 44.402432 1.3% > > 20B 128MB 44.900317 46.146447 -2.7% > > 20B 256MB 45.303223 48.193623 -6.3% > > 20B 512MB 45.615301 44.486872 2.2% > [] > > I'm still left thinking we should just abandon the prefetch at this point > > and > > keep the perf code until we have new instructions to help us with this > > further, > > unless you see something I dont. > > I tend to agree but perhaps the 3% performance > increase with a prefetch for longer lengths is > actually significant and desirable. > Maybe, but I worry that its not going to be consistent. At least not with the cost of the extra comparison and jump. > It doesn't seem you've done the test I suggested > where prefetch is done only for > "len > PREFETCH_STRIDE". > No, thats exactly what I did, I did this: #define PREFETCH_STRIDE (cache_line_size() * 2) ... if (len > PREFETCH_STRIDE) prefecth(buf + PREFETCH_STRIDE) while (count64) { ... > Is it ever useful to do a prefetch of the > address/data being accessed by the next > instruction? > Doubtful, you need to prefetch the data far enough in advance that its loaded by the time you need to reference it. Otherwise you wind up stalling the data pipeline while the load completes. So unless you have really fast memory, the prefetch is effectively a no-op for the next access. But the next cacheline ahead is good, as it prevents the stall there. Any more than that though (from this testing), seems to again be a no-op as modern hardware automatically issues the prefetch because it notices the linear data access pattern. > Anyway, thanks for doing all the work. > No worries, glad to do it. Thanks for the review Ingo, what do you think, shall I submit the perf bits as a separate thread, or do you not want those any more? Regards Neil > Joe > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote: > On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote: > > On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote: [] > > > So, the numbers are correct now that I returned my hardware to its > > > previous > > > interrupt affinity state, but the trend seems to be the same (namely that > > > there > > > isn't a clear one). We seem to find peak performance around a readahead > > > of 2 > > > cachelines, but its very small (about 3%), and its inconsistent (larger > > > set > > > sizes fall to either side of that stride). So I don't see it as a clear > > > win. I > > > still think we should probably scrap the readahead for now, just take the > > > perf > > > bits, and revisit this when we can use the vector instructions or the > > > independent carry chain instructions to improve this more consistently. > > > > > > Thoughts > > > > Perhaps a single prefetch, not of the first addr but of > > the addr after PREFETCH_STRIDE would work best but only > > if length is > PREFETCH_STRIDE. > > > > I'd try: > > > > if (len > PREFETCH_STRIDE) > > prefetch(buf + PREFETCH_STRIDE); > > while (count64) { > > etc... > > } > > > > I still don't know how much that impacts very short lengths. > > Can you please add a 20 byte length to your tests? > Sure, I modified the code so that we only prefetched 2 cache lines ahead, but > only if the overall length of the input buffer is more than 2 cache lines. > Below are the results (all counts are the average of 100 iterations of the > csum operation, as previous tests were, I just omitted that column). > > len set cycles/byte cycles/byte improvement > no prefetch prefetch > === > 20B 64MB45.014989 44.402432 1.3% > 20B 128MB 44.900317 46.146447 -2.7% > 20B 256MB 45.303223 48.193623 -6.3% > 20B 512MB 45.615301 44.486872 2.2% [] > I'm still left thinking we should just abandon the prefetch at this point and > keep the perf code until we have new instructions to help us with this > further, > unless you see something I dont. I tend to agree but perhaps the 3% performance increase with a prefetch for longer lengths is actually significant and desirable. It doesn't seem you've done the test I suggested where prefetch is done only for "len > PREFETCH_STRIDE". Is it ever useful to do a prefetch of the address/data being accessed by the next instruction? Anyway, thanks for doing all the work. Joe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote: > On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote: > > On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote: > > > Hi again Neil. > > > > > > Forwarding on to netdev with a concern as to how often > > > do_csum is used via csum_partial for very short headers > > > and what impact any prefetch would have there. > > > > > > Also, what changed in your test environment? > > > > > > Why are the new values 5+% higher cycles/byte than the > > > previous values? > > > > > > And here is the new table reformatted: > > > > > > len set iterations Readahead cachelines vs cycles/byte > > > 1 2 3 4 6 10 20 > > > 1500B 64MB100 1.4342 1.4300 1.4350 1.4350 1.4396 1.4315 > > > 1.4555 > > > 1500B 128MB 100 1.4312 1.4346 1.4271 1.4284 1.4376 1.4318 > > > 1.4431 > > > 1500B 256MB 100 1.4309 1.4254 1.4316 1.4308 1.4418 1.4304 > > > 1.4367 > > > 1500B 512MB 100 1.4534 1.4516 1.4523 1.4563 1.4554 1.4644 > > > 1.4590 > > > 9000B 64MB100 0.8921 0.8924 0.8932 0.8949 0.8952 0.8939 > > > 0.8985 > > > 9000B 128MB 100 0.8841 0.8856 0.8845 0.8854 0.8861 0.8879 > > > 0.8861 > > > 9000B 256MB 100 0.8806 0.8821 0.8813 0.8833 0.8814 0.8827 > > > 0.8895 > > > 9000B 512MB 100 0.8838 0.8852 0.8841 0.8865 0.8846 0.8901 > > > 0.8865 > > > 64KB 64MB100 0.8132 0.8136 0.8132 0.8150 0.8147 0.8149 > > > 0.8147 > > > 64KB 128MB 100 0.8013 0.8014 0.8013 0.8020 0.8041 0.8015 > > > 0.8033 > > > 64KB 256MB 100 0.7956 0.7959 0.7956 0.7976 0.7981 0.7967 > > > 0.7973 > > > 64KB 512MB 100 0.7934 0.7932 0.7937 0.7951 0.7954 0.7943 > > > 0.7948 > > > > > > > > > There we go, thats better: > > len set iterations Readahead cachelines vs cycles/byte > > 1 2 3 4 5 10 20 > > 1500B 64MB 100 1.3638 1.3288 1.3464 1.3505 1.3586 1.3527 1.3408 > > 1500B 128MB 100 1.3394 1.3357 1.3625 1.3456 1.3536 1.3400 1.3410 > > 1500B 256MB 100 1.3773 1.3362 1.3419 1.3548 1.3543 1.3442 1.4163 > > 1500B 512MB 100 1.3442 1.3390 1.3434 1.3505 1.3767 1.3513 1.3820 > > 9000B 64MB 100 0.8505 0.8492 0.8521 0.8593 0.8566 0.8577 0.8547 > > 9000B 128MB 100 0.8507 0.8507 0.8523 0.8627 0.8593 0.8670 0.8570 > > 9000B 256MB 100 0.8516 0.8515 0.8568 0.8546 0.8549 0.8609 0.8596 > > 9000B 512MB 100 0.8517 0.8526 0.8552 0.8675 0.8547 0.8526 0.8621 > > 64KB 64MB 100 0.7679 0.7689 0.7688 0.7716 0.7714 0.7722 0.7716 > > 64KB 128MB 100 0.7683 0.7687 0.7710 0.7690 0.7717 0.7694 0.7703 > > 64KB 256MB 100 0.7680 0.7703 0.7688 0.7689 0.7726 0.7717 0.7713 > > 64KB 512MB 100 0.7692 0.7690 0.7701 0.7705 0.7698 0.7693 0.7735 > > > > > > So, the numbers are correct now that I returned my hardware to its previous > > interrupt affinity state, but the trend seems to be the same (namely that > > there > > isn't a clear one). We seem to find peak performance around a readahead of > > 2 > > cachelines, but its very small (about 3%), and its inconsistent (larger set > > sizes fall to either side of that stride). So I don't see it as a clear > > win. I > > still think we should probably scrap the readahead for now, just take the > > perf > > bits, and revisit this when we can use the vector instructions or the > > independent carry chain instructions to improve this more consistently. > > > > Thoughts > > Perhaps a single prefetch, not of the first addr but of > the addr after PREFETCH_STRIDE would work best but only > if length is > PREFETCH_STRIDE. > > I'd try: > > if (len > PREFETCH_STRIDE) > prefetch(buf + PREFETCH_STRIDE); > while (count64) { > etc... > } > > I still don't know how much that impacts very short lengths. > > Can you please add a 20 byte length to your tests? > > Sure, I modified the code so that we only prefetched 2 cache lines ahead, but only if the overall length of the input buffer is more than 2 cache lines. Below are the results (all counts are the average of 100 iterations of the csum operation, as previous tests were, I just omitted that column). len set cycles/byte cycles/byte improvement no prefetch prefetch === 20B 64MB45.014989 44.402432 1.3% 20B 128MB 44.900317 46.146447 -2.7% 20B 256MB 45.303223 48.193623 -6.3% 20B 512MB 45.615301 44.486872 2.2% 1500B 64MB1.3643651.3322851.9% 1500B 128MB 1.3739451.3359071.4% 1500B 256MB 1.3569711.3390841.2% 1500B 512MB
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote: > On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote: > > Hi again Neil. > > > > Forwarding on to netdev with a concern as to how often > > do_csum is used via csum_partial for very short headers > > and what impact any prefetch would have there. > > > > Also, what changed in your test environment? > > > > Why are the new values 5+% higher cycles/byte than the > > previous values? > > > > And here is the new table reformatted: > > > > len set iterations Readahead cachelines vs cycles/byte > > 1 2 3 4 6 10 20 > > 1500B 64MB100 1.4342 1.4300 1.4350 1.4350 1.4396 1.4315 > > 1.4555 > > 1500B 128MB 100 1.4312 1.4346 1.4271 1.4284 1.4376 1.4318 > > 1.4431 > > 1500B 256MB 100 1.4309 1.4254 1.4316 1.4308 1.4418 1.4304 > > 1.4367 > > 1500B 512MB 100 1.4534 1.4516 1.4523 1.4563 1.4554 1.4644 > > 1.4590 > > 9000B 64MB100 0.8921 0.8924 0.8932 0.8949 0.8952 0.8939 > > 0.8985 > > 9000B 128MB 100 0.8841 0.8856 0.8845 0.8854 0.8861 0.8879 > > 0.8861 > > 9000B 256MB 100 0.8806 0.8821 0.8813 0.8833 0.8814 0.8827 > > 0.8895 > > 9000B 512MB 100 0.8838 0.8852 0.8841 0.8865 0.8846 0.8901 > > 0.8865 > > 64KB64MB100 0.8132 0.8136 0.8132 0.8150 0.8147 0.8149 > > 0.8147 > > 64KB128MB 100 0.8013 0.8014 0.8013 0.8020 0.8041 0.8015 > > 0.8033 > > 64KB256MB 100 0.7956 0.7959 0.7956 0.7976 0.7981 0.7967 > > 0.7973 > > 64KB512MB 100 0.7934 0.7932 0.7937 0.7951 0.7954 0.7943 > > 0.7948 > > > > > There we go, thats better: > len set iterations Readahead cachelines vs cycles/byte > 1 2 3 4 5 10 20 > 1500B 64MB100 1.3638 1.3288 1.3464 1.3505 1.3586 1.3527 1.3408 > 1500B 128MB 100 1.3394 1.3357 1.3625 1.3456 1.3536 1.3400 1.3410 > 1500B 256MB 100 1.3773 1.3362 1.3419 1.3548 1.3543 1.3442 1.4163 > 1500B 512MB 100 1.3442 1.3390 1.3434 1.3505 1.3767 1.3513 1.3820 > 9000B 64MB100 0.8505 0.8492 0.8521 0.8593 0.8566 0.8577 0.8547 > 9000B 128MB 100 0.8507 0.8507 0.8523 0.8627 0.8593 0.8670 0.8570 > 9000B 256MB 100 0.8516 0.8515 0.8568 0.8546 0.8549 0.8609 0.8596 > 9000B 512MB 100 0.8517 0.8526 0.8552 0.8675 0.8547 0.8526 0.8621 > 64KB 64MB100 0.7679 0.7689 0.7688 0.7716 0.7714 0.7722 0.7716 > 64KB 128MB 100 0.7683 0.7687 0.7710 0.7690 0.7717 0.7694 0.7703 > 64KB 256MB 100 0.7680 0.7703 0.7688 0.7689 0.7726 0.7717 0.7713 > 64KB 512MB 100 0.7692 0.7690 0.7701 0.7705 0.7698 0.7693 0.7735 > > > So, the numbers are correct now that I returned my hardware to its previous > interrupt affinity state, but the trend seems to be the same (namely that > there > isn't a clear one). We seem to find peak performance around a readahead of 2 > cachelines, but its very small (about 3%), and its inconsistent (larger set > sizes fall to either side of that stride). So I don't see it as a clear win. > I > still think we should probably scrap the readahead for now, just take the perf > bits, and revisit this when we can use the vector instructions or the > independent carry chain instructions to improve this more consistently. > > Thoughts Perhaps a single prefetch, not of the first addr but of the addr after PREFETCH_STRIDE would work best but only if length is > PREFETCH_STRIDE. I'd try: if (len > PREFETCH_STRIDE) prefetch(buf + PREFETCH_STRIDE); while (count64) { etc... } I still don't know how much that impacts very short lengths. Can you please add a 20 byte length to your tests? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote: > Hi again Neil. > > Forwarding on to netdev with a concern as to how often > do_csum is used via csum_partial for very short headers > and what impact any prefetch would have there. > > Also, what changed in your test environment? > > Why are the new values 5+% higher cycles/byte than the > previous values? > > And here is the new table reformatted: > > len set iterations Readahead cachelines vs cycles/byte > 1 2 3 4 6 10 20 > 1500B 64MB100 1.4342 1.4300 1.4350 1.4350 1.4396 1.4315 1.4555 > 1500B 128MB 100 1.4312 1.4346 1.4271 1.4284 1.4376 1.4318 1.4431 > 1500B 256MB 100 1.4309 1.4254 1.4316 1.4308 1.4418 1.4304 1.4367 > 1500B 512MB 100 1.4534 1.4516 1.4523 1.4563 1.4554 1.4644 1.4590 > 9000B 64MB100 0.8921 0.8924 0.8932 0.8949 0.8952 0.8939 0.8985 > 9000B 128MB 100 0.8841 0.8856 0.8845 0.8854 0.8861 0.8879 0.8861 > 9000B 256MB 100 0.8806 0.8821 0.8813 0.8833 0.8814 0.8827 0.8895 > 9000B 512MB 100 0.8838 0.8852 0.8841 0.8865 0.8846 0.8901 0.8865 > 64KB 64MB100 0.8132 0.8136 0.8132 0.8150 0.8147 0.8149 0.8147 > 64KB 128MB 100 0.8013 0.8014 0.8013 0.8020 0.8041 0.8015 0.8033 > 64KB 256MB 100 0.7956 0.7959 0.7956 0.7976 0.7981 0.7967 0.7973 > 64KB 512MB 100 0.7934 0.7932 0.7937 0.7951 0.7954 0.7943 0.7948 > There we go, thats better: len set iterations Readahead cachelines vs cycles/byte 1 2 3 4 5 10 20 1500B 64MB 100 1.3638 1.3288 1.3464 1.3505 1.3586 1.3527 1.3408 1500B 128MB 100 1.3394 1.3357 1.3625 1.3456 1.3536 1.3400 1.3410 1500B 256MB 100 1.3773 1.3362 1.3419 1.3548 1.3543 1.3442 1.4163 1500B 512MB 100 1.3442 1.3390 1.3434 1.3505 1.3767 1.3513 1.3820 9000B 64MB 100 0.8505 0.8492 0.8521 0.8593 0.8566 0.8577 0.8547 9000B 128MB 100 0.8507 0.8507 0.8523 0.8627 0.8593 0.8670 0.8570 9000B 256MB 100 0.8516 0.8515 0.8568 0.8546 0.8549 0.8609 0.8596 9000B 512MB 100 0.8517 0.8526 0.8552 0.8675 0.8547 0.8526 0.8621 64KB 64MB 100 0.7679 0.7689 0.7688 0.7716 0.7714 0.7722 0.7716 64KB 128MB 100 0.7683 0.7687 0.7710 0.7690 0.7717 0.7694 0.7703 64KB 256MB 100 0.7680 0.7703 0.7688 0.7689 0.7726 0.7717 0.7713 64KB 512MB 100 0.7692 0.7690 0.7701 0.7705 0.7698 0.7693 0.7735 So, the numbers are correct now that I returned my hardware to its previous interrupt affinity state, but the trend seems to be the same (namely that there isn't a clear one). We seem to find peak performance around a readahead of 2 cachelines, but its very small (about 3%), and its inconsistent (larger set sizes fall to either side of that stride). So I don't see it as a clear win. I still think we should probably scrap the readahead for now, just take the perf bits, and revisit this when we can use the vector instructions or the independent carry chain instructions to improve this more consistently. Thoughts Neil -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote: > Hi again Neil. > > Forwarding on to netdev with a concern as to how often > do_csum is used via csum_partial for very short headers > and what impact any prefetch would have there. > > Also, what changed in your test environment? > > Why are the new values 5+% higher cycles/byte than the > previous values? > Hmm, thank you, I didn't notice the increase. I think I rebooted my system and failed to reset my irq affinity to avoid the processor I was testing on. Let me rerun. Neil > And here is the new table reformatted: > > len set iterations Readahead cachelines vs cycles/byte > 1 2 3 4 6 10 20 > 1500B 64MB100 1.4342 1.4300 1.4350 1.4350 1.4396 1.4315 1.4555 > 1500B 128MB 100 1.4312 1.4346 1.4271 1.4284 1.4376 1.4318 1.4431 > 1500B 256MB 100 1.4309 1.4254 1.4316 1.4308 1.4418 1.4304 1.4367 > 1500B 512MB 100 1.4534 1.4516 1.4523 1.4563 1.4554 1.4644 1.4590 > 9000B 64MB100 0.8921 0.8924 0.8932 0.8949 0.8952 0.8939 0.8985 > 9000B 128MB 100 0.8841 0.8856 0.8845 0.8854 0.8861 0.8879 0.8861 > 9000B 256MB 100 0.8806 0.8821 0.8813 0.8833 0.8814 0.8827 0.8895 > 9000B 512MB 100 0.8838 0.8852 0.8841 0.8865 0.8846 0.8901 0.8865 > 64KB 64MB100 0.8132 0.8136 0.8132 0.8150 0.8147 0.8149 0.8147 > 64KB 128MB 100 0.8013 0.8014 0.8013 0.8020 0.8041 0.8015 0.8033 > 64KB 256MB 100 0.7956 0.7959 0.7956 0.7976 0.7981 0.7967 0.7973 > 64KB 512MB 100 0.7934 0.7932 0.7937 0.7951 0.7954 0.7943 0.7948 > > Forwarded Message > From: Neil Horman > To: Joe Perches > Cc: Dave Jones , linux-kernel@vger.kernel.org, > sebastien.du...@bull.net, Thomas Gleixner , Ingo > Molnar , H. Peter Anvin , > x...@kernel.org > Subject: Re: [PATCH v2 2/2] x86: add prefetching to do_csum > > On Fri, Nov 08, 2013 at 12:29:07PM -0800, Joe Perches wrote: > > On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote: > > > On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote: > > > > On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote: > > > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote: > > > > > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote: > > > > > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote: > > > > > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote: > > > > > > > > > do_csum was identified via perf recently as a hot spot when > > > > > > > > doing > > > > > > > > > receive on ip over infiniband workloads. After alot of > > > > > > > > testing and > > > > > > > > > ideas, we found the best optimization available to us > > > > > > > > currently is to > > > > > > > > > prefetch the entire data buffer prior to doing the checksum > > > > > > [] > > > > > > > I'll fix this up and send a v3, but I'll give it a day in case > > > > > > > there are more > > > > > > > comments first. > > > > > > > > > > > > Perhaps a reduction in prefetch loop count helps. > > > > > > > > > > > > Was capping the amount prefetched and letting the > > > > > > hardware prefetch also tested? > > > > > > > > > > > > prefetch_lines(buff, min(len, cache_line_size() * 8u)); > > > > > > > > > > > > > > > > Just tested this out: > > > > > > > > Thanks. > > > > > > > > Reformatting the table so it's a bit more > > > > readable/comparable for me: > > > > > > > > len SetSz Loops cycles/byte > > > > limited unlimited > > > > 1500B 64MB1M 1.3442 1.3605 > > > > 1500B 128MB 1M 1.3410 1.3542 > > > > 1500B 256MB 1M 1.3536 1.3710 > > > > 1500B 512MB 1M 1.3463 1.3536 > > > > 9000B 64MB1M 0.8522 0.8504 > > > > 9000B 128MB 1M 0.8528 0.8536 > > > > 9000B 256MB 1M 0.8532 0.8520 > > > > 9000B 512MB 1M 0.8527 0.8525 > > > > 64KB64MB1M 0.7686 0.7683 > > > > 64KB128MB 1M 0.7695 0.7686 > > > > 64KB256MB 1M 0.7699 0.7708 > > > > 64KB512MB 1M 0.7799 0.7694 > > > > > > > > This data appears to show some value > > > > in capping for 1500b lengths and noise > > > > for shorter and longer lengths. > > > > > > > > Any idea what the actual distribution of > > > > do_csum lengths is under various loads? > > > > > > > I don't have any hard data no, sorry. > > > > I think you should before you implement this. > > You might find extremely short lengths. > > > > > I'll cap the prefetch at 1500B for now, since it > > > doesn't seem to hurt or help beyond that > > > > The table data has a max prefetch of > > 8 * boot_cpu_data.x86_cache_alignment so > > I believe it's always less than 1500 but > > perhaps 4 might be slightly better still. > > > > > So, you appear to be correct, I reran my test set with different prefetch > ceilings and got the results below.
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote: Hi again Neil. Forwarding on to netdev with a concern as to how often do_csum is used via csum_partial for very short headers and what impact any prefetch would have there. Also, what changed in your test environment? Why are the new values 5+% higher cycles/byte than the previous values? Hmm, thank you, I didn't notice the increase. I think I rebooted my system and failed to reset my irq affinity to avoid the processor I was testing on. Let me rerun. Neil And here is the new table reformatted: len set iterations Readahead cachelines vs cycles/byte 1 2 3 4 6 10 20 1500B 64MB100 1.4342 1.4300 1.4350 1.4350 1.4396 1.4315 1.4555 1500B 128MB 100 1.4312 1.4346 1.4271 1.4284 1.4376 1.4318 1.4431 1500B 256MB 100 1.4309 1.4254 1.4316 1.4308 1.4418 1.4304 1.4367 1500B 512MB 100 1.4534 1.4516 1.4523 1.4563 1.4554 1.4644 1.4590 9000B 64MB100 0.8921 0.8924 0.8932 0.8949 0.8952 0.8939 0.8985 9000B 128MB 100 0.8841 0.8856 0.8845 0.8854 0.8861 0.8879 0.8861 9000B 256MB 100 0.8806 0.8821 0.8813 0.8833 0.8814 0.8827 0.8895 9000B 512MB 100 0.8838 0.8852 0.8841 0.8865 0.8846 0.8901 0.8865 64KB 64MB100 0.8132 0.8136 0.8132 0.8150 0.8147 0.8149 0.8147 64KB 128MB 100 0.8013 0.8014 0.8013 0.8020 0.8041 0.8015 0.8033 64KB 256MB 100 0.7956 0.7959 0.7956 0.7976 0.7981 0.7967 0.7973 64KB 512MB 100 0.7934 0.7932 0.7937 0.7951 0.7954 0.7943 0.7948 Forwarded Message From: Neil Horman nhor...@tuxdriver.com To: Joe Perches j...@perches.com Cc: Dave Jones da...@redhat.com, linux-kernel@vger.kernel.org, sebastien.du...@bull.net, Thomas Gleixner t...@linutronix.de, Ingo Molnar mi...@redhat.com, H. Peter Anvin h...@zytor.com, x...@kernel.org Subject: Re: [PATCH v2 2/2] x86: add prefetching to do_csum On Fri, Nov 08, 2013 at 12:29:07PM -0800, Joe Perches wrote: On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote: On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote: On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote: On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote: On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote: On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote: On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote: do_csum was identified via perf recently as a hot spot when doing receive on ip over infiniband workloads. After alot of testing and ideas, we found the best optimization available to us currently is to prefetch the entire data buffer prior to doing the checksum [] I'll fix this up and send a v3, but I'll give it a day in case there are more comments first. Perhaps a reduction in prefetch loop count helps. Was capping the amount prefetched and letting the hardware prefetch also tested? prefetch_lines(buff, min(len, cache_line_size() * 8u)); Just tested this out: Thanks. Reformatting the table so it's a bit more readable/comparable for me: len SetSz Loops cycles/byte limited unlimited 1500B 64MB1M 1.3442 1.3605 1500B 128MB 1M 1.3410 1.3542 1500B 256MB 1M 1.3536 1.3710 1500B 512MB 1M 1.3463 1.3536 9000B 64MB1M 0.8522 0.8504 9000B 128MB 1M 0.8528 0.8536 9000B 256MB 1M 0.8532 0.8520 9000B 512MB 1M 0.8527 0.8525 64KB64MB1M 0.7686 0.7683 64KB128MB 1M 0.7695 0.7686 64KB256MB 1M 0.7699 0.7708 64KB512MB 1M 0.7799 0.7694 This data appears to show some value in capping for 1500b lengths and noise for shorter and longer lengths. Any idea what the actual distribution of do_csum lengths is under various loads? I don't have any hard data no, sorry. I think you should before you implement this. You might find extremely short lengths. I'll cap the prefetch at 1500B for now, since it doesn't seem to hurt or help beyond that The table data has a max prefetch of 8 * boot_cpu_data.x86_cache_alignment so I believe it's always less than 1500 but perhaps 4 might be slightly better still. So, you appear to be correct, I reran my test set with different prefetch ceilings and got the results below. There are some cases in which there is a performance gain, but the gain is small, and occurs at different spots depending on the input buffer size (though most peak gains appear around 2 cache lines). I'm guessing it takes about 2 prefetches before hardware
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote: Hi again Neil. Forwarding on to netdev with a concern as to how often do_csum is used via csum_partial for very short headers and what impact any prefetch would have there. Also, what changed in your test environment? Why are the new values 5+% higher cycles/byte than the previous values? And here is the new table reformatted: len set iterations Readahead cachelines vs cycles/byte 1 2 3 4 6 10 20 1500B 64MB100 1.4342 1.4300 1.4350 1.4350 1.4396 1.4315 1.4555 1500B 128MB 100 1.4312 1.4346 1.4271 1.4284 1.4376 1.4318 1.4431 1500B 256MB 100 1.4309 1.4254 1.4316 1.4308 1.4418 1.4304 1.4367 1500B 512MB 100 1.4534 1.4516 1.4523 1.4563 1.4554 1.4644 1.4590 9000B 64MB100 0.8921 0.8924 0.8932 0.8949 0.8952 0.8939 0.8985 9000B 128MB 100 0.8841 0.8856 0.8845 0.8854 0.8861 0.8879 0.8861 9000B 256MB 100 0.8806 0.8821 0.8813 0.8833 0.8814 0.8827 0.8895 9000B 512MB 100 0.8838 0.8852 0.8841 0.8865 0.8846 0.8901 0.8865 64KB 64MB100 0.8132 0.8136 0.8132 0.8150 0.8147 0.8149 0.8147 64KB 128MB 100 0.8013 0.8014 0.8013 0.8020 0.8041 0.8015 0.8033 64KB 256MB 100 0.7956 0.7959 0.7956 0.7976 0.7981 0.7967 0.7973 64KB 512MB 100 0.7934 0.7932 0.7937 0.7951 0.7954 0.7943 0.7948 There we go, thats better: len set iterations Readahead cachelines vs cycles/byte 1 2 3 4 5 10 20 1500B 64MB 100 1.3638 1.3288 1.3464 1.3505 1.3586 1.3527 1.3408 1500B 128MB 100 1.3394 1.3357 1.3625 1.3456 1.3536 1.3400 1.3410 1500B 256MB 100 1.3773 1.3362 1.3419 1.3548 1.3543 1.3442 1.4163 1500B 512MB 100 1.3442 1.3390 1.3434 1.3505 1.3767 1.3513 1.3820 9000B 64MB 100 0.8505 0.8492 0.8521 0.8593 0.8566 0.8577 0.8547 9000B 128MB 100 0.8507 0.8507 0.8523 0.8627 0.8593 0.8670 0.8570 9000B 256MB 100 0.8516 0.8515 0.8568 0.8546 0.8549 0.8609 0.8596 9000B 512MB 100 0.8517 0.8526 0.8552 0.8675 0.8547 0.8526 0.8621 64KB 64MB 100 0.7679 0.7689 0.7688 0.7716 0.7714 0.7722 0.7716 64KB 128MB 100 0.7683 0.7687 0.7710 0.7690 0.7717 0.7694 0.7703 64KB 256MB 100 0.7680 0.7703 0.7688 0.7689 0.7726 0.7717 0.7713 64KB 512MB 100 0.7692 0.7690 0.7701 0.7705 0.7698 0.7693 0.7735 So, the numbers are correct now that I returned my hardware to its previous interrupt affinity state, but the trend seems to be the same (namely that there isn't a clear one). We seem to find peak performance around a readahead of 2 cachelines, but its very small (about 3%), and its inconsistent (larger set sizes fall to either side of that stride). So I don't see it as a clear win. I still think we should probably scrap the readahead for now, just take the perf bits, and revisit this when we can use the vector instructions or the independent carry chain instructions to improve this more consistently. Thoughts Neil -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote: On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote: Hi again Neil. Forwarding on to netdev with a concern as to how often do_csum is used via csum_partial for very short headers and what impact any prefetch would have there. Also, what changed in your test environment? Why are the new values 5+% higher cycles/byte than the previous values? And here is the new table reformatted: len set iterations Readahead cachelines vs cycles/byte 1 2 3 4 6 10 20 1500B 64MB100 1.4342 1.4300 1.4350 1.4350 1.4396 1.4315 1.4555 1500B 128MB 100 1.4312 1.4346 1.4271 1.4284 1.4376 1.4318 1.4431 1500B 256MB 100 1.4309 1.4254 1.4316 1.4308 1.4418 1.4304 1.4367 1500B 512MB 100 1.4534 1.4516 1.4523 1.4563 1.4554 1.4644 1.4590 9000B 64MB100 0.8921 0.8924 0.8932 0.8949 0.8952 0.8939 0.8985 9000B 128MB 100 0.8841 0.8856 0.8845 0.8854 0.8861 0.8879 0.8861 9000B 256MB 100 0.8806 0.8821 0.8813 0.8833 0.8814 0.8827 0.8895 9000B 512MB 100 0.8838 0.8852 0.8841 0.8865 0.8846 0.8901 0.8865 64KB64MB100 0.8132 0.8136 0.8132 0.8150 0.8147 0.8149 0.8147 64KB128MB 100 0.8013 0.8014 0.8013 0.8020 0.8041 0.8015 0.8033 64KB256MB 100 0.7956 0.7959 0.7956 0.7976 0.7981 0.7967 0.7973 64KB512MB 100 0.7934 0.7932 0.7937 0.7951 0.7954 0.7943 0.7948 There we go, thats better: len set iterations Readahead cachelines vs cycles/byte 1 2 3 4 5 10 20 1500B 64MB100 1.3638 1.3288 1.3464 1.3505 1.3586 1.3527 1.3408 1500B 128MB 100 1.3394 1.3357 1.3625 1.3456 1.3536 1.3400 1.3410 1500B 256MB 100 1.3773 1.3362 1.3419 1.3548 1.3543 1.3442 1.4163 1500B 512MB 100 1.3442 1.3390 1.3434 1.3505 1.3767 1.3513 1.3820 9000B 64MB100 0.8505 0.8492 0.8521 0.8593 0.8566 0.8577 0.8547 9000B 128MB 100 0.8507 0.8507 0.8523 0.8627 0.8593 0.8670 0.8570 9000B 256MB 100 0.8516 0.8515 0.8568 0.8546 0.8549 0.8609 0.8596 9000B 512MB 100 0.8517 0.8526 0.8552 0.8675 0.8547 0.8526 0.8621 64KB 64MB100 0.7679 0.7689 0.7688 0.7716 0.7714 0.7722 0.7716 64KB 128MB 100 0.7683 0.7687 0.7710 0.7690 0.7717 0.7694 0.7703 64KB 256MB 100 0.7680 0.7703 0.7688 0.7689 0.7726 0.7717 0.7713 64KB 512MB 100 0.7692 0.7690 0.7701 0.7705 0.7698 0.7693 0.7735 So, the numbers are correct now that I returned my hardware to its previous interrupt affinity state, but the trend seems to be the same (namely that there isn't a clear one). We seem to find peak performance around a readahead of 2 cachelines, but its very small (about 3%), and its inconsistent (larger set sizes fall to either side of that stride). So I don't see it as a clear win. I still think we should probably scrap the readahead for now, just take the perf bits, and revisit this when we can use the vector instructions or the independent carry chain instructions to improve this more consistently. Thoughts Perhaps a single prefetch, not of the first addr but of the addr after PREFETCH_STRIDE would work best but only if length is PREFETCH_STRIDE. I'd try: if (len PREFETCH_STRIDE) prefetch(buf + PREFETCH_STRIDE); while (count64) { etc... } I still don't know how much that impacts very short lengths. Can you please add a 20 byte length to your tests? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote: On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote: On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote: Hi again Neil. Forwarding on to netdev with a concern as to how often do_csum is used via csum_partial for very short headers and what impact any prefetch would have there. Also, what changed in your test environment? Why are the new values 5+% higher cycles/byte than the previous values? And here is the new table reformatted: len set iterations Readahead cachelines vs cycles/byte 1 2 3 4 6 10 20 1500B 64MB100 1.4342 1.4300 1.4350 1.4350 1.4396 1.4315 1.4555 1500B 128MB 100 1.4312 1.4346 1.4271 1.4284 1.4376 1.4318 1.4431 1500B 256MB 100 1.4309 1.4254 1.4316 1.4308 1.4418 1.4304 1.4367 1500B 512MB 100 1.4534 1.4516 1.4523 1.4563 1.4554 1.4644 1.4590 9000B 64MB100 0.8921 0.8924 0.8932 0.8949 0.8952 0.8939 0.8985 9000B 128MB 100 0.8841 0.8856 0.8845 0.8854 0.8861 0.8879 0.8861 9000B 256MB 100 0.8806 0.8821 0.8813 0.8833 0.8814 0.8827 0.8895 9000B 512MB 100 0.8838 0.8852 0.8841 0.8865 0.8846 0.8901 0.8865 64KB 64MB100 0.8132 0.8136 0.8132 0.8150 0.8147 0.8149 0.8147 64KB 128MB 100 0.8013 0.8014 0.8013 0.8020 0.8041 0.8015 0.8033 64KB 256MB 100 0.7956 0.7959 0.7956 0.7976 0.7981 0.7967 0.7973 64KB 512MB 100 0.7934 0.7932 0.7937 0.7951 0.7954 0.7943 0.7948 There we go, thats better: len set iterations Readahead cachelines vs cycles/byte 1 2 3 4 5 10 20 1500B 64MB 100 1.3638 1.3288 1.3464 1.3505 1.3586 1.3527 1.3408 1500B 128MB 100 1.3394 1.3357 1.3625 1.3456 1.3536 1.3400 1.3410 1500B 256MB 100 1.3773 1.3362 1.3419 1.3548 1.3543 1.3442 1.4163 1500B 512MB 100 1.3442 1.3390 1.3434 1.3505 1.3767 1.3513 1.3820 9000B 64MB 100 0.8505 0.8492 0.8521 0.8593 0.8566 0.8577 0.8547 9000B 128MB 100 0.8507 0.8507 0.8523 0.8627 0.8593 0.8670 0.8570 9000B 256MB 100 0.8516 0.8515 0.8568 0.8546 0.8549 0.8609 0.8596 9000B 512MB 100 0.8517 0.8526 0.8552 0.8675 0.8547 0.8526 0.8621 64KB 64MB 100 0.7679 0.7689 0.7688 0.7716 0.7714 0.7722 0.7716 64KB 128MB 100 0.7683 0.7687 0.7710 0.7690 0.7717 0.7694 0.7703 64KB 256MB 100 0.7680 0.7703 0.7688 0.7689 0.7726 0.7717 0.7713 64KB 512MB 100 0.7692 0.7690 0.7701 0.7705 0.7698 0.7693 0.7735 So, the numbers are correct now that I returned my hardware to its previous interrupt affinity state, but the trend seems to be the same (namely that there isn't a clear one). We seem to find peak performance around a readahead of 2 cachelines, but its very small (about 3%), and its inconsistent (larger set sizes fall to either side of that stride). So I don't see it as a clear win. I still think we should probably scrap the readahead for now, just take the perf bits, and revisit this when we can use the vector instructions or the independent carry chain instructions to improve this more consistently. Thoughts Perhaps a single prefetch, not of the first addr but of the addr after PREFETCH_STRIDE would work best but only if length is PREFETCH_STRIDE. I'd try: if (len PREFETCH_STRIDE) prefetch(buf + PREFETCH_STRIDE); while (count64) { etc... } I still don't know how much that impacts very short lengths. Can you please add a 20 byte length to your tests? Sure, I modified the code so that we only prefetched 2 cache lines ahead, but only if the overall length of the input buffer is more than 2 cache lines. Below are the results (all counts are the average of 100 iterations of the csum operation, as previous tests were, I just omitted that column). len set cycles/byte cycles/byte improvement no prefetch prefetch === 20B 64MB45.014989 44.402432 1.3% 20B 128MB 44.900317 46.146447 -2.7% 20B 256MB 45.303223 48.193623 -6.3% 20B 512MB 45.615301 44.486872 2.2% 1500B 64MB1.3643651.3322851.9% 1500B 128MB 1.3739451.3359071.4% 1500B 256MB 1.3569711.3390841.2% 1500B 512MB 1.3510911.3414310.7% 9000B 64MB0.8509660.851077-0.1% 9000B 128MB 0.8510130.8503660.1% 9000B 256MB 0.8542120.8510330.3% 9000B
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote: On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote: On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote: [] So, the numbers are correct now that I returned my hardware to its previous interrupt affinity state, but the trend seems to be the same (namely that there isn't a clear one). We seem to find peak performance around a readahead of 2 cachelines, but its very small (about 3%), and its inconsistent (larger set sizes fall to either side of that stride). So I don't see it as a clear win. I still think we should probably scrap the readahead for now, just take the perf bits, and revisit this when we can use the vector instructions or the independent carry chain instructions to improve this more consistently. Thoughts Perhaps a single prefetch, not of the first addr but of the addr after PREFETCH_STRIDE would work best but only if length is PREFETCH_STRIDE. I'd try: if (len PREFETCH_STRIDE) prefetch(buf + PREFETCH_STRIDE); while (count64) { etc... } I still don't know how much that impacts very short lengths. Can you please add a 20 byte length to your tests? Sure, I modified the code so that we only prefetched 2 cache lines ahead, but only if the overall length of the input buffer is more than 2 cache lines. Below are the results (all counts are the average of 100 iterations of the csum operation, as previous tests were, I just omitted that column). len set cycles/byte cycles/byte improvement no prefetch prefetch === 20B 64MB45.014989 44.402432 1.3% 20B 128MB 44.900317 46.146447 -2.7% 20B 256MB 45.303223 48.193623 -6.3% 20B 512MB 45.615301 44.486872 2.2% [] I'm still left thinking we should just abandon the prefetch at this point and keep the perf code until we have new instructions to help us with this further, unless you see something I dont. I tend to agree but perhaps the 3% performance increase with a prefetch for longer lengths is actually significant and desirable. It doesn't seem you've done the test I suggested where prefetch is done only for len PREFETCH_STRIDE. Is it ever useful to do a prefetch of the address/data being accessed by the next instruction? Anyway, thanks for doing all the work. Joe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
On Tue, Nov 12, 2013 at 12:38:01PM -0800, Joe Perches wrote: On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote: On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote: On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote: [] So, the numbers are correct now that I returned my hardware to its previous interrupt affinity state, but the trend seems to be the same (namely that there isn't a clear one). We seem to find peak performance around a readahead of 2 cachelines, but its very small (about 3%), and its inconsistent (larger set sizes fall to either side of that stride). So I don't see it as a clear win. I still think we should probably scrap the readahead for now, just take the perf bits, and revisit this when we can use the vector instructions or the independent carry chain instructions to improve this more consistently. Thoughts Perhaps a single prefetch, not of the first addr but of the addr after PREFETCH_STRIDE would work best but only if length is PREFETCH_STRIDE. I'd try: if (len PREFETCH_STRIDE) prefetch(buf + PREFETCH_STRIDE); while (count64) { etc... } I still don't know how much that impacts very short lengths. Can you please add a 20 byte length to your tests? Sure, I modified the code so that we only prefetched 2 cache lines ahead, but only if the overall length of the input buffer is more than 2 cache lines. Below are the results (all counts are the average of 100 iterations of the csum operation, as previous tests were, I just omitted that column). len set cycles/byte cycles/byte improvement no prefetch prefetch === 20B 64MB45.014989 44.402432 1.3% 20B 128MB 44.900317 46.146447 -2.7% 20B 256MB 45.303223 48.193623 -6.3% 20B 512MB 45.615301 44.486872 2.2% [] I'm still left thinking we should just abandon the prefetch at this point and keep the perf code until we have new instructions to help us with this further, unless you see something I dont. I tend to agree but perhaps the 3% performance increase with a prefetch for longer lengths is actually significant and desirable. Maybe, but I worry that its not going to be consistent. At least not with the cost of the extra comparison and jump. It doesn't seem you've done the test I suggested where prefetch is done only for len PREFETCH_STRIDE. No, thats exactly what I did, I did this: #define PREFETCH_STRIDE (cache_line_size() * 2) ... if (len PREFETCH_STRIDE) prefecth(buf + PREFETCH_STRIDE) while (count64) { ... Is it ever useful to do a prefetch of the address/data being accessed by the next instruction? Doubtful, you need to prefetch the data far enough in advance that its loaded by the time you need to reference it. Otherwise you wind up stalling the data pipeline while the load completes. So unless you have really fast memory, the prefetch is effectively a no-op for the next access. But the next cacheline ahead is good, as it prevents the stall there. Any more than that though (from this testing), seems to again be a no-op as modern hardware automatically issues the prefetch because it notices the linear data access pattern. Anyway, thanks for doing all the work. No worries, glad to do it. Thanks for the review Ingo, what do you think, shall I submit the perf bits as a separate thread, or do you not want those any more? Regards Neil Joe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/