Re: AVX generic mode tuning discussion.
On Mon, Jan 7, 2013 at 7:21 PM, Jagasia, Harsha harsha.jaga...@amd.com wrote: We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. Btw, it looks like the data is massively skewed by 436.cactusADM. What are the overall numbers if you disregard cactus? It's also for sure the case that the vectorizer cost model has not been touched for avx256 vs. avx128 vs. sse, so a more sensible approach would be to look at differentiating things there to improve the cactus numbers. Harsha, did you investigate why avx256 is such a loss for cactus or why it is so much of a win for SB? I know this thread did not get closed from our end for a while now, but we (AMD) would really like to re-open this discussion. So here goes. We did investigate why cactus is slower in avx-256 mode than avx-128 mode on AMD processors. Using -Ofast flag (with appropriate flags to generate avx-128 code or avx-256 code) and running with the reference data set, we observe the following runtimes on Bulldozer. Runtime %Diff AVX-256 versus AVX-128 AVX128616s 38% AVX256 with store splitting 853s Scheduling and predictive commoning are turned off in the compiler for both cases, so that the code generated by the compiler for the avx-128 and avx-256 cases are mostly equivalent i.e only avx-128 instructions on one side are being replaced by avx-256 instructions on the other side. Looking at the cactus source and oprofile reports, the hottest loop nest is a triple nested loop. The innermost loop of this nest has ~400 lines of Fortran code and takes up 99% of the run time of the benchmark. Gcc vectorizes the innermost loop for both the 128 and 256 bit cases. In order to vectorize the innermost loop, gcc generates a SIMD scalar prologue loop to align the relevant vectors, followed by a SIMD packed avx loop, followed by a SIMD scalar epilogue loop to handle what's left after a whole multiple of vector factor is taken care of. Here are the oprofile samples seen in the AVX-128 and AVX-256 case for the innermost Fortran loop's 3 components. Oprofile Samples AVX 128 AVX-256-ss Gap in samples Gap as % of total runtime Total 153408 214448 61040 38% SIMD Vector loop135653 183074 4742130% SIMD Scalar Prolog loop3817 104346617 4% SIMD Scalar Epilog loop 3471 100726601 4% The avx-256 code is spending 30% more time in the SIMD vector loop than the avx-128 code. The code gen appears to be equivalent for this vector loop in the 128b and 256b cases- i.e only avx-128 instructions on one side are being replaced by avx-256 instructions on the other side. The instruction mix and scheduling are same, except for the spilling and loading of one variable. We know this gap is because there are fewer physical registers available for renaming to the avx-256 code, since our processor loses the upper halves of the FP registers for renaming. Our entire SIMD pipeline in the processor is 128-bit and we don't have native true 256-bit, even for foreseeable future generations, unlike Sandybridge/Ivybridge. The avx-256 code is spending 8% more time in the SIMD scalar prologue and epilogue than the avx-128 code. The code gen is exactly the same for these scalar loops in the 128b and 256b case - i.e exact same instruction mix and scheduling. The reason for the gap is actually the number of iterations that gcc executes in these loops for the 2 cases. This is because gcc is following Sandy bridge's recommendation and aligning avx-256 vectors to a 32-byte boundary instead of a 16-byte boundary, even on Bulldozer. The Sandybridge Software Optimization Guide mentions that the optimal memory alignment of an AVX 256-bit vector, stored in memory, is 32 bytes. The Bulldozer Software Optimization Guide says Align all packed floating-point data on 16-byte boundaries. In case of cactus, the relevant double vector has 118 elements
RE: AVX generic mode tuning discussion.
We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? We see these % differences going from SSE42 to AVX128 to AVX256 on Bulldozer with -mtune=generic -Ofast. (Positive is improvement, negative is degradation) Bulldozer: AVX128/SSE42 AVX256/AVX-128 410.bwaves -1.4% -1.4% 416.gamess -1.1% 0.0% 433.milc 0.5% -2.4% 434.zeusmp 9.7% -2.1% 435.gromacs 5.1% 0.5% 436.cactusADM 8.2% -23.8% 437.leslie3d 8.1% 0.4% 444.namd 3.6% 0.0% 447.dealII -1.4% -0.4% 450.soplex -0.4% -0.4% 453.povray 0.0% -1.5% 454.calculix 15.7% -8.3% 459.GemsFDTD 4.9% 1.4% 465.tonto 1.3% -0.6% 470.lbm 0.9% 0.3% 481.wrf 7.3% -3.6% 482.sphinx3 5.0% -9.8% SPECFP 3.8% -3.2% Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. We see a substantial gain in several SPECFP benchmarks going from SSE42 to AVX128 on Bulldozer. IMHO, accomplishing even a 5% gain in an individual benchmark takes a hardware company several man months. The loss with AVX256 for Bulldozer is much more significant than the gain for SandyBridge. While the general trend in the industry is a move toward AVX256, for now we would be disadvantaging Bulldozer with this choice. We have several customers who use -mtune=generic and it is default, unless a user explicitly overrides it with -mtune=native. They are the ones who want to experiment with latest ISA using gcc, but want to keep their ISA selection and tuning agnostic on x86/64. IMHO, it is with these customers in mind that generic was introduced in the first place. Since stage 1 closure is around the corner, just wanted to ping to see if the maintainers have made up their mind on this one. AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes out pretty much all of that gain in generic mode. Until there is a convergence on AVX-256 for x86/64, we would like to propose having generic generate avx-128 by default and have a user override to avx-256 manually when known to benefit performance. Did somebody spend the time analyzing why CactusADM shows so much of a difference? With the recent improvements in vectorizing for AVX, did you re-do the measurements with a recent trunk? I don't think disabling avx-256 by default is a good idea until we understand why these numbers happen and are convinced we cannot fix this by proper cost modeling. We have observed cases where AVX-256 bit code is slower than AVX-128 bit code on Bulldozer. This is because internally the front end, data paths etc for Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed 256-bit code at the pipeline can congest the front end causing stalls and hence slowdowns. We expect the behavior of cactus, calculix and sphinx, which are the 3 benchmarks with the biggest avx-256 gaps, to be in the same vein. In general, the hardware design engineers recommend running AVX 128-bit code on Bulldozer. Given the underlying hardware design, software tuning can't really change the results here. Any further analysis of cactus would be a cycle sink at our end and we may not even be able to discuss the details on a public mailing list. x86/64 has not yet converged on avx-256 and generic mode should reflect that. Posting the re-measurements on trunk for cactus, calculix and sphinx on Bulldozer: AVX128/SSE42AVX256/AVX-128 436.cactusADM 10% -30% 454.calculix14.7% -6% 482.sphinx3 7% -9% All positive % above are improvements, all negative % are degradations. I will post re-measurements for all of Spec with latest trunk as soon as I have them. Thoughts? Thanks, Harsha
Re: AVX generic mode tuning discussion.
On Wed, Nov 2, 2011 at 5:57 PM, Jagasia, Harsha harsha.jaga...@amd.com wrote: We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? We see these % differences going from SSE42 to AVX128 to AVX256 on Bulldozer with -mtune=generic -Ofast. (Positive is improvement, negative is degradation) Bulldozer: AVX128/SSE42 AVX256/AVX-128 410.bwaves -1.4% -1.4% 416.gamess -1.1% 0.0% 433.milc 0.5% -2.4% 434.zeusmp 9.7% -2.1% 435.gromacs 5.1% 0.5% 436.cactusADM 8.2% -23.8% 437.leslie3d 8.1% 0.4% 444.namd 3.6% 0.0% 447.dealII -1.4% -0.4% 450.soplex -0.4% -0.4% 453.povray 0.0% -1.5% 454.calculix 15.7% -8.3% 459.GemsFDTD 4.9% 1.4% 465.tonto 1.3% -0.6% 470.lbm 0.9% 0.3% 481.wrf 7.3% -3.6% 482.sphinx3 5.0% -9.8% SPECFP 3.8% -3.2% Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. We see a substantial gain in several SPECFP benchmarks going from SSE42 to AVX128 on Bulldozer. IMHO, accomplishing even a 5% gain in an individual benchmark takes a hardware company several man months. The loss with AVX256 for Bulldozer is much more significant than the gain for SandyBridge. While the general trend in the industry is a move toward AVX256, for now we would be disadvantaging Bulldozer with this choice. We have several customers who use -mtune=generic and it is default, unless a user explicitly overrides it with -mtune=native. They are the ones who want to experiment with latest ISA using gcc, but want to keep their ISA selection and tuning agnostic on x86/64. IMHO, it is with these customers in mind that generic was introduced in the first place. Since stage 1 closure is around the corner, just wanted to ping to see if the maintainers have made up their mind on this one. AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes out pretty much all of that gain in generic mode. Until there is a convergence on AVX-256 for x86/64, we would like to propose having generic generate avx-128 by default and have a user override to avx-256 manually when known to benefit performance. Did somebody spend the time analyzing why CactusADM shows so much of a difference? With the recent improvements in vectorizing for AVX, did you re-do the measurements with a recent trunk? I don't think disabling avx-256 by default is a good idea until we understand why these numbers happen and are convinced we cannot fix this by proper cost modeling. We have observed cases where AVX-256 bit code is slower than AVX-128 bit code on Bulldozer. This is because internally the front end, data paths etc for Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed 256-bit code at the pipeline can congest the front end causing stalls and hence slowdowns. We expect the behavior of cactus, calculix and sphinx, which are the 3 benchmarks with the biggest avx-256 gaps, to be in the same vein. In general, the hardware design engineers recommend running AVX 128-bit code on Bulldozer. Given the underlying hardware design, software tuning can't really change the results here. Any further analysis of cactus would be a cycle sink at our end and we may not even be able to discuss the details on a public mailing list. x86/64 has not yet converged on avx-256 and generic mode should reflect that. Well, generic hasn't converged on AVX at all. Cost modeling can deal with code density just fine - are there any differences between code density issues of say, loads vs. stores vs. arithmetic? I specifically ask about analysis because AVX-256 has instruction set issues for certain patterns the vectorizer generates and the cost model currently does not reflect these at all. Richard. Posting the re-measurements on trunk for cactus, calculix and sphinx on Bulldozer: AVX128/SSE42 AVX256/AVX-128 436.cactusADM 10% -30% 454.calculix 14.7% -6% 482.sphinx3 7% -9% All positive % above are improvements, all negative % are degradations. I will post re-measurements for all of Spec
Re: AVX generic mode tuning discussion.
On Mon, Oct 31, 2011 at 9:36 PM, Jagasia, Harsha harsha.jaga...@amd.com wrote: We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? We see these % differences going from SSE42 to AVX128 to AVX256 on Bulldozer with -mtune=generic -Ofast. (Positive is improvement, negative is degradation) Bulldozer: AVX128/SSE42 AVX256/AVX-128 410.bwaves -1.4% -1.4% 416.gamess -1.1% 0.0% 433.milc 0.5% -2.4% 434.zeusmp 9.7% -2.1% 435.gromacs 5.1% 0.5% 436.cactusADM 8.2% -23.8% 437.leslie3d 8.1% 0.4% 444.namd 3.6% 0.0% 447.dealII -1.4% -0.4% 450.soplex -0.4% -0.4% 453.povray 0.0% -1.5% 454.calculix 15.7% -8.3% 459.GemsFDTD 4.9% 1.4% 465.tonto 1.3% -0.6% 470.lbm 0.9% 0.3% 481.wrf 7.3% -3.6% 482.sphinx3 5.0% -9.8% SPECFP 3.8% -3.2% Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. We see a substantial gain in several SPECFP benchmarks going from SSE42 to AVX128 on Bulldozer. IMHO, accomplishing even a 5% gain in an individual benchmark takes a hardware company several man months. The loss with AVX256 for Bulldozer is much more significant than the gain for SandyBridge. While the general trend in the industry is a move toward AVX256, for now we would be disadvantaging Bulldozer with this choice. We have several customers who use -mtune=generic and it is default, unless a user explicitly overrides it with -mtune=native. They are the ones who want to experiment with latest ISA using gcc, but want to keep their ISA selection and tuning agnostic on x86/64. IMHO, it is with these customers in mind that generic was introduced in the first place. Since stage 1 closure is around the corner, just wanted to ping to see if the maintainers have made up their mind on this one. AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes out pretty much all of that gain in generic mode. Until there is a convergence on AVX-256 for x86/64, we would like to propose having generic generate avx-128 by default and have a user override to avx-256 manually when known to benefit performance. Did somebody spend the time analyzing why CactusADM shows so much of a difference? With the recent improvements in vectorizing for AVX, did you re-do the measurements with a recent trunk? I don't think disabling avx-256 by default is a good idea until we understand why these numbers happen and are convinced we cannot fix this by proper cost modeling. Richard. Thanks, Harsha
RE: AVX generic mode tuning discussion.
We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? We see these % differences going from SSE42 to AVX128 to AVX256 on Bulldozer with -mtune=generic -Ofast. (Positive is improvement, negative is degradation) Bulldozer: AVX128/SSE42AVX256/AVX-128 410.bwaves-1.4% -1.4% 416.gamess-1.1% 0.0% 433.milc 0.5%-2.4% 434.zeusmp9.7%-2.1% 435.gromacs 5.1%0.5% 436.cactusADM 8.2%-23.8% 437.leslie3d 8.1%0.4% 444.namd 3.6%0.0% 447.dealII-1.4% -0.4% 450.soplex-0.4% -0.4% 453.povray0.0%-1.5% 454.calculix 15.7% -8.3% 459.GemsFDTD 4.9%1.4% 465.tonto 1.3%-0.6% 470.lbm 0.9%0.3% 481.wrf 7.3%-3.6% 482.sphinx3 5.0%-9.8% SPECFP3.8%-3.2% Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. We see a substantial gain in several SPECFP benchmarks going from SSE42 to AVX128 on Bulldozer. IMHO, accomplishing even a 5% gain in an individual benchmark takes a hardware company several man months. The loss with AVX256 for Bulldozer is much more significant than the gain for SandyBridge. While the general trend in the industry is a move toward AVX256, for now we would be disadvantaging Bulldozer with this choice. We have several customers who use -mtune=generic and it is default, unless a user explicitly overrides it with -mtune=native. They are the ones who want to experiment with latest ISA using gcc, but want to keep their ISA selection and tuning agnostic on x86/64. IMHO, it is with these customers in mind that generic was introduced in the first place. Since stage 1 closure is around the corner, just wanted to ping to see if the maintainers have made up their mind on this one. AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes out pretty much all of that gain in generic mode. Until there is a convergence on AVX-256 for x86/64, we would like to propose having generic generate avx-128 by default and have a user override to avx-256 manually when known to benefit performance. Thanks, Harsha
RE: AVX generic mode tuning discussion.
On 07/12/2011 02:22 PM, harsha.jaga...@amd.com wrote: We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? We see these % differences going from SSE42 to AVX128 to AVX256 on Bulldozer with -mtune=generic -Ofast. (Positive is improvement, negative is degradation) Bulldozer: AVX128/SSE42AVX256/AVX-128 410.bwaves -1.4% -1.4% 416.gamess -1.1% 0.0% 433.milc0.5%-2.4% 434.zeusmp 9.7%-2.1% 435.gromacs 5.1%0.5% 436.cactusADM 8.2%-23.8% 437.leslie3d8.1%0.4% 444.namd3.6%0.0% 447.dealII -1.4% -0.4% 450.soplex -0.4% -0.4% 453.povray 0.0%-1.5% 454.calculix15.7% -8.3% 459.GemsFDTD4.9%1.4% 465.tonto 1.3%-0.6% 470.lbm 0.9%0.3% 481.wrf 7.3%-3.6% 482.sphinx3 5.0%-9.8% SPECFP 3.8%-3.2% Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. We see a substantial gain in several SPECFP benchmarks going from SSE42 to AVX128 on Bulldozer. IMHO, accomplishing even a 5% gain in an individual benchmark takes a hardware company several man months. The loss with AVX256 for Bulldozer is much more significant than the gain for SandyBridge. While the general trend in the industry is a move toward AVX256, for now we would be disadvantaging Bulldozer with this choice. We have several customers who use -mtune=generic and it is default, unless a user explicitly overrides it with -mtune=native. They are the ones who want to experiment with latest ISA using gcc, but want to keep their ISA selection and tuning agnostic on x86/64. IMHO, it is with these customers in mind that generic was introduced in the first place. Thanks, Harsha
RE: AVX generic mode tuning discussion.
We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. Btw, it looks like the data is massively skewed by 436.cactusADM. What are the overall numbers if you disregard cactus? Disregarding cactus, these are the cumulative SpecFP scores we see. On Bulldozer: AVX256/AVX128 SPECFP -1.8% On SandyBridge: AVX256/AVX128 SPECFP -0.15% It's also for sure the case that the vectorizer cost model has not been touched for avx256 vs. avx128 vs. sse, so a more sensible approach would be to look at differentiating things there to improve the cactus numbers. I am not sure how much the vectorizer cost model can help here. The cost model can decide whether to vectorize and/or what vectorization factor to use. But in generic mode, that decision has to be processor family neutral anyway. Harsha, did you investigate why avx256 is such a loss for cactus or why it is so much of a win for SB? We are planning to investigate cactus and other cases to understand the reasons behind these observations better on Bulldozer, but disregarding cactus, there appear to be no significant gains on Sandybridge with AVX256 over AVX128 as well. Thanks, Harsha
Re: AVX generic mode tuning discussion.
On Tue, Jul 12, 2011 at 11:56 PM, Richard Henderson r...@redhat.com wrote: On 07/12/2011 02:22 PM, harsha.jaga...@amd.com wrote: We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. Btw, it looks like the data is massively skewed by 436.cactusADM. What are the overall numbers if you disregard cactus? It's also for sure the case that the vectorizer cost model has not been touched for avx256 vs. avx128 vs. sse, so a more sensible approach would be to look at differentiating things there to improve the cactus numbers. Harsha, did you investigate why avx256 is such a loss for cactus or why it is so much of a win for SB? I suppose generic tuning is of less importance for AVX as people need to enable that manually anyway (and will possibly do so only via means of -march=native). Thanks, Richard. r~
Re: AVX generic mode tuning discussion.
On Wed, Jul 13, 2011 at 10:42:41AM +0200, Richard Guenther wrote: I suppose generic tuning is of less importance for AVX as people need to enable that manually anyway (and will possibly do so only via means of -march=native). Yeah, but if somebody does compile with -mavx -mtune=generic, I'd expect the intent is that he wants fastest code not just on current generation of CPUs, but on the next few following ones, and I'd say that being able to use twice as big vectorization factor ought to be a win in most cases if the cost model gets it right. If not for the vectorization factor doubling, what would be reasons why somebody would compile code with -mavx -mtune=generic and rule out support for many recent chips? Yeah, there are the 2 operand forms and such code can avoid penalty when mixed with AVX256 code, but would that be strong reason enough to lose the support of most of the recent CPUs? When targeting just a particular CPU and using -march= with CPU which already includes AVX, -mtune=generic probably doesn't make much sense, you probably want -march=native and you are optimizing for the CPU you have. Jakub
Re: AVX generic mode tuning discussion.
On 07/12/2011 02:22 PM, harsha.jaga...@amd.com wrote: We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. You indicate a 3% reduction on bulldozer with avx256. How does avx128 compare to -mno-avx -msse4.2? Will the next AMD generation have a useable avx256? I'm not keen on the idea of generic mode being tune for a single processor revision that maybe shouldn't actually be using avx at all. r~