On Tue, Jan 28, 2020 at 4:34 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > You could move > > > > (match_test "TARGET_AVX") > > (const_string "TI") > > > > up to bypass the cases below. > > > > I don't think we can do that. There are 2 cases where we prefer > movaps/movups: > > /* Use packed single precision instructions where posisble. I.e. > movups instead of movupd. */ > DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, > "sse_packed_single_insn_optimal", > m_BDVER | m_ZNVER) > > /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores. */ > DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", > m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC) > > We should always use movaps/movups for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL. > It is wrong to bypass TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL with TARGET_AVX > as m_BDVER | m_ZNVER support AVX.
The reason for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL on AMD target is only insn size, as advised in e.g. Software Optimization Guide for the AMD Family 15h Processors [1], section 7.1.2, where it is said: --quote-- 7.1.2 Reduce Instruction SizeOptimization Reduce the size of instructions when possible. Rationale Using smaller instruction sizes improves instruction fetch throughput. Specific examples include the following: *In SIMD code, use the single-precision (PS) form of instructions instead of the double-precision (PD) form. For example, for register to register moves, MOVAPS achieves the same result as MOVAPD, but uses one less byte to encode the instruction and has no prefix byte. Other examples in which single-precision forms can be substituted for double-precision forms include MOVUPS, MOVNTPS, XORPS, ORPS, ANDPS, and SHUFPS. ... --/quote-- Please note that this optimization applies only to non-AVX forms, as demonstrated by: 0: 0f 28 c8 movaps %xmm0,%xmm1 3: 66 0f 28 c8 movapd %xmm0,%xmm1 7: c5 f8 28 d1 vmovaps %xmm1,%xmm2 b: c5 f9 28 d1 vmovapd %xmm1,%xmm2 Also note that MOVDQA is missing in the above optimization. It is harmful to substitute MOVDQA with MOVAPS, as it can (and does) introduce +1 cycle forwarding penalty between FLT (FPA/FPM) and INT (VALU) FP clusters. Following the above optimization, it is obvious that TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL handling was cargo-culted from one pattern to another. Its use should be reviewed and fixed where not appropriate. [1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf Uros.