On Wed, Dec 12, 2012 at 4:16 PM, Xinliang David Li <davi...@google.com> wrote: > On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> Concerning 1push per cycle, I think it is same as K7 hardware did, so move >> prologue should be a win. >>> > Index: config/i386/i386.c >>> > =================================================================== >>> > --- config/i386/i386.c (revision 194452) >>> > +++ config/i386/i386.c (working copy) >>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { >>> > COSTS_N_INSNS (8), /* cost of FABS instruction. */ >>> > COSTS_N_INSNS (8), /* cost of FCHS instruction. */ >>> > COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ >>> > - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, >>> > - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, >>> > + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, >>> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, >>> > {-1, libcall, false}}}}, >>> > {{libcall, {{6, loop_1_byte, true}, >>> > {24, loop, true}, >>> > {8192, rep_prefix_4_byte, true}, >>> > {-1, libcall, false}}}, >>> > - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, >>> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, >> >> libcall is not faster up to 8KB to rep sequence that is better for >> regalloc/code >> cache than fully blowin function call. > > Be careful with this. My recollection is that REP sequence is good for
s/good/not good/ David > any size -- for smaller size, the REP initial set up cost is too high > (10s of cycles), while for large size copy, it is less efficient > compared with library version. > > >>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe >>> > m_PPRO, >>> > >>> > /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ >>> > - m_CORE2I7 | m_GENERIC, >>> > + m_GENERIC | m_CORE2, >> >> This disable shifts that store just some flags. Acroding to Agner's manual >> I7 handle >> this well. >> > > ok. > >> Partial flags stall >> The Sandy Bridge uses the method of an extra Âľop to join partial registers >> not only for >> general purpose registers but also for the flags register, unlike previous >> processors which >> used this method only for general purpose registers. This occurs when a >> write to a part of >> the flags register is followed by a read from a larger part of the flags >> register. The partial >> flags stall of previous processors (See page 75) is therefore replaced by an >> extra Âľop. The >> Sandy Bridge also generates an extra Âľop when reading the flags after a >> rotate instruction. >> >> This is cheaper than the 7 cycle delay on Core this flags is trying to avoid. > > ok. > >>> > >>> > /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall >>> > * on 16-bit immediate moves into memory on Core2 and Corei7. */ >>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe >>> > m_K6, >>> > >>> > /* X86_TUNE_USE_CLTD */ >>> > - ~(m_PENT | m_ATOM | m_K6), >>> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), > > My change was to enable CLTD for generic. Is your change intended to > revert that? > >> >> None of CPUs that generic care about are !USE_CLTD now after your change. >>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe >>> > m_ATHLON_K8, >>> > >>> > /* X86_TUNE_SSE_TYPELESS_STORES */ >>> > - m_AMD_MULTIPLE, >>> > + m_AMD_MULTIPLE | m_CORE2I7, /*????*/ >> >> Hmm, I can not seem to find this in manual now, but I believe that stores >> also do not type, >> so movaps is preferred over movapd store because it is shorter. If not, >> this change should >> produce a lot of slowdowns. >>> > >>> > /* X86_TUNE_SSE_LOAD0_BY_PXOR */ >>> > - m_PPRO | m_P4_NOCONA, >>> > + m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/ >> >> Agner: >> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The >> Core2 and Nehalem processors recognize that certain instructions are >> independent of the >> prior value of the register if the source and destination registers are the >> same. >> >> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, >> XORPD, and all >> variants of PSUBxxx and PCMPxxx except PCMPEQQ. >>> > >>> > /* X86_TUNE_MEMORY_MISMATCH_STALL */ >>> > m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, >>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe >>> > >>> > /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict >>> > more >>> > than 4 branch instructions in the 16 byte window. */ >>> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, >>> > + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, >> >> This is special passs to handle limitations of AMD's K7/K8/K10 branch >> prediction. >> Intel never had similar design, so this flag is pointless. > > I noticed that too, but Andi has a better answer to it. > >> >> We apparently ought to disable it for K10, at least per Agner's manual. >>> > >>> > /* X86_TUNE_SCHEDULE */ >>> > m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | >>> > m_GENERIC, >>> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe >>> > m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, >>> > >>> > /* X86_TUNE_USE_INCDEC */ >>> > - ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC), >>> > + ~(m_P4_NOCONA | m_ATOM | m_GENERIC), >> >> Skipping inc/dec is to avoid partial flag stall happening on P4 only. >>> > > > > K8 and K10 partitions the flags into groups. References to flags to > the same group can still cause the stall -- not sure how that can be > handled. > >>> > /* X86_TUNE_PAD_RETURNS */ >>> > - m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC, >>> > + m_AMD_MULTIPLE | m_GENERIC, >> >> Again this deals specifically with AMD K7/K8/K10 branch prediction. I am >> not even >> sure this should be enabled for K10. >>> > >>> > /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion. */ >>> > m_ATOM, >>> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe >>> > m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | >>> > m_GENERIC, >>> > >>> > /* X86_TUNE_AVOID_VECTOR_DECODE */ >>> > - m_CORE2I7 | m_K8 | m_GENERIC64, >>> > + m_K8 | m_GENERIC64, >> >> This avoid AMD vector decoded instructions, again if it helped it did so by >> accident. >>> > >>> > /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for >>> > HImode >>> > and SImode multiply, but 386 and 486 do HImode multiply faster. */ >>> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe >>> > >>> > /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is >>> > vector path on AMD machines. */ >>> > - m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64, >>> > + m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER, >>> > >>> > /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on >>> > AMD >>> > machines. */ >>> > - m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64, >>> > + m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER, >> >> This is similarly targetd for AMD hardware only. I did not find ismilar >> limiation >> in the optimization manual. >>> > >>> > /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR >>> > than a MOV. */ >>> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe >>> > >>> > /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion >>> > from FP to FP. */ >>> > - m_CORE2I7 | m_AMDFAM10 | m_GENERIC, >>> > + m_AMDFAM10 | m_GENERIC, >> >> This is quite specific feature of AMD chips to preffer packed converts over >> scalar. Nothing like this is documented for cores >>> > >>> > /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion >>> > from integer to FP. */ >>> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe >>> > /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction >>> > with a subsequent conditional jump instruction into a single >>> > compare-and-branch uop. */ >>> > - m_BDVER, >>> > + m_BDVER | m_CORE2I7, >> >> Core iplements fusion similar to what AMD does so I think this just applies >> here. > > yes. > > > thanks, > > David > > >>> > >>> > /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag >>> > will impact LEA instruction selection. */ >>> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe >>> > }; >>> > >>> > static const unsigned int x86_accumulate_outgoing_args >>> > - = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | >>> > m_GENERIC; >>> > + = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC; >> >> Stack engine should make this cheap, just like prologues in moves. >> This definitely needs some validation, the accumulate-outgoing-args >> codegen differs quite a lot. Also this leads to unwind tables bloat. >>> > >>> > static const unsigned int x86_arch_always_fancy_math_387 >>> > = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE >>> > | m_GENERIC; >> >> Honza