epilogue for modern CPUs

Xinliang David Li Wed, 12 Dec 2012 16:16:51 -0800

On Wed, Dec 12, 2012 at 4:16 PM, Xinliang David Li <davi...@google.com> wrote:
> On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
>> prologue should be a win.
>>> > Index: config/i386/i386.c
>>> > ===================================================================
>>> > --- config/i386/i386.c  (revision 194452)
>>> > +++ config/i386/i386.c  (working copy)
>>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>> >    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>>> >    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>>> >    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
>>> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>>> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>>> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>> >                {-1, libcall, false}}}},
>>> >    {{libcall, {{6, loop_1_byte, true},
>>> >                {24, loop, true},
>>> >                {8192, rep_prefix_4_byte, true},
>>> >                {-1, libcall, false}}},
>>> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>
>> libcall is not faster up to 8KB to rep sequence that is better for 
>> regalloc/code
>> cache than fully blowin function call.
>
> Be careful with this. My recollection is that REP sequence is good for



s/good/not good/


David

> any size -- for smaller size, the REP initial set up cost is too high
> (10s of cycles), while for large size copy, it is less efficient
> compared with library version.
>
>
>>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_PPRO,
>>> >
>>> >    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>>> > -  m_CORE2I7 | m_GENERIC,
>>> > +  m_GENERIC | m_CORE2,
>>
>> This disable shifts that store just some flags. Acroding to Agner's manual 
>> I7 handle
>> this well.
>>
>
> ok.
>
>> Partial flags stall
>> The Sandy Bridge uses the method of an extra Âľop to join partial registers 
>> not only for
>> general purpose registers but also for the flags register, unlike previous 
>> processors which
>> used this method only for general purpose registers. This occurs when a 
>> write to a part of
>> the flags register is followed by a read from a larger part of the flags 
>> register. The partial
>> flags stall of previous processors (See page 75) is therefore replaced by an 
>> extra Âľop. The
>> Sandy Bridge also generates an extra Âľop when reading the flags after a 
>> rotate instruction.
>>
>> This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
>
> ok.
>
>>> >
>>> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_K6,
>>> >
>>> >    /* X86_TUNE_USE_CLTD */
>>> > -  ~(m_PENT | m_ATOM | m_K6),
>>> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
> My change was to enable CLTD for generic. Is your change intended to
> revert that?
>
>>
>> None of CPUs that generic care about are !USE_CLTD now after your change.
>>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_ATHLON_K8,
>>> >
>>> >    /* X86_TUNE_SSE_TYPELESS_STORES */
>>> > -  m_AMD_MULTIPLE,
>>> > +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>>
>> Hmm, I can not seem to find this in manual now, but I believe that stores 
>> also do not type,
>> so movaps is preferred over movapd store because it is shorter.  If not, 
>> this change should
>> produce a lot of slowdowns.
>>> >
>>> >    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
>>> > -  m_PPRO | m_P4_NOCONA,
>>> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>>
>> Agner:
>> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
>> Core2 and Nehalem processors recognize that certain instructions are 
>> independent of the
>> prior value of the register if the source and destination registers are the 
>> same.
>>
>> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, 
>> XORPD, and all
>> variants of PSUBxxx and PCMPxxx except PCMPEQQ.
>>> >
>>> >    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>>> >    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>>> >
>>> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict 
>>> > more
>>> >       than 4 branch instructions in the 16 byte window.  */
>>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>> This is special passs to handle limitations of AMD's K7/K8/K10 branch 
>> prediction.
>> Intel never had similar design, so this flag is pointless.
>
> I noticed that too, but Andi has a better answer to it.
>
>>
>> We apparently ought to disable it for K10, at least per Agner's manual.
>>> >
>>> >    /* X86_TUNE_SCHEDULE */
>>> >    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | 
>>> > m_GENERIC,
>>> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> >
>>> >    /* X86_TUNE_USE_INCDEC */
>>> > -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
>>> > +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>>
>> Skipping inc/dec is to avoid partial flag stall happening on P4 only.
>>> >
>
>
> K8 and K10 partitions the flags into groups. References to flags to
> the same group can still cause the stall -- not sure how that can be
> handled.
>
>>> >    /* X86_TUNE_PAD_RETURNS */
>>> > -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
>>> > +  m_AMD_MULTIPLE | m_GENERIC,
>>
>> Again this deals specifically with AMD K7/K8/K10 branch prediction.  I am 
>> not even
>> sure this should be enabled for K10.
>>> >
>>> >    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>>> >    m_ATOM,
>>> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | 
>>> > m_GENERIC,
>>> >
>>> >    /* X86_TUNE_AVOID_VECTOR_DECODE */
>>> > -  m_CORE2I7 | m_K8 | m_GENERIC64,
>>> > +  m_K8 | m_GENERIC64,
>>
>> This avoid AMD vector decoded instructions, again if it helped it did so by 
>> accident.
>>> >
>>> >    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for 
>>> > HImode
>>> >       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
>>> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>>> >
>>> >    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
>>> >       vector path on AMD machines.  */
>>> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>>> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>> >
>>> >    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on 
>>> > AMD
>>> >       machines.  */
>>> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>>> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>
>> This is similarly targetd for AMD hardware only. I did not find ismilar 
>> limiation
>> in the optimization manual.
>>> >
>>> >    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
>>> >       than a MOV.  */
>>> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>>> >
>>> >    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>>> >       from FP to FP. */
>>> > -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
>>> > +  m_AMDFAM10 | m_GENERIC,
>>
>> This is quite specific feature of AMD chips to preffer packed converts over
>> scalar. Nothing like this is documented for cores
>>> >
>>> >    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>>> >       from integer to FP. */
>>> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>>> >    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>>> >       with a subsequent conditional jump instruction into a single
>>> >       compare-and-branch uop.  */
>>> > -  m_BDVER,
>>> > +  m_BDVER | m_CORE2I7,
>>
>> Core iplements fusion similar to what AMD does so I think this just applies 
>> here.
>
> yes.
>
>
> thanks,
>
> David
>
>
>>> >
>>> >    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>>> >       will impact LEA instruction selection. */
>>> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
>>> >  };
>>> >
>>> >  static const unsigned int x86_accumulate_outgoing_args
>>> > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | 
>>> > m_GENERIC;
>>> > +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>>
>> Stack engine should make this cheap, just like prologues in moves.
>> This definitely needs some validation, the accumulate-outgoing-args
>> codegen differs quite a lot. Also this leads to unwind tables bloat.
>>> >
>>> >  static const unsigned int x86_arch_always_fancy_math_387
>>> >    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE 
>>> > | m_GENERIC;
>>
>> Honza

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

Reply via email to