> Index: config/i386/i386.md
> ===================================================================
> --- config/i386/i386.md       (revision 185920)
> +++ config/i386/i386.md       (working copy)
> @@ -2262,9 +2262,19 @@
>          ]
>          (const_string "SI")))])
>  
> +(define_insn "*movhi_imm_internal"
> +  [(set (match_operand:HI 0 "memory_operand" "=m")
> +        (match_operand:HI 1 "immediate_operand" "n"))]
> +  "!TARGET_LCP_STALL && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
> +{
> +  return "mov{w}\t{%1, %0|%0, %1}";
> +}
> +  [(set (attr "type") (const_string "imov"))
> +   (set (attr "mode") (const_string "HI"))])
> +
>  (define_insn "*movhi_internal"
>    [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r,r,m")
> -     (match_operand:HI 1 "general_operand" "r,rn,rm,rn"))]
> +     (match_operand:HI 1 "general_operand" "r,rn,rm,r"))]

If you do this, you will prevent reload from considering using immediate
as rematerializatoin when the register holding constant is on a stack
on !TARGET_LCP_STALL machines. The matching pattern for moves should really
handle all available alternatives, so reload is happy.

You can duplicate the pattern, but I think this is much better to be done as
post-reload peephole2.  I.e. ask for scratch register and if it is available do
the splitting.  This way optimization won't happen when there is no register
available and we will also rely on post-reload cleanups to unify moves of
constant, but I think this should work well.

You also want to conditionalize the split by optimize_insn_for_speed, too.

>    "!(MEM_P (operands[0]) && MEM_P (operands[1]))"
>  {
>    switch (get_attr_type (insn))
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c        (revision 185920)
> +++ config/i386/i386.c        (working copy)
> @@ -1964,6 +1964,11 @@ static unsigned int initial_ix86_tune_features[X86
>    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>    m_CORE2I7 | m_GENERIC,
>  
> +  /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> +   * on 16-bit immediate moves into memory on Core2 and Corei7,
> +   * which may also affect AMD implementations.  */
> +  m_CORE2I7 | m_GENERIC | m_AMD_MULTIPLE,

Is this supposed to help AMD? (at least the pre-buldozer design should not care
about length changing prefixes that much because it tags sizes in the cache).
If not, I would suggest enabling it only for cores and generic.

Honza

Reply via email to