Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-18 Thread Vladimir Makarov

On 12-10-15 8:06 AM, Richard Sandiford wrote:

Vladimir Makarov  writes:

if that's accurate.  I dropped the term "reload pseudo" because of
the general comment in my earlier reply about the use of "reload pseudo"
when the code seems to include inheritance and split pseudos too.

There is no inheritance and splitting yet.  It is done after the
constraint pass.
So at this stage >= new_regno_start means reload pseudo.

Ah, OK.


That's a change in the meaning of NEW_CLASS, but seems easier for
callers to handle.  I think all it requires is changing:


+  common_class = ira_reg_class_subset[rclass][cl];
+  if (new_class != NULL)
+   *new_class = common_class;

to:

common_class = ira_reg_class_subset[rclass][cl];
if (new_class != NULL && rclass != common_class)
*new_class = common_class;

This change results in infinite LRA looping on a first libgcc file
compilation.  Unfortunately I have no time to investigate it.
I'd like to say that most code of in this code is very sensitive to
changes.  I see it a lot.  You change something looking obvious and a
target is broken.
I am going to investigate it when I have more time.

Thanks.


+default:
+  {
+   const char *fmt = GET_RTX_FORMAT (code);
+   int i;
+
+   if (GET_RTX_LENGTH (code) != 1
+   || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC)
+ {
+   for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
+ if (fmt[i] == 'e')
+   extract_loc_address_regs (false, mode, as, &XEXP (x, i),
+ context_p, code, SCRATCH,
+ modify_p, ad);
+   break;
+ }
+   /* fall through for case UNARY_OP (UNSPEC ...)  */
+  }
+
+case UNSPEC:
+  if (ad->disp_loc == NULL)
+   ad->disp_loc = loc;
+  else if (ad->base_reg_loc == NULL)
+   {
+ ad->base_reg_loc = loc;
+ ad->base_outer_code = outer_code;
+ ad->index_code = index_code;
+ ad->base_modify_p = modify_p;
+   }
+  else
+   {
+ lra_assert (ad->index_reg_loc == NULL);
+ ad->index_reg_loc = loc;
+   }
+  break;
+
+}

Which targets use a bare UNSPEC as a displacement?  I thought a
displacement had to be a link-time constant, in which case it should
satisfy CONSTANT_P.  For UNSPECs, that means wrapping it in a CONST.

I saw it somewhere.  I guess IA64.

I'm just a bit worried that the UNSPEC handling is sensitive to the
order that subrtxes are processed (unlike PLUS, which goes to some
trouble to work out what's what).  It could be especially confusing
because the default case processes operands in reverse order while
PLUS processes them in forward order.

Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough?
Probably deserves a comment.

I don't remember.  To figure out, I should switch it off and try all
targets supported by LRA.

AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just
recording where reloads of a particular class need to go (obviously
in the case of disp_loc, which isn't reloaded at all).  The feidls
have semantic value too.  I.e. we use them to work out the value
of at least part of the address.

In that case it seems dangerous to look through general rtxes
in the way that the default case above does.  Maybe just making
sure that DISP_LOC is involved in a sum with the base would be
enough, but another idea was:


I know of three ways of "mutating" (for want of a better word)
an address:

1. (and X (const_int X)), to align
2. a subreg
3. a unary operator (such as truncation or extension)

So maybe we could:

a. remove outer mutations (using a helper function)
b. handle LO_SUM, PRE_*, POST_*: as now
c. otherwise treat the address of the sum of one, two or three pieces.
   c1. Peel mutations of all pieces.
   c2. Classify the pieces into base, index and displacement.
   This would be similar to the jousting code above, but hopefully
   easier because all three rtxes are to hand.  E.g. we could
   do the base vs. index thing in a similar way to
   commutative_operand_precedence.
   c3. Record which pieces were mutated (e.g. using something like the
   index_loc vs. index_reg_loc distinction in the current code)

That should be general enough for current targets, but if it isn't,
we could generalise it further when we know what generalisation is needed.

That's still going to be a fair amount of code, but hopefully not more,
and we might have more confidence at each stage what each value is.
And it avoids the risk of treating "mutated" addresses as "unmutated" ones.


Just an idea though.  Probably not for 4.8, although I might try it
if I find time.

I am not sure that you listed all the cases.  

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-18 Thread Vladimir Makarov

On 12-10-17 7:24 AM, Richard Sandiford wrote:

Thanks for all the updates.

Vladimir Makarov  writes:

+ /* index * scale + disp => new base + index * scale  */
+ enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
+
+ lra_assert (INDEX_REG_CLASS != NO_REGS);
+ new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
+ lra_assert (GET_CODE (*addr_loc) == PLUS);
+ lra_emit_move (new_reg, *ad.disp_loc);
+ if (CONSTANT_P (XEXP (*addr_loc, 1)))
+   XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
+ XEXP (*addr_loc, 0) = new_reg;

The canonical form is (plus (mult ...) (reg)) rather than
(plus (reg) (mult ...)), but it looks like we create the latter.
I realise you try both forms here:

It might happen because equiv substitution in LRA.

+ /* Some targets like ARM, accept address operands in
+specific order -- try exchange them if necessary.  */
+ if (! valid_address_p (mode, *addr_loc, as))
+   {
+ exchange_plus_ops (*addr_loc);
+ if (! valid_address_p (mode, *addr_loc, as))
+   exchange_plus_ops (*addr_loc);
+   }

but I think we should try the canonical form first.  And I'd prefer it
if we didn't try the other form at all, especially in 4.8.  It isn't
really the backend's job to reject non-canonical rtl.  This might well
be another case where some targets need a (hopefully small) tweak in
order to play by the rules.

Also, I suppose this section of code feeds back to my question on
Wednesday about the distinction that LRA seems to make between the
compile-time constant in:

(plus (reg X1) (const_int Y1))

and the link-time constant in:

(plus (reg X2) (symbol_ref Y2))

It looked like extract_address_regs classified X1 as a base register and
X2 as an index register.  The difference between the two constants has
no run-time significance though, and I think we should handle both X1
and X2 as base registers (as I think reload does).

I think the path above would then be specific to scaled indices.
In the original address the "complex" index must come first and the
displacement second.  In the modified address, the index would stay
first and the new base register would be second.  More below.

As I wrote above the problem is also in that equiv substitution can
create non-canonical forms.

Right.  Just in case there's a misunderstanding: I'm not complaining
about these routines internally using forms that are noncanonical
(which could happen because of equiv substitution, like you say).
I just think that what we eventually try to validate should be canonical.
In a way it's similar to how the simplify-rtx.c routines work.

If there are targets that only accept noncanonical rtl (which is after
all just a specific type of invalid rtl), they need to be fixed.
Agree.  In order not to forget to fix targets I am removing operand 
exchange.

+  /* base + scale * index + disp => new base + scale * index  */
+  new_reg = base_plus_disp_to_reg (mode, as, &ad);
+  *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
+  if (! valid_address_p (mode, *addr_loc, as))
+   {
+ /* Some targets like ARM, accept address operands in
+specific order -- try exchange them if necessary.  */
+ exchange_plus_ops (*addr_loc);
+ if (! valid_address_p (mode, *addr_loc, as))
+   exchange_plus_ops (*addr_loc);
+   }

Same comment as above about canonical rtl.  Here we can have two
registers -- in which case the base should come first -- or a more
complex index -- in which case the index should come first.

We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
with the operands in either order, and let it take care of the details.
Using simplify_gen_binary would help with the earlier index+disp case too.

Equiv substitution can create non-canonical forms.  There are 2 approaches:
o have a code for dealing with non-canonical forms (equiv substitution,
target stupidity)
o always support canonical forms and require them from targets.

I decided to use the 1st variant but I am reconsidering it.  I'll try to
fix before inclusion.  But I am not sure I have time for this.  All
these changes makes LRA unstable. In fact, I've just found that changes
I already made so far resulted in 2 SPEC2000 tests broken although GCC
testsuite and bootstrap looks good.

OK.  I'm happy to try fixing the noncanonical thing.


+  /* If this is post-increment, first copy the location to the reload reg.  */
+  if (post && real_in != result)
+emit_insn (gen_move_insn (result, real_in));

Nit, but real_in != result can never be true AIUI, and I was confused how
the code could be correct in that case.  Maybe just remove it, or make
it an assert?

No, it might be true:

real_in = in == value ? incloc : in;
...
if (cond)
result = incloc;
else
result = ...

if (post && real_in != result)

So it is true if in==va

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-17 Thread Steven Bosscher
On Wed, Oct 17, 2012 at 9:53 PM, Vladimir Makarov wrote:
> On 12-10-15 12:49 PM, Richard Sandiford wrote:
>> Getting rid of reload always seemed like a pipe dream, and if the only
>> known drawback of this replacement is that it takes a while on extreme
>> testcases, that's an amazing achievement.  (Not to say compile time
>> isn't important, just that there were so many other hurdles to overcome.)

Just to be clear, LRA now does no worse from a compile time POV than,
say, tree-ssa-live. Most of the scalability problems have been
addressed.

> It is my second attempt.  The first one was YARA project.  I got a lot of
> experience from this project and knowledge how not to do this.
> LRA will be still a long lasting project.  I don't think I found all
> weirdness of reload just trying 8 targets (fixing one bug on one target
> frequently resulted in new bugs on other targets so it required to do
> frequently cardinal changes to the original code). Only after trying the 8
> targets I got feeling that this approach could well.

Hats off to you, Vlad, for your years of effort on improving GCC's RA!

Ciao!
Steven


Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-17 Thread Vladimir Makarov

On 12-10-15 12:49 PM, Richard Sandiford wrote:

Hi Vlad,

Some comments about the rest of LRA.  Nothing major here...

Vladimir Makarov  writes:

+/* Info about register in an insn.  */
+struct lra_insn_reg
+{
+  /* The biggest mode through which the insn refers to the register
+ (remember the register can be accessed through a subreg in the
+ insn).  */
+  ENUM_BITFIELD(machine_mode) biggest_mode : 16;

AFAICT, this is actually always the mode of a specific reference,
and if there are references to the same register in different modes,
those references get their own lra_insn_regs.  "mode" might be better
than "biggest_mode" if so.

I seems mode is also not good.  I've just modified the comment to 
reflect the fact that is just a reference.

+/* Static part (common info for insns with the same ICODE) of LRA
+   internal insn info. It exists in at most one exemplar for each
+   non-negative ICODE. Warning: if the structure definition is
+   changed, the initializer for debug_insn_static_data in lra.c should
+   be changed too.  */

Probably worth saying (before the warning) that there is also
one structure for each asm.

Good point.  I added a comment.

+/* LRA internal info about an insn (LRA internal insn
+   representation).  */
+struct lra_insn_recog_data
+{
+  int icode; /* The insn code. */
+  rtx insn; /* The insn itself. */
+  /* Common data for insns with the same ICODE. */
+  struct lra_static_insn_data *insn_static_data;

Maybe worth mentioning asms here too.

Fixed.

+  /* Two arrays of size correspondingly equal to the operand and the
+ duplication numbers: */
+  rtx **operand_loc; /* The operand locations, NULL if no operands.  */
+  rtx **dup_loc; /* The dup locations, NULL if no dups. */
+  /* Number of hard registers implicitly used in given call insn.  The
+ value can be NULL or points to array of the hard register numbers
+ ending with a negative value.  */
+  int *arg_hard_regs;
+#ifdef HAVE_ATTR_enabled
+  /* Alternative enabled for the insn. NULL for debug insns.  */
+  bool *alternative_enabled_p;
+#endif
+  /* The alternative should be used for the insn, -1 if invalid, or we
+ should try to use any alternative, or the insn is a debug
+ insn.  */
+  int used_insn_alternative;
+  struct lra_insn_reg *regs;  /* Always NULL for a debug insn. */

Comments consistently above the field.

Fixed.

+extern void lra_expand_reg_info (void);

This doesn't exist any more.

Fixed.

+extern int lra_constraint_new_insn_uid_start;

Just saying in case: this seems to be write-only, with lra-constraints.c
instead using a static variable to track the uid start.

I realise you might want to keep it anyway for consistency with
lra_constraint_new_regno_start, or for debugging.


+extern rtx lra_secondary_memory[NUM_MACHINE_MODES];

This doesn't exist any more.

Removed.  Thanks.

+/* lra-saves.c: */
+
+extern bool lra_save_restore (void);

Same for this file & function.

Removed.

+/* The function returns TRUE if at least one hard register from ones
+   starting with HARD_REGNO and containing value of MODE are in set
+   HARD_REGSET. */
+static inline bool
+lra_hard_reg_set_intersection_p (int hard_regno, enum machine_mode mode,
+HARD_REG_SET hard_regset)
+{
+  int i;
+
+  lra_assert (hard_regno >= 0);
+  for (i = hard_regno_nregs[hard_regno][mode] - 1; i >= 0; i--)
+if (TEST_HARD_REG_BIT (hard_regset, hard_regno + i))
+  return true;
+  return false;
+}

This is the same as overlaps_hard_reg_set_p.


I removed it and started to use the function overlaps_hard_reg_set_p.

+/* Return hard regno and offset of (sub-)register X through arguments
+   HARD_REGNO and OFFSET.  If it is not (sub-)register or the hard
+   register is unknown, then return -1 and 0 correspondingly.  */

The function seems to return -1 for both.
Fixed.  It does not matter for the rest of code as offset is used only 
when hard_regno >= 0.

+/* Add hard registers starting with HARD_REGNO and holding value of
+   MODE to the set S.  */
+static inline void
+lra_add_hard_reg_set (int hard_regno, enum machine_mode mode, HARD_REG_SET *s)
+{
+  int i;
+
+  for (i = hard_regno_nregs[hard_regno][mode] - 1; i >= 0; i--)
+SET_HARD_REG_BIT (*s, hard_regno + i);
+}

This is add_to_hard_reg_set.

Removed.

+   Here is block diagram of LRA passes:
+
+ - 
+| Undo inheritance|  ------
+| for spilled pseudos)| | Memory-memory |  | New (and old) |
+| and splits (for |<| move coalesce |<-|  pseudos|
+| pseudos got the |  ---   |   assignment  |
+  Start |  same  hard regs)   | 
---
+|- ^
+V|     |
+ ---   

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-17 Thread Richard Sandiford
Thanks for all the updates.

Vladimir Makarov  writes:
>>> + /* index * scale + disp => new base + index * scale  */
>>> + enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
>>> +
>>> + lra_assert (INDEX_REG_CLASS != NO_REGS);
>>> + new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
>>> + lra_assert (GET_CODE (*addr_loc) == PLUS);
>>> + lra_emit_move (new_reg, *ad.disp_loc);
>>> + if (CONSTANT_P (XEXP (*addr_loc, 1)))
>>> +   XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
>>> + XEXP (*addr_loc, 0) = new_reg;
>> The canonical form is (plus (mult ...) (reg)) rather than
>> (plus (reg) (mult ...)), but it looks like we create the latter.
>> I realise you try both forms here:
> It might happen because equiv substitution in LRA.
>>> + /* Some targets like ARM, accept address operands in
>>> +specific order -- try exchange them if necessary.  */
>>> + if (! valid_address_p (mode, *addr_loc, as))
>>> +   {
>>> + exchange_plus_ops (*addr_loc);
>>> + if (! valid_address_p (mode, *addr_loc, as))
>>> +   exchange_plus_ops (*addr_loc);
>>> +   }
>> but I think we should try the canonical form first.  And I'd prefer it
>> if we didn't try the other form at all, especially in 4.8.  It isn't
>> really the backend's job to reject non-canonical rtl.  This might well
>> be another case where some targets need a (hopefully small) tweak in
>> order to play by the rules.
>>
>> Also, I suppose this section of code feeds back to my question on
>> Wednesday about the distinction that LRA seems to make between the
>> compile-time constant in:
>>
>>(plus (reg X1) (const_int Y1))
>>
>> and the link-time constant in:
>>
>>(plus (reg X2) (symbol_ref Y2))
>>
>> It looked like extract_address_regs classified X1 as a base register and
>> X2 as an index register.  The difference between the two constants has
>> no run-time significance though, and I think we should handle both X1
>> and X2 as base registers (as I think reload does).
>>
>> I think the path above would then be specific to scaled indices.
>> In the original address the "complex" index must come first and the
>> displacement second.  In the modified address, the index would stay
>> first and the new base register would be second.  More below.
> As I wrote above the problem is also in that equiv substitution can 
> create non-canonical forms.

Right.  Just in case there's a misunderstanding: I'm not complaining
about these routines internally using forms that are noncanonical
(which could happen because of equiv substitution, like you say).
I just think that what we eventually try to validate should be canonical.
In a way it's similar to how the simplify-rtx.c routines work.

If there are targets that only accept noncanonical rtl (which is after
all just a specific type of invalid rtl), they need to be fixed.

>>> +  /* base + scale * index + disp => new base + scale * index  */
>>> +  new_reg = base_plus_disp_to_reg (mode, as, &ad);
>>> +  *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
>>> +  if (! valid_address_p (mode, *addr_loc, as))
>>> +   {
>>> + /* Some targets like ARM, accept address operands in
>>> +specific order -- try exchange them if necessary.  */
>>> + exchange_plus_ops (*addr_loc);
>>> + if (! valid_address_p (mode, *addr_loc, as))
>>> +   exchange_plus_ops (*addr_loc);
>>> +   }
>> Same comment as above about canonical rtl.  Here we can have two
>> registers -- in which case the base should come first -- or a more
>> complex index -- in which case the index should come first.
>>
>> We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
>> with the operands in either order, and let it take care of the details.
>> Using simplify_gen_binary would help with the earlier index+disp case too.
> Equiv substitution can create non-canonical forms.  There are 2 approaches:
> o have a code for dealing with non-canonical forms (equiv substitution, 
> target stupidity)
> o always support canonical forms and require them from targets.
>
> I decided to use the 1st variant but I am reconsidering it.  I'll try to 
> fix before inclusion.  But I am not sure I have time for this.  All 
> these changes makes LRA unstable. In fact, I've just found that changes 
> I already made so far resulted in 2 SPEC2000 tests broken although GCC 
> testsuite and bootstrap looks good.

OK.  I'm happy to try fixing the noncanonical thing.

>>> +  /* If this is post-increment, first copy the location to the reload reg. 
>>>  */
>>> +  if (post && real_in != result)
>>> +emit_insn (gen_move_insn (result, real_in));
>> Nit, but real_in != result can never be true AIUI, and I was confused how
>> the code could be correct in that case.  Maybe just remove it, or make
>> it an assert?
> No, it might be true:
>
> real_in = in == value ? incloc : in;
> ...
> if (cond)
>result = incloc;
> else
>result = ...
>
> if (

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-16 Thread Vladimir Makarov

On 12-10-12 10:29 AM, Richard Sandiford wrote:

Hi Vlad,

Comments for the rest of ira-constraints.c.

Vladimir Makarov  writes:

+  saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX;
+  change_p = equiv_address_substitution (&ad, addr_loc, mode, as, code);
+  if (ad.base_reg_loc != NULL)
+{
+  if (process_addr_reg
+ (ad.base_reg_loc, before,
+  (ad.base_modify_p && REG_P (*ad.base_reg_loc)
+   && find_regno_note (curr_insn, REG_DEAD,
+   REGNO (*ad.base_reg_loc)) == NULL
+   ? after : NULL),
+  base_reg_class (mode, as, ad.base_outer_code, ad.index_code)))
+   change_p = true;
+  if (ad.base_reg_loc2 != NULL)
+   *ad.base_reg_loc2 = *ad.base_reg_loc;
+  saved_base_reg = *ad.base_reg_loc;
+  lra_eliminate_reg_if_possible (ad.base_reg_loc);
+  if (ad.base_reg_loc2 != NULL)
+   {
+ saved_base_reg2 = *ad.base_reg_loc2;
+ lra_eliminate_reg_if_possible (ad.base_reg_loc2);
+   }

We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it
might be clearer without saved_base_reg2.  More below...

+  /* The following addressing is checked by constraints and
+usually target specific legitimate address hooks do not
+consider them valid.  */
+  || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC
+  || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC

typo: two PRE_DECs, although:

+  || GET_CODE (*addr_loc) == PRE_MODIFY
+  || GET_CODE (*addr_loc) == POST_MODIFY

the whole lot could just be replaced by ad.base_modify_p, or perhaps
even removed entirely given:

+  /* In this case we can not do anything because if it is wrong
+that is because of wrong displacement.  Remember that any
+address was legitimate in non-strict sense before LRA.  */
+  || ad.disp_loc == NULL)

It doesn't seem worth validating the address at all for ad.disp_loc == NULL.
E.g. something like:

   if (ad.base_reg_loc != NULL
   && (process_addr_reg
  (ad.base_reg_loc, before,
   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
&& find_regno_note (curr_insn, REG_DEAD,
REGNO (*ad.base_reg_loc)) == NULL
? after : NULL),
   base_reg_class (mode, as, ad.base_outer_code, ad.index_code
 {
   change_p = true;
   if (ad.base_reg_loc2 != NULL)
 *ad.base_reg_loc2 = *ad.base_reg_loc;
 }

   if (ad.index_reg_loc != NULL
   && process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS))
 change_p = true;

   /* The address was valid before LRA.  We only change its form if the
  address has a displacement, so if it has no displacement it must
  still be valid.  */
   if (ad.disp_loc == NULL)
 return change_p;

   /* See whether the address is still valid.  Some ports do not check
  displacements for eliminable registers, so we replace them
  temporarily with the elimination target.  */
   saved_base_reg = saved_index_reg = NULL_RTX;
   ...
   if (ok_p)
 return change_p;

Yes, it has sense.  I changed the code as you propose.

+#ifdef HAVE_lo_sum
+ {
+   rtx insn;
+   rtx last = get_last_insn ();
+
+   /* disp => lo_sum (new_base, disp)   */
+   insn = emit_insn (gen_rtx_SET
+ (VOIDmode, new_reg,
+  gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc;
+   code = recog_memoized (insn);
+   if (code >= 0)
+ {
+   rtx save = *ad.disp_loc;
+
+   *ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc);
+   if (! valid_address_p (mode, *ad.disp_loc, as))
+ {
+   *ad.disp_loc = save;
+   code = -1;
+ }
+ }
+   if (code < 0)
+ delete_insns_since (last);
+ }
+#endif

Nice :-)

Purely for the record, I wondered whether the high part should be
generated with emit_move_insn(_1) instead, with the rhs of the move
being the HIGH rtx.  That would allow targets to cope with cases where
the high part isn't represented directly as a HIGH.  E.g. on MIPS and
Alpha, small-data accesses use the global register as the high part instead.

However, both MIPS and Alpha accept small-data addresses as legitimate
constants and addresses before and during reload and only introduce the
split form after reload.  And I think that's how any other cases that
aren't simple HIGHs should be handled too.  E.g. MIPS also represents
GOT page loads as HIGHs until after reload, and only then lowers the
HIGH to a GOT load.  Allowing the backend to generate anything other
than a plain HIGH set here would be a double-edged sword.

So after all that I agree that the gen_rtx_SET above is better than
calling the move expanders.

Thanks for sharing your knowledge.

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-15 Thread Richard Sandiford
Hi Vlad,

Some comments about the rest of LRA.  Nothing major here...

Vladimir Makarov  writes:
> +/* Info about register in an insn.  */
> +struct lra_insn_reg
> +{
> +  /* The biggest mode through which the insn refers to the register
> + (remember the register can be accessed through a subreg in the
> + insn).  */
> +  ENUM_BITFIELD(machine_mode) biggest_mode : 16;

AFAICT, this is actually always the mode of a specific reference,
and if there are references to the same register in different modes,
those references get their own lra_insn_regs.  "mode" might be better
than "biggest_mode" if so.

> +/* Static part (common info for insns with the same ICODE) of LRA
> +   internal insn info.   It exists in at most one exemplar for each
> +   non-negative ICODE.   Warning: if the structure definition is
> +   changed, the initializer for debug_insn_static_data in lra.c should
> +   be changed too.  */

Probably worth saying (before the warning) that there is also
one structure for each asm.

> +/* LRA internal info about an insn (LRA internal insn
> +   representation).  */
> +struct lra_insn_recog_data
> +{
> +  int icode; /* The insn code.   */
> +  rtx insn; /* The insn itself.   */
> +  /* Common data for insns with the same ICODE.   */
> +  struct lra_static_insn_data *insn_static_data;

Maybe worth mentioning asms here too.

> +  /* Two arrays of size correspondingly equal to the operand and the
> + duplication numbers: */
> +  rtx **operand_loc; /* The operand locations, NULL if no operands.  */
> +  rtx **dup_loc; /* The dup locations, NULL if no dups.   */
> +  /* Number of hard registers implicitly used in given call insn.  The
> + value can be NULL or points to array of the hard register numbers
> + ending with a negative value.  */
> +  int *arg_hard_regs;
> +#ifdef HAVE_ATTR_enabled
> +  /* Alternative enabled for the insn.   NULL for debug insns.  */
> +  bool *alternative_enabled_p;
> +#endif
> +  /* The alternative should be used for the insn, -1 if invalid, or we
> + should try to use any alternative, or the insn is a debug
> + insn.  */
> +  int used_insn_alternative;
> +  struct lra_insn_reg *regs;  /* Always NULL for a debug insn.   */

Comments consistently above the field.

> +extern void lra_expand_reg_info (void);

This doesn't exist any more.

> +extern int lra_constraint_new_insn_uid_start;

Just saying in case: this seems to be write-only, with lra-constraints.c
instead using a static variable to track the uid start.

I realise you might want to keep it anyway for consistency with
lra_constraint_new_regno_start, or for debugging.

> +extern rtx lra_secondary_memory[NUM_MACHINE_MODES];

This doesn't exist any more.

> +/* lra-saves.c: */
> +
> +extern bool lra_save_restore (void);

Same for this file & function.

> +/* The function returns TRUE if at least one hard register from ones
> +   starting with HARD_REGNO and containing value of MODE are in set
> +   HARD_REGSET.   */
> +static inline bool
> +lra_hard_reg_set_intersection_p (int hard_regno, enum machine_mode mode,
> +  HARD_REG_SET hard_regset)
> +{
> +  int i;
> +
> +  lra_assert (hard_regno >= 0);
> +  for (i = hard_regno_nregs[hard_regno][mode] - 1; i >= 0; i--)
> +if (TEST_HARD_REG_BIT (hard_regset, hard_regno + i))
> +  return true;
> +  return false;
> +}

This is the same as overlaps_hard_reg_set_p.

> +/* Return hard regno and offset of (sub-)register X through arguments
> +   HARD_REGNO and OFFSET.  If it is not (sub-)register or the hard
> +   register is unknown, then return -1 and 0 correspondingly.  */

The function seems to return -1 for both.

> +/* Add hard registers starting with HARD_REGNO and holding value of
> +   MODE to the set S.  */
> +static inline void
> +lra_add_hard_reg_set (int hard_regno, enum machine_mode mode, HARD_REG_SET 
> *s)
> +{
> +  int i;
> +
> +  for (i = hard_regno_nregs[hard_regno][mode] - 1; i >= 0; i--)
> +SET_HARD_REG_BIT (*s, hard_regno + i);
> +}

This is add_to_hard_reg_set.

> +   Here is block diagram of LRA passes:
> +
> +   - 
> +  | Undo inheritance|  ------ 
> +  | for spilled pseudos)| | Memory-memory |  | New (and old) |
> +  | and splits (for |<| move coalesce |<-|pseudos|
> +  | pseudos got the |  ---   |   assignment  |
> +  Start   |  same  hard regs)   | 
> --- 
> +|  - ^
> +V  |     |
> + ---   V | Update virtual |  |
> +|  Remove   |> >|register |  |
> +| scratches |  ^ |  displacements |  |
> + ---   |

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-15 Thread Richard Sandiford
Vladimir Makarov  writes:
>> if that's accurate.  I dropped the term "reload pseudo" because of
>> the general comment in my earlier reply about the use of "reload pseudo"
>> when the code seems to include inheritance and split pseudos too.
> There is no inheritance and splitting yet.  It is done after the 
> constraint pass.
> So at this stage >= new_regno_start means reload pseudo.

Ah, OK.

>> That's a change in the meaning of NEW_CLASS, but seems easier for
>> callers to handle.  I think all it requires is changing:
>>
>>> +  common_class = ira_reg_class_subset[rclass][cl];
>>> +  if (new_class != NULL)
>>> +   *new_class = common_class;
>> to:
>>
>>common_class = ira_reg_class_subset[rclass][cl];
>>if (new_class != NULL && rclass != common_class)
>>  *new_class = common_class;
> This change results in infinite LRA looping on a first libgcc file 
> compilation.  Unfortunately I have no time to investigate it.
> I'd like to say that most code of in this code is very sensitive to 
> changes.  I see it a lot.  You change something looking obvious and a 
> target is broken.
> I am going to investigate it when I have more time.

Thanks.

>>> +default:
>>> +  {
>>> +   const char *fmt = GET_RTX_FORMAT (code);
>>> +   int i;
>>> +
>>> +   if (GET_RTX_LENGTH (code) != 1
>>> +   || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC)
>>> + {
>>> +   for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
>>> + if (fmt[i] == 'e')
>>> +   extract_loc_address_regs (false, mode, as, &XEXP (x, i),
>>> + context_p, code, SCRATCH,
>>> + modify_p, ad);
>>> +   break;
>>> + }
>>> +   /* fall through for case UNARY_OP (UNSPEC ...)  */
>>> +  }
>>> +
>>> +case UNSPEC:
>>> +  if (ad->disp_loc == NULL)
>>> +   ad->disp_loc = loc;
>>> +  else if (ad->base_reg_loc == NULL)
>>> +   {
>>> + ad->base_reg_loc = loc;
>>> + ad->base_outer_code = outer_code;
>>> + ad->index_code = index_code;
>>> + ad->base_modify_p = modify_p;
>>> +   }
>>> +  else
>>> +   {
>>> + lra_assert (ad->index_reg_loc == NULL);
>>> + ad->index_reg_loc = loc;
>>> +   }
>>> +  break;
>>> +
>>> +}
>> Which targets use a bare UNSPEC as a displacement?  I thought a
>> displacement had to be a link-time constant, in which case it should
>> satisfy CONSTANT_P.  For UNSPECs, that means wrapping it in a CONST.
> I saw it somewhere.  I guess IA64.
>> I'm just a bit worried that the UNSPEC handling is sensitive to the
>> order that subrtxes are processed (unlike PLUS, which goes to some
>> trouble to work out what's what).  It could be especially confusing
>> because the default case processes operands in reverse order while
>> PLUS processes them in forward order.
>>
>> Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough?
>> Probably deserves a comment.
> I don't remember.  To figure out, I should switch it off and try all 
> targets supported by LRA.
>> AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just
>> recording where reloads of a particular class need to go (obviously
>> in the case of disp_loc, which isn't reloaded at all).  The feidls
>> have semantic value too.  I.e. we use them to work out the value
>> of at least part of the address.
>>
>> In that case it seems dangerous to look through general rtxes
>> in the way that the default case above does.  Maybe just making
>> sure that DISP_LOC is involved in a sum with the base would be
>> enough, but another idea was:
>>
>> 
>> I know of three ways of "mutating" (for want of a better word)
>> an address:
>>
>>1. (and X (const_int X)), to align
>>2. a subreg
>>3. a unary operator (such as truncation or extension)
>>
>> So maybe we could:
>>
>>a. remove outer mutations (using a helper function)
>>b. handle LO_SUM, PRE_*, POST_*: as now
>>c. otherwise treat the address of the sum of one, two or three pieces.
>>   c1. Peel mutations of all pieces.
>>   c2. Classify the pieces into base, index and displacement.
>>   This would be similar to the jousting code above, but hopefully
>>   easier because all three rtxes are to hand.  E.g. we could
>>   do the base vs. index thing in a similar way to
>>   commutative_operand_precedence.
>>   c3. Record which pieces were mutated (e.g. using something like the
>>   index_loc vs. index_reg_loc distinction in the current code)
>>
>> That should be general enough for current targets, but if it isn't,
>> we could generalise it further when we know what generalisation is needed.
>>
>> That's still going to be a fair amount of code, but hopefully not more,
>> and we might have more confidence at each stage what each value is.
>> And it avoids the risk of treating "mutated" addresses as "unmutated" ones.
>> ---

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-13 Thread Richard Sandiford
I'm having to correct my own comments again, sorry.

Richard Sandiford  writes:
>> +  /* If this is post-increment, first copy the location to the reload reg.  
>> */
>> +  if (post && real_in != result)
>> +emit_insn (gen_move_insn (result, real_in));
>
> Nit, but real_in != result can never be true AIUI, and I was confused how
> the code could be correct in that case.  Maybe just remove it, or make
> it an assert?

Probably obvious, but I meant "real_in == result" can never be true.
"real_in != result" could be removed or turned into an assert.

>> +if (GET_CODE (op) == PLUS)
>> +  {
>> +plus = op;
>> +op = XEXP (op, 1);
>> +  }
>
> Sorry, I'm complaining about old reload code again, but: does this
> actually happen in LRA?  In reload, a register operand could become a
> PLUS because of elimination, but I thought LRA did things differently.
> Besides, this is only needed for:
>
>> +if (CONST_POOL_OK_P (mode, op)
>> +&& ((targetm.preferred_reload_class
>> + (op, (enum reg_class) goal_alt[i]) == NO_REGS)
>> +|| no_input_reloads_p)
>> +&& mode != VOIDmode)
>> +  {
>> +rtx tem = force_const_mem (mode, op);
>> +
>> +change_p = true;
>> +/* If we stripped a SUBREG or a PLUS above add it back.  */
>> +if (plus != NULL_RTX)
>> +  tem = gen_rtx_PLUS (mode, XEXP (plus, 0), tem);
>
> and we shouldn't have (plus (constant ...) ...) after elimination
> (or at all outside of a CONST).  I don't understand why the code is
> needed even in reload.

Scratch the thing about needing it for reload.  It's obviously the
second second operand we're reloading, not the first, and it could
well be that an elimination displacement needs to be reloaded via
the constant pool.

The question for LRA still stands though.

Richard


Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-12 Thread Richard Sandiford
Vladimir Makarov  writes:
>>> +/* Info about pseudo used during the assignment pass.  Thread is a set
>>> +   of connected reload and inheritance pseudos with the same set of
>>> +   available hard reg set.  Thread is a pseudo itself for other
>>> +   cases.  */
>>> +struct regno_assign_info
>> Maybe:
>>
>> /* Information about the thread to which a pseudo belongs.  Threads are
>> a set of connected reload and inheritance pseudos with the same set of
>> available hard registers.  Lone registers belong to their own threads.  
>> */
> Fixed.
>> Although the condition seems to be:
>>> +   && (ira_class_hard_regs_num[regno_allocno_class_array[regno1]]
>>> +   == ira_class_hard_regs_num[regno_allocno_class_array[regno2]]))
>> i.e. the same _number_ of available hard regs, but not necessarily the
>> same set.
> It should be the same in most cases.  This condition is just faster 
> approximation of the same available hard reg set.

The distinction does seem important though.   It's possible that
a target has two distinct register files of the same allocatable size.
Would something like:

  (ira_class_subset_p[class1][class2]
   && ira_class_subset_p[class2][class1])

work instead?

>> /* Update the preference for using HARD_REGNO for pseudos that are
>> connected directly or indirectly with REGNO.  Apply divisor DIV
>> to any preference adjustments.
>>
>> The more indirectly a pseudo is connected, the smaller its effect
>> should be.  We therefore increase DIV on each "hop".  */
>>
> Fixed.  By the way, it is your invention from IRA.

Heh, I'd forgotten all about that.

>>> +  /* We are trying to spill a reload pseudo.  That is wrong we
>>> +should assign all reload pseudos, otherwise we cannot reuse
>>> +the selected alternatives.  */
>>> +  hard_regno = find_hard_regno_for (regno, &cost, -1);
>>> +  if (hard_regno >= 0)
>>> +   {
>> Don't really understand this comment, sorry.
> It removed the comment.  It is from an old solution code trying to 
> guarantee assignment to the reload pseudo.
>> Also, why are we passing -1 to find_hard_regno_for, rather than hard_regno?
>> The loop body up till this point has been specifically freeing up registers
>> to make hard_regno allocatable.  I realise that, by spilling everything
>> that overlaps this range, we might have freed up other registers too,
>> and so made others besides hard_regno allocatable.  But wouldn't we want
>> to test those other hard registers in "their" iteration of the loop
>> instead of this one?  The spills in those iterations ought to be more
>> directed (i.e. there should be less incidental spilling).
>>
>> As things stand, doing an rclass_size * rclass_size scan seems
>> unnecessarily expensive, although probably off the radar.
> We cannot just pass hard_regno for multi-word pseudo when hard_regno-1 
> is already free.

But this call is in a loop that iterates over all registers in the class:

  for (i = 0; i < rclass_size; i++)
{
  hard_regno = ira_class_hard_regs[rclass][i];

and we reach the find_hard_regno_for call unless there is some
conflicting register that we cannot spill.  So if "hard_regno - 1"
belongs to the allocation class and is a viable choice, "its" iteration
of the loop would spill specifically for "hard_regno - 1" and get the
most accurate cost for that register.  I couldn't see why any other
iteration of the loop would want to consider "hard_regno - 1".

> You are right about possibility to speed up the code, although on 
> profiles I looked (including the last huge tests) spill_for and 
> find_hard_regno_for called from takes few time. That is probably because 
> you don't need spill frequently.  Freeing one long live range pseudo 
> permits to find hard regno without spilling for many short live pseudos 
> (reload and inheritance ones).
> Also loop rclass_size * rclass_size is not expensive, the preparation of 
> data for the loop is expensive.

OK, in that case maybe the efficiency concern wasn't justified.
FWIW, I still think passing hard_regno would be clearer though,
in terms of meeting expectations.  It just seems odd to spill for
one specific register and then test all of them.  Especially when the
spilling we actually do after choosing register X is based on "X's"
iteration of this loop.

(I realise I could well be missing the point here though, sorry.)

>>> + && (reload_hard_regno
>>> + = find_hard_regno_for (reload_regno,
>>> +&reload_cost, -1)) >= 0
>>> + && (lra_hard_reg_set_intersection_p
>>> + (reload_hard_regno, PSEUDO_REGNO_MODE (reload_regno),
>>> +  spilled_hard_regs)))
>>> +   {
>>> + if (lra_dump_file != NULL)
>>> +   fprintf (lra_dump_file, " assign %d(cost=%d)",
>>> +reload_regno, reload_cost);
>>> + assign_temporarily (reload_regno, reload_hard_regno);
>>> +  

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-12 Thread Richard Sandiford
Hi Vlad,

Comments for the rest of ira-constraints.c.

Vladimir Makarov  writes:
> +  saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX;
> +  change_p = equiv_address_substitution (&ad, addr_loc, mode, as, code);
> +  if (ad.base_reg_loc != NULL)
> +{
> +  if (process_addr_reg
> +   (ad.base_reg_loc, before,
> +(ad.base_modify_p && REG_P (*ad.base_reg_loc)
> + && find_regno_note (curr_insn, REG_DEAD,
> + REGNO (*ad.base_reg_loc)) == NULL
> + ? after : NULL),
> +base_reg_class (mode, as, ad.base_outer_code, ad.index_code)))
> + change_p = true;
> +  if (ad.base_reg_loc2 != NULL)
> + *ad.base_reg_loc2 = *ad.base_reg_loc;
> +  saved_base_reg = *ad.base_reg_loc;
> +  lra_eliminate_reg_if_possible (ad.base_reg_loc);
> +  if (ad.base_reg_loc2 != NULL)
> + {
> +   saved_base_reg2 = *ad.base_reg_loc2;
> +   lra_eliminate_reg_if_possible (ad.base_reg_loc2);
> + }

We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it
might be clearer without saved_base_reg2.  More below...

> +  /* The following addressing is checked by constraints and
> +  usually target specific legitimate address hooks do not
> +  consider them valid.  */
> +  || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC
> +  || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC

typo: two PRE_DECs, although:

> +  || GET_CODE (*addr_loc) == PRE_MODIFY
> +  || GET_CODE (*addr_loc) == POST_MODIFY

the whole lot could just be replaced by ad.base_modify_p, or perhaps
even removed entirely given:

> +  /* In this case we can not do anything because if it is wrong
> +  that is because of wrong displacement.  Remember that any
> +  address was legitimate in non-strict sense before LRA.  */
> +  || ad.disp_loc == NULL)

It doesn't seem worth validating the address at all for ad.disp_loc == NULL.
E.g. something like:

  if (ad.base_reg_loc != NULL
  && (process_addr_reg
  (ad.base_reg_loc, before,
   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
&& find_regno_note (curr_insn, REG_DEAD,
REGNO (*ad.base_reg_loc)) == NULL
? after : NULL),
   base_reg_class (mode, as, ad.base_outer_code, ad.index_code
{
  change_p = true;
  if (ad.base_reg_loc2 != NULL)
*ad.base_reg_loc2 = *ad.base_reg_loc;
}

  if (ad.index_reg_loc != NULL
  && process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS))
change_p = true;

  /* The address was valid before LRA.  We only change its form if the
 address has a displacement, so if it has no displacement it must
 still be valid.  */
  if (ad.disp_loc == NULL)
return change_p;

  /* See whether the address is still valid.  Some ports do not check
 displacements for eliminable registers, so we replace them
 temporarily with the elimination target.  */
  saved_base_reg = saved_index_reg = NULL_RTX;
  ...
  if (ok_p)
return change_p;

> +#ifdef HAVE_lo_sum
> +   {
> + rtx insn;
> + rtx last = get_last_insn ();
> +
> + /* disp => lo_sum (new_base, disp)  */
> + insn = emit_insn (gen_rtx_SET
> +   (VOIDmode, new_reg,
> +gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc;
> + code = recog_memoized (insn);
> + if (code >= 0)
> +   {
> + rtx save = *ad.disp_loc;
> +
> + *ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc);
> + if (! valid_address_p (mode, *ad.disp_loc, as))
> +   {
> + *ad.disp_loc = save;
> + code = -1;
> +   }
> +   }
> + if (code < 0)
> +   delete_insns_since (last);
> +   }
> +#endif

Nice :-)

Purely for the record, I wondered whether the high part should be
generated with emit_move_insn(_1) instead, with the rhs of the move
being the HIGH rtx.  That would allow targets to cope with cases where
the high part isn't represented directly as a HIGH.  E.g. on MIPS and
Alpha, small-data accesses use the global register as the high part instead.

However, both MIPS and Alpha accept small-data addresses as legitimate
constants and addresses before and during reload and only introduce the
split form after reload.  And I think that's how any other cases that
aren't simple HIGHs should be handled too.  E.g. MIPS also represents
GOT page loads as HIGHs until after reload, and only then lowers the
HIGH to a GOT load.  Allowing the backend to generate anything other
than a plain HIGH set here would be a double-edged sword.

So after all that I agree that the gen_rtx_SET above is better than
calling the move expanders.

> +   /* index * scale + disp => new base + index * scale  */
> +   enum reg_class cl = base_reg_class (mode, as, SCRATC

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-11 Thread Vladimir Makarov

On 10/04/2012 11:50 AM, Richard Sandiford wrote:

Hi Vlad,

This message is for lra-assigns.c.  Sorry for the piecemeal reviews,
never sure when I'll get time...


+/* This file contains a pass mostly assigning hard registers to reload
+   pseudos.  There is no any RTL code transformation on this pass.

Maybe:

/* This file's main objective is to assign hard registers to reload pseudos.
It also tries to allocate hard registers to other pseudos, but at a lower
priority than the reload pseudos.  The pass does not transform the RTL.

if that's accurate.

Yes.  That is better.  I used your comment.

+   Reload pseudos get what they need (usually) hard registers in
+   anyway possibly by spilling non-reload pseudos and by assignment
+   reload pseudos with smallest number of available hard registers
+   first.
+
+   If reload pseudos can get hard registers only through spilling
+   other pseudos, we choose what pseudos to spill taking into account
+   how given reload pseudo benefits and also how other reload pseudos
+   not assigned yet benefit too (see function spill_for).

Maybe:

We must allocate a hard register to every reload pseudo.  We try to
increase the chances of finding a viable allocation by assigning the
pseudos in order of fewest available hard registers first.  If we
still fail to find a hard register, we spill other (non-reload)
pseudos in order to make room.

assign_hard_regno_for allocates registers without spilling.
spill_for does the same with spilling.  Both functions use
a cost model to determine the most profitable choice of
hard and spill registers.

Ok.  I just changed two sentences a bit:

 find_hard_regno_for finds hard registers for allocation without  
spilling.  spill_for does the same with spilling.



+   Non-reload pseudos can get hard registers too if it is possible and
+   improves the code.  It might be possible because of spilling
+   non-reload pseudos on given pass.

Maybe:

Once we have finished allocating reload pseudos, we also try to
assign registers to other (non-reload) pseudos.  This is useful
if hard registers were freed up by the spilling just described.


Fixed.

+   We try to assign hard registers processing pseudos by threads.  The
+   thread contains reload and inheritance pseudos connected by copies
+   (move insns).  It improves the chance to get the same hard register
+   to pseudos in the thread and, as the result, to remove some move
+   insns.

Maybe:

We try to assign hard registers by collecting pseudos into threads.
These threads contain reload and inheritance pseudos that are connected
by copies (move insns).  Doing this improves the chances of pseudos
in the thread getting the same hard register and, as a result,
of allowing some move insns to be deleted.

Fixed.

+   When we assign hard register to a pseudo, we decrease the cost of
+   the hard registers for corresponding pseudos connected by copies.

Maybe:

When we assign a hard register to a pseudo, we decrease the cost of
using the same hard register for pseudos that are connected by copies.

Fixed.

+   If two hard registers are equally good for assigning the pseudo
+   with hard register cost point of view, we prefer a hard register in
+   smaller register bank.  By default, there is only one register
+   bank.  A target can define register banks by hook
+   register_bank. For example, x86-64 has a few register banks: hard
+   regs with and without REX prefixes are in different banks.  It
+   permits to generate smaller code as insns without REX prefix are
+   shorter.

Maybe:

If two hard registers have the same frequency-derived cost,
we prefer hard registers in lower register banks.  The mapping
of registers to banks is controlled by the register_bank target hook.
For example, x86-64 has a few register banks: hard registers with and
without REX prefixes are in different banks.  This permits us
to generate smaller code as insns without REX prefixes are shorter.

although this might change if the name of the hook changes.


With recent change in the hook name, I modified it to:

   If two hard registers have the same frequency-derived cost, we
   prefer hard registers with bigger priorities.  The mapping of
   registers to priorities is controlled by the register_priority
   target hook.  For example, x86-64 has a few register priorities:
   hard registers with and without REX prefixes have different
   priorities.  This permits us to generate smaller code as insns
   without REX prefixes are shorter.


+/* Info about pseudo used during the assignment pass.  Thread is a set
+   of connected reload and inheritance pseudos with the same set of
+   available hard reg set.  Thread is a pseudo itself for other
+   cases.  */
+struct regno_assign_info

Maybe:

/* Information about the thread to which a pseudo belongs.  Threads are
a set of connected reload and inheritance pseudos with 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-10 Thread Vladimir Makarov

On 12-10-03 7:11 AM, Richard Sandiford wrote:

Hi Vlad,

Some comments on lra-spills.c and lra-coalesce.c.


+   The pass creates necessary stack slots and assign spilled pseudos
+   to the stack slots in following way:

s/assign/assigns/

Fixed.

+   (or insn memory constraints) might be not satisfied any more.

s/might be not/might not be/

Fixed.

+   For some targets, the pass can spill some pseudos into hard
+   registers of different class (usually into vector registers)
+   instead of spilling them into memory if it is possible and
+   profitable. Spilling GENERAL_REGS pseudo into SSE registers for
+   modern Intel x86/x86-64 processors is an example of such
+   optimization.  And this is actually recommended by Intel
+   optimization guide.

Maybe mention core i7 specifically?  "Modern" is a bit dangerous
in code that'll live a long time.

Yes, right.  Fixed.  Another bad thing would be an usage of word new.

+/* The structure describes a stack slot which can be used for several
+   spilled pseudos.  */
+struct slot
+{

Looks like this describes "a register or stack slot" given the hard_regno case.

Fixed

+/* Array containing info about the stack slots. The array element is
+   indexed by the stack slot number in the range [0..slost_num).  */

Typo: slots_num

Fixed.

+  /* Each pseudo has an inherent size which comes from its own mode,
+ and a total size which provides room for paradoxical subregs
+ which refer to the pseudo reg in wider modes.
+
+ We can use a slot already allocated if it provides both enough
+ inherent space and enough total space.  Otherwise, we allocate a
+ new slot, making sure that it has no less inherent space, and no
+ less total space, then the previous slot. */

The second part of the comment seems a bit misplaced, since the following
code doesn't reuse stack slots.  This is done elsewhere instead.
Maybe the first part would be better above the inherent_size assignment.

Right.  I've changed comment to reflect the current state of the code.

+  /* If we have any adjustment to make, or if the stack slot is the
+ wrong mode, make a new stack slot. */
+  x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust);

We don't make a new slot here.
I removed the comment.  The same comment is present in reload1.c and 
probably should be also removed.

+/* Sort pseudos according their slot numbers putting ones with smaller
+   numbers first, or last when the frame pointer is not needed. So
+   pseudos with the first slot will be finally addressed with smaller
+   address displacement.  */
+static int
+pseudo_reg_slot_compare (const void *v1p, const void *v2p)
+{
+  const int regno1 = *(const int *) v1p;
+  const int regno2 = *(const int *) v2p;
+  int diff, slot_num1, slot_num2;
+  int total_size1, total_size2;
+
+  slot_num1 = pseudo_slots[regno1].slot_num;
+  slot_num2 = pseudo_slots[regno2].slot_num;
+  if ((diff = slot_num1 - slot_num2) != 0)
+return (frame_pointer_needed
+   || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff);

The comment doesn't quite describe the condition.  Maybe:

/* Sort pseudos according to their slots, putting the slots in the order
that they should be allocated.  Slots with lower numbers have the highest
priority and should get the smallest displacement from the stack or
frame pointer (whichever is being used).

The first allocated slot is always closest to the frame pointer,
so prefer lower slot numbers when frame_pointer_needed.  If the stack
and frame grow in the same direction, then the first allocated slot is
always closest to the initial stack pointer and furthest away from the
final stack pointer, so allocate higher numbers first when using the
stack pointer in that case.  The reverse is true if the stack and
frame grow in opposite directions.  */

I used your comment.  Thanks.

+  total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1),
+GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode));
+  total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2),
+GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode));
+  if ((diff = total_size2 - total_size1) != 0)
+return diff;

I think this could do with a bit more commentary.  When is biggest_mode
ever smaller than PSEUDO_REGNO_BYTES?  Is that for pseudos that are only
ever referenced as lowpart subregs?  If so, why does PSEUDO_REGNO_BYTES
matter for those registers here but not when calculating biggest_mode?
The MAX code has no sense to me too (probably it was wrongly adapted 
from somewhere).  So I removed MAX.

+/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
+   pseudos which did not get a spill hard register at the beginning of
+   array PSEUDO_REGNOS. Return the number of such pseudos.  */

It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest
frequency first.

Fixed.

+  bitmap set_jump_crosses = regstat_get

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-10 Thread Richard Sandiford
Sorry, reading back in different surroundings made me notice a couple
of silly errors:

Richard Sandiford  writes:
> E.g.:
>
>   if ((*loc = get_equiv_substitution (reg)) != reg)
> ...as above...
>   if (*loc != reg || !in_class_p (reg, cl, &new_class))
> ...as above...
>   else if (new_class != NO_REGS && rclass != new_class)
> change_class (regno, new_class, "Change", true);
>   return change_p;
>
> (assuming change to in_class_p suggested earlier) seems like it
> covers the same cases.

...but that same in_class_p change means that the "rclass != new_class"
condition isn't needed.  I.e.

  if ((*loc = get_equiv_substitution (reg)) != reg)
...as above...
  if (*loc != reg || !in_class_p (reg, cl, &new_class))
...as above...
  else if (new_class != NO_REGS)
change_class (regno, new_class, "  Change", true);
  return change_p;

>> +  if (operand_reg[nop] != NULL_RTX)
>> +{
>> +  int last_reload = (lra_reg_info[ORIGINAL_REGNO
>> +  (operand_reg[nop])]
>> + .last_reload);
>> +
>> +  if (last_reload > bb_reload_num)
>> +reload_sum += last_reload;
>> +  else
>> +reload_sum += bb_reload_num;
>
> The comment for reload_sum says:
>
>> +/* Overall number reflecting distances of previous reloading the same
>> +   value.  It is used to improve inheritance chances.  */
>> +static int best_reload_sum;
>
> which made me think of distance from the current instruction.  I see
> it's actually something else, effectively a sum of instruction numbers.
>
> I assumed the idea was to prefer registers that were reloaded more
> recently (closer the current instruction).  In that case I thought that,
> for a distance-based best_reload_sum, smaller would be better,
> while for an instruction-number-based best_reload_sum, larger would
> be better.  It looks like we use instruction-number based best_reload_sums
> but prefer smaller sums:
>
>> +  && (reload_nregs < best_reload_nregs
>> +  || (reload_nregs == best_reload_nregs
>> +  && best_reload_sum < reload_sum
>
> Is that intentional?

Clearly I can't read.  The code _does_ prefer higher numbers.  I still
think "distance" is a bit misleading though. :-)

Just for the record, the rest of my question:

> Also, is this value meaningful for output reloads, which aren't really
> going to be able to inherit a value as such?  We seem to apply the cost
> regardless of whether it's an input or an output, so probably deserves
> a comment.
>
> Same for matched input operands, which as you say elsewhere aren't
> inherited.

still applies.

Richard


Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-10 Thread Vladimir Makarov

On 12-10-02 10:14 AM, Richard Sandiford wrote:

Richard Sandiford  writes:

+/* Merge ranges R1 and R2 and returns the result.  The function
+   maintains the order of ranges and tries to minimize size of the
+   result range list.  */
+lra_live_range_t
+lra_merge_live_ranges (lra_live_range_t r1, lra_live_range_t r2)
+{
+  lra_live_range_t first, last, temp;
+
+  if (r1 == NULL)
+return r2;
+  if (r2 == NULL)
+return r1;
+  for (first = last = NULL; r1 != NULL && r2 != NULL;)
+{
+  if (r1->start < r2->start)
+   {
+ temp = r1;
+ r1 = r2;
+ r2 = temp;
+   }
+  if (r1->start <= r2->finish + 1)
+   {
+ /* Intersected ranges: merge r1 and r2 into r1.  */
+ r1->start = r2->start;
+ if (r1->finish < r2->finish)
+   r1->finish = r2->finish;
+ temp = r2;
+ r2 = r2->next;
+ pool_free (live_range_pool, temp);
+ if (r2 == NULL)
+   {
+ /* To try to merge with subsequent ranges in r1.  */
+ r2 = r1->next;
+ r1->next = NULL;
+   }
+   }
+  else
+   {
+ /* Add r1 to the result.  */
+ if (first == NULL)
+   first = last = r1;
+ else
+   {
+ last->next = r1;
+ last = r1;
+   }
+ r1 = r1->next;
+ if (r1 == NULL)
+   {
+ /* To try to merge with subsequent ranges in r2.  */
+ r1 = r2->next;
+ r2->next = NULL;
+   }
+   }

I might be misreading, but I'm not sure whether this handles merges like:

   r1 = [6,7], [3,4]
   r2 = [3,8], [0,1]

After the first iteration, it looks like we'll have:

   r1 = [3,8], [3,4]
   r2 = [0,1]

Then we'll add both [3,8] and [3,4] to the result.

OK, so I start to read patch b and realise that this is only supposed to
handle non-overlapping live ranges.  It might be worth having a comment
and assert to that effect, for slow readers like me.

Although in that case the function feels a little more complicated than
it needs to be.  When we run out of R1 or R2, why not just use the other
one as the rest of the live range list?  Why is:


+ if (r1 == NULL)
+   {
+ /* To try to merge with subsequent ranges in r2.  */
+ r1 = r2->next;
+ r2->next = NULL;
+   }

needed?


No, it is not necessary (if an assert is removed below).  I simplified 
the code, added comments and an assert checking intersection.  Actually 
I wanted to simplify it anyway later to speed up LRA (this function is 
called many times for PR54146).




Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-10 Thread Vladimir Makarov

On 12-10-02 9:42 AM, Richard Sandiford wrote:

Vladimir Makarov  writes:

This is the major patch containing all new files.  The patch also adds
necessary calls to LRA from IRA.As the patch is too big, it continues in
the next email.

2012-09-27  Vladimir Makarov  

  * Makefile.in (LRA_INT_H): New.
  (OBJS): Add lra.o, lra-assigns.o, lra-coalesce.o,
  lra-constraints.o, lra-eliminations.o, lra-lives.o, and lra-spills.o.
  (ira.o): Add dependence on lra.h.
  (lra.o, lra-assigns.o, lra-coalesce.o, lra-constraints.o): New entries.
  (lra-eliminations.o, lra-lives.o, lra-spills.o): Ditto.
  * ira.c: Include lra.h.
  (ira_init_once, ira_init, ira_finish_once): Call lra_start_once,
  lra_init, lra_finish_once in anyway.
  (lra_in_progress): Remove.
  (do_reload): Call LRA.
  * lra.h: New.
  * lra-int.h: Ditto.
  * lra.c: Ditto.
  * lra-assigns.c: Ditto.
  * lra-constraints.c: Ditto.
  * lra-coalesce.c: Ditto.
  * lra-eliminations.c: Ditto.
  * lra-lives.c: Ditto.
  * lra-spills.c: Ditto.
  * doc/passes.texi: Describe LRA pass.

Comments on ira-lives.c.  (Sorry for the split, had more time to look
at this than expected)


+/* Copy live range list given by its head R and return the result.  */
+lra_live_range_t
+lra_copy_live_range_list (lra_live_range_t r)
+{
+  lra_live_range_t p, first, last;
+
+  if (r == NULL)
+return NULL;
+  for (first = last = NULL; r != NULL; r = r->next)
+{
+  p = copy_live_range (r);
+  if (first == NULL)
+   first = p;
+  else
+   last->next = p;
+  last = p;
+}
+  return first;
+}

Maybe simpler as:

   lra_live_range_t p, first, *chain;

   first = NULL;
   for (chain = &first; r != NULL; r = r->next)
 {
   p = copy_live_range (r);
   *chain = p;
   chain = &p->next;
 }
   return first;


OK.

+/* Merge ranges R1 and R2 and returns the result.  The function
+   maintains the order of ranges and tries to minimize size of the
+   result range list.  */
+lra_live_range_t
+lra_merge_live_ranges (lra_live_range_t r1, lra_live_range_t r2)
+{
+  lra_live_range_t first, last, temp;
+
+  if (r1 == NULL)
+return r2;
+  if (r2 == NULL)
+return r1;
+  for (first = last = NULL; r1 != NULL && r2 != NULL;)
+{
+  if (r1->start < r2->start)
+   {
+ temp = r1;
+ r1 = r2;
+ r2 = temp;
+   }
+  if (r1->start <= r2->finish + 1)
+   {
+ /* Intersected ranges: merge r1 and r2 into r1.  */
+ r1->start = r2->start;
+ if (r1->finish < r2->finish)
+   r1->finish = r2->finish;
+ temp = r2;
+ r2 = r2->next;
+ pool_free (live_range_pool, temp);
+ if (r2 == NULL)
+   {
+ /* To try to merge with subsequent ranges in r1.  */
+ r2 = r1->next;
+ r1->next = NULL;
+   }
+   }
+  else
+   {
+ /* Add r1 to the result.  */
+ if (first == NULL)
+   first = last = r1;
+ else
+   {
+ last->next = r1;
+ last = r1;
+   }
+ r1 = r1->next;
+ if (r1 == NULL)
+   {
+ /* To try to merge with subsequent ranges in r2.  */
+ r1 = r2->next;
+ r2->next = NULL;
+   }
+   }

I might be misreading, but I'm not sure whether this handles merges like:

   r1 = [6,7], [3,4]
   r2 = [3,8], [0,1]

After the first iteration, it looks like we'll have:

   r1 = [3,8], [3,4]
   r2 = [0,1]

Then we'll add both [3,8] and [3,4] to the result.

Same chain pointer comment as for lra_merge_live_ranges.


+/* Return TRUE if live range R1 is in R2.  */
+bool
+lra_live_range_in_p (lra_live_range_t r1, lra_live_range_t r2)
+{
+  /* Remember the live ranges are always kept ordered. */
+  while (r1 != NULL && r2 != NULL)
+{
+  /* R1's element is in R2's element.  */
+  if (r2->start <= r1->start && r1->finish <= r2->finish)
+   r1 = r1->next;
+  /* Intersection: R1's start is in R2.  */
+  else if (r2->start <= r1->start && r1->start <= r2->finish)
+   return false;
+  /* Intersection: R1's finish is in R2.  */
+  else if (r2->start <= r1->finish && r1->finish <= r2->finish)
+   return false;
+  else if (r1->start > r2->finish)
+   return false; /* No covering R2's element for R1's one.  */
+  else
+   r2 = r2->next;
+}
+  return r1 == NULL;

Does the inner bit reduce to:

   /* R1's element is in R2's element.  */
   if (r2->start <= r1->start && r1->finish <= r2->finish)
r1 = r1->next;
   /* All of R2's element comes after R1's element.  */
   else if (r2->start > r1->finish)
r2 = r2->next;
   else
return false;
Yes. It seems to me the right change.  I found that this function is not 
used in LRA anymore.   So I am just removing the function. Sorry for 
wasting your time reviewing this.
I wrot

Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-09 Thread Vladimir Makarov

On 10/09/2012 06:17 AM, Richard Sandiford wrote:

Thanks for the updates.

Vladimir Makarov  writes:

+   a change in the offset between the eliminable register and its
+   substitution if UPDATE_P, or the full offset if FULL_P, or
+   otherwise zero.

I wonder if an enum would be better than two booleans?
It avoids invalid combinations like UPDATE_P && FULL_P
and might make the arguments more obvious too.

IMHO, It is matter of choice.  I don't like to introduce a new enum just
for the function.  It is pretty standard situation.  I usually introduce
enum when there are a few combinations prohibited.

OK.  I agree this is probably personal preference.


+ /* The only time we want to replace a PLUS with a REG
+(this occurs when the constant operand of the PLUS is
+the negative of the offset) is when we are inside a
+MEM.  We won't want to do so at other times because
+that would change the structure of the insn in a way
+that reload can't handle.  We special-case the
+commonest situation in eliminate_regs_in_insn, so
+just replace a PLUS with a PLUS here, unless inside a
+MEM.  */

Reload reference.  Does this restriction still apply?  The later comment:

I don't think so.  I removed the comment.

Well, the question was about the code as much as the comment.
The comment did describe what the code did:

  if (mem_mode != 0
  && CONST_INT_P (XEXP (x, 1))
  && INTVAL (XEXP (x, 1)) == -offset)
return to;
  else
return gen_rtx_PLUS (Pmode, to,
 plus_constant (Pmode,
XEXP (x, 1), offset));

If the restriction doesn't apply any more then the mem_mode condition
should be removed.  If does apply then we should have some sort of
comment to explain why.

I suppose the question is: what happens for PLUS match_operators?
If elimination changes a (plus (reg X) (const_int Y)) into (reg X'),
and the (plus ...) is matched via a match_operator, will LRA cope
correctly?  Or does LRA require a plus matched via a match_operator
to remain a plus?  Or shouldn't we eliminate match_operators at all,
and just process true operands?

I wasn't sure at this point (and still haven't read through everything,
so am still not sure now).
I guess LRA can handle such change with or without minor modification (a 
new insn recognition).  So I am removing mem_mode condition.  At least I 
did not find a problem at least on two targets.  I might return the code 
if I find a target where it is really necessary.



+Note that there is no risk of modifying the structure of the insn,
+since we only get called for its operands, thus we are either
+modifying the address inside a MEM, or something like an address
+operand of a load-address insn.  */

I removed this too.

I think that's still accurate and should be kept.  I was just using
it to emphasise a point (probably badly, sorry).

I returned the comment.

makes it sound on face value like the MEM restriction above is a
reload-specific
thing.  Same question for:


+   /* As above, if we are not inside a MEM we do not want to
+  turn a PLUS into something else.  We might try to do so here
+  for an addition of 0 if we aren't optimizing.  */

It looks like your follow-up patch left this alone FWIW.
I modify the code as above in hope that the removed code will be not 
necessary.

+#ifdef WORD_REGISTER_OPERATIONS
+  /* On these machines, combine can create RTL of the form
+ (set (subreg:m1 (reg:m2 R) 0) ...)
+ where m1 < m2, and expects something interesting to
+ happen to the entire word.  Moreover, it will use the
+ (reg:m2 R) later, expecting all bits to be preserved.
+ So if the number of words is the same, preserve the
+ subreg so that push_reload can see it.  */
+  && ! ((x_size - 1) / UNITS_PER_WORD
+== (new_size -1 ) / UNITS_PER_WORD)
+#endif

Reload reference (push_reload).  Do we still need this for LRA?

It is hard me to say.  So I would not touch this code at least for now.
I changed push reload to LRA.

Could I ask you to reconsider?  The situation that the comment describes
sounds like a bug to me.  Removing it shouldn't affect the 4.8 submission.

It just seems to me that LRA is our big chance of getting rid of some
of this cruft.  If we're too scared to touch code like this even on
a big change like reload to LRA, we'll never be able to touch it.
Yes, you are right.  I am removing this too.  It does not affect x86.  
It might affect other targets (although I don't think so.  I guess LRA 
does not need this). If it affects, I'll find why and 

Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-09 Thread Richard Sandiford
Thanks for the updates.

Vladimir Makarov  writes:
>>> +   a change in the offset between the eliminable register and its
>>> +   substitution if UPDATE_P, or the full offset if FULL_P, or
>>> +   otherwise zero.
>> I wonder if an enum would be better than two booleans?
>> It avoids invalid combinations like UPDATE_P && FULL_P
>> and might make the arguments more obvious too.
> IMHO, It is matter of choice.  I don't like to introduce a new enum just 
> for the function.  It is pretty standard situation.  I usually introduce 
> enum when there are a few combinations prohibited.

OK.  I agree this is probably personal preference.

>>> + /* The only time we want to replace a PLUS with a REG
>>> +(this occurs when the constant operand of the PLUS is
>>> +the negative of the offset) is when we are inside a
>>> +MEM.  We won't want to do so at other times because
>>> +that would change the structure of the insn in a way
>>> +that reload can't handle.  We special-case the
>>> +commonest situation in eliminate_regs_in_insn, so
>>> +just replace a PLUS with a PLUS here, unless inside a
>>> +MEM.  */
>> Reload reference.  Does this restriction still apply?  The later comment:
> I don't think so.  I removed the comment.

Well, the question was about the code as much as the comment.
The comment did describe what the code did:

  if (mem_mode != 0
  && CONST_INT_P (XEXP (x, 1))
  && INTVAL (XEXP (x, 1)) == -offset)
return to;
  else
return gen_rtx_PLUS (Pmode, to,
 plus_constant (Pmode,
XEXP (x, 1), offset));

If the restriction doesn't apply any more then the mem_mode condition
should be removed.  If does apply then we should have some sort of
comment to explain why.

I suppose the question is: what happens for PLUS match_operators?
If elimination changes a (plus (reg X) (const_int Y)) into (reg X'),
and the (plus ...) is matched via a match_operator, will LRA cope
correctly?  Or does LRA require a plus matched via a match_operator
to remain a plus?  Or shouldn't we eliminate match_operators at all,
and just process true operands?

I wasn't sure at this point (and still haven't read through everything,
so am still not sure now).

>>> +Note that there is no risk of modifying the structure of the insn,
>>> +since we only get called for its operands, thus we are either
>>> +modifying the address inside a MEM, or something like an address
>>> +operand of a load-address insn.  */
> I removed this too.

I think that's still accurate and should be kept.  I was just using
it to emphasise a point (probably badly, sorry).

>> makes it sound on face value like the MEM restriction above is a
>> reload-specific
>> thing.  Same question for:
>>
>>> +   /* As above, if we are not inside a MEM we do not want to
>>> +  turn a PLUS into something else.  We might try to do so here
>>> +  for an addition of 0 if we aren't optimizing.  */

It looks like your follow-up patch left this alone FWIW.

>>> +#ifdef WORD_REGISTER_OPERATIONS
>>> +  /* On these machines, combine can create RTL of the form
>>> + (set (subreg:m1 (reg:m2 R) 0) ...)
>>> + where m1 < m2, and expects something interesting to
>>> + happen to the entire word.  Moreover, it will use the
>>> + (reg:m2 R) later, expecting all bits to be preserved.
>>> + So if the number of words is the same, preserve the
>>> + subreg so that push_reload can see it.  */
>>> +  && ! ((x_size - 1) / UNITS_PER_WORD
>>> +== (new_size -1 ) / UNITS_PER_WORD)
>>> +#endif
>> Reload reference (push_reload).  Do we still need this for LRA?
> It is hard me to say.  So I would not touch this code at least for now.  
> I changed push reload to LRA.

Could I ask you to reconsider?  The situation that the comment describes
sounds like a bug to me.  Removing it shouldn't affect the 4.8 submission.

It just seems to me that LRA is our big chance of getting rid of some
of this cruft.  If we're too scared to touch code like this even on
a big change like reload to LRA, we'll never be able to touch it.

>>> +static void
>>> +eliminate_regs_in_insn (rtx insn, bool replace_p)
>>> +{
>>> +  int icode = recog_memoized (insn);
>>> +  rtx old_set = single_set (insn);
>>> +  bool val;
>> "validate_p" might be a better name.
>>
> Done.

Sorry for being too terse.  I see you've changed "replace_p" to
"validate_p", but I actually meant that _"val"_ should be changed
to "validate_p".  Elsewhere you use "val" to mean "value number",
and "val" could equally mean "validate_p" or "validated_p".

"replace_p" was a good name. :-)

>>> + /* If an output operand changed from a REG

Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-08 Thread Vladimir Makarov

On 12-10-02 7:21 AM, Richard Sandiford wrote:

Hi Vlad,

Vladimir Makarov  writes:

+LRA is different from the reload pass in LRA division on small,
+manageable, and separated sub-tasks.  All LRA transformations and
+decisions are reflected in RTL as more as possible.  Instruction
+constraints as a primary source of the info and that minimizes number
+of target-depended macros/hooks.

+LRA is run for the targets it were ported.

Suggest something like:

   Unlike the reload pass, intermediate LRA decisions are reflected in
   RTL as much as possible.  This reduces the number of target-dependent
   macros and hooks, leaving instruction constraints as the primary
   source of control.

   LRA is run on targets for which TARGET_LRA_P returns true.

It is better.  Done.

+/* The virtual registers (like argument and frame pointer) are widely
+   used in RTL. Virtual registers should be changed by real hard
+   registers (like stack pointer or hard frame pointer) plus some
+   offset.  The offsets are changed usually every time when stack is
+   expanded.  We know the final offsets only at the very end of LRA.

I always think of "virtual" as [FIRST_VIRTUAL_REGISTER, LAST_VIRTUAL_REGISTER].
Maybe "eliminable" would be better?  E.g.

/* Eliminable registers (like a soft argument or frame pointer) are widely
used in RTL.  These eliminable registers should be replaced by real hard
registers (like the stack pointer or hard frame pointer) plus some offset.
The offsets usually change whenever the stack is expanded.  We know the
final offsets only at the very end of LRA.

Yes, right.  Eliminable is a better term.  I'll change it too.

+   We keep RTL code at most time in such state that the virtual
+   registers can be changed by just the corresponding hard registers
+   (with zero offsets) and we have the right RTL code. To achieve this
+   we should add initial offset at the beginning of LRA work and update
+   offsets after each stack expanding. But actually we update virtual
+   registers to the same virtual registers + corresponding offsets
+   before every constraint pass because it affects constraint
+   satisfaction (e.g. an address displacement became too big for some
+   target).

Suggest:

Within LRA, we usually keep the RTL in such a state that the eliminable
registers can be replaced by just the corresponding hard register
(without any offset).  To achieve this we should add the initial
elimination offset at the beginning of LRA and update the offsets
whenever the stack is expanded.  We need to do this before every
constraint pass because the choice of offset often affects whether
a particular address or memory constraint is satisfied.

Done.

+   The final change of virtual registers to the corresponding hard
+   registers are done at the very end of LRA when there were no change
+   in offsets anymore:
+
+fp + 42 =>  sp + 42

virtual=>eliminable if the above is OK.

Ok.  Done.

+   Such approach requires a few changes in the rest GCC code because
+   virtual registers are not recognized as real ones in some
+   constraints and predicates. Fortunately, such changes are
+   small.  */

Not sure whether the last paragraph really belongs in the code,
since it's more about the reload->LRA transition.

I removed the last paragraph.

+  /* Nonzero if this elimination can be done.  */
+  bool can_eliminate;  
+  /* CAN_ELIMINATE since the last check.  */
+  bool prev_can_eliminate;

AFAICT, these two fields are (now) only ever assigned at the same time,
via init_elim_table and setup_can_eliminate.  Looks like we can do
without prev_can_eliminate.  (And the way that the pass doesn't
need to differentiate between the raw CAN_ELIMINABLE value and
the processed value feels nice and reassuring.)

Ok.  I'll put it on to my do list.

+/* Map: 'from regno' -> to the current elimination, NULL otherwise.
+   The elimination table may contains more one elimination of a hard
+   register.  The map contains only one currently used elimination of
+   the hard register.  */
+static struct elim_table *elimination_map[FIRST_PSEUDO_REGISTER];

Nit: s/-> to/->/, s/may contains/may contain/.  Maybe:

/* Map: eliminable "from" register -> its current elimination,
or NULL if none.  The elimination table may contain more than
one elimination for the same hard register, but this map specifies
the one that we are currently using.  */

Done.

+/* When an eliminable hard register becomes not eliminable, we use the
+   special following structure to restore original offsets for the
+   register.  */

s/special following/following special/

Done.

+/* Return the current substitution hard register of the elimination of
+   HARD_REGNO. If HARD_REGNO is not eliminable, return itself.  */
+int
+lra_get_elimation_hard_regno (int hard_regno)

Typo: s/get_elimation/get_elimination/

Done.

+/* Scan X and replace any eliminable registers (such as fp) wit

Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-05 Thread Vladimir Makarov

On 12-10-05 5:53 PM, Steven Bosscher wrote:

On Thu, Oct 4, 2012 at 2:59 PM, Steven Bosscher  wrote:

On Thu, Oct 4, 2012 at 1:30 PM, Richard Guenther
 wrote:

Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order?

Not at this stage. For cfglayout mode I would answer yes, but IRA/LRA
operates in cfgrtl mode, so the sequence of insns and basic blocks
must match. Therefore, if you walk the basic blocks in reverse, and
the insns in each basic block in reverse, you effectively work on a,
let's say, "reverse extended basic block" (??) in that your program
points are sequential across fallthrough edges.


  Thus, doesn't
the above show there exists an optimal order for processing which we could use?

There may be a smarter order: Could even walk blocks in that order if
you know a priori what path through the CFG minimizes the length of
the live range chains. But to know that order, you have to build the
chains. So chicken-and-egg...

Richi, you had me inspired, and I remembered your (still disgusting
;-)) my_rev_post_order_compute hack.

A better order than walking basic blocks in reverse, is to walk them
in topological order on the reverse CFG. This results in very long
live ranges being compressed into very short chains because the
program points become almost sequential for them.

So here's one more patch:

@@ -994,8 +1092,16 @@ lra_create_live_ranges (bool all_p)
curr_point = 0;
point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
lra_point_freq = VEC_address (int, point_freq_vec);
-  FOR_EACH_BB (bb)
-process_bb_lives (bb);
+  int *post_order_rev_cfg = XNEWVEC (int, last_basic_block);
+  int n_blocks_inverted = inverted_post_order_compute (post_order_rev_cfg);
+  lra_assert (n_blocks_inverted == n_basic_blocks);
+  for (i = n_blocks_inverted - 1; i >= 0; --i)
+{
+  bb = BASIC_BLOCK (post_order_rev_cfg[i]);
+  if (bb == EXIT_BLOCK_PTR || bb == ENTRY_BLOCK_PTR)
+   continue;
+  process_bb_lives (bb, curr_point);
+} // HACK13 must add free (post_order_rev_cfg); here
lra_live_max_point = curr_point;
create_start_finish_chains ();
if (lra_dump_file != NULL)


Without this patch:
Compressing live ranges: from 700458 to 391665 - 55%, pre_count
40730653, post_count 34363983
max per-reg pre_count 12978 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])
max per-reg post_count 10967 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])

With this patch:
Compressing live ranges: from 700458 to 372585 - 53%, pre_count
283937, post_count 271120
max per-reg pre_count 545 (230653, 542 defs, 542 uses) (reg/f:DI
230653 [ SR.13303 ])
max per-reg post_count 544 (230649, 542 defs, 542 uses) (reg/f:DI
230649 [ SR.13305 ])

(the per-reg counts are the lengths of the live range chains for the
mentioned regno).
Yes, that is impressive.  But I think, #points in a live range is a real 
parameter of the complexity.

So the patch results in fewer program points, and way, way shorter
live range chains. Needless to say, really, compile time benefits
significantly also. Time spent in create_start_finish_chains:
Without this patch, 19.75s (2%) usr
With this patch, 0.14s (0%)

Vlad, is this something you also already tried? If not, then I'm going
to bootstrap&test this (on top of my pending other patches on
lra-lives.c).


No, I was busy with fixing a bug in LRA.  I played a bit with your patch 
(including FOR_EACH_BB_REVERSE) on Intel machine.  But I did not see a 
visible improvement on slow.cc.


The code size is a bit different too and that I can not explain because 
#points or #element in live range list are not present in assignment 
decisions.  I have no time to find why it is different. In any case I 
think it is ok and there is reasonable explanation for this.


As on AMD machines, the improvement is really significant.  I have no 
objections to the previous patch (including FOR_EACH_BB_REVERSE) and the 
patch you put here.  Of course if it is committed with a proper changelog.


Thanks for working on this, Steven.  I'd like to attack 
find_hard_regno_for when I have time.  I should say work on improving 
compilation speed for this particular case is a bit addictive for me.


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-05 Thread Steven Bosscher
On Thu, Oct 4, 2012 at 2:59 PM, Steven Bosscher  wrote:
> On Thu, Oct 4, 2012 at 1:30 PM, Richard Guenther
>  wrote:
>> Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order?
>
> Not at this stage. For cfglayout mode I would answer yes, but IRA/LRA
> operates in cfgrtl mode, so the sequence of insns and basic blocks
> must match. Therefore, if you walk the basic blocks in reverse, and
> the insns in each basic block in reverse, you effectively work on a,
> let's say, "reverse extended basic block" (??) in that your program
> points are sequential across fallthrough edges.
>
>>  Thus, doesn't
>> the above show there exists an optimal order for processing which we could 
>> use?
>
> There may be a smarter order: Could even walk blocks in that order if
> you know a priori what path through the CFG minimizes the length of
> the live range chains. But to know that order, you have to build the
> chains. So chicken-and-egg...

Richi, you had me inspired, and I remembered your (still disgusting
;-)) my_rev_post_order_compute hack.

A better order than walking basic blocks in reverse, is to walk them
in topological order on the reverse CFG. This results in very long
live ranges being compressed into very short chains because the
program points become almost sequential for them.

So here's one more patch:

@@ -994,8 +1092,16 @@ lra_create_live_ranges (bool all_p)
   curr_point = 0;
   point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
   lra_point_freq = VEC_address (int, point_freq_vec);
-  FOR_EACH_BB (bb)
-process_bb_lives (bb);
+  int *post_order_rev_cfg = XNEWVEC (int, last_basic_block);
+  int n_blocks_inverted = inverted_post_order_compute (post_order_rev_cfg);
+  lra_assert (n_blocks_inverted == n_basic_blocks);
+  for (i = n_blocks_inverted - 1; i >= 0; --i)
+{
+  bb = BASIC_BLOCK (post_order_rev_cfg[i]);
+  if (bb == EXIT_BLOCK_PTR || bb == ENTRY_BLOCK_PTR)
+   continue;
+  process_bb_lives (bb, curr_point);
+} // HACK13 must add free (post_order_rev_cfg); here
   lra_live_max_point = curr_point;
   create_start_finish_chains ();
   if (lra_dump_file != NULL)


Without this patch:
Compressing live ranges: from 700458 to 391665 - 55%, pre_count
40730653, post_count 34363983
max per-reg pre_count 12978 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])
max per-reg post_count 10967 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])

With this patch:
Compressing live ranges: from 700458 to 372585 - 53%, pre_count
283937, post_count 271120
max per-reg pre_count 545 (230653, 542 defs, 542 uses) (reg/f:DI
230653 [ SR.13303 ])
max per-reg post_count 544 (230649, 542 defs, 542 uses) (reg/f:DI
230649 [ SR.13305 ])

(the per-reg counts are the lengths of the live range chains for the
mentioned regno).

So the patch results in fewer program points, and way, way shorter
live range chains. Needless to say, really, compile time benefits
significantly also. Time spent in create_start_finish_chains:
Without this patch, 19.75s (2%) usr
With this patch, 0.14s (0%)

Vlad, is this something you also already tried? If not, then I'm going
to bootstrap&test this (on top of my pending other patches on
lra-lives.c).

Ciao!
Steven


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Steven Bosscher
On Thu, Oct 4, 2012 at 5:31 PM, Vladimir Makarov  wrote:
>
> Wow.  I did not have such effect.  What machine do you use?

I do all my testing on gcc17.

Ciao!
Steven


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Vladimir Makarov

On 10/04/2012 02:57 AM, Steven Bosscher wrote:

On Thu, Oct 4, 2012 at 5:30 AM, Vladimir Makarov  wrote:

I was going to look at this code too but I was interesting in generation of
less points and live ranges.  It is strange that in my profiles,
remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
time on these huge tests.   So I was not interesting to speed up the
function and may be therefore you have no visible change in compilation
time.

Right. The compression algorithm doesn't care much about the initial
number of program points, only about the number of live ranges before
and after compression. I had expected a bigger effect on the number of
live ranges before compression.

0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?


0.65% (2.78s).

Actually, I have a profile but I am not sure now that it is for 
PR54146.  It might be for PR26854.


I'll check it again to be sure.


Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-04 Thread Richard Sandiford
Hi Vlad,

This message is for lra-assigns.c.  Sorry for the piecemeal reviews,
never sure when I'll get time...

> +/* This file contains a pass mostly assigning hard registers to reload
> +   pseudos.  There is no any RTL code transformation on this pass.

Maybe:

/* This file's main objective is to assign hard registers to reload pseudos.
   It also tries to allocate hard registers to other pseudos, but at a lower
   priority than the reload pseudos.  The pass does not transform the RTL.

if that's accurate.

> +   Reload pseudos get what they need (usually) hard registers in
> +   anyway possibly by spilling non-reload pseudos and by assignment
> +   reload pseudos with smallest number of available hard registers
> +   first.
> +
> +   If reload pseudos can get hard registers only through spilling
> +   other pseudos, we choose what pseudos to spill taking into account
> +   how given reload pseudo benefits and also how other reload pseudos
> +   not assigned yet benefit too (see function spill_for).

Maybe:

   We must allocate a hard register to every reload pseudo.  We try to
   increase the chances of finding a viable allocation by assigning the
   pseudos in order of fewest available hard registers first.  If we
   still fail to find a hard register, we spill other (non-reload)
   pseudos in order to make room.

   assign_hard_regno_for allocates registers without spilling.
   spill_for does the same with spilling.  Both functions use
   a cost model to determine the most profitable choice of
   hard and spill registers.

> +   Non-reload pseudos can get hard registers too if it is possible and
> +   improves the code.  It might be possible because of spilling
> +   non-reload pseudos on given pass.

Maybe:

   Once we have finished allocating reload pseudos, we also try to
   assign registers to other (non-reload) pseudos.  This is useful
   if hard registers were freed up by the spilling just described.

> +   We try to assign hard registers processing pseudos by threads.  The
> +   thread contains reload and inheritance pseudos connected by copies
> +   (move insns).  It improves the chance to get the same hard register
> +   to pseudos in the thread and, as the result, to remove some move
> +   insns.

Maybe:

   We try to assign hard registers by collecting pseudos into threads.
   These threads contain reload and inheritance pseudos that are connected
   by copies (move insns).  Doing this improves the chances of pseudos
   in the thread getting the same hard register and, as a result,
   of allowing some move insns to be deleted.

> +   When we assign hard register to a pseudo, we decrease the cost of
> +   the hard registers for corresponding pseudos connected by copies.

Maybe:

   When we assign a hard register to a pseudo, we decrease the cost of
   using the same hard register for pseudos that are connected by copies.

> +   If two hard registers are equally good for assigning the pseudo
> +   with hard register cost point of view, we prefer a hard register in
> +   smaller register bank.  By default, there is only one register
> +   bank.  A target can define register banks by hook
> +   register_bank. For example, x86-64 has a few register banks: hard
> +   regs with and without REX prefixes are in different banks.  It
> +   permits to generate smaller code as insns without REX prefix are
> +   shorter.

Maybe:

   If two hard registers have the same frequency-derived cost,
   we prefer hard registers in lower register banks.  The mapping
   of registers to banks is controlled by the register_bank target hook.
   For example, x86-64 has a few register banks: hard registers with and
   without REX prefixes are in different banks.  This permits us
   to generate smaller code as insns without REX prefixes are shorter.

although this might change if the name of the hook changes.

> +/* Info about pseudo used during the assignment pass.  Thread is a set
> +   of connected reload and inheritance pseudos with the same set of
> +   available hard reg set.  Thread is a pseudo itself for other
> +   cases.  */
> +struct regno_assign_info

Maybe:

/* Information about the thread to which a pseudo belongs.  Threads are
   a set of connected reload and inheritance pseudos with the same set of
   available hard registers.  Lone registers belong to their own threads.  */

Although the condition seems to be:

> + && (ira_class_hard_regs_num[regno_allocno_class_array[regno1]]
> + == ira_class_hard_regs_num[regno_allocno_class_array[regno2]]))

i.e. the same _number_ of available hard regs, but not necessarily the
same set.

"thread" might be more mnemonic than "regno_assign" in this file,
but that's bikeshed stuff.

> +  for (i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++)
> +{
> +  regno_assign_info[i].first = i;
> +  regno_assign_info[i].next = -1;
> +  regno_assign_info[i].freq = lra_reg_info[i].freq;
> +}

Minor speedup, but it's probably worth caching max_reg_n

Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Vladimir Makarov

On 10/04/2012 05:43 AM, Steven Bosscher wrote:

On Wed, Oct 3, 2012 at 5:35 PM, Steven Bosscher  wrote:

The "worst" result is this:
Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, 
post_count 12483414

But that's still a lot better than before the patch for the same function:
Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, 
post_count 12479992

Walking basic blocks with FOR_EACH_BB_REVERSE gives:

Only FOR_EACH_BB_REVERSE:
Compressing live ranges: from 1742579 to 429746 - 24% pre_count
41106212, post_count 34376494
Compressing live ranges: from 1742569 to 63000 - 3% pre_count
40835340, post_count 11055747

FOR_EACH_BB_REVERSE + need_curr_point_incr:
Compressing live ranges: from 726184 to 416529 - 57% pre_count
40743516, post_count 34376846
Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806,
post_count 11055747

The combination of the two changes takes ~20s off the ~180s for "LRA
create live ranges".



Wow.  I did not have such effect.  What machine do you use?


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Vladimir Makarov

On 10/04/2012 03:24 AM, Steven Bosscher wrote:

On Thu, Oct 4, 2012 at 8:57 AM, Steven Bosscher  wrote:

On Thu, Oct 4, 2012 at 5:30 AM, Vladimir Makarov  wrote:

I was going to look at this code too but I was interesting in generation of
less points and live ranges.  It is strange that in my profiles,
remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
time on these huge tests.   So I was not interesting to speed up the
function and may be therefore you have no visible change in compilation
time.

Right. The compression algorithm doesn't care much about the initial
number of program points, only about the number of live ranges before
and after compression. I had expected a bigger effect on the number of
live ranges before compression.

0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?



I don't object the idea of the patch.  I need some time to look at it (the
different results on a function is a bit scary for me) and check simulator
times on other tests.

Understood.

BTW, it would be great if you can also look at this additional patch hunk:

@@ -994,8 +1044,8 @@ lra_create_live_ranges (bool all_p)
curr_point = 0;
point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
lra_point_freq = VEC_address (int, point_freq_vec);
-  FOR_EACH_BB (bb)
-process_bb_lives (bb);
+  FOR_EACH_BB_REVERSE (bb)
+process_bb_lives (bb, curr_point);
lra_live_max_point = curr_point;
create_start_finish_chains ();
if (lra_dump_file != NULL)

I think this should result in more live ranges being merged. Here's
why I think so, based on my far worse understanding of this code than
yours, so forgive me if I'm Completely Wrong :-)

No, you are not wrong.
Two days ago, I worked on patch which contains the same code.  The patch 
actually takes EBB into account to decrease # calls of mark_pseudo_live 
at the beginning of process_bb_lives and mark_pseudo_dead at the 
function end and for that I needed FOR_EACH_BB_REVERSE.  The patch was 
half baked (it did not checked hard regs live changes at the end of BB 
to set up right hard reg conflicts for pseudos) but it gave an idea how 
much I can get from this.  It is not bad but not what I expected.  So I 
stopped work on this.  But we still should work on these ideas as they 
improve LRA speed in small steps (many small steps will create a visible 
effect).


We can really solve scalability problem only by using simpler but still 
good enough algorithms (too simple algorithms result in big code size 
and actually even in worse compilation times).  I've been working on it 
and I'll send a patch soon.

process_bb_lives walks insns in the basic block from last to first, so
say you have a basic block chain 1->2->3, and each block has 4 insns,
then AFAIU the program points in block 1 will be [4,3,2,1], in block 2
it will be [8,7,6,5], and in block 3 it will be [12,11,10,9]. Say a
reg is used in block 3 at point 11, and set in block at point 3. Then
this reg will have a live range chain [3-1],[8-5],[12-11].

If you visit the basic blocks in reverse order, the program points
will be: 1:[12,11,10,9], 2:[8,7,6,5], 3:[4,3,2,1]. Now the same reg
will be set at point 11 and used at point 3, and the live range chain
will be just [11-3].






Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Steven Bosscher
On Thu, Oct 4, 2012 at 1:30 PM, Richard Guenther
 wrote:
> Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order?

Not at this stage. For cfglayout mode I would answer yes, but IRA/LRA
operates in cfgrtl mode, so the sequence of insns and basic blocks
must match. Therefore, if you walk the basic blocks in reverse, and
the insns in each basic block in reverse, you effectively work on a,
let's say, "reverse extended basic block" (??) in that your program
points are sequential across fallthrough edges.

>  Thus, doesn't
> the above show there exists an optimal order for processing which we could 
> use?

There may be a smarter order: Could even walk blocks in that order if
you know a priori what path through the CFG minimizes the length of
the live range chains. But to know that order, you have to build the
chains. So chicken-and-egg...

> (I realize _REVERSE is a simple solution, but might there not exist a
> pathological case where _REVERSE is even worse than non-_REVERSE?)

Intuitively, I'm certain that _REVERSE is always better than
non-_REVERSE, although I don't know how to prove that :-)

Ciao!
Steven


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 11:43 AM, Steven Bosscher  wrote:
> On Wed, Oct 3, 2012 at 5:35 PM, Steven Bosscher  wrote:
>> The "worst" result is this:
>> Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, 
>> post_count 12483414
>>
>> But that's still a lot better than before the patch for the same function:
>> Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, 
>> post_count 12479992
>
> Walking basic blocks with FOR_EACH_BB_REVERSE gives:
>
> Only FOR_EACH_BB_REVERSE:
> Compressing live ranges: from 1742579 to 429746 - 24% pre_count
> 41106212, post_count 34376494
> Compressing live ranges: from 1742569 to 63000 - 3% pre_count
> 40835340, post_count 11055747
>
> FOR_EACH_BB_REVERSE + need_curr_point_incr:
> Compressing live ranges: from 726184 to 416529 - 57% pre_count
> 40743516, post_count 34376846
> Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806,
> post_count 11055747
>
> The combination of the two changes takes ~20s off the ~180s for "LRA
> create live ranges".

Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order?  Thus, doesn't
the above show there exists an optimal order for processing which we could use?
(I realize _REVERSE is a simple solution, but might there not exist a
pathological
case where _REVERSE is even worse than non-_REVERSE?)

Richard.

> Ciao!
> Steven


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Steven Bosscher
On Wed, Oct 3, 2012 at 5:35 PM, Steven Bosscher  wrote:
> The "worst" result is this:
> Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, 
> post_count 12483414
>
> But that's still a lot better than before the patch for the same function:
> Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, 
> post_count 12479992

Walking basic blocks with FOR_EACH_BB_REVERSE gives:

Only FOR_EACH_BB_REVERSE:
Compressing live ranges: from 1742579 to 429746 - 24% pre_count
41106212, post_count 34376494
Compressing live ranges: from 1742569 to 63000 - 3% pre_count
40835340, post_count 11055747

FOR_EACH_BB_REVERSE + need_curr_point_incr:
Compressing live ranges: from 726184 to 416529 - 57% pre_count
40743516, post_count 34376846
Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806,
post_count 11055747

The combination of the two changes takes ~20s off the ~180s for "LRA
create live ranges".

Ciao!
Steven


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Steven Bosscher
On Thu, Oct 4, 2012 at 8:57 AM, Steven Bosscher  wrote:
> On Thu, Oct 4, 2012 at 5:30 AM, Vladimir Makarov  wrote:
>> I was going to look at this code too but I was interesting in generation of
>> less points and live ranges.  It is strange that in my profiles,
>> remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
>> time on these huge tests.   So I was not interesting to speed up the
>> function and may be therefore you have no visible change in compilation
>> time.
>
> Right. The compression algorithm doesn't care much about the initial
> number of program points, only about the number of live ranges before
> and after compression. I had expected a bigger effect on the number of
> live ranges before compression.
>
> 0.6% sounds really very different from my timings. How much time does
> create_start_finish_chains take for you?
>
>
>> I don't object the idea of the patch.  I need some time to look at it (the
>> different results on a function is a bit scary for me) and check simulator
>> times on other tests.
>
> Understood.

BTW, it would be great if you can also look at this additional patch hunk:

@@ -994,8 +1044,8 @@ lra_create_live_ranges (bool all_p)
   curr_point = 0;
   point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
   lra_point_freq = VEC_address (int, point_freq_vec);
-  FOR_EACH_BB (bb)
-process_bb_lives (bb);
+  FOR_EACH_BB_REVERSE (bb)
+process_bb_lives (bb, curr_point);
   lra_live_max_point = curr_point;
   create_start_finish_chains ();
   if (lra_dump_file != NULL)

I think this should result in more live ranges being merged. Here's
why I think so, based on my far worse understanding of this code than
yours, so forgive me if I'm Completely Wrong :-)

process_bb_lives walks insns in the basic block from last to first, so
say you have a basic block chain 1->2->3, and each block has 4 insns,
then AFAIU the program points in block 1 will be [4,3,2,1], in block 2
it will be [8,7,6,5], and in block 3 it will be [12,11,10,9]. Say a
reg is used in block 3 at point 11, and set in block at point 3. Then
this reg will have a live range chain [3-1],[8-5],[12-11].

If you visit the basic blocks in reverse order, the program points
will be: 1:[12,11,10,9], 2:[8,7,6,5], 3:[4,3,2,1]. Now the same reg
will be set at point 11 and used at point 3, and the live range chain
will be just [11-3].

I'm experimenting with this extra hunk and report back here.

Ciao!
Steven


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-03 Thread Steven Bosscher
On Thu, Oct 4, 2012 at 5:30 AM, Vladimir Makarov  wrote:
> I was going to look at this code too but I was interesting in generation of
> less points and live ranges.  It is strange that in my profiles,
> remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
> time on these huge tests.   So I was not interesting to speed up the
> function and may be therefore you have no visible change in compilation
> time.

Right. The compression algorithm doesn't care much about the initial
number of program points, only about the number of live ranges before
and after compression. I had expected a bigger effect on the number of
live ranges before compression.

0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?


> I don't object the idea of the patch.  I need some time to look at it (the
> different results on a function is a bit scary for me) and check simulator
> times on other tests.

Understood.

FWIW you can find the test results from before and after the patch here:
http://gcc.gnu.org/ml/gcc-testresults/2012-10/msg00267.html
http://gcc.gnu.org/ml/gcc-testresults/2012-10/msg00303.html

Ciao!
Steven


Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-03 Thread Vladimir Makarov

On 12-10-03 11:35 AM, Steven Bosscher wrote:

On Wed, Oct 3, 2012 at 4:56 PM, Vladimir Makarov  wrote:

On 12-10-03 3:13 AM, Steven Bosscher wrote:

On Tue, Oct 2, 2012 at 3:42 PM, Richard Sandiford
 wrote:

+/* Compress pseudo live ranges by removing program points where
+   nothing happens.  Complexity of many algorithms in LRA is linear
+   function of program points number.  To speed up the code we try to
+   minimize the number of the program points here.  */
+static void
+remove_some_program_points_and_update_live_ranges (void)

Genuine question, but could we do this on the fly instead,
by not incrementing curr_point if the current point had no value?

I suppose the main complication would be checking cases where
all births are recorded by extending the previous just-closed live
range rather than starting a new one, in which case it's the previous
point that needs to be reused.  Hmm...

It does seem to be something worth investigating further. Things like:

Compressing live ranges: from 1742579 to 554532 - 31%
Compressing live ranges: from 1742569 to 73069 - 4%

look extreme, but they're actually the norm. For the same test case
(PR54146 still) but looking only at functions with initially 1000-
program points, you get this picture:

Compressing live ranges: from 1766 to 705 - 39%
Compressing live ranges: from 1449 to 370 - 25%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 2464 to 676 - 27%
Compressing live ranges: from 1433 to 379 - 26%
Compressing live ranges: from 1261 to 348 - 27%
Compressing live ranges: from 2835 to 755 - 26%
Compressing live ranges: from 5426 to 1678 - 30%
Compressing live ranges: from 5227 to 1477 - 28%
Compressing live ranges: from 1845 to 467 - 25%
Compressing live ranges: from 4868 to 1378 - 28%
Compressing live ranges: from 4875 to 1388 - 28%
Compressing live ranges: from 4882 to 1384 - 28%
Compressing live ranges: from 5688 to 1714 - 30%
Compressing live ranges: from 4943 to 1310 - 26%
Compressing live ranges: from 2976 to 792 - 26%
Compressing live ranges: from 5463 to 1526 - 27%
Compressing live ranges: from 2854 to 730 - 25%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 2771 to 904 - 32%
Compressing live ranges: from 4916 to 1429 - 29%
Compressing live ranges: from 6505 to 2238 - 34%
Compressing live ranges: from 6493 to 166 - 2%
Compressing live ranges: from 5498 to 1734 - 31%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 5043 to 1420 - 28%
Compressing live ranges: from 3094 to 788 - 25%
Compressing live ranges: from 4563 to 1311 - 28%
Compressing live ranges: from 4557 to 158 - 3%
Compressing live ranges: from 1050 to 274 - 26%
Compressing live ranges: from 1602 to 434 - 27%
Compressing live ranges: from 2474 to 600 - 24%
Compressing live ranges: from 2718 to 866 - 31%
Compressing live ranges: from 2097 to 716 - 34%
Compressing live ranges: from 4152 to 1099 - 26%
Compressing live ranges: from 5065 to 1514 - 29%
Compressing live ranges: from 1236 to 359 - 29%
Compressing live ranges: from 1722 to 541 - 31%
Compressing live ranges: from 5186 to 1401 - 27%

Unfortunately the dump doesn't mention how many live ranges could be
merged thanks to the compression.

It'd also be good to understand why the compression ratios are so
small, and consistently around ~30%.

This sentence is not clear to me.  30% means 3 times less points. It is
pretty good compression.


   Maybe curr_point includes things
it should ignore (DEBUG_INSNs, NOTEs, ...)?


After the compression, there are only points important for conflict info,
i.e. only points where some reg dies or start living.  Even more if on the
subsequent points there are only life starts or only deaths,  they are
represented by one point after the compression.

With this patch the amount of compression is reduced. Without the
patch the compression ratio is typically around 30%, with the patch
this goes up to ~65%. The worst compression ratios (where worse is
better in this case :-) are:

$ grep Compressing log4 | egrep " [78].%"
Compressing live ranges: from 31 to 22 - 70%, pre_count 17, post_count 15
Compressing live ranges: from 128 to 94 - 73%, pre_count 87, post_count 77
Compressing live ranges: from 32 to 23 - 71%, pre_count 16, post_count 15
Compressing live ranges: from 38 to 29 - 76%, pre_count 19, post_count 18
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 125 to 89 - 71%, pre_count 71, post_count 62
Compressing live ranges: fr

[patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-03 Thread Steven Bosscher
On Wed, Oct 3, 2012 at 4:56 PM, Vladimir Makarov  wrote:
> On 12-10-03 3:13 AM, Steven Bosscher wrote:
>>
>> On Tue, Oct 2, 2012 at 3:42 PM, Richard Sandiford
>>  wrote:

 +/* Compress pseudo live ranges by removing program points where
 +   nothing happens.  Complexity of many algorithms in LRA is linear
 +   function of program points number.  To speed up the code we try to
 +   minimize the number of the program points here.  */
 +static void
 +remove_some_program_points_and_update_live_ranges (void)
>>>
>>> Genuine question, but could we do this on the fly instead,
>>> by not incrementing curr_point if the current point had no value?
>>>
>>> I suppose the main complication would be checking cases where
>>> all births are recorded by extending the previous just-closed live
>>> range rather than starting a new one, in which case it's the previous
>>> point that needs to be reused.  Hmm...
>>
>> It does seem to be something worth investigating further. Things like:
>>
>> Compressing live ranges: from 1742579 to 554532 - 31%
>> Compressing live ranges: from 1742569 to 73069 - 4%
>>
>> look extreme, but they're actually the norm. For the same test case
>> (PR54146 still) but looking only at functions with initially 1000-
>> program points, you get this picture:
>>
>> Compressing live ranges: from 1766 to 705 - 39%
>> Compressing live ranges: from 1449 to 370 - 25%
>> Compressing live ranges: from 3939 to 1093 - 27%
>> Compressing live ranges: from 3939 to 1093 - 27%
>> Compressing live ranges: from 3939 to 1093 - 27%
>> Compressing live ranges: from 3939 to 1093 - 27%
>> Compressing live ranges: from 2464 to 676 - 27%
>> Compressing live ranges: from 1433 to 379 - 26%
>> Compressing live ranges: from 1261 to 348 - 27%
>> Compressing live ranges: from 2835 to 755 - 26%
>> Compressing live ranges: from 5426 to 1678 - 30%
>> Compressing live ranges: from 5227 to 1477 - 28%
>> Compressing live ranges: from 1845 to 467 - 25%
>> Compressing live ranges: from 4868 to 1378 - 28%
>> Compressing live ranges: from 4875 to 1388 - 28%
>> Compressing live ranges: from 4882 to 1384 - 28%
>> Compressing live ranges: from 5688 to 1714 - 30%
>> Compressing live ranges: from 4943 to 1310 - 26%
>> Compressing live ranges: from 2976 to 792 - 26%
>> Compressing live ranges: from 5463 to 1526 - 27%
>> Compressing live ranges: from 2854 to 730 - 25%
>> Compressing live ranges: from 1810 to 745 - 41%
>> Compressing live ranges: from 2771 to 904 - 32%
>> Compressing live ranges: from 4916 to 1429 - 29%
>> Compressing live ranges: from 6505 to 2238 - 34%
>> Compressing live ranges: from 6493 to 166 - 2%
>> Compressing live ranges: from 5498 to 1734 - 31%
>> Compressing live ranges: from 1810 to 745 - 41%
>> Compressing live ranges: from 5043 to 1420 - 28%
>> Compressing live ranges: from 3094 to 788 - 25%
>> Compressing live ranges: from 4563 to 1311 - 28%
>> Compressing live ranges: from 4557 to 158 - 3%
>> Compressing live ranges: from 1050 to 274 - 26%
>> Compressing live ranges: from 1602 to 434 - 27%
>> Compressing live ranges: from 2474 to 600 - 24%
>> Compressing live ranges: from 2718 to 866 - 31%
>> Compressing live ranges: from 2097 to 716 - 34%
>> Compressing live ranges: from 4152 to 1099 - 26%
>> Compressing live ranges: from 5065 to 1514 - 29%
>> Compressing live ranges: from 1236 to 359 - 29%
>> Compressing live ranges: from 1722 to 541 - 31%
>> Compressing live ranges: from 5186 to 1401 - 27%
>>
>> Unfortunately the dump doesn't mention how many live ranges could be
>> merged thanks to the compression.
>>
>> It'd also be good to understand why the compression ratios are so
>> small, and consistently around ~30%.
>
> This sentence is not clear to me.  30% means 3 times less points. It is
> pretty good compression.
>
>>   Maybe curr_point includes things
>> it should ignore (DEBUG_INSNs, NOTEs, ...)?
>>
> After the compression, there are only points important for conflict info,
> i.e. only points where some reg dies or start living.  Even more if on the
> subsequent points there are only life starts or only deaths,  they are
> represented by one point after the compression.

With this patch the amount of compression is reduced. Without the
patch the compression ratio is typically around 30%, with the patch
this goes up to ~65%. The worst compression ratios (where worse is
better in this case :-) are:

$ grep Compressing log4 | egrep " [78].%"
Compressing live ranges: from 31 to 22 - 70%, pre_count 17, post_count 15
Compressing live ranges: from 128 to 94 - 73%, pre_count 87, post_count 77
Compressing live ranges: from 32 to 23 - 71%, pre_count 16, post_count 15
Compressing live ranges: from 38 to 29 - 76%, pre_count 19, post_count 18
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 60 to 4

Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-03 Thread Vladimir Makarov

On 12-10-03 3:13 AM, Steven Bosscher wrote:

On Tue, Oct 2, 2012 at 3:42 PM, Richard Sandiford
 wrote:

+/* Compress pseudo live ranges by removing program points where
+   nothing happens.  Complexity of many algorithms in LRA is linear
+   function of program points number.  To speed up the code we try to
+   minimize the number of the program points here.  */
+static void
+remove_some_program_points_and_update_live_ranges (void)

Genuine question, but could we do this on the fly instead,
by not incrementing curr_point if the current point had no value?

I suppose the main complication would be checking cases where
all births are recorded by extending the previous just-closed live
range rather than starting a new one, in which case it's the previous
point that needs to be reused.  Hmm...

It does seem to be something worth investigating further. Things like:

Compressing live ranges: from 1742579 to 554532 - 31%
Compressing live ranges: from 1742569 to 73069 - 4%

look extreme, but they're actually the norm. For the same test case
(PR54146 still) but looking only at functions with initially 1000-
program points, you get this picture:

Compressing live ranges: from 1766 to 705 - 39%
Compressing live ranges: from 1449 to 370 - 25%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 2464 to 676 - 27%
Compressing live ranges: from 1433 to 379 - 26%
Compressing live ranges: from 1261 to 348 - 27%
Compressing live ranges: from 2835 to 755 - 26%
Compressing live ranges: from 5426 to 1678 - 30%
Compressing live ranges: from 5227 to 1477 - 28%
Compressing live ranges: from 1845 to 467 - 25%
Compressing live ranges: from 4868 to 1378 - 28%
Compressing live ranges: from 4875 to 1388 - 28%
Compressing live ranges: from 4882 to 1384 - 28%
Compressing live ranges: from 5688 to 1714 - 30%
Compressing live ranges: from 4943 to 1310 - 26%
Compressing live ranges: from 2976 to 792 - 26%
Compressing live ranges: from 5463 to 1526 - 27%
Compressing live ranges: from 2854 to 730 - 25%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 2771 to 904 - 32%
Compressing live ranges: from 4916 to 1429 - 29%
Compressing live ranges: from 6505 to 2238 - 34%
Compressing live ranges: from 6493 to 166 - 2%
Compressing live ranges: from 5498 to 1734 - 31%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 5043 to 1420 - 28%
Compressing live ranges: from 3094 to 788 - 25%
Compressing live ranges: from 4563 to 1311 - 28%
Compressing live ranges: from 4557 to 158 - 3%
Compressing live ranges: from 1050 to 274 - 26%
Compressing live ranges: from 1602 to 434 - 27%
Compressing live ranges: from 2474 to 600 - 24%
Compressing live ranges: from 2718 to 866 - 31%
Compressing live ranges: from 2097 to 716 - 34%
Compressing live ranges: from 4152 to 1099 - 26%
Compressing live ranges: from 5065 to 1514 - 29%
Compressing live ranges: from 1236 to 359 - 29%
Compressing live ranges: from 1722 to 541 - 31%
Compressing live ranges: from 5186 to 1401 - 27%

Unfortunately the dump doesn't mention how many live ranges could be
merged thanks to the compression.

It'd also be good to understand why the compression ratios are so
small, and consistently around ~30%.
This sentence is not clear to me.  30% means 3 times less points. It is 
pretty good compression.

  Maybe curr_point includes things
it should ignore (DEBUG_INSNs, NOTEs, ...)?

After the compression, there are only points important for conflict 
info, i.e. only points where some reg dies or start living.  Even more 
if on the subsequent points there are only life starts or only deaths,  
they are represented by one point after the compression.




Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-03 Thread Richard Sandiford
Hi Vlad,

Some comments on lra-spills.c and lra-coalesce.c.

> +   The pass creates necessary stack slots and assign spilled pseudos
> +   to the stack slots in following way:

s/assign/assigns/

> +   (or insn memory constraints) might be not satisfied any more.

s/might be not/might not be/

> +   For some targets, the pass can spill some pseudos into hard
> +   registers of different class (usually into vector registers)
> +   instead of spilling them into memory if it is possible and
> +   profitable.   Spilling GENERAL_REGS pseudo into SSE registers for
> +   modern Intel x86/x86-64 processors is an example of such
> +   optimization.  And this is actually recommended by Intel
> +   optimization guide.

Maybe mention core i7 specifically?  "Modern" is a bit dangerous
in code that'll live a long time.

> +/* The structure describes a stack slot which can be used for several
> +   spilled pseudos.  */
> +struct slot
> +{

Looks like this describes "a register or stack slot" given the hard_regno case.

> +/* Array containing info about the stack slots.   The array element is
> +   indexed by the stack slot number in the range [0..slost_num).  */

Typo: slots_num

> +  /* Each pseudo has an inherent size which comes from its own mode,
> + and a total size which provides room for paradoxical subregs
> + which refer to the pseudo reg in wider modes.
> + 
> + We can use a slot already allocated if it provides both enough
> + inherent space and enough total space.  Otherwise, we allocate a
> + new slot, making sure that it has no less inherent space, and no
> + less total space, then the previous slot.   */

The second part of the comment seems a bit misplaced, since the following
code doesn't reuse stack slots.  This is done elsewhere instead.
Maybe the first part would be better above the inherent_size assignment.

> +  /* If we have any adjustment to make, or if the stack slot is the
> + wrong mode, make a new stack slot.   */
> +  x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust);

We don't make a new slot here.

> +/* Sort pseudos according their slot numbers putting ones with smaller
> +   numbers first, or last when the frame pointer is not needed.   So
> +   pseudos with the first slot will be finally addressed with smaller
> +   address displacement.  */
> +static int
> +pseudo_reg_slot_compare (const void *v1p, const void *v2p)
> +{
> +  const int regno1 = *(const int *) v1p;
> +  const int regno2 = *(const int *) v2p;
> +  int diff, slot_num1, slot_num2;
> +  int total_size1, total_size2;
> +
> +  slot_num1 = pseudo_slots[regno1].slot_num;
> +  slot_num2 = pseudo_slots[regno2].slot_num;
> +  if ((diff = slot_num1 - slot_num2) != 0)
> +return (frame_pointer_needed
> + || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff);

The comment doesn't quite describe the condition.  Maybe:

/* Sort pseudos according to their slots, putting the slots in the order
   that they should be allocated.  Slots with lower numbers have the highest
   priority and should get the smallest displacement from the stack or
   frame pointer (whichever is being used).

   The first allocated slot is always closest to the frame pointer,
   so prefer lower slot numbers when frame_pointer_needed.  If the stack
   and frame grow in the same direction, then the first allocated slot is
   always closest to the initial stack pointer and furthest away from the
   final stack pointer, so allocate higher numbers first when using the
   stack pointer in that case.  The reverse is true if the stack and
   frame grow in opposite directions.  */

> +  total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1),
> +  GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode));
> +  total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2),
> +  GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode));
> +  if ((diff = total_size2 - total_size1) != 0)
> +return diff;

I think this could do with a bit more commentary.  When is biggest_mode
ever smaller than PSEUDO_REGNO_BYTES?  Is that for pseudos that are only
ever referenced as lowpart subregs?  If so, why does PSEUDO_REGNO_BYTES
matter for those registers here but not when calculating biggest_mode?

> +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
> +   pseudos which did not get a spill hard register at the beginning of
> +   array PSEUDO_REGNOS.   Return the number of such pseudos.  */

It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest
frequency first.

> +  bitmap set_jump_crosses = regstat_get_setjmp_crosses ();

I notice you use "set_jump" here and "setjump" in parts of 7a.patch.
Probably better to use setjmp across the board.

> +  /* Hard registers which can not be used for any purpose at given
> + program point because they are unallocatable or already allocated
> + for other pseudos.   */ 
> +  HARD_REG_SET *reserved_hard_regs;
> +
> +  if

Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-03 Thread Steven Bosscher
On Tue, Oct 2, 2012 at 3:42 PM, Richard Sandiford
 wrote:
>
>> +/* Compress pseudo live ranges by removing program points where
>> +   nothing happens.  Complexity of many algorithms in LRA is linear
>> +   function of program points number.  To speed up the code we try to
>> +   minimize the number of the program points here.  */
>> +static void
>> +remove_some_program_points_and_update_live_ranges (void)
>
> Genuine question, but could we do this on the fly instead,
> by not incrementing curr_point if the current point had no value?
>
> I suppose the main complication would be checking cases where
> all births are recorded by extending the previous just-closed live
> range rather than starting a new one, in which case it's the previous
> point that needs to be reused.  Hmm...

It does seem to be something worth investigating further. Things like:

Compressing live ranges: from 1742579 to 554532 - 31%
Compressing live ranges: from 1742569 to 73069 - 4%

look extreme, but they're actually the norm. For the same test case
(PR54146 still) but looking only at functions with initially 1000-
program points, you get this picture:

Compressing live ranges: from 1766 to 705 - 39%
Compressing live ranges: from 1449 to 370 - 25%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 2464 to 676 - 27%
Compressing live ranges: from 1433 to 379 - 26%
Compressing live ranges: from 1261 to 348 - 27%
Compressing live ranges: from 2835 to 755 - 26%
Compressing live ranges: from 5426 to 1678 - 30%
Compressing live ranges: from 5227 to 1477 - 28%
Compressing live ranges: from 1845 to 467 - 25%
Compressing live ranges: from 4868 to 1378 - 28%
Compressing live ranges: from 4875 to 1388 - 28%
Compressing live ranges: from 4882 to 1384 - 28%
Compressing live ranges: from 5688 to 1714 - 30%
Compressing live ranges: from 4943 to 1310 - 26%
Compressing live ranges: from 2976 to 792 - 26%
Compressing live ranges: from 5463 to 1526 - 27%
Compressing live ranges: from 2854 to 730 - 25%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 2771 to 904 - 32%
Compressing live ranges: from 4916 to 1429 - 29%
Compressing live ranges: from 6505 to 2238 - 34%
Compressing live ranges: from 6493 to 166 - 2%
Compressing live ranges: from 5498 to 1734 - 31%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 5043 to 1420 - 28%
Compressing live ranges: from 3094 to 788 - 25%
Compressing live ranges: from 4563 to 1311 - 28%
Compressing live ranges: from 4557 to 158 - 3%
Compressing live ranges: from 1050 to 274 - 26%
Compressing live ranges: from 1602 to 434 - 27%
Compressing live ranges: from 2474 to 600 - 24%
Compressing live ranges: from 2718 to 866 - 31%
Compressing live ranges: from 2097 to 716 - 34%
Compressing live ranges: from 4152 to 1099 - 26%
Compressing live ranges: from 5065 to 1514 - 29%
Compressing live ranges: from 1236 to 359 - 29%
Compressing live ranges: from 1722 to 541 - 31%
Compressing live ranges: from 5186 to 1401 - 27%

Unfortunately the dump doesn't mention how many live ranges could be
merged thanks to the compression.

It'd also be good to understand why the compression ratios are so
small, and consistently around ~30%. Maybe curr_point includes things
it should ignore (DEBUG_INSNs, NOTEs, ...)?

Ciao!
Steven


Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-02 Thread Vladimir Makarov

On 10/02/2012 06:42 PM, Bernd Schmidt wrote:

On 10/03/2012 12:29 AM, Vladimir Makarov wrote:

On 12-10-02 7:30 AM, Bernd Schmidt wrote:

On 09/28/2012 12:59 AM, Vladimir Makarov wrote:

+   We keep RTL code at most time in such state that the virtual
+   registers can be changed by just the corresponding hard registers
+   (with zero offsets) and we have the right RTL code.To achieve
this
+   we should add initial offset at the beginning of LRA work and update
+   offsets after each stack expanding.But actually we update
virtual
+   registers to the same virtual registers + corresponding offsets
+   before every constraint pass because it affects constraint
+   satisfaction (e.g. an address displacement became too big for some
+   target).
+
+   The final change of virtual registers to the corresponding hard
+   registers are done at the very end of LRA when there were no change
+   in offsets anymore:
+
+ fp + 42 =>sp + 42

Let me try to understand this.  We have (mem (fp)), which we rewrite to
(mem (fp + 42)), but this is intended to represent (mem (sp + 42))?

Wouldn't this fail on any target which has different addressing ranges
for SP and FP?



Yes, I think so.  It is not a problem for 9 current targets.  But when I
or somebody else start porting LRA to such target we can introduce new
virtual registers (and switch to them at the begging of LRA) to differ
the situation.  It will be not a big deal.

I'd appreciate if you tell me such target in order to know in advance
what to do first when I start work on it.

C6X is such a target. The stack pointer allows a 15 bit unsigned offset,
while all normal registers only allow more limited offsets.

I think the above approach sounds quite hackish and I'd prefer to see
this reworked before merging. Can you retain the current mechanism of
calling eliminate_regs before examining an insn, or is this not workable
for some reason?


  I don't think it is worth to rework.  That is what I found the best 
(and I tried many approaches).  Besides it is usefull to see what the 
current offsets during debugging just looking on RTL.  But the most 
important is that it really saves a lot of code, simplifies the 
implementation making code much clear.


So that is what I exactly would like to save in this implementation.  It 
also follows the major design goal to reflect changes in RTL as much as 
possible.


Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-02 Thread Bernd Schmidt
On 10/03/2012 12:29 AM, Vladimir Makarov wrote:
> On 12-10-02 7:30 AM, Bernd Schmidt wrote:
>> On 09/28/2012 12:59 AM, Vladimir Makarov wrote:
>>> +   We keep RTL code at most time in such state that the virtual
>>> +   registers can be changed by just the corresponding hard registers
>>> +   (with zero offsets) and we have the right RTL code.To achieve
>>> this
>>> +   we should add initial offset at the beginning of LRA work and update
>>> +   offsets after each stack expanding.But actually we update
>>> virtual
>>> +   registers to the same virtual registers + corresponding offsets
>>> +   before every constraint pass because it affects constraint
>>> +   satisfaction (e.g. an address displacement became too big for some
>>> +   target).
>>> +
>>> +   The final change of virtual registers to the corresponding hard
>>> +   registers are done at the very end of LRA when there were no change
>>> +   in offsets anymore:
>>> +
>>> + fp + 42 =>sp + 42
>> Let me try to understand this.  We have (mem (fp)), which we rewrite to
>> (mem (fp + 42)), but this is intended to represent (mem (sp + 42))?
>>
>> Wouldn't this fail on any target which has different addressing ranges
>> for SP and FP?
>>
>>
> Yes, I think so.  It is not a problem for 9 current targets.  But when I
> or somebody else start porting LRA to such target we can introduce new
> virtual registers (and switch to them at the begging of LRA) to differ
> the situation.  It will be not a big deal.
> 
> I'd appreciate if you tell me such target in order to know in advance
> what to do first when I start work on it.

C6X is such a target. The stack pointer allows a 15 bit unsigned offset,
while all normal registers only allow more limited offsets.

I think the above approach sounds quite hackish and I'd prefer to see
this reworked before merging. Can you retain the current mechanism of
calling eliminate_regs before examining an insn, or is this not workable
for some reason?


Bernd



Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-02 Thread Vladimir Makarov

On 12-10-02 7:30 AM, Bernd Schmidt wrote:

On 09/28/2012 12:59 AM, Vladimir Makarov wrote:

+   We keep RTL code at most time in such state that the virtual
+   registers can be changed by just the corresponding hard registers
+   (with zero offsets) and we have the right RTL code. To achieve this
+   we should add initial offset at the beginning of LRA work and update
+   offsets after each stack expanding. But actually we update virtual
+   registers to the same virtual registers + corresponding offsets
+   before every constraint pass because it affects constraint
+   satisfaction (e.g. an address displacement became too big for some
+   target).
+
+   The final change of virtual registers to the corresponding hard
+   registers are done at the very end of LRA when there were no change
+   in offsets anymore:
+
+fp + 42 =>  sp + 42

Let me try to understand this.  We have (mem (fp)), which we rewrite to
(mem (fp + 42)), but this is intended to represent (mem (sp + 42))?

Wouldn't this fail on any target which has different addressing ranges
for SP and FP?


Yes, I think so.  It is not a problem for 9 current targets.  But when I 
or somebody else start porting LRA to such target we can introduce new 
virtual registers (and switch to them at the begging of LRA) to differ 
the situation.  It will be not a big deal.


I'd appreciate if you tell me such target in order to know in advance 
what to do first when I start work on it.





Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-02 Thread Richard Sandiford
Richard Sandiford  writes:
>> +/* Merge ranges R1 and R2 and returns the result.  The function
>> +   maintains the order of ranges and tries to minimize size of the
>> +   result range list.  */
>> +lra_live_range_t 
>> +lra_merge_live_ranges (lra_live_range_t r1, lra_live_range_t r2)
>> +{
>> +  lra_live_range_t first, last, temp;
>> +
>> +  if (r1 == NULL)
>> +return r2;
>> +  if (r2 == NULL)
>> +return r1;
>> +  for (first = last = NULL; r1 != NULL && r2 != NULL;)
>> +{
>> +  if (r1->start < r2->start)
>> +{
>> +  temp = r1;
>> +  r1 = r2;
>> +  r2 = temp;
>> +}
>> +  if (r1->start <= r2->finish + 1)
>> +{
>> +  /* Intersected ranges: merge r1 and r2 into r1.  */
>> +  r1->start = r2->start;
>> +  if (r1->finish < r2->finish)
>> +r1->finish = r2->finish;
>> +  temp = r2;
>> +  r2 = r2->next;
>> +  pool_free (live_range_pool, temp);
>> +  if (r2 == NULL)
>> +{
>> +  /* To try to merge with subsequent ranges in r1.  */
>> +  r2 = r1->next;
>> +  r1->next = NULL;
>> +}
>> +}
>> +  else
>> +{
>> +  /* Add r1 to the result.  */
>> +  if (first == NULL)
>> +first = last = r1;
>> +  else
>> +{
>> +  last->next = r1;
>> +  last = r1;
>> +}
>> +  r1 = r1->next;
>> +  if (r1 == NULL)
>> +{
>> +  /* To try to merge with subsequent ranges in r2.  */
>> +  r1 = r2->next;
>> +  r2->next = NULL;
>> +}
>> +}
>
> I might be misreading, but I'm not sure whether this handles merges like:
>
>   r1 = [6,7], [3,4]
>   r2 = [3,8], [0,1]
>
> After the first iteration, it looks like we'll have:
>
>   r1 = [3,8], [3,4]
>   r2 = [0,1]
>
> Then we'll add both [3,8] and [3,4] to the result.

OK, so I start to read patch b and realise that this is only supposed to
handle non-overlapping live ranges.  It might be worth having a comment
and assert to that effect, for slow readers like me.

Although in that case the function feels a little more complicated than
it needs to be.  When we run out of R1 or R2, why not just use the other
one as the rest of the live range list?  Why is:

>> +  if (r1 == NULL)
>> +{
>> +  /* To try to merge with subsequent ranges in r2.  */
>> +  r1 = r2->next;
>> +  r2->next = NULL;
>> +}

needed?

Richard


Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-02 Thread Richard Sandiford
Vladimir Makarov  writes:
> This is the major patch containing all new files.  The patch also adds 
> necessary calls to LRA from IRA.As the patch is too big, it continues in 
> the next email.
>
> 2012-09-27  Vladimir Makarov  
>
>  * Makefile.in (LRA_INT_H): New.
>  (OBJS): Add lra.o, lra-assigns.o, lra-coalesce.o,
>  lra-constraints.o, lra-eliminations.o, lra-lives.o, and lra-spills.o.
>  (ira.o): Add dependence on lra.h.
>  (lra.o, lra-assigns.o, lra-coalesce.o, lra-constraints.o): New entries.
>  (lra-eliminations.o, lra-lives.o, lra-spills.o): Ditto.
>  * ira.c: Include lra.h.
>  (ira_init_once, ira_init, ira_finish_once): Call lra_start_once,
>  lra_init, lra_finish_once in anyway.
>  (lra_in_progress): Remove.
>  (do_reload): Call LRA.
>  * lra.h: New.
>  * lra-int.h: Ditto.
>  * lra.c: Ditto.
>  * lra-assigns.c: Ditto.
>  * lra-constraints.c: Ditto.
>  * lra-coalesce.c: Ditto.
>  * lra-eliminations.c: Ditto.
>  * lra-lives.c: Ditto.
>  * lra-spills.c: Ditto.
>  * doc/passes.texi: Describe LRA pass.

Comments on ira-lives.c.  (Sorry for the split, had more time to look
at this than expected)

> +/* Copy live range list given by its head R and return the result.  */
> +lra_live_range_t
> +lra_copy_live_range_list (lra_live_range_t r)
> +{
> +  lra_live_range_t p, first, last;
> +
> +  if (r == NULL)
> +return NULL;
> +  for (first = last = NULL; r != NULL; r = r->next)
> +{
> +  p = copy_live_range (r);
> +  if (first == NULL)
> + first = p;
> +  else
> + last->next = p;
> +  last = p;
> +}
> +  return first;
> +}

Maybe simpler as:

  lra_live_range_t p, first, *chain;

  first = NULL;
  for (chain = &first; r != NULL; r = r->next)
{
  p = copy_live_range (r);
  *chain = p;
  chain = &p->next;
}
  return first;

> +/* Merge ranges R1 and R2 and returns the result.  The function
> +   maintains the order of ranges and tries to minimize size of the
> +   result range list.  */
> +lra_live_range_t 
> +lra_merge_live_ranges (lra_live_range_t r1, lra_live_range_t r2)
> +{
> +  lra_live_range_t first, last, temp;
> +
> +  if (r1 == NULL)
> +return r2;
> +  if (r2 == NULL)
> +return r1;
> +  for (first = last = NULL; r1 != NULL && r2 != NULL;)
> +{
> +  if (r1->start < r2->start)
> + {
> +   temp = r1;
> +   r1 = r2;
> +   r2 = temp;
> + }
> +  if (r1->start <= r2->finish + 1)
> + {
> +   /* Intersected ranges: merge r1 and r2 into r1.  */
> +   r1->start = r2->start;
> +   if (r1->finish < r2->finish)
> + r1->finish = r2->finish;
> +   temp = r2;
> +   r2 = r2->next;
> +   pool_free (live_range_pool, temp);
> +   if (r2 == NULL)
> + {
> +   /* To try to merge with subsequent ranges in r1.  */
> +   r2 = r1->next;
> +   r1->next = NULL;
> + }
> + }
> +  else
> + {
> +   /* Add r1 to the result.  */
> +   if (first == NULL)
> + first = last = r1;
> +   else
> + {
> +   last->next = r1;
> +   last = r1;
> + }
> +   r1 = r1->next;
> +   if (r1 == NULL)
> + {
> +   /* To try to merge with subsequent ranges in r2.  */
> +   r1 = r2->next;
> +   r2->next = NULL;
> + }
> + }

I might be misreading, but I'm not sure whether this handles merges like:

  r1 = [6,7], [3,4]
  r2 = [3,8], [0,1]

After the first iteration, it looks like we'll have:

  r1 = [3,8], [3,4]
  r2 = [0,1]

Then we'll add both [3,8] and [3,4] to the result.

Same chain pointer comment as for lra_merge_live_ranges.

> +/* Return TRUE if live range R1 is in R2.  */
> +bool
> +lra_live_range_in_p (lra_live_range_t r1, lra_live_range_t r2)
> +{
> +  /* Remember the live ranges are always kept ordered.   */
> +  while (r1 != NULL && r2 != NULL)
> +{
> +  /* R1's element is in R2's element.  */
> +  if (r2->start <= r1->start && r1->finish <= r2->finish)
> + r1 = r1->next;
> +  /* Intersection: R1's start is in R2.  */
> +  else if (r2->start <= r1->start && r1->start <= r2->finish)
> + return false;
> +  /* Intersection: R1's finish is in R2.  */
> +  else if (r2->start <= r1->finish && r1->finish <= r2->finish)
> + return false;
> +  else if (r1->start > r2->finish)
> + return false; /* No covering R2's element for R1's one.  */
> +  else
> + r2 = r2->next;
> +}
> +  return r1 == NULL;

Does the inner bit reduce to:

  /* R1's element is in R2's element.  */
  if (r2->start <= r1->start && r1->finish <= r2->finish)
r1 = r1->next;
  /* All of R2's element comes after R1's element.  */
  else if (r2->start > r1->finish)
r2 = r2->next;
  else
return false;

(Genuine question)

> +/* Process the death of hard register REGNO.  This updates
> +   hard_regs_live and START_DYI

Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-02 Thread Bernd Schmidt
On 09/28/2012 12:59 AM, Vladimir Makarov wrote:
> +   We keep RTL code at most time in such state that the virtual
> +   registers can be changed by just the corresponding hard registers
> +   (with zero offsets) and we have the right RTL code.   To achieve this
> +   we should add initial offset at the beginning of LRA work and update
> +   offsets after each stack expanding.   But actually we update virtual
> +   registers to the same virtual registers + corresponding offsets
> +   before every constraint pass because it affects constraint
> +   satisfaction (e.g. an address displacement became too big for some
> +   target).
> +
> +   The final change of virtual registers to the corresponding hard
> +   registers are done at the very end of LRA when there were no change
> +   in offsets anymore:
> +
> +  fp + 42 => sp + 42

Let me try to understand this.  We have (mem (fp)), which we rewrite to
(mem (fp + 42)), but this is intended to represent (mem (sp + 42))?

Wouldn't this fail on any target which has different addressing ranges
for SP and FP?


Bernd


Re: RFC: LRA for x86/x86-64 [7/9]

2012-10-02 Thread Richard Sandiford
Hi Vlad,

Vladimir Makarov  writes:
> This is the major patch containing all new files.  The patch also adds 
> necessary calls to LRA from IRA.As the patch is too big, it continues in 
> the next email.
>
> 2012-09-27  Vladimir Makarov  
>
>  * Makefile.in (LRA_INT_H): New.
>  (OBJS): Add lra.o, lra-assigns.o, lra-coalesce.o,
>  lra-constraints.o, lra-eliminations.o, lra-lives.o, and lra-spills.o.
>  (ira.o): Add dependence on lra.h.
>  (lra.o, lra-assigns.o, lra-coalesce.o, lra-constraints.o): New entries.
>  (lra-eliminations.o, lra-lives.o, lra-spills.o): Ditto.
>  * ira.c: Include lra.h.
>  (ira_init_once, ira_init, ira_finish_once): Call lra_start_once,
>  lra_init, lra_finish_once in anyway.
>  (lra_in_progress): Remove.
>  (do_reload): Call LRA.
>  * lra.h: New.
>  * lra-int.h: Ditto.
>  * lra.c: Ditto.
>  * lra-assigns.c: Ditto.
>  * lra-constraints.c: Ditto.
>  * lra-coalesce.c: Ditto.
>  * lra-eliminations.c: Ditto.
>  * lra-lives.c: Ditto.
>  * lra-spills.c: Ditto.
>  * doc/passes.texi: Describe LRA pass.

A non-authoritative review of the documentation and lra-eliminations.c:

> +LRA is different from the reload pass in LRA division on small,
> +manageable, and separated sub-tasks.  All LRA transformations and
> +decisions are reflected in RTL as more as possible.  Instruction
> +constraints as a primary source of the info and that minimizes number
> +of target-depended macros/hooks.
>
> +LRA is run for the targets it were ported.

Suggest something like:

  Unlike the reload pass, intermediate LRA decisions are reflected in
  RTL as much as possible.  This reduces the number of target-dependent
  macros and hooks, leaving instruction constraints as the primary
  source of control.

  LRA is run on targets for which TARGET_LRA_P returns true.

> +/* The virtual registers (like argument and frame pointer) are widely
> +   used in RTL.   Virtual registers should be changed by real hard
> +   registers (like stack pointer or hard frame pointer) plus some
> +   offset.  The offsets are changed usually every time when stack is
> +   expanded.  We know the final offsets only at the very end of LRA.

I always think of "virtual" as [FIRST_VIRTUAL_REGISTER, LAST_VIRTUAL_REGISTER].
Maybe "eliminable" would be better?  E.g.

/* Eliminable registers (like a soft argument or frame pointer) are widely
   used in RTL.  These eliminable registers should be replaced by real hard
   registers (like the stack pointer or hard frame pointer) plus some offset.
   The offsets usually change whenever the stack is expanded.  We know the
   final offsets only at the very end of LRA.

> +   We keep RTL code at most time in such state that the virtual
> +   registers can be changed by just the corresponding hard registers
> +   (with zero offsets) and we have the right RTL code.   To achieve this
> +   we should add initial offset at the beginning of LRA work and update
> +   offsets after each stack expanding.   But actually we update virtual
> +   registers to the same virtual registers + corresponding offsets
> +   before every constraint pass because it affects constraint
> +   satisfaction (e.g. an address displacement became too big for some
> +   target).

Suggest:

   Within LRA, we usually keep the RTL in such a state that the eliminable
   registers can be replaced by just the corresponding hard register
   (without any offset).  To achieve this we should add the initial
   elimination offset at the beginning of LRA and update the offsets
   whenever the stack is expanded.  We need to do this before every
   constraint pass because the choice of offset often affects whether
   a particular address or memory constraint is satisfied.

> +   The final change of virtual registers to the corresponding hard
> +   registers are done at the very end of LRA when there were no change
> +   in offsets anymore:
> +
> +  fp + 42 => sp + 42

virtual=>eliminable if the above is OK.

> +   Such approach requires a few changes in the rest GCC code because
> +   virtual registers are not recognized as real ones in some
> +   constraints and predicates.   Fortunately, such changes are
> +   small.  */

Not sure whether the last paragraph really belongs in the code,
since it's more about the reload->LRA transition.

> +  /* Nonzero if this elimination can be done.  */
> +  bool can_eliminate;
> +  /* CAN_ELIMINATE since the last check.  */
> +  bool prev_can_eliminate;

AFAICT, these two fields are (now) only ever assigned at the same time,
via init_elim_table and setup_can_eliminate.  Looks like we can do
without prev_can_eliminate.  (And the way that the pass doesn't
need to differentiate between the raw CAN_ELIMINABLE value and
the processed value feels nice and reassuring.)

> +/* Map: 'from regno' -> to the current elimination, NULL otherwise.
> +   The elimination table may contains