Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-18 Thread Vladimir Makarov

On 12-10-17 7:24 AM, Richard Sandiford wrote:

Thanks for all the updates.

Vladimir Makarov vmaka...@redhat.com writes:

+ /* index * scale + disp = new base + index * scale  */
+ enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
+
+ lra_assert (INDEX_REG_CLASS != NO_REGS);
+ new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, disp);
+ lra_assert (GET_CODE (*addr_loc) == PLUS);
+ lra_emit_move (new_reg, *ad.disp_loc);
+ if (CONSTANT_P (XEXP (*addr_loc, 1)))
+   XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
+ XEXP (*addr_loc, 0) = new_reg;

The canonical form is (plus (mult ...) (reg)) rather than
(plus (reg) (mult ...)), but it looks like we create the latter.
I realise you try both forms here:

It might happen because equiv substitution in LRA.

+ /* Some targets like ARM, accept address operands in
+specific order -- try exchange them if necessary.  */
+ if (! valid_address_p (mode, *addr_loc, as))
+   {
+ exchange_plus_ops (*addr_loc);
+ if (! valid_address_p (mode, *addr_loc, as))
+   exchange_plus_ops (*addr_loc);
+   }

but I think we should try the canonical form first.  And I'd prefer it
if we didn't try the other form at all, especially in 4.8.  It isn't
really the backend's job to reject non-canonical rtl.  This might well
be another case where some targets need a (hopefully small) tweak in
order to play by the rules.

Also, I suppose this section of code feeds back to my question on
Wednesday about the distinction that LRA seems to make between the
compile-time constant in:

(plus (reg X1) (const_int Y1))

and the link-time constant in:

(plus (reg X2) (symbol_ref Y2))

It looked like extract_address_regs classified X1 as a base register and
X2 as an index register.  The difference between the two constants has
no run-time significance though, and I think we should handle both X1
and X2 as base registers (as I think reload does).

I think the path above would then be specific to scaled indices.
In the original address the complex index must come first and the
displacement second.  In the modified address, the index would stay
first and the new base register would be second.  More below.

As I wrote above the problem is also in that equiv substitution can
create non-canonical forms.

Right.  Just in case there's a misunderstanding: I'm not complaining
about these routines internally using forms that are noncanonical
(which could happen because of equiv substitution, like you say).
I just think that what we eventually try to validate should be canonical.
In a way it's similar to how the simplify-rtx.c routines work.

If there are targets that only accept noncanonical rtl (which is after
all just a specific type of invalid rtl), they need to be fixed.
Agree.  In order not to forget to fix targets I am removing operand 
exchange.

+  /* base + scale * index + disp = new base + scale * index  */
+  new_reg = base_plus_disp_to_reg (mode, as, ad);
+  *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
+  if (! valid_address_p (mode, *addr_loc, as))
+   {
+ /* Some targets like ARM, accept address operands in
+specific order -- try exchange them if necessary.  */
+ exchange_plus_ops (*addr_loc);
+ if (! valid_address_p (mode, *addr_loc, as))
+   exchange_plus_ops (*addr_loc);
+   }

Same comment as above about canonical rtl.  Here we can have two
registers -- in which case the base should come first -- or a more
complex index -- in which case the index should come first.

We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
with the operands in either order, and let it take care of the details.
Using simplify_gen_binary would help with the earlier index+disp case too.

Equiv substitution can create non-canonical forms.  There are 2 approaches:
o have a code for dealing with non-canonical forms (equiv substitution,
target stupidity)
o always support canonical forms and require them from targets.

I decided to use the 1st variant but I am reconsidering it.  I'll try to
fix before inclusion.  But I am not sure I have time for this.  All
these changes makes LRA unstable. In fact, I've just found that changes
I already made so far resulted in 2 SPEC2000 tests broken although GCC
testsuite and bootstrap looks good.

OK.  I'm happy to try fixing the noncanonical thing.


+  /* If this is post-increment, first copy the location to the reload reg.  */
+  if (post  real_in != result)
+emit_insn (gen_move_insn (result, real_in));

Nit, but real_in != result can never be true AIUI, and I was confused how
the code could be correct in that case.  Maybe just remove it, or make
it an assert?

No, it might be true:

real_in = in == value ? incloc : in;
...
if (cond)
result = incloc;
else
result = ...

if (post  real_in != result)

So it is true 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-18 Thread Vladimir Makarov

On 12-10-15 8:06 AM, Richard Sandiford wrote:

Vladimir Makarov vmaka...@redhat.com writes:

if that's accurate.  I dropped the term reload pseudo because of
the general comment in my earlier reply about the use of reload pseudo
when the code seems to include inheritance and split pseudos too.

There is no inheritance and splitting yet.  It is done after the
constraint pass.
So at this stage = new_regno_start means reload pseudo.

Ah, OK.


That's a change in the meaning of NEW_CLASS, but seems easier for
callers to handle.  I think all it requires is changing:


+  common_class = ira_reg_class_subset[rclass][cl];
+  if (new_class != NULL)
+   *new_class = common_class;

to:

common_class = ira_reg_class_subset[rclass][cl];
if (new_class != NULL  rclass != common_class)
*new_class = common_class;

This change results in infinite LRA looping on a first libgcc file
compilation.  Unfortunately I have no time to investigate it.
I'd like to say that most code of in this code is very sensitive to
changes.  I see it a lot.  You change something looking obvious and a
target is broken.
I am going to investigate it when I have more time.

Thanks.


+default:
+  {
+   const char *fmt = GET_RTX_FORMAT (code);
+   int i;
+
+   if (GET_RTX_LENGTH (code) != 1
+   || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC)
+ {
+   for (i = GET_RTX_LENGTH (code) - 1; i = 0; i--)
+ if (fmt[i] == 'e')
+   extract_loc_address_regs (false, mode, as, XEXP (x, i),
+ context_p, code, SCRATCH,
+ modify_p, ad);
+   break;
+ }
+   /* fall through for case UNARY_OP (UNSPEC ...)  */
+  }
+
+case UNSPEC:
+  if (ad-disp_loc == NULL)
+   ad-disp_loc = loc;
+  else if (ad-base_reg_loc == NULL)
+   {
+ ad-base_reg_loc = loc;
+ ad-base_outer_code = outer_code;
+ ad-index_code = index_code;
+ ad-base_modify_p = modify_p;
+   }
+  else
+   {
+ lra_assert (ad-index_reg_loc == NULL);
+ ad-index_reg_loc = loc;
+   }
+  break;
+
+}

Which targets use a bare UNSPEC as a displacement?  I thought a
displacement had to be a link-time constant, in which case it should
satisfy CONSTANT_P.  For UNSPECs, that means wrapping it in a CONST.

I saw it somewhere.  I guess IA64.

I'm just a bit worried that the UNSPEC handling is sensitive to the
order that subrtxes are processed (unlike PLUS, which goes to some
trouble to work out what's what).  It could be especially confusing
because the default case processes operands in reverse order while
PLUS processes them in forward order.

Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough?
Probably deserves a comment.

I don't remember.  To figure out, I should switch it off and try all
targets supported by LRA.

AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just
recording where reloads of a particular class need to go (obviously
in the case of disp_loc, which isn't reloaded at all).  The feidls
have semantic value too.  I.e. we use them to work out the value
of at least part of the address.

In that case it seems dangerous to look through general rtxes
in the way that the default case above does.  Maybe just making
sure that DISP_LOC is involved in a sum with the base would be
enough, but another idea was:


I know of three ways of mutating (for want of a better word)
an address:

1. (and X (const_int X)), to align
2. a subreg
3. a unary operator (such as truncation or extension)

So maybe we could:

a. remove outer mutations (using a helper function)
b. handle LO_SUM, PRE_*, POST_*: as now
c. otherwise treat the address of the sum of one, two or three pieces.
   c1. Peel mutations of all pieces.
   c2. Classify the pieces into base, index and displacement.
   This would be similar to the jousting code above, but hopefully
   easier because all three rtxes are to hand.  E.g. we could
   do the base vs. index thing in a similar way to
   commutative_operand_precedence.
   c3. Record which pieces were mutated (e.g. using something like the
   index_loc vs. index_reg_loc distinction in the current code)

That should be general enough for current targets, but if it isn't,
we could generalise it further when we know what generalisation is needed.

That's still going to be a fair amount of code, but hopefully not more,
and we might have more confidence at each stage what each value is.
And it avoids the risk of treating mutated addresses as unmutated ones.


Just an idea though.  Probably not for 4.8, although I might try it
if I find time.

I am not sure that you listed all the cases.  It 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-17 Thread Richard Sandiford
Thanks for all the updates.

Vladimir Makarov vmaka...@redhat.com writes:
 + /* index * scale + disp = new base + index * scale  */
 + enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
 +
 + lra_assert (INDEX_REG_CLASS != NO_REGS);
 + new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, disp);
 + lra_assert (GET_CODE (*addr_loc) == PLUS);
 + lra_emit_move (new_reg, *ad.disp_loc);
 + if (CONSTANT_P (XEXP (*addr_loc, 1)))
 +   XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
 + XEXP (*addr_loc, 0) = new_reg;
 The canonical form is (plus (mult ...) (reg)) rather than
 (plus (reg) (mult ...)), but it looks like we create the latter.
 I realise you try both forms here:
 It might happen because equiv substitution in LRA.
 + /* Some targets like ARM, accept address operands in
 +specific order -- try exchange them if necessary.  */
 + if (! valid_address_p (mode, *addr_loc, as))
 +   {
 + exchange_plus_ops (*addr_loc);
 + if (! valid_address_p (mode, *addr_loc, as))
 +   exchange_plus_ops (*addr_loc);
 +   }
 but I think we should try the canonical form first.  And I'd prefer it
 if we didn't try the other form at all, especially in 4.8.  It isn't
 really the backend's job to reject non-canonical rtl.  This might well
 be another case where some targets need a (hopefully small) tweak in
 order to play by the rules.

 Also, I suppose this section of code feeds back to my question on
 Wednesday about the distinction that LRA seems to make between the
 compile-time constant in:

(plus (reg X1) (const_int Y1))

 and the link-time constant in:

(plus (reg X2) (symbol_ref Y2))

 It looked like extract_address_regs classified X1 as a base register and
 X2 as an index register.  The difference between the two constants has
 no run-time significance though, and I think we should handle both X1
 and X2 as base registers (as I think reload does).

 I think the path above would then be specific to scaled indices.
 In the original address the complex index must come first and the
 displacement second.  In the modified address, the index would stay
 first and the new base register would be second.  More below.
 As I wrote above the problem is also in that equiv substitution can 
 create non-canonical forms.

Right.  Just in case there's a misunderstanding: I'm not complaining
about these routines internally using forms that are noncanonical
(which could happen because of equiv substitution, like you say).
I just think that what we eventually try to validate should be canonical.
In a way it's similar to how the simplify-rtx.c routines work.

If there are targets that only accept noncanonical rtl (which is after
all just a specific type of invalid rtl), they need to be fixed.

 +  /* base + scale * index + disp = new base + scale * index  */
 +  new_reg = base_plus_disp_to_reg (mode, as, ad);
 +  *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
 +  if (! valid_address_p (mode, *addr_loc, as))
 +   {
 + /* Some targets like ARM, accept address operands in
 +specific order -- try exchange them if necessary.  */
 + exchange_plus_ops (*addr_loc);
 + if (! valid_address_p (mode, *addr_loc, as))
 +   exchange_plus_ops (*addr_loc);
 +   }
 Same comment as above about canonical rtl.  Here we can have two
 registers -- in which case the base should come first -- or a more
 complex index -- in which case the index should come first.

 We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
 with the operands in either order, and let it take care of the details.
 Using simplify_gen_binary would help with the earlier index+disp case too.
 Equiv substitution can create non-canonical forms.  There are 2 approaches:
 o have a code for dealing with non-canonical forms (equiv substitution, 
 target stupidity)
 o always support canonical forms and require them from targets.

 I decided to use the 1st variant but I am reconsidering it.  I'll try to 
 fix before inclusion.  But I am not sure I have time for this.  All 
 these changes makes LRA unstable. In fact, I've just found that changes 
 I already made so far resulted in 2 SPEC2000 tests broken although GCC 
 testsuite and bootstrap looks good.

OK.  I'm happy to try fixing the noncanonical thing.

 +  /* If this is post-increment, first copy the location to the reload reg. 
  */
 +  if (post  real_in != result)
 +emit_insn (gen_move_insn (result, real_in));
 Nit, but real_in != result can never be true AIUI, and I was confused how
 the code could be correct in that case.  Maybe just remove it, or make
 it an assert?
 No, it might be true:

 real_in = in == value ? incloc : in;
 ...
 if (cond)
result = incloc;
 else
result = ...

 if (post  real_in != result)

 So it is true if in==value  cond

Sorry, what I meant was that cond is ! post  REG_P (incloc):

  if (! post  REG_P (incloc))
result = incloc;
  else

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-17 Thread Vladimir Makarov

On 12-10-15 12:49 PM, Richard Sandiford wrote:

Hi Vlad,

Some comments about the rest of LRA.  Nothing major here...

Vladimir Makarov vmaka...@redhat.com writes:

+/* Info about register in an insn.  */
+struct lra_insn_reg
+{
+  /* The biggest mode through which the insn refers to the register
+ (remember the register can be accessed through a subreg in the
+ insn).  */
+  ENUM_BITFIELD(machine_mode) biggest_mode : 16;

AFAICT, this is actually always the mode of a specific reference,
and if there are references to the same register in different modes,
those references get their own lra_insn_regs.  mode might be better
than biggest_mode if so.

I seems mode is also not good.  I've just modified the comment to 
reflect the fact that is just a reference.

+/* Static part (common info for insns with the same ICODE) of LRA
+   internal insn info. It exists in at most one exemplar for each
+   non-negative ICODE. Warning: if the structure definition is
+   changed, the initializer for debug_insn_static_data in lra.c should
+   be changed too.  */

Probably worth saying (before the warning) that there is also
one structure for each asm.

Good point.  I added a comment.

+/* LRA internal info about an insn (LRA internal insn
+   representation).  */
+struct lra_insn_recog_data
+{
+  int icode; /* The insn code. */
+  rtx insn; /* The insn itself. */
+  /* Common data for insns with the same ICODE. */
+  struct lra_static_insn_data *insn_static_data;

Maybe worth mentioning asms here too.

Fixed.

+  /* Two arrays of size correspondingly equal to the operand and the
+ duplication numbers: */
+  rtx **operand_loc; /* The operand locations, NULL if no operands.  */
+  rtx **dup_loc; /* The dup locations, NULL if no dups. */
+  /* Number of hard registers implicitly used in given call insn.  The
+ value can be NULL or points to array of the hard register numbers
+ ending with a negative value.  */
+  int *arg_hard_regs;
+#ifdef HAVE_ATTR_enabled
+  /* Alternative enabled for the insn. NULL for debug insns.  */
+  bool *alternative_enabled_p;
+#endif
+  /* The alternative should be used for the insn, -1 if invalid, or we
+ should try to use any alternative, or the insn is a debug
+ insn.  */
+  int used_insn_alternative;
+  struct lra_insn_reg *regs;  /* Always NULL for a debug insn. */

Comments consistently above the field.

Fixed.

+extern void lra_expand_reg_info (void);

This doesn't exist any more.

Fixed.

+extern int lra_constraint_new_insn_uid_start;

Just saying in case: this seems to be write-only, with lra-constraints.c
instead using a static variable to track the uid start.

I realise you might want to keep it anyway for consistency with
lra_constraint_new_regno_start, or for debugging.


+extern rtx lra_secondary_memory[NUM_MACHINE_MODES];

This doesn't exist any more.

Removed.  Thanks.

+/* lra-saves.c: */
+
+extern bool lra_save_restore (void);

Same for this file  function.

Removed.

+/* The function returns TRUE if at least one hard register from ones
+   starting with HARD_REGNO and containing value of MODE are in set
+   HARD_REGSET. */
+static inline bool
+lra_hard_reg_set_intersection_p (int hard_regno, enum machine_mode mode,
+HARD_REG_SET hard_regset)
+{
+  int i;
+
+  lra_assert (hard_regno = 0);
+  for (i = hard_regno_nregs[hard_regno][mode] - 1; i = 0; i--)
+if (TEST_HARD_REG_BIT (hard_regset, hard_regno + i))
+  return true;
+  return false;
+}

This is the same as overlaps_hard_reg_set_p.


I removed it and started to use the function overlaps_hard_reg_set_p.

+/* Return hard regno and offset of (sub-)register X through arguments
+   HARD_REGNO and OFFSET.  If it is not (sub-)register or the hard
+   register is unknown, then return -1 and 0 correspondingly.  */

The function seems to return -1 for both.
Fixed.  It does not matter for the rest of code as offset is used only 
when hard_regno = 0.

+/* Add hard registers starting with HARD_REGNO and holding value of
+   MODE to the set S.  */
+static inline void
+lra_add_hard_reg_set (int hard_regno, enum machine_mode mode, HARD_REG_SET *s)
+{
+  int i;
+
+  for (i = hard_regno_nregs[hard_regno][mode] - 1; i = 0; i--)
+SET_HARD_REG_BIT (*s, hard_regno + i);
+}

This is add_to_hard_reg_set.

Removed.

+   Here is block diagram of LRA passes:
+
+ - 
+| Undo inheritance|  ------
+| for spilled pseudos)| | Memory-memory |  | New (and old) |
+| and splits (for || move coalesce |-|  pseudos|
+| pseudos got the |  ---   |   assignment  |
+  Start |  same  hard regs)   | 
---
+|- ^
+V|     |
+ 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-17 Thread Steven Bosscher
On Wed, Oct 17, 2012 at 9:53 PM, Vladimir Makarov wrote:
 On 12-10-15 12:49 PM, Richard Sandiford wrote:
 Getting rid of reload always seemed like a pipe dream, and if the only
 known drawback of this replacement is that it takes a while on extreme
 testcases, that's an amazing achievement.  (Not to say compile time
 isn't important, just that there were so many other hurdles to overcome.)

Just to be clear, LRA now does no worse from a compile time POV than,
say, tree-ssa-live. Most of the scalability problems have been
addressed.

 It is my second attempt.  The first one was YARA project.  I got a lot of
 experience from this project and knowledge how not to do this.
 LRA will be still a long lasting project.  I don't think I found all
 weirdness of reload just trying 8 targets (fixing one bug on one target
 frequently resulted in new bugs on other targets so it required to do
 frequently cardinal changes to the original code). Only after trying the 8
 targets I got feeling that this approach could well.

Hats off to you, Vlad, for your years of effort on improving GCC's RA!

Ciao!
Steven


Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-16 Thread Vladimir Makarov

On 12-10-12 10:29 AM, Richard Sandiford wrote:

Hi Vlad,

Comments for the rest of ira-constraints.c.

Vladimir Makarovvmaka...@redhat.com  writes:

+  saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX;
+  change_p = equiv_address_substitution (ad, addr_loc, mode, as, code);
+  if (ad.base_reg_loc != NULL)
+{
+  if (process_addr_reg
+ (ad.base_reg_loc, before,
+  (ad.base_modify_p  REG_P (*ad.base_reg_loc)
+find_regno_note (curr_insn, REG_DEAD,
+   REGNO (*ad.base_reg_loc)) == NULL
+   ? after : NULL),
+  base_reg_class (mode, as, ad.base_outer_code, ad.index_code)))
+   change_p = true;
+  if (ad.base_reg_loc2 != NULL)
+   *ad.base_reg_loc2 = *ad.base_reg_loc;
+  saved_base_reg = *ad.base_reg_loc;
+  lra_eliminate_reg_if_possible (ad.base_reg_loc);
+  if (ad.base_reg_loc2 != NULL)
+   {
+ saved_base_reg2 = *ad.base_reg_loc2;
+ lra_eliminate_reg_if_possible (ad.base_reg_loc2);
+   }

We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it
might be clearer without saved_base_reg2.  More below...

+  /* The following addressing is checked by constraints and
+usually target specific legitimate address hooks do not
+consider them valid.  */
+  || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC
+  || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC

typo: two PRE_DECs, although:

+  || GET_CODE (*addr_loc) == PRE_MODIFY
+  || GET_CODE (*addr_loc) == POST_MODIFY

the whole lot could just be replaced by ad.base_modify_p, or perhaps
even removed entirely given:

+  /* In this case we can not do anything because if it is wrong
+that is because of wrong displacement.  Remember that any
+address was legitimate in non-strict sense before LRA.  */
+  || ad.disp_loc == NULL)

It doesn't seem worth validating the address at all for ad.disp_loc == NULL.
E.g. something like:

   if (ad.base_reg_loc != NULL
(process_addr_reg
  (ad.base_reg_loc, before,
   (ad.base_modify_p  REG_P (*ad.base_reg_loc)
 find_regno_note (curr_insn, REG_DEAD,
REGNO (*ad.base_reg_loc)) == NULL
? after : NULL),
   base_reg_class (mode, as, ad.base_outer_code, ad.index_code
 {
   change_p = true;
   if (ad.base_reg_loc2 != NULL)
 *ad.base_reg_loc2 = *ad.base_reg_loc;
 }

   if (ad.index_reg_loc != NULL
process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS))
 change_p = true;

   /* The address was valid before LRA.  We only change its form if the
  address has a displacement, so if it has no displacement it must
  still be valid.  */
   if (ad.disp_loc == NULL)
 return change_p;

   /* See whether the address is still valid.  Some ports do not check
  displacements for eliminable registers, so we replace them
  temporarily with the elimination target.  */
   saved_base_reg = saved_index_reg = NULL_RTX;
   ...
   if (ok_p)
 return change_p;

Yes, it has sense.  I changed the code as you propose.

+#ifdef HAVE_lo_sum
+ {
+   rtx insn;
+   rtx last = get_last_insn ();
+
+   /* disp = lo_sum (new_base, disp)   */
+   insn = emit_insn (gen_rtx_SET
+ (VOIDmode, new_reg,
+  gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc;
+   code = recog_memoized (insn);
+   if (code = 0)
+ {
+   rtx save = *ad.disp_loc;
+
+   *ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc);
+   if (! valid_address_p (mode, *ad.disp_loc, as))
+ {
+   *ad.disp_loc = save;
+   code = -1;
+ }
+ }
+   if (code  0)
+ delete_insns_since (last);
+ }
+#endif

Nice :-)

Purely for the record, I wondered whether the high part should be
generated with emit_move_insn(_1) instead, with the rhs of the move
being the HIGH rtx.  That would allow targets to cope with cases where
the high part isn't represented directly as a HIGH.  E.g. on MIPS and
Alpha, small-data accesses use the global register as the high part instead.

However, both MIPS and Alpha accept small-data addresses as legitimate
constants and addresses before and during reload and only introduce the
split form after reload.  And I think that's how any other cases that
aren't simple HIGHs should be handled too.  E.g. MIPS also represents
GOT page loads as HIGHs until after reload, and only then lowers the
HIGH to a GOT load.  Allowing the backend to generate anything other
than a plain HIGH set here would be a double-edged sword.

So after all that I agree that the gen_rtx_SET above is better than
calling the move expanders.

Thanks for sharing your 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-15 Thread Richard Sandiford
Vladimir Makarov vmaka...@redhat.com writes:
 if that's accurate.  I dropped the term reload pseudo because of
 the general comment in my earlier reply about the use of reload pseudo
 when the code seems to include inheritance and split pseudos too.
 There is no inheritance and splitting yet.  It is done after the 
 constraint pass.
 So at this stage = new_regno_start means reload pseudo.

Ah, OK.

 That's a change in the meaning of NEW_CLASS, but seems easier for
 callers to handle.  I think all it requires is changing:

 +  common_class = ira_reg_class_subset[rclass][cl];
 +  if (new_class != NULL)
 +   *new_class = common_class;
 to:

common_class = ira_reg_class_subset[rclass][cl];
if (new_class != NULL  rclass != common_class)
  *new_class = common_class;
 This change results in infinite LRA looping on a first libgcc file 
 compilation.  Unfortunately I have no time to investigate it.
 I'd like to say that most code of in this code is very sensitive to 
 changes.  I see it a lot.  You change something looking obvious and a 
 target is broken.
 I am going to investigate it when I have more time.

Thanks.

 +default:
 +  {
 +   const char *fmt = GET_RTX_FORMAT (code);
 +   int i;
 +
 +   if (GET_RTX_LENGTH (code) != 1
 +   || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC)
 + {
 +   for (i = GET_RTX_LENGTH (code) - 1; i = 0; i--)
 + if (fmt[i] == 'e')
 +   extract_loc_address_regs (false, mode, as, XEXP (x, i),
 + context_p, code, SCRATCH,
 + modify_p, ad);
 +   break;
 + }
 +   /* fall through for case UNARY_OP (UNSPEC ...)  */
 +  }
 +
 +case UNSPEC:
 +  if (ad-disp_loc == NULL)
 +   ad-disp_loc = loc;
 +  else if (ad-base_reg_loc == NULL)
 +   {
 + ad-base_reg_loc = loc;
 + ad-base_outer_code = outer_code;
 + ad-index_code = index_code;
 + ad-base_modify_p = modify_p;
 +   }
 +  else
 +   {
 + lra_assert (ad-index_reg_loc == NULL);
 + ad-index_reg_loc = loc;
 +   }
 +  break;
 +
 +}
 Which targets use a bare UNSPEC as a displacement?  I thought a
 displacement had to be a link-time constant, in which case it should
 satisfy CONSTANT_P.  For UNSPECs, that means wrapping it in a CONST.
 I saw it somewhere.  I guess IA64.
 I'm just a bit worried that the UNSPEC handling is sensitive to the
 order that subrtxes are processed (unlike PLUS, which goes to some
 trouble to work out what's what).  It could be especially confusing
 because the default case processes operands in reverse order while
 PLUS processes them in forward order.

 Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough?
 Probably deserves a comment.
 I don't remember.  To figure out, I should switch it off and try all 
 targets supported by LRA.
 AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just
 recording where reloads of a particular class need to go (obviously
 in the case of disp_loc, which isn't reloaded at all).  The feidls
 have semantic value too.  I.e. we use them to work out the value
 of at least part of the address.

 In that case it seems dangerous to look through general rtxes
 in the way that the default case above does.  Maybe just making
 sure that DISP_LOC is involved in a sum with the base would be
 enough, but another idea was:

 
 I know of three ways of mutating (for want of a better word)
 an address:

1. (and X (const_int X)), to align
2. a subreg
3. a unary operator (such as truncation or extension)

 So maybe we could:

a. remove outer mutations (using a helper function)
b. handle LO_SUM, PRE_*, POST_*: as now
c. otherwise treat the address of the sum of one, two or three pieces.
   c1. Peel mutations of all pieces.
   c2. Classify the pieces into base, index and displacement.
   This would be similar to the jousting code above, but hopefully
   easier because all three rtxes are to hand.  E.g. we could
   do the base vs. index thing in a similar way to
   commutative_operand_precedence.
   c3. Record which pieces were mutated (e.g. using something like the
   index_loc vs. index_reg_loc distinction in the current code)

 That should be general enough for current targets, but if it isn't,
 we could generalise it further when we know what generalisation is needed.

 That's still going to be a fair amount of code, but hopefully not more,
 and we might have more confidence at each stage what each value is.
 And it avoids the risk of treating mutated addresses as unmutated ones.
 

 Just an idea though.  Probably not for 4.8, although I might try it
 if I find time.
 I am not sure that you listed all the cases.  It would be great if you 
 listed all the cases. In this case we could 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-15 Thread Richard Sandiford
Hi Vlad,

Some comments about the rest of LRA.  Nothing major here...

Vladimir Makarov vmaka...@redhat.com writes:
 +/* Info about register in an insn.  */
 +struct lra_insn_reg
 +{
 +  /* The biggest mode through which the insn refers to the register
 + (remember the register can be accessed through a subreg in the
 + insn).  */
 +  ENUM_BITFIELD(machine_mode) biggest_mode : 16;

AFAICT, this is actually always the mode of a specific reference,
and if there are references to the same register in different modes,
those references get their own lra_insn_regs.  mode might be better
than biggest_mode if so.

 +/* Static part (common info for insns with the same ICODE) of LRA
 +   internal insn info.   It exists in at most one exemplar for each
 +   non-negative ICODE.   Warning: if the structure definition is
 +   changed, the initializer for debug_insn_static_data in lra.c should
 +   be changed too.  */

Probably worth saying (before the warning) that there is also
one structure for each asm.

 +/* LRA internal info about an insn (LRA internal insn
 +   representation).  */
 +struct lra_insn_recog_data
 +{
 +  int icode; /* The insn code.   */
 +  rtx insn; /* The insn itself.   */
 +  /* Common data for insns with the same ICODE.   */
 +  struct lra_static_insn_data *insn_static_data;

Maybe worth mentioning asms here too.

 +  /* Two arrays of size correspondingly equal to the operand and the
 + duplication numbers: */
 +  rtx **operand_loc; /* The operand locations, NULL if no operands.  */
 +  rtx **dup_loc; /* The dup locations, NULL if no dups.   */
 +  /* Number of hard registers implicitly used in given call insn.  The
 + value can be NULL or points to array of the hard register numbers
 + ending with a negative value.  */
 +  int *arg_hard_regs;
 +#ifdef HAVE_ATTR_enabled
 +  /* Alternative enabled for the insn.   NULL for debug insns.  */
 +  bool *alternative_enabled_p;
 +#endif
 +  /* The alternative should be used for the insn, -1 if invalid, or we
 + should try to use any alternative, or the insn is a debug
 + insn.  */
 +  int used_insn_alternative;
 +  struct lra_insn_reg *regs;  /* Always NULL for a debug insn.   */

Comments consistently above the field.

 +extern void lra_expand_reg_info (void);

This doesn't exist any more.

 +extern int lra_constraint_new_insn_uid_start;

Just saying in case: this seems to be write-only, with lra-constraints.c
instead using a static variable to track the uid start.

I realise you might want to keep it anyway for consistency with
lra_constraint_new_regno_start, or for debugging.

 +extern rtx lra_secondary_memory[NUM_MACHINE_MODES];

This doesn't exist any more.

 +/* lra-saves.c: */
 +
 +extern bool lra_save_restore (void);

Same for this file  function.

 +/* The function returns TRUE if at least one hard register from ones
 +   starting with HARD_REGNO and containing value of MODE are in set
 +   HARD_REGSET.   */
 +static inline bool
 +lra_hard_reg_set_intersection_p (int hard_regno, enum machine_mode mode,
 +  HARD_REG_SET hard_regset)
 +{
 +  int i;
 +
 +  lra_assert (hard_regno = 0);
 +  for (i = hard_regno_nregs[hard_regno][mode] - 1; i = 0; i--)
 +if (TEST_HARD_REG_BIT (hard_regset, hard_regno + i))
 +  return true;
 +  return false;
 +}

This is the same as overlaps_hard_reg_set_p.

 +/* Return hard regno and offset of (sub-)register X through arguments
 +   HARD_REGNO and OFFSET.  If it is not (sub-)register or the hard
 +   register is unknown, then return -1 and 0 correspondingly.  */

The function seems to return -1 for both.

 +/* Add hard registers starting with HARD_REGNO and holding value of
 +   MODE to the set S.  */
 +static inline void
 +lra_add_hard_reg_set (int hard_regno, enum machine_mode mode, HARD_REG_SET 
 *s)
 +{
 +  int i;
 +
 +  for (i = hard_regno_nregs[hard_regno][mode] - 1; i = 0; i--)
 +SET_HARD_REG_BIT (*s, hard_regno + i);
 +}

This is add_to_hard_reg_set.

 +   Here is block diagram of LRA passes:
 +
 +   - 
 +  | Undo inheritance|  ------ 
 +  | for spilled pseudos)| | Memory-memory |  | New (and old) |
 +  | and splits (for || move coalesce |-|pseudos|
 +  | pseudos got the |  ---   |   assignment  |
 +  Start   |  same  hard regs)   | 
 --- 
 +|  - ^
 +V  |     |
 + ---   V | Update virtual |  |
 +|  Remove   | |register |  |
 +| scratches |  ^ |  displacements |  |
 + ---   |     |
 +   |   

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-13 Thread Richard Sandiford
I'm having to correct my own comments again, sorry.

Richard Sandiford rdsandif...@googlemail.com writes:
 +  /* If this is post-increment, first copy the location to the reload reg.  
 */
 +  if (post  real_in != result)
 +emit_insn (gen_move_insn (result, real_in));

 Nit, but real_in != result can never be true AIUI, and I was confused how
 the code could be correct in that case.  Maybe just remove it, or make
 it an assert?

Probably obvious, but I meant real_in == result can never be true.
real_in != result could be removed or turned into an assert.

 +if (GET_CODE (op) == PLUS)
 +  {
 +plus = op;
 +op = XEXP (op, 1);
 +  }

 Sorry, I'm complaining about old reload code again, but: does this
 actually happen in LRA?  In reload, a register operand could become a
 PLUS because of elimination, but I thought LRA did things differently.
 Besides, this is only needed for:

 +if (CONST_POOL_OK_P (mode, op)
 + ((targetm.preferred_reload_class
 + (op, (enum reg_class) goal_alt[i]) == NO_REGS)
 +|| no_input_reloads_p)
 + mode != VOIDmode)
 +  {
 +rtx tem = force_const_mem (mode, op);
 +
 +change_p = true;
 +/* If we stripped a SUBREG or a PLUS above add it back.  */
 +if (plus != NULL_RTX)
 +  tem = gen_rtx_PLUS (mode, XEXP (plus, 0), tem);

 and we shouldn't have (plus (constant ...) ...) after elimination
 (or at all outside of a CONST).  I don't understand why the code is
 needed even in reload.

Scratch the thing about needing it for reload.  It's obviously the
second second operand we're reloading, not the first, and it could
well be that an elimination displacement needs to be reloaded via
the constant pool.

The question for LRA still stands though.

Richard


Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-12 Thread Richard Sandiford
Hi Vlad,

Comments for the rest of ira-constraints.c.

Vladimir Makarov vmaka...@redhat.com writes:
 +  saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX;
 +  change_p = equiv_address_substitution (ad, addr_loc, mode, as, code);
 +  if (ad.base_reg_loc != NULL)
 +{
 +  if (process_addr_reg
 +   (ad.base_reg_loc, before,
 +(ad.base_modify_p  REG_P (*ad.base_reg_loc)
 +  find_regno_note (curr_insn, REG_DEAD,
 + REGNO (*ad.base_reg_loc)) == NULL
 + ? after : NULL),
 +base_reg_class (mode, as, ad.base_outer_code, ad.index_code)))
 + change_p = true;
 +  if (ad.base_reg_loc2 != NULL)
 + *ad.base_reg_loc2 = *ad.base_reg_loc;
 +  saved_base_reg = *ad.base_reg_loc;
 +  lra_eliminate_reg_if_possible (ad.base_reg_loc);
 +  if (ad.base_reg_loc2 != NULL)
 + {
 +   saved_base_reg2 = *ad.base_reg_loc2;
 +   lra_eliminate_reg_if_possible (ad.base_reg_loc2);
 + }

We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it
might be clearer without saved_base_reg2.  More below...

 +  /* The following addressing is checked by constraints and
 +  usually target specific legitimate address hooks do not
 +  consider them valid.  */
 +  || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC
 +  || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC

typo: two PRE_DECs, although:

 +  || GET_CODE (*addr_loc) == PRE_MODIFY
 +  || GET_CODE (*addr_loc) == POST_MODIFY

the whole lot could just be replaced by ad.base_modify_p, or perhaps
even removed entirely given:

 +  /* In this case we can not do anything because if it is wrong
 +  that is because of wrong displacement.  Remember that any
 +  address was legitimate in non-strict sense before LRA.  */
 +  || ad.disp_loc == NULL)

It doesn't seem worth validating the address at all for ad.disp_loc == NULL.
E.g. something like:

  if (ad.base_reg_loc != NULL
   (process_addr_reg
  (ad.base_reg_loc, before,
   (ad.base_modify_p  REG_P (*ad.base_reg_loc)
 find_regno_note (curr_insn, REG_DEAD,
REGNO (*ad.base_reg_loc)) == NULL
? after : NULL),
   base_reg_class (mode, as, ad.base_outer_code, ad.index_code
{
  change_p = true;
  if (ad.base_reg_loc2 != NULL)
*ad.base_reg_loc2 = *ad.base_reg_loc;
}

  if (ad.index_reg_loc != NULL
   process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS))
change_p = true;

  /* The address was valid before LRA.  We only change its form if the
 address has a displacement, so if it has no displacement it must
 still be valid.  */
  if (ad.disp_loc == NULL)
return change_p;

  /* See whether the address is still valid.  Some ports do not check
 displacements for eliminable registers, so we replace them
 temporarily with the elimination target.  */
  saved_base_reg = saved_index_reg = NULL_RTX;
  ...
  if (ok_p)
return change_p;

 +#ifdef HAVE_lo_sum
 +   {
 + rtx insn;
 + rtx last = get_last_insn ();
 +
 + /* disp = lo_sum (new_base, disp)  */
 + insn = emit_insn (gen_rtx_SET
 +   (VOIDmode, new_reg,
 +gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc;
 + code = recog_memoized (insn);
 + if (code = 0)
 +   {
 + rtx save = *ad.disp_loc;
 +
 + *ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc);
 + if (! valid_address_p (mode, *ad.disp_loc, as))
 +   {
 + *ad.disp_loc = save;
 + code = -1;
 +   }
 +   }
 + if (code  0)
 +   delete_insns_since (last);
 +   }
 +#endif

Nice :-)

Purely for the record, I wondered whether the high part should be
generated with emit_move_insn(_1) instead, with the rhs of the move
being the HIGH rtx.  That would allow targets to cope with cases where
the high part isn't represented directly as a HIGH.  E.g. on MIPS and
Alpha, small-data accesses use the global register as the high part instead.

However, both MIPS and Alpha accept small-data addresses as legitimate
constants and addresses before and during reload and only introduce the
split form after reload.  And I think that's how any other cases that
aren't simple HIGHs should be handled too.  E.g. MIPS also represents
GOT page loads as HIGHs until after reload, and only then lowers the
HIGH to a GOT load.  Allowing the backend to generate anything other
than a plain HIGH set here would be a double-edged sword.

So after all that I agree that the gen_rtx_SET above is better than
calling the move expanders.

 +   /* index * scale + disp = new base + index * scale  */
 +   enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
 +
 +   lra_assert (INDEX_REG_CLASS != 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-12 Thread Richard Sandiford
Vladimir Makarov vmaka...@redhat.com writes:
 +/* Info about pseudo used during the assignment pass.  Thread is a set
 +   of connected reload and inheritance pseudos with the same set of
 +   available hard reg set.  Thread is a pseudo itself for other
 +   cases.  */
 +struct regno_assign_info
 Maybe:

 /* Information about the thread to which a pseudo belongs.  Threads are
 a set of connected reload and inheritance pseudos with the same set of
 available hard registers.  Lone registers belong to their own threads.  
 */
 Fixed.
 Although the condition seems to be:
 +(ira_class_hard_regs_num[regno_allocno_class_array[regno1]]
 +   == ira_class_hard_regs_num[regno_allocno_class_array[regno2]]))
 i.e. the same _number_ of available hard regs, but not necessarily the
 same set.
 It should be the same in most cases.  This condition is just faster 
 approximation of the same available hard reg set.

The distinction does seem important though.   It's possible that
a target has two distinct register files of the same allocatable size.
Would something like:

  (ira_class_subset_p[class1][class2]
ira_class_subset_p[class2][class1])

work instead?

 /* Update the preference for using HARD_REGNO for pseudos that are
 connected directly or indirectly with REGNO.  Apply divisor DIV
 to any preference adjustments.

 The more indirectly a pseudo is connected, the smaller its effect
 should be.  We therefore increase DIV on each hop.  */

 Fixed.  By the way, it is your invention from IRA.

Heh, I'd forgotten all about that.

 +  /* We are trying to spill a reload pseudo.  That is wrong we
 +should assign all reload pseudos, otherwise we cannot reuse
 +the selected alternatives.  */
 +  hard_regno = find_hard_regno_for (regno, cost, -1);
 +  if (hard_regno = 0)
 +   {
 Don't really understand this comment, sorry.
 It removed the comment.  It is from an old solution code trying to 
 guarantee assignment to the reload pseudo.
 Also, why are we passing -1 to find_hard_regno_for, rather than hard_regno?
 The loop body up till this point has been specifically freeing up registers
 to make hard_regno allocatable.  I realise that, by spilling everything
 that overlaps this range, we might have freed up other registers too,
 and so made others besides hard_regno allocatable.  But wouldn't we want
 to test those other hard registers in their iteration of the loop
 instead of this one?  The spills in those iterations ought to be more
 directed (i.e. there should be less incidental spilling).

 As things stand, doing an rclass_size * rclass_size scan seems
 unnecessarily expensive, although probably off the radar.
 We cannot just pass hard_regno for multi-word pseudo when hard_regno-1 
 is already free.

But this call is in a loop that iterates over all registers in the class:

  for (i = 0; i  rclass_size; i++)
{
  hard_regno = ira_class_hard_regs[rclass][i];

and we reach the find_hard_regno_for call unless there is some
conflicting register that we cannot spill.  So if hard_regno - 1
belongs to the allocation class and is a viable choice, its iteration
of the loop would spill specifically for hard_regno - 1 and get the
most accurate cost for that register.  I couldn't see why any other
iteration of the loop would want to consider hard_regno - 1.

 You are right about possibility to speed up the code, although on 
 profiles I looked (including the last huge tests) spill_for and 
 find_hard_regno_for called from takes few time. That is probably because 
 you don't need spill frequently.  Freeing one long live range pseudo 
 permits to find hard regno without spilling for many short live pseudos 
 (reload and inheritance ones).
 Also loop rclass_size * rclass_size is not expensive, the preparation of 
 data for the loop is expensive.

OK, in that case maybe the efficiency concern wasn't justified.
FWIW, I still think passing hard_regno would be clearer though,
in terms of meeting expectations.  It just seems odd to spill for
one specific register and then test all of them.  Especially when the
spilling we actually do after choosing register X is based on X's
iteration of this loop.

(I realise I could well be missing the point here though, sorry.)

 +  (reload_hard_regno
 + = find_hard_regno_for (reload_regno,
 +reload_cost, -1)) = 0
 +  (lra_hard_reg_set_intersection_p
 + (reload_hard_regno, PSEUDO_REGNO_MODE (reload_regno),
 +  spilled_hard_regs)))
 +   {
 + if (lra_dump_file != NULL)
 +   fprintf (lra_dump_file,  assign %d(cost=%d),
 +reload_regno, reload_cost);
 + assign_temporarily (reload_regno, reload_hard_regno);
 + cost += reload_cost;
 It looks like registers that can be reallocated make hard_regno more
 expensive (specifically by reload_cost), but registers 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-11 Thread Vladimir Makarov

On 10/04/2012 11:50 AM, Richard Sandiford wrote:

Hi Vlad,

This message is for lra-assigns.c.  Sorry for the piecemeal reviews,
never sure when I'll get time...


+/* This file contains a pass mostly assigning hard registers to reload
+   pseudos.  There is no any RTL code transformation on this pass.

Maybe:

/* This file's main objective is to assign hard registers to reload pseudos.
It also tries to allocate hard registers to other pseudos, but at a lower
priority than the reload pseudos.  The pass does not transform the RTL.

if that's accurate.

Yes.  That is better.  I used your comment.

+   Reload pseudos get what they need (usually) hard registers in
+   anyway possibly by spilling non-reload pseudos and by assignment
+   reload pseudos with smallest number of available hard registers
+   first.
+
+   If reload pseudos can get hard registers only through spilling
+   other pseudos, we choose what pseudos to spill taking into account
+   how given reload pseudo benefits and also how other reload pseudos
+   not assigned yet benefit too (see function spill_for).

Maybe:

We must allocate a hard register to every reload pseudo.  We try to
increase the chances of finding a viable allocation by assigning the
pseudos in order of fewest available hard registers first.  If we
still fail to find a hard register, we spill other (non-reload)
pseudos in order to make room.

assign_hard_regno_for allocates registers without spilling.
spill_for does the same with spilling.  Both functions use
a cost model to determine the most profitable choice of
hard and spill registers.

Ok.  I just changed two sentences a bit:

 find_hard_regno_for finds hard registers for allocation without  
spilling.  spill_for does the same with spilling.



+   Non-reload pseudos can get hard registers too if it is possible and
+   improves the code.  It might be possible because of spilling
+   non-reload pseudos on given pass.

Maybe:

Once we have finished allocating reload pseudos, we also try to
assign registers to other (non-reload) pseudos.  This is useful
if hard registers were freed up by the spilling just described.


Fixed.

+   We try to assign hard registers processing pseudos by threads.  The
+   thread contains reload and inheritance pseudos connected by copies
+   (move insns).  It improves the chance to get the same hard register
+   to pseudos in the thread and, as the result, to remove some move
+   insns.

Maybe:

We try to assign hard registers by collecting pseudos into threads.
These threads contain reload and inheritance pseudos that are connected
by copies (move insns).  Doing this improves the chances of pseudos
in the thread getting the same hard register and, as a result,
of allowing some move insns to be deleted.

Fixed.

+   When we assign hard register to a pseudo, we decrease the cost of
+   the hard registers for corresponding pseudos connected by copies.

Maybe:

When we assign a hard register to a pseudo, we decrease the cost of
using the same hard register for pseudos that are connected by copies.

Fixed.

+   If two hard registers are equally good for assigning the pseudo
+   with hard register cost point of view, we prefer a hard register in
+   smaller register bank.  By default, there is only one register
+   bank.  A target can define register banks by hook
+   register_bank. For example, x86-64 has a few register banks: hard
+   regs with and without REX prefixes are in different banks.  It
+   permits to generate smaller code as insns without REX prefix are
+   shorter.

Maybe:

If two hard registers have the same frequency-derived cost,
we prefer hard registers in lower register banks.  The mapping
of registers to banks is controlled by the register_bank target hook.
For example, x86-64 has a few register banks: hard registers with and
without REX prefixes are in different banks.  This permits us
to generate smaller code as insns without REX prefixes are shorter.

although this might change if the name of the hook changes.


With recent change in the hook name, I modified it to:

   If two hard registers have the same frequency-derived cost, we
   prefer hard registers with bigger priorities.  The mapping of
   registers to priorities is controlled by the register_priority
   target hook.  For example, x86-64 has a few register priorities:
   hard registers with and without REX prefixes have different
   priorities.  This permits us to generate smaller code as insns
   without REX prefixes are shorter.


+/* Info about pseudo used during the assignment pass.  Thread is a set
+   of connected reload and inheritance pseudos with the same set of
+   available hard reg set.  Thread is a pseudo itself for other
+   cases.  */
+struct regno_assign_info

Maybe:

/* Information about the thread to which a pseudo belongs.  Threads are
a set of connected reload and inheritance pseudos with 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-10 Thread Richard Sandiford
Sorry, reading back in different surroundings made me notice a couple
of silly errors:

Richard Sandiford rdsandif...@googlemail.com writes:
 E.g.:

   if ((*loc = get_equiv_substitution (reg)) != reg)
 ...as above...
   if (*loc != reg || !in_class_p (reg, cl, new_class))
 ...as above...
   else if (new_class != NO_REGS  rclass != new_class)
 change_class (regno, new_class, Change, true);
   return change_p;

 (assuming change to in_class_p suggested earlier) seems like it
 covers the same cases.

...but that same in_class_p change means that the rclass != new_class
condition isn't needed.  I.e.

  if ((*loc = get_equiv_substitution (reg)) != reg)
...as above...
  if (*loc != reg || !in_class_p (reg, cl, new_class))
...as above...
  else if (new_class != NO_REGS)
change_class (regno, new_class,   Change, true);
  return change_p;

 +  if (operand_reg[nop] != NULL_RTX)
 +{
 +  int last_reload = (lra_reg_info[ORIGINAL_REGNO
 +  (operand_reg[nop])]
 + .last_reload);
 +
 +  if (last_reload  bb_reload_num)
 +reload_sum += last_reload;
 +  else
 +reload_sum += bb_reload_num;

 The comment for reload_sum says:

 +/* Overall number reflecting distances of previous reloading the same
 +   value.  It is used to improve inheritance chances.  */
 +static int best_reload_sum;

 which made me think of distance from the current instruction.  I see
 it's actually something else, effectively a sum of instruction numbers.

 I assumed the idea was to prefer registers that were reloaded more
 recently (closer the current instruction).  In that case I thought that,
 for a distance-based best_reload_sum, smaller would be better,
 while for an instruction-number-based best_reload_sum, larger would
 be better.  It looks like we use instruction-number based best_reload_sums
 but prefer smaller sums:

 +   (reload_nregs  best_reload_nregs
 +  || (reload_nregs == best_reload_nregs
 +   best_reload_sum  reload_sum

 Is that intentional?

Clearly I can't read.  The code _does_ prefer higher numbers.  I still
think distance is a bit misleading though. :-)

Just for the record, the rest of my question:

 Also, is this value meaningful for output reloads, which aren't really
 going to be able to inherit a value as such?  We seem to apply the cost
 regardless of whether it's an input or an output, so probably deserves
 a comment.

 Same for matched input operands, which as you say elsewhere aren't
 inherited.

still applies.

Richard


Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-10 Thread Vladimir Makarov

On 12-10-03 7:11 AM, Richard Sandiford wrote:

Hi Vlad,

Some comments on lra-spills.c and lra-coalesce.c.


+   The pass creates necessary stack slots and assign spilled pseudos
+   to the stack slots in following way:

s/assign/assigns/

Fixed.

+   (or insn memory constraints) might be not satisfied any more.

s/might be not/might not be/

Fixed.

+   For some targets, the pass can spill some pseudos into hard
+   registers of different class (usually into vector registers)
+   instead of spilling them into memory if it is possible and
+   profitable. Spilling GENERAL_REGS pseudo into SSE registers for
+   modern Intel x86/x86-64 processors is an example of such
+   optimization.  And this is actually recommended by Intel
+   optimization guide.

Maybe mention core i7 specifically?  Modern is a bit dangerous
in code that'll live a long time.

Yes, right.  Fixed.  Another bad thing would be an usage of word new.

+/* The structure describes a stack slot which can be used for several
+   spilled pseudos.  */
+struct slot
+{

Looks like this describes a register or stack slot given the hard_regno case.

Fixed

+/* Array containing info about the stack slots. The array element is
+   indexed by the stack slot number in the range [0..slost_num).  */

Typo: slots_num

Fixed.

+  /* Each pseudo has an inherent size which comes from its own mode,
+ and a total size which provides room for paradoxical subregs
+ which refer to the pseudo reg in wider modes.
+
+ We can use a slot already allocated if it provides both enough
+ inherent space and enough total space.  Otherwise, we allocate a
+ new slot, making sure that it has no less inherent space, and no
+ less total space, then the previous slot. */

The second part of the comment seems a bit misplaced, since the following
code doesn't reuse stack slots.  This is done elsewhere instead.
Maybe the first part would be better above the inherent_size assignment.

Right.  I've changed comment to reflect the current state of the code.

+  /* If we have any adjustment to make, or if the stack slot is the
+ wrong mode, make a new stack slot. */
+  x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust);

We don't make a new slot here.
I removed the comment.  The same comment is present in reload1.c and 
probably should be also removed.

+/* Sort pseudos according their slot numbers putting ones with smaller
+   numbers first, or last when the frame pointer is not needed. So
+   pseudos with the first slot will be finally addressed with smaller
+   address displacement.  */
+static int
+pseudo_reg_slot_compare (const void *v1p, const void *v2p)
+{
+  const int regno1 = *(const int *) v1p;
+  const int regno2 = *(const int *) v2p;
+  int diff, slot_num1, slot_num2;
+  int total_size1, total_size2;
+
+  slot_num1 = pseudo_slots[regno1].slot_num;
+  slot_num2 = pseudo_slots[regno2].slot_num;
+  if ((diff = slot_num1 - slot_num2) != 0)
+return (frame_pointer_needed
+   || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff);

The comment doesn't quite describe the condition.  Maybe:

/* Sort pseudos according to their slots, putting the slots in the order
that they should be allocated.  Slots with lower numbers have the highest
priority and should get the smallest displacement from the stack or
frame pointer (whichever is being used).

The first allocated slot is always closest to the frame pointer,
so prefer lower slot numbers when frame_pointer_needed.  If the stack
and frame grow in the same direction, then the first allocated slot is
always closest to the initial stack pointer and furthest away from the
final stack pointer, so allocate higher numbers first when using the
stack pointer in that case.  The reverse is true if the stack and
frame grow in opposite directions.  */

I used your comment.  Thanks.

+  total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1),
+GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode));
+  total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2),
+GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode));
+  if ((diff = total_size2 - total_size1) != 0)
+return diff;

I think this could do with a bit more commentary.  When is biggest_mode
ever smaller than PSEUDO_REGNO_BYTES?  Is that for pseudos that are only
ever referenced as lowpart subregs?  If so, why does PSEUDO_REGNO_BYTES
matter for those registers here but not when calculating biggest_mode?
The MAX code has no sense to me too (probably it was wrongly adapted 
from somewhere).  So I removed MAX.

+/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
+   pseudos which did not get a spill hard register at the beginning of
+   array PSEUDO_REGNOS. Return the number of such pseudos.  */

It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest
frequency first.

Fixed.

+  bitmap set_jump_crosses = 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-04 Thread Richard Sandiford
Hi Vlad,

This message is for lra-assigns.c.  Sorry for the piecemeal reviews,
never sure when I'll get time...

 +/* This file contains a pass mostly assigning hard registers to reload
 +   pseudos.  There is no any RTL code transformation on this pass.

Maybe:

/* This file's main objective is to assign hard registers to reload pseudos.
   It also tries to allocate hard registers to other pseudos, but at a lower
   priority than the reload pseudos.  The pass does not transform the RTL.

if that's accurate.

 +   Reload pseudos get what they need (usually) hard registers in
 +   anyway possibly by spilling non-reload pseudos and by assignment
 +   reload pseudos with smallest number of available hard registers
 +   first.
 +
 +   If reload pseudos can get hard registers only through spilling
 +   other pseudos, we choose what pseudos to spill taking into account
 +   how given reload pseudo benefits and also how other reload pseudos
 +   not assigned yet benefit too (see function spill_for).

Maybe:

   We must allocate a hard register to every reload pseudo.  We try to
   increase the chances of finding a viable allocation by assigning the
   pseudos in order of fewest available hard registers first.  If we
   still fail to find a hard register, we spill other (non-reload)
   pseudos in order to make room.

   assign_hard_regno_for allocates registers without spilling.
   spill_for does the same with spilling.  Both functions use
   a cost model to determine the most profitable choice of
   hard and spill registers.

 +   Non-reload pseudos can get hard registers too if it is possible and
 +   improves the code.  It might be possible because of spilling
 +   non-reload pseudos on given pass.

Maybe:

   Once we have finished allocating reload pseudos, we also try to
   assign registers to other (non-reload) pseudos.  This is useful
   if hard registers were freed up by the spilling just described.

 +   We try to assign hard registers processing pseudos by threads.  The
 +   thread contains reload and inheritance pseudos connected by copies
 +   (move insns).  It improves the chance to get the same hard register
 +   to pseudos in the thread and, as the result, to remove some move
 +   insns.

Maybe:

   We try to assign hard registers by collecting pseudos into threads.
   These threads contain reload and inheritance pseudos that are connected
   by copies (move insns).  Doing this improves the chances of pseudos
   in the thread getting the same hard register and, as a result,
   of allowing some move insns to be deleted.

 +   When we assign hard register to a pseudo, we decrease the cost of
 +   the hard registers for corresponding pseudos connected by copies.

Maybe:

   When we assign a hard register to a pseudo, we decrease the cost of
   using the same hard register for pseudos that are connected by copies.

 +   If two hard registers are equally good for assigning the pseudo
 +   with hard register cost point of view, we prefer a hard register in
 +   smaller register bank.  By default, there is only one register
 +   bank.  A target can define register banks by hook
 +   register_bank. For example, x86-64 has a few register banks: hard
 +   regs with and without REX prefixes are in different banks.  It
 +   permits to generate smaller code as insns without REX prefix are
 +   shorter.

Maybe:

   If two hard registers have the same frequency-derived cost,
   we prefer hard registers in lower register banks.  The mapping
   of registers to banks is controlled by the register_bank target hook.
   For example, x86-64 has a few register banks: hard registers with and
   without REX prefixes are in different banks.  This permits us
   to generate smaller code as insns without REX prefixes are shorter.

although this might change if the name of the hook changes.

 +/* Info about pseudo used during the assignment pass.  Thread is a set
 +   of connected reload and inheritance pseudos with the same set of
 +   available hard reg set.  Thread is a pseudo itself for other
 +   cases.  */
 +struct regno_assign_info

Maybe:

/* Information about the thread to which a pseudo belongs.  Threads are
   a set of connected reload and inheritance pseudos with the same set of
   available hard registers.  Lone registers belong to their own threads.  */

Although the condition seems to be:

 +  (ira_class_hard_regs_num[regno_allocno_class_array[regno1]]
 + == ira_class_hard_regs_num[regno_allocno_class_array[regno2]]))

i.e. the same _number_ of available hard regs, but not necessarily the
same set.

thread might be more mnemonic than regno_assign in this file,
but that's bikeshed stuff.

 +  for (i = FIRST_PSEUDO_REGISTER; i  max_reg_num (); i++)
 +{
 +  regno_assign_info[i].first = i;
 +  regno_assign_info[i].next = -1;
 +  regno_assign_info[i].freq = lra_reg_info[i].freq;
 +}

Minor speedup, but it's probably worth caching max_reg_num () rather than
calling it in each loop 

Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

2012-10-03 Thread Richard Sandiford
Hi Vlad,

Some comments on lra-spills.c and lra-coalesce.c.

 +   The pass creates necessary stack slots and assign spilled pseudos
 +   to the stack slots in following way:

s/assign/assigns/

 +   (or insn memory constraints) might be not satisfied any more.

s/might be not/might not be/

 +   For some targets, the pass can spill some pseudos into hard
 +   registers of different class (usually into vector registers)
 +   instead of spilling them into memory if it is possible and
 +   profitable.   Spilling GENERAL_REGS pseudo into SSE registers for
 +   modern Intel x86/x86-64 processors is an example of such
 +   optimization.  And this is actually recommended by Intel
 +   optimization guide.

Maybe mention core i7 specifically?  Modern is a bit dangerous
in code that'll live a long time.

 +/* The structure describes a stack slot which can be used for several
 +   spilled pseudos.  */
 +struct slot
 +{

Looks like this describes a register or stack slot given the hard_regno case.

 +/* Array containing info about the stack slots.   The array element is
 +   indexed by the stack slot number in the range [0..slost_num).  */

Typo: slots_num

 +  /* Each pseudo has an inherent size which comes from its own mode,
 + and a total size which provides room for paradoxical subregs
 + which refer to the pseudo reg in wider modes.
 + 
 + We can use a slot already allocated if it provides both enough
 + inherent space and enough total space.  Otherwise, we allocate a
 + new slot, making sure that it has no less inherent space, and no
 + less total space, then the previous slot.   */

The second part of the comment seems a bit misplaced, since the following
code doesn't reuse stack slots.  This is done elsewhere instead.
Maybe the first part would be better above the inherent_size assignment.

 +  /* If we have any adjustment to make, or if the stack slot is the
 + wrong mode, make a new stack slot.   */
 +  x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust);

We don't make a new slot here.

 +/* Sort pseudos according their slot numbers putting ones with smaller
 +   numbers first, or last when the frame pointer is not needed.   So
 +   pseudos with the first slot will be finally addressed with smaller
 +   address displacement.  */
 +static int
 +pseudo_reg_slot_compare (const void *v1p, const void *v2p)
 +{
 +  const int regno1 = *(const int *) v1p;
 +  const int regno2 = *(const int *) v2p;
 +  int diff, slot_num1, slot_num2;
 +  int total_size1, total_size2;
 +
 +  slot_num1 = pseudo_slots[regno1].slot_num;
 +  slot_num2 = pseudo_slots[regno2].slot_num;
 +  if ((diff = slot_num1 - slot_num2) != 0)
 +return (frame_pointer_needed
 + || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff);

The comment doesn't quite describe the condition.  Maybe:

/* Sort pseudos according to their slots, putting the slots in the order
   that they should be allocated.  Slots with lower numbers have the highest
   priority and should get the smallest displacement from the stack or
   frame pointer (whichever is being used).

   The first allocated slot is always closest to the frame pointer,
   so prefer lower slot numbers when frame_pointer_needed.  If the stack
   and frame grow in the same direction, then the first allocated slot is
   always closest to the initial stack pointer and furthest away from the
   final stack pointer, so allocate higher numbers first when using the
   stack pointer in that case.  The reverse is true if the stack and
   frame grow in opposite directions.  */

 +  total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1),
 +  GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode));
 +  total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2),
 +  GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode));
 +  if ((diff = total_size2 - total_size1) != 0)
 +return diff;

I think this could do with a bit more commentary.  When is biggest_mode
ever smaller than PSEUDO_REGNO_BYTES?  Is that for pseudos that are only
ever referenced as lowpart subregs?  If so, why does PSEUDO_REGNO_BYTES
matter for those registers here but not when calculating biggest_mode?

 +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
 +   pseudos which did not get a spill hard register at the beginning of
 +   array PSEUDO_REGNOS.   Return the number of such pseudos.  */

It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest
frequency first.

 +  bitmap set_jump_crosses = regstat_get_setjmp_crosses ();

I notice you use set_jump here and setjump in parts of 7a.patch.
Probably better to use setjmp across the board.

 +  /* Hard registers which can not be used for any purpose at given
 + program point because they are unallocatable or already allocated
 + for other pseudos.   */ 
 +  HARD_REG_SET *reserved_hard_regs;
 +
 +  if (! lra_reg_spill_p)
 +return n;
 +  /* Set up reserved hard