Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
On 12-10-17 7:24 AM, Richard Sandiford wrote: Thanks for all the updates. Vladimir Makarov vmaka...@redhat.com writes: + /* index * scale + disp = new base + index * scale */ + enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH); + + lra_assert (INDEX_REG_CLASS != NO_REGS); + new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, disp); + lra_assert (GET_CODE (*addr_loc) == PLUS); + lra_emit_move (new_reg, *ad.disp_loc); + if (CONSTANT_P (XEXP (*addr_loc, 1))) + XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0); + XEXP (*addr_loc, 0) = new_reg; The canonical form is (plus (mult ...) (reg)) rather than (plus (reg) (mult ...)), but it looks like we create the latter. I realise you try both forms here: It might happen because equiv substitution in LRA. + /* Some targets like ARM, accept address operands in +specific order -- try exchange them if necessary. */ + if (! valid_address_p (mode, *addr_loc, as)) + { + exchange_plus_ops (*addr_loc); + if (! valid_address_p (mode, *addr_loc, as)) + exchange_plus_ops (*addr_loc); + } but I think we should try the canonical form first. And I'd prefer it if we didn't try the other form at all, especially in 4.8. It isn't really the backend's job to reject non-canonical rtl. This might well be another case where some targets need a (hopefully small) tweak in order to play by the rules. Also, I suppose this section of code feeds back to my question on Wednesday about the distinction that LRA seems to make between the compile-time constant in: (plus (reg X1) (const_int Y1)) and the link-time constant in: (plus (reg X2) (symbol_ref Y2)) It looked like extract_address_regs classified X1 as a base register and X2 as an index register. The difference between the two constants has no run-time significance though, and I think we should handle both X1 and X2 as base registers (as I think reload does). I think the path above would then be specific to scaled indices. In the original address the complex index must come first and the displacement second. In the modified address, the index would stay first and the new base register would be second. More below. As I wrote above the problem is also in that equiv substitution can create non-canonical forms. Right. Just in case there's a misunderstanding: I'm not complaining about these routines internally using forms that are noncanonical (which could happen because of equiv substitution, like you say). I just think that what we eventually try to validate should be canonical. In a way it's similar to how the simplify-rtx.c routines work. If there are targets that only accept noncanonical rtl (which is after all just a specific type of invalid rtl), they need to be fixed. Agree. In order not to forget to fix targets I am removing operand exchange. + /* base + scale * index + disp = new base + scale * index */ + new_reg = base_plus_disp_to_reg (mode, as, ad); + *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc); + if (! valid_address_p (mode, *addr_loc, as)) + { + /* Some targets like ARM, accept address operands in +specific order -- try exchange them if necessary. */ + exchange_plus_ops (*addr_loc); + if (! valid_address_p (mode, *addr_loc, as)) + exchange_plus_ops (*addr_loc); + } Same comment as above about canonical rtl. Here we can have two registers -- in which case the base should come first -- or a more complex index -- in which case the index should come first. We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...), with the operands in either order, and let it take care of the details. Using simplify_gen_binary would help with the earlier index+disp case too. Equiv substitution can create non-canonical forms. There are 2 approaches: o have a code for dealing with non-canonical forms (equiv substitution, target stupidity) o always support canonical forms and require them from targets. I decided to use the 1st variant but I am reconsidering it. I'll try to fix before inclusion. But I am not sure I have time for this. All these changes makes LRA unstable. In fact, I've just found that changes I already made so far resulted in 2 SPEC2000 tests broken although GCC testsuite and bootstrap looks good. OK. I'm happy to try fixing the noncanonical thing. + /* If this is post-increment, first copy the location to the reload reg. */ + if (post real_in != result) +emit_insn (gen_move_insn (result, real_in)); Nit, but real_in != result can never be true AIUI, and I was confused how the code could be correct in that case. Maybe just remove it, or make it an assert? No, it might be true: real_in = in == value ? incloc : in; ... if (cond) result = incloc; else result = ... if (post real_in != result) So it is true
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
On 12-10-15 8:06 AM, Richard Sandiford wrote: Vladimir Makarov vmaka...@redhat.com writes: if that's accurate. I dropped the term reload pseudo because of the general comment in my earlier reply about the use of reload pseudo when the code seems to include inheritance and split pseudos too. There is no inheritance and splitting yet. It is done after the constraint pass. So at this stage = new_regno_start means reload pseudo. Ah, OK. That's a change in the meaning of NEW_CLASS, but seems easier for callers to handle. I think all it requires is changing: + common_class = ira_reg_class_subset[rclass][cl]; + if (new_class != NULL) + *new_class = common_class; to: common_class = ira_reg_class_subset[rclass][cl]; if (new_class != NULL rclass != common_class) *new_class = common_class; This change results in infinite LRA looping on a first libgcc file compilation. Unfortunately I have no time to investigate it. I'd like to say that most code of in this code is very sensitive to changes. I see it a lot. You change something looking obvious and a target is broken. I am going to investigate it when I have more time. Thanks. +default: + { + const char *fmt = GET_RTX_FORMAT (code); + int i; + + if (GET_RTX_LENGTH (code) != 1 + || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC) + { + for (i = GET_RTX_LENGTH (code) - 1; i = 0; i--) + if (fmt[i] == 'e') + extract_loc_address_regs (false, mode, as, XEXP (x, i), + context_p, code, SCRATCH, + modify_p, ad); + break; + } + /* fall through for case UNARY_OP (UNSPEC ...) */ + } + +case UNSPEC: + if (ad-disp_loc == NULL) + ad-disp_loc = loc; + else if (ad-base_reg_loc == NULL) + { + ad-base_reg_loc = loc; + ad-base_outer_code = outer_code; + ad-index_code = index_code; + ad-base_modify_p = modify_p; + } + else + { + lra_assert (ad-index_reg_loc == NULL); + ad-index_reg_loc = loc; + } + break; + +} Which targets use a bare UNSPEC as a displacement? I thought a displacement had to be a link-time constant, in which case it should satisfy CONSTANT_P. For UNSPECs, that means wrapping it in a CONST. I saw it somewhere. I guess IA64. I'm just a bit worried that the UNSPEC handling is sensitive to the order that subrtxes are processed (unlike PLUS, which goes to some trouble to work out what's what). It could be especially confusing because the default case processes operands in reverse order while PLUS processes them in forward order. Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough? Probably deserves a comment. I don't remember. To figure out, I should switch it off and try all targets supported by LRA. AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just recording where reloads of a particular class need to go (obviously in the case of disp_loc, which isn't reloaded at all). The feidls have semantic value too. I.e. we use them to work out the value of at least part of the address. In that case it seems dangerous to look through general rtxes in the way that the default case above does. Maybe just making sure that DISP_LOC is involved in a sum with the base would be enough, but another idea was: I know of three ways of mutating (for want of a better word) an address: 1. (and X (const_int X)), to align 2. a subreg 3. a unary operator (such as truncation or extension) So maybe we could: a. remove outer mutations (using a helper function) b. handle LO_SUM, PRE_*, POST_*: as now c. otherwise treat the address of the sum of one, two or three pieces. c1. Peel mutations of all pieces. c2. Classify the pieces into base, index and displacement. This would be similar to the jousting code above, but hopefully easier because all three rtxes are to hand. E.g. we could do the base vs. index thing in a similar way to commutative_operand_precedence. c3. Record which pieces were mutated (e.g. using something like the index_loc vs. index_reg_loc distinction in the current code) That should be general enough for current targets, but if it isn't, we could generalise it further when we know what generalisation is needed. That's still going to be a fair amount of code, but hopefully not more, and we might have more confidence at each stage what each value is. And it avoids the risk of treating mutated addresses as unmutated ones. Just an idea though. Probably not for 4.8, although I might try it if I find time. I am not sure that you listed all the cases. It
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Thanks for all the updates. Vladimir Makarov vmaka...@redhat.com writes: + /* index * scale + disp = new base + index * scale */ + enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH); + + lra_assert (INDEX_REG_CLASS != NO_REGS); + new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, disp); + lra_assert (GET_CODE (*addr_loc) == PLUS); + lra_emit_move (new_reg, *ad.disp_loc); + if (CONSTANT_P (XEXP (*addr_loc, 1))) + XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0); + XEXP (*addr_loc, 0) = new_reg; The canonical form is (plus (mult ...) (reg)) rather than (plus (reg) (mult ...)), but it looks like we create the latter. I realise you try both forms here: It might happen because equiv substitution in LRA. + /* Some targets like ARM, accept address operands in +specific order -- try exchange them if necessary. */ + if (! valid_address_p (mode, *addr_loc, as)) + { + exchange_plus_ops (*addr_loc); + if (! valid_address_p (mode, *addr_loc, as)) + exchange_plus_ops (*addr_loc); + } but I think we should try the canonical form first. And I'd prefer it if we didn't try the other form at all, especially in 4.8. It isn't really the backend's job to reject non-canonical rtl. This might well be another case where some targets need a (hopefully small) tweak in order to play by the rules. Also, I suppose this section of code feeds back to my question on Wednesday about the distinction that LRA seems to make between the compile-time constant in: (plus (reg X1) (const_int Y1)) and the link-time constant in: (plus (reg X2) (symbol_ref Y2)) It looked like extract_address_regs classified X1 as a base register and X2 as an index register. The difference between the two constants has no run-time significance though, and I think we should handle both X1 and X2 as base registers (as I think reload does). I think the path above would then be specific to scaled indices. In the original address the complex index must come first and the displacement second. In the modified address, the index would stay first and the new base register would be second. More below. As I wrote above the problem is also in that equiv substitution can create non-canonical forms. Right. Just in case there's a misunderstanding: I'm not complaining about these routines internally using forms that are noncanonical (which could happen because of equiv substitution, like you say). I just think that what we eventually try to validate should be canonical. In a way it's similar to how the simplify-rtx.c routines work. If there are targets that only accept noncanonical rtl (which is after all just a specific type of invalid rtl), they need to be fixed. + /* base + scale * index + disp = new base + scale * index */ + new_reg = base_plus_disp_to_reg (mode, as, ad); + *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc); + if (! valid_address_p (mode, *addr_loc, as)) + { + /* Some targets like ARM, accept address operands in +specific order -- try exchange them if necessary. */ + exchange_plus_ops (*addr_loc); + if (! valid_address_p (mode, *addr_loc, as)) + exchange_plus_ops (*addr_loc); + } Same comment as above about canonical rtl. Here we can have two registers -- in which case the base should come first -- or a more complex index -- in which case the index should come first. We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...), with the operands in either order, and let it take care of the details. Using simplify_gen_binary would help with the earlier index+disp case too. Equiv substitution can create non-canonical forms. There are 2 approaches: o have a code for dealing with non-canonical forms (equiv substitution, target stupidity) o always support canonical forms and require them from targets. I decided to use the 1st variant but I am reconsidering it. I'll try to fix before inclusion. But I am not sure I have time for this. All these changes makes LRA unstable. In fact, I've just found that changes I already made so far resulted in 2 SPEC2000 tests broken although GCC testsuite and bootstrap looks good. OK. I'm happy to try fixing the noncanonical thing. + /* If this is post-increment, first copy the location to the reload reg. */ + if (post real_in != result) +emit_insn (gen_move_insn (result, real_in)); Nit, but real_in != result can never be true AIUI, and I was confused how the code could be correct in that case. Maybe just remove it, or make it an assert? No, it might be true: real_in = in == value ? incloc : in; ... if (cond) result = incloc; else result = ... if (post real_in != result) So it is true if in==value cond Sorry, what I meant was that cond is ! post REG_P (incloc): if (! post REG_P (incloc)) result = incloc; else
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
On 12-10-15 12:49 PM, Richard Sandiford wrote: Hi Vlad, Some comments about the rest of LRA. Nothing major here... Vladimir Makarov vmaka...@redhat.com writes: +/* Info about register in an insn. */ +struct lra_insn_reg +{ + /* The biggest mode through which the insn refers to the register + (remember the register can be accessed through a subreg in the + insn). */ + ENUM_BITFIELD(machine_mode) biggest_mode : 16; AFAICT, this is actually always the mode of a specific reference, and if there are references to the same register in different modes, those references get their own lra_insn_regs. mode might be better than biggest_mode if so. I seems mode is also not good. I've just modified the comment to reflect the fact that is just a reference. +/* Static part (common info for insns with the same ICODE) of LRA + internal insn info. It exists in at most one exemplar for each + non-negative ICODE. Warning: if the structure definition is + changed, the initializer for debug_insn_static_data in lra.c should + be changed too. */ Probably worth saying (before the warning) that there is also one structure for each asm. Good point. I added a comment. +/* LRA internal info about an insn (LRA internal insn + representation). */ +struct lra_insn_recog_data +{ + int icode; /* The insn code. */ + rtx insn; /* The insn itself. */ + /* Common data for insns with the same ICODE. */ + struct lra_static_insn_data *insn_static_data; Maybe worth mentioning asms here too. Fixed. + /* Two arrays of size correspondingly equal to the operand and the + duplication numbers: */ + rtx **operand_loc; /* The operand locations, NULL if no operands. */ + rtx **dup_loc; /* The dup locations, NULL if no dups. */ + /* Number of hard registers implicitly used in given call insn. The + value can be NULL or points to array of the hard register numbers + ending with a negative value. */ + int *arg_hard_regs; +#ifdef HAVE_ATTR_enabled + /* Alternative enabled for the insn. NULL for debug insns. */ + bool *alternative_enabled_p; +#endif + /* The alternative should be used for the insn, -1 if invalid, or we + should try to use any alternative, or the insn is a debug + insn. */ + int used_insn_alternative; + struct lra_insn_reg *regs; /* Always NULL for a debug insn. */ Comments consistently above the field. Fixed. +extern void lra_expand_reg_info (void); This doesn't exist any more. Fixed. +extern int lra_constraint_new_insn_uid_start; Just saying in case: this seems to be write-only, with lra-constraints.c instead using a static variable to track the uid start. I realise you might want to keep it anyway for consistency with lra_constraint_new_regno_start, or for debugging. +extern rtx lra_secondary_memory[NUM_MACHINE_MODES]; This doesn't exist any more. Removed. Thanks. +/* lra-saves.c: */ + +extern bool lra_save_restore (void); Same for this file function. Removed. +/* The function returns TRUE if at least one hard register from ones + starting with HARD_REGNO and containing value of MODE are in set + HARD_REGSET. */ +static inline bool +lra_hard_reg_set_intersection_p (int hard_regno, enum machine_mode mode, +HARD_REG_SET hard_regset) +{ + int i; + + lra_assert (hard_regno = 0); + for (i = hard_regno_nregs[hard_regno][mode] - 1; i = 0; i--) +if (TEST_HARD_REG_BIT (hard_regset, hard_regno + i)) + return true; + return false; +} This is the same as overlaps_hard_reg_set_p. I removed it and started to use the function overlaps_hard_reg_set_p. +/* Return hard regno and offset of (sub-)register X through arguments + HARD_REGNO and OFFSET. If it is not (sub-)register or the hard + register is unknown, then return -1 and 0 correspondingly. */ The function seems to return -1 for both. Fixed. It does not matter for the rest of code as offset is used only when hard_regno = 0. +/* Add hard registers starting with HARD_REGNO and holding value of + MODE to the set S. */ +static inline void +lra_add_hard_reg_set (int hard_regno, enum machine_mode mode, HARD_REG_SET *s) +{ + int i; + + for (i = hard_regno_nregs[hard_regno][mode] - 1; i = 0; i--) +SET_HARD_REG_BIT (*s, hard_regno + i); +} This is add_to_hard_reg_set. Removed. + Here is block diagram of LRA passes: + + - +| Undo inheritance| ------ +| for spilled pseudos)| | Memory-memory | | New (and old) | +| and splits (for || move coalesce |-| pseudos| +| pseudos got the | --- | assignment | + Start | same hard regs) | --- +|- ^ +V| | +
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
On Wed, Oct 17, 2012 at 9:53 PM, Vladimir Makarov wrote: On 12-10-15 12:49 PM, Richard Sandiford wrote: Getting rid of reload always seemed like a pipe dream, and if the only known drawback of this replacement is that it takes a while on extreme testcases, that's an amazing achievement. (Not to say compile time isn't important, just that there were so many other hurdles to overcome.) Just to be clear, LRA now does no worse from a compile time POV than, say, tree-ssa-live. Most of the scalability problems have been addressed. It is my second attempt. The first one was YARA project. I got a lot of experience from this project and knowledge how not to do this. LRA will be still a long lasting project. I don't think I found all weirdness of reload just trying 8 targets (fixing one bug on one target frequently resulted in new bugs on other targets so it required to do frequently cardinal changes to the original code). Only after trying the 8 targets I got feeling that this approach could well. Hats off to you, Vlad, for your years of effort on improving GCC's RA! Ciao! Steven
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
On 12-10-12 10:29 AM, Richard Sandiford wrote: Hi Vlad, Comments for the rest of ira-constraints.c. Vladimir Makarovvmaka...@redhat.com writes: + saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX; + change_p = equiv_address_substitution (ad, addr_loc, mode, as, code); + if (ad.base_reg_loc != NULL) +{ + if (process_addr_reg + (ad.base_reg_loc, before, + (ad.base_modify_p REG_P (*ad.base_reg_loc) +find_regno_note (curr_insn, REG_DEAD, + REGNO (*ad.base_reg_loc)) == NULL + ? after : NULL), + base_reg_class (mode, as, ad.base_outer_code, ad.index_code))) + change_p = true; + if (ad.base_reg_loc2 != NULL) + *ad.base_reg_loc2 = *ad.base_reg_loc; + saved_base_reg = *ad.base_reg_loc; + lra_eliminate_reg_if_possible (ad.base_reg_loc); + if (ad.base_reg_loc2 != NULL) + { + saved_base_reg2 = *ad.base_reg_loc2; + lra_eliminate_reg_if_possible (ad.base_reg_loc2); + } We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it might be clearer without saved_base_reg2. More below... + /* The following addressing is checked by constraints and +usually target specific legitimate address hooks do not +consider them valid. */ + || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC + || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC typo: two PRE_DECs, although: + || GET_CODE (*addr_loc) == PRE_MODIFY + || GET_CODE (*addr_loc) == POST_MODIFY the whole lot could just be replaced by ad.base_modify_p, or perhaps even removed entirely given: + /* In this case we can not do anything because if it is wrong +that is because of wrong displacement. Remember that any +address was legitimate in non-strict sense before LRA. */ + || ad.disp_loc == NULL) It doesn't seem worth validating the address at all for ad.disp_loc == NULL. E.g. something like: if (ad.base_reg_loc != NULL (process_addr_reg (ad.base_reg_loc, before, (ad.base_modify_p REG_P (*ad.base_reg_loc) find_regno_note (curr_insn, REG_DEAD, REGNO (*ad.base_reg_loc)) == NULL ? after : NULL), base_reg_class (mode, as, ad.base_outer_code, ad.index_code { change_p = true; if (ad.base_reg_loc2 != NULL) *ad.base_reg_loc2 = *ad.base_reg_loc; } if (ad.index_reg_loc != NULL process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS)) change_p = true; /* The address was valid before LRA. We only change its form if the address has a displacement, so if it has no displacement it must still be valid. */ if (ad.disp_loc == NULL) return change_p; /* See whether the address is still valid. Some ports do not check displacements for eliminable registers, so we replace them temporarily with the elimination target. */ saved_base_reg = saved_index_reg = NULL_RTX; ... if (ok_p) return change_p; Yes, it has sense. I changed the code as you propose. +#ifdef HAVE_lo_sum + { + rtx insn; + rtx last = get_last_insn (); + + /* disp = lo_sum (new_base, disp) */ + insn = emit_insn (gen_rtx_SET + (VOIDmode, new_reg, + gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc; + code = recog_memoized (insn); + if (code = 0) + { + rtx save = *ad.disp_loc; + + *ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc); + if (! valid_address_p (mode, *ad.disp_loc, as)) + { + *ad.disp_loc = save; + code = -1; + } + } + if (code 0) + delete_insns_since (last); + } +#endif Nice :-) Purely for the record, I wondered whether the high part should be generated with emit_move_insn(_1) instead, with the rhs of the move being the HIGH rtx. That would allow targets to cope with cases where the high part isn't represented directly as a HIGH. E.g. on MIPS and Alpha, small-data accesses use the global register as the high part instead. However, both MIPS and Alpha accept small-data addresses as legitimate constants and addresses before and during reload and only introduce the split form after reload. And I think that's how any other cases that aren't simple HIGHs should be handled too. E.g. MIPS also represents GOT page loads as HIGHs until after reload, and only then lowers the HIGH to a GOT load. Allowing the backend to generate anything other than a plain HIGH set here would be a double-edged sword. So after all that I agree that the gen_rtx_SET above is better than calling the move expanders. Thanks for sharing your
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Vladimir Makarov vmaka...@redhat.com writes: if that's accurate. I dropped the term reload pseudo because of the general comment in my earlier reply about the use of reload pseudo when the code seems to include inheritance and split pseudos too. There is no inheritance and splitting yet. It is done after the constraint pass. So at this stage = new_regno_start means reload pseudo. Ah, OK. That's a change in the meaning of NEW_CLASS, but seems easier for callers to handle. I think all it requires is changing: + common_class = ira_reg_class_subset[rclass][cl]; + if (new_class != NULL) + *new_class = common_class; to: common_class = ira_reg_class_subset[rclass][cl]; if (new_class != NULL rclass != common_class) *new_class = common_class; This change results in infinite LRA looping on a first libgcc file compilation. Unfortunately I have no time to investigate it. I'd like to say that most code of in this code is very sensitive to changes. I see it a lot. You change something looking obvious and a target is broken. I am going to investigate it when I have more time. Thanks. +default: + { + const char *fmt = GET_RTX_FORMAT (code); + int i; + + if (GET_RTX_LENGTH (code) != 1 + || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC) + { + for (i = GET_RTX_LENGTH (code) - 1; i = 0; i--) + if (fmt[i] == 'e') + extract_loc_address_regs (false, mode, as, XEXP (x, i), + context_p, code, SCRATCH, + modify_p, ad); + break; + } + /* fall through for case UNARY_OP (UNSPEC ...) */ + } + +case UNSPEC: + if (ad-disp_loc == NULL) + ad-disp_loc = loc; + else if (ad-base_reg_loc == NULL) + { + ad-base_reg_loc = loc; + ad-base_outer_code = outer_code; + ad-index_code = index_code; + ad-base_modify_p = modify_p; + } + else + { + lra_assert (ad-index_reg_loc == NULL); + ad-index_reg_loc = loc; + } + break; + +} Which targets use a bare UNSPEC as a displacement? I thought a displacement had to be a link-time constant, in which case it should satisfy CONSTANT_P. For UNSPECs, that means wrapping it in a CONST. I saw it somewhere. I guess IA64. I'm just a bit worried that the UNSPEC handling is sensitive to the order that subrtxes are processed (unlike PLUS, which goes to some trouble to work out what's what). It could be especially confusing because the default case processes operands in reverse order while PLUS processes them in forward order. Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough? Probably deserves a comment. I don't remember. To figure out, I should switch it off and try all targets supported by LRA. AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just recording where reloads of a particular class need to go (obviously in the case of disp_loc, which isn't reloaded at all). The feidls have semantic value too. I.e. we use them to work out the value of at least part of the address. In that case it seems dangerous to look through general rtxes in the way that the default case above does. Maybe just making sure that DISP_LOC is involved in a sum with the base would be enough, but another idea was: I know of three ways of mutating (for want of a better word) an address: 1. (and X (const_int X)), to align 2. a subreg 3. a unary operator (such as truncation or extension) So maybe we could: a. remove outer mutations (using a helper function) b. handle LO_SUM, PRE_*, POST_*: as now c. otherwise treat the address of the sum of one, two or three pieces. c1. Peel mutations of all pieces. c2. Classify the pieces into base, index and displacement. This would be similar to the jousting code above, but hopefully easier because all three rtxes are to hand. E.g. we could do the base vs. index thing in a similar way to commutative_operand_precedence. c3. Record which pieces were mutated (e.g. using something like the index_loc vs. index_reg_loc distinction in the current code) That should be general enough for current targets, but if it isn't, we could generalise it further when we know what generalisation is needed. That's still going to be a fair amount of code, but hopefully not more, and we might have more confidence at each stage what each value is. And it avoids the risk of treating mutated addresses as unmutated ones. Just an idea though. Probably not for 4.8, although I might try it if I find time. I am not sure that you listed all the cases. It would be great if you listed all the cases. In this case we could
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Hi Vlad, Some comments about the rest of LRA. Nothing major here... Vladimir Makarov vmaka...@redhat.com writes: +/* Info about register in an insn. */ +struct lra_insn_reg +{ + /* The biggest mode through which the insn refers to the register + (remember the register can be accessed through a subreg in the + insn). */ + ENUM_BITFIELD(machine_mode) biggest_mode : 16; AFAICT, this is actually always the mode of a specific reference, and if there are references to the same register in different modes, those references get their own lra_insn_regs. mode might be better than biggest_mode if so. +/* Static part (common info for insns with the same ICODE) of LRA + internal insn info. It exists in at most one exemplar for each + non-negative ICODE. Warning: if the structure definition is + changed, the initializer for debug_insn_static_data in lra.c should + be changed too. */ Probably worth saying (before the warning) that there is also one structure for each asm. +/* LRA internal info about an insn (LRA internal insn + representation). */ +struct lra_insn_recog_data +{ + int icode; /* The insn code. */ + rtx insn; /* The insn itself. */ + /* Common data for insns with the same ICODE. */ + struct lra_static_insn_data *insn_static_data; Maybe worth mentioning asms here too. + /* Two arrays of size correspondingly equal to the operand and the + duplication numbers: */ + rtx **operand_loc; /* The operand locations, NULL if no operands. */ + rtx **dup_loc; /* The dup locations, NULL if no dups. */ + /* Number of hard registers implicitly used in given call insn. The + value can be NULL or points to array of the hard register numbers + ending with a negative value. */ + int *arg_hard_regs; +#ifdef HAVE_ATTR_enabled + /* Alternative enabled for the insn. NULL for debug insns. */ + bool *alternative_enabled_p; +#endif + /* The alternative should be used for the insn, -1 if invalid, or we + should try to use any alternative, or the insn is a debug + insn. */ + int used_insn_alternative; + struct lra_insn_reg *regs; /* Always NULL for a debug insn. */ Comments consistently above the field. +extern void lra_expand_reg_info (void); This doesn't exist any more. +extern int lra_constraint_new_insn_uid_start; Just saying in case: this seems to be write-only, with lra-constraints.c instead using a static variable to track the uid start. I realise you might want to keep it anyway for consistency with lra_constraint_new_regno_start, or for debugging. +extern rtx lra_secondary_memory[NUM_MACHINE_MODES]; This doesn't exist any more. +/* lra-saves.c: */ + +extern bool lra_save_restore (void); Same for this file function. +/* The function returns TRUE if at least one hard register from ones + starting with HARD_REGNO and containing value of MODE are in set + HARD_REGSET. */ +static inline bool +lra_hard_reg_set_intersection_p (int hard_regno, enum machine_mode mode, + HARD_REG_SET hard_regset) +{ + int i; + + lra_assert (hard_regno = 0); + for (i = hard_regno_nregs[hard_regno][mode] - 1; i = 0; i--) +if (TEST_HARD_REG_BIT (hard_regset, hard_regno + i)) + return true; + return false; +} This is the same as overlaps_hard_reg_set_p. +/* Return hard regno and offset of (sub-)register X through arguments + HARD_REGNO and OFFSET. If it is not (sub-)register or the hard + register is unknown, then return -1 and 0 correspondingly. */ The function seems to return -1 for both. +/* Add hard registers starting with HARD_REGNO and holding value of + MODE to the set S. */ +static inline void +lra_add_hard_reg_set (int hard_regno, enum machine_mode mode, HARD_REG_SET *s) +{ + int i; + + for (i = hard_regno_nregs[hard_regno][mode] - 1; i = 0; i--) +SET_HARD_REG_BIT (*s, hard_regno + i); +} This is add_to_hard_reg_set. + Here is block diagram of LRA passes: + + - + | Undo inheritance| ------ + | for spilled pseudos)| | Memory-memory | | New (and old) | + | and splits (for || move coalesce |-|pseudos| + | pseudos got the | --- | assignment | + Start | same hard regs) | --- +| - ^ +V | | + --- V | Update virtual | | +| Remove | |register | | +| scratches | ^ | displacements | | + --- | | + |
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
I'm having to correct my own comments again, sorry. Richard Sandiford rdsandif...@googlemail.com writes: + /* If this is post-increment, first copy the location to the reload reg. */ + if (post real_in != result) +emit_insn (gen_move_insn (result, real_in)); Nit, but real_in != result can never be true AIUI, and I was confused how the code could be correct in that case. Maybe just remove it, or make it an assert? Probably obvious, but I meant real_in == result can never be true. real_in != result could be removed or turned into an assert. +if (GET_CODE (op) == PLUS) + { +plus = op; +op = XEXP (op, 1); + } Sorry, I'm complaining about old reload code again, but: does this actually happen in LRA? In reload, a register operand could become a PLUS because of elimination, but I thought LRA did things differently. Besides, this is only needed for: +if (CONST_POOL_OK_P (mode, op) + ((targetm.preferred_reload_class + (op, (enum reg_class) goal_alt[i]) == NO_REGS) +|| no_input_reloads_p) + mode != VOIDmode) + { +rtx tem = force_const_mem (mode, op); + +change_p = true; +/* If we stripped a SUBREG or a PLUS above add it back. */ +if (plus != NULL_RTX) + tem = gen_rtx_PLUS (mode, XEXP (plus, 0), tem); and we shouldn't have (plus (constant ...) ...) after elimination (or at all outside of a CONST). I don't understand why the code is needed even in reload. Scratch the thing about needing it for reload. It's obviously the second second operand we're reloading, not the first, and it could well be that an elimination displacement needs to be reloaded via the constant pool. The question for LRA still stands though. Richard
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Hi Vlad, Comments for the rest of ira-constraints.c. Vladimir Makarov vmaka...@redhat.com writes: + saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX; + change_p = equiv_address_substitution (ad, addr_loc, mode, as, code); + if (ad.base_reg_loc != NULL) +{ + if (process_addr_reg + (ad.base_reg_loc, before, +(ad.base_modify_p REG_P (*ad.base_reg_loc) + find_regno_note (curr_insn, REG_DEAD, + REGNO (*ad.base_reg_loc)) == NULL + ? after : NULL), +base_reg_class (mode, as, ad.base_outer_code, ad.index_code))) + change_p = true; + if (ad.base_reg_loc2 != NULL) + *ad.base_reg_loc2 = *ad.base_reg_loc; + saved_base_reg = *ad.base_reg_loc; + lra_eliminate_reg_if_possible (ad.base_reg_loc); + if (ad.base_reg_loc2 != NULL) + { + saved_base_reg2 = *ad.base_reg_loc2; + lra_eliminate_reg_if_possible (ad.base_reg_loc2); + } We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it might be clearer without saved_base_reg2. More below... + /* The following addressing is checked by constraints and + usually target specific legitimate address hooks do not + consider them valid. */ + || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC + || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC typo: two PRE_DECs, although: + || GET_CODE (*addr_loc) == PRE_MODIFY + || GET_CODE (*addr_loc) == POST_MODIFY the whole lot could just be replaced by ad.base_modify_p, or perhaps even removed entirely given: + /* In this case we can not do anything because if it is wrong + that is because of wrong displacement. Remember that any + address was legitimate in non-strict sense before LRA. */ + || ad.disp_loc == NULL) It doesn't seem worth validating the address at all for ad.disp_loc == NULL. E.g. something like: if (ad.base_reg_loc != NULL (process_addr_reg (ad.base_reg_loc, before, (ad.base_modify_p REG_P (*ad.base_reg_loc) find_regno_note (curr_insn, REG_DEAD, REGNO (*ad.base_reg_loc)) == NULL ? after : NULL), base_reg_class (mode, as, ad.base_outer_code, ad.index_code { change_p = true; if (ad.base_reg_loc2 != NULL) *ad.base_reg_loc2 = *ad.base_reg_loc; } if (ad.index_reg_loc != NULL process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS)) change_p = true; /* The address was valid before LRA. We only change its form if the address has a displacement, so if it has no displacement it must still be valid. */ if (ad.disp_loc == NULL) return change_p; /* See whether the address is still valid. Some ports do not check displacements for eliminable registers, so we replace them temporarily with the elimination target. */ saved_base_reg = saved_index_reg = NULL_RTX; ... if (ok_p) return change_p; +#ifdef HAVE_lo_sum + { + rtx insn; + rtx last = get_last_insn (); + + /* disp = lo_sum (new_base, disp) */ + insn = emit_insn (gen_rtx_SET + (VOIDmode, new_reg, +gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc; + code = recog_memoized (insn); + if (code = 0) + { + rtx save = *ad.disp_loc; + + *ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc); + if (! valid_address_p (mode, *ad.disp_loc, as)) + { + *ad.disp_loc = save; + code = -1; + } + } + if (code 0) + delete_insns_since (last); + } +#endif Nice :-) Purely for the record, I wondered whether the high part should be generated with emit_move_insn(_1) instead, with the rhs of the move being the HIGH rtx. That would allow targets to cope with cases where the high part isn't represented directly as a HIGH. E.g. on MIPS and Alpha, small-data accesses use the global register as the high part instead. However, both MIPS and Alpha accept small-data addresses as legitimate constants and addresses before and during reload and only introduce the split form after reload. And I think that's how any other cases that aren't simple HIGHs should be handled too. E.g. MIPS also represents GOT page loads as HIGHs until after reload, and only then lowers the HIGH to a GOT load. Allowing the backend to generate anything other than a plain HIGH set here would be a double-edged sword. So after all that I agree that the gen_rtx_SET above is better than calling the move expanders. + /* index * scale + disp = new base + index * scale */ + enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH); + + lra_assert (INDEX_REG_CLASS !=
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Vladimir Makarov vmaka...@redhat.com writes: +/* Info about pseudo used during the assignment pass. Thread is a set + of connected reload and inheritance pseudos with the same set of + available hard reg set. Thread is a pseudo itself for other + cases. */ +struct regno_assign_info Maybe: /* Information about the thread to which a pseudo belongs. Threads are a set of connected reload and inheritance pseudos with the same set of available hard registers. Lone registers belong to their own threads. */ Fixed. Although the condition seems to be: +(ira_class_hard_regs_num[regno_allocno_class_array[regno1]] + == ira_class_hard_regs_num[regno_allocno_class_array[regno2]])) i.e. the same _number_ of available hard regs, but not necessarily the same set. It should be the same in most cases. This condition is just faster approximation of the same available hard reg set. The distinction does seem important though. It's possible that a target has two distinct register files of the same allocatable size. Would something like: (ira_class_subset_p[class1][class2] ira_class_subset_p[class2][class1]) work instead? /* Update the preference for using HARD_REGNO for pseudos that are connected directly or indirectly with REGNO. Apply divisor DIV to any preference adjustments. The more indirectly a pseudo is connected, the smaller its effect should be. We therefore increase DIV on each hop. */ Fixed. By the way, it is your invention from IRA. Heh, I'd forgotten all about that. + /* We are trying to spill a reload pseudo. That is wrong we +should assign all reload pseudos, otherwise we cannot reuse +the selected alternatives. */ + hard_regno = find_hard_regno_for (regno, cost, -1); + if (hard_regno = 0) + { Don't really understand this comment, sorry. It removed the comment. It is from an old solution code trying to guarantee assignment to the reload pseudo. Also, why are we passing -1 to find_hard_regno_for, rather than hard_regno? The loop body up till this point has been specifically freeing up registers to make hard_regno allocatable. I realise that, by spilling everything that overlaps this range, we might have freed up other registers too, and so made others besides hard_regno allocatable. But wouldn't we want to test those other hard registers in their iteration of the loop instead of this one? The spills in those iterations ought to be more directed (i.e. there should be less incidental spilling). As things stand, doing an rclass_size * rclass_size scan seems unnecessarily expensive, although probably off the radar. We cannot just pass hard_regno for multi-word pseudo when hard_regno-1 is already free. But this call is in a loop that iterates over all registers in the class: for (i = 0; i rclass_size; i++) { hard_regno = ira_class_hard_regs[rclass][i]; and we reach the find_hard_regno_for call unless there is some conflicting register that we cannot spill. So if hard_regno - 1 belongs to the allocation class and is a viable choice, its iteration of the loop would spill specifically for hard_regno - 1 and get the most accurate cost for that register. I couldn't see why any other iteration of the loop would want to consider hard_regno - 1. You are right about possibility to speed up the code, although on profiles I looked (including the last huge tests) spill_for and find_hard_regno_for called from takes few time. That is probably because you don't need spill frequently. Freeing one long live range pseudo permits to find hard regno without spilling for many short live pseudos (reload and inheritance ones). Also loop rclass_size * rclass_size is not expensive, the preparation of data for the loop is expensive. OK, in that case maybe the efficiency concern wasn't justified. FWIW, I still think passing hard_regno would be clearer though, in terms of meeting expectations. It just seems odd to spill for one specific register and then test all of them. Especially when the spilling we actually do after choosing register X is based on X's iteration of this loop. (I realise I could well be missing the point here though, sorry.) + (reload_hard_regno + = find_hard_regno_for (reload_regno, +reload_cost, -1)) = 0 + (lra_hard_reg_set_intersection_p + (reload_hard_regno, PSEUDO_REGNO_MODE (reload_regno), + spilled_hard_regs))) + { + if (lra_dump_file != NULL) + fprintf (lra_dump_file, assign %d(cost=%d), +reload_regno, reload_cost); + assign_temporarily (reload_regno, reload_hard_regno); + cost += reload_cost; It looks like registers that can be reallocated make hard_regno more expensive (specifically by reload_cost), but registers
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
On 10/04/2012 11:50 AM, Richard Sandiford wrote: Hi Vlad, This message is for lra-assigns.c. Sorry for the piecemeal reviews, never sure when I'll get time... +/* This file contains a pass mostly assigning hard registers to reload + pseudos. There is no any RTL code transformation on this pass. Maybe: /* This file's main objective is to assign hard registers to reload pseudos. It also tries to allocate hard registers to other pseudos, but at a lower priority than the reload pseudos. The pass does not transform the RTL. if that's accurate. Yes. That is better. I used your comment. + Reload pseudos get what they need (usually) hard registers in + anyway possibly by spilling non-reload pseudos and by assignment + reload pseudos with smallest number of available hard registers + first. + + If reload pseudos can get hard registers only through spilling + other pseudos, we choose what pseudos to spill taking into account + how given reload pseudo benefits and also how other reload pseudos + not assigned yet benefit too (see function spill_for). Maybe: We must allocate a hard register to every reload pseudo. We try to increase the chances of finding a viable allocation by assigning the pseudos in order of fewest available hard registers first. If we still fail to find a hard register, we spill other (non-reload) pseudos in order to make room. assign_hard_regno_for allocates registers without spilling. spill_for does the same with spilling. Both functions use a cost model to determine the most profitable choice of hard and spill registers. Ok. I just changed two sentences a bit: find_hard_regno_for finds hard registers for allocation without spilling. spill_for does the same with spilling. + Non-reload pseudos can get hard registers too if it is possible and + improves the code. It might be possible because of spilling + non-reload pseudos on given pass. Maybe: Once we have finished allocating reload pseudos, we also try to assign registers to other (non-reload) pseudos. This is useful if hard registers were freed up by the spilling just described. Fixed. + We try to assign hard registers processing pseudos by threads. The + thread contains reload and inheritance pseudos connected by copies + (move insns). It improves the chance to get the same hard register + to pseudos in the thread and, as the result, to remove some move + insns. Maybe: We try to assign hard registers by collecting pseudos into threads. These threads contain reload and inheritance pseudos that are connected by copies (move insns). Doing this improves the chances of pseudos in the thread getting the same hard register and, as a result, of allowing some move insns to be deleted. Fixed. + When we assign hard register to a pseudo, we decrease the cost of + the hard registers for corresponding pseudos connected by copies. Maybe: When we assign a hard register to a pseudo, we decrease the cost of using the same hard register for pseudos that are connected by copies. Fixed. + If two hard registers are equally good for assigning the pseudo + with hard register cost point of view, we prefer a hard register in + smaller register bank. By default, there is only one register + bank. A target can define register banks by hook + register_bank. For example, x86-64 has a few register banks: hard + regs with and without REX prefixes are in different banks. It + permits to generate smaller code as insns without REX prefix are + shorter. Maybe: If two hard registers have the same frequency-derived cost, we prefer hard registers in lower register banks. The mapping of registers to banks is controlled by the register_bank target hook. For example, x86-64 has a few register banks: hard registers with and without REX prefixes are in different banks. This permits us to generate smaller code as insns without REX prefixes are shorter. although this might change if the name of the hook changes. With recent change in the hook name, I modified it to: If two hard registers have the same frequency-derived cost, we prefer hard registers with bigger priorities. The mapping of registers to priorities is controlled by the register_priority target hook. For example, x86-64 has a few register priorities: hard registers with and without REX prefixes have different priorities. This permits us to generate smaller code as insns without REX prefixes are shorter. +/* Info about pseudo used during the assignment pass. Thread is a set + of connected reload and inheritance pseudos with the same set of + available hard reg set. Thread is a pseudo itself for other + cases. */ +struct regno_assign_info Maybe: /* Information about the thread to which a pseudo belongs. Threads are a set of connected reload and inheritance pseudos with
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Sorry, reading back in different surroundings made me notice a couple of silly errors: Richard Sandiford rdsandif...@googlemail.com writes: E.g.: if ((*loc = get_equiv_substitution (reg)) != reg) ...as above... if (*loc != reg || !in_class_p (reg, cl, new_class)) ...as above... else if (new_class != NO_REGS rclass != new_class) change_class (regno, new_class, Change, true); return change_p; (assuming change to in_class_p suggested earlier) seems like it covers the same cases. ...but that same in_class_p change means that the rclass != new_class condition isn't needed. I.e. if ((*loc = get_equiv_substitution (reg)) != reg) ...as above... if (*loc != reg || !in_class_p (reg, cl, new_class)) ...as above... else if (new_class != NO_REGS) change_class (regno, new_class, Change, true); return change_p; + if (operand_reg[nop] != NULL_RTX) +{ + int last_reload = (lra_reg_info[ORIGINAL_REGNO + (operand_reg[nop])] + .last_reload); + + if (last_reload bb_reload_num) +reload_sum += last_reload; + else +reload_sum += bb_reload_num; The comment for reload_sum says: +/* Overall number reflecting distances of previous reloading the same + value. It is used to improve inheritance chances. */ +static int best_reload_sum; which made me think of distance from the current instruction. I see it's actually something else, effectively a sum of instruction numbers. I assumed the idea was to prefer registers that were reloaded more recently (closer the current instruction). In that case I thought that, for a distance-based best_reload_sum, smaller would be better, while for an instruction-number-based best_reload_sum, larger would be better. It looks like we use instruction-number based best_reload_sums but prefer smaller sums: + (reload_nregs best_reload_nregs + || (reload_nregs == best_reload_nregs + best_reload_sum reload_sum Is that intentional? Clearly I can't read. The code _does_ prefer higher numbers. I still think distance is a bit misleading though. :-) Just for the record, the rest of my question: Also, is this value meaningful for output reloads, which aren't really going to be able to inherit a value as such? We seem to apply the cost regardless of whether it's an input or an output, so probably deserves a comment. Same for matched input operands, which as you say elsewhere aren't inherited. still applies. Richard
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
On 12-10-03 7:11 AM, Richard Sandiford wrote: Hi Vlad, Some comments on lra-spills.c and lra-coalesce.c. + The pass creates necessary stack slots and assign spilled pseudos + to the stack slots in following way: s/assign/assigns/ Fixed. + (or insn memory constraints) might be not satisfied any more. s/might be not/might not be/ Fixed. + For some targets, the pass can spill some pseudos into hard + registers of different class (usually into vector registers) + instead of spilling them into memory if it is possible and + profitable. Spilling GENERAL_REGS pseudo into SSE registers for + modern Intel x86/x86-64 processors is an example of such + optimization. And this is actually recommended by Intel + optimization guide. Maybe mention core i7 specifically? Modern is a bit dangerous in code that'll live a long time. Yes, right. Fixed. Another bad thing would be an usage of word new. +/* The structure describes a stack slot which can be used for several + spilled pseudos. */ +struct slot +{ Looks like this describes a register or stack slot given the hard_regno case. Fixed +/* Array containing info about the stack slots. The array element is + indexed by the stack slot number in the range [0..slost_num). */ Typo: slots_num Fixed. + /* Each pseudo has an inherent size which comes from its own mode, + and a total size which provides room for paradoxical subregs + which refer to the pseudo reg in wider modes. + + We can use a slot already allocated if it provides both enough + inherent space and enough total space. Otherwise, we allocate a + new slot, making sure that it has no less inherent space, and no + less total space, then the previous slot. */ The second part of the comment seems a bit misplaced, since the following code doesn't reuse stack slots. This is done elsewhere instead. Maybe the first part would be better above the inherent_size assignment. Right. I've changed comment to reflect the current state of the code. + /* If we have any adjustment to make, or if the stack slot is the + wrong mode, make a new stack slot. */ + x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust); We don't make a new slot here. I removed the comment. The same comment is present in reload1.c and probably should be also removed. +/* Sort pseudos according their slot numbers putting ones with smaller + numbers first, or last when the frame pointer is not needed. So + pseudos with the first slot will be finally addressed with smaller + address displacement. */ +static int +pseudo_reg_slot_compare (const void *v1p, const void *v2p) +{ + const int regno1 = *(const int *) v1p; + const int regno2 = *(const int *) v2p; + int diff, slot_num1, slot_num2; + int total_size1, total_size2; + + slot_num1 = pseudo_slots[regno1].slot_num; + slot_num2 = pseudo_slots[regno2].slot_num; + if ((diff = slot_num1 - slot_num2) != 0) +return (frame_pointer_needed + || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff); The comment doesn't quite describe the condition. Maybe: /* Sort pseudos according to their slots, putting the slots in the order that they should be allocated. Slots with lower numbers have the highest priority and should get the smallest displacement from the stack or frame pointer (whichever is being used). The first allocated slot is always closest to the frame pointer, so prefer lower slot numbers when frame_pointer_needed. If the stack and frame grow in the same direction, then the first allocated slot is always closest to the initial stack pointer and furthest away from the final stack pointer, so allocate higher numbers first when using the stack pointer in that case. The reverse is true if the stack and frame grow in opposite directions. */ I used your comment. Thanks. + total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1), +GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode)); + total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2), +GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode)); + if ((diff = total_size2 - total_size1) != 0) +return diff; I think this could do with a bit more commentary. When is biggest_mode ever smaller than PSEUDO_REGNO_BYTES? Is that for pseudos that are only ever referenced as lowpart subregs? If so, why does PSEUDO_REGNO_BYTES matter for those registers here but not when calculating biggest_mode? The MAX code has no sense to me too (probably it was wrongly adapted from somewhere). So I removed MAX. +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS. Put the + pseudos which did not get a spill hard register at the beginning of + array PSEUDO_REGNOS. Return the number of such pseudos. */ It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest frequency first. Fixed. + bitmap set_jump_crosses =
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Hi Vlad, This message is for lra-assigns.c. Sorry for the piecemeal reviews, never sure when I'll get time... +/* This file contains a pass mostly assigning hard registers to reload + pseudos. There is no any RTL code transformation on this pass. Maybe: /* This file's main objective is to assign hard registers to reload pseudos. It also tries to allocate hard registers to other pseudos, but at a lower priority than the reload pseudos. The pass does not transform the RTL. if that's accurate. + Reload pseudos get what they need (usually) hard registers in + anyway possibly by spilling non-reload pseudos and by assignment + reload pseudos with smallest number of available hard registers + first. + + If reload pseudos can get hard registers only through spilling + other pseudos, we choose what pseudos to spill taking into account + how given reload pseudo benefits and also how other reload pseudos + not assigned yet benefit too (see function spill_for). Maybe: We must allocate a hard register to every reload pseudo. We try to increase the chances of finding a viable allocation by assigning the pseudos in order of fewest available hard registers first. If we still fail to find a hard register, we spill other (non-reload) pseudos in order to make room. assign_hard_regno_for allocates registers without spilling. spill_for does the same with spilling. Both functions use a cost model to determine the most profitable choice of hard and spill registers. + Non-reload pseudos can get hard registers too if it is possible and + improves the code. It might be possible because of spilling + non-reload pseudos on given pass. Maybe: Once we have finished allocating reload pseudos, we also try to assign registers to other (non-reload) pseudos. This is useful if hard registers were freed up by the spilling just described. + We try to assign hard registers processing pseudos by threads. The + thread contains reload and inheritance pseudos connected by copies + (move insns). It improves the chance to get the same hard register + to pseudos in the thread and, as the result, to remove some move + insns. Maybe: We try to assign hard registers by collecting pseudos into threads. These threads contain reload and inheritance pseudos that are connected by copies (move insns). Doing this improves the chances of pseudos in the thread getting the same hard register and, as a result, of allowing some move insns to be deleted. + When we assign hard register to a pseudo, we decrease the cost of + the hard registers for corresponding pseudos connected by copies. Maybe: When we assign a hard register to a pseudo, we decrease the cost of using the same hard register for pseudos that are connected by copies. + If two hard registers are equally good for assigning the pseudo + with hard register cost point of view, we prefer a hard register in + smaller register bank. By default, there is only one register + bank. A target can define register banks by hook + register_bank. For example, x86-64 has a few register banks: hard + regs with and without REX prefixes are in different banks. It + permits to generate smaller code as insns without REX prefix are + shorter. Maybe: If two hard registers have the same frequency-derived cost, we prefer hard registers in lower register banks. The mapping of registers to banks is controlled by the register_bank target hook. For example, x86-64 has a few register banks: hard registers with and without REX prefixes are in different banks. This permits us to generate smaller code as insns without REX prefixes are shorter. although this might change if the name of the hook changes. +/* Info about pseudo used during the assignment pass. Thread is a set + of connected reload and inheritance pseudos with the same set of + available hard reg set. Thread is a pseudo itself for other + cases. */ +struct regno_assign_info Maybe: /* Information about the thread to which a pseudo belongs. Threads are a set of connected reload and inheritance pseudos with the same set of available hard registers. Lone registers belong to their own threads. */ Although the condition seems to be: + (ira_class_hard_regs_num[regno_allocno_class_array[regno1]] + == ira_class_hard_regs_num[regno_allocno_class_array[regno2]])) i.e. the same _number_ of available hard regs, but not necessarily the same set. thread might be more mnemonic than regno_assign in this file, but that's bikeshed stuff. + for (i = FIRST_PSEUDO_REGISTER; i max_reg_num (); i++) +{ + regno_assign_info[i].first = i; + regno_assign_info[i].next = -1; + regno_assign_info[i].freq = lra_reg_info[i].freq; +} Minor speedup, but it's probably worth caching max_reg_num () rather than calling it in each loop
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Hi Vlad, Some comments on lra-spills.c and lra-coalesce.c. + The pass creates necessary stack slots and assign spilled pseudos + to the stack slots in following way: s/assign/assigns/ + (or insn memory constraints) might be not satisfied any more. s/might be not/might not be/ + For some targets, the pass can spill some pseudos into hard + registers of different class (usually into vector registers) + instead of spilling them into memory if it is possible and + profitable. Spilling GENERAL_REGS pseudo into SSE registers for + modern Intel x86/x86-64 processors is an example of such + optimization. And this is actually recommended by Intel + optimization guide. Maybe mention core i7 specifically? Modern is a bit dangerous in code that'll live a long time. +/* The structure describes a stack slot which can be used for several + spilled pseudos. */ +struct slot +{ Looks like this describes a register or stack slot given the hard_regno case. +/* Array containing info about the stack slots. The array element is + indexed by the stack slot number in the range [0..slost_num). */ Typo: slots_num + /* Each pseudo has an inherent size which comes from its own mode, + and a total size which provides room for paradoxical subregs + which refer to the pseudo reg in wider modes. + + We can use a slot already allocated if it provides both enough + inherent space and enough total space. Otherwise, we allocate a + new slot, making sure that it has no less inherent space, and no + less total space, then the previous slot. */ The second part of the comment seems a bit misplaced, since the following code doesn't reuse stack slots. This is done elsewhere instead. Maybe the first part would be better above the inherent_size assignment. + /* If we have any adjustment to make, or if the stack slot is the + wrong mode, make a new stack slot. */ + x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust); We don't make a new slot here. +/* Sort pseudos according their slot numbers putting ones with smaller + numbers first, or last when the frame pointer is not needed. So + pseudos with the first slot will be finally addressed with smaller + address displacement. */ +static int +pseudo_reg_slot_compare (const void *v1p, const void *v2p) +{ + const int regno1 = *(const int *) v1p; + const int regno2 = *(const int *) v2p; + int diff, slot_num1, slot_num2; + int total_size1, total_size2; + + slot_num1 = pseudo_slots[regno1].slot_num; + slot_num2 = pseudo_slots[regno2].slot_num; + if ((diff = slot_num1 - slot_num2) != 0) +return (frame_pointer_needed + || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff); The comment doesn't quite describe the condition. Maybe: /* Sort pseudos according to their slots, putting the slots in the order that they should be allocated. Slots with lower numbers have the highest priority and should get the smallest displacement from the stack or frame pointer (whichever is being used). The first allocated slot is always closest to the frame pointer, so prefer lower slot numbers when frame_pointer_needed. If the stack and frame grow in the same direction, then the first allocated slot is always closest to the initial stack pointer and furthest away from the final stack pointer, so allocate higher numbers first when using the stack pointer in that case. The reverse is true if the stack and frame grow in opposite directions. */ + total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1), + GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode)); + total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2), + GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode)); + if ((diff = total_size2 - total_size1) != 0) +return diff; I think this could do with a bit more commentary. When is biggest_mode ever smaller than PSEUDO_REGNO_BYTES? Is that for pseudos that are only ever referenced as lowpart subregs? If so, why does PSEUDO_REGNO_BYTES matter for those registers here but not when calculating biggest_mode? +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS. Put the + pseudos which did not get a spill hard register at the beginning of + array PSEUDO_REGNOS. Return the number of such pseudos. */ It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest frequency first. + bitmap set_jump_crosses = regstat_get_setjmp_crosses (); I notice you use set_jump here and setjump in parts of 7a.patch. Probably better to use setjmp across the board. + /* Hard registers which can not be used for any purpose at given + program point because they are unallocatable or already allocated + for other pseudos. */ + HARD_REG_SET *reserved_hard_regs; + + if (! lra_reg_spill_p) +return n; + /* Set up reserved hard