Qing Zhao <qing.z...@oracle.com> writes: >> On Sep 14, 2020, at 2:20 PM, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> >> Qing Zhao <qing.z...@oracle.com <mailto:qing.z...@oracle.com>> writes: >>>> On Sep 14, 2020, at 11:33 AM, Richard Sandiford >>>> <richard.sandif...@arm.com> wrote: >>>> >>>> Qing Zhao <qing.z...@oracle.com> writes: >>>>>> Like I mentioned earlier though, passes that run after >>>>>> pass_thread_prologue_and_epilogue can use call-clobbered registers that >>>>>> weren't previously used. For example, on x86_64, the function might >>>>>> not use %r8 when the prologue, epilogue and returns are generated, >>>>>> but pass_regrename might later introduce a new use of %r8. AIUI, >>>>>> the “used” version of the new command-line option is supposed to clear >>>>>> %r8 in these circumstances, but it wouldn't do so if the data was >>>>>> collected at the point that the return is generated. >>>>> >>>>> Thanks for the information. >>>>> >>>>>> >>>>>> That's why I think it's more robust to do this later (at the beginning >>>>>> of pass_late_compilation) and insert the zeroing before returns that >>>>>> already exist. >>>>> >>>>> Yes, looks like it’s not correct to insert the zeroing at the time when >>>>> prologue, epilogue and return are generated. >>>>> As I also checked, “return” might be also generated as late as pass >>>>> “pass_delay_slots”, So, shall we move the >>>>> New pass as late as possible? >>>> >>>> If we insert the zeroing before pass_delay_slots and describe the >>>> result correctly, pass_delay_slots should do the right thing. >>>> >>>> Describing the result correctly includes ensuring that the cleared >>>> registers are treated as live on exit from the function, so that the >>>> zeroing doesn't get deleted again, or skipped by pass_delay_slots. >>> >>> In the current implementation for x86, when we generating a zeroing insn as >>> the following: >>> >>> (insn 18 16 19 2 (set (reg:SI 1 dx) >>> (const_int 0 [0])) "t10.c":11:1 -1 >>> (nil)) >>> (insn 19 18 20 2 (unspec_volatile [ >>> (reg:SI 1 dx) >>> ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1 >>> (nil)) >>> >>> i.e, after each zeroing insn, the register that is zeroed is marked as >>> “UNSPECV_PRO_EPILOGUE_USE”, >>> By doing this, we can avoid this zeroing insn from being deleted or >>> skipped. >>> >>> Is doing this enough to describe the result correctly? >>> Is there other thing we need to do in addition to this? >> >> I guess that works, but I think it would be better to abstract >> EPILOGUE_USES into a new target-independent wrapper function that >> (a) returns true if EPILOGUE_USES itself returns true and (b) returns >> true for registers that need to be zero on return, if the zeroing >> instructions have already been inserted. The places that currently >> test EPILOGUE_USES should then test this new wrapper function instead. > > Okay, I see. > Looks like that EPILOGUE_USES is used in df-scan.c to compute the data flow > information. If EPILOUGE_USES return true > for registers that need to be zeroed on return, those registers will be > included in the data flow information, as a result, later > passes will not be able to delete them. > > This sounds to be a cleaner approach than the current one that marks the > registers “UNSPECV_PRO_EPILOGUE_USE”. > > A more detailed implementation question on this: > Where should I put this new target-independent wrapper function in? Which > header file will be a proper place to hold this new function?
Not a strong opinion, but: maybe df.h and df-scan.c, since this is really a DF query. >> After inserting the zeroing instructions, the pass should recompute the >> live-out sets based on this. Sorry, I was wrong here. It should *cause* the sets to be recomputed where necessary (rather than recompute them directly), but see below. > Is only computing the live-out sets of the block that including the return > insn enough? Or we should re-compute the whole procedure? > > Which utility routine I should use to recompute the live-out sets? Inserting the instructions will cause the containing block to be marked dirty, via df_set_bb_dirty. I think the pass should also call df_set_bb_dirty on the exit block itself, to indicate that the wrapper around EPILOGUE_USES has changed behaviour, but that might not be necessary. This gives the df machinery enough information to work out what has changed. It will then propagate those changes throughout the function. (I don't think any propagation would be necessary here, but if I'm wrong about that, then the df machinery will do whatever propagation is necessary.) However, the convention is for a pass that uses the df machinery to call df_analyze first. This call to df_analyze updates any stale df information. So unlike what I said yesterday, the pass itself doesn't need to make sure that the df information is up-to-date. It just needs to indicate what has changed, as above. In the case of pass_delay_slots, pass_free_cfg has: /* The resource.c machinery uses DF but the CFG isn't guaranteed to be valid at that point so it would be too late to call df_analyze. */ if (DELAY_SLOTS && optimize > 0 && flag_delayed_branch) { df_note_add_problem (); df_analyze (); } Any other machine-specific passes that use df already need to call df_analyze (if they use the df machinery). So simply marking what has changed is enough (by design). > My understanding is: > > In our new pass that is put in the beginning of the pass_late_compilation, > I,e pass_zero_call_used_regs; > > PUSH_INSERT_PASSES_WITHIN (pass_late_compilation) > ++++ NEXT_PASS (pass_zero_call_used_regs); > NEXT_PASS (pass_compute_alignments); > NEXT_PASS (pass_variable_tracking); > NEXT_PASS (pass_free_cfg); > NEXT_PASS (pass_machine_reorg); > NEXT_PASS (pass_cleanup_barriers); > NEXT_PASS (pass_delay_slots); > > When we scan the EXIT BLOCK of the routine, all the return insns have already > been there. > The later passes including “pass_delay_slots” will not generate additional > returns anymore, they might just call “target.gen_return” or > “target.gen_simple_return() to replace > “ret_rtx” or “simple_ret_rtx” ? Kind-of. pass_delay_slots can also duplicate code, so it's not always a straight replacement. But the point is that returns don't appear out of nowhere. There has to be a semantic reason for them to exist. The behaviour of the function after pass_delay_slots has to be the same as it was before the pass (disregarding undefined behaviour). Once we've added clearing of the zero registers to all return paths, that clearing becomes part of the behaviour of the function, and so will be part of the behaviour after pass_delay_slots as well. So I don't think the problem is with passes generating new returns. It's more whether they could use new registers that then need to be cleared, which is the main justification for running the new pass so late in the pipeline. In principle, there's nothing stopping pass_delay_slots allocating new registers (like pass_regrename does), and in principle that could introduce the need to do more clearing. But I don't think the current pass does that. The pass is also very much legacy code at this point, so the chances of new optimisations being added to it are fairly low. If that did happen, I think it would be reasonable to expect the pass to work within the set of registers that have already been allocated, at least when your new option is in effect. Thanks, Richard