On 8/11/23 17:04, Jeff Law wrote:

I'm wondering (naively) if there is some way to tune this - for a given backend. In general it would make sense to do the replacement, but not if the cost changes (e.g. consts could be embedded in x86 insn freely, but not for RISC-V where this is costly and if something is split, it might been intentional.
I'm not immediately aware of a way to tune.

When it comes to tuning, the toplevel questions are do we have any of the info we need to tune at the point where the transformation occurs. The two most obvious pieces here would be loop info an register pressure.

ie, do we have enough loop structure to know if the def is at a shallower loop nest than the use.  There's a reasonable chance we have this information as my recollection is this analysis is done fairly early in IRA.

But that means we likely don't have any sense of register pressure at the points between the def and use.   So the most useful metric for tuning isn't really available.

I'd argue that even if the register pressure were high, in some cases, there's just no way around it and RA needs to honor what the backend did apriori (split in this case), otherwise we end up with something which doesn't compute literally and leads to ICE. I'm puzzled that in this case, intentional implementation is getting in the way. So while I don't care about the -0.0 case in itself, it seems with the current framework we can't just achieve the results, other that the roundabout way of peephole2 you alluded to.


The one thing that stands out is we don't do this transformation at all when register pressure sensitive scheduling is enabled. And we really should be turning that on by default.  Our data shows register pressure sensitive scheduling is about a 6-7% cycle improvement on x264 as it avoids spilling in those key satd loops.

 /* Don't move insns if live range shrinkage or register
     pressure-sensitive scheduling were done because it will not
     improve allocation but likely worsen insn scheduling.  */
  if (optimize
      && !flag_live_range_shrinkage
      && !(flag_sched_pressure && flag_schedule_insns))
    combine_and_move_insns ();


So you might want to look at register pressure sensitive scheduling first.  If you go into x264_r from specint and look at x264_pixel_satd_8x4.  First verify the loops are fully unrolled. If they are, then look for 32bit loads/stores into the stack.  If you have them, then you're spilling and getting crappy performance.  Using register pressure sensitive scheduling should help significantly.

Is that -fira-loop-pressure ?

We've certainly seen that internally.  The plan was to submit a patch to make register pressure sensitive scheduling the default when the scheduler is enabled.  We just haven't pushed on it.  If you can verify that you're seeing spilling as well, then it'd certainly bolster the argument that register-pressure-sensitive-scheduling is desirable.

I can confirm that the loop is fully unrolled and there's a zillion stack spills there for intermediate computes (-Ofast -march=rv64gc_zba_zbb_zbs, no V in that build).

Thx,
-Vineet

Reply via email to