> Am 07.09.2024 um 17:56 schrieb Jeff Law <jeffreya...@gmail.com>:
> 
> 
> 
> On 9/7/24 1:09 AM, Richard Biener wrote:
>>>> Am 06.09.2024 um 17:38 schrieb Andrew Carlotti <andrew.carlo...@arm.com>:
>>> 
>>> Hi,
>>> 
>>> I'm working on optimising assignments to the AArch64 Floating-point Mode
>>> Register (FPMR), as part of our FP8 enablement work.  Claudio has already
>>> implemented FPMR as a hard register, with the intention that FP8 intrinsic
>>> functions will compile to a combination of an fpmr register set, followed 
>>> by an
>>> FP8 operation that takes fpmr as an input operand.
>>> 
>>> It would clearly be inefficient to retain an explicit FPMR assignment prior 
>>> to whic
>>> each FP8 instruction (especially in the common case where every assignment 
>>> uses
>>> the same FPMR value).  I think the best way to optimise this would be to
>>> implement a new pass that can optimise assignments to individual hard 
>>> registers.
>>> 
>>> There are a number of existing passes that do similar optimisations, but 
>>> which
>>> I believe are unsuitable for this scenario for various reasons.  For 
>>> example:
>>> 
>>> - cse1 can already optimise FPMR assignments within an extended basic block,
>>>  but can't handle broader optimisations.
>>> - pre (in gcse.c) doesn't work with assigning constant values, which would 
>>> miss
>>>  many potential usages.  It also has limits on how far code can be moved,
>>>  based around ideas of register pressure that don't apply to the context of 
>>> a
>>>  single hard register that shouldn't be used by the register allocator for
>>>  anything else.  Additionally, it doesn't run at -Os.
>>> - hoist (also using gcse.c) only handles constant values, and only runs when
>>>  optimising for size.  It also has the rest of the issues that pre does.
>>> - mode_sw only handles a small finite set of modes.  The mode requirements 
>>> are
>>>  determined solely by the instructions that require the specific mode, so 
>>> mode
>>>  switches don't depend on the output of previous instructions.
>>> 
>>> 
>>> My intention would be for the new pass to reuse ideas, and hopefully some of
>>> the existing code, from the mode-switching and gcse passes.  In particular,
>>> gcse.c (or it's dependencies) has code that could identify when values 
>>> assigned
>>> to the FPMR are known to be the same (although we may not need the full CSE
>>> capabilities of gcse.c), and mode-switching.cc knows how to globally 
>>> optimise
>>> mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to 
>>> avoid
>>> excessively increasing register pressure).
>>> 
>>> Initially the new pass would only apply to the AArch64 FPMR register, but in
>>> future it could also be used for other hard registers with similar 
>>> properties.
>>> 
>>> Does anyone have any comments on this approach, before I start writing any
>>> code?
>> Can you explain in more detail why the mode-switching pass
> infrastructure isn’t a good fit?  ISTR it already is customizable via
> target hooks.
> Agreed.  Mode switching seems to be the right pass to look at.
> 
> It probably is worth pointing out that mode switching is LCM based and as 
> such never speculates.  Given the potential cost of a mode switch, failure to 
> speculate may be a notable limitation (though the same would apply to the 
> ideas Andrew floated above).
> 
> This has recently come up in the RISC-V space due to needing VXRM assignments 
> so that we can utilize the vaaddu add-with-averaging instructions.    
> Placement of VXRM mode switches looks optimal from an LCM standpoint, but 
> speculation can measurably improve performance.  It was something like 2% on 
> the BPI for x264.  The k1/m1 chip in the BPI is almost certainly flushing its 
> pipelines on the VXRM assignment.
> 
> I've got a hack here that I'll submit upstream at some point.  Just not at 
> the top of my list yet -- especially now that our uarch has been fixed to not 
> flush its pipelines at VXRM assignments ;-)

I suppose LCM could be enhanced to handle partial antic and if the edges it 
speculates on are cold that might even be profitable on less great 
implementations?

> 
> jeff

Reply via email to