> Am 07.09.2024 um 17:56 schrieb Jeff Law <jeffreya...@gmail.com>:
>
>
>
> On 9/7/24 1:09 AM, Richard Biener wrote:
>>>> Am 06.09.2024 um 17:38 schrieb Andrew Carlotti <andrew.carlo...@arm.com>:
>>>
>>> Hi,
>>>
>>> I'm working on optimising assignments to the AArch64 Floating-point Mode
>>> Register (FPMR), as part of our FP8 enablement work. Claudio has already
>>> implemented FPMR as a hard register, with the intention that FP8 intrinsic
>>> functions will compile to a combination of an fpmr register set, followed
>>> by an
>>> FP8 operation that takes fpmr as an input operand.
>>>
>>> It would clearly be inefficient to retain an explicit FPMR assignment prior
>>> to whic
>>> each FP8 instruction (especially in the common case where every assignment
>>> uses
>>> the same FPMR value). I think the best way to optimise this would be to
>>> implement a new pass that can optimise assignments to individual hard
>>> registers.
>>>
>>> There are a number of existing passes that do similar optimisations, but
>>> which
>>> I believe are unsuitable for this scenario for various reasons. For
>>> example:
>>>
>>> - cse1 can already optimise FPMR assignments within an extended basic block,
>>> but can't handle broader optimisations.
>>> - pre (in gcse.c) doesn't work with assigning constant values, which would
>>> miss
>>> many potential usages. It also has limits on how far code can be moved,
>>> based around ideas of register pressure that don't apply to the context of
>>> a
>>> single hard register that shouldn't be used by the register allocator for
>>> anything else. Additionally, it doesn't run at -Os.
>>> - hoist (also using gcse.c) only handles constant values, and only runs when
>>> optimising for size. It also has the rest of the issues that pre does.
>>> - mode_sw only handles a small finite set of modes. The mode requirements
>>> are
>>> determined solely by the instructions that require the specific mode, so
>>> mode
>>> switches don't depend on the output of previous instructions.
>>>
>>>
>>> My intention would be for the new pass to reuse ideas, and hopefully some of
>>> the existing code, from the mode-switching and gcse passes. In particular,
>>> gcse.c (or it's dependencies) has code that could identify when values
>>> assigned
>>> to the FPMR are known to be the same (although we may not need the full CSE
>>> capabilities of gcse.c), and mode-switching.cc knows how to globally
>>> optimise
>>> mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to
>>> avoid
>>> excessively increasing register pressure).
>>>
>>> Initially the new pass would only apply to the AArch64 FPMR register, but in
>>> future it could also be used for other hard registers with similar
>>> properties.
>>>
>>> Does anyone have any comments on this approach, before I start writing any
>>> code?
>> Can you explain in more detail why the mode-switching pass
> infrastructure isn’t a good fit? ISTR it already is customizable via
> target hooks.
> Agreed. Mode switching seems to be the right pass to look at.
>
> It probably is worth pointing out that mode switching is LCM based and as
> such never speculates. Given the potential cost of a mode switch, failure to
> speculate may be a notable limitation (though the same would apply to the
> ideas Andrew floated above).
>
> This has recently come up in the RISC-V space due to needing VXRM assignments
> so that we can utilize the vaaddu add-with-averaging instructions.
> Placement of VXRM mode switches looks optimal from an LCM standpoint, but
> speculation can measurably improve performance. It was something like 2% on
> the BPI for x264. The k1/m1 chip in the BPI is almost certainly flushing its
> pipelines on the VXRM assignment.
>
> I've got a hack here that I'll submit upstream at some point. Just not at
> the top of my list yet -- especially now that our uarch has been fixed to not
> flush its pipelines at VXRM assignments ;-)
I suppose LCM could be enhanced to handle partial antic and if the edges it
speculates on are cold that might even be profitable on less great
implementations?
>
> jeff