On 9/7/24 7:06 PM, Andrew Carlotti wrote:
On Sat, Sep 07, 2024 at 09:09:52AM +0200, Richard Biener wrote:
Am 06.09.2024 um 17:38 schrieb Andrew Carlotti <andrew.carlo...@arm.com>:
Hi,
I'm working on optimising assignments to the AArch64 Floating-point Mode
Register (FPMR), as part of our FP8 enablement work. Claudio has already
implemented FPMR as a hard register, with the intention that FP8 intrinsic
functions will compile to a combination of an fpmr register set, followed by an
FP8 operation that takes fpmr as an input operand.
It would clearly be inefficient to retain an explicit FPMR assignment prior to
each FP8 instruction (especially in the common case where every assignment uses
the same FPMR value). I think the best way to optimise this would be to
implement a new pass that can optimise assignments to individual hard registers.
There are a number of existing passes that do similar optimisations, but which
I believe are unsuitable for this scenario for various reasons. For example:
- cse1 can already optimise FPMR assignments within an extended basic block,
but can't handle broader optimisations.
- pre (in gcse.c) doesn't work with assigning constant values, which would miss
many potential usages. It also has limits on how far code can be moved,
based around ideas of register pressure that don't apply to the context of a
single hard register that shouldn't be used by the register allocator for
anything else. Additionally, it doesn't run at -Os.
- hoist (also using gcse.c) only handles constant values, and only runs when
optimising for size. It also has the rest of the issues that pre does.
- mode_sw only handles a small finite set of modes. The mode requirements are
determined solely by the instructions that require the specific mode, so mode
switches don't depend on the output of previous instructions.
My intention would be for the new pass to reuse ideas, and hopefully some of
the existing code, from the mode-switching and gcse passes. In particular,
gcse.c (or it's dependencies) has code that could identify when values assigned
to the FPMR are known to be the same (although we may not need the full CSE
capabilities of gcse.c), and mode-switching.cc knows how to globally optimise
mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to avoid
excessively increasing register pressure).
Initially the new pass would only apply to the AArch64 FPMR register, but in
future it could also be used for other hard registers with similar properties.
Does anyone have any comments on this approach, before I start writing any
code?
Can you explain in more detail why the mode-switching pass infrastructure
isn’t a good fit? ISTR it already is customizable via target hooks.
Richard
I forgot to explain how FPMR is used.
The FPMR register contains a large number of fields that control the data
formats and saturation/scaling behaviour used in various fp8 conversion an
multiplication intrinsics. At present, I think there are 2^26 valid defined
values that an be used in the FPMR. Furthermore, these values are not always
compile-time constants - we expect that devlopers will often reuse the same
compiled code (e.g. a matrix multiplication library routine) with different
formats or scaling/saturation behaviour selected at runtime (e.g. by passing a
parameter to the library routine).
(The specification for the FPRM register can be found at [1]. It's usage in
fp8 intrinsics is described in the draft ACLE spec at [2].)
As I understand it, the existing mode-switching pass infrastructure is built
around a small number of modes, where the choice of mode is a compile time
constant, and the total number of possible modes is fixed when building GCC.
Our usage of the FPMR register does not meet any of these criteria. I don't
see how these limitations could be overcome with target hooks within the
contraints of the existing pass.
You might look at RISC-V's insertion of vsetvls which uses some of the
basic concepts from mode-switching, but is, umm, more complex.
jeff