https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121240
--- Comment #10 from Dhruv Chawla <dhruvc at gcc dot gnu.org> --- (In reply to Wilco from comment #9) > We expand floating point constants using MOV/MOVK/DUP in the same way since > the use of DUP is faster when you are load limited (often the case in FP > code). We can fine tune which cases should expand into DUP, but the decision > has to be taken early on, so we can't take register pressure into account. > After doing some deeper diving into this kernel, it appears that this is an edge case being hit on Neoverse V2 because of DUP having a throughput of 1. > My question is why are these constants generated *inside* the loop if there > is no high register pressure? They should be lifted out of the loop. Do you > know what the actual bottleneck is? You could edit some extra MOV or DUP in > the disassembly to see which results in worse performance. The constants do actually get lifted out of the loop by LICM. The RA then spills and rematerializes them in the hot loop as required. I think the agent got a bit misled here as there is no "low register pressure". So as per the agent: "This produces 24 DUP rematerializations inside the hot inner loop (19 MOV + 19 MOVK + 24 DUP = 62 instructions) versus 28 ADRP + 29 LDR Q = 57 instructions in the GOOD build. The FP computation is byte-identical: 91 FMUL, 72 FMLA, 34 FADD/FSUB, 4 FRSQRTE, 8 FRSQRTS. The DUP path is shorter in latency (5 cycles MOV→MOVK→DUP→FMA vs 7 cycles ADRP→LDR→FMA) but serializes on a single pipe. With 24 DUPs at throughput 1 on M0, the last DUP cannot issue until cycle ~24. The GOOD build's 29 LDR Qs at throughput 3 on L pipes all complete by cycle ~10." I also asked it to confirm using llvm-mca, which gave similar results. So I think this is an unfortunate case of a low-throughput instruction on this particular architecture. Does this make more sense? I will be experimenting with the assembly as well, just wanted to post the deeper dive first. (PS: Sorry for throwing AI slop at you. I wanted to use this regression as a real-world test for these agents and to better learn how to use them.)
