[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

crazylht at gmail dot com Tue, 22 Sep 2020 18:38:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127


--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Michael_S from comment #6)
> Why do you see it as addition of peephole pattern?
> I see it as removal. Like, "do what's written in the source and don't try to
> be tricky".
> Probably, I am too removed from how compilers work :(
> 
> Or, may be, handle it at the level of cost of instructions?

Since the costs inside/outside of a loop are different, it's too rough to use
an unified cost to do estimation.

> I don't know how gcc works internally, but it seems that currently the cost
> of register move [on Haswell and Skylake] is underestimated.
> Although it is true that register move has no cost in terms of execution
> ports and latency, it still has the same cost as, say, integer ALU
> instructions, in terms of the front end and renamer.
> 
> Also, as pointed out above by Alexander, the cost of FMA3 with (base+index)
> or (index*scale) memory operands could also be underestimated.
> 
> Unlike Alexander, I am not sure that the difference between (base+index) and
> (base) is really what matters. IMO, the cost of FMA3 with *any* memory
> operands is underestimated, but I am not going to insist on that.
> 

We can increase FMA3 cost if it has memory operand, but not sure would it
benifit all cases, especially for those not in inner loop.

> 
> In the ideal world compiler should reason as I do it myself when coding in
> asm:
> estimate which resource is critical in the given loop and then try to reduce
> pressure on this particular resource. In this particular loop a critical
> resource appears to be renamer, so the cost of instructions should be seen
> as cost at renamer. In other situations critical resource could be
> throughput of particular issue port. In yet another situation it could be
> latency of instructions that form dependency chain across multiple

AFAIK, resource like issue port is only consided in software scheduler, but
scheduler won't do instruction tranformation like that, even scheduler itself
is subtle when target supports out-of-order execution.

> iterations of the loop.
> The later is especially common in "reduction" algorithms of which dot
> product is the most common example.
> A single value of instruction cost simply can't cover all these different
> cases in satisfactory manner.
> May be, gcc it's already here, I don't know.

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

Reply via email to