[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

ubizjak at gmail dot com via Gcc-bugs Thu, 23 Dec 2021 03:16:40 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797


--- Comment #17 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to hubicka from comment #16)
> > > 
> > > It could be done, but I was under impression that the sequence to load 
> > > 1.0f
> > > into topmost elements nullifies the benefit of operation to divide two
> > 
> > Sure, so perhaps we should somewhat increase the vectorization cost of 
> > V2SFmode
> > division so that we would use it only if it is part of longer sequences?
> 
> I wonder how the hardware implements it.  If divps is of similar latency
> as divss then I guess it is essentially always win to load 1.0 to the
> upper part, since it is slow operation.  On the other hand if divps is
> about 4 times divss, then this may be harmful.
> 
> Agner Fog seems to be listing divss and divps with same latencies.
> For zen it is 10 cycles which should be enough to do the setup.

OK, I'll prepare and test a formal patch.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

Reply via email to