https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #17 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to hubicka from comment #16)
> > > 
> > > It could be done, but I was under impression that the sequence to load 
> > > 1.0f
> > > into topmost elements nullifies the benefit of operation to divide two
> > 
> > Sure, so perhaps we should somewhat increase the vectorization cost of 
> > V2SFmode
> > division so that we would use it only if it is part of longer sequences?
> 
> I wonder how the hardware implements it.  If divps is of similar latency
> as divss then I guess it is essentially always win to load 1.0 to the
> upper part, since it is slow operation.  On the other hand if divps is
> about 4 times divss, then this may be harmful.
> 
> Agner Fog seems to be listing divss and divps with same latencies.
> For zen it is 10 cycles which should be enough to do the setup.

OK, I'll prepare and test a formal patch.

Reply via email to