https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #17 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to hubicka from comment #16) > > > > > > It could be done, but I was under impression that the sequence to load > > > 1.0f > > > into topmost elements nullifies the benefit of operation to divide two > > > > Sure, so perhaps we should somewhat increase the vectorization cost of > > V2SFmode > > division so that we would use it only if it is part of longer sequences? > > I wonder how the hardware implements it. If divps is of similar latency > as divss then I guess it is essentially always win to load 1.0 to the > upper part, since it is slow operation. On the other hand if divps is > about 4 times divss, then this may be harmful. > > Agner Fog seems to be listing divss and divps with same latencies. > For zen it is 10 cycles which should be enough to do the setup. OK, I'll prepare and test a formal patch.