On Wed, Aug 25, 2021 at 02:22:06PM -0400, Michael Meissner wrote: > On Wed, Aug 25, 2021 at 12:44:16PM -0500, Segher Boessenkool wrote: > > Out of interest, did you notice any scheduling differences with this? > > I don't use the built-ins so I wouldn't notice a difference. I noticed this > as > part of the next patch to add support for XXSPLTIDP (and ultimately XXSPLTIW > in > a future patch). The XXSPLTIDP instruction allows loading up many SFmode, > DFmode, and V2DFmode constants. The XXSPLTIW instruction allows loading up > certain V16QImode, V8HImode, V4SImode, and V4SFmode constants.
Yeah, you might notice scheduling differences when "normal" code starts using this. Builtins are meh :-) > However, I suspect if you aren't running spec on an otherwise > idle machine, things will change where XXSPLTIDP will be more of a win by > eliminating the loads. It has more bandwidth and it is less contested, yup. It does have the usual "prefixed" gotchas though. > While XXSPLTIDP by itself is positive, unfortunately, there is a regression in > cactuBSSN_r (3%) when I add XXSPLTIW (but not XXSPLTIDP) that I'm trying to > track down. > > If I add both instructions, several of the benchmarks improve (including > xalancbmk by 11% and x264_r by 27%), but cactuBSSN_r has the 3% regression and > fotonik3d_r also has a new 3% regression. > > Given that many more programs use floating point constants than vector > constants (66,000 XXSPLTID's created vs. 5,000 XXSPLTIW's), I figure to push > the XXSPLTIDP now, and try to figure out the differences before submitting the > XXSPLTIW patch. It would be interesting to figure out a pattern behind the regressions :-) Segher