On Wed, Aug 25, 2021 at 02:22:06PM -0400, Michael Meissner wrote:
> On Wed, Aug 25, 2021 at 12:44:16PM -0500, Segher Boessenkool wrote:
> > Out of interest, did you notice any scheduling differences with this?
> 
> I don't use the built-ins so I wouldn't notice a difference.  I noticed this 
> as
> part of the next patch to add support for XXSPLTIDP (and ultimately XXSPLTIW 
> in
> a future patch).  The XXSPLTIDP instruction allows loading up many SFmode,
> DFmode, and V2DFmode constants.  The XXSPLTIW instruction allows loading up
> certain V16QImode, V8HImode, V4SImode, and V4SFmode constants.

Yeah, you might notice scheduling differences when "normal" code starts
using this.  Builtins are meh :-)

> However, I suspect if you aren't running spec on an otherwise
> idle machine, things will change where XXSPLTIDP will be more of a win by
> eliminating the loads.

It has more bandwidth and it is less contested, yup.  It does have the
usual "prefixed" gotchas though.

> While XXSPLTIDP by itself is positive, unfortunately, there is a regression in
> cactuBSSN_r (3%) when I add XXSPLTIW (but not XXSPLTIDP) that I'm trying to
> track down.
> 
> If I add both instructions, several of the benchmarks improve (including
> xalancbmk by 11% and x264_r by 27%), but cactuBSSN_r has the 3% regression and
> fotonik3d_r also has a new 3% regression.
> 
> Given that many more programs use floating point constants than vector
> constants (66,000 XXSPLTID's created vs. 5,000 XXSPLTIW's), I figure to push
> the XXSPLTIDP now, and try to figure out the differences before submitting the
> XXSPLTIW patch.

It would be interesting to figure out a pattern behind the regressions :-)


Segher

Reply via email to