Re: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency

H.J. Lu Tue, 22 Jan 2019 05:29:17 -0800

On Tue, Jan 22, 2019 at 4:08 AM Richard Biener
<[email protected]> wrote:
>
> On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu <[email protected]> wrote:
> >
> > On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <[email protected]> wrote:
> > >
> > > On 1/7/19 6:55 AM, H.J. Lu wrote:
> > > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <[email protected]> wrote:
> > > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <[email protected]> wrote:
> > > >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <[email protected]> wrote:
> > > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > > >>>>>>>> Did you mean "the nearest common dominator"?
> > > >>>>>>> If the nearest common dominator appears in the loop while all 
> > > >>>>>>> uses are
> > > >>>>>>> out of loops, this will result in suboptimal xor placement.
> > > >>>>>>> In this case you want to split edges out of the loop.
> > > >>>>>>>
> > > >>>>>>> In general this is what the LCM framework will do for you if the 
> > > >>>>>>> problem
> > > >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function 
> > > >>>>>>> mode is
> > > >>>>>>> "no zero register needed" and all conversions need mode "zero 
> > > >>>>>>> register
> > > >>>>>>> needed".  Mode switching should then do the correct placement 
> > > >>>>>>> decisions
> > > >>>>>>> (reaching minimal number of executions of xor).
> > > >>>>>>>
> > > >>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
> > > >>>>>>> It seems like a special case of more general issue, but I do not 
> > > >>>>>>> see
> > > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > > >>>>>>> the placement is correct we can probalby go either with new pass 
> > > >>>>>>> or
> > > >>>>>>> making this part of mode swithcing (which is anyway run by x86 
> > > >>>>>>> backend)
> > > >>>>>> So I haven't followed this discussion at all, but did touch on this
> > > >>>>>> issue with some patch a month or two ago with a target patch that 
> > > >>>>>> was
> > > >>>>>> trying to avoid the partial stalls.
> > > >>>>>>
> > > >>>>>> My assumption is that we're trying to find one or more places to
> > > >>>>>> initialize the upper half of an avx register so as to avoid partial
> > > >>>>>> register stall at existing sites that set the upper half.
> > > >>>>>>
> > > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> > > >>>>>> switching is just another variant).   A common-dominator approach 
> > > >>>>>> is
> > > >>>>>> closer to a classic GCSE and is going to result is more 
> > > >>>>>> initializations
> > > >>>>>> at sub-optimal points than a PRE/LCM style.
> > > >>>>> yes, it is usual code placement problem. It is special case because 
> > > >>>>> the
> > > >>>>> zero register is not modified by the conversion (just we need to 
> > > >>>>> have
> > > >>>>> zero somewhere).  So basically we do not have kills to the zero 
> > > >>>>> except
> > > >>>>> for entry block.
> > > >>>>>
> > > >>>> Do you have  testcase to show thatf the nearest common dominator
> > > >>>> in the loop, while all uses areout of loops, leads to suboptimal xor
> > > >>>> placement?
> > > >>> I don't have a testcase, but it's all but certain nearest common
> > > >>> dominator is going to be a suboptimal placement.  That's going to 
> > > >>> create
> > > >>> paths where you're going to emit the xor when it's not used.
> > > >>>
> > > >>> The whole point of the LCM algorithms is they are optimal in terms of
> > > >>> expression evaluations.
> > > >> We tried LCM and it didn't work well for this case.  LCM places a 
> > > >> single
> > > >> VXOR close to the location where it is needed, which can be inside a
> > > >> loop.  There is nothing wrong with the LCM algorithms.   But this 
> > > >> doesn't
> > > >> solve
> > > >>
> > > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> > > >>
> > > >> where VXOR is executed multiple times inside of a function, instead of
> > > >> just once.   We are investigating to generate a single VXOR at entry 
> > > >> of the
> > > >> nearest dominator for basic blocks with SF/DF conversions, which is in
> > > >> the the fake loop that contains the whole function:
> > > >>
> > > >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> > > >>                                              convert_bbs);
> > > >>       while (bb->loop_father->latch
> > > >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> > > >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> > > >>                                       bb->loop_father->header);
> > > >>
> > > >>       insn = BB_HEAD (bb);
> > > >>       if (!NONDEBUG_INSN_P (insn))
> > > >>         insn = next_nonnote_nondebug_insn (insn);
> > > >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > > >>       set_insn = emit_insn_before (set, insn);
> > > >>
> > > > Here is the updated patch.  OK for trunk?
> > > >
> > > > Thanks.
> > > >
> > > > -- H.J.
> > > >
> > > >
> > > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> > > >
> > > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> > > > From: "H.J. Lu" <[email protected]>
> > > > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> > > >
> > > > With -mavx, for
> > > >
> > > > $ cat foo.i
> > > > extern float f;
> > > > extern double d;
> > > > extern int i;
> > > >
> > > > void
> > > > foo (void)
> > > > {
> > > >   d = f;
> > > >   f = i;
> > > > }
> > > >
> > > > we need to generate
> > > >
> > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > >       ...
> > > >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> > > >       ...
> > > >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> > > >
> > > > to avoid partial XMM register stall.  This patch adds a pass to generate
> > > > a single
> > > >
> > > >       vxorps          %xmmN, %xmmN, %xmmN
> > > >
> > > > at entry of the nearest dominator for basic blocks with SF/DF 
> > > > conversions,
> > > > which is in the fake loop that contains the whole function, instead of
> > > > generating one
> > > >
> > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > >
> > > > for each SF/DF conversion.
> > > >
> > > > NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> > > > inside the loop.  Simple testcase show this:
> > > >
> > > > $ cat badcase.c
> > > >
> > > > extern float f;
> > > > extern double d;
> > > >
> > > > void
> > > > foo (int n, int k)
> > > > {
> > > >   for (int j = 0; j != n; j++)
> > > >     if (j < k)
> > > >       d = f;
> > > > }
> > > >
> > > > It generates
> > > >
> > > >     ...
> > > >     loop:
> > > >       if(j < k)
> > > >         vxorps  %xmm0, %xmm0, %xmm0
> > > >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> > > >       ...
> > > >     loopend
> > > >     ...
> > > >
> > > > This is because LCM only works when there is a certain benifit.  But for
> > > > conditional branch, LCM wouldn't move
> > > >
> > > >    vxorps  %xmm0, %xmm0, %xmm0
> > > It works this way for a reason.  There are obviously paths through the
> > > loop where the conversion does not happen and thus the vxor is not
> > > needed or desirable on those paths.
> > >
> > > That's a fundamental property of the LCM algorithm -- it never inserts
> > > an evaluation on a path through the CFG where it will not be used.
> > >
> > > Your algorithm of inserting into the dominator block will introduce
> > > runtime executions of the vxor on paths where it is not needed.
> > >
> > > It's well known that relaxing that property of LCM can result in better
> > > code generation in some circumstances.  Block copying and loop
> > > restructuring are the gold standard for dealing with this kind of problem.
> > >
> > > In this case you could split the iteration space so that you have two
> > > loops.  one for 0..k and the other for k..n.  Note that GCC has support
> > > for this kind of loop restructuring.  This has the advantage that the j
> > > < k test does not happen each iteration of the loop and the vxor stuff
> > > via LCM would be optimal.
> > >
> > > There's many other cases where copying and restructuring results in
> > > better common subexpression elimination (which is what you're doing).
> > > Probably the best work I've seen in this space is Bodik's thesis.
> > > Click's work from '95 touches on some of this as well, but isn't as
> > > relevant to this specific instance.
> > >
> > > Anyway, whether or not the patch should move forward is really up to Jan
> > > (and Uros if he wants to be involved) I think.  I'm not fundamentally
> > > opposed to HJ's approach as I'm aware of the different tradeoffs.
> > >
> > > HJ's approach of pulling into the dominator block can result in
> > > unnecessary evaluations.  But it can also reduce the number of
> > > evaluations in other cases.  It really depends on the runtime behavior
> > > of the code.   One could argue that the vxor stuff we're talking about
> > > is most likely happening in loops, and probably not in deeply nested
> > > control structures within those loops.  Thus pulling them out more
> > > aggressively ala LICM may be better than LCM.
> >
> > True, there is a trade-off.   My approach inserts a vxorps at the last 
> > possible
> > position.  Yes, vxorps will always be executed even if it may not be 
> > required.
> > Since it is executed only once in all cases, it is a win overall.
>
> Hopefully a simple vpxor won't end up powering up the other AVX512
> unit if it lay dormant ...


A 128-bit AVX vpxor won't touch AVX512.

> And if we ever get to the state of having two separate ISAs in the same
> function then you'd need to make sure you can execute vpxor in the
> place you are inserting since it may now be executed when it wasn't
> before (and I assume you already check that you do not zero the
> reg if there's a value life in it if the conditional def you are instrumenting
> is not executed).

A dedicated pseudo register is allocated and zeroed for INT->FP and
FP->FP conversions.  IRA/LRA take care of the rest.


-- 
H.J.

Re: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency

Reply via email to