On Mon, Jan 7, 2019 at 5:55 AM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <l...@redhat.com> wrote:
> > >
> > > On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> wrote:
> > > >>
> > > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > > >>>>>
> > > >>>>> Did you mean "the nearest common dominator"?
> > > >>>>
> > > >>>> If the nearest common dominator appears in the loop while all uses 
> > > >>>> are
> > > >>>> out of loops, this will result in suboptimal xor placement.
> > > >>>> In this case you want to split edges out of the loop.
> > > >>>>
> > > >>>> In general this is what the LCM framework will do for you if the 
> > > >>>> problem
> > > >>>> is modelled siimlar way as in mode_swtiching.  At entry function 
> > > >>>> mode is
> > > >>>> "no zero register needed" and all conversions need mode "zero 
> > > >>>> register
> > > >>>> needed".  Mode switching should then do the correct placement 
> > > >>>> decisions
> > > >>>> (reaching minimal number of executions of xor).
> > > >>>>
> > > >>>> Jeff, whan is your optinion on the approach taken by the patch?
> > > >>>> It seems like a special case of more general issue, but I do not see
> > > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > > >>>> the placement is correct we can probalby go either with new pass or
> > > >>>> making this part of mode swithcing (which is anyway run by x86 
> > > >>>> backend)
> > > >>> So I haven't followed this discussion at all, but did touch on this
> > > >>> issue with some patch a month or two ago with a target patch that was
> > > >>> trying to avoid the partial stalls.
> > > >>>
> > > >>> My assumption is that we're trying to find one or more places to
> > > >>> initialize the upper half of an avx register so as to avoid partial
> > > >>> register stall at existing sites that set the upper half.
> > > >>>
> > > >>> This sounds like a classic PRE/LCM style problem (of which mode
> > > >>> switching is just another variant).   A common-dominator approach is
> > > >>> closer to a classic GCSE and is going to result is more 
> > > >>> initializations
> > > >>> at sub-optimal points than a PRE/LCM style.
> > > >>
> > > >> yes, it is usual code placement problem. It is special case because the
> > > >> zero register is not modified by the conversion (just we need to have
> > > >> zero somewhere).  So basically we do not have kills to the zero except
> > > >> for entry block.
> > > >>
> > > >
> > > > Do you have  testcase to show thatf the nearest common dominator
> > > > in the loop, while all uses areout of loops, leads to suboptimal xor
> > > > placement?
> > > I don't have a testcase, but it's all but certain nearest common
> > > dominator is going to be a suboptimal placement.  That's going to create
> > > paths where you're going to emit the xor when it's not used.
> > >
> > > The whole point of the LCM algorithms is they are optimal in terms of
> > > expression evaluations.
> >
> > We tried LCM and it didn't work well for this case.  LCM places a single
> > VXOR close to the location where it is needed, which can be inside a
> > loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> > solve
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> >
> > where VXOR is executed multiple times inside of a function, instead of
> > just once.   We are investigating to generate a single VXOR at entry of the
> > nearest dominator for basic blocks with SF/DF conversions, which is in
> > the the fake loop that contains the whole function:
> >
> >       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> >                                              convert_bbs);
> >       while (bb->loop_father->latch
> >              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> >         bb = get_immediate_dominator (CDI_DOMINATORS,
> >                                       bb->loop_father->header);
> >
> >       insn = BB_HEAD (bb);
> >       if (!NONDEBUG_INSN_P (insn))
> >         insn = next_nonnote_nondebug_insn (insn);
> >       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> >       set_insn = emit_insn_before (set, insn);
> >
>
> Here is the updated patch.  OK for trunk?
>

This is a GCC 8/9 regression:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007

PING:

https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html

-- 
H.J.

Reply via email to