On Mon, Jan 7, 2019 at 5:55 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <l...@redhat.com> wrote: > > > > > > On 11/28/18 12:48 PM, H.J. Lu wrote: > > > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> wrote: > > > >> > > > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote: > > > >>>>> > > > >>>>> Did you mean "the nearest common dominator"? > > > >>>> > > > >>>> If the nearest common dominator appears in the loop while all uses > > > >>>> are > > > >>>> out of loops, this will result in suboptimal xor placement. > > > >>>> In this case you want to split edges out of the loop. > > > >>>> > > > >>>> In general this is what the LCM framework will do for you if the > > > >>>> problem > > > >>>> is modelled siimlar way as in mode_swtiching. At entry function > > > >>>> mode is > > > >>>> "no zero register needed" and all conversions need mode "zero > > > >>>> register > > > >>>> needed". Mode switching should then do the correct placement > > > >>>> decisions > > > >>>> (reaching minimal number of executions of xor). > > > >>>> > > > >>>> Jeff, whan is your optinion on the approach taken by the patch? > > > >>>> It seems like a special case of more general issue, but I do not see > > > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if > > > >>>> the placement is correct we can probalby go either with new pass or > > > >>>> making this part of mode swithcing (which is anyway run by x86 > > > >>>> backend) > > > >>> So I haven't followed this discussion at all, but did touch on this > > > >>> issue with some patch a month or two ago with a target patch that was > > > >>> trying to avoid the partial stalls. > > > >>> > > > >>> My assumption is that we're trying to find one or more places to > > > >>> initialize the upper half of an avx register so as to avoid partial > > > >>> register stall at existing sites that set the upper half. > > > >>> > > > >>> This sounds like a classic PRE/LCM style problem (of which mode > > > >>> switching is just another variant). A common-dominator approach is > > > >>> closer to a classic GCSE and is going to result is more > > > >>> initializations > > > >>> at sub-optimal points than a PRE/LCM style. > > > >> > > > >> yes, it is usual code placement problem. It is special case because the > > > >> zero register is not modified by the conversion (just we need to have > > > >> zero somewhere). So basically we do not have kills to the zero except > > > >> for entry block. > > > >> > > > > > > > > Do you have testcase to show thatf the nearest common dominator > > > > in the loop, while all uses areout of loops, leads to suboptimal xor > > > > placement? > > > I don't have a testcase, but it's all but certain nearest common > > > dominator is going to be a suboptimal placement. That's going to create > > > paths where you're going to emit the xor when it's not used. > > > > > > The whole point of the LCM algorithms is they are optimal in terms of > > > expression evaluations. > > > > We tried LCM and it didn't work well for this case. LCM places a single > > VXOR close to the location where it is needed, which can be inside a > > loop. There is nothing wrong with the LCM algorithms. But this doesn't > > solve > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007 > > > > where VXOR is executed multiple times inside of a function, instead of > > just once. We are investigating to generate a single VXOR at entry of the > > nearest dominator for basic blocks with SF/DF conversions, which is in > > the the fake loop that contains the whole function: > > > > bb = nearest_common_dominator_for_set (CDI_DOMINATORS, > > convert_bbs); > > while (bb->loop_father->latch > > != EXIT_BLOCK_PTR_FOR_FN (cfun)) > > bb = get_immediate_dominator (CDI_DOMINATORS, > > bb->loop_father->header); > > > > insn = BB_HEAD (bb); > > if (!NONDEBUG_INSN_P (insn)) > > insn = next_nonnote_nondebug_insn (insn); > > set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); > > set_insn = emit_insn_before (set, insn); > > > > Here is the updated patch. OK for trunk? >
This is a GCC 8/9 regression: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007 PING: https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html -- H.J.