Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

Richard Biener Mon, 12 Jul 2021 07:46:37 -0700

On Mon, 12 Jul 2021, guojiufu wrote:

> On 2021-07-12 18:02, Richard Biener wrote:
> > On Mon, 12 Jul 2021, guojiufu wrote:
> > 
> >> On 2021-07-12 16:57, Richard Biener wrote:
> >> > On Mon, 12 Jul 2021, guojiufu wrote:
> >> >
> >> >> On 2021-07-12 14:20, Richard Biener wrote:
> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> >> >> >
> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
> >> >> >> > I wonder if there's a way to query the target what modes the doloop
> >> >> >> > pattern can handle (not being too familiar with the doloop code).
> >> >> >>
> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
> >> >> >> is
> >> >> >> hard to introspect.
> >> >> >>
> >> >> >> > Why do you need to do any checks besides the new type being able to
> >> >> >> > represent all IV values?  The original doloop IV will never wrap
> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
> >> >> >> > become
> >> >> >> > zero ... I suppose the doloop might still do the correct thing here
> >> >> >> > but it also still will with a IV with larger type).
> >> >>
> >> >> The issue comes from U*_MAX (original short MAX), as you said: on which
> >> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
> >> >> larger type 'zero - 1' will be a very large number on larger type
> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
> >> >> value
> >> >> (e.g. "0xff").
> >> >
> >> > But for the larger type the small type MAX + 1 fits and does not yield
> >> > zero so it should still work exactly as before, no?  Of course you
> >> > have to compute the + 1 in the larger type.
> >> >
> >> You are right, if compute the "+ 1" in the larger type it is ok, as below
> >> code:
> >> ```
> >>    /* Use type in word size may fast.  */
> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
> >>       {
> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
> >>         niter = fold_convert (ntype, niter);
> >>       }
> >> 
> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> >>                              build_int_cst (ntype, 1));
> >> 
> >> 
> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
> >> true);
> >> ```
> >> The issue of this is, this code generates more stmt for doloop.xxx:
> >>   _12 = (unsigned int) xx(D);
> >>   _10 = _12 + 4294967295;
> >>   _24 = (long unsigned int) _10;
> >>   doloop.6_8 = _24 + 1;
> >> 
> >> if use previous patch, "+ 1" on original type, then the stmts will looks
> >> like:
> >>   _12 = (unsigned int) xx(D);
> >>   doloop.6_8 = (long unsigned int) _12;
> >> 
> >> This is the reason for checking
> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
> > 
> > But this then only works when there's an upper bound on the number
> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
> > you can instead use
> > 
> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
> 
> Ok, Thanks!
> I remember you mentioned that:
> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN (ntype)),
> TYPE_SIGN (ntype))
> would be better than
> wi::to_widest (TYPE_MAX_VALUE (ntype)).
> 
> It seems that:
> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
> (NODE)->type_non_common.maxval"
> which do a numerical-check and return the field of maxval.  And then call to
> wi::to_widest
> 
> The other code "widest_int::from (wi::max_value (..,..),..)" calls
> wi::max_value
> and widest_int::from.
> 
> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?


TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
underlying modes precision.  At some point we've tried to eliminate
most of its uses, not sure what the situation/position is right now.

> > I think the -1 above comes from number of latch iterations vs. header
> > entries - it's a common source for this kind of issues.  range analysis
> > might be able to prove that we can still merge the two adds even with
> > the intermediate extension.
> Yes, as you mentioned here, it relates to number of latch iterations
> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
> This kind of loop, the niter is used to be 'n - 1' after transformed
> into 'do-while' form.
> I would see how to merge these two adds safely at this point
> when generating doloop iv. (maybe range info, thanks!
>
> > 
> > Is this pre-loop extra add really offsetting the in-loop doloop
> > improvements?
> I'm not catching this question too much, sorry.  I guess your concern
> is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
> is decreasing niter until 0 (all number >0).
> If misunderstand,  thanks for point out.

I'm questioning the argument that not being able to eliminate the +1-1
pair effects the overall cost improvement for the doloop IV type change
which hopefully is _not_ just loop IV setup cost but improving
performance in the loop body itself?

Richard.
 
> > 
> >> >> >>
> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
> >> >> >>
> >> >> >> > I'd have expected sth like
> >> >> >> >
> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
> >> >> >> > (ntype));
> >> >> >> >
> >> >> >> > thus the decision made using a mode - which is also why I wonder
> >> >> >> > if there's a way to query the target for this.  As you say,
> >> >> >> > it _may_ be fast, so better check (somehow).
> >> >>
> >> >>
> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
> >> >>     /* Use type in word size may fast.  */
> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
> >> >> (ntype))))
> >> >>       {
> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
> >> >>         base = fold_convert (ntype, base);
> >> >>       }
> >> >>
> >> >> As you pointed out, this does not query the mode from targets.
> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
> >> >> seems
> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
> >> >> requires
> >> >> Pmode/DImode; while there are other targets that work on other 'mode'
> >> >> (e.g.
> >> >> SI).
> >> >>
> >> >>
> >> >> In doloop_optimize, there is code:
> >> >>
> >> >> ```
> >> >>     mode = desc->mode;
> >> >> .....
> >> >>     doloop_reg = gen_reg_rtx (mode);
> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
> >> >> start_label);
> >> >>
> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
> >> >> 1;
> >> >>     if (! doloop_seq
> >> >>         && mode != word_mode
> >> >>         /* Before trying mode different from the one in that # of
> >> >> iterations is
> >> >>            computed, we must be sure that the number of iterations fits
> >> >> into
> >> >>            the new mode.  */
> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
> >> >>       {
> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
> >> >>         mode);
> >> >>         else
> >> >>           count = lowpart_subreg (word_mode, count, mode);
> >> >>         PUT_MODE (doloop_reg, word_mode);
> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
> >> >>       }
> >> >>     if (! doloop_seq)
> >> >>       {
> >> >>         if (dump_file)
> >> >>           fprintf (dump_file,
> >> >>                    "Doloop: Target unwilling to use doloop pattern!\n");
> >> >>         return false;
> >> >>       }
> >> >> ```
> >> >> The above code first tries the mode of niter_desc by call
> >> >> targetm.gen_doloop_end
> >> >> to see if the target can generate doloop insns, if fail, then try to use
> >> >> 'word_mode' against gen_doloop_end.
> >> >>
> >> >>
> >> >> >>
> >> >> >> Almost all targets just use Pmode, but there is no such guarantee I
> >> >> >> think, and esp. some targets that do not have machine insns for this
> >> >> >> (but want to generate different code for this anyway) can do pretty
> >> >> >> much
> >> >> >> anything.
> >> >> >>
> >> >> >> Maybe using just Pmode here is good enough though?
> >> >> >
> >> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
> >> >> > if we go for any hardcoded mode.  s390x for example seems to handle
> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
> >> >> > for SImode?!).  But indeed it looks like somehow querying doloop_end
> >> >> > is going to be difficult since the expander doesn't have any mode,
> >> >> > so we'd have to actually try emit RTL here.
> >> >>
> >> >> Instead of using hardcode mode, maybe we could add a hook for targets to
> >> >> return
> >> >> the preferred mode.
> >> >
> >> > That's a possiblity of course.  Like the following which just shows the
> >> > default implementation then (pass in current mode, return a more
> >> > preferred
> >> > mode or the mode itself)
> >> >
> >> > enum machine_mode
> >> > prefered_doloop_mode (enum machine_mode mode)
> >> > {
> >> >   return mode;
> >> > }
> >> >
> >> Yes, thanks!
> >> 
> >> Checking current do_loop_end in targets, in general, when do_loop_end
> >> requires
> >> SI mode, the target is defining Pmode as SImode and word_mode (from
> >> BITS_PER_WORD
> >> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
> >> When do_loop_end requires DI mode, the target is defining Pmode as DImode
> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
> >> 
> >> So, if aggressively, then by default we may just return word_mode.
> > 
> > Note we still have to check whether the prefered mode is valid
> > (passing in TImode but then returning DImode would be wrong).
> Yes, some code like in doloop_optimize (rtl code)
> ```
> mode != word_mode
> && (word_mode_size >= GET_MODE_PRECISION (mode)
>     || wi::leu_p (iterations_max, word_mode_max))
> ```
> 
> Thanks again for your comments!
> 
> BR,
> Jiufu Guo
> > 
> > Richard.
> > 
> >> BR,
> >> 
> >> Jiufu Guo.
> >> 
> >> 
> >> >>
> >> >> Thanks for those valuable comments!
> >> >>
> >> >> Jiufu Guo
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > Richard.
> >> >>
> >> >>
> >> 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

Reply via email to