On Mon, 12 Jul 2021, guojiufu wrote: > On 2021-07-12 18:02, Richard Biener wrote: > > On Mon, 12 Jul 2021, guojiufu wrote: > > > >> On 2021-07-12 16:57, Richard Biener wrote: > >> > On Mon, 12 Jul 2021, guojiufu wrote: > >> > > >> >> On 2021-07-12 14:20, Richard Biener wrote: > >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote: > >> >> > > >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote: > >> >> >> > I wonder if there's a way to query the target what modes the doloop > >> >> >> > pattern can handle (not being too familiar with the doloop code). > >> >> >> > >> >> >> You can look what modes are allowed for operand 0 of doloop_end, > >> >> >> perhaps? Although that is a define_expand, not a define_insn, so it > >> >> >> is > >> >> >> hard to introspect. > >> >> >> > >> >> >> > Why do you need to do any checks besides the new type being able to > >> >> >> > represent all IV values? The original doloop IV will never wrap > >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will > >> >> >> > become > >> >> >> > zero ... I suppose the doloop might still do the correct thing here > >> >> >> > but it also still will with a IV with larger type). > >> >> > >> >> The issue comes from U*_MAX (original short MAX), as you said: on which > >> >> niter + 1 becomes zero. And because the step for doloop is -1; then, on > >> >> larger type 'zero - 1' will be a very large number on larger type > >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small > >> >> value > >> >> (e.g. "0xff"). > >> > > >> > But for the larger type the small type MAX + 1 fits and does not yield > >> > zero so it should still work exactly as before, no? Of course you > >> > have to compute the + 1 in the larger type. > >> > > >> You are right, if compute the "+ 1" in the larger type it is ok, as below > >> code: > >> ``` > >> /* Use type in word size may fast. */ > >> if (TYPE_PRECISION (ntype) < BITS_PER_WORD) > >> { > >> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1); > >> niter = fold_convert (ntype, niter); > >> } > >> > >> tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter), > >> build_int_cst (ntype, 1)); > >> > >> > >> add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL, > >> true); > >> ``` > >> The issue of this is, this code generates more stmt for doloop.xxx: > >> _12 = (unsigned int) xx(D); > >> _10 = _12 + 4294967295; > >> _24 = (long unsigned int) _10; > >> doloop.6_8 = _24 + 1; > >> > >> if use previous patch, "+ 1" on original type, then the stmts will looks > >> like: > >> _12 = (unsigned int) xx(D); > >> doloop.6_8 = (long unsigned int) _12; > >> > >> This is the reason for checking > >> wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype))) > > > > But this then only works when there's an upper bound on the number > > of iterations. Note you should not use TYPE_MAX_VALUE here but > > you can instead use > > > > wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value > > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype)))); > > Ok, Thanks! > I remember you mentioned that: > widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN (ntype)), > TYPE_SIGN (ntype)) > would be better than > wi::to_widest (TYPE_MAX_VALUE (ntype)). > > It seems that: > "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK > (NODE)->type_non_common.maxval" > which do a numerical-check and return the field of maxval. And then call to > wi::to_widest > > The other code "widest_int::from (wi::max_value (..,..),..)" calls > wi::max_value > and widest_int::from. > > I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
TYPE_MAX_VALUE can be "suprising", it does not necessarily match the underlying modes precision. At some point we've tried to eliminate most of its uses, not sure what the situation/position is right now. > > I think the -1 above comes from number of latch iterations vs. header > > entries - it's a common source for this kind of issues. range analysis > > might be able to prove that we can still merge the two adds even with > > the intermediate extension. > Yes, as you mentioned here, it relates to number of latch iterations > For loop looks like : while (l < n) or for (i = 0; i < n; i++) > This kind of loop, the niter is used to be 'n - 1' after transformed > into 'do-while' form. > I would see how to merge these two adds safely at this point > when generating doloop iv. (maybe range info, thanks! > > > > > Is this pre-loop extra add really offsetting the in-loop doloop > > improvements? > I'm not catching this question too much, sorry. I guess your concern > is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx > is decreasing niter until 0 (all number >0). > If misunderstand, thanks for point out. I'm questioning the argument that not being able to eliminate the +1-1 pair effects the overall cost improvement for the doloop IV type change which hopefully is _not_ just loop IV setup cost but improving performance in the loop body itself? Richard. > > > >> >> >> > >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap. > >> >> >> > >> >> >> > I'd have expected sth like > >> >> >> > > >> >> >> > ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED > >> >> >> > (ntype)); > >> >> >> > > >> >> >> > thus the decision made using a mode - which is also why I wonder > >> >> >> > if there's a way to query the target for this. As you say, > >> >> >> > it _may_ be fast, so better check (somehow). > >> >> > >> >> > >> >> I was also thinking of using hooks like type_for_size/type_for_mode. > >> >> /* Use type in word size may fast. */ > >> >> if (TYPE_PRECISION (ntype) < BITS_PER_WORD > >> >> && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE > >> >> (ntype)))) > >> >> { > >> >> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1); > >> >> base = fold_convert (ntype, base); > >> >> } > >> >> > >> >> As you pointed out, this does not query the mode from targets. > >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it > >> >> seems > >> >> not easy to use it in tree-ssa-loop-ivopts.c. > >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64 > >> >> requires > >> >> Pmode/DImode; while there are other targets that work on other 'mode' > >> >> (e.g. > >> >> SI). > >> >> > >> >> > >> >> In doloop_optimize, there is code: > >> >> > >> >> ``` > >> >> mode = desc->mode; > >> >> ..... > >> >> doloop_reg = gen_reg_rtx (mode); > >> >> rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, > >> >> start_label); > >> >> > >> >> word_mode_size = GET_MODE_PRECISION (word_mode); > >> >> word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - > >> >> 1; > >> >> if (! doloop_seq > >> >> && mode != word_mode > >> >> /* Before trying mode different from the one in that # of > >> >> iterations is > >> >> computed, we must be sure that the number of iterations fits > >> >> into > >> >> the new mode. */ > >> >> && (word_mode_size >= GET_MODE_PRECISION (mode) > >> >> || wi::leu_p (iterations_max, word_mode_max))) > >> >> { > >> >> if (word_mode_size > GET_MODE_PRECISION (mode)) > >> >> count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, > >> >> mode); > >> >> else > >> >> count = lowpart_subreg (word_mode, count, mode); > >> >> PUT_MODE (doloop_reg, word_mode); > >> >> doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label); > >> >> } > >> >> if (! doloop_seq) > >> >> { > >> >> if (dump_file) > >> >> fprintf (dump_file, > >> >> "Doloop: Target unwilling to use doloop pattern!\n"); > >> >> return false; > >> >> } > >> >> ``` > >> >> The above code first tries the mode of niter_desc by call > >> >> targetm.gen_doloop_end > >> >> to see if the target can generate doloop insns, if fail, then try to use > >> >> 'word_mode' against gen_doloop_end. > >> >> > >> >> > >> >> >> > >> >> >> Almost all targets just use Pmode, but there is no such guarantee I > >> >> >> think, and esp. some targets that do not have machine insns for this > >> >> >> (but want to generate different code for this anyway) can do pretty > >> >> >> much > >> >> >> anything. > >> >> >> > >> >> >> Maybe using just Pmode here is good enough though? > >> >> > > >> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode > >> >> > if we go for any hardcoded mode. s390x for example seems to handle > >> >> > both SImode and DImode (but names the helper gen_doloop_si64 > >> >> > for SImode?!). But indeed it looks like somehow querying doloop_end > >> >> > is going to be difficult since the expander doesn't have any mode, > >> >> > so we'd have to actually try emit RTL here. > >> >> > >> >> Instead of using hardcode mode, maybe we could add a hook for targets to > >> >> return > >> >> the preferred mode. > >> > > >> > That's a possiblity of course. Like the following which just shows the > >> > default implementation then (pass in current mode, return a more > >> > preferred > >> > mode or the mode itself) > >> > > >> > enum machine_mode > >> > prefered_doloop_mode (enum machine_mode mode) > >> > { > >> > return mode; > >> > } > >> > > >> Yes, thanks! > >> > >> Checking current do_loop_end in targets, in general, when do_loop_end > >> requires > >> SI mode, the target is defining Pmode as SImode and word_mode (from > >> BITS_PER_WORD > >> which defaults from UNITS_PER_WORD) is also defined to align with SI mode. > >> When do_loop_end requires DI mode, the target is defining Pmode as DImode > >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode. > >> > >> So, if aggressively, then by default we may just return word_mode. > > > > Note we still have to check whether the prefered mode is valid > > (passing in TImode but then returning DImode would be wrong). > Yes, some code like in doloop_optimize (rtl code) > ``` > mode != word_mode > && (word_mode_size >= GET_MODE_PRECISION (mode) > || wi::leu_p (iterations_max, word_mode_max)) > ``` > > Thanks again for your comments! > > BR, > Jiufu Guo > > > > Richard. > > > >> BR, > >> > >> Jiufu Guo. > >> > >> > >> >> > >> >> Thanks for those valuable comments! > >> >> > >> >> Jiufu Guo > >> >> > >> >> > >> >> > >> >> > > >> >> > Richard. > >> >> > >> >> > >> > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)