On October 21, 2019 4:06:49 PM GMT+02:00, Richard Sandiford <richard.sandif...@arm.com> wrote: >Richard Biener <richard.guent...@gmail.com> writes: >> On Mon, Oct 21, 2019 at 12:08 PM Richard Sandiford >> <richard.sandif...@arm.com> wrote: >>> >>> Richard Biener <richard.guent...@gmail.com> writes: >>> > On October 20, 2019 2:54:48 PM GMT+02:00, Richard Sandiford ><richard.sandif...@arm.com> wrote: >>> >>Richard Biener <richard.guent...@gmail.com> writes: >>> >>> On October 19, 2019 5:06:40 PM GMT+02:00, Richard Sandiford >>> >><richard.sandif...@arm.com> wrote: >>> >>>>After the previous patch, it seems more natural to apply the >>> >>>>PARAM_SLP_MAX_INSNS_IN_BB threshold as soon as we know what >>> >>>>the region is, rather than delaying it to vect_slp_analyze_bb_1. >>> >>>>(But rather than carve out the biggest region possible and then >>> >>>>reject it, wouldn't it be better to stop when the region gets >>> >>>>too big, so we at least have a chance of vectorising something?) >>> >>>> >>> >>>>It also seems more natural for vect_slp_bb_region to create the >>> >>>>bb_vec_info itself rather than (a) having to pass bits of data >down >>> >>>>for the initialisation and (b) forcing vect_slp_analyze_bb_1 to >free >>> >>>>on every failure return. >>> >>>> >>> >>>>Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to >install? >>> >>> >>> >>> Ok. But I wonder what the reason was for this limit? Dependency >>> >>analysis was greatly simplified, being no longer quadratic here. >Can >>> >>you check where the limit originally came from? But indeed >splitting >>> >>the region makes more sense then, but at dataref group boundaries >I'd >>> >>say. >>> >> >>> >>Yeah, looks it was the complexity of dependence analysis: >>> >> >>> >> https://gcc.gnu.org/ml/gcc-patches/2009-05/msg01303.html >>> > >>> > OK. We no longer >>> > Call compute dependence between all memory refs but only verify we >can do those code-motions we need to do. That's of course much harder >to check a bound on upfront (it's still quadratic in the worst case). >I'm also not sure this is ever a problem, but we might instead count >the number of stmts involving memory? >>> >>> OK, here's what I tried. It exposes quadratic behaviour for >>> gcc.dg/pr87985.c on x86_64 though. To make things worse, this is >>> all in the name of "vectorising" using V1DI. :-) >>> >>> In that test we have N data references in which the even elements >appear >>> to be vectorisable but the odd elements don't. First time round, we >hit: >>> >>> /* For basic block SLP, try to break the group up into multiples >of the >>> vector size. */ >>> unsigned HOST_WIDE_INT const_nunits; >>> if (is_a <bb_vec_info> (vinfo) >>> && STMT_VINFO_GROUPED_ACCESS (stmt_info) >>> && DR_GROUP_FIRST_ELEMENT (stmt_info) >>> && nunits.is_constant (&const_nunits)) >>> >>> This breaks off the leading even element and recurses, discards the >>> leading odd element, and then recurses on the rest. We then come >round >>> to the same code when recursing on the rest. >>> >>> It looks like matches is valid for all elements, so I guess we >should >>> divide the group based on all false elements, not just the first? >> >> What do you mean with "valid for all elements"? matches[i] is >> true or false for all elements? If it is true we shouldn't split >> at all. > >Initially I'd wondered whether the elements were undefined after the >first "false", i.e. if we broke early at that point. But it looks like >true and false elements after the first false are still meaningful. > >> But indeed the recursion is bad if we split out one vector a step, >> we should split the whole group in "half" instead. >> >> Still the code should ensure we have a "half" that works OK >> and one that possibly doesn't. >> >> OK, I see we have {true, true, false <repeats ... times> } for >> matches but the 2nd iteration gets {true, false, ... } and doesn't >> split for me. Of course if you have V1DI then you'll notice >> that matches[0] is _always_ true ... I guess we should simply >> never try splitting for const_nunits == 1? Or even never >> do BB vectorization with such a vector type. > >I don't think it's specific to const_nunits == 1. E.g. if we have >matches == { true, true, false, true } repeating for const_nunits == 2, >we'll have the same problem of trying: > > N > 2, N-4 > 2, N-8 > 2, N-12, > 2, N-16 > .... > >which is still quadratic.
Yeah. >Dividing genuinely into half would fix that, so would dividing the >whole group based on the group1_size. I think the smaller split was meant as optimization and originally we didn't recurse on the rest... So yes, simply halving might help. Or we could try using the >whole matches array to divide the group up into chunks that are >worth recursing on, with each chunk having at most half the original >group size and with the matches elements for each chunk being all-true >or all-false. > >Thanks, >Richard