On Thu, May 4, 2017 at 10:59 AM, Robin Dapp <rd...@linux.vnet.ibm.com> wrote: > Hi, > >> This one only works for known misalignment, otherwise it's overkill. >> >> OTOH if with some refactoring we can end up using a single cost model >> that would be great. That is for the SAME_ALIGN_REFS we want to >> choose the unknown misalignment with the maximum number of >> SAME_ALIGN_REFS. And if we know the misalignment of a single >> ref then we still may want to align a unknown misalign ref if that has >> more SAME_ALIGN_REFS (I think we always choose the known-misalign >> one currently). > > [0/3] > Attempt to unify the peeling cost model as follows: > > - Keep the treatment of known misalignments. > > - Save the load and store with the most frequent misalignment. > - Compare their costs and get the hardware-preferred one via costs. > > - Choose the best peeling from the best peeling with known > misalignment and the best with unknown misalignment according to > the number of aligned data refs. > > - Calculate costs for leaving everything misaligned and compare with > the best peeling so far.
So the new part is the last point? There's a lot of refactoring in 3/3 that makes it hard to see what is actually changed ... you need to resist in doing this, it makes review very hard. > I also performed some refactoring that seemed necessary during writing > but which is not strictly necessary anymore ([1/3] and [2/3]) yet imho > simplifies understanding the code. The bulk of the changes is in [3/3]. > > Testsuite on i386 and s390x is clean. I guess some additional test > cases won't hurt and I will add them later, however I didn't succeed > defining a test cases with two datarefs with same but unknown > misalignment. How can this be done? a[i] += b[i] should have the load DR of a[i] have the same misalignment as the store DR of a[i]. I think that's the only case (load/store pair) where this happens. We might want to enhance the machinery to have a[i] and a[i+4] be recorded for example in case the VF divides 4. Richards patch may have improved things here. > > A thing I did not understand when going over the existing code: In > vect_get_known_peeling_cost() we have > > /* If peeled iterations are known but number of scalar loop > iterations are unknown, count a taken branch per peeled loop. */ > > retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, > NULL, 0, vect_prologue); > retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, > NULL, 0, vect_epilogue); > > In all uses of the function, prologue_cost_vec is discarded afterwards, > only the return value is used. Should the second statement read retval > +=? This is only executed when the number of loop iterations is > unknown. Currently we indeed count one taken branch, but why then > execute record_stmt_cost twice or rather not discard the first retval? Yes, it should be +=. It's also somewhat odd code that should be refactored given it is supposed to be only called when we know the number of iterations to peel. That is, we can't use it to get an estimate on the cost of peeling when the prologue iteration is unknown (the vect_estimate_min_profitable_iters code has this in a path not calling vect_get_known_peeling_cost. Can you try producing a simpler patch that does the last '-' only, without all the rest? + /* At this point, we have to choose between peeling for the datarefs with + known alignment and the ones with unknown alignment. Prefer the one + that aligns more datarefs in total. */ + struct data_reference *dr0 = NULL; + if (do_peeling) { I think it's always best to align a ref with known alignment as that simplifies conditions and allows followup optimizations (unrolling of the prologue / epilogue). I think for this it's better to also compute full costs rather than relying on sth as simple as "number of same aligned refs". Does the code ever end up misaligning a previously known aligned ref? Thanks, Richard. > > Regards > Robin >