Re: [RFC] S/390: Alignment peeling prolog generation
On Thu, May 11, 2017 at 2:15 PM, Richard Bienerwrote: > On Thu, May 11, 2017 at 2:14 PM, Richard Biener > wrote: >> On Thu, May 11, 2017 at 1:15 PM, Robin Dapp wrote: >>> Included the requested changes in the patches (to follow). I removed >>> the alignment count check now altogether. >>> I'm not sure why you test for unlimited_cost_model here as I said elsewhere I'm not sure what not cost modeling means for static decisions. The purpose of unlimited_cost_model is to always vectorize when possible and omit the runtime profitability check. So for peeling I'd just always use the cost model. Thus please drop this check. >>> >>> Without that, I get one additional FAIL gcc.dg/vect/slp-25.c for x86. >>> It is caused by choosing no peeling (inside costs 0) over peeling for >>> known alignment with unlimited cost model (inside costs 0 as well). >>> Costs 0 for no peeling are caused by count == 0 or rather ncopies = vf / >>> nunits == 4 / 8 == 0 in record_stmt_costs (). Shouldn't always hold >>> ncopies > 0? Even 0.5 would have worked here to make no peeling more >>> expensive than 0. >> >> That's odd. I can't get >> >> Index: gcc/tree-vect-stmts.c >> === >> --- gcc/tree-vect-stmts.c (revision 247882) >> +++ gcc/tree-vect-stmts.c (working copy) >> @@ -95,6 +96,7 @@ record_stmt_cost (stmt_vector_for_cost * >> enum vect_cost_for_stmt kind, stmt_vec_info stmt_info, >> int misalign, enum vect_cost_model_location where) >> { >> + gcc_assert (count > 0); >>if (body_cost_vec) >> { >>tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE; >> >> to ICE with the testcase (unpatched trunk) >> >> Where's that record_stmt_cost call done? You can't simply use vf/nunits >> for SLP. > > Ah, of course needs -fvect-cost-model. > > I'll investigate. Ugh. The vect_peeling_hash_get_lowest_cost isn't handling SLP in any way, there's quite some refactoring necessary to fix that. I suggest (eh) to do Index: gcc/tree-vect-data-refs.c === --- gcc/tree-vect-data-refs.c (revision 247734) +++ gcc/tree-vect-data-refs.c (working copy) @@ -1129,7 +1129,7 @@ vect_get_data_access_cost (struct data_r int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info); int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); - int ncopies = vf / nunits; + int ncopies = MAX (1, vf / nunits); /* TODO: Handle SLP properly */ if (DR_IS_READ (dr)) vect_get_load_cost (dr, ncopies, true, inside_cost, outside_cost, > Richard. > >> Richard. >> >>> Test suite on s390x is clean. >>> >>> Regards >>> Robin >>>
Re: [RFC] S/390: Alignment peeling prolog generation
On Thu, May 11, 2017 at 2:14 PM, Richard Bienerwrote: > On Thu, May 11, 2017 at 1:15 PM, Robin Dapp wrote: >> Included the requested changes in the patches (to follow). I removed >> the alignment count check now altogether. >> >>> I'm not sure why you test for unlimited_cost_model here as I said >>> elsewhere I'm not sure >>> what not cost modeling means for static decisions. The purpose of >>> unlimited_cost_model >>> is to always vectorize when possible and omit the runtime >>> profitability check. So for peeling >>> I'd just always use the cost model. Thus please drop this check. >> >> Without that, I get one additional FAIL gcc.dg/vect/slp-25.c for x86. >> It is caused by choosing no peeling (inside costs 0) over peeling for >> known alignment with unlimited cost model (inside costs 0 as well). >> Costs 0 for no peeling are caused by count == 0 or rather ncopies = vf / >> nunits == 4 / 8 == 0 in record_stmt_costs (). Shouldn't always hold >> ncopies > 0? Even 0.5 would have worked here to make no peeling more >> expensive than 0. > > That's odd. I can't get > > Index: gcc/tree-vect-stmts.c > === > --- gcc/tree-vect-stmts.c (revision 247882) > +++ gcc/tree-vect-stmts.c (working copy) > @@ -95,6 +96,7 @@ record_stmt_cost (stmt_vector_for_cost * > enum vect_cost_for_stmt kind, stmt_vec_info stmt_info, > int misalign, enum vect_cost_model_location where) > { > + gcc_assert (count > 0); >if (body_cost_vec) > { >tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE; > > to ICE with the testcase (unpatched trunk) > > Where's that record_stmt_cost call done? You can't simply use vf/nunits > for SLP. Ah, of course needs -fvect-cost-model. I'll investigate. Richard. > Richard. > >> Test suite on s390x is clean. >> >> Regards >> Robin >>
Re: [RFC] S/390: Alignment peeling prolog generation
On Thu, May 11, 2017 at 1:15 PM, Robin Dappwrote: > Included the requested changes in the patches (to follow). I removed > the alignment count check now altogether. > >> I'm not sure why you test for unlimited_cost_model here as I said >> elsewhere I'm not sure >> what not cost modeling means for static decisions. The purpose of >> unlimited_cost_model >> is to always vectorize when possible and omit the runtime >> profitability check. So for peeling >> I'd just always use the cost model. Thus please drop this check. > > Without that, I get one additional FAIL gcc.dg/vect/slp-25.c for x86. > It is caused by choosing no peeling (inside costs 0) over peeling for > known alignment with unlimited cost model (inside costs 0 as well). > Costs 0 for no peeling are caused by count == 0 or rather ncopies = vf / > nunits == 4 / 8 == 0 in record_stmt_costs (). Shouldn't always hold > ncopies > 0? Even 0.5 would have worked here to make no peeling more > expensive than 0. That's odd. I can't get Index: gcc/tree-vect-stmts.c === --- gcc/tree-vect-stmts.c (revision 247882) +++ gcc/tree-vect-stmts.c (working copy) @@ -95,6 +96,7 @@ record_stmt_cost (stmt_vector_for_cost * enum vect_cost_for_stmt kind, stmt_vec_info stmt_info, int misalign, enum vect_cost_model_location where) { + gcc_assert (count > 0); if (body_cost_vec) { tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE; to ICE with the testcase (unpatched trunk) Where's that record_stmt_cost call done? You can't simply use vf/nunits for SLP. Richard. > Test suite on s390x is clean. > > Regards > Robin >
Re: [RFC] S/390: Alignment peeling prolog generation
Included the requested changes in the patches (to follow). I removed the alignment count check now altogether. > I'm not sure why you test for unlimited_cost_model here as I said > elsewhere I'm not sure > what not cost modeling means for static decisions. The purpose of > unlimited_cost_model > is to always vectorize when possible and omit the runtime > profitability check. So for peeling > I'd just always use the cost model. Thus please drop this check. Without that, I get one additional FAIL gcc.dg/vect/slp-25.c for x86. It is caused by choosing no peeling (inside costs 0) over peeling for known alignment with unlimited cost model (inside costs 0 as well). Costs 0 for no peeling are caused by count == 0 or rather ncopies = vf / nunits == 4 / 8 == 0 in record_stmt_costs (). Shouldn't always hold ncopies > 0? Even 0.5 would have worked here to make no peeling more expensive than 0. Test suite on s390x is clean. Regards Robin
Re: [RFC] S/390: Alignment peeling prolog generation
On Mon, May 8, 2017 at 6:11 PM, Robin Dappwrote: >> So the new part is the last point? There's a lot of refactoring in > 3/3 that >> makes it hard to see what is actually changed ... you need to resist >> in doing this, it makes review very hard. > > The new part is actually spread across the three last "-"s. Attached is > a new version of [3/3] split up into two patches with hopefully less > blending of refactoring and new functionality. > > [3/4] Computes full costs when peeling for unknown alignment, uses > either read or write and compares the better one with the peeling costs > for known alignment. If the peeling for unknown alignment "aligns" more > than twice the number of datarefs, it is preferred over the peeling for > known alignment. > > [4/4] Computes the costs for no peeling and compares them with the costs > of the best peeling so far. If it is not more expensive, no peeling > will be performed. > >> I think it's always best to align a ref with known alignment as that > simplifies >> conditions and allows followup optimizations (unrolling of the >> prologue / epilogue). >> I think for this it's better to also compute full costs rather than > relying on >> sth as simple as "number of same aligned refs". >> >> Does the code ever end up misaligning a previously known aligned ref? > > The following case used to get aligned via the known alignment of dd but > would not anymore since peeling for unknown alignment aligns two > accesses. I guess the determining factor is still up for scrutiny and > should probably > 2. Still, on e.g. s390x no peeling is performed due > to costs. Ok, in principle this makes sense if we manage to correctly compute the costs. What exactly is profitable or not is of course subject to the target costs. Richard. > void foo(int *restrict a, int *restrict b, int *restrict c, int > *restrict d, unsigned int n) > { > int *restrict dd = __builtin_assume_aligned (d, 8); > for (unsigned int i = 0; i < n; i++) > { > b[i] = b[i] + a[i]; > c[i] = c[i] + b[i]; > dd[i] = a[i]; > } > } > > Regards > Robin >
Re: [RFC] S/390: Alignment peeling prolog generation
> So the new part is the last point? There's a lot of refactoring in 3/3 that > makes it hard to see what is actually changed ... you need to resist > in doing this, it makes review very hard. The new part is actually spread across the three last "-"s. Attached is a new version of [3/3] split up into two patches with hopefully less blending of refactoring and new functionality. [3/4] Computes full costs when peeling for unknown alignment, uses either read or write and compares the better one with the peeling costs for known alignment. If the peeling for unknown alignment "aligns" more than twice the number of datarefs, it is preferred over the peeling for known alignment. [4/4] Computes the costs for no peeling and compares them with the costs of the best peeling so far. If it is not more expensive, no peeling will be performed. > I think it's always best to align a ref with known alignment as that simplifies > conditions and allows followup optimizations (unrolling of the > prologue / epilogue). > I think for this it's better to also compute full costs rather than relying on > sth as simple as "number of same aligned refs". > > Does the code ever end up misaligning a previously known aligned ref? The following case used to get aligned via the known alignment of dd but would not anymore since peeling for unknown alignment aligns two accesses. I guess the determining factor is still up for scrutiny and should probably > 2. Still, on e.g. s390x no peeling is performed due to costs. void foo(int *restrict a, int *restrict b, int *restrict c, int *restrict d, unsigned int n) { int *restrict dd = __builtin_assume_aligned (d, 8); for (unsigned int i = 0; i < n; i++) { b[i] = b[i] + a[i]; c[i] = c[i] + b[i]; dd[i] = a[i]; } } Regards Robin
Re: [RFC] S/390: Alignment peeling prolog generation
On Thu, May 4, 2017 at 10:59 AM, Robin Dappwrote: > Hi, > >> This one only works for known misalignment, otherwise it's overkill. >> >> OTOH if with some refactoring we can end up using a single cost model >> that would be great. That is for the SAME_ALIGN_REFS we want to >> choose the unknown misalignment with the maximum number of >> SAME_ALIGN_REFS. And if we know the misalignment of a single >> ref then we still may want to align a unknown misalign ref if that has >> more SAME_ALIGN_REFS (I think we always choose the known-misalign >> one currently). > > [0/3] > Attempt to unify the peeling cost model as follows: > > - Keep the treatment of known misalignments. > > - Save the load and store with the most frequent misalignment. > - Compare their costs and get the hardware-preferred one via costs. > > - Choose the best peeling from the best peeling with known >misalignment and the best with unknown misalignment according to >the number of aligned data refs. > > - Calculate costs for leaving everything misaligned and compare with >the best peeling so far. So the new part is the last point? There's a lot of refactoring in 3/3 that makes it hard to see what is actually changed ... you need to resist in doing this, it makes review very hard. > I also performed some refactoring that seemed necessary during writing > but which is not strictly necessary anymore ([1/3] and [2/3]) yet imho > simplifies understanding the code. The bulk of the changes is in [3/3]. > > Testsuite on i386 and s390x is clean. I guess some additional test > cases won't hurt and I will add them later, however I didn't succeed > defining a test cases with two datarefs with same but unknown > misalignment. How can this be done? a[i] += b[i] should have the load DR of a[i] have the same misalignment as the store DR of a[i]. I think that's the only case (load/store pair) where this happens. We might want to enhance the machinery to have a[i] and a[i+4] be recorded for example in case the VF divides 4. Richards patch may have improved things here. > > A thing I did not understand when going over the existing code: In > vect_get_known_peeling_cost() we have > > /* If peeled iterations are known but number of scalar loop > iterations are unknown, count a taken branch per peeled loop. */ > > retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, > NULL, 0, vect_prologue); > retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, > NULL, 0, vect_epilogue); > > In all uses of the function, prologue_cost_vec is discarded afterwards, > only the return value is used. Should the second statement read retval > +=? This is only executed when the number of loop iterations is > unknown. Currently we indeed count one taken branch, but why then > execute record_stmt_cost twice or rather not discard the first retval? Yes, it should be +=. It's also somewhat odd code that should be refactored given it is supposed to be only called when we know the number of iterations to peel. That is, we can't use it to get an estimate on the cost of peeling when the prologue iteration is unknown (the vect_estimate_min_profitable_iters code has this in a path not calling vect_get_known_peeling_cost. Can you try producing a simpler patch that does the last '-' only, without all the rest? + /* At this point, we have to choose between peeling for the datarefs with + known alignment and the ones with unknown alignment. Prefer the one + that aligns more datarefs in total. */ + struct data_reference *dr0 = NULL; + if (do_peeling) { I think it's always best to align a ref with known alignment as that simplifies conditions and allows followup optimizations (unrolling of the prologue / epilogue). I think for this it's better to also compute full costs rather than relying on sth as simple as "number of same aligned refs". Does the code ever end up misaligning a previously known aligned ref? Thanks, Richard. > > Regards > Robin >
Re: [RFC] S/390: Alignment peeling prolog generation
Hi, > This one only works for known misalignment, otherwise it's overkill. > > OTOH if with some refactoring we can end up using a single cost model > that would be great. That is for the SAME_ALIGN_REFS we want to > choose the unknown misalignment with the maximum number of > SAME_ALIGN_REFS. And if we know the misalignment of a single > ref then we still may want to align a unknown misalign ref if that has > more SAME_ALIGN_REFS (I think we always choose the known-misalign > one currently). [0/3] Attempt to unify the peeling cost model as follows: - Keep the treatment of known misalignments. - Save the load and store with the most frequent misalignment. - Compare their costs and get the hardware-preferred one via costs. - Choose the best peeling from the best peeling with known misalignment and the best with unknown misalignment according to the number of aligned data refs. - Calculate costs for leaving everything misaligned and compare with the best peeling so far. I also performed some refactoring that seemed necessary during writing but which is not strictly necessary anymore ([1/3] and [2/3]) yet imho simplifies understanding the code. The bulk of the changes is in [3/3]. Testsuite on i386 and s390x is clean. I guess some additional test cases won't hurt and I will add them later, however I didn't succeed defining a test cases with two datarefs with same but unknown misalignment. How can this be done? A thing I did not understand when going over the existing code: In vect_get_known_peeling_cost() we have /* If peeled iterations are known but number of scalar loop iterations are unknown, count a taken branch per peeled loop. */ retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL, 0, vect_prologue); retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL, 0, vect_epilogue); In all uses of the function, prologue_cost_vec is discarded afterwards, only the return value is used. Should the second statement read retval +=? This is only executed when the number of loop iterations is unknown. Currently we indeed count one taken branch, but why then execute record_stmt_cost twice or rather not discard the first retval? Regards Robin
Re: [RFC] S/390: Alignment peeling prolog generation
On Wed, Apr 12, 2017 at 9:50 AM, Robin Dappwrote: >> Note I was very conservative here to allow store bandwidth starved >> CPUs to benefit from aligning a store. >> >> I think it would be reasonable to apply the same heuristic to the >> store case that we only peel for same cost if peeling would at least >> align two refs. > > Do you mean checking if peeling aligns >= 2 refs for sure? (i.e. with a > known misalignment) Or the same as currently via > STMT_VINFO_SAME_ALIGN_REFS just for stores and .length() >= 2? The latter. > Is checking via vect_peeling_hash_choose_best_peeling () too costly or > simply unnecessary if we already know the costs for aligned and > unaligned are the same? This one only works for known misalignment, otherwise it's overkill. OTOH if with some refactoring we can end up using a single cost model that would be great. That is for the SAME_ALIGN_REFS we want to choose the unknown misalignment with the maximum number of SAME_ALIGN_REFS. And if we know the misalignment of a single ref then we still may want to align a unknown misalign ref if that has more SAME_ALIGN_REFS (I think we always choose the known-misalign one currently). Richard. > Regards > Robin >
Re: [RFC] S/390: Alignment peeling prolog generation
> Note I was very conservative here to allow store bandwidth starved > CPUs to benefit from aligning a store. > > I think it would be reasonable to apply the same heuristic to the > store case that we only peel for same cost if peeling would at least > align two refs. Do you mean checking if peeling aligns >= 2 refs for sure? (i.e. with a known misalignment) Or the same as currently via STMT_VINFO_SAME_ALIGN_REFS just for stores and .length() >= 2? Is checking via vect_peeling_hash_choose_best_peeling () too costly or simply unnecessary if we already know the costs for aligned and unaligned are the same? Regards Robin
Re: [RFC] S/390: Alignment peeling prolog generation
On April 11, 2017 4:57:29 PM GMT+02:00, "Bin.Cheng"wrote: >On Tue, Apr 11, 2017 at 3:38 PM, Robin Dapp >wrote: >> Hi, >> >> when looking at various vectorization examples on s390x I noticed >that >> we still peel vf/2 iterations for alignment even though vectorization >> costs of unaligned loads and stores are the same as normal >loads/stores. >> >> A simple example is >> >> void foo(int *restrict a, int *restrict b, unsigned int n) >> { >> for (unsigned int i = 0; i < n; i++) >> { >> b[i] = a[i] * 2 + 1; >> } >> } >> >> which gets peeled unless __builtin_assume_aligned (a, 8) is used. >> >> In tree-vect-data-refs.c there are several checks that involve costs >in >> the peeling decision none of which seems to suffice in this case. For >a >> loop with only read DRs there is a check that has been triggering >(i.e. >> disable peeling) since we implemented the vectorization costs. >> >> Here, we have DR_MISALIGNMENT (dr) == -1 for all DRs but the costs >> should still dictate to never peel. I attached a tentative patch for >> discussion which fixes the problem by checking the costs for npeel = >0 >> and npeel = vf/2 after ensuring we support all misalignments. Is >there a >> better way and place to do it? Are we missing something somewhere >else >> that would preclude the peeling from happening? >> >> This is not indended for stage 4 obviously :) >Hi Robin, >Seems Richi added code like below comparing costs between aligned and >unsigned loads, and only peeling if it's beneficial: > >/* In case there are only loads with different unknown misalignments, >use > peeling only if it may help to align other accesses in the loop or > if it may help improving load bandwith when we'd end up using > unaligned loads. */ > tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0))); > if (!first_store > && !STMT_VINFO_SAME_ALIGN_REFS ( > vinfo_for_stmt (DR_STMT (dr0))).length () > && (vect_supportable_dr_alignment (dr0, false) > != dr_unaligned_supported > || (builtin_vectorization_cost (vector_load, dr0_vt, 0) > == builtin_vectorization_cost (unaligned_load, dr0_vt, -1 >do_peeling = false; > >I think similar codes can be added for store cases too. Note I was very conservative here to allow store bandwidth starved CPUs to benefit from aligning a store. I think it would be reasonable to apply the same heuristic to the store case that we only peel for same cost if peeling would at least align two refs. Richard. >Thanks, >bin >> >> Regards >> Robin
Re: [RFC] S/390: Alignment peeling prolog generation
On Tue, Apr 11, 2017 at 4:03 PM, Robin Dappwrote: > Hi Bin, > >> Seems Richi added code like below comparing costs between aligned and >> unsigned loads, and only peeling if it's beneficial: >> >> /* In case there are only loads with different unknown misalignments, >> use >> peeling only if it may help to align other accesses in the loop or >> if it may help improving load bandwith when we'd end up using >> unaligned loads. */ >> tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0))); >> if (!first_store >> && !STMT_VINFO_SAME_ALIGN_REFS ( >> vinfo_for_stmt (DR_STMT (dr0))).length () >> && (vect_supportable_dr_alignment (dr0, false) >> != dr_unaligned_supported >> || (builtin_vectorization_cost (vector_load, dr0_vt, 0) >> == builtin_vectorization_cost (unaligned_load, dr0_vt, -1 >> do_peeling = false; > > yes this is the "special case" I was referring to. This successfully > avoids peeling when there is no store (after we had set vectorization > costs). My patch tries to check the costs for all references. I am not sure if all references need to be checked, on AArch64, aligned/unaligned costs are set globally, so only need to make one check here. Thanks, bin > > Regards > Robin >
Re: [RFC] S/390: Alignment peeling prolog generation
Hi Bin, > Seems Richi added code like below comparing costs between aligned and > unsigned loads, and only peeling if it's beneficial: > > /* In case there are only loads with different unknown misalignments, > use > peeling only if it may help to align other accesses in the loop or > if it may help improving load bandwith when we'd end up using > unaligned loads. */ > tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0))); > if (!first_store > && !STMT_VINFO_SAME_ALIGN_REFS ( > vinfo_for_stmt (DR_STMT (dr0))).length () > && (vect_supportable_dr_alignment (dr0, false) > != dr_unaligned_supported > || (builtin_vectorization_cost (vector_load, dr0_vt, 0) > == builtin_vectorization_cost (unaligned_load, dr0_vt, -1 > do_peeling = false; yes this is the "special case" I was referring to. This successfully avoids peeling when there is no store (after we had set vectorization costs). My patch tries to check the costs for all references. Regards Robin
Re: [RFC] S/390: Alignment peeling prolog generation
On Tue, Apr 11, 2017 at 3:38 PM, Robin Dappwrote: > Hi, > > when looking at various vectorization examples on s390x I noticed that > we still peel vf/2 iterations for alignment even though vectorization > costs of unaligned loads and stores are the same as normal loads/stores. > > A simple example is > > void foo(int *restrict a, int *restrict b, unsigned int n) > { > for (unsigned int i = 0; i < n; i++) > { > b[i] = a[i] * 2 + 1; > } > } > > which gets peeled unless __builtin_assume_aligned (a, 8) is used. > > In tree-vect-data-refs.c there are several checks that involve costs in > the peeling decision none of which seems to suffice in this case. For a > loop with only read DRs there is a check that has been triggering (i.e. > disable peeling) since we implemented the vectorization costs. > > Here, we have DR_MISALIGNMENT (dr) == -1 for all DRs but the costs > should still dictate to never peel. I attached a tentative patch for > discussion which fixes the problem by checking the costs for npeel = 0 > and npeel = vf/2 after ensuring we support all misalignments. Is there a > better way and place to do it? Are we missing something somewhere else > that would preclude the peeling from happening? > > This is not indended for stage 4 obviously :) Hi Robin, Seems Richi added code like below comparing costs between aligned and unsigned loads, and only peeling if it's beneficial: /* In case there are only loads with different unknown misalignments, use peeling only if it may help to align other accesses in the loop or if it may help improving load bandwith when we'd end up using unaligned loads. */ tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0))); if (!first_store && !STMT_VINFO_SAME_ALIGN_REFS ( vinfo_for_stmt (DR_STMT (dr0))).length () && (vect_supportable_dr_alignment (dr0, false) != dr_unaligned_supported || (builtin_vectorization_cost (vector_load, dr0_vt, 0) == builtin_vectorization_cost (unaligned_load, dr0_vt, -1 do_peeling = false; I think similar codes can be added for store cases too. Thanks, bin > > Regards > Robin
[RFC] S/390: Alignment peeling prolog generation
Hi, when looking at various vectorization examples on s390x I noticed that we still peel vf/2 iterations for alignment even though vectorization costs of unaligned loads and stores are the same as normal loads/stores. A simple example is void foo(int *restrict a, int *restrict b, unsigned int n) { for (unsigned int i = 0; i < n; i++) { b[i] = a[i] * 2 + 1; } } which gets peeled unless __builtin_assume_aligned (a, 8) is used. In tree-vect-data-refs.c there are several checks that involve costs in the peeling decision none of which seems to suffice in this case. For a loop with only read DRs there is a check that has been triggering (i.e. disable peeling) since we implemented the vectorization costs. Here, we have DR_MISALIGNMENT (dr) == -1 for all DRs but the costs should still dictate to never peel. I attached a tentative patch for discussion which fixes the problem by checking the costs for npeel = 0 and npeel = vf/2 after ensuring we support all misalignments. Is there a better way and place to do it? Are we missing something somewhere else that would preclude the peeling from happening? This is not indended for stage 4 obviously :) Regards Robin diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index 3fc762a..795c22c 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -1418,6 +1418,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo) stmt_vec_info stmt_info; unsigned int npeel = 0; bool all_misalignments_unknown = true; + bool all_misalignments_supported = true; unsigned int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); unsigned possible_npeel_number = 1; tree vectype; @@ -1547,6 +1548,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo) } all_misalignments_unknown = false; + /* Data-ref that was chosen for the case that all the misalignments are unknown is not relevant anymore, since we have a data-ref with known alignment. */ @@ -1609,6 +1611,24 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo) break; } } + + /* Check if target supports misaligned data access for current data + reference. */ + vectype = STMT_VINFO_VECTYPE (stmt_info); + machine_mode mode = TYPE_MODE (vectype); + if (targetm.vectorize. + support_vector_misalignment (mode, TREE_TYPE (DR_REF (dr)), + DR_MISALIGNMENT (dr), false)) + { + vect_peeling_hash_insert (_htab, loop_vinfo, +dr, 0); + /* Also insert vf/2 peeling that will be used when all + misalignments are unknown. */ + vect_peeling_hash_insert (_htab, loop_vinfo, +dr, vf / 2); + } + else + all_misalignments_supported = false; } /* Check if we can possibly peel the loop. */ @@ -1687,6 +1707,18 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo) dr0 = first_store; } + /* If the target supports accessing all data references in a misaligned + way, check costs to see if we can leave them unaligned and do not + perform any peeling. */ + if (all_misalignments_supported) + { + dr0 = vect_peeling_hash_choose_best_peeling (_htab, + loop_vinfo, , + _cost_vec); + if (!dr0 || !npeel) + do_peeling = false; + } + /* In case there are only loads with different unknown misalignments, use peeling only if it may help to align other accesses in the loop or if it may help improving load bandwith when we'd end up using