Re: [RFC] S/390: Alignment peeling prolog generation

2017-05-11 Thread Richard Biener
On Thu, May 11, 2017 at 2:15 PM, Richard Biener
 wrote:
> On Thu, May 11, 2017 at 2:14 PM, Richard Biener
>  wrote:
>> On Thu, May 11, 2017 at 1:15 PM, Robin Dapp  wrote:
>>> Included the requested changes in the patches (to follow).  I removed
>>> the alignment count check now altogether.
>>>
 I'm not sure why you test for unlimited_cost_model here as I said
 elsewhere I'm not sure
 what not cost modeling means for static decisions.  The purpose of
 unlimited_cost_model
 is to always vectorize when possible and omit the runtime
 profitability check.  So for peeling
 I'd just always use the cost model.  Thus please drop this check.
>>>
>>> Without that, I get one additional FAIL gcc.dg/vect/slp-25.c for x86.
>>> It is caused by choosing no peeling (inside costs 0) over peeling for
>>> known alignment with unlimited cost model (inside costs 0 as well).
>>> Costs 0 for no peeling are caused by count == 0 or rather ncopies = vf /
>>> nunits == 4 / 8 == 0 in record_stmt_costs ().  Shouldn't always hold
>>> ncopies > 0? Even 0.5 would have worked here to make no peeling more
>>> expensive than 0.
>>
>> That's odd.  I can't get
>>
>> Index: gcc/tree-vect-stmts.c
>> ===
>> --- gcc/tree-vect-stmts.c   (revision 247882)
>> +++ gcc/tree-vect-stmts.c   (working copy)
>> @@ -95,6 +96,7 @@ record_stmt_cost (stmt_vector_for_cost *
>>   enum vect_cost_for_stmt kind, stmt_vec_info stmt_info,
>>   int misalign, enum vect_cost_model_location where)
>>  {
>> +  gcc_assert (count > 0);
>>if (body_cost_vec)
>>  {
>>tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE;
>>
>> to ICE with the testcase (unpatched trunk)
>>
>> Where's that record_stmt_cost call done?  You can't simply use vf/nunits
>> for SLP.
>
> Ah, of course needs -fvect-cost-model.
>
> I'll investigate.

Ugh.  The vect_peeling_hash_get_lowest_cost isn't handling SLP in any
way, there's quite
some refactoring necessary to fix that.

I suggest (eh) to do

Index: gcc/tree-vect-data-refs.c
===
--- gcc/tree-vect-data-refs.c   (revision 247734)
+++ gcc/tree-vect-data-refs.c   (working copy)
@@ -1129,7 +1129,7 @@ vect_get_data_access_cost (struct data_r
   int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
   int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
-  int ncopies = vf / nunits;
+  int ncopies = MAX (1, vf / nunits); /* TODO: Handle SLP properly  */

   if (DR_IS_READ (dr))
 vect_get_load_cost (dr, ncopies, true, inside_cost, outside_cost,




> Richard.
>
>> Richard.
>>
>>> Test suite on s390x is clean.
>>>
>>> Regards
>>>  Robin
>>>


Re: [RFC] S/390: Alignment peeling prolog generation

2017-05-11 Thread Richard Biener
On Thu, May 11, 2017 at 2:14 PM, Richard Biener
 wrote:
> On Thu, May 11, 2017 at 1:15 PM, Robin Dapp  wrote:
>> Included the requested changes in the patches (to follow).  I removed
>> the alignment count check now altogether.
>>
>>> I'm not sure why you test for unlimited_cost_model here as I said
>>> elsewhere I'm not sure
>>> what not cost modeling means for static decisions.  The purpose of
>>> unlimited_cost_model
>>> is to always vectorize when possible and omit the runtime
>>> profitability check.  So for peeling
>>> I'd just always use the cost model.  Thus please drop this check.
>>
>> Without that, I get one additional FAIL gcc.dg/vect/slp-25.c for x86.
>> It is caused by choosing no peeling (inside costs 0) over peeling for
>> known alignment with unlimited cost model (inside costs 0 as well).
>> Costs 0 for no peeling are caused by count == 0 or rather ncopies = vf /
>> nunits == 4 / 8 == 0 in record_stmt_costs ().  Shouldn't always hold
>> ncopies > 0? Even 0.5 would have worked here to make no peeling more
>> expensive than 0.
>
> That's odd.  I can't get
>
> Index: gcc/tree-vect-stmts.c
> ===
> --- gcc/tree-vect-stmts.c   (revision 247882)
> +++ gcc/tree-vect-stmts.c   (working copy)
> @@ -95,6 +96,7 @@ record_stmt_cost (stmt_vector_for_cost *
>   enum vect_cost_for_stmt kind, stmt_vec_info stmt_info,
>   int misalign, enum vect_cost_model_location where)
>  {
> +  gcc_assert (count > 0);
>if (body_cost_vec)
>  {
>tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE;
>
> to ICE with the testcase (unpatched trunk)
>
> Where's that record_stmt_cost call done?  You can't simply use vf/nunits
> for SLP.

Ah, of course needs -fvect-cost-model.

I'll investigate.

Richard.

> Richard.
>
>> Test suite on s390x is clean.
>>
>> Regards
>>  Robin
>>


Re: [RFC] S/390: Alignment peeling prolog generation

2017-05-11 Thread Richard Biener
On Thu, May 11, 2017 at 1:15 PM, Robin Dapp  wrote:
> Included the requested changes in the patches (to follow).  I removed
> the alignment count check now altogether.
>
>> I'm not sure why you test for unlimited_cost_model here as I said
>> elsewhere I'm not sure
>> what not cost modeling means for static decisions.  The purpose of
>> unlimited_cost_model
>> is to always vectorize when possible and omit the runtime
>> profitability check.  So for peeling
>> I'd just always use the cost model.  Thus please drop this check.
>
> Without that, I get one additional FAIL gcc.dg/vect/slp-25.c for x86.
> It is caused by choosing no peeling (inside costs 0) over peeling for
> known alignment with unlimited cost model (inside costs 0 as well).
> Costs 0 for no peeling are caused by count == 0 or rather ncopies = vf /
> nunits == 4 / 8 == 0 in record_stmt_costs ().  Shouldn't always hold
> ncopies > 0? Even 0.5 would have worked here to make no peeling more
> expensive than 0.

That's odd.  I can't get

Index: gcc/tree-vect-stmts.c
===
--- gcc/tree-vect-stmts.c   (revision 247882)
+++ gcc/tree-vect-stmts.c   (working copy)
@@ -95,6 +96,7 @@ record_stmt_cost (stmt_vector_for_cost *
  enum vect_cost_for_stmt kind, stmt_vec_info stmt_info,
  int misalign, enum vect_cost_model_location where)
 {
+  gcc_assert (count > 0);
   if (body_cost_vec)
 {
   tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE;

to ICE with the testcase (unpatched trunk)

Where's that record_stmt_cost call done?  You can't simply use vf/nunits
for SLP.

Richard.

> Test suite on s390x is clean.
>
> Regards
>  Robin
>


Re: [RFC] S/390: Alignment peeling prolog generation

2017-05-11 Thread Robin Dapp
Included the requested changes in the patches (to follow).  I removed
the alignment count check now altogether.

> I'm not sure why you test for unlimited_cost_model here as I said
> elsewhere I'm not sure
> what not cost modeling means for static decisions.  The purpose of
> unlimited_cost_model
> is to always vectorize when possible and omit the runtime
> profitability check.  So for peeling
> I'd just always use the cost model.  Thus please drop this check.

Without that, I get one additional FAIL gcc.dg/vect/slp-25.c for x86.
It is caused by choosing no peeling (inside costs 0) over peeling for
known alignment with unlimited cost model (inside costs 0 as well).
Costs 0 for no peeling are caused by count == 0 or rather ncopies = vf /
nunits == 4 / 8 == 0 in record_stmt_costs ().  Shouldn't always hold
ncopies > 0? Even 0.5 would have worked here to make no peeling more
expensive than 0.

Test suite on s390x is clean.

Regards
 Robin



Re: [RFC] S/390: Alignment peeling prolog generation

2017-05-09 Thread Richard Biener
On Mon, May 8, 2017 at 6:11 PM, Robin Dapp  wrote:
>> So the new part is the last point?  There's a lot of refactoring in
> 3/3 that
>> makes it hard to see what is actually changed ...  you need to resist
>> in doing this, it makes review very hard.
>
> The new part is actually spread across the three last "-"s.  Attached is
> a new version of [3/3] split up into two patches with hopefully less
> blending of refactoring and new functionality.
>
> [3/4] Computes full costs when peeling for unknown alignment, uses
> either read or write and compares the better one with the peeling costs
> for known alignment.  If the peeling for unknown alignment "aligns" more
> than twice the number of datarefs, it is preferred over the peeling for
> known alignment.
>
> [4/4] Computes the costs for no peeling and compares them with the costs
> of the best peeling so far.  If it is not more expensive, no peeling
> will be performed.
>
>> I think it's always best to align a ref with known alignment as that
> simplifies
>> conditions and allows followup optimizations (unrolling of the
>> prologue / epilogue).
>> I think for this it's better to also compute full costs rather than
> relying on
>> sth as simple as "number of same aligned refs".
>>
>> Does the code ever end up misaligning a previously known aligned ref?
>
> The following case used to get aligned via the known alignment of dd but
> would not anymore since peeling for unknown alignment aligns two
> accesses.  I guess the determining factor is still up for scrutiny and
> should probably > 2.  Still, on e.g. s390x no peeling is performed due
> to costs.

Ok, in principle this makes sense if we manage to correctly compute
the costs.  What exactly is profitable or not is of course subject to
the target costs.

Richard.

> void foo(int *restrict a, int *restrict b, int *restrict c, int
> *restrict d, unsigned int n)
> {
>   int *restrict dd = __builtin_assume_aligned (d, 8);
>   for (unsigned int i = 0; i < n; i++)
> {
>   b[i] = b[i] + a[i];
>   c[i] = c[i] + b[i];
>   dd[i] = a[i];
> }
> }
>
> Regards
>  Robin
>


Re: [RFC] S/390: Alignment peeling prolog generation

2017-05-08 Thread Robin Dapp
> So the new part is the last point?  There's a lot of refactoring in
3/3 that
> makes it hard to see what is actually changed ...  you need to resist
> in doing this, it makes review very hard.

The new part is actually spread across the three last "-"s.  Attached is
a new version of [3/3] split up into two patches with hopefully less
blending of refactoring and new functionality.

[3/4] Computes full costs when peeling for unknown alignment, uses
either read or write and compares the better one with the peeling costs
for known alignment.  If the peeling for unknown alignment "aligns" more
than twice the number of datarefs, it is preferred over the peeling for
known alignment.

[4/4] Computes the costs for no peeling and compares them with the costs
of the best peeling so far.  If it is not more expensive, no peeling
will be performed.

> I think it's always best to align a ref with known alignment as that
simplifies
> conditions and allows followup optimizations (unrolling of the
> prologue / epilogue).
> I think for this it's better to also compute full costs rather than
relying on
> sth as simple as "number of same aligned refs".
>
> Does the code ever end up misaligning a previously known aligned ref?

The following case used to get aligned via the known alignment of dd but
would not anymore since peeling for unknown alignment aligns two
accesses.  I guess the determining factor is still up for scrutiny and
should probably > 2.  Still, on e.g. s390x no peeling is performed due
to costs.

void foo(int *restrict a, int *restrict b, int *restrict c, int
*restrict d, unsigned int n)
{
  int *restrict dd = __builtin_assume_aligned (d, 8);
  for (unsigned int i = 0; i < n; i++)
{
  b[i] = b[i] + a[i];
  c[i] = c[i] + b[i];
  dd[i] = a[i];
}
}

Regards
 Robin



Re: [RFC] S/390: Alignment peeling prolog generation

2017-05-05 Thread Richard Biener
On Thu, May 4, 2017 at 10:59 AM, Robin Dapp  wrote:
> Hi,
>
>> This one only works for known misalignment, otherwise it's overkill.
>>
>> OTOH if with some refactoring we can end up using a single cost model
>> that would be great.  That is for the SAME_ALIGN_REFS we want to
>> choose the unknown misalignment with the maximum number of
>> SAME_ALIGN_REFS.  And if we know the misalignment of a single
>> ref then we still may want to align a unknown misalign ref if that has
>> more SAME_ALIGN_REFS (I think we always choose the known-misalign
>> one currently).
>
> [0/3]
> Attempt to unify the peeling cost model as follows:
>
>  - Keep the treatment of known misalignments.
>
>  - Save the load and store with the most frequent misalignment.
>   - Compare their costs and get the hardware-preferred one via costs.
>
>  - Choose the best peeling from the best peeling with known
>misalignment and the best with unknown misalignment according to
>the number of aligned data refs.
>
>  - Calculate costs for leaving everything misaligned and compare with
>the best peeling so far.

So the new part is the last point?  There's a lot of refactoring in 3/3 that
makes it hard to see what is actually changed ...  you need to resist
in doing this, it makes review very hard.

> I also performed some refactoring that seemed necessary during writing
> but which is not strictly necessary anymore ([1/3] and [2/3]) yet imho
> simplifies understanding the code.  The bulk of the changes is in [3/3].
>
> Testsuite on i386 and s390x is clean.  I guess some additional test
> cases won't hurt and I will add them later, however I didn't succeed
> defining a test cases with two datarefs with same but unknown
> misalignment.  How can this be done?

  a[i] += b[i]

should have the load DR of a[i] have the same misalignment as the
store DR of a[i].  I think that's the only case (load/store pair) where
this happens.  We might want to enhance the machinery to
have a[i] and a[i+4] be recorded for example in case the VF divides 4.
Richards patch may have improved things here.

>
> A thing I did not understand when going over the existing code: In
> vect_get_known_peeling_cost() we have
>
> /* If peeled iterations are known but number of scalar loop
>  iterations are unknown, count a taken branch per peeled loop.  */
>
> retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
>  NULL, 0, vect_prologue);
> retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
>  NULL, 0, vect_epilogue);
>
> In all uses of the function, prologue_cost_vec is discarded afterwards,
> only the return value is used.  Should the second statement read retval
> +=?  This is only executed when the number of loop iterations is
> unknown.  Currently we indeed count one taken branch, but why then
> execute record_stmt_cost twice or rather not discard the first retval?

Yes, it should be +=.

It's also somewhat odd code that should be refactored given it is supposed
to be only called when we know the number of iterations to peel.  That is,
we can't use it to get an estimate on the cost of peeling when the prologue
iteration is unknown (the vect_estimate_min_profitable_iters code has
this in a path not calling vect_get_known_peeling_cost.

Can you try producing a simpler patch that does the last '-' only, without
all the rest?

+  /* At this point, we have to choose between peeling for the datarefs with
+ known alignment and the ones with unknown alignment.  Prefer the one
+ that aligns more datarefs in total.  */
+  struct data_reference *dr0 = NULL;
+  if (do_peeling)
 {

I think it's always best to align a ref with known alignment as that simplifies
conditions and allows followup optimizations (unrolling of the
prologue / epilogue).
I think for this it's better to also compute full costs rather than relying on
sth as simple as "number of same aligned refs".

Does the code ever end up misaligning a previously known aligned ref?

Thanks,
Richard.

>
> Regards
>  Robin
>


Re: [RFC] S/390: Alignment peeling prolog generation

2017-05-04 Thread Robin Dapp
Hi,

> This one only works for known misalignment, otherwise it's overkill.
>
> OTOH if with some refactoring we can end up using a single cost model
> that would be great.  That is for the SAME_ALIGN_REFS we want to
> choose the unknown misalignment with the maximum number of
> SAME_ALIGN_REFS.  And if we know the misalignment of a single
> ref then we still may want to align a unknown misalign ref if that has
> more SAME_ALIGN_REFS (I think we always choose the known-misalign
> one currently).

[0/3]
Attempt to unify the peeling cost model as follows:

 - Keep the treatment of known misalignments.

 - Save the load and store with the most frequent misalignment.
  - Compare their costs and get the hardware-preferred one via costs.

 - Choose the best peeling from the best peeling with known
   misalignment and the best with unknown misalignment according to
   the number of aligned data refs.

 - Calculate costs for leaving everything misaligned and compare with
   the best peeling so far.

I also performed some refactoring that seemed necessary during writing
but which is not strictly necessary anymore ([1/3] and [2/3]) yet imho
simplifies understanding the code.  The bulk of the changes is in [3/3].

Testsuite on i386 and s390x is clean.  I guess some additional test
cases won't hurt and I will add them later, however I didn't succeed
defining a test cases with two datarefs with same but unknown
misalignment.  How can this be done?


A thing I did not understand when going over the existing code: In
vect_get_known_peeling_cost() we have

/* If peeled iterations are known but number of scalar loop
 iterations are unknown, count a taken branch per peeled loop.  */

retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
 NULL, 0, vect_prologue);
retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
 NULL, 0, vect_epilogue);

In all uses of the function, prologue_cost_vec is discarded afterwards,
only the return value is used.  Should the second statement read retval
+=?  This is only executed when the number of loop iterations is
unknown.  Currently we indeed count one taken branch, but why then
execute record_stmt_cost twice or rather not discard the first retval?

Regards
 Robin



Re: [RFC] S/390: Alignment peeling prolog generation

2017-04-12 Thread Richard Biener
On Wed, Apr 12, 2017 at 9:50 AM, Robin Dapp  wrote:
>> Note I was very conservative here to allow store bandwidth starved
>> CPUs to benefit from aligning a store.
>>
>> I think it would be reasonable to apply the same heuristic to the
>> store case that we only peel for same cost if peeling would at least
>> align two refs.
>
> Do you mean checking if peeling aligns >= 2 refs for sure? (i.e. with a
> known misalignment) Or the same as currently via
> STMT_VINFO_SAME_ALIGN_REFS just for stores and .length() >= 2?

The latter.

> Is checking via vect_peeling_hash_choose_best_peeling () too costly or
> simply unnecessary if we already know the costs for aligned and
> unaligned are the same?

This one only works for known misalignment, otherwise it's overkill.

OTOH if with some refactoring we can end up using a single cost model
that would be great.  That is for the SAME_ALIGN_REFS we want to
choose the unknown misalignment with the maximum number of
SAME_ALIGN_REFS.  And if we know the misalignment of a single
ref then we still may want to align a unknown misalign ref if that has
more SAME_ALIGN_REFS (I think we always choose the known-misalign
one currently).

Richard.

> Regards
>  Robin
>


Re: [RFC] S/390: Alignment peeling prolog generation

2017-04-12 Thread Robin Dapp
> Note I was very conservative here to allow store bandwidth starved
> CPUs to benefit from aligning a store.
> 
> I think it would be reasonable to apply the same heuristic to the
> store case that we only peel for same cost if peeling would at least
> align two refs.

Do you mean checking if peeling aligns >= 2 refs for sure? (i.e. with a
known misalignment) Or the same as currently via
STMT_VINFO_SAME_ALIGN_REFS just for stores and .length() >= 2?

Is checking via vect_peeling_hash_choose_best_peeling () too costly or
simply unnecessary if we already know the costs for aligned and
unaligned are the same?

Regards
 Robin



Re: [RFC] S/390: Alignment peeling prolog generation

2017-04-11 Thread Richard Biener
On April 11, 2017 4:57:29 PM GMT+02:00, "Bin.Cheng"  
wrote:
>On Tue, Apr 11, 2017 at 3:38 PM, Robin Dapp 
>wrote:
>> Hi,
>>
>> when looking at various vectorization examples on s390x I noticed
>that
>> we still peel vf/2 iterations for alignment even though vectorization
>> costs of unaligned loads and stores are the same as normal
>loads/stores.
>>
>> A simple example is
>>
>> void foo(int *restrict a, int *restrict b, unsigned int n)
>> {
>>   for (unsigned int i = 0; i < n; i++)
>> {
>>   b[i] = a[i] * 2 + 1;
>> }
>> }
>>
>> which gets peeled unless __builtin_assume_aligned (a, 8) is used.
>>
>> In tree-vect-data-refs.c there are several checks that involve costs 
>in
>> the peeling decision none of which seems to suffice in this case. For
>a
>> loop with only read DRs there is a check that has been triggering
>(i.e.
>> disable peeling) since we implemented the vectorization costs.
>>
>> Here, we have DR_MISALIGNMENT (dr) == -1 for all DRs but the costs
>> should still dictate to never peel. I attached a tentative patch for
>> discussion which fixes the problem by checking the costs for npeel =
>0
>> and npeel = vf/2 after ensuring we support all misalignments. Is
>there a
>> better way and place to do it? Are we missing something somewhere
>else
>> that would preclude the peeling from happening?
>>
>> This is not indended for stage 4 obviously :)
>Hi Robin,
>Seems Richi added code like below comparing costs between aligned and
>unsigned loads, and only peeling if it's beneficial:
>
>/* In case there are only loads with different unknown misalignments,
>use
> peeling only if it may help to align other accesses in the loop or
> if it may help improving load bandwith when we'd end up using
> unaligned loads.  */
> tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0)));
>  if (!first_store
>  && !STMT_VINFO_SAME_ALIGN_REFS (
>  vinfo_for_stmt (DR_STMT (dr0))).length ()
>  && (vect_supportable_dr_alignment (dr0, false)
>  != dr_unaligned_supported
>  || (builtin_vectorization_cost (vector_load, dr0_vt, 0)
>  == builtin_vectorization_cost (unaligned_load, dr0_vt, -1
>do_peeling = false;
>
>I think similar codes can be added for store cases too.

Note I was very conservative here to allow store bandwidth starved CPUs to 
benefit from aligning a store.

I think it would be reasonable to apply the same heuristic to the store case 
that we only peel for same cost if peeling would at least align two refs.

Richard.

>Thanks,
>bin
>>
>> Regards
>>  Robin



Re: [RFC] S/390: Alignment peeling prolog generation

2017-04-11 Thread Bin.Cheng
On Tue, Apr 11, 2017 at 4:03 PM, Robin Dapp  wrote:
> Hi Bin,
>
>> Seems Richi added code like below comparing costs between aligned and
>> unsigned loads, and only peeling if it's beneficial:
>>
>>   /* In case there are only loads with different unknown misalignments, 
>> use
>>  peeling only if it may help to align other accesses in the loop or
>>  if it may help improving load bandwith when we'd end up using
>>  unaligned loads.  */
>>   tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0)));
>>   if (!first_store
>>   && !STMT_VINFO_SAME_ALIGN_REFS (
>>   vinfo_for_stmt (DR_STMT (dr0))).length ()
>>   && (vect_supportable_dr_alignment (dr0, false)
>>   != dr_unaligned_supported
>>   || (builtin_vectorization_cost (vector_load, dr0_vt, 0)
>>   == builtin_vectorization_cost (unaligned_load, dr0_vt, -1
>> do_peeling = false;
>
> yes this is the "special case" I was referring to. This successfully
> avoids peeling when there is no store (after we had set vectorization
> costs). My patch tries to check the costs for all references.
I am not sure if all references need to be checked, on AArch64,
aligned/unaligned costs are set globally, so only need to make one
check here.

Thanks,
bin
>
> Regards
>  Robin
>


Re: [RFC] S/390: Alignment peeling prolog generation

2017-04-11 Thread Robin Dapp
Hi Bin,

> Seems Richi added code like below comparing costs between aligned and
> unsigned loads, and only peeling if it's beneficial:
> 
>   /* In case there are only loads with different unknown misalignments, 
> use
>  peeling only if it may help to align other accesses in the loop or
>  if it may help improving load bandwith when we'd end up using
>  unaligned loads.  */
>   tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0)));
>   if (!first_store
>   && !STMT_VINFO_SAME_ALIGN_REFS (
>   vinfo_for_stmt (DR_STMT (dr0))).length ()
>   && (vect_supportable_dr_alignment (dr0, false)
>   != dr_unaligned_supported
>   || (builtin_vectorization_cost (vector_load, dr0_vt, 0)
>   == builtin_vectorization_cost (unaligned_load, dr0_vt, -1
> do_peeling = false;

yes this is the "special case" I was referring to. This successfully
avoids peeling when there is no store (after we had set vectorization
costs). My patch tries to check the costs for all references.

Regards
 Robin



Re: [RFC] S/390: Alignment peeling prolog generation

2017-04-11 Thread Bin.Cheng
On Tue, Apr 11, 2017 at 3:38 PM, Robin Dapp  wrote:
> Hi,
>
> when looking at various vectorization examples on s390x I noticed that
> we still peel vf/2 iterations for alignment even though vectorization
> costs of unaligned loads and stores are the same as normal loads/stores.
>
> A simple example is
>
> void foo(int *restrict a, int *restrict b, unsigned int n)
> {
>   for (unsigned int i = 0; i < n; i++)
> {
>   b[i] = a[i] * 2 + 1;
> }
> }
>
> which gets peeled unless __builtin_assume_aligned (a, 8) is used.
>
> In tree-vect-data-refs.c there are several checks that involve costs  in
> the peeling decision none of which seems to suffice in this case. For a
> loop with only read DRs there is a check that has been triggering (i.e.
> disable peeling) since we implemented the vectorization costs.
>
> Here, we have DR_MISALIGNMENT (dr) == -1 for all DRs but the costs
> should still dictate to never peel. I attached a tentative patch for
> discussion which fixes the problem by checking the costs for npeel = 0
> and npeel = vf/2 after ensuring we support all misalignments. Is there a
> better way and place to do it? Are we missing something somewhere else
> that would preclude the peeling from happening?
>
> This is not indended for stage 4 obviously :)
Hi Robin,
Seems Richi added code like below comparing costs between aligned and
unsigned loads, and only peeling if it's beneficial:

  /* In case there are only loads with different unknown misalignments, use
 peeling only if it may help to align other accesses in the loop or
 if it may help improving load bandwith when we'd end up using
 unaligned loads.  */
  tree dr0_vt = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr0)));
  if (!first_store
  && !STMT_VINFO_SAME_ALIGN_REFS (
  vinfo_for_stmt (DR_STMT (dr0))).length ()
  && (vect_supportable_dr_alignment (dr0, false)
  != dr_unaligned_supported
  || (builtin_vectorization_cost (vector_load, dr0_vt, 0)
  == builtin_vectorization_cost (unaligned_load, dr0_vt, -1
do_peeling = false;

I think similar codes can be added for store cases too.

Thanks,
bin
>
> Regards
>  Robin


[RFC] S/390: Alignment peeling prolog generation

2017-04-11 Thread Robin Dapp
Hi,

when looking at various vectorization examples on s390x I noticed that
we still peel vf/2 iterations for alignment even though vectorization
costs of unaligned loads and stores are the same as normal loads/stores.

A simple example is

void foo(int *restrict a, int *restrict b, unsigned int n)
{
  for (unsigned int i = 0; i < n; i++)
{
  b[i] = a[i] * 2 + 1;
}
}

which gets peeled unless __builtin_assume_aligned (a, 8) is used.

In tree-vect-data-refs.c there are several checks that involve costs  in
the peeling decision none of which seems to suffice in this case. For a
loop with only read DRs there is a check that has been triggering (i.e.
disable peeling) since we implemented the vectorization costs.

Here, we have DR_MISALIGNMENT (dr) == -1 for all DRs but the costs
should still dictate to never peel. I attached a tentative patch for
discussion which fixes the problem by checking the costs for npeel = 0
and npeel = vf/2 after ensuring we support all misalignments. Is there a
better way and place to do it? Are we missing something somewhere else
that would preclude the peeling from happening?

This is not indended for stage 4 obviously :)

Regards
 Robin
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 3fc762a..795c22c 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -1418,6 +1418,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
   stmt_vec_info stmt_info;
   unsigned int npeel = 0;
   bool all_misalignments_unknown = true;
+  bool all_misalignments_supported = true;
   unsigned int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned possible_npeel_number = 1;
   tree vectype;
@@ -1547,6 +1548,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
 }
 
   all_misalignments_unknown = false;
+
   /* Data-ref that was chosen for the case that all the
  misalignments are unknown is not relevant anymore, since we
  have a data-ref with known alignment.  */
@@ -1609,6 +1611,24 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
   break;
 }
 }
+
+  /* Check if target supports misaligned data access for current data
+	 reference.  */
+  vectype = STMT_VINFO_VECTYPE (stmt_info);
+  machine_mode mode = TYPE_MODE (vectype);
+  if (targetm.vectorize.
+	  support_vector_misalignment (mode, TREE_TYPE (DR_REF (dr)),
+   DR_MISALIGNMENT (dr), false))
+	{
+	  vect_peeling_hash_insert (_htab, loop_vinfo,
+dr, 0);
+	  /* Also insert vf/2 peeling that will be used when all
+	 misalignments are unknown. */
+	  vect_peeling_hash_insert (_htab, loop_vinfo,
+dr, vf / 2);
+	}
+  else
+	all_misalignments_supported = false;
 }
 
   /* Check if we can possibly peel the loop.  */
@@ -1687,6 +1707,18 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
 dr0 = first_store;
 }
 
+  /* If the target supports accessing all data references in a misaligned
+	 way, check costs to see if we can leave them unaligned and do not
+	 perform any peeling.  */
+  if (all_misalignments_supported)
+	{
+	  dr0 = vect_peeling_hash_choose_best_peeling (_htab,
+		   loop_vinfo, ,
+		   _cost_vec);
+	  if (!dr0 || !npeel)
+	do_peeling = false;
+	}
+
   /* In case there are only loads with different unknown misalignments, use
  peeling only if it may help to align other accesses in the loop or
 	 if it may help improving load bandwith when we'd end up using