Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-22 Thread Jeff Law via Gcc-patches



On 11/22/22 04:08, Richard Biener via Gcc-patches wrote:

On Tue, 22 Nov 2022, Richard Sandiford wrote:


Tamar Christina  writes:

-Original Message-
From: Richard Biener 
Sent: Tuesday, November 22, 2022 10:59 AM
To: Richard Sandiford 
Cc: Tamar Christina via Gcc-patches ; Tamar
Christina ; Richard Biener
; nd 
Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
bitfields and array_refs

On Tue, 22 Nov 2022, Richard Sandiford wrote:


Tamar Christina via Gcc-patches  writes:

So it's not easily possible the within current infrastructure.  But
it does look like ARM might eventually benefit from something like STV

on x86?

I'm not sure.  The problem with trying to do this in RTL is that
you'd have to be able to decide from two psuedos whether they come
from extracts that are sequential. When coming in from a hard
register that's easy yes.  When coming in from a load, or any other

operation that produces psuedos that becomes harder.

Yeah.

Just in case anyone reading the above is tempted to implement STV for
AArch64: I think it would set a bad precedent if we had a
paste-&-adjust version of the x86 pass.  AFAIK, the target
capabilities and constraints are mostly modelled correctly using
existing mechanisms, so I don't think there's anything particularly
target-specific about the process of forcing things to be on the general or

SIMD/FP side.

So if we did have an STV-ish thing for AArch64, I think it should be a
target-independent pass that uses hooks and recog, even if the pass is
initially enabled for AArch64 only.

Agreed - maybe some of the x86 code can be leveraged, but of course the
cost modeling is the most difficult to get right - IIRC the x86 backend resorts
to backend specific tuning flags rather than trying to get rtx_cost or insn_cost
"correct" here.


(FWIW, on the patch itself, I tend to agree that this is really an SLP
optimisation.  If the vectoriser fails to see the benefit, or if it
fails to handle more complex cases, then it would be good to try to
fix that.)

Also agreed - but costing is hard ;)

I guess, I still disagree here but I've clearly been out-Richard.  The problem 
is still
that this is just basic codegen.  I still don't think it requires -O2 to be 
usable.

So I guess the only correct implementation is to use an STV-like patch.  But 
given
that this is already the second attempt, first RTL one was rejected by Richard,
second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV
thing before I waste months more..

I don't think this in itself is a good motivation for STV.  My comment
above was more about the idea of STV for AArch64 in general (since it
had been raised).

Personally I still think the reduction should be generated in gimple.

I agree, and the proper place to generate the reduction is in SLP.


Sorry to have sent things astray with my earlier ACK.  It looked 
reasonable to me.


jeff



Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-22 Thread Richard Biener via Gcc-patches
On Tue, 22 Nov 2022, Richard Sandiford wrote:

> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Biener 
> >> Sent: Tuesday, November 22, 2022 10:59 AM
> >> To: Richard Sandiford 
> >> Cc: Tamar Christina via Gcc-patches ; Tamar
> >> Christina ; Richard Biener
> >> ; nd 
> >> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> >> bitfields and array_refs
> >>
> >> On Tue, 22 Nov 2022, Richard Sandiford wrote:
> >>
> >> > Tamar Christina via Gcc-patches  writes:
> >> > >> So it's not easily possible the within current infrastructure.  But
> >> > >> it does look like ARM might eventually benefit from something like STV
> >> on x86?
> >> > >>
> >> > >
> >> > > I'm not sure.  The problem with trying to do this in RTL is that
> >> > > you'd have to be able to decide from two psuedos whether they come
> >> > > from extracts that are sequential. When coming in from a hard
> >> > > register that's easy yes.  When coming in from a load, or any other
> >> operation that produces psuedos that becomes harder.
> >> >
> >> > Yeah.
> >> >
> >> > Just in case anyone reading the above is tempted to implement STV for
> >> > AArch64: I think it would set a bad precedent if we had a
> >> > paste-&-adjust version of the x86 pass.  AFAIK, the target
> >> > capabilities and constraints are mostly modelled correctly using
> >> > existing mechanisms, so I don't think there's anything particularly
> >> > target-specific about the process of forcing things to be on the general 
> >> > or
> >> SIMD/FP side.
> >> >
> >> > So if we did have an STV-ish thing for AArch64, I think it should be a
> >> > target-independent pass that uses hooks and recog, even if the pass is
> >> > initially enabled for AArch64 only.
> >>
> >> Agreed - maybe some of the x86 code can be leveraged, but of course the
> >> cost modeling is the most difficult to get right - IIRC the x86 backend 
> >> resorts
> >> to backend specific tuning flags rather than trying to get rtx_cost or 
> >> insn_cost
> >> "correct" here.
> >>
> >> > (FWIW, on the patch itself, I tend to agree that this is really an SLP
> >> > optimisation.  If the vectoriser fails to see the benefit, or if it
> >> > fails to handle more complex cases, then it would be good to try to
> >> > fix that.)
> >>
> >> Also agreed - but costing is hard ;)
> >
> > I guess, I still disagree here but I've clearly been out-Richard.  The 
> > problem is still
> > that this is just basic codegen.  I still don't think it requires -O2 to be 
> > usable.
> >
> > So I guess the only correct implementation is to use an STV-like patch.  
> > But given
> > that this is already the second attempt, first RTL one was rejected by 
> > Richard,
> > second GIMPLE one was rejected by Richi I'd like to get an agreement on 
> > this STV
> > thing before I waste months more..
> 
> I don't think this in itself is a good motivation for STV.  My comment
> above was more about the idea of STV for AArch64 in general (since it
> had been raised).
> 
> Personally I still think the reduction should be generated in gimple.

I agree, and the proper place to generate the reduction is in SLP.

Richard.


Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-22 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Biener 
>> Sent: Tuesday, November 22, 2022 10:59 AM
>> To: Richard Sandiford 
>> Cc: Tamar Christina via Gcc-patches ; Tamar
>> Christina ; Richard Biener
>> ; nd 
>> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
>> bitfields and array_refs
>>
>> On Tue, 22 Nov 2022, Richard Sandiford wrote:
>>
>> > Tamar Christina via Gcc-patches  writes:
>> > >> So it's not easily possible the within current infrastructure.  But
>> > >> it does look like ARM might eventually benefit from something like STV
>> on x86?
>> > >>
>> > >
>> > > I'm not sure.  The problem with trying to do this in RTL is that
>> > > you'd have to be able to decide from two psuedos whether they come
>> > > from extracts that are sequential. When coming in from a hard
>> > > register that's easy yes.  When coming in from a load, or any other
>> operation that produces psuedos that becomes harder.
>> >
>> > Yeah.
>> >
>> > Just in case anyone reading the above is tempted to implement STV for
>> > AArch64: I think it would set a bad precedent if we had a
>> > paste-&-adjust version of the x86 pass.  AFAIK, the target
>> > capabilities and constraints are mostly modelled correctly using
>> > existing mechanisms, so I don't think there's anything particularly
>> > target-specific about the process of forcing things to be on the general or
>> SIMD/FP side.
>> >
>> > So if we did have an STV-ish thing for AArch64, I think it should be a
>> > target-independent pass that uses hooks and recog, even if the pass is
>> > initially enabled for AArch64 only.
>>
>> Agreed - maybe some of the x86 code can be leveraged, but of course the
>> cost modeling is the most difficult to get right - IIRC the x86 backend 
>> resorts
>> to backend specific tuning flags rather than trying to get rtx_cost or 
>> insn_cost
>> "correct" here.
>>
>> > (FWIW, on the patch itself, I tend to agree that this is really an SLP
>> > optimisation.  If the vectoriser fails to see the benefit, or if it
>> > fails to handle more complex cases, then it would be good to try to
>> > fix that.)
>>
>> Also agreed - but costing is hard ;)
>
> I guess, I still disagree here but I've clearly been out-Richard.  The 
> problem is still
> that this is just basic codegen.  I still don't think it requires -O2 to be 
> usable.
>
> So I guess the only correct implementation is to use an STV-like patch.  But 
> given
> that this is already the second attempt, first RTL one was rejected by 
> Richard,
> second GIMPLE one was rejected by Richi I'd like to get an agreement on this 
> STV
> thing before I waste months more..

I don't think this in itself is a good motivation for STV.  My comment
above was more about the idea of STV for AArch64 in general (since it
had been raised).

Personally I still think the reduction should be generated in gimple.

Thanks,
Richard


RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-22 Thread Tamar Christina via Gcc-patches
> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, November 22, 2022 10:59 AM
> To: Richard Sandiford 
> Cc: Tamar Christina via Gcc-patches ; Tamar
> Christina ; Richard Biener
> ; nd 
> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> bitfields and array_refs
> 
> On Tue, 22 Nov 2022, Richard Sandiford wrote:
> 
> > Tamar Christina via Gcc-patches  writes:
> > >> So it's not easily possible the within current infrastructure.  But
> > >> it does look like ARM might eventually benefit from something like STV
> on x86?
> > >>
> > >
> > > I'm not sure.  The problem with trying to do this in RTL is that
> > > you'd have to be able to decide from two psuedos whether they come
> > > from extracts that are sequential. When coming in from a hard
> > > register that's easy yes.  When coming in from a load, or any other
> operation that produces psuedos that becomes harder.
> >
> > Yeah.
> >
> > Just in case anyone reading the above is tempted to implement STV for
> > AArch64: I think it would set a bad precedent if we had a
> > paste-&-adjust version of the x86 pass.  AFAIK, the target
> > capabilities and constraints are mostly modelled correctly using
> > existing mechanisms, so I don't think there's anything particularly
> > target-specific about the process of forcing things to be on the general or
> SIMD/FP side.
> >
> > So if we did have an STV-ish thing for AArch64, I think it should be a
> > target-independent pass that uses hooks and recog, even if the pass is
> > initially enabled for AArch64 only.
> 
> Agreed - maybe some of the x86 code can be leveraged, but of course the
> cost modeling is the most difficult to get right - IIRC the x86 backend 
> resorts
> to backend specific tuning flags rather than trying to get rtx_cost or 
> insn_cost
> "correct" here.
> 
> > (FWIW, on the patch itself, I tend to agree that this is really an SLP
> > optimisation.  If the vectoriser fails to see the benefit, or if it
> > fails to handle more complex cases, then it would be good to try to
> > fix that.)
> 
> Also agreed - but costing is hard ;)

I guess, I still disagree here but I've clearly been out-Richard.  The problem 
is still
that this is just basic codegen.  I still don't think it requires -O2 to be 
usable.

So I guess the only correct implementation is to use an STV-like patch.  But 
given
that this is already the second attempt, first RTL one was rejected by Richard,
second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV
thing before I waste months more..

Thanks,
Tamar

> 
> Richard.


Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-22 Thread Richard Biener via Gcc-patches
On Tue, 22 Nov 2022, Richard Sandiford wrote:

> Tamar Christina via Gcc-patches  writes:
> >> So it's not easily possible the within current infrastructure.  But it 
> >> does look
> >> like ARM might eventually benefit from something like STV on x86?
> >> 
> >
> > I'm not sure.  The problem with trying to do this in RTL is that you'd have 
> > to be
> > able to decide from two psuedos whether they come from extracts that are
> > sequential. When coming in from a hard register that's easy yes.  When 
> > coming in
> > from a load, or any other operation that produces psuedos that becomes 
> > harder.
> 
> Yeah.
> 
> Just in case anyone reading the above is tempted to implement STV for
> AArch64: I think it would set a bad precedent if we had a paste-&-adjust
> version of the x86 pass.  AFAIK, the target capabilities and constraints
> are mostly modelled correctly using existing mechanisms, so I don't
> think there's anything particularly target-specific about the process
> of forcing things to be on the general or SIMD/FP side.
> 
> So if we did have an STV-ish thing for AArch64, I think it should be
> a target-independent pass that uses hooks and recog, even if the pass
> is initially enabled for AArch64 only.

Agreed - maybe some of the x86 code can be leveraged, but of course
the cost modeling is the most difficult to get right - IIRC the x86
backend resorts to backend specific tuning flags rather than trying
to get rtx_cost or insn_cost "correct" here.

> (FWIW, on the patch itself, I tend to agree that this is really an
> SLP optimisation.  If the vectoriser fails to see the benefit, or if
> it fails to handle more complex cases, then it would be good to try
> to fix that.)

Also agreed - but costing is hard ;)

Richard.


Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-22 Thread Richard Sandiford via Gcc-patches
Tamar Christina via Gcc-patches  writes:
>> So it's not easily possible the within current infrastructure.  But it does 
>> look
>> like ARM might eventually benefit from something like STV on x86?
>> 
>
> I'm not sure.  The problem with trying to do this in RTL is that you'd have 
> to be
> able to decide from two psuedos whether they come from extracts that are
> sequential. When coming in from a hard register that's easy yes.  When coming 
> in
> from a load, or any other operation that produces psuedos that becomes harder.

Yeah.

Just in case anyone reading the above is tempted to implement STV for
AArch64: I think it would set a bad precedent if we had a paste-&-adjust
version of the x86 pass.  AFAIK, the target capabilities and constraints
are mostly modelled correctly using existing mechanisms, so I don't
think there's anything particularly target-specific about the process
of forcing things to be on the general or SIMD/FP side.

So if we did have an STV-ish thing for AArch64, I think it should be
a target-independent pass that uses hooks and recog, even if the pass
is initially enabled for AArch64 only.

(FWIW, on the patch itself, I tend to agree that this is really an
SLP optimisation.  If the vectoriser fails to see the benefit, or if
it fails to handle more complex cases, then it would be good to try
to fix that.)

Thanks,
Richard


RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-07 Thread Tamar Christina via Gcc-patches
> -Original Message-
> From: Richard Biener 
> Sent: Monday, November 7, 2022 11:23 AM
> To: Tamar Christina 
> Cc: Richard Biener ; gcc-
> patc...@gcc.gnu.org; nd 
> Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from
> bitfields and array_refs
> 
> On Mon, 7 Nov 2022, Tamar Christina wrote:
> 
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Monday, November 7, 2022 10:18 AM
> > > To: Tamar Christina 
> > > Cc: Richard Biener ; gcc-
> > > patc...@gcc.gnu.org; nd 
> > > Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > > bitfields and array_refs
> > >
> > > On Mon, 7 Nov 2022, Tamar Christina wrote:
> > >
> > > > > -Original Message-
> > > > > From: Richard Biener 
> > > > > Sent: Saturday, November 5, 2022 11:33 AM
> > > > > To: Tamar Christina 
> > > > > Cc: gcc-patches@gcc.gnu.org; nd ;
> rguent...@suse.de
> > > > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions
> > > > > from bitfields and array_refs
> > > > >
> > > > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches
> > > > >  wrote:
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > This patch series is to add recognition of pairwise operations
> > > > > > (reductions) in match.pd such that we can benefit from them
> > > > > > even at
> > > > > > -O1 when the vectorizer isn't enabled.
> > > > > >
> > > > > > Ths use of these allow for a lot simpler codegen in AArch64
> > > > > > and allows us to avoid quite a lot of codegen warts.
> > > > > >
> > > > > > As an example a simple:
> > > > > >
> > > > > > typedef float v4sf __attribute__((vector_size (16)));
> > > > > >
> > > > > > float
> > > > > > foo3 (v4sf x)
> > > > > > {
> > > > > >   return x[1] + x[2];
> > > > > > }
> > > > > >
> > > > > > currently generates:
> > > > > >
> > > > > > foo3:
> > > > > > dup s1, v0.s[1]
> > > > > > dup s0, v0.s[2]
> > > > > > fadds0, s1, s0
> > > > > > ret
> > > > > >
> > > > > > while with this patch series now generates:
> > > > > >
> > > > > > foo3:
> > > > > > ext v0.16b, v0.16b, v0.16b, #4
> > > > > > faddp   s0, v0.2s
> > > > > > ret
> > > > > >
> > > > > > This patch will not perform the operation if the source is not
> > > > > > a gimple register and leaves memory sources to the vectorizer
> > > > > > as it's able to deal correctly with clobbers.
> > > > >
> > > > > But the vectorizer should also be able to cope with the above.
> > > >
> > > > There are several problems with leaving it up to the vectorizer to do:
> > > >
> > > > 1. We only get it at -O2 and higher.
> > > > 2. The way the vectorizer costs the reduction makes the resulting
> > > > cost
> > > always too high for AArch64.
> > > >
> > > > As an example the following:
> > > >
> > > > typedef unsigned int u32v4 __attribute__((vector_size(16)));
> > > > unsigned int f (u32v4 a, u32v4 b) {
> > > > return a[0] + a[1];
> > > > }
> > > >
> > > > Doesn't get SLP'ed because the vectorizer costs it as:
> > > >
> > > > node 0x485eb30 0 times vec_perm costs 0 in body
> > > > _1 + _2 1 times vector_stmt costs 1 in body
> > > > _1 + _2 1 times vec_perm costs 2 in body
> > > > _1 + _2 1 times vec_to_scalar costs 2 in body
> > > >
> > > > And so ultimately you fail because:
> > > >
> > > > /app/example.c:8:17: note: Cost model analysis for part in loop 0:
> > > >   Vector cost: 5
> > > >   Scalar cost: 3
> > > >
> > > > This looks like it's because the vectorizer costs the operation to
> > > > create the BIT_FIELD_REF ; For the red

RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-07 Thread Richard Biener via Gcc-patches
On Mon, 7 Nov 2022, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Monday, November 7, 2022 10:18 AM
> > To: Tamar Christina 
> > Cc: Richard Biener ; gcc-
> > patc...@gcc.gnu.org; nd 
> > Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > bitfields and array_refs
> > 
> > On Mon, 7 Nov 2022, Tamar Christina wrote:
> > 
> > > > -Original Message-
> > > > From: Richard Biener 
> > > > Sent: Saturday, November 5, 2022 11:33 AM
> > > > To: Tamar Christina 
> > > > Cc: gcc-patches@gcc.gnu.org; nd ; rguent...@suse.de
> > > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > > > bitfields and array_refs
> > > >
> > > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > This patch series is to add recognition of pairwise operations
> > > > > (reductions) in match.pd such that we can benefit from them even
> > > > > at
> > > > > -O1 when the vectorizer isn't enabled.
> > > > >
> > > > > Ths use of these allow for a lot simpler codegen in AArch64 and
> > > > > allows us to avoid quite a lot of codegen warts.
> > > > >
> > > > > As an example a simple:
> > > > >
> > > > > typedef float v4sf __attribute__((vector_size (16)));
> > > > >
> > > > > float
> > > > > foo3 (v4sf x)
> > > > > {
> > > > >   return x[1] + x[2];
> > > > > }
> > > > >
> > > > > currently generates:
> > > > >
> > > > > foo3:
> > > > > dup s1, v0.s[1]
> > > > > dup s0, v0.s[2]
> > > > > fadds0, s1, s0
> > > > > ret
> > > > >
> > > > > while with this patch series now generates:
> > > > >
> > > > > foo3:
> > > > > ext v0.16b, v0.16b, v0.16b, #4
> > > > > faddp   s0, v0.2s
> > > > > ret
> > > > >
> > > > > This patch will not perform the operation if the source is not a
> > > > > gimple register and leaves memory sources to the vectorizer as
> > > > > it's able to deal correctly with clobbers.
> > > >
> > > > But the vectorizer should also be able to cope with the above.
> > >
> > > There are several problems with leaving it up to the vectorizer to do:
> > >
> > > 1. We only get it at -O2 and higher.
> > > 2. The way the vectorizer costs the reduction makes the resulting cost
> > always too high for AArch64.
> > >
> > > As an example the following:
> > >
> > > typedef unsigned int u32v4 __attribute__((vector_size(16))); unsigned
> > > int f (u32v4 a, u32v4 b) {
> > > return a[0] + a[1];
> > > }
> > >
> > > Doesn't get SLP'ed because the vectorizer costs it as:
> > >
> > > node 0x485eb30 0 times vec_perm costs 0 in body
> > > _1 + _2 1 times vector_stmt costs 1 in body
> > > _1 + _2 1 times vec_perm costs 2 in body
> > > _1 + _2 1 times vec_to_scalar costs 2 in body
> > >
> > > And so ultimately you fail because:
> > >
> > > /app/example.c:8:17: note: Cost model analysis for part in loop 0:
> > >   Vector cost: 5
> > >   Scalar cost: 3
> > >
> > > This looks like it's because the vectorizer costs the operation to
> > > create the BIT_FIELD_REF ; For the reduction as
> > > requiring two scalar extracts and a permute. While it ultimately does
> > produce a BIT_FIELD_REF ; that's not what it costs.
> > >
> > > This causes the reduction to almost always be more expensive, so
> > > unless the rest of the SLP tree amortizes the cost we never generate them.
> > 
> > On x86 for example the hadds are prohibitly expensive here.  Are you sure
> > the horizontal add is actually profitable on arm?  Your pattern-matching has
> > no cost modeling at all?
> 
> Yes, they are dirt cheap, that's why we use them for a lot of our codegen for
> e.g. compressing values.
> 
> > 
> > > 3. The SLP only happens on operation that are SLP shaped and where SLP
> > didn&

RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-07 Thread Tamar Christina via Gcc-patches
> -Original Message-
> From: Richard Biener 
> Sent: Monday, November 7, 2022 10:18 AM
> To: Tamar Christina 
> Cc: Richard Biener ; gcc-
> patc...@gcc.gnu.org; nd 
> Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from
> bitfields and array_refs
> 
> On Mon, 7 Nov 2022, Tamar Christina wrote:
> 
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Saturday, November 5, 2022 11:33 AM
> > > To: Tamar Christina 
> > > Cc: gcc-patches@gcc.gnu.org; nd ; rguent...@suse.de
> > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > > bitfields and array_refs
> > >
> > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches
> > >  wrote:
> > > >
> > > > Hi All,
> > > >
> > > > This patch series is to add recognition of pairwise operations
> > > > (reductions) in match.pd such that we can benefit from them even
> > > > at
> > > > -O1 when the vectorizer isn't enabled.
> > > >
> > > > Ths use of these allow for a lot simpler codegen in AArch64 and
> > > > allows us to avoid quite a lot of codegen warts.
> > > >
> > > > As an example a simple:
> > > >
> > > > typedef float v4sf __attribute__((vector_size (16)));
> > > >
> > > > float
> > > > foo3 (v4sf x)
> > > > {
> > > >   return x[1] + x[2];
> > > > }
> > > >
> > > > currently generates:
> > > >
> > > > foo3:
> > > > dup s1, v0.s[1]
> > > > dup s0, v0.s[2]
> > > > fadds0, s1, s0
> > > > ret
> > > >
> > > > while with this patch series now generates:
> > > >
> > > > foo3:
> > > > ext v0.16b, v0.16b, v0.16b, #4
> > > > faddp   s0, v0.2s
> > > > ret
> > > >
> > > > This patch will not perform the operation if the source is not a
> > > > gimple register and leaves memory sources to the vectorizer as
> > > > it's able to deal correctly with clobbers.
> > >
> > > But the vectorizer should also be able to cope with the above.
> >
> > There are several problems with leaving it up to the vectorizer to do:
> >
> > 1. We only get it at -O2 and higher.
> > 2. The way the vectorizer costs the reduction makes the resulting cost
> always too high for AArch64.
> >
> > As an example the following:
> >
> > typedef unsigned int u32v4 __attribute__((vector_size(16))); unsigned
> > int f (u32v4 a, u32v4 b) {
> > return a[0] + a[1];
> > }
> >
> > Doesn't get SLP'ed because the vectorizer costs it as:
> >
> > node 0x485eb30 0 times vec_perm costs 0 in body
> > _1 + _2 1 times vector_stmt costs 1 in body
> > _1 + _2 1 times vec_perm costs 2 in body
> > _1 + _2 1 times vec_to_scalar costs 2 in body
> >
> > And so ultimately you fail because:
> >
> > /app/example.c:8:17: note: Cost model analysis for part in loop 0:
> >   Vector cost: 5
> >   Scalar cost: 3
> >
> > This looks like it's because the vectorizer costs the operation to
> > create the BIT_FIELD_REF ; For the reduction as
> > requiring two scalar extracts and a permute. While it ultimately does
> produce a BIT_FIELD_REF ; that's not what it costs.
> >
> > This causes the reduction to almost always be more expensive, so
> > unless the rest of the SLP tree amortizes the cost we never generate them.
> 
> On x86 for example the hadds are prohibitly expensive here.  Are you sure
> the horizontal add is actually profitable on arm?  Your pattern-matching has
> no cost modeling at all?

Yes, they are dirt cheap, that's why we use them for a lot of our codegen for
e.g. compressing values.

> 
> > 3. The SLP only happens on operation that are SLP shaped and where SLP
> didn't fail.
> >
> > As a simple example, the vectorizer can't SLP the following:
> >
> > unsigned int f (u32v4 a, u32v4 b)
> > {
> > a[0] += b[0];
> > return a[0] + a[1];
> > }
> >
> > Because there's not enough VF here and it can't unroll. This and many
> > others fail because they're not an SLP-able operation, or SLP build fails.
> 
> That's of course because the pattern matching for reductions is too simple
> here, getting us a group size of three.  Bad associ

RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-07 Thread Richard Biener via Gcc-patches
On Mon, 7 Nov 2022, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Saturday, November 5, 2022 11:33 AM
> > To: Tamar Christina 
> > Cc: gcc-patches@gcc.gnu.org; nd ; rguent...@suse.de
> > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > bitfields and array_refs
> > 
> > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches  > patc...@gcc.gnu.org> wrote:
> > >
> > > Hi All,
> > >
> > > This patch series is to add recognition of pairwise operations
> > > (reductions) in match.pd such that we can benefit from them even at
> > > -O1 when the vectorizer isn't enabled.
> > >
> > > Ths use of these allow for a lot simpler codegen in AArch64 and allows
> > > us to avoid quite a lot of codegen warts.
> > >
> > > As an example a simple:
> > >
> > > typedef float v4sf __attribute__((vector_size (16)));
> > >
> > > float
> > > foo3 (v4sf x)
> > > {
> > >   return x[1] + x[2];
> > > }
> > >
> > > currently generates:
> > >
> > > foo3:
> > > dup s1, v0.s[1]
> > > dup s0, v0.s[2]
> > > fadds0, s1, s0
> > > ret
> > >
> > > while with this patch series now generates:
> > >
> > > foo3:
> > > ext v0.16b, v0.16b, v0.16b, #4
> > > faddp   s0, v0.2s
> > > ret
> > >
> > > This patch will not perform the operation if the source is not a
> > > gimple register and leaves memory sources to the vectorizer as it's
> > > able to deal correctly with clobbers.
> > 
> > But the vectorizer should also be able to cope with the above.  
> 
> There are several problems with leaving it up to the vectorizer to do:
> 
> 1. We only get it at -O2 and higher.
> 2. The way the vectorizer costs the reduction makes the resulting cost always 
> too high for AArch64.
> 
> As an example the following:
> 
> typedef unsigned int u32v4 __attribute__((vector_size(16)));
> unsigned int f (u32v4 a, u32v4 b)
> {
> return a[0] + a[1];
> }
> 
> Doesn't get SLP'ed because the vectorizer costs it as:
> 
> node 0x485eb30 0 times vec_perm costs 0 in body
> _1 + _2 1 times vector_stmt costs 1 in body
> _1 + _2 1 times vec_perm costs 2 in body
> _1 + _2 1 times vec_to_scalar costs 2 in body
> 
> And so ultimately you fail because:
> 
> /app/example.c:8:17: note: Cost model analysis for part in loop 0:
>   Vector cost: 5
>   Scalar cost: 3
> 
> This looks like it's because the vectorizer costs the operation to create the 
> BIT_FIELD_REF ;
> For the reduction as requiring two scalar extracts and a permute. While it 
> ultimately does produce a
> BIT_FIELD_REF ; that's not what it costs.
> 
> This causes the reduction to almost always be more expensive, so unless the 
> rest of the SLP tree amortizes
> the cost we never generate them.

On x86 for example the hadds are prohibitly expensive here.  Are you sure
the horizontal add is actually profitable on arm?  Your pattern-matching
has no cost modeling at all?

> 3. The SLP only happens on operation that are SLP shaped and where SLP didn't 
> fail.
> 
> As a simple example, the vectorizer can't SLP the following:
> 
> unsigned int f (u32v4 a, u32v4 b)
> {
> a[0] += b[0];
> return a[0] + a[1];
> }
> 
> Because there's not enough VF here and it can't unroll. This and many others 
> fail because they're not an
> SLP-able operation, or SLP build fails.

That's of course because the pattern matching for reductions is too simple
here, getting us a group size of three.  Bad association would make your
simple pattern matching fail as well.

> This causes us to generate for e.g. this example:
> 
> f:
> dup s2, v0.s[1]
> fmovw1, s1
> add v0.2s, v2.2s, v0.2s
> fmovw0, s0
> add w0, w0, w1
> ret
> 
> instead of with my patch:
> 
> f:
> addpv0.2s, v0.2s, v0.2s
> add v0.2s, v0.2s, v1.2s
> fmovw0, s0
> ret
> 
> which is significantly better code.  So I don't think the vectorizer is the 
> right solution for this.

Simple pattern matching isn't either.  In fact basic-block SLP is supposed
to be the advanced pattern matching including a cost model.

IMHO the correct approach is to improve that, 
vect_slp_check_for_constructors plus how we handle/recover from SLP
discovery fails as i

RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-06 Thread Tamar Christina via Gcc-patches
> -Original Message-
> From: Richard Biener 
> Sent: Saturday, November 5, 2022 11:33 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; rguent...@suse.de
> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> bitfields and array_refs
> 
> On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches  patc...@gcc.gnu.org> wrote:
> >
> > Hi All,
> >
> > This patch series is to add recognition of pairwise operations
> > (reductions) in match.pd such that we can benefit from them even at
> > -O1 when the vectorizer isn't enabled.
> >
> > Ths use of these allow for a lot simpler codegen in AArch64 and allows
> > us to avoid quite a lot of codegen warts.
> >
> > As an example a simple:
> >
> > typedef float v4sf __attribute__((vector_size (16)));
> >
> > float
> > foo3 (v4sf x)
> > {
> >   return x[1] + x[2];
> > }
> >
> > currently generates:
> >
> > foo3:
> > dup s1, v0.s[1]
> > dup s0, v0.s[2]
> > fadds0, s1, s0
> > ret
> >
> > while with this patch series now generates:
> >
> > foo3:
> > ext v0.16b, v0.16b, v0.16b, #4
> > faddp   s0, v0.2s
> > ret
> >
> > This patch will not perform the operation if the source is not a
> > gimple register and leaves memory sources to the vectorizer as it's
> > able to deal correctly with clobbers.
> 
> But the vectorizer should also be able to cope with the above.  

There are several problems with leaving it up to the vectorizer to do:

1. We only get it at -O2 and higher.
2. The way the vectorizer costs the reduction makes the resulting cost always 
too high for AArch64.

As an example the following:

typedef unsigned int u32v4 __attribute__((vector_size(16)));
unsigned int f (u32v4 a, u32v4 b)
{
return a[0] + a[1];
}

Doesn't get SLP'ed because the vectorizer costs it as:

node 0x485eb30 0 times vec_perm costs 0 in body
_1 + _2 1 times vector_stmt costs 1 in body
_1 + _2 1 times vec_perm costs 2 in body
_1 + _2 1 times vec_to_scalar costs 2 in body

And so ultimately you fail because:

/app/example.c:8:17: note: Cost model analysis for part in loop 0:
  Vector cost: 5
  Scalar cost: 3

This looks like it's because the vectorizer costs the operation to create the 
BIT_FIELD_REF ;
For the reduction as requiring two scalar extracts and a permute. While it 
ultimately does produce a
BIT_FIELD_REF ; that's not what it costs.

This causes the reduction to almost always be more expensive, so unless the 
rest of the SLP tree amortizes
the cost we never generate them.

3. The SLP only happens on operation that are SLP shaped and where SLP didn't 
fail.

As a simple example, the vectorizer can't SLP the following:

unsigned int f (u32v4 a, u32v4 b)
{
a[0] += b[0];
return a[0] + a[1];
}

Because there's not enough VF here and it can't unroll. This and many others 
fail because they're not an
SLP-able operation, or SLP build fails.

This causes us to generate for e.g. this example:

f:
dup s2, v0.s[1]
fmovw1, s1
add v0.2s, v2.2s, v0.2s
fmovw0, s0
add w0, w0, w1
ret

instead of with my patch:

f:
addpv0.2s, v0.2s, v0.2s
add v0.2s, v0.2s, v1.2s
fmovw0, s0
ret

which is significantly better code.  So I don't think the vectorizer is the 
right solution for this.

> I don't think
> we want to do this as part of general folding.  Iff, then this belongs in 
> specific
> points of the pass pipeline, no?

The reason I currently have it as such is because in general the compiler 
doesn't really deal with
horizontal reductions at all.  Also since the vectorizer itself can introduce 
reductions I figured it's
better to have one representation for this.  So admittedly perhaps this should 
only be done after
vectorization as that's when today we expect reductions to be in Gimple.

As for having it in a specific point in the pass pipeline, I have it as a 
general one since a number of
passes could create the form for the reduction, for instance vec_lower could 
break up an operation
to allow this to match.  The bigger BIT_FIELD_EXPR it creates could also lead 
to other optimizations.

Additionally you had mentioned last time that Andrew was trying to move min/max 
detection to match.pd
So I had figured this was the correct place for it.

That said I have no intuition for what would be better here. Since the check is 
quite cheap.  But do you have
a particular place you want this move to then?  Ideally I'd want it before the 
last FRE pass, but perhaps
isel?

Thanks,
Tamar

> 
> > The use of thes

Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-11-05 Thread Richard Biener via Gcc-patches
On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches
 wrote:
>
> Hi All,
>
> This patch series is to add recognition of pairwise operations (reductions)
> in match.pd such that we can benefit from them even at -O1 when the vectorizer
> isn't enabled.
>
> Ths use of these allow for a lot simpler codegen in AArch64 and allows us to
> avoid quite a lot of codegen warts.
>
> As an example a simple:
>
> typedef float v4sf __attribute__((vector_size (16)));
>
> float
> foo3 (v4sf x)
> {
>   return x[1] + x[2];
> }
>
> currently generates:
>
> foo3:
> dup s1, v0.s[1]
> dup s0, v0.s[2]
> fadds0, s1, s0
> ret
>
> while with this patch series now generates:
>
> foo3:
> ext v0.16b, v0.16b, v0.16b, #4
> faddp   s0, v0.2s
> ret
>
> This patch will not perform the operation if the source is not a gimple
> register and leaves memory sources to the vectorizer as it's able to deal
> correctly with clobbers.

But the vectorizer should also be able to cope with the above.  I don't think
we want to do this as part of general folding.  Iff, then this belongs
in specific
points of the pass pipeline, no?

> The use of these instruction makes a significant difference in codegen quality
> for AArch64 and Arm.
>
> NOTE: The last entry in the series contains tests for all of the previous
> patches as it's a bit of an all or nothing thing.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> * match.pd (adjacent_data_access_p): Import.
> Add new pattern for bitwise plus, min, max, fmax, fmin.
> * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
> * tree.cc (adjacent_data_access_p): New.
> * tree.h (adjacent_data_access_p): New.
>
> --- inline copy of patch --
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 
> 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10eb18db23c0cd9f7
>  100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
> HONOR_NANS
> uniform_vector_p
> expand_vec_cmp_expr_p
> -   bitmask_inv_cst_vector_p)
> +   bitmask_inv_cst_vector_p
> +   adjacent_data_access_p)
>
>  /* Operator lists.  */
>  (define_operator_list tcc_comparison
> @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>
>  /* Canonicalizations of BIT_FIELD_REFs.  */
>
> +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */
> +(for op (plus min max FMIN_ALL FMAX_ALL)
> + ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
> + IFN_REDUC_FMIN IFN_REDUC_FMAX)
> + (simplify
> +  (op @0 @1)
> +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> +(with { poly_uint64 nloc = 0;
> +   tree src = adjacent_data_access_p (@0, @1, &nloc, true);
> +   tree ntype = build_vector_type (type, 2);
> +   tree size = TYPE_SIZE (ntype);
> +   tree pos = build_int_cst (TREE_TYPE (size), nloc);
> +   poly_uint64 _sz;
> +   poly_uint64 _total; }
> + (if (src && is_gimple_reg (src) && ntype
> + && poly_int_tree_p (size, &_sz)
> + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> + && known_ge (_total, _sz + nloc))
> +  (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))
> +
> +(for op (lt gt)
> + ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
> + ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)
> + (simplify
> +  (cond (op @0 @1) @0 @1)
> +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> +(with { poly_uint64 nloc = 0;
> +   tree src = adjacent_data_access_p (@0, @1, &nloc, false);
> +   tree ntype = build_vector_type (type, 2);
> +   tree size = TYPE_SIZE (ntype);
> +   tree pos = build_int_cst (TREE_TYPE (size), nloc);
> +   poly_uint64 _sz;
> +   poly_uint64 _total; }
> + (if (src && is_gimple_reg (src) && ntype
> + && poly_int_tree_p (size, &_sz)
> + && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> + && known_ge (_total, _sz + nloc))
> +  (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
> +   (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
> +   (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }
> +
>  (simplify
>   (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
>   (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
> diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
> index 
> 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006603ae1e3deb6b76
>  100644
> --- a/gcc/tree-cfg.cc
> +++ b/gcc/tree-cfg.cc
> @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
>  {
>tree arg = gimple_call_arg (stmt, i);
>if ((is_gimple_reg_type (TREE_TYPE (arg))
> +  && !is_gimple_variable (arg)
>&& !is_gimple_val (arg))
>   || (!is_gi

Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:56, Tamar Christina wrote:

Hi All,

This patch series is to add recognition of pairwise operations (reductions)
in match.pd such that we can benefit from them even at -O1 when the vectorizer
isn't enabled.

Ths use of these allow for a lot simpler codegen in AArch64 and allows us to
avoid quite a lot of codegen warts.

As an example a simple:

typedef float v4sf __attribute__((vector_size (16)));

float
foo3 (v4sf x)
{
   return x[1] + x[2];
}

currently generates:

foo3:
 dup s1, v0.s[1]
 dup s0, v0.s[2]
 fadds0, s1, s0
 ret

while with this patch series now generates:

foo3:
ext v0.16b, v0.16b, v0.16b, #4
faddp   s0, v0.2s
ret

This patch will not perform the operation if the source is not a gimple
register and leaves memory sources to the vectorizer as it's able to deal
correctly with clobbers.

The use of these instruction makes a significant difference in codegen quality
for AArch64 and Arm.

NOTE: The last entry in the series contains tests for all of the previous
patches as it's a bit of an all or nothing thing.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* match.pd (adjacent_data_access_p): Import.
Add new pattern for bitwise plus, min, max, fmax, fmin.
* tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
* tree.cc (adjacent_data_access_p): New.
* tree.h (adjacent_data_access_p): New.


Nice stuff.  I'd pondered some similar stuff at Tachyum, but got dragged 
away before it could be implemented.







diff --git a/gcc/tree.cc b/gcc/tree.cc
index 
007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311
 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
return builder.build ();
  }
  
+/* Returns base address if the two operands represent adjacent access of data

+   such that a pairwise operation can be used.  OP1 must be a lower subpart
+   than OP2.  If POS is not NULL then on return if a value is returned POS
+   will indicate the position of the lower address.  If COMMUTATIVE_P then
+   the operation is also tried by flipping op1 and op2.  */
+
+tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
+bool commutative_p)


Formatting nit.  Return type on a different line.


OK with that fixed.


jeff