Re: [RFC][PATCH] Remove a bad use of SLOW_UNALIGNED_ACCESS

2016-11-15 Thread Jeff Law

On 11/01/2016 03:39 PM, Wilco Dijkstra wrote:

 Jeff Law  wrote:


I think you'll need to look at bz61320 before this could go in.


I had a look, but there is nothing there that is related - eventually
a latent alignment bug was fixed in IVOpt.

Excellent.  Thanks for digging into what really happened.


Note that the bswap phase
currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
or SLOW_UNALIGNED_ACCESS:

-  if (bswap
 - && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
 - && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
 -   return false;

If bswap is false no byte swap is needed, so we found a native endian load
and it will always perform the optimization by inserting an unaligned load.
This apparently works on all targets, and doesn't cause alignment traps or
huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
So I'm at a loss what these macros are supposed to mean and how I can query
whether a backend supports fast unaligned access for a particular mode.

What I actually want to write is something like:

 if (!FAST_UNALIGNED_LOAD (mode, align)) return false;

And know that it only accepts unaligned accesses that are efficient on the 
target.
Maybe we need a new hook like this and get rid of the old one?
As Richi indicated later, these decisions are probably made best at 
expansion time -- as long as we have the required information.  So I'd 
only go with a hook if (for example) the alignment information is lost 
by the time we get to expansion and thus we can't DTRT at expansion time.


Patch is OK.

jeff


Re: [RFC][PATCH] Remove a bad use of SLOW_UNALIGNED_ACCESS

2016-11-02 Thread Richard Biener
On Wed, Nov 2, 2016 at 2:43 PM, Wilco Dijkstra  wrote:
> Richard Biener wrote:
> On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra  
> wrote:
>
>> > If bswap is false no byte swap is needed, so we found a native endian load
>> > and it will always perform the optimization by inserting an unaligned load.
>>
>> Yes, the general agreement is that the expander can do best and thus we
>> should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
>> The expander will generate the canonical best code (hopefully...).
>
> Right, but there are cases where you have to choose between unaligned or 
> aligned
> accesses and you need to know whether the unaligned access is fast.
>
> A good example is memcpy expansion, if you have fast unaligned accesses then 
> you
> should use them to deal with the last few bytes, but if they get expanded, 
> using several
> aligned accesses is much faster than a single unaligned access.

Yes.  That's RTL expansion at which point you of course have to look
at SLOW_UNALIGNED_ACCESS.

>> > This apparently works on all targets, and doesn't cause alignment traps or
>> > huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
>> > So I'm at a loss what these macros are supposed to mean and how I can query
>> > whether a backend supports fast unaligned access for a particular mode.
>> >
>> > What I actually want to write is something like:
>> >
>> >  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
>> >
>> > And know that it only accepts unaligned accesses that are efficient on the 
>> > target.
>> > Maybe we need a new hook like this and get rid of the old one?
>>
>> No, we don't need to other hook.
>>
>> Note there is another similar user in gimple-fold.c when folding small
>> memcpy/memmove
>> to single load/store pairs (patch posted but not applied by me -- I've
>> asked for strict-align
>> target maintainer feedback but got none).
>
> I didn't find it, do you have a link?

https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00598.html

>> Now - for bswap I'm only 99% sure that unaligned load + bswap is
>> better than piecewise loads plus manual swap.
>
> It depends on whether unaligned loads and bswap are expanded or not. Even if 
> we
> assume the expansion is at least as efficient as doing it explicitly 
> (definitely true
> for modes larger than the native integer size - as we found out in PR77308!),
> if both the unaligned load and bswap are expanded it seems better not to make 
> the
> transformation for modes up to the word size. But there is no way to find out 
> as
> SLOW_UNALIGNED_ACCESS must be true whenever STRICT_ALIGN is true.

The case I was thinking about is availability of a bswap load operating only on
aligned memory and "regular" register bswap being "fake" provided by first
spilling to an aligned stack slot and then loading from that.

Maybe a bit far-fetched.

>> But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
>> STRICT_ALIGNMENT checks from the GIMPLE side of the compiler.
>
> I sort of agree because the purpose of these macros is unclear - the 
> documentation
> is insufficient and out of date. I do believe however we need an accurate way 
> to find out
> whether a target supports fast unaligned accesses as that is required to 
> generate good
> target code.

I believe the target macros are solely for RTL expansion and say that
it has to avoid
unaligned ops as those would trap.

Richard.

> Wilco


Re: [RFC][PATCH] Remove a bad use of SLOW_UNALIGNED_ACCESS

2016-11-02 Thread Wilco Dijkstra
Richard Biener wrote:
On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra  wrote:

> > If bswap is false no byte swap is needed, so we found a native endian load
> > and it will always perform the optimization by inserting an unaligned load.
>
> Yes, the general agreement is that the expander can do best and thus we
> should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
> The expander will generate the canonical best code (hopefully...).

Right, but there are cases where you have to choose between unaligned or aligned
accesses and you need to know whether the unaligned access is fast.

A good example is memcpy expansion, if you have fast unaligned accesses then you
should use them to deal with the last few bytes, but if they get expanded, 
using several
aligned accesses is much faster than a single unaligned access.

> > This apparently works on all targets, and doesn't cause alignment traps or
> > huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
> > So I'm at a loss what these macros are supposed to mean and how I can query
> > whether a backend supports fast unaligned access for a particular mode.
> >
> > What I actually want to write is something like:
> >
> >  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
> >
> > And know that it only accepts unaligned accesses that are efficient on the 
> > target.
> > Maybe we need a new hook like this and get rid of the old one?
>
> No, we don't need to other hook.
> 
> Note there is another similar user in gimple-fold.c when folding small
> memcpy/memmove
> to single load/store pairs (patch posted but not applied by me -- I've
> asked for strict-align
> target maintainer feedback but got none).

I didn't find it, do you have a link?

> Now - for bswap I'm only 99% sure that unaligned load + bswap is
> better than piecewise loads plus manual swap.

It depends on whether unaligned loads and bswap are expanded or not. Even if we 
assume the expansion is at least as efficient as doing it explicitly 
(definitely true
for modes larger than the native integer size - as we found out in PR77308!),
if both the unaligned load and bswap are expanded it seems better not to make 
the
transformation for modes up to the word size. But there is no way to find out as
SLOW_UNALIGNED_ACCESS must be true whenever STRICT_ALIGN is true.

> But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
> STRICT_ALIGNMENT checks from the GIMPLE side of the compiler.

I sort of agree because the purpose of these macros is unclear - the 
documentation
is insufficient and out of date. I do believe however we need an accurate way 
to find out
whether a target supports fast unaligned accesses as that is required to 
generate good
target code.

Wilco 

Re: [RFC][PATCH] Remove a bad use of SLOW_UNALIGNED_ACCESS

2016-11-02 Thread Richard Biener
On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra  wrote:
>  Jeff Law  wrote:
>
>> I think you'll need to look at bz61320 before this could go in.
>
> I had a look, but there is nothing there that is related - eventually
> a latent alignment bug was fixed in IVOpt. Note that the bswap phase
> currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
> or SLOW_UNALIGNED_ACCESS:
>
> -  if (bswap
>  - && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
>  - && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
>  -   return false;
>
> If bswap is false no byte swap is needed, so we found a native endian load
> and it will always perform the optimization by inserting an unaligned load.

Yes, the general agreement is that the expander can do best and thus we
should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
The expander will generate the canonical best code (hopefully...).

> This apparently works on all targets, and doesn't cause alignment traps or
> huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
> So I'm at a loss what these macros are supposed to mean and how I can query
> whether a backend supports fast unaligned access for a particular mode.
>
> What I actually want to write is something like:
>
>  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
>
> And know that it only accepts unaligned accesses that are efficient on the 
> target.
> Maybe we need a new hook like this and get rid of the old one?

No, we don't need to other hook.

Note there is another similar user in gimple-fold.c when folding small
memcpy/memmove
to single load/store pairs (patch posted but not applied by me -- I've
asked for strict-align
target maintainer feedback but got none).

Now - for bswap I'm only 99% sure that unaligned load + bswap is
better than piecewise
loads plus manual swap.

But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
STRICT_ALIGNMENT
checks from the GIMPLE side of the compiler.

Richard.

> Wilco
>


Re: [RFC][PATCH] Remove a bad use of SLOW_UNALIGNED_ACCESS

2016-11-01 Thread Wilco Dijkstra
 Jeff Law  wrote:

> I think you'll need to look at bz61320 before this could go in.

I had a look, but there is nothing there that is related - eventually
a latent alignment bug was fixed in IVOpt. Note that the bswap phase
currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
or SLOW_UNALIGNED_ACCESS:

-  if (bswap
 - && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
 - && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
 -   return false;

If bswap is false no byte swap is needed, so we found a native endian load
and it will always perform the optimization by inserting an unaligned load.
This apparently works on all targets, and doesn't cause alignment traps or
huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
So I'm at a loss what these macros are supposed to mean and how I can query
whether a backend supports fast unaligned access for a particular mode.

What I actually want to write is something like:

 if (!FAST_UNALIGNED_LOAD (mode, align)) return false;

And know that it only accepts unaligned accesses that are efficient on the 
target.
Maybe we need a new hook like this and get rid of the old one?

Wilco


Re: [RFC][PATCH] Remove a bad use of SLOW_UNALIGNED_ACCESS

2016-11-01 Thread Jeff Law

On 11/01/2016 11:36 AM, Wilco Dijkstra wrote:

Looking at PR77308, one of the issues is that the bswap optimization
phase doesn't work on ARM.  This is due to an odd check that uses
SLOW_UNALIGNED_ACCESS (which is always true on ARM).  Since the testcase
in PR77308 generates much better code with this patch (~13% fewer
instructions), it seems best to remove this odd check.

This exposes a problem with SLOW_UNALIGNED_ACCESS - what is it supposed
to mean or do? According to its current definition, it means we should
never emit an unaligned access for a given mode as it would lead to a
trap.  However that is not what happens, for example all integer modes on
ARM support really fast unaligned access and we generate unaligned instructions
without any issues.  Some Thumb-1 targets automatically expand unaligned
accesses if necessary.  So this macro clearly doesn't stop unaligned accesses
from being generated.

So I want to set it to false for most modes on ARM as they are not slow.
However doing so causes incorrect code generation and unaligned traps.
How can we differentiate between modes that support fast unaligned access
in hardware, modes that get expanded and modes that should never be used in
an unaligned access?

Bootstrap & regress OK.

ChangeLog:
2015-11-01  Wilco Dijkstra  

gcc/
* tree-ssa-math-opts.c (bswap_replace): Remove test
of SLOW_UNALIGNED_ACCESS.

testsuite/
* gcc.dg/optimize-bswapdi-3.c: Remove xfail.
* gcc.dg/optimize-bswaphi-1.c: Likewise.
* gcc.dg/optimize-bswapsi-2.c: Likewise.

I think you'll need to look at bz61320 before this could go in.

jeff






[RFC][PATCH] Remove a bad use of SLOW_UNALIGNED_ACCESS

2016-11-01 Thread Wilco Dijkstra
Looking at PR77308, one of the issues is that the bswap optimization 
phase doesn't work on ARM.  This is due to an odd check that uses
SLOW_UNALIGNED_ACCESS (which is always true on ARM).  Since the testcase
in PR77308 generates much better code with this patch (~13% fewer
instructions), it seems best to remove this odd check.

This exposes a problem with SLOW_UNALIGNED_ACCESS - what is it supposed
to mean or do? According to its current definition, it means we should
never emit an unaligned access for a given mode as it would lead to a
trap.  However that is not what happens, for example all integer modes on
ARM support really fast unaligned access and we generate unaligned instructions
without any issues.  Some Thumb-1 targets automatically expand unaligned
accesses if necessary.  So this macro clearly doesn't stop unaligned accesses
from being generated.

So I want to set it to false for most modes on ARM as they are not slow. 
However doing so causes incorrect code generation and unaligned traps.
How can we differentiate between modes that support fast unaligned access
in hardware, modes that get expanded and modes that should never be used in
an unaligned access?

Bootstrap & regress OK.

ChangeLog:
2015-11-01  Wilco Dijkstra  

gcc/
* tree-ssa-math-opts.c (bswap_replace): Remove test
of SLOW_UNALIGNED_ACCESS.

testsuite/
* gcc.dg/optimize-bswapdi-3.c: Remove xfail.
* gcc.dg/optimize-bswaphi-1.c: Likewise.
* gcc.dg/optimize-bswapsi-2.c: Likewise.

--

diff --git a/gcc/testsuite/gcc.dg/optimize-bswapdi-3.c 
b/gcc/testsuite/gcc.dg/optimize-bswapdi-3.c
index 
273b4bc622cb32564533e1352b5fc8ad52054f8b..6f682014622ab79e541cdf26d13f16a7d87f158d
 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswapdi-3.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswapdi-3.c
@@ -61,4 +61,4 @@ uint64_t read_be64_3 (unsigned char *data)
 }
 
 /* { dg-final { scan-tree-dump-times "64 bit load in target endianness found 
at" 3 "bswap" } } */
-/* { dg-final { scan-tree-dump-times "64 bit bswap implementation found at" 3 
"bswap" { xfail alpha*-*-* arm*-*-* } } } */
+/* { dg-final { scan-tree-dump-times "64 bit bswap implementation found at" 3 
"bswap" } } */
diff --git a/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c 
b/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
index 
c18ca6174d12a786a71252dfe47cfe78ca58750a..852ccfe5c1acd519f2cf340cc55f3ea74b1ec21f
 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
@@ -55,5 +55,4 @@ swap16 (HItype in)
 }
 
 /* { dg-final { scan-tree-dump-times "16 bit load in target endianness found 
at" 3 "bswap" } } */
-/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 1 
"bswap" { target alpha*-*-* arm*-*-* } } } */
-/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 4 
"bswap" { xfail alpha*-*-* arm*-*-* } } } */
+/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 4 
"bswap" } } */
diff --git a/gcc/testsuite/gcc.dg/optimize-bswapsi-2.c 
b/gcc/testsuite/gcc.dg/optimize-bswapsi-2.c
index 
a1558af2cc74adde439d42223b00977d9eeb9639..01ae3776ed3f44fbc300d001f8c67ec11625d03b
 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswapsi-2.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswapsi-2.c
@@ -45,4 +45,4 @@ uint32_t read_be32_3 (unsigned char *data)
 }
 
 /* { dg-final { scan-tree-dump-times "32 bit load in target endianness found 
at" 3 "bswap" } } */
-/* { dg-final { scan-tree-dump-times "32 bit bswap implementation found at" 3 
"bswap" { xfail alpha*-*-* arm*-*-* } } } */
+/* { dg-final { scan-tree-dump-times "32 bit bswap implementation found at" 3 
"bswap" } } */
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index 
0cea1a8472d5d9c4f0e4a7bd82930e201948c4ec..cbb2f9367a287ad8cfcfc5740c0e49b2c83bafd0
 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -2651,11 +2651,6 @@ bswap_replace (gimple *cur_stmt, gimple *src_stmt, tree 
fndecl,
}
}
 
-  if (bswap
- && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
- && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
-   return false;
-
   /* Move cur_stmt just before  one of the load of the original
 to ensure it has the same VUSE.  See PR61517 for what could
 go wrong.  */