Re: [RS6000] PR60737, expand_block_clear uses word stores

2014-05-07 Thread Alan Modra
On Wed, May 07, 2014 at 01:39:50PM -0400, David Edelsohn wrote:
> On Tue, May 6, 2014 at 4:32 AM, Alan Modra  wrote:
> > BTW, the latest patch in my tree has a slight refinement, the
> > reload-by-hand addition.
> >
> > PR target/60737
> > * config/rs6000/rs6000.c (expand_block_move): Allow 64-bit
> > loads and stores when -mno-strict-align at any alignment.
> > (expand_block_clear): Similarly.  Also correct calculation of
> > instruction count.
> 
> Based on results of your experiment, the revised patch is okay.
> 
> You did not include gcc-patches in the distribution list for the revised 
> patch.

Thanks, David.  Patch copied here for gcc-patches and committed
revision 210201.

PR target/60737
* config/rs6000/rs6000.c (expand_block_move): Allow 64-bit
loads and stores when -mno-strict-align at any alignment.
(expand_block_clear): Similarly.  Also correct calculation of
instruction count.

Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 210200)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -15443,7 +15443,7 @@ expand_block_clear (rtx operands[])
  load zero and three to do clearing.  */
   if (TARGET_ALTIVEC && align >= 128)
 clear_step = 16;
-  else if (TARGET_POWERPC64 && align >= 32)
+  else if (TARGET_POWERPC64 && (align >= 64 || !STRICT_ALIGNMENT))
 clear_step = 8;
   else if (TARGET_SPE && align >= 64)
 clear_step = 8;
@@ -15471,12 +15471,27 @@ expand_block_clear (rtx operands[])
   mode = V2SImode;
 }
   else if (bytes >= 8 && TARGET_POWERPC64
-  /* 64-bit loads and stores require word-aligned
- displacements.  */
-  && (align >= 64 || (!STRICT_ALIGNMENT && align >= 32)))
+  && (align >= 64 || !STRICT_ALIGNMENT))
{
  clear_bytes = 8;
  mode = DImode;
+ if (offset == 0 && align < 64)
+   {
+ rtx addr;
+
+ /* If the address form is reg+offset with offset not a
+multiple of four, reload into reg indirect form here
+rather than waiting for reload.  This way we get one
+reload, not one per store.  */
+ addr = XEXP (orig_dest, 0);
+ if ((GET_CODE (addr) == PLUS || GET_CODE (addr) == LO_SUM)
+ && GET_CODE (XEXP (addr, 1)) == CONST_INT
+ && (INTVAL (XEXP (addr, 1)) & 3) != 0)
+   {
+ addr = copy_addr_to_reg (addr);
+ orig_dest = replace_equiv_address (orig_dest, addr);
+   }
+   }
}
   else if (bytes >= 4 && (align >= 32 || !STRICT_ALIGNMENT))
{   /* move 4 bytes */
@@ -15604,13 +15619,36 @@ expand_block_move (rtx operands[])
  gen_func.movmemsi = gen_movmemsi_4reg;
}
   else if (bytes >= 8 && TARGET_POWERPC64
-  /* 64-bit loads and stores require word-aligned
- displacements.  */
-  && (align >= 64 || (!STRICT_ALIGNMENT && align >= 32)))
+  && (align >= 64 || !STRICT_ALIGNMENT))
{
  move_bytes = 8;
  mode = DImode;
  gen_func.mov = gen_movdi;
+ if (offset == 0 && align < 64)
+   {
+ rtx addr;
+
+ /* If the address form is reg+offset with offset not a
+multiple of four, reload into reg indirect form here
+rather than waiting for reload.  This way we get one
+reload, not one per load and/or store.  */
+ addr = XEXP (orig_dest, 0);
+ if ((GET_CODE (addr) == PLUS || GET_CODE (addr) == LO_SUM)
+ && GET_CODE (XEXP (addr, 1)) == CONST_INT
+ && (INTVAL (XEXP (addr, 1)) & 3) != 0)
+   {
+ addr = copy_addr_to_reg (addr);
+ orig_dest = replace_equiv_address (orig_dest, addr);
+   }
+ addr = XEXP (orig_src, 0);
+ if ((GET_CODE (addr) == PLUS || GET_CODE (addr) == LO_SUM)
+ && GET_CODE (XEXP (addr, 1)) == CONST_INT
+ && (INTVAL (XEXP (addr, 1)) & 3) != 0)
+   {
+ addr = copy_addr_to_reg (addr);
+ orig_src = replace_equiv_address (orig_src, addr);
+   }
+   }
}
   else if (TARGET_STRING && bytes > 4 && !TARGET_POWERPC64)
{   /* move up to 8 bytes at a time */

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RS6000] PR60737, expand_block_clear uses word stores

2014-05-06 Thread Alan Modra
On Tue, May 06, 2014 at 06:00:14PM +0930, Alan Modra wrote:
> __asm__ __volatile__ ("ld %0,0(%1)"
> : "=r" (scratch[0]) : "r" (p));

Ick, I can't write ppc asm.  That should be "b" (p).

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RS6000] PR60737, expand_block_clear uses word stores

2014-05-06 Thread Alan Modra
On Fri, May 02, 2014 at 11:28:27AM -0400, David Edelsohn wrote:
> On Fri, May 2, 2014 at 6:20 AM, Alan Modra  wrote:
> > In cases where the compiler has no alignment info, powerpc64le-linux
> > gcc generates byte at a time copies for -mstrict-align (which is on
> > for little-endian power7).  That's awful code, a problem shared by
> > other strict-align targets, see pr50417.  However, we also have a case
> > when -mno-strict-align generates less than ideal code, which I believe
> > stems from using alignment as a proxy for testing an address offset.
> > See http://gcc.gnu.org/ml/gcc-patches/1999-09n/msg01072.html.
> >
> > So my first attempt at fixing this problem looked at address offsets
> > directly.  That worked fine too, but on thinking some more, I believe
> > we no longer have the movdi restriction.  Nowadays we'll reload the
> > address if we have an offset that doesn't satisfy the "Y" constraint
> > (ie. a multiple of 4 offset).  Which led to this simpler patch.
> > Bootstrapped and regression tested powerpc64le-linux, powerpc64-linux
> > and powerpc-linux.  OK to apply?
> 
> Hi, Alan
> 
> Thanks for finding and addressing this.
> 
> As you mention, recent server-class processors, at least POWER8, do
> not have the performance degradation for common, mis-aligned loads and
> stores of wider modes. But the patch should not impose this default on
> the large, installed based of processors, where mis-aligned loads can
> be a severe performance penalty. This heuristic has become
> processor-dependent and should not be hard-coded in the block_move and
> block_clear algorithms.
> 
> PROCESSOR_DEFAULT is POWER8 for ELFv2 (and should be updated as the
> default for PowerLinux in general). Please update the patch to test
> rs6000_cpu, probably another boolean flag set in
> rs6000_option_override_internal(). Because of the processor defaults,
> the preferred instruction sequence will be the default without
> encoding an assumption about the heuristics in the algorithm itself.

David,
I agree that mis-aligned loads can cause large performance penalties
on old 32-bit processors that take alignment traps.  However, the code
that I'm touching here is inside TARGET_POWERPC64 and STRICT_ALIGNMENT.
So we're talking about rs64, 620 and later.  I don't have the book iv
info handy for rs64 or 620, but on my G5, which is getting quite old,
using 64-bit loads is a win over 32-bit.  I expect this to be true for
power4 and later too, for which unaligned loads run at full speed
except when crossing certain byte boundaries related to the size of
cache lines and whether the load was a cache hit or miss.  Note that
it's the crossing of the relevant byte boundary that causes a
slow-down, not the size of the access.

On my G5 I used the following to test the cache hit case:

#include 

int
main (int argc, char **argv)
{
  unsigned int i;
  unsigned int align = 0;
  unsigned long buf[33] __attribute__ ((__aligned__ (128)));
  unsigned long *p;

  if (--argc > 0)
align = strtol (*++argv, NULL, 0);

  p = (unsigned long *)((char *) buf + align);
  for (i = 0; i < 1; i++)
{
#ifdef USE64
  unsigned long scratch[8];
__asm__ __volatile__ ("ld %0,0(%1)"
  : "=r" (scratch[0]) : "r" (p));
__asm__ __volatile__ ("ld %0,0(%1)"
  : "=r" (scratch[1]) : "r" (p));
__asm__ __volatile__ ("ld %0,0(%1)"
  : "=r" (scratch[2]) : "r" (p));
__asm__ __volatile__ ("ld %0,0(%1)"
  : "=r" (scratch[3]) : "r" (p));
__asm__ __volatile__ ("ld %0,0(%1)"
  : "=r" (scratch[4]) : "r" (p));
__asm__ __volatile__ ("ld %0,0(%1)"
  : "=r" (scratch[5]) : "r" (p));
__asm__ __volatile__ ("ld %0,0(%1)"
  : "=r" (scratch[6]) : "r" (p));
__asm__ __volatile__ ("ld %0,0(%1)"
  : "=r" (scratch[7]) : "r" (p));
#else
unsigned int scratch[16];
__asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
  : "=r" (scratch[0]), "=r" (scratch[1]) : "r" (p));
__asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
  : "=r" (scratch[2]), "=r" (scratch[3]) : "r" (p));
__asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
  : "=r" (scratch[4]), "=r" (scratch[5]) : "r" (p));
__asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
  : "=r" (scratch[6]), "=r" (scratch[7]) : "r" (p));
__asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
  : "=r" (scratch[8]), "=r" (scratch[9]) : "r" (p));
__asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
  : "=r" (scratch[10]), "=r" (scratch[11]) : "r" (p));
__asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
  : "=r" (scratch[12]), "=r" (scratch[13]) : "r" (p));
__asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
  : "=

Re: [RS6000] PR60737, expand_block_clear uses word stores

2014-05-02 Thread David Edelsohn
On Fri, May 2, 2014 at 6:20 AM, Alan Modra  wrote:
> In cases where the compiler has no alignment info, powerpc64le-linux
> gcc generates byte at a time copies for -mstrict-align (which is on
> for little-endian power7).  That's awful code, a problem shared by
> other strict-align targets, see pr50417.  However, we also have a case
> when -mno-strict-align generates less than ideal code, which I believe
> stems from using alignment as a proxy for testing an address offset.
> See http://gcc.gnu.org/ml/gcc-patches/1999-09n/msg01072.html.
>
> So my first attempt at fixing this problem looked at address offsets
> directly.  That worked fine too, but on thinking some more, I believe
> we no longer have the movdi restriction.  Nowadays we'll reload the
> address if we have an offset that doesn't satisfy the "Y" constraint
> (ie. a multiple of 4 offset).  Which led to this simpler patch.
> Bootstrapped and regression tested powerpc64le-linux, powerpc64-linux
> and powerpc-linux.  OK to apply?

Hi, Alan

Thanks for finding and addressing this.

As you mention, recent server-class processors, at least POWER8, do
not have the performance degradation for common, mis-aligned loads and
stores of wider modes. But the patch should not impose this default on
the large, installed based of processors, where mis-aligned loads can
be a severe performance penalty. This heuristic has become
processor-dependent and should not be hard-coded in the block_move and
block_clear algorithms.

PROCESSOR_DEFAULT is POWER8 for ELFv2 (and should be updated as the
default for PowerLinux in general). Please update the patch to test
rs6000_cpu, probably another boolean flag set in
rs6000_option_override_internal(). Because of the processor defaults,
the preferred instruction sequence will be the default without
encoding an assumption about the heuristics in the algorithm itself.

Thanks, David


[RS6000] PR60737, expand_block_clear uses word stores

2014-05-02 Thread Alan Modra
In cases where the compiler has no alignment info, powerpc64le-linux
gcc generates byte at a time copies for -mstrict-align (which is on
for little-endian power7).  That's awful code, a problem shared by
other strict-align targets, see pr50417.  However, we also have a case
when -mno-strict-align generates less than ideal code, which I believe
stems from using alignment as a proxy for testing an address offset.
See http://gcc.gnu.org/ml/gcc-patches/1999-09n/msg01072.html.

So my first attempt at fixing this problem looked at address offsets
directly.  That worked fine too, but on thinking some more, I believe
we no longer have the movdi restriction.  Nowadays we'll reload the
address if we have an offset that doesn't satisfy the "Y" constraint
(ie. a multiple of 4 offset).  Which led to this simpler patch.
Bootstrapped and regression tested powerpc64le-linux, powerpc64-linux
and powerpc-linux.  OK to apply?

PR target/60737
* config/rs6000/rs6000.c (expand_block_move): Allow 64-bit
loads and stores when -mno-strict-align at any alignment.
(expand_block_clear): Similarly.  Also correct calculation of
instruction count.

Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 209926)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -15438,7 +15422,7 @@ expand_block_clear (rtx operands[])
  load zero and three to do clearing.  */
   if (TARGET_ALTIVEC && align >= 128)
 clear_step = 16;
-  else if (TARGET_POWERPC64 && align >= 32)
+  else if (TARGET_POWERPC64 && (align >= 64 || !STRICT_ALIGNMENT))
 clear_step = 8;
   else if (TARGET_SPE && align >= 64)
 clear_step = 8;
@@ -15466,9 +15450,7 @@ expand_block_clear (rtx operands[])
   mode = V2SImode;
 }
   else if (bytes >= 8 && TARGET_POWERPC64
-  /* 64-bit loads and stores require word-aligned
- displacements.  */
-  && (align >= 64 || (!STRICT_ALIGNMENT && align >= 32)))
+  && (align >= 64 || !STRICT_ALIGNMENT))
{
  clear_bytes = 8;
  mode = DImode;
@@ -15599,9 +15581,7 @@ expand_block_move (rtx operands[])
  gen_func.movmemsi = gen_movmemsi_4reg;
}
   else if (bytes >= 8 && TARGET_POWERPC64
-  /* 64-bit loads and stores require word-aligned
- displacements.  */
-  && (align >= 64 || (!STRICT_ALIGNMENT && align >= 32)))
+  && (align >= 64 || !STRICT_ALIGNMENT))
{
  move_bytes = 8;
  mode = DImode;

-- 
Alan Modra
Australia Development Lab, IBM