On Fri, May 02, 2014 at 11:28:27AM -0400, David Edelsohn wrote:
> On Fri, May 2, 2014 at 6:20 AM, Alan Modra <amo...@gmail.com> wrote:
> > In cases where the compiler has no alignment info, powerpc64le-linux
> > gcc generates byte at a time copies for -mstrict-align (which is on
> > for little-endian power7).  That's awful code, a problem shared by
> > other strict-align targets, see pr50417.  However, we also have a case
> > when -mno-strict-align generates less than ideal code, which I believe
> > stems from using alignment as a proxy for testing an address offset.
> > See http://gcc.gnu.org/ml/gcc-patches/1999-09n/msg01072.html.
> >
> > So my first attempt at fixing this problem looked at address offsets
> > directly.  That worked fine too, but on thinking some more, I believe
> > we no longer have the movdi restriction.  Nowadays we'll reload the
> > address if we have an offset that doesn't satisfy the "Y" constraint
> > (ie. a multiple of 4 offset).  Which led to this simpler patch.
> > Bootstrapped and regression tested powerpc64le-linux, powerpc64-linux
> > and powerpc-linux.  OK to apply?
> 
> Hi, Alan
> 
> Thanks for finding and addressing this.
> 
> As you mention, recent server-class processors, at least POWER8, do
> not have the performance degradation for common, mis-aligned loads and
> stores of wider modes. But the patch should not impose this default on
> the large, installed based of processors, where mis-aligned loads can
> be a severe performance penalty. This heuristic has become
> processor-dependent and should not be hard-coded in the block_move and
> block_clear algorithms.
> 
> PROCESSOR_DEFAULT is POWER8 for ELFv2 (and should be updated as the
> default for PowerLinux in general). Please update the patch to test
> rs6000_cpu, probably another boolean flag set in
> rs6000_option_override_internal(). Because of the processor defaults,
> the preferred instruction sequence will be the default without
> encoding an assumption about the heuristics in the algorithm itself.

David,
I agree that mis-aligned loads can cause large performance penalties
on old 32-bit processors that take alignment traps.  However, the code
that I'm touching here is inside TARGET_POWERPC64 and STRICT_ALIGNMENT.
So we're talking about rs64, 620 and later.  I don't have the book iv
info handy for rs64 or 620, but on my G5, which is getting quite old,
using 64-bit loads is a win over 32-bit.  I expect this to be true for
power4 and later too, for which unaligned loads run at full speed
except when crossing certain byte boundaries related to the size of
cache lines and whether the load was a cache hit or miss.  Note that
it's the crossing of the relevant byte boundary that causes a
slow-down, not the size of the access.

On my G5 I used the following to test the cache hit case:

#include <stdlib.h>

int
main (int argc, char **argv)
{
  unsigned int i;
  unsigned int align = 0;
  unsigned long buf[33] __attribute__ ((__aligned__ (128)));
  unsigned long *p;

  if (--argc > 0)
    align = strtol (*++argv, NULL, 0);

  p = (unsigned long *)((char *) buf + align);
  for (i = 0; i < 100000000; i++)
    {
#ifdef USE64
      unsigned long scratch[8];
    __asm__ __volatile__ ("ld %0,0(%1)"
                          : "=r" (scratch[0]) : "r" (p));
    __asm__ __volatile__ ("ld %0,0(%1)"
                          : "=r" (scratch[1]) : "r" (p));
    __asm__ __volatile__ ("ld %0,0(%1)"
                          : "=r" (scratch[2]) : "r" (p));
    __asm__ __volatile__ ("ld %0,0(%1)"
                          : "=r" (scratch[3]) : "r" (p));
    __asm__ __volatile__ ("ld %0,0(%1)"
                          : "=r" (scratch[4]) : "r" (p));
    __asm__ __volatile__ ("ld %0,0(%1)"
                          : "=r" (scratch[5]) : "r" (p));
    __asm__ __volatile__ ("ld %0,0(%1)"
                          : "=r" (scratch[6]) : "r" (p));
    __asm__ __volatile__ ("ld %0,0(%1)"
                          : "=r" (scratch[7]) : "r" (p));
#else
    unsigned int scratch[16];
    __asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
                          : "=r" (scratch[0]), "=r" (scratch[1]) : "r" (p));
    __asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
                          : "=r" (scratch[2]), "=r" (scratch[3]) : "r" (p));
    __asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
                          : "=r" (scratch[4]), "=r" (scratch[5]) : "r" (p));
    __asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
                          : "=r" (scratch[6]), "=r" (scratch[7]) : "r" (p));
    __asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
                          : "=r" (scratch[8]), "=r" (scratch[9]) : "r" (p));
    __asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
                          : "=r" (scratch[10]), "=r" (scratch[11]) : "r" (p));
    __asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
                          : "=r" (scratch[12]), "=r" (scratch[13]) : "r" (p));
    __asm__ __volatile__ ("lwz %0,0(%2); lwz %1,4(%2)"
                          : "=r" (scratch[14]), "=r" (scratch[15]) : "r" (p));
#endif
    }
  return 0;
}

It shows some quite interesting behaviour.  For 64-bit loads with
alignment offsets of 0 to 56 I get a time of 0.205s.  Offsets of 57 to
63 (ie. the load is crossing a 64-byte boundary) give me 9.41s.

For 32-bit loads with alignment offsets of 0 to 56, and 60, I see a
time of 0.435s.  Offsets of 56 to 58 give me 13.174s, and offsets of
61 to 63 give me 9.41s.

That makes it advantageous to use 64-bit loads on G5, if my testcase
is at all representative of block copies.  I also tried with smaller
numbers of loads in the loop, which showed the same story.  I
excluded unaligned writes from the testcase since those are only
penalised on crossing a page boundary (at least for the reasonably
recent processors that I know something about).

Do you have access to an rs64 machine and 620 to collect some numbers
for me?  I'm loathe to add a tuning knob without putting in correct
values for these older machines you're worrying about.

-- 
Alan Modra
Australia Development Lab, IBM

Reply via email to