From: Joakim Tjernlund 
> On Mon, 2015-01-12 at 11:55 +1100, Anton Blanchard wrote:
> > Hi David,
> >
> > > The unrolled loop (deleted) looks excessive.
> > > On a modern cpu with multiple execution units you can usually
> > > manage to get the loop overhead to execute in parallel to the
> > > actual 'work'.
> > > So I suspect that a much simpler 'word at a time' loop will be almost as 
> > > fast - especially in the
> case where the code isn't
> > > already in the cache and the compare is relatively short.
> >
> > I'm always keen to keep things as simple as possible, but your loop is over 
> > 50% slower. Once the
> loop hits a steady state you are going to run into front end issues with 
> instruction fetch on POWER8.

Interesting, I'm not an expert on ppc scheduling, but on my old x86 Athon 700 
(I think
it was that one) a similar loop ran as fast as 'rep movsw'.

> Out of curiosity, does preincrement make any difference(or can gcc do that 
> for you nowadays)?

It will only change register pressure slightly, and might allow any execution
delays be filled - but that is very processor dependant.
Actually you probably want to do 'a += 2' somewhere to reduce the instruction 
count.
Similarly the end condition needs to compare one of the pointers.

Elsewhere (not ppc) I've used (++p)[-1] instead of *p++ to move the increment
before the load to get better scheduling.

>          a1 = *a;
>          b1 = *b;
>          while {
>                  a2 = *++a;
>                  b2 = *++b;
>                  if (a1 != a2)
     That should have been a1 != b1
>                       break;
>                  a1 = *++a;
>                  b1 = *++b;
>          } while (a2 != a1);
     and a2 != b2

        David
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Reply via email to