From: Joakim Tjernlund > On Mon, 2015-01-12 at 11:55 +1100, Anton Blanchard wrote: > > Hi David, > > > > > The unrolled loop (deleted) looks excessive. > > > On a modern cpu with multiple execution units you can usually > > > manage to get the loop overhead to execute in parallel to the > > > actual 'work'. > > > So I suspect that a much simpler 'word at a time' loop will be almost as > > > fast - especially in the > case where the code isn't > > > already in the cache and the compare is relatively short. > > > > I'm always keen to keep things as simple as possible, but your loop is over > > 50% slower. Once the > loop hits a steady state you are going to run into front end issues with > instruction fetch on POWER8.
Interesting, I'm not an expert on ppc scheduling, but on my old x86 Athon 700 (I think it was that one) a similar loop ran as fast as 'rep movsw'. > Out of curiosity, does preincrement make any difference(or can gcc do that > for you nowadays)? It will only change register pressure slightly, and might allow any execution delays be filled - but that is very processor dependant. Actually you probably want to do 'a += 2' somewhere to reduce the instruction count. Similarly the end condition needs to compare one of the pointers. Elsewhere (not ppc) I've used (++p)[-1] instead of *p++ to move the increment before the load to get better scheduling. > a1 = *a; > b1 = *b; > while { > a2 = *++a; > b2 = *++b; > if (a1 != a2) That should have been a1 != b1 > break; > a1 = *++a; > b1 = *++b; > } while (a2 != a1); and a2 != b2 David _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev