On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote: > r5 does contain the value to be updated, so lets use r5 all way long > for that. It makes the code more readable. > > To avoid confusion, it is better to use adde instead of addc > > The first addition is useless. Its only purpose is to clear carry. > As r4 is a signed int that is always positive, this can be done by > using srawi instead of srwi > > Let's also remove the comment about bdnz having no overhead as it > is not correct on all powerpc, at least on MPC8xx > > In the last part, in our situation, the remaining quantity of bytes > to be proceeded is between 0 and 3. Therefore, we can base that part > on the value of bit 31 and bit 30 of r4 instead of anding r4 with 3 > then proceding on comparisons and substractions. > > Signed-off-by: Christophe Leroy <christophe.le...@c-s.fr> > --- > arch/powerpc/lib/checksum_32.S | 37 +++++++++++++++++-------------------- > 1 file changed, 17 insertions(+), 20 deletions(-)
Do you have benchmarks for these optimizations? -Scott > > diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S > index 3472372..9c12602 100644 > --- a/arch/powerpc/lib/checksum_32.S > +++ b/arch/powerpc/lib/checksum_32.S > @@ -27,35 +27,32 @@ > * csum_partial(buff, len, sum) > */ > _GLOBAL(csum_partial) > - addic r0,r5,0 > subi r3,r3,4 > - srwi. r6,r4,2 > + srawi. r6,r4,2 /* Divide len by 4 and also clear carry */ > beq 3f /* if we're doing < 4 bytes */ > - andi. r5,r3,2 /* Align buffer to longword boundary */ > + andi. r0,r3,2 /* Align buffer to longword boundary */ > beq+ 1f > - lhz r5,4(r3) /* do 2 bytes to get aligned */ > - addi r3,r3,2 > + lhz r0,4(r3) /* do 2 bytes to get aligned */ > subi r4,r4,2 > - addc r0,r0,r5 > + addi r3,r3,2 > srwi. r6,r4,2 /* # words to do */ > + adde r5,r5,r0 > beq 3f > 1: mtctr r6 > -2: lwzu r5,4(r3) /* the bdnz has zero overhead, so it should */ > - adde r0,r0,r5 /* be unnecessary to unroll this loop */ > +2: lwzu r0,4(r3) > + adde r5,r5,r0 > bdnz 2b > - andi. r4,r4,3 > -3: cmpwi 0,r4,2 > - blt+ 4f > - lhz r5,4(r3) > +3: andi. r0,r4,2 > + beq+ 4f > + lhz r0,4(r3) > addi r3,r3,2 > - subi r4,r4,2 > - adde r0,r0,r5 > -4: cmpwi 0,r4,1 > - bne+ 5f > - lbz r5,4(r3) > - slwi r5,r5,8 /* Upper byte of word */ > - adde r0,r0,r5 > -5: addze r3,r0 /* add in final carry */ > + adde r5,r5,r0 > +4: andi. r0,r4,1 > + beq+ 5f > + lbz r0,4(r3) > + slwi r0,r0,8 /* Upper byte of word */ > + adde r5,r5,r0 > +5: addze r3,r5 /* add in final carry */ > blr > > /* _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev