Hi Anton,

On Fri, Jul 01, 2016 at 08:19:45AM +1000, Anton Blanchard wrote:
> +#ifdef BYTESWAP_DATA
> +     addis   r3,r2,.byteswap_constant@toc@ha
> +     addi    r3,r3,.byteswap_constant@toc@l
> +
> +     lvx     byteswap,0,r3
> +     addi    r3,r3,16
> +#endif

You already have r0=0, so you can just do

        lvsr byteswap,0,r0
        vnot byteswap,byteswap

(the top bits of the permute vector bytes aren't used after all).

Or if you find that distasteful,

        lvsl byteswap,0,r0
        vspltisb v0,15
        vxor byteswap,byteswap,v0

Btw, the value in r3 isn't used after this, that last addi is useless?

> +     /*
> +      * The reflected version of Barrett reduction. Instead of bit
> +      * reflecting our data (which is expensive to do), we bit reflect our
> +      * constants and our algorithm, which means the intermediate data in
> +      * our vector registers goes from 0-63 instead of 63-0. We can reflect
> +      * the algorithm because we don't carry in mod 2 arithmetic.
> +      */

Expensive?  Ha!

        vgbbd v0,v0
        vperm v0,v0,v0,byteswap
        vgbbd v0,v0
        vperm v0,v0,v0,byteswap
        vsldoi v0,v0,v0,8

(or fold these last two together, needs another constant though).

> +     lvx     v0,0,r4
> +     lvx     v16,0,r3
> +     VPERM(v0,v0,v16,byteswap)
> +     vxor    v0,v0,v8        /* xor in initial value */
> +     VPMSUMW(v0,v0,v16)
> +     bdz     .Lv0

That VPERM looks strange...  You probably want v0 instead of v16.  Not
that it matters here.


Segher
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Reply via email to