On Mon, Mar 1, 2021 at 3:39 AM Gabriel Paubert <paub...@iram.es> wrote:
>
> On Sun, Feb 28, 2021 at 11:52:12PM +0000, Luke Kenneth Casson Leighton wrote:
> > On Monday, March 1, 2021, Riccardo Mottola <riccardo.mott...@libero.it>
> > wrote:
> > ...
> > Tulio Magno Quites Machado Filho is currently working on glibc6 patches
> > which reverse these erroneous assumptions, replacing them with "#ifdef VSX"
> > thus allowing people to compile code that does not rely on SIMD.
>
> Beware that VSX is not Altivec. Altivec was called VMX by IBM and
> VSX is a superset of Altivec (IIRC).

Based on my experience with Botan and Crypto++... VSX is available
with POWER7 and -mvsx compiler option. VSX is part of POWER8 core and
does not need a compiler option.

VSX is a lot like Intel tic/toc features. VSX allows a 64-bit vector
loads and stores, but it does not provide operations on 64-bit
vectors. You have to use POWER8 to get the 64-bit add (addudm),
subtract (subudm), etc.

So a POWER7+VSX 64-bit add might look like:

typedef __vector unsigned int    uint32x4_p;
typedef __vector unsigned long long uint64x2_p;

# Load 64-bit vector from uint64_t[2]
uint64x2_p a = vec_ld(...);
uint64x2_p b = vec_ld(...);

# But still perform the 32-bit add
uint64x2_p c = (uint64x2_p )VecAdd64((uint32x4_p)a, (uint32x4_p)b);

And:

uint32x4_p
VecAdd64(const uint32x4_p vec1, const uint32x4_p vec2)
{
    // The carry mask selects carry's for elements 1 and 3 and sets
    // remaining elements to 0. The result is then shifted so the
    // carried values are added to elements 0 and 2.
#if defined(MYLIB_BIG_ENDIAN)
    const uint32x4_p zero = {0, 0, 0, 0};
    const uint32x4_p mask = {0, 1, 0, 1};
#else
    const uint32x4_p zero = {0, 0, 0, 0};
    const uint32x4_p mask = {1, 0, 1, 0};
#endif

    uint32x4_p cy = vec_addc(vec1, vec2);
    uint32x4_p res = vec_add(vec1, vec2);
    cy = vec_and(mask, cy);
    cy = vec_sld (cy, zero, 4);
    return vec_add(res, cy);
}

 A POWER8 add looks as expected:

uint64x2_p
VecAdd64(const uint64x2_p vec1, const uint64x2_p vec2)
{
    return vec_add(a, b);
}

Even with the crippled 64-bit add using 32-bit elements, some
algorithms, like Bernstein's ChaCha, runs about 2.5x faster than over
the scalar unit.

Jeff

Reply via email to