On Mon, Mar 1, 2021 at 3:39 AM Gabriel Paubert <paub...@iram.es> wrote: > > On Sun, Feb 28, 2021 at 11:52:12PM +0000, Luke Kenneth Casson Leighton wrote: > > On Monday, March 1, 2021, Riccardo Mottola <riccardo.mott...@libero.it> > > wrote: > > ... > > Tulio Magno Quites Machado Filho is currently working on glibc6 patches > > which reverse these erroneous assumptions, replacing them with "#ifdef VSX" > > thus allowing people to compile code that does not rely on SIMD. > > Beware that VSX is not Altivec. Altivec was called VMX by IBM and > VSX is a superset of Altivec (IIRC).
Based on my experience with Botan and Crypto++... VSX is available with POWER7 and -mvsx compiler option. VSX is part of POWER8 core and does not need a compiler option. VSX is a lot like Intel tic/toc features. VSX allows a 64-bit vector loads and stores, but it does not provide operations on 64-bit vectors. You have to use POWER8 to get the 64-bit add (addudm), subtract (subudm), etc. So a POWER7+VSX 64-bit add might look like: typedef __vector unsigned int uint32x4_p; typedef __vector unsigned long long uint64x2_p; # Load 64-bit vector from uint64_t[2] uint64x2_p a = vec_ld(...); uint64x2_p b = vec_ld(...); # But still perform the 32-bit add uint64x2_p c = (uint64x2_p )VecAdd64((uint32x4_p)a, (uint32x4_p)b); And: uint32x4_p VecAdd64(const uint32x4_p vec1, const uint32x4_p vec2) { // The carry mask selects carry's for elements 1 and 3 and sets // remaining elements to 0. The result is then shifted so the // carried values are added to elements 0 and 2. #if defined(MYLIB_BIG_ENDIAN) const uint32x4_p zero = {0, 0, 0, 0}; const uint32x4_p mask = {0, 1, 0, 1}; #else const uint32x4_p zero = {0, 0, 0, 0}; const uint32x4_p mask = {1, 0, 1, 0}; #endif uint32x4_p cy = vec_addc(vec1, vec2); uint32x4_p res = vec_add(vec1, vec2); cy = vec_and(mask, cy); cy = vec_sld (cy, zero, 4); return vec_add(res, cy); } A POWER8 add looks as expected: uint64x2_p VecAdd64(const uint64x2_p vec1, const uint64x2_p vec2) { return vec_add(a, b); } Even with the crippled 64-bit add using 32-bit elements, some algorithms, like Bernstein's ChaCha, runs about 2.5x faster than over the scalar unit. Jeff