On Fri, Nov 20, 2020 at 3:40 PM Niels Möller <ni...@lysator.liu.se> wrote: > > ni...@lysator.liu.se (Niels Möller) writes: > > > It could likely be speedup further by processing 2, 3 or 4 blocks in > > parallel. > > I've given 2 blocks in parallel a try, but not quite working yet. My > work-in-progress code below. > > When I test it on the gcc112 machine, it fails with an illegal > instruction (SIGILL) on this line, close to function entry: > > .globl _nettle_chacha_2core > .type _nettle_chacha_2core,%function > .align 5 > _nettle_chacha_2core: > addis 2,12,(.TOC.-_nettle_chacha_2core)@ha > addi 2,2,(.TOC.-_nettle_chacha_2core)@l > .localentry _nettle_chacha_2core, .-_nettle_chacha_2core > > > li r8, 0x30 > vspltisw v1, 1 > => vextractuw v1, v1, 0 > > I don't understand, from the manual, what's wrong with this. The > intention of this piece of code is just to construct the value {1, 0, 0, > 0} in one of the vector registers. Maybe there's a better way to do > that?
GCC112 is a POWER8 machine. According to the POWER manual, vextractuw is a POWER9 instruction. POWER8 manual: https://openpowerfoundation.org/?resource_lib=power8-processor-users-manual POWER9 manual: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual Jeff _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs