Re: neon logops
Richard Henderson r...@twiddle.net writes: Building on the copyi that tege committed the other day, use neon for the logical operations too. I committed the 128 bit version to arm/neon, making it become used for all Neon capable processors. I put it there since it is a speedup for A9 as well as A15, compared to the core code. Note that this is not yet Copyright FSF since we're waiting for FSF to handle the paperwork. This is a departure from our previous policy of waiting out the FSF before comitting contributions. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
neon logops
Building on the copyi that tege committed the other day, use neon for the logical operations too. I did both a 128-bit aligned version, $ ./speed-128 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n mpn_nand_n clock_gettime is 1.000ns accurate overhead 6.00 cycles, precision 10 units of 1.00e-09 secs, CPU freq 1694.10 MHz mpn_and_nmpn_nand_n 10#1.79871.8986 50#0.93931.0692 100 #1.24911.3890 500 #0.81540.9753 1000 #0.77860.9435 5000 #1.49551.5765 1 #1.65321.7415 and a 256-bit aligned version, just to see if having a higher ratio of operation insns to memory insns would help, $ ./speed-256 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n mpn_nand_n clock_gettime is 1.000ns accurate overhead 6.00 cycles, precision 10 units of 1.00e-09 secs, CPU freq 1694.10 MHz mpn_and_nmpn_nand_n 10#1.59891.6988 50#1.09921.1592 100 #1.03931.0593 500 #1.03731.0413 1000 #1.03031.0313 5000 #1.59141.6003 1 1.6824 #1.6768 It's a bit curious how the later is less jaggy, but slightly slower. r~ dnl ARM mpn_and_n, et al. dnl Copyright 2013 Free Software Foundation, Inc. dnl This file is part of the GNU MP Library. dnl The GNU MP Library is free software; you can redistribute it and/or modify dnl it under the terms of the GNU Lesser General Public License as published dnl by the Free Software Foundation; either version 3 of the License, or (at dnl your option) any later version. dnl The GNU MP Library is distributed in the hope that it will be useful, but dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public dnl License for more details. dnl You should have received a copy of the GNU Lesser General Public License dnl along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. include(`../config.m4') Cand cyc/l nand cyc/l C StrongARM ? ? C XScale ? ? C Cortex-A8 ? ? C Cortex-A9 ? ? C Cortex-A15 0.78 0.94 define(`rp', `r0') define(`up', `r1') define(`vp', `r2') define(`n', `r3') define(`POSTOP') ifdef(`OPERATION_and_n',` define(`func',`mpn_and_n') define(`LOGOP', `vand $1, $2, $3')') ifdef(`OPERATION_andn_n',` define(`func',`mpn_andn_n') define(`LOGOP', `vbic $1, $2, $3')') ifdef(`OPERATION_nand_n',` define(`func',`mpn_nand_n') define(`POSTOP', `vmvn $1, $1') define(`LOGOP', `vand $1, $2, $3')') ifdef(`OPERATION_ior_n',` define(`func',`mpn_ior_n') define(`LOGOP', `vorr $1, $2, $3')') ifdef(`OPERATION_iorn_n',` define(`func',`mpn_iorn_n') define(`LOGOP', `vorn $1, $2, $3')') ifdef(`OPERATION_nior_n',` define(`func',`mpn_nior_n') define(`POSTOP', `vmvn $1, $1') define(`LOGOP', `vorr $1, $2, $3')') ifdef(`OPERATION_xor_n',` define(`func',`mpn_xor_n') define(`LOGOP', `veor $1, $2, $3')') ifdef(`OPERATION_xnor_n',` define(`func',`mpn_xnor_n') define(`POSTOP', `vmvn $1, $1') define(`LOGOP', `veor $1, $2, $3')') MULFUNC_PROLOGUE(mpn_and_n mpn_andn_n mpn_nand_n mpn_ior_n mpn_iorn_n mpn_nior_n mpn_xor_n mpn_xnor_n) ASM_START() .fpuneon PROLOGUE(func) cmp n, #7 ble L(bc) C Copy until rp is 128-bit aligned tst rp, #4 beq L(al1) vld1.32 {d0[0]}, [up]! vld1.32 {d1[0]}, [vp]! sub n, n, #1 LOGOP( d0, d0, d1) POSTOP( d0, d0) vst1.32 {d0[0]}, [rp]! L(al1): tst rp, #8 beq L(al2) vld1.32 {d0}, [up]! vld1.32 {d1}, [vp]! sub n, n, #2 LOGOP( d0, d0, d1) POSTOP( d0, d0) vst1.32 {d0}, [rp:64]! L(al2): vld1.32 {q2}, [up]! vld1.32 {q3}, [vp]! subsn, n, #12 blt L(end) ALIGN(16) L(top): vld1.32 {q0}, [up]! LOGOP( q2, q2, q3) vld1.32 {q1}, [vp]! POSTOP( q2, q2) subsn, n, #8 vst1.32 {q2}, [rp:128]! vld1.32 {q2}, [up]! LOGOP( q0, q0, q1) vld1.32 {q3}, [vp]! POSTOP( q0, q0) vst1.32 {q0}, [rp:128]! bge L(top) L(end): LOGOP( q2, q2, q3) POSTOP( q2, q2) vst1.32 {q2}, [rp:128]! C Copy last 0-7
Re: neon logops
Richard Henderson r...@twiddle.net writes: Building on the copyi that tege committed the other day, use neon for the logical operations too. I did both a 128-bit aligned version, $ ./speed-128 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n mpn_nand_n clock_gettime is 1.000ns accurate overhead 6.00 cycles, precision 10 units of 1.00e-09 secs, CPU freq 1694.10 MHz mpn_and_nmpn_nand_n 10#1.79871.8986 50#0.93931.0692 100 #1.24911.3890 500 #0.81540.9753 1000 #0.77860.9435 5000 #1.49551.5765 1 #1.65321.7415 and a 256-bit aligned version, just to see if having a higher ratio of operation insns to memory insns would help, $ ./speed-256 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n mpn_nand_n clock_gettime is 1.000ns accurate overhead 6.00 cycles, precision 10 units of 1.00e-09 secs, CPU freq 1694.10 MHz mpn_and_nmpn_nand_n 10#1.59891.6988 50#1.09921.1592 100 #1.03931.0593 500 #1.03731.0413 1000 #1.03031.0313 5000 #1.59141.6003 1 1.6824 #1.6768 It's a bit curious how the later is less jaggy, but slightly slower. I assume you mean that the destination ptr are naturally aligned, while the source ptrs are 32-bit aligned? My guess for the jaggyness is that of two src ptrs, you rarely strike a case where they are 256-bit aligned, in particular not when both are 256-bit aligned. But that happens much more often for 128-bit alignment. My copy was alignment insensitive, perhaps thanks to scheduling, or that it stresses the unaligned load logic less, with its one load-per-store? You can play with -x -y -w -W to force alignment. They are for src1, src2, dst1, dst2, respectively, IIRC. 0 would mean aligned, except that's not too well-defined. 1 means the pointer mod 2^something = 1, etc. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: neon logops
On 2013-03-08 03:46, Torbjorn Granlund wrote: I assume you mean that the destination ptr are naturally aligned, while the source ptrs are 32-bit aligned? Yes. My guess for the jaggyness is that of two src ptrs, you rarely strike a case where they are 256-bit aligned, in particular not when both are 256-bit aligned. But that happens much more often for 128-bit alignment. My copy was alignment insensitive, perhaps thanks to scheduling, or that it stresses the unaligned load logic less, with its one load-per-store? I don't know. I do know there's something bizzare going on that's probably needs some chip knowledge to figure out. For instance, testing the -128 patch I posted here, and making no other change except *adding* :128 markers to both source operands, I hoped to determine what effect source alignment has on the loop. (This change is not generally correct, but does work for the case of speed with specified alignment.) The peak result is slightly *slower* than before. with align without align mpn_and_nmpn_nand_n mpn_and_nmpn_nand_n 10#1.79891.8987 1.79901.8989 50#0.93931.0693 0.93951.0694 100 #1.24911.3891 1.24961.3893 500 #0.81540.9753 0.81560.9756 1000 0.87461.0642 #0.77870.9435 5000 #1.40671.4939 1.50121.5577 1 #1.54541.6702 1.55211.5926 ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel