Re: neon logops

2013-04-26 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes:

  Building on the copyi that tege committed the other day, use neon for
  the logical operations too.
  
I committed the 128 bit version to arm/neon, making it become used for
all Neon capable processors.  I put it there since it is a speedup for
A9 as well as A15, compared to the core code.

Note that this is not yet Copyright FSF since we're waiting for FSF to
handle the paperwork.  This is a departure from our previous policy of
waiting out the FSF before comitting contributions.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


neon logops

2013-03-08 Thread Richard Henderson
Building on the copyi that tege committed the other day, use neon for the 
logical operations too.


I did both a 128-bit aligned version,


$ ./speed-128 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n 
mpn_nand_n
clock_gettime is 1.000ns accurate
overhead 6.00 cycles, precision 10 units of 1.00e-09 secs, CPU freq 
1694.10 MHz
mpn_and_nmpn_nand_n
10#1.79871.8986
50#0.93931.0692
100   #1.24911.3890
500   #0.81540.9753
1000  #0.77860.9435
5000  #1.49551.5765
1 #1.65321.7415


and a 256-bit aligned version, just to see if having a higher ratio of 
operation insns to memory insns would help,



$ ./speed-256 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n 
mpn_nand_n
clock_gettime is 1.000ns accurate
overhead 6.00 cycles, precision 10 units of 1.00e-09 secs, CPU freq 
1694.10 MHz
mpn_and_nmpn_nand_n
10#1.59891.6988
50#1.09921.1592
100   #1.03931.0593
500   #1.03731.0413
1000  #1.03031.0313
5000  #1.59141.6003
1  1.6824   #1.6768


It's a bit curious how the later is less jaggy, but slightly slower.


r~
dnl  ARM mpn_and_n, et al.

dnl  Copyright 2013 Free Software Foundation, Inc.

dnl  This file is part of the GNU MP Library.

dnl  The GNU MP Library is free software; you can redistribute it and/or modify
dnl  it under the terms of the GNU Lesser General Public License as published
dnl  by the Free Software Foundation; either version 3 of the License, or (at
dnl  your option) any later version.

dnl  The GNU MP Library is distributed in the hope that it will be useful, but
dnl  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
dnl  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
dnl  License for more details.

dnl  You should have received a copy of the GNU Lesser General Public License
dnl  along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.

include(`../config.m4')

Cand cyc/l  nand cyc/l
C StrongARM  ?  ?
C XScale ?  ?
C Cortex-A8  ?  ?
C Cortex-A9  ?  ?
C Cortex-A15 0.78   0.94

define(`rp', `r0')
define(`up', `r1')
define(`vp', `r2')
define(`n',  `r3')

define(`POSTOP')

ifdef(`OPERATION_and_n',`
  define(`func',`mpn_and_n')
  define(`LOGOP',   `vand   $1, $2, $3')')
ifdef(`OPERATION_andn_n',`
  define(`func',`mpn_andn_n')
  define(`LOGOP',   `vbic   $1, $2, $3')')
ifdef(`OPERATION_nand_n',`
  define(`func',`mpn_nand_n')
  define(`POSTOP',  `vmvn   $1, $1')
  define(`LOGOP',   `vand   $1, $2, $3')')
ifdef(`OPERATION_ior_n',`
  define(`func',`mpn_ior_n')
  define(`LOGOP',   `vorr   $1, $2, $3')')
ifdef(`OPERATION_iorn_n',`
  define(`func',`mpn_iorn_n')
  define(`LOGOP',   `vorn   $1, $2, $3')')
ifdef(`OPERATION_nior_n',`
  define(`func',`mpn_nior_n')
  define(`POSTOP',  `vmvn   $1, $1')
  define(`LOGOP',   `vorr   $1, $2, $3')')
ifdef(`OPERATION_xor_n',`
  define(`func',`mpn_xor_n')
  define(`LOGOP',   `veor   $1, $2, $3')')
ifdef(`OPERATION_xnor_n',`
  define(`func',`mpn_xnor_n')
  define(`POSTOP',  `vmvn   $1, $1')
  define(`LOGOP',   `veor   $1, $2, $3')')

MULFUNC_PROLOGUE(mpn_and_n mpn_andn_n mpn_nand_n mpn_ior_n mpn_iorn_n 
mpn_nior_n mpn_xor_n mpn_xnor_n)

ASM_START()
.fpuneon
PROLOGUE(func)
cmp n, #7
ble L(bc)

C Copy until rp is 128-bit aligned
tst rp, #4
beq L(al1)
vld1.32 {d0[0]}, [up]!
vld1.32 {d1[0]}, [vp]!
sub n, n, #1
LOGOP(  d0, d0, d1)
POSTOP( d0, d0)
vst1.32 {d0[0]}, [rp]!
L(al1): tst rp, #8
beq L(al2)
vld1.32 {d0}, [up]!
vld1.32 {d1}, [vp]!
sub n, n, #2
LOGOP(  d0, d0, d1)
POSTOP( d0, d0)
vst1.32 {d0}, [rp:64]!
L(al2): vld1.32 {q2}, [up]!
vld1.32 {q3}, [vp]!
subsn, n, #12
blt L(end)

ALIGN(16)
L(top): vld1.32 {q0}, [up]!
LOGOP(  q2, q2, q3)
vld1.32 {q1}, [vp]!
POSTOP( q2, q2)
subsn, n, #8
vst1.32 {q2}, [rp:128]!
vld1.32 {q2}, [up]!
LOGOP(  q0, q0, q1)
vld1.32 {q3}, [vp]!
POSTOP( q0, q0)
vst1.32 {q0}, [rp:128]!
bge L(top)

L(end): LOGOP(  q2, q2, q3)
POSTOP( q2, q2)
vst1.32 {q2}, [rp:128]!

C Copy last 0-7 

Re: neon logops

2013-03-08 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes:

  Building on the copyi that tege committed the other day, use neon for
  the logical operations too.
  
  I did both a 128-bit aligned version,
  
   $ ./speed-128 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n 
mpn_nand_n
   clock_gettime is 1.000ns accurate
   overhead 6.00 cycles, precision 10 units of 1.00e-09 secs, CPU freq 
1694.10 MHz
   mpn_and_nmpn_nand_n
   10#1.79871.8986
   50#0.93931.0692
   100   #1.24911.3890
   500   #0.81540.9753
   1000  #0.77860.9435
   5000  #1.49551.5765
   1 #1.65321.7415
  
  and a 256-bit aligned version, just to see if having a higher ratio of
  operation insns to memory insns would help,
  
   $ ./speed-256 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n 
mpn_nand_n
   clock_gettime is 1.000ns accurate
   overhead 6.00 cycles, precision 10 units of 1.00e-09 secs, CPU freq 
1694.10 MHz
   mpn_and_nmpn_nand_n
   10#1.59891.6988
   50#1.09921.1592
   100   #1.03931.0593
   500   #1.03731.0413
   1000  #1.03031.0313
   5000  #1.59141.6003
   1  1.6824   #1.6768
  
  It's a bit curious how the later is less jaggy, but slightly slower.
  
I assume you mean that the destination ptr are naturally aligned, while
the source ptrs are 32-bit aligned?

My guess for the jaggyness is that of two src ptrs, you rarely strike
a case where they are 256-bit aligned, in particular not when both are
256-bit aligned.  But that happens much more often for 128-bit
alignment.  My copy was alignment insensitive, perhaps thanks to
scheduling, or that it stresses the unaligned load logic less, with its
one load-per-store?

You can play with -x -y -w -W to force alignment.  They are for src1,
src2, dst1, dst2, respectively, IIRC.  0 would mean aligned, except
that's not too well-defined.  1 means the pointer mod 2^something = 1,
etc.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: neon logops

2013-03-08 Thread Richard Henderson

On 2013-03-08 03:46, Torbjorn Granlund wrote:

I assume you mean that the destination ptr are naturally aligned, while
the source ptrs are 32-bit aligned?


Yes.


My guess for the jaggyness is that of two src ptrs, you rarely strike
a case where they are 256-bit aligned, in particular not when both are
256-bit aligned.  But that happens much more often for 128-bit
alignment.  My copy was alignment insensitive, perhaps thanks to
scheduling, or that it stresses the unaligned load logic less, with its
one load-per-store?


I don't know.  I do know there's something bizzare going on that's probably 
needs some chip knowledge to figure out.


For instance, testing the -128 patch I posted here, and making no other change 
except *adding* :128 markers to both source operands, I hoped to determine what 
effect source alignment has on the loop.  (This change is not generally 
correct, but does work for the case of speed with specified alignment.)


The peak result is slightly *slower* than before.

with align   without align
mpn_and_nmpn_nand_n  mpn_and_nmpn_nand_n
10#1.79891.8987  1.79901.8989
50#0.93931.0693  0.93951.0694
100   #1.24911.3891  1.24961.3893
500   #0.81540.9753  0.81560.9756
1000   0.87461.0642 #0.77870.9435
5000  #1.40671.4939  1.50121.5577
1 #1.54541.6702  1.55211.5926

___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel