Re: ARM public key benchmark

2013-04-04 Thread Richard Henderson
On 2013-04-04 06:51, Niels Möller wrote: And it's no use to even think of porting the loop mixer to arm without access to cycle-accurate timing. Looking around the web it seems that what most folks do is write a minimal kernel module that toggles the bit that allows userspace access to the

neon logops

2013-03-08 Thread Richard Henderson
Building on the copyi that tege committed the other day, use neon for the logical operations too. I did both a 128-bit aligned version, $ ./speed-128 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n mpn_nand_n clock_gettime is 1.000ns accurate overhead 6.00 cycles, precision

Re: neon logops

2013-03-08 Thread Richard Henderson
On 2013-03-08 03:46, Torbjorn Granlund wrote: I assume you mean that the destination ptr are naturally aligned, while the source ptrs are 32-bit aligned? Yes. My guess for the jaggyness is that of two src ptrs, you rarely strike a case where they are 256-bit aligned, in particular not when

Re: [PATCH 3/3] Optimize 64-bit mpn_add_N and mpn_sub_N for sparc T3 and later.

2013-03-05 Thread Richard Henderson
On 03/05/2013 04:54 AM, Torbjorn Granlund wrote: There is currently no mpn/sparc64/ultrasparct3, only ultrasparct1. For which CPUs are these new add_n/sub_n intended? Why not also for for other CPUs? For T3 and T4. This file makes use of new instructions: addxc(cc). Honestly, why they

Re: [PATCH 3/3] Optimize 64-bit mpn_add_N and mpn_sub_N for sparc T3 and later.

2013-03-05 Thread Richard Henderson
On 03/05/2013 12:05 PM, David Miller wrote: Which still hasn't made it to the list yet. I wonder why what is rejecting it as I never receive any kind of notification. Torbjorn did you at least receive it this time as you're on the CC:? BTW, I also noticed that Richard's 20 piece patch set

Re: [PATCH 06/20] Use gmp-renamei.h for renaming the internal routines

2013-03-05 Thread Richard Henderson
On 03/05/2013 12:51 PM, bodr...@mail.dm.unipi.it wrote: +__GMP_INTERN (extern const mp_limb_t, __gmp_oddfac_table, []); +__GMP_INTERN (extern const mp_limb_t, __gmp_odd2fac_table, []); +__GMP_INTERN (extern const unsigned char, __gmp_fac2cnt_table, []); +__GMP_INTERN (extern const

[PATCH 20/20] Delete support for varargs.h

2013-03-04 Thread Richard Henderson
ANSI C is now 25 years old. We already use ANSI-C-isms all over the source tree. This sort of paranoia check is now well out of date. --- acinclude.m4 | 55 +++--- config.in | 3 --- configure.ac | 1 -

[PATCH 01/20] Delete mpn/generic/sizeinbase.c

2013-03-04 Thread Richard Henderson
As far as I can tell it hasn't been used since 2002-02-09 Kevin Ryde ke...@swox.se * configure.in, mpn/Makefile.am, gmp-impl.h (mpn_sizeinbase): Remove. * mpn/generic/sizeinbase.c: Remove file. removed it from MPN_OBJECTS. It's certainly never built. I'm not sure how the

[PATCH 17/20] Convert the scanf directory to __GMP_*_DEFINE and include changes

2013-03-04 Thread Richard Henderson
--- scanf/doscan.c | 32 ++-- scanf/fscanf.c | 12 +--- scanf/fscanffuns.c | 3 +-- scanf/scanf.c | 12 +--- scanf/sscanf.c | 12 +--- scanf/sscanffuns.c | 4 +--- scanf/vfscanf.c| 12 +--- scanf/vscanf.c | 12

[PATCH 04/20] Make proper use of gmp-rename.h

2013-03-04 Thread Richard Henderson
This lets us delete all of the defines cluttering the human maintained source. Note that __MPN gets to move to the implementation, and we had a redundant definition of mpn_sqr. --- gmp-h.in | 413 ++--- gmp-impl.h | 7 +- 2 files

[PATCH 09/20] Prepare for creating hidden aliases of all routines

2013-03-04 Thread Richard Henderson
All of the mechanism is here, but not enabled -- configure has not yet been updated to define HAVE_HIDDEN_ALIAS. However, by hacking the generated config.h file by hand we'll be able to find errors as they occur without having to create one monster patch to do everything all at once. ---

[PATCH 08/20] Squish include requirements

2013-03-04 Thread Richard Henderson
It is vital that config.h be included before gmp.h in order to get the right expansions for __GMP_PUBLIC_FULL and __GMP_PUBLIC_DEFINE, once we start declaring hidden symbols. Ease the burden of include orderings by always using only gmp-impl.h, and getting everything else -- including system

[PATCH 18/20] Automatic hidden aliases inside existing asm-defs.m4 macros

2013-03-04 Thread Richard Henderson
This means we don't have to edit every single asm file. --- mpn/asm-defs.m4 | 342 1 file changed, 174 insertions(+), 168 deletions(-) diff --git a/mpn/asm-defs.m4 b/mpn/asm-defs.m4 index b95cad7..e3cdcb5 100644 --- a/mpn/asm-defs.m4 +++

[PATCH 03/20] Build and include gmp-rename.h

2013-03-04 Thread Richard Henderson
Placeholder commit. The defines are still in gmp-h.in, so the generated file contains lots of self-defines of ABI to ABI name, which the script actually removes, creating an empty file. But the makefile rule works... --- Makefile.am| 7 +-- gen-rename.awk | 34

[PATCH 16/20] Convert the toplevel directory to __GMP_*_DEFINE and include changes

2013-03-04 Thread Richard Henderson
--- assert.c | 5 ++--- compat.c | 2 -- errno.c| 7 +-- extract-dbl.c | 2 +- gen-bases.c| 2 +- gen-fib.c | 2 +- gen-rename.c | 10 ++ gen-renamei.c | 6 -- invalid.c | 18 ++ memory.c | 7 +++ mp_bpl.c

[PATCH 19/20] Configure for hidden aliases.

2013-03-04 Thread Richard Henderson
--- acinclude.m4 | 20 config.in| 4 configure.ac | 1 + 3 files changed, 25 insertions(+) diff --git a/acinclude.m4 b/acinclude.m4 index 15f71b1..f7b128e 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -3169,6 +3169,26 @@ fi ]) +dnl GMP_C_HIDDEN_ALIAS +dnl

Re: [PATCH 09/20] Prepare for creating hidden aliases of all routines

2013-03-04 Thread Richard Henderson
On 03/04/2013 11:48 AM, Niels Möller wrote: Richard Henderson r...@twiddle.net writes: index 1b27998..ff0dc45 100644 --- a/gmp-h.in +++ b/gmp-h.in @@ -251,6 +251,10 @@ typedef __mpq_struct *mpq_ptr; __GMP_PUBLIC_DATA - for declaring data variables __GMP_PUBLIC_ALIAS

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-04 Thread Richard Henderson
On 03/04/2013 12:25 PM, Torbjorn Granlund wrote: Did you use gmp-func-list.txt for determining which functions to make public? No, I used the existing gmp-h.in file, as I mentioned elsewhere. Note that all symbols that are visible today are still visible with the patch. I'm not really

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-04 Thread Richard Henderson
On 03/04/2013 12:21 PM, Torbjorn Granlund wrote: What sort of paperwork do you and Red hat have in place? We need to extend it as soon as possible, if the current paperwork needs amending. (Last time, for David Miller, it took something like two months, and only after some nagging.)

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-04 Thread Richard Henderson
On 03/04/2013 12:47 PM, Torbjorn Granlund wrote: But we might as well address this in the next stage. Do you agree? Yes. I think the macros added here will aid in cleaning things up. r~ ___ gmp-devel mailing list gmp-devel@gmplib.org

[PATCH 11/20] Convert the mpf subdirectory to __GMP_*_DEFINE and include changes

2013-03-04 Thread Richard Henderson
--- mpf/abs.c | 2 +- mpf/add.c | 2 +- mpf/add_ui.c | 2 +- mpf/ceilfloor.c| 3 ++- mpf/clear.c| 2 +- mpf/clears.c | 11 +-- mpf/cmp.c | 2 +- mpf/cmp_d.c| 8 +--- mpf/cmp_si.c | 2 +- mpf/cmp_ui.c | 2

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-04 Thread Richard Henderson
On 03/04/2013 12:21 PM, Torbjorn Granlund wrote: I've tried to do this in a series of steps that are as mechanical as possible, and therefore as easy to review as possible. Is the patch set intended to be applied as a whole, or will applying each (in number order) give something which

Re: Register r9 in the ARM ABI

2013-03-01 Thread Richard Henderson
On 02/28/2013 11:41 PM, Niels Möller wrote: I have a general ARM question, that maybe Richard or someone else on the list can answer. From the ABI documentation I've read, register r9 is in some way reserved for implementation of things like thread local storage. If I write a leaf function

Re: GMP symbol naming (and the history thereof)?

2013-02-28 Thread Richard Henderson
On 02/28/2013 12:50 AM, Torbjorn Granlund wrote: Richard Henderson r...@twiddle.net writes: Several times over the past week as I debug my neon routines, it has become painfully apparent (as I accidentally single-step into the dynamic linker) that the shared libgmp could use some help

Re: ARM Neon popcount

2013-02-27 Thread Richard Henderson
On 2013-02-27 13:27, Torbjorn Granlund wrote: Specific questions: * I completely ignore alignment. Is that bad? I'm not sure about that. It's something that perhaps we should experiment with. As written, the code will work, as the chip will handle totally unaligned data. What I don't

Re: ARM Neon popcount

2013-02-27 Thread Richard Henderson
On 2013-02-27 14:33, Torbjorn Granlund wrote: vld1.32 { q1, q2 }, [r0@128]! As specified in section A.3.2.1, if you specify the alignment it will also be checked, so you'll get SIGBUS if its not right. I wanted to experiment, but I cannot find any syntax which is accepted

GMP symbol naming (and the history thereof)?

2013-02-27 Thread Richard Henderson
Several times over the past week as I debug my neon routines, it has become painfully apparent (as I accidentally single-step into the dynamic linker) that the shared libgmp could use some help in modernizing its internal linkage. Most important is arranging for calls within GMP to go through

Neon column-wise addmul_4

2013-02-26 Thread Richard Henderson
On 02/26/2013 05:14 AM, Niels Möller wrote: Untried tricks: One could try to use vuzp to separate high and low parts of the products. Then only the low parts need shifting around. I guess I'll try that with addmul_4 first, to see if it makes for any improvement. One could maybe use vaddw, to

Re: Neon addmul_8

2013-02-26 Thread Richard Henderson
On 02/26/2013 10:41 AM, Torbjorn Granlund wrote: I'm not sure quite what's going on with the 3/4 issue rates. I really would have expected to see either exactly 1, or very nearly 1/2, especially for vadd. I think you mean 4/3. But also that is an underestimate. with 8-way unrolling

Re: arm neon

2013-02-23 Thread Richard Henderson
On 2013-02-23 06:06, Niels Möller wrote: Not sure what the bottlenecks of your loop are though; instruction decoding, load/store, or the recurrency chain (but at least it shouldn't be multiplier throughput, right?). Yeah, neither am I. I can't find any info on what latency of neon insns

Re: arm neon

2013-02-23 Thread Richard Henderson
On 2013-02-23 05:31, Torbjorn Granlund wrote: Richard Henderson r...@twiddle.net writes: Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time with larger unrolling to make full use of the vector load insns, and less over-prefetching. Good improvement! Keep

Neon addmul_4

2013-02-23 Thread Richard Henderson
Down to 2.8-3.0 cyc/limb. r~ dnl ARM neon mpn_addmul_4. dnl dnl Copyright 2013 Free Software Foundation, Inc. dnl dnl This file is part of the GNU MP Library. dnl dnl The GNU MP Library is free software; you can redistribute it and/or modify dnl it under the terms of the GNU Lesser General

Neon addmul_8

2013-02-23 Thread Richard Henderson
gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=169400 $ ./t.out mpn_addmul_8: 2845ms (1.782 cycles/limb) [973.59 Gb/s] mpn_addmul_8: 2620ms (1.641 cycles/limb) [1057.20 Gb/s] mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s] mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19

Re: arm neon

2013-02-22 Thread Richard Henderson
On 02/22/2013 02:32 AM, Torbjorn Granlund wrote: Richard Henderson r...@twiddle.net writes: Indeed, the last version that Niels posted doesn't pass this test. Oops. The following does pass, but if I'm to believe the arithmetic it's still fairly slow -- around 12cyc/sec

Re: arm neon

2013-02-22 Thread Richard Henderson
On 02/22/2013 10:20 AM, Torbjorn Granlund wrote: Useful. Is there any 32+32 32 - 32? I.e., carry-out. Sadly, no. Or if there is, I missed it. Also interesting, as I'm looking around, is VEXT. Consider vmull.u32 Qa01, Du00, Dv01 vmull.u32 Qb12, Du11, Dv01 which

Re: arm neon

2013-02-22 Thread Richard Henderson
On 02/22/2013 12:08 PM, Richard Henderson wrote: Perhaps I should give this another go... Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time with larger unrolling to make full use of the vector load insns, and less over-prefetching. I guess the target is anything under

Re: arm neon

2013-01-12 Thread Richard Henderson
On 01/12/2013 10:30 AM, Niels Möller wrote: Using Neon in a robust way might be a bit tricky, though. I have no idea how to determine if a CPU has Neon or not, and ARM has made most useful meta instructions supervisor-only. For a start, I guess it could be a configure time option (with no

Re: Some arm cortex-a8 improvements

2012-04-24 Thread Richard Henderson
On 04/24/12 00:18, Torbjorn Granlund wrote: On my system, umaal has a latency if 3, whatever dependencies I create. (There are 4 input regs and 2 output, so there are quite a few possible dependency combinations; I only tried a subset.) Either the docs are plain wrong, or there are several

Re: Some arm cortex-a8 improvements

2012-04-23 Thread Richard Henderson
On 04/22/2012 03:06 PM, Torbjorn Granlund wrote: Richard Henderson r...@twiddle.net writes: I used the following, almost certainly not appropriate for general application. [snip] Thanks. I would be very useful to make GMP timing work with the kernel Linux running om ARM. Do you

Re: Some arm cortex-a8 improvements

2012-04-23 Thread Richard Henderson
On 04/23/12 07:49, Torbjorn Granlund wrote: Do you know the repeat rate of umull, umlal, umaal, assuming no reg dependencies? For a8: 3 cycles. r~ ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel

Re: Some arm cortex-a8 improvements

2012-04-23 Thread Richard Henderson
On 04/22/12 13:06, Torbjorn Granlund wrote: Thanks. I would be very useful to make GMP timing work with the kernel Linux running om ARM. Do you know if there are similar problems with, say, NetBSD? Apparently it's possible to enable the perf counter registers at user level. See

Re: Some arm cortex-a8 improvements

2012-04-23 Thread Richard Henderson
On 04/23/12 15:32, Torbjorn Granlund wrote: Richard Henderson r...@twiddle.net writes: On 04/23/12 07:49, Torbjorn Granlund wrote: Do you know the repeat rate of umull, umlal, umaal, assuming no reg dependencies? For a8: 3 cycles. For a9 it seems to be 2 cycles, so 3.25 c

Re: Some arm cortex-a8 improvements

2012-04-21 Thread Richard Henderson
On 04/20/2012 11:34 AM, Torbjorn Granlund wrote: It's a bit touchy speed testing these. There's no cycle counter available in userspace, and Hz is depressingly low. So I've had to bump the minimum iterations way way up in order to get semi- reliable results. Which causes the speed