On 2013-04-04 06:51, Niels Möller wrote:
And it's no use to even think of porting the loop mixer to arm without
access to cycle-accurate timing.
Looking around the web it seems that what most folks do is write a
minimal kernel module that toggles the bit that allows userspace
access to the
Building on the copyi that tege committed the other day, use neon for the
logical operations too.
I did both a 128-bit aligned version,
$ ./speed-128 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n
mpn_nand_n
clock_gettime is 1.000ns accurate
overhead 6.00 cycles, precision
On 2013-03-08 03:46, Torbjorn Granlund wrote:
I assume you mean that the destination ptr are naturally aligned, while
the source ptrs are 32-bit aligned?
Yes.
My guess for the jaggyness is that of two src ptrs, you rarely strike
a case where they are 256-bit aligned, in particular not when
On 03/05/2013 04:54 AM, Torbjorn Granlund wrote:
There is currently no mpn/sparc64/ultrasparct3, only ultrasparct1. For
which CPUs are these new add_n/sub_n intended? Why not also for for
other CPUs?
For T3 and T4. This file makes use of new instructions: addxc(cc).
Honestly, why they
On 03/05/2013 12:05 PM, David Miller wrote:
Which still hasn't made it to the list yet. I wonder why what is
rejecting it as I never receive any kind of notification. Torbjorn
did you at least receive it this time as you're on the CC:?
BTW, I also noticed that Richard's 20 piece patch set
On 03/05/2013 12:51 PM, bodr...@mail.dm.unipi.it wrote:
+__GMP_INTERN (extern const mp_limb_t, __gmp_oddfac_table, []);
+__GMP_INTERN (extern const mp_limb_t, __gmp_odd2fac_table, []);
+__GMP_INTERN (extern const unsigned char, __gmp_fac2cnt_table, []);
+__GMP_INTERN (extern const
ANSI C is now 25 years old. We already use ANSI-C-isms all over the
source tree. This sort of paranoia check is now well out of date.
---
acinclude.m4 | 55 +++---
config.in | 3 ---
configure.ac | 1 -
As far as I can tell it hasn't been used since
2002-02-09 Kevin Ryde ke...@swox.se
* configure.in, mpn/Makefile.am, gmp-impl.h (mpn_sizeinbase): Remove.
* mpn/generic/sizeinbase.c: Remove file.
removed it from MPN_OBJECTS. It's certainly never built.
I'm not sure how the
---
scanf/doscan.c | 32 ++--
scanf/fscanf.c | 12 +---
scanf/fscanffuns.c | 3 +--
scanf/scanf.c | 12 +---
scanf/sscanf.c | 12 +---
scanf/sscanffuns.c | 4 +---
scanf/vfscanf.c| 12 +---
scanf/vscanf.c | 12
This lets us delete all of the defines cluttering the
human maintained source.
Note that __MPN gets to move to the implementation, and we had
a redundant definition of mpn_sqr.
---
gmp-h.in | 413 ++---
gmp-impl.h | 7 +-
2 files
All of the mechanism is here, but not enabled -- configure has not
yet been updated to define HAVE_HIDDEN_ALIAS.
However, by hacking the generated config.h file by hand we'll be
able to find errors as they occur without having to create one
monster patch to do everything all at once.
---
It is vital that config.h be included before gmp.h in order to get
the right expansions for __GMP_PUBLIC_FULL and __GMP_PUBLIC_DEFINE,
once we start declaring hidden symbols.
Ease the burden of include orderings by always using only gmp-impl.h,
and getting everything else -- including system
This means we don't have to edit every single asm file.
---
mpn/asm-defs.m4 | 342
1 file changed, 174 insertions(+), 168 deletions(-)
diff --git a/mpn/asm-defs.m4 b/mpn/asm-defs.m4
index b95cad7..e3cdcb5 100644
--- a/mpn/asm-defs.m4
+++
Placeholder commit. The defines are still in gmp-h.in, so the
generated file contains lots of self-defines of ABI to ABI name,
which the script actually removes, creating an empty file.
But the makefile rule works...
---
Makefile.am| 7 +--
gen-rename.awk | 34
---
assert.c | 5 ++---
compat.c | 2 --
errno.c| 7 +--
extract-dbl.c | 2 +-
gen-bases.c| 2 +-
gen-fib.c | 2 +-
gen-rename.c | 10 ++
gen-renamei.c | 6 --
invalid.c | 18 ++
memory.c | 7 +++
mp_bpl.c
---
acinclude.m4 | 20
config.in| 4
configure.ac | 1 +
3 files changed, 25 insertions(+)
diff --git a/acinclude.m4 b/acinclude.m4
index 15f71b1..f7b128e 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -3169,6 +3169,26 @@ fi
])
+dnl GMP_C_HIDDEN_ALIAS
+dnl
On 03/04/2013 11:48 AM, Niels Möller wrote:
Richard Henderson r...@twiddle.net writes:
index 1b27998..ff0dc45 100644
--- a/gmp-h.in
+++ b/gmp-h.in
@@ -251,6 +251,10 @@ typedef __mpq_struct *mpq_ptr;
__GMP_PUBLIC_DATA - for declaring data variables
__GMP_PUBLIC_ALIAS
On 03/04/2013 12:25 PM, Torbjorn Granlund wrote:
Did you use gmp-func-list.txt for determining which functions to make
public?
No, I used the existing gmp-h.in file, as I mentioned elsewhere.
Note that all symbols that are visible today are still visible with the patch.
I'm not really
On 03/04/2013 12:21 PM, Torbjorn Granlund wrote:
What sort of paperwork do you and Red hat have in place? We need to
extend it as soon as possible, if the current paperwork needs amending.
(Last time, for David Miller, it took something like two months, and
only after some nagging.)
On 03/04/2013 12:47 PM, Torbjorn Granlund wrote:
But we might as well address this in the next stage. Do you agree?
Yes. I think the macros added here will aid in cleaning things up.
r~
___
gmp-devel mailing list
gmp-devel@gmplib.org
---
mpf/abs.c | 2 +-
mpf/add.c | 2 +-
mpf/add_ui.c | 2 +-
mpf/ceilfloor.c| 3 ++-
mpf/clear.c| 2 +-
mpf/clears.c | 11 +--
mpf/cmp.c | 2 +-
mpf/cmp_d.c| 8 +---
mpf/cmp_si.c | 2 +-
mpf/cmp_ui.c | 2
On 03/04/2013 12:21 PM, Torbjorn Granlund wrote:
I've tried to do this in a series of steps that are as mechanical as
possible, and therefore as easy to review as possible.
Is the patch set intended to be applied as a whole, or will applying
each (in number order) give something which
On 02/28/2013 11:41 PM, Niels Möller wrote:
I have a general ARM question, that maybe Richard or someone else on the
list can answer.
From the ABI documentation I've read, register r9 is in some way
reserved for implementation of things like thread local storage. If I
write a leaf function
On 02/28/2013 12:50 AM, Torbjorn Granlund wrote:
Richard Henderson r...@twiddle.net writes:
Several times over the past week as I debug my neon routines, it has
become painfully apparent (as I accidentally single-step into the
dynamic linker) that the shared libgmp could use some help
On 2013-02-27 13:27, Torbjorn Granlund wrote:
Specific questions:
* I completely ignore alignment. Is that bad?
I'm not sure about that. It's something that perhaps we should
experiment with. As written, the code will work, as the chip will
handle totally unaligned data. What I don't
On 2013-02-27 14:33, Torbjorn Granlund wrote:
vld1.32 { q1, q2 }, [r0@128]!
As specified in section A.3.2.1, if you specify the alignment it will
also be checked, so you'll get SIGBUS if its not right.
I wanted to experiment, but I cannot find any syntax which is accepted
Several times over the past week as I debug my neon routines, it has
become painfully apparent (as I accidentally single-step into the
dynamic linker) that the shared libgmp could use some help in
modernizing its internal linkage.
Most important is arranging for calls within GMP to go through
On 02/26/2013 05:14 AM, Niels Möller wrote:
Untried tricks: One could try to use vuzp to separate high and low
parts of the products. Then only the low parts need shifting around.
I guess I'll try that with addmul_4 first, to see if it makes for any
improvement. One could maybe use vaddw, to
On 02/26/2013 10:41 AM, Torbjorn Granlund wrote:
I'm not sure quite what's going on with the 3/4 issue rates. I really would
have expected to see either exactly 1, or very nearly 1/2, especially for
vadd.
I think you mean 4/3. But also that is an underestimate. with 8-way
unrolling
On 2013-02-23 06:06, Niels Möller wrote:
Not sure what the bottlenecks of your loop are though; instruction
decoding, load/store, or the recurrency chain (but at least it shouldn't
be multiplier throughput, right?).
Yeah, neither am I. I can't find any info on what latency of neon insns
On 2013-02-23 05:31, Torbjorn Granlund wrote:
Richard Henderson r...@twiddle.net writes:
Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time
with larger unrolling to make full use of the vector load insns, and less
over-prefetching.
Good improvement!
Keep
Down to 2.8-3.0 cyc/limb.
r~
dnl ARM neon mpn_addmul_4.
dnl
dnl Copyright 2013 Free Software Foundation, Inc.
dnl
dnl This file is part of the GNU MP Library.
dnl
dnl The GNU MP Library is free software; you can redistribute it and/or modify
dnl it under the terms of the GNU Lesser General
gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=169400
$ ./t.out
mpn_addmul_8: 2845ms (1.782 cycles/limb) [973.59 Gb/s]
mpn_addmul_8: 2620ms (1.641 cycles/limb) [1057.20 Gb/s]
mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s]
mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19
On 02/22/2013 02:32 AM, Torbjorn Granlund wrote:
Richard Henderson r...@twiddle.net writes:
Indeed, the last version that Niels posted doesn't pass this test.
Oops.
The following does pass, but if I'm to believe the arithmetic it's
still fairly slow -- around 12cyc/sec
On 02/22/2013 10:20 AM, Torbjorn Granlund wrote:
Useful. Is there any 32+32 32 - 32? I.e., carry-out.
Sadly, no. Or if there is, I missed it.
Also interesting, as I'm looking around, is VEXT. Consider
vmull.u32 Qa01, Du00, Dv01
vmull.u32 Qb12, Du11, Dv01
which
On 02/22/2013 12:08 PM, Richard Henderson wrote:
Perhaps I should give this another go...
Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time
with larger unrolling to make full use of the vector load insns, and less
over-prefetching.
I guess the target is anything under
On 01/12/2013 10:30 AM, Niels Möller wrote:
Using Neon in a robust way might be a bit tricky, though. I have no
idea how to determine if a CPU has Neon or not, and ARM has made most
useful meta instructions supervisor-only.
For a start, I guess it could be a configure time option (with no
On 04/24/12 00:18, Torbjorn Granlund wrote:
On my system, umaal has a latency if 3, whatever dependencies I create.
(There are 4 input regs and 2 output, so there are quite a few
possible dependency combinations; I only tried a subset.)
Either the docs are plain wrong, or there are several
On 04/22/2012 03:06 PM, Torbjorn Granlund wrote:
Richard Henderson r...@twiddle.net writes:
I used the following, almost certainly not appropriate for general
application.
[snip]
Thanks. I would be very useful to make GMP timing work with the kernel
Linux running om ARM. Do you
On 04/23/12 07:49, Torbjorn Granlund wrote:
Do you know the repeat rate of umull, umlal, umaal, assuming no reg
dependencies?
For a8: 3 cycles.
r~
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel
On 04/22/12 13:06, Torbjorn Granlund wrote:
Thanks. I would be very useful to make GMP timing work with the kernel
Linux running om ARM. Do you know if there are similar problems with,
say, NetBSD?
Apparently it's possible to enable the perf counter registers at user level.
See
On 04/23/12 15:32, Torbjorn Granlund wrote:
Richard Henderson r...@twiddle.net writes:
On 04/23/12 07:49, Torbjorn Granlund wrote:
Do you know the repeat rate of umull, umlal, umaal, assuming no reg
dependencies?
For a8: 3 cycles.
For a9 it seems to be 2 cycles, so 3.25 c
On 04/20/2012 11:34 AM, Torbjorn Granlund wrote:
It's a bit touchy speed testing these. There's no cycle counter
available in userspace, and Hz is depressingly low. So I've had
to bump the minimum iterations way way up in order to get semi-
reliable results. Which causes the speed
43 matches
Mail list logo