David Miller da...@davemloft.net writes:
I use rd %pc here, since moving forward that's what we're going to
use.
I checked in something similar, and also updated the other sparc64 files
to use rd %pc.
I tried to look into doing something like:
L(pc):rd %pc, %g2
swift gmake -k
gcc -m64 -fPIC -c -o test1_shared.o test1.S
/usr/ccs/bin/as: /var/tmp//ccqorjdc.s: , approx line 18: internal error:
pic_relocs(): hh reltype?
gmake: *** [test1_shared.o] Error 1
gcc -m64 -c -o test1_static.o test1.S
gcc -m64 -fPIC -c -o test2_shared.o test2.S
/usr/ccs/bin/as:
David Miller da...@davemloft.net writes:
BTW, you traded one failure for another, now PIC is broke
for ultrasparct3 builds, because now in invert_limb.asm we're
back to:
diff -r bd92f35223f8 mpn/sparc64/ultrasparct3/invert_limb.asm
--- a/mpn/sparc64/ultrasparct3/invert_limb.asm Sun
Torbjorn Granlund t...@gmplib.org writes:
Else, I cannot understand how this could have worked!
Arrg, I didn't actually make a T3 simulated build.
--
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp
David Miller da...@davemloft.net writes:
Some of this stuff doesn't work, for example, from invert_limb.asm:
+ rd %pc, %g3
+ sethi %hi(_GLOBAL_OFFSET_TABLE_+4), %g4
+ add %g4, %lo(_GLOBAL_OFFSET_TABLE_+8), %g4
+ add %g3, %g4, %g4
sethi
David Miller da...@davemloft.net writes:
Since using rdpc avoids the whole issue of corrupting the return
address stack, it seems pretty desirable to move over to it.
Let's do it.
Well see a slight slowdown for T3, but probably its general slowness
will make this new slowdown almost
Torbjorn Granlund t...@gmplib.org writes:
Torbjorn Granlund t...@gmplib.org writes:
ld: fatal: relocation error: R_SPARC_GOTDATA_OP_LOX10: file
mpn/.libs/gcd_1.o: symbol ctz_table: relocation illegal for TLS symbol
ld: fatal: relocation error: R_SPARC_GOTDATA_OP: file mpn/.libs
David Miller da...@davemloft.net writes:
ld: fatal: relocation error: R_SPARC_GOTDATA_OP_LOX10: file
mpn/.libs/gcd_1.o: symbol ctz_table: relocation illegal for TLS symbol
ld: fatal: relocation error: R_SPARC_GOTDATA_OP: file
mpn/.libs/gcd_1.o: symbol ctz_table: relocation illegal
I'd suggest to use the loop below for sparc64. It limits `which' to be
2^32 by creating the mask based on 32-bit comparison. It would be
possible to replace subcc o1,1,o1; subc ... by addcc o1,-1,o1; addxc
... for newer chips, but I think that's no use.
I sincerely apologise for the odd number
David Miller da...@davemloft.net writes:
It isn't really conditional execution on sparc, the resources and
timing required for the move instruction are constant whether the
condition matches or not.
That's not enough.
It needs to have the same data-dependency behaviour too.
And it
David Miller da...@davemloft.net writes:
I sincerely apologise for the odd number of insns in the loop. :-)
Easily solved by using the pointer trick on 'tp' and making 'i'
instead be 'i * stride'. That'll get us down to 16 instructions.
I'll try to find time to play with this
David Miller da...@davemloft.net writes:
From: Torbjorn Granlund t...@gmplib.org
Date: Fri, 12 Apr 2013 10:04:35 +0200
I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles
per limb on T4, BTW. On US1-2 it might run at 7/4 c/l and on US3-4 it
again probably
There are syntax errors for swift.nada.kth.se, a Solaris system. See
http://gmplib.org/devel/tm-date.html.
The offending lines:
swift (ABI=64)
99: sethi %gdop_hix22(ctz_table), %i5
swift-32 (ABI=32)
99: sethi %gdop_hix22(.Lnoll), %l0
We need things to work on Solaris, *BSD.
--
I've written a few variants of tabselect using a different table
traversal order. I think of this as horisontal, making the old one
vertical.
An arm neon variant which I think has become nice, thanks to neon's
elegance. It improves the A9 performance by ~100% and the A15
performance by ~30%
Richard Henderson earlier wrote an addmul_8 which runs at an impressive
1.6 c/l (or something thereabout).
Since accumulating in Neen is somewhat tricky, I decided to try
alternatives. I now have a mul_1 loop which performs multiplies using
neon insns and then adds the result using plain old
Please use LEA* instead of LOAD_SYMBOL*, since that's what we use
elsewhere. (OK, LEA might be a misnomer, but a well-established one in
and outside of GMP.)
I assume your broad testing covers every modified file. Do you have an
idea of whether that is true.
Whn testing shared libs, I have
I think my original tabselect methid is not the best, at least not of we
implement it in assembly.
The current method takes one full table vector entry at a time, and need
to perform two loads and one store per entry in the large table of
vectors.
It seems better two work in the opposite
ni...@lysator.liu.se (Niels Möller) writes:
mp_limb_t rl;
for (rl = 0, i = 0; i nelems; i++)
rl += table[...] -(mp_limb_t) (i == k);
rp[...] = rl;
Reduces the number of stores from O(n^2) to O(n), and instead increases
the mask creation from O(n) to O(n^2). Loads
I assume your broad testing covers every modified file. Do you have an
idea of whether that is true.
I rechecked everything and the one case I missed was supersparc-*
Even the current tree has a build problem of the supersparc target
with current tools due to combination of a
David Miller da...@davemloft.net writes:
I added such code, but since I couldn't find any cpp symbol which is
triggered by -mvis2, I invented a name we need to set ourselves.
CPP will define __VIS__ = 0x200 in that case (and likewise = 0x300
for -mvis3).
Thanks, I missed that one in
David Miller da...@davemloft.net writes:
Attached is a dive_1.asm that works for me on real hardware as
well as T4 timings from:
tune/speed -p1000 -s1-1000 -f1.1 -C mpn_divexact_1.3
This timing is most curious. The cost of inversion computation should
be clearly visible for tiny
David Miller da...@davemloft.net writes:
First mul_1, renamed again, now encoding the load scheduling. Only the
6c variant is new. Please time it. If it doesn't run at 3 c/l, then
there are 2 simple things to try, indicated in a comment.
This gets the expected 3 cycles per limb
ni...@lysator.liu.se (Niels Möller) writes:
I had on the other hand not realised David's ones complement + pre-invert
carry trick.
Not sure I understand what you are referring to here. I haven't been
following the sparc developments very closely (and I don't remember much
of sparc
Richard Henderson r...@twiddle.net writes:
On 2013-04-04 06:51, Niels Möller wrote:
And it's no use to even think of porting the loop mixer to arm without
access to cycle-accurate timing.
Looking around the web it seems that what most folks do is write a
minimal kernel module that
ni...@lysator.liu.se (Niels Möller) writes:
I guess it's lowest numbered first (and lowest memory address).
But a loop with
use r7
ldm up!, {r4,r5,r6,r7}
use r4
looks like poor scheduling betwen load of r4 and use of it, and the ldm
can't be moved earlier since
ni...@lysator.liu.se (Niels Möller) writes:
1. I guess one can expect submul_1 to always be a bit slower than
addmul_1, since submul_1 needs additional arithmetics besides the
umaal? One could perhaps do some negations on the fly, a - b C = -
((-a) + b*C), maybe that
ni...@lysator.liu.se (Niels Möller) writes:
For large operands, it's strictly between add_n and addmul_1, which I
guess is as expected. For small sizes, I had a look at the loop setup
for add_n, which checks bit 0 and 1 of n separately. If that's faster,
maybe one could borrow that logic.
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
So it should be doable with the addmul_1 loop and two additional,
non-recurrency, not instructions per limb, and then maybe some extra
logic for the return value. One could aim for 4.25 c/l, I
David,
First mul_1, renamed again, now encoding the load scheduling. Only the
6c variant is new. Please time it. If it doesn't run at 3 c/l, then
there are 2 simple things to try, indicated in a comment.
sparct34-mul_1-3c.asm
Description: Binary data
sparct34-mul_1-6c.asm
Description:
David Miller da...@davemloft.net writes:
First mul_1, renamed again, now encoding the load scheduling. Only the
6c variant is new. Please time it. If it doesn't run at 3 c/l, then
there are 2 simple things to try, indicated in a comment.
Looks exciting, I'll play around with this
David Miller da...@davemloft.net writes:
Please don't do this, you checked in code that doesn't even compile
again.
Easy to fix. Please pull again.
I was just starting to work on getting the information for you
so this is very disappointing. :-/
Well, bugs happen.
--
Torbjörn
David Miller da...@davemloft.net writes:
I can tell by looking at the commit that it's still broken, can you
please stop jumping the gun and simply be patient enough for me to
test things out?
Since I am wrapping up, I wanted to push things and clean out unfinished
things.
Why is is
Torbjorn Granlund t...@gmplib.org writes:
This version probably overschedules loads, I'll try another variant some
day which fixes that.
Two variants. The 1st is just the previous 3 c/l one, with a bug fix,
and renamed. The 2nd is a version which I hope still runs at 3 c/l
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd
like to try.
That wasn't a clear win... I use addmul_1 and submul_1 as a fallback
(and I always do in-place operation,
David Miller da...@davemloft.net writes:
See attached, looks like mul1b isn't able to reach 3 c/l like mul1a can.
overhead 6.00 cycles, precision 1000 units of 3.51e-10 secs, CPU freq
2847.41 MHz
Darn. Is the load latency 3 cycles?
The old code had a load-use schedule of 8 cycles,
David Miller da...@davemloft.net writes:
I started playing with these, and one problem is that the
addxc/addxccc instructions do not accept an immediate field. They
only accept rs1, rs2, rd arguments. Please update your compat macros
to catch this.
Oops. Missed that.
With this
David Miller da...@davemloft.net writes:
Attached are the output of:
tune/speed -p1000 -s1-1000 -f1.1 -C mpn_mul_2.3
3.25 c/l, not 3 c/l as I had hoped.
tune/speed -p1000 -s1-1000 -f1.1 -C mpn_addmul_2.3
3.75 c/l, not 3.5 c/l as I had hoped...
I will accept this, since
David Miller da...@davemloft.net writes:
Yes, understood. We have to transpose a few of the shifts with
their neighbouring arithmetic ops in this loop to make it optimal
for Ultra-I/II/IIi
I found a powered up US2 and run time timing tests. No slowdown there
for the new generic
Plain, non-pipelined version of bdiv_dbm1c.asm, mod_1_4.asm, mode1o.asm,
dive_1.asm, invert_limb.asm.
I wrote this with help of gcc, having first told longlong.h about
umulxhi and addxc. Then I hand-optimised the result to varying degree.
In no case did I software pipeline the loops, so these
David Miller da...@davemloft.net writes:
The code uses lzcnt, which I hope is implemented in T3 and T4. I added
it to the missing.m4 file, so that I could test the code on my old
sparcs.
Just be forewarned, lzcnt is very slow, as slow as popc.
I use both. I use lzcnt in
For the most critical functions, i.e., mul_1, addmul_1, submul_1, mul_2,
and addmul_2, we should not stick to 2-way unrolling.
I played with a 4-way unrolled mul_1, but not using your multi-pointer
trick, meaning that we will spend two cycles instead of one cycle for
bookkeeping.
Our current is
David Miller da...@davemloft.net writes:
I'm going to play around with some things to try and fix this.
Interestingly, UltraSPARC-1 and UltraSPARC-2 would not group the
final cycle of the loop this way, because of it's requirement that
integer operations must occur in the first three
David Miller da...@davemloft.net writes:
For values of N = 1 we would expect 1 cycle per iteration. But
that's not exactly what happens.
N cycles
==
1 2
2 3
3 4
4 5
5 6
6
David Miller da...@davemloft.net writes:
So here is a working generic 64-bit sparc lshift.asm that seems
to work well on all chips.
I'm now going to iterate over lshiftc and rshift.
I whacked off some code at the end, and generalised the resulting code
to become a lorrshift.asm. Some
David Miller da...@davemloft.net writes:
Looks good, and here is my lshiftc:
add up, -8, up
srlxu1, tcnt, %l4
andn%l3, %l4, r0
stx r0, [rp + 0]
bnz,pt %xcc, L(loop0)
sllx u1, cnt, %l3
I'd claim that branch is taken with
David Miller da...@davemloft.net writes:
And I want to take this opportunity to mention that 'try' is very
non-useful at times for testing newly coded routines.
I don't use it for that.
It seems to perform several unrelated mpn calculations during
initialization before it does the
David Miller da...@davemloft.net writes:
From: David Miller da...@davemloft.net
Date: Fri, 29 Mar 2013 22:14:05 -0400 (EDT)
Great. Let's sort out the strange hang behavior I get with your code.
I think it's rshift. I actually happened to be working on rshift when
you sent
David Miller da...@davemloft.net writes:
The TODO list grows, it never seems to shrink :-)
Oh no, it shrinks. But I only show you the first few lines of it. :-)
Wrt. scheduling mulx/umulxhi, I think to a certain extent I think the
out-of-order completion unit in the backend of the
There will be many spurious failures reported by the automated GMP
testing system in the next days. This is caused by construction work.
http://gmplib.org/devel/tm-date.html
The goal is to test both static and dynamic builds for every config.
We're getting rid of cron initiated testing too, in
David Miller da...@davemloft.net writes:
These give a modest speedup compared to the T1 routines.
I also added missing T3 timings to existing code.
The first thing to try then is finding code that runs well on both.
There is a cost in having more variants than we need.
Also, I worked on
David Miller da...@davemloft.net writes:
L(top):
or %g4, %g1, %l1
sllx%g2, cnt, %g1
srlx%g2, tcnt, %g4
ldx [up - 8], %g2
stx %l1, [rp - 8]
or %g3, %l2, %l7
sllx%g5, cnt, %l2
David Miller da...@davemloft.net writes:
These give a modest speedup compared to the T1 routines.
I also added missing T3 timings to existing code.
The first thing to try then is finding code that runs well on both.
There is a cost in having more variants than we need.
David Miller da...@davemloft.net writes:
Technically we could use this on some chips we don't distinguish on
a fine enough granularity yet. For example we can assume popc is
available on T2 as well as UltraSPARC-IV.
But for now, just T3 and later.
I suppose we should mention this
Vincent Lefevre vinc...@vinc17.net writes:
I haven't though a lot about this, but it is not clear that -1 + eps
should be considered to fit an unsigned type.
Why?
We need to decide how to define the edge conditions.
We could either see at as can be assigned to type without havoc or
Vincent Lefevre vinc...@vinc17.net writes:
What is the behavior for MAXIMUM + eps (both for signed and unsigned
types)?
That's indeed something we (GMP and MPFR) should worry about. Whatever
we decide, we should handle the upper and lower boundaries analogously!
--
Torbjörn
Zimmermann Paul paul.zimmerm...@inria.fr writes:
indeed this is inconsistent. mpf_fits_uint_p(-0.5) should return true, as well
as mpf_fits_uint_p(-0.999).
I am not so sure that would be the right fix here.
--
Torbjörn
___
gmp-devel
bodr...@mail.dm.unipi.it writes:
long vectors of bignums... do you mean that it might be used for the
point-wise multiplication in Sch�nhage–Strassen?
No. Surely GPUs could be used for individual huge multiplies, but
that's not going to benefit a lot of GMP applications.
I have
Emmanuel Thomé emmanuel.th...@gmail.com writes:
Hi,
On Mon, Mar 11, 2013 at 9:53 PM, Torbjorn Granlund t...@gmplib.org wrote:
I have 'mpz_vec_t', 'mpf_vec_t in mind, which have some number of mpz_t
elements, each probably (padded to) the same size counted in limbs
Richard Henderson r...@twiddle.net writes:
Building on the copyi that tege committed the other day, use neon for
the logical operations too.
I did both a 128-bit aligned version,
$ ./speed-128 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n
mpn_nand_n
clock_gettime
David Miller da...@davemloft.net writes:
From: Torbjorn Granlund t...@gmplib.org
Date: Thu, 07 Mar 2013 20:58:51 +0100
I'm reasonably sure this is correct.
Needs some work still:
It was a one character bug in the non-emulation stuff. This is also in
the smaller checked
David Miller da...@davemloft.net writes:
Seems to work fine, here are some speed runs:
davem@patience:~/src/GMP/HG/build-sparc64-ultrasparct4/tune$ ./speed -C -s
32-64 -t 2 mpn_mul_2
overhead 6.06 cycles, precision 1 units of 3.51e-10 secs, CPU freq
2847.34 MHz
Here's a patch that reorders the arguments for mpn_addcnd_n and
mpn_subcnd_n (I think it's best to keep this change separate from the
renaming, since the potential problems are quite different).
It's tested on x86_64, arm, and with --disable-assembly. I've run a
regular make check and
I wrote 4-way unrolled mul_2 and addmul_2 for T3/T4.
The FAKE_T3 stuff includes missing.m4, which impelements some
instructions missing from my old systems around here. I might retain
that stuff for a while to allow local regression testing, even if it is
a bit ugly.
Could you please run time
romes p romes_12...@yahoo.com writes:
Hello developers
I noticed that there is also a CUMP site
http:/www.hpcs.cs.tsukuba.ac.jp/~nakayama/cump/
Sheesh, the guy has copyied and edited the GMP webpages and now claims
the default all rights reserved with himself as owner. Not a serious
I only now spotted FPMADDXHI and FPMADDX. No Sun/Oracle SPARC hae been
a floating-point demon, and these intger multiply instructions are
performed in the fpu.
Multiply-accumulate instructions are tricky, since one may easily put
the accumulation on a carry recurrency path, and thereby kill
I think all you T3/T4 changes are now in. Please check that I didn't
mess something up.
Thanks for this contribution!
--
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel
David Miller da...@davemloft.net writes:
From: Torbjorn Granlund t...@gmplib.org
Date: Wed, 06 Mar 2013 00:08:09 +0100
The addmul code could be simularly improved.
Grumble... and I did this work already, I sent older versions
of my T3/T4 changes, let me go see how I screwed
David Miller da...@davemloft.net writes:
I optimised submul_1.asm, and then edited both addmul_1 and submul_1 to
use as similar operand order as possible. Please test these using
tests/devel/try, and please time this new submul_1.
The testsuite starts failing very early with these
David Miller da...@davemloft.net writes:
From: David Miller da...@davemloft.net
Date: Thu, 07 Mar 2013 01:06:55 -0500 (EST)
I'll test your routines with the obvious fix in a moment.
With the one-liner fix both of your new implementations work.
Thanks for testing!
submul_1 is
ni...@lysator.liu.se (Niels Möller) writes:
Actually, I think that's incorrect.
Everyone has some *familiarity* with the C preprocessor, which surely is
an advantage. And maybe most C programmers think they they understand
it. But in my experience, very few understand the fine details
ni...@lysator.liu.se (Niels Möller) writes:
That would certainly cause some additional confusion. Any suggestion for
appropriate m4 quote characters to use? ;-)
I think one should be kind and use [ and ]. The resulting C dialect,
where indexing would be written arr[[i]] is not too bad...
David Miller da...@davemloft.net writes:
* mpn/sparc32/ultrasparct1/mul_1.asm (mpn_mul_1): Unroll main loop
one time, align code on 32-byte boundary, add T2/T3/T4 timings.
* mpn/sparc32/ultrasparct1/addmul_1.asm (mpn_addmul_1): Likewise.
*
David Miller da...@davemloft.net writes:
This is a resubmit of the work I did 2 months ago now
that my FSF assignment has finally been completed.
Just the simple stuff, use of mulx/umulx/addxccc and 1
level of loop unrolling.
We now got patches 1/3 and 3/3. Is there a 2/3 too?
--
David Miller da...@davemloft.net writes:
* mpn/sparc64/ultrasparct3/add_n.asm: New file.
* mpn/sparc64/ultrasparct3/sub_n.asm: New file.
There is currently no mpn/sparc64/ultrasparct3, only ultrasparct1. For
which CPUs are these new add_n/sub_n intended? Why not also for for
Richard Henderson r...@twiddle.net writes:
For T3 and T4. This file makes use of new instructions: addxc(cc).
Thanks.
Honestly, why they didn't have a proper 64-bit with carry insn right
from the very first v9 cpu is a mystery.
The SPARC cpu is so full of design mistakes that I am not
Richard Henderson r...@twiddle.net writes:
One extra add insn here (copy-paste from addmul)?
addcc %o5, %g3, %g3
addxccc %g2, %g1, %g1
addxc %g0, %o4, %o5
Since I cannot test this at all (qemu-system-sparc64 persistenty resists
all my usage attempts) I need you
David Miller da...@davemloft.net writes:
From: Torbjorn Granlund t...@gmplib.org
Date: Tue, 05 Mar 2013 21:35:19 +0100
Richard Henderson r...@twiddle.net writes:
One extra add insn here (copy-paste from addmul)?
addcc %o5, %g3, %g3
addxccc %g2, %g1, %g1
David Miller da...@davemloft.net writes:
The versions I posted passed all of the tests.
What does all the tests mean?
I insists that you run tests/devel/try. Please send me the output of
the command I asked you to run.
Running GMP's test suite is *not* adequate for testing new assembly
David Miller da...@davemloft.net writes:
diff --git a/mpn/sparc64/ultrasparct3/mul_1.asm
b/mpn/sparc64/ultrasparct3/mul_1.asm
index df52647..6a3f193 100644
--- a/mpn/sparc64/ultrasparct3/mul_1.asm
+++ b/mpn/sparc64/ultrasparct3/mul_1.asm
@@ -50,8 +50,7 @@ L(top):
umulxhi %o4,
David Miller da...@davemloft.net writes:
From: Torbjorn Granlund t...@gmplib.org
Date: Tue, 05 Mar 2013 23:27:45 +0100
David Miller da...@davemloft.net writes:
diff --git a/mpn/sparc64/ultrasparct3/mul_1.asm
b/mpn/sparc64/ultrasparct3/mul_1.asm
index df52647..6a3f193
David Miller da...@davemloft.net writes:
This is a respin of patch #2 from last night, it incorporates all of
the improvements either explicitly or implicitly suggested :-)
Torbjorn, I'm leaving out the configure regeneration from the patch,
so that the patch is not so large, since I'm
ni...@lysator.liu.se (Niels Möller) writes:
Does anyone remember why it was deleted back then? I think it makes a
lot of sense as a public mpn function.
Checked old mail, but it is only mentioned 7 months earlier, when Kevin
aded it. It does not make sense as an internal function, the
Richard Henderson r...@twiddle.net writes:
This does not adjust the public interface at all, or tidy the
internal namespace at all. What it does do is annotate the source
(in as few places as possible) so that we automatically create and
use the hidden internal aliases inside the
Did you use gmp-func-list.txt for determining which functions to make
public?
--
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel
Richard Henderson r...@twiddle.net writes:
No, I used the existing gmp-h.in file, as I mentioned elsewhere.
Note that all symbols that are visible today are still visible with the patch.
I'm not really cleaning up the set of exported symbols. Just making sure
that
gmp itself
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
And this. (I think I'd prefer
mp_limb_t
mpn_cnd_add_n (mp_limb_t cnd, mp_ptr rp, mp_srcptr ap, mp_srcptr bp,
mp_size_t n)
but that's a minor detail, and view the cnd_
ni...@lysator.liu.se (Niels Möller) writes:
As I understand it, that plan would imply that for assembly files
currently providing both _n and _nc, the _n entry point gets obsolete
That was not the idea, at least not for internal calls. It sometimes
have a cycle or two overhead.
It does
ni...@lysator.liu.se (Niels Möller) writes:
I have a general ARM question, that maybe Richard or someone else on the
list can answer.
From the ABI documentation I've read, register r9 is in some way
reserved for implementation of things like thread local storage. If I
write a leaf
Richard Henderson r...@twiddle.net writes:
Excellent. That's more or less exactly what I want to do.
That would be another welcome contribution!
I believe that IFUNC and the fallback fat system can live side-by side,
sharing most of the actual logic. The choice of which implementation
Richard Henderson r...@twiddle.net writes:
Several times over the past week as I debug my neon routines, it has
become painfully apparent (as I accidentally single-step into the
dynamic linker) that the shared libgmp could use some help in
modernizing its internal linkage.
We are at
ni...@lysator.liu.se (Niels Möller) writes:
What about vldm? Like
vldmup!, {q0,q1,q2,q3}
As far as I understand the manual, it supports a larger number of
registers. The registers must be consecutive, but that's no problem
here.
I added a long list of things to try.
ni...@lysator.liu.se (Niels Möller) writes:
I think it make sense with some level of name mangling from API symbols
to linker names. First, it's good practice to use a single prefix for
all linker symbols, while it's nice to use multiple prefixes for API
symbols (mpz_*, mpn_*, gmp_*,
I try to classify things on my list a bit further, and correct errors.
I stumbed over some functions. I need feedback here.
mpn_div_qr_2 is decl but not doc.
I suggest we move the declaration to gmp-impl.h.
mpn_divrem_2 is decl but not doc.
I suggest we move the declaration to gmp-impl.h.
I decided to play a bit with Neon, but instead of doing something hard
like addmul_k, I wrote an mpn_popcount. :-)
The code runs well for A15 at about 0.56 c/l, but much worse on A9 at
about 2.8 c/l. (The inner-loops hard whacking on q8 is a problem on A9;
using a8 and a9 alternatingly shaves
ni...@lysator.liu.se (Niels Möller) writes:
Hmm, I tried changing all output registers to unique registers (only
written once in the loop, never ever read (except as vmlal reads the
output register before accumulating to it). Do you mean that I need to
change the *input* registers of all
ni...@lysator.liu.se (Niels Möller) writes:
I'm attaching the functions I've been testing, in case anyone else would
like to play with them.
May I innocently ask if the function have survived the prescribed
testing (tests/devel/addmul_N.c and/or tests/devel/try.c)? ;-)
--
Torbjörn
Richard Henderson r...@twiddle.net writes:
Perhaps I got the methodology wrong here, but it sure appears as if vmlal does
not require the addend input until the 4th cycle, producing full output on the
5th. This seems to be the easiest way to hide a lot of output latency.
I measured a
Richard Henderson r...@twiddle.net writes:
gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=169400
$ ./t.out
mpn_addmul_8: 2845ms (1.782 cycles/limb) [973.59 Gb/s]
mpn_addmul_8: 2620ms (1.641 cycles/limb) [1057.20 Gb/s]
mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s]
Richard Henderson r...@twiddle.net writes:
Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time
with larger unrolling to make full use of the vector load insns, and less
over-prefetching.
Good improvement!
Keep in mind that addmul_ will be used for smallish count
Richard Henderson r...@twiddle.net writes:
On 2013-02-23 06:06, Niels Möller wrote:
Not sure what the bottlenecks of your loop are though; instruction
decoding, load/store, or the recurrency chain (but at least it shouldn't
be multiplier throughput, right?).
Yeah, neither am I. I
101 - 200 of 335 matches
Mail list logo