Re: [PATCH] Fix 64-bit T3 invert_limb.asm on PIC again.

2013-04-15 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I use rd %pc here, since moving forward that's what we're going to use. I checked in something similar, and also updated the other sparc64 files to use rd %pc. I tried to look into doing something like: L(pc):rd %pc, %g2

Re: [PATCH] Improve and consolidate sparc PIC assembler.

2013-04-15 Thread Torbjorn Granlund
swift gmake -k gcc -m64 -fPIC -c -o test1_shared.o test1.S /usr/ccs/bin/as: /var/tmp//ccqorjdc.s: , approx line 18: internal error: pic_relocs(): hh reltype? gmake: *** [test1_shared.o] Error 1 gcc -m64 -c -o test1_static.o test1.S gcc -m64 -fPIC -c -o test2_shared.o test2.S /usr/ccs/bin/as:

Re: [PATCH] Improve and consolidate sparc PIC assembler.

2013-04-15 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: BTW, you traded one failure for another, now PIC is broke for ultrasparct3 builds, because now in invert_limb.asm we're back to: diff -r bd92f35223f8 mpn/sparc64/ultrasparct3/invert_limb.asm --- a/mpn/sparc64/ultrasparct3/invert_limb.asm Sun

Re: [PATCH] Fix 64-bit T3 invert_limb.asm on PIC again.

2013-04-15 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: Else, I cannot understand how this could have worked! Arrg, I didn't actually make a T3 simulated build. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp

Re: [PATCH] Fix 64-bit T3 invert_limb.asm on PIC again.

2013-04-15 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Some of this stuff doesn't work, for example, from invert_limb.asm: + rd %pc, %g3 + sethi %hi(_GLOBAL_OFFSET_TABLE_+4), %g4 + add %g4, %lo(_GLOBAL_OFFSET_TABLE_+8), %g4 + add %g3, %g4, %g4 sethi

Re: [PATCH] Improve and consolidate sparc PIC assembler.

2013-04-14 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Since using rdpc avoids the whole issue of corrupting the return address stack, it seems pretty desirable to move over to it. Let's do it. Well see a slight slowdown for T3, but probably its general slowness will make this new slowdown almost

Re: [PATCH] Improve and consolidate sparc PIC assembler.

2013-04-13 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: Torbjorn Granlund t...@gmplib.org writes: ld: fatal: relocation error: R_SPARC_GOTDATA_OP_LOX10: file mpn/.libs/gcd_1.o: symbol ctz_table: relocation illegal for TLS symbol ld: fatal: relocation error: R_SPARC_GOTDATA_OP: file mpn/.libs

Re: [PATCH] Improve and consolidate sparc PIC assembler.

2013-04-13 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: ld: fatal: relocation error: R_SPARC_GOTDATA_OP_LOX10: file mpn/.libs/gcd_1.o: symbol ctz_table: relocation illegal for TLS symbol ld: fatal: relocation error: R_SPARC_GOTDATA_OP: file mpn/.libs/gcd_1.o: symbol ctz_table: relocation illegal

Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
I'd suggest to use the loop below for sparc64. It limits `which' to be 2^32 by creating the mask based on 32-bit comparison. It would be possible to replace subcc o1,1,o1; subc ... by addcc o1,-1,o1; addxc ... for newer chips, but I think that's no use. I sincerely apologise for the odd number

Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: It isn't really conditional execution on sparc, the resources and timing required for the move instruction are constant whether the condition matches or not. That's not enough. It needs to have the same data-dependency behaviour too. And it

Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I sincerely apologise for the odd number of insns in the loop. :-) Easily solved by using the pointer trick on 'tp' and making 'i' instead be 'i * stride'. That'll get us down to 16 instructions. I'll try to find time to play with this

Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: Torbjorn Granlund t...@gmplib.org Date: Fri, 12 Apr 2013 10:04:35 +0200 I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles per limb on T4, BTW. On US1-2 it might run at 7/4 c/l and on US3-4 it again probably

Re: [PATCH] Improve and consolidate sparc PIC assembler.

2013-04-11 Thread Torbjorn Granlund
There are syntax errors for swift.nada.kth.se, a Solaris system. See http://gmplib.org/devel/tm-date.html. The offending lines: swift (ABI=64) 99: sethi %gdop_hix22(ctz_table), %i5 swift-32 (ABI=32) 99: sethi %gdop_hix22(.Lnoll), %l0 We need things to work on Solaris, *BSD. --

Re: Better tabselect

2013-04-11 Thread Torbjorn Granlund
I've written a few variants of tabselect using a different table traversal order. I think of this as horisontal, making the old one vertical. An arm neon variant which I think has become nice, thanks to neon's elegance. It improves the A9 performance by ~100% and the A15 performance by ~30%

ARM Neon multiplication

2013-04-10 Thread Torbjorn Granlund
Richard Henderson earlier wrote an addmul_8 which runs at an impressive 1.6 c/l (or something thereabout). Since accumulating in Neen is somewhat tricky, I decided to try alternatives. I now have a mul_1 loop which performs multiplies using neon insns and then adds the result using plain old

Re: [PATCH] Improve and consolidate sparc PIC assembler.

2013-04-10 Thread Torbjorn Granlund
Please use LEA* instead of LOAD_SYMBOL*, since that's what we use elsewhere. (OK, LEA might be a misnomer, but a well-established one in and outside of GMP.) I assume your broad testing covers every modified file. Do you have an idea of whether that is true. Whn testing shared libs, I have

Better tabselect

2013-04-10 Thread Torbjorn Granlund
I think my original tabselect methid is not the best, at least not of we implement it in assembly. The current method takes one full table vector entry at a time, and need to perform two loads and one store per entry in the large table of vectors. It seems better two work in the opposite

Re: Better tabselect

2013-04-10 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: mp_limb_t rl; for (rl = 0, i = 0; i nelems; i++) rl += table[...] -(mp_limb_t) (i == k); rp[...] = rl; Reduces the number of stores from O(n^2) to O(n), and instead increases the mask creation from O(n) to O(n^2). Loads

Re: [PATCH] Improve and consolidate sparc PIC assembler.

2013-04-10 Thread Torbjorn Granlund
I assume your broad testing covers every modified file. Do you have an idea of whether that is true. I rechecked everything and the one case I missed was supersparc-* Even the current tree has a build problem of the supersparc target with current tools due to combination of a

Re: longlong.h and cpu type vectoring...

2013-04-09 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I added such code, but since I couldn't find any cpp symbol which is triggered by -mvis2, I invented a name we need to set ourselves. CPP will define __VIS__ = 0x200 in that case (and likewise = 0x300 for -mvis3). Thanks, I missed that one in

Re: Some secondary asm T3,T4,T5 functions

2013-04-04 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Attached is a dive_1.asm that works for me on real hardware as well as T4 timings from: tune/speed -p1000 -s1-1000 -f1.1 -C mpn_divexact_1.3 This timing is most curious. The cost of inversion computation should be clearly visible for tiny

Re: New T3/T4 code batch

2013-04-04 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: First mul_1, renamed again, now encoding the load scheduling. Only the 6c variant is new. Please time it. If it doesn't run at 3 c/l, then there are 2 simple things to try, indicated in a comment. This gets the expected 3 cycles per limb

Re: ARM public key benchmark

2013-04-04 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I had on the other hand not realised David's ones complement + pre-invert carry trick. Not sure I understand what you are referring to here. I haven't been following the sparc developments very closely (and I don't remember much of sparc

Re: ARM public key benchmark

2013-04-04 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: On 2013-04-04 06:51, Niels Möller wrote: And it's no use to even think of porting the loop mixer to arm without access to cycle-accurate timing. Looking around the web it seems that what most folks do is write a minimal kernel module that

Re: ARM public key benchmark

2013-04-04 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I guess it's lowest numbered first (and lowest memory address). But a loop with use r7 ldm up!, {r4,r5,r6,r7} use r4 looks like poor scheduling betwen load of r4 and use of it, and the ldm can't be moved earlier since

Re: ARM public key benchmark

2013-04-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: 1. I guess one can expect submul_1 to always be a bit slower than addmul_1, since submul_1 needs additional arithmetics besides the umaal? One could perhaps do some negations on the fly, a - b C = - ((-a) + b*C), maybe that

Re: ARM public key benchmark

2013-04-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: For large operands, it's strictly between add_n and addmul_1, which I guess is as expected. For small sizes, I had a look at the loop setup for add_n, which checks bit 0 and 1 of n separately. If that's faster, maybe one could borrow that logic.

Re: ARM public key benchmark

2013-04-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: So it should be doable with the addmul_1 loop and two additional, non-recurrency, not instructions per limb, and then maybe some extra logic for the return value. One could aim for 4.25 c/l, I

New T3/T4 code batch

2013-04-03 Thread Torbjorn Granlund
David, First mul_1, renamed again, now encoding the load scheduling. Only the 6c variant is new. Please time it. If it doesn't run at 3 c/l, then there are 2 simple things to try, indicated in a comment. sparct34-mul_1-3c.asm Description: Binary data sparct34-mul_1-6c.asm Description:

Re: New T3/T4 code batch

2013-04-03 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: First mul_1, renamed again, now encoding the load scheduling. Only the 6c variant is new. Please time it. If it doesn't run at 3 c/l, then there are 2 simple things to try, indicated in a comment. Looks exciting, I'll play around with this

Re: New T3/T4 code batch

2013-04-03 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Please don't do this, you checked in code that doesn't even compile again. Easy to fix. Please pull again. I was just starting to work on getting the information for you so this is very disappointing. :-/ Well, bugs happen. -- Torbjörn

Re: New T3/T4 code batch

2013-04-03 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I can tell by looking at the commit that it's still broken, can you please stop jumping the gun and simply be patient enough for me to test things out? Since I am wrapping up, I wanted to push things and clean out unfinished things. Why is is

Re: Possible new T3-T5 mul_1

2013-04-02 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: This version probably overschedules loads, I'll try another variant some day which fixes that. Two variants. The 1st is just the previous 3 c/l one, with a bug fix, and renamed. The 2nd is a version which I hope still runs at 3 c/l

Re: ARM public key benchmark

2013-04-02 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd like to try. That wasn't a clear win... I use addmul_1 and submul_1 as a fallback (and I always do in-place operation,

Re: Possible new T3-T5 mul_1

2013-04-02 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: See attached, looks like mul1b isn't able to reach 3 c/l like mul1a can. overhead 6.00 cycles, precision 1000 units of 3.51e-10 secs, CPU freq 2847.41 MHz Darn. Is the load latency 3 cycles? The old code had a load-use schedule of 8 cycles,

Re: Some secondary asm T3,T4,T5 functions

2013-04-02 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I started playing with these, and one problem is that the addxc/addxccc instructions do not accept an immediate field. They only accept rs1, rs2, rd arguments. Please update your compat macros to catch this. Oops. Missed that. With this

Re: Possible new T3-T5 mul_1

2013-04-02 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Attached are the output of: tune/speed -p1000 -s1-1000 -f1.1 -C mpn_mul_2.3 3.25 c/l, not 3 c/l as I had hoped. tune/speed -p1000 -s1-1000 -f1.1 -C mpn_addmul_2.3 3.75 c/l, not 3.5 c/l as I had hoped... I will accept this, since

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-04-01 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Yes, understood. We have to transpose a few of the shifts with their neighbouring arithmetic ops in this loop to make it optimal for Ultra-I/II/IIi I found a powered up US2 and run time timing tests. No slowdown there for the new generic

Some secondary asm T3,T4,T5 functions

2013-04-01 Thread Torbjorn Granlund
Plain, non-pipelined version of bdiv_dbm1c.asm, mod_1_4.asm, mode1o.asm, dive_1.asm, invert_limb.asm. I wrote this with help of gcc, having first told longlong.h about umulxhi and addxc. Then I hand-optimised the result to varying degree. In no case did I software pipeline the loops, so these

Re: Some secondary asm T3,T4,T5 functions

2013-04-01 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: The code uses lzcnt, which I hope is implemented in T3 and T4. I added it to the missing.m4 file, so that I could test the code on my old sparcs. Just be forewarned, lzcnt is very slow, as slow as popc. I use both. I use lzcnt in

Possible new T3-T5 mul_1

2013-04-01 Thread Torbjorn Granlund
For the most critical functions, i.e., mul_1, addmul_1, submul_1, mul_2, and addmul_2, we should not stick to 2-way unrolling. I played with a 4-way unrolled mul_1, but not using your multi-pointer trick, meaning that we will spend two cycles instead of one cycle for bookkeeping. Our current is

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-31 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I'm going to play around with some things to try and fix this. Interestingly, UltraSPARC-1 and UltraSPARC-2 would not group the final cycle of the loop this way, because of it's requirement that integer operations must occur in the first three

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-31 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: For values of N = 1 we would expect 1 cycle per iteration. But that's not exactly what happens. N cycles == 1 2 2 3 3 4 4 5 5 6 6

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-29 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: So here is a working generic 64-bit sparc lshift.asm that seems to work well on all chips. I'm now going to iterate over lshiftc and rshift. I whacked off some code at the end, and generalised the resulting code to become a lorrshift.asm. Some

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-29 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Looks good, and here is my lshiftc: add up, -8, up srlxu1, tcnt, %l4 andn%l3, %l4, r0 stx r0, [rp + 0] bnz,pt %xcc, L(loop0) sllx u1, cnt, %l3 I'd claim that branch is taken with

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-29 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: And I want to take this opportunity to mention that 'try' is very non-useful at times for testing newly coded routines. I don't use it for that. It seems to perform several unrelated mpn calculations during initialization before it does the

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-29 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: David Miller da...@davemloft.net Date: Fri, 29 Mar 2013 22:14:05 -0400 (EDT) Great. Let's sort out the strange hang behavior I get with your code. I think it's rshift. I actually happened to be working on rshift when you sent

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-28 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: The TODO list grows, it never seems to shrink :-) Oh no, it shrinks. But I only show you the first few lines of it. :-) Wrt. scheduling mulx/umulxhi, I think to a certain extent I think the out-of-order completion unit in the backend of the

GMP testing system

2013-03-28 Thread Torbjorn Granlund
There will be many spurious failures reported by the automated GMP testing system in the next days. This is caused by construction work. http://gmplib.org/devel/tm-date.html The goal is to test both static and dynamic builds for every config. We're getting rid of cron initiated testing too, in

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-26 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: These give a modest speedup compared to the T1 routines. I also added missing T3 timings to existing code. The first thing to try then is finding code that runs well on both. There is a cost in having more variants than we need. Also, I worked on

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-26 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: L(top): or %g4, %g1, %l1 sllx%g2, cnt, %g1 srlx%g2, tcnt, %g4 ldx [up - 8], %g2 stx %l1, [rp - 8] or %g3, %l2, %l7 sllx%g5, cnt, %l2

Re: [PATCH] T3/T4 sparc shifts, plus more timings

2013-03-26 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: These give a modest speedup compared to the T1 routines. I also added missing T3 timings to existing code. The first thing to try then is finding code that runs well on both. There is a cost in having more variants than we need.

Re: [PATCH] 64-bit Popcount/Hweight for T3 and later

2013-03-25 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Technically we could use this on some chips we don't distinguish on a fine enough granularity yet. For example we can assume popc is available on T2 as well as UltraSPARC-IV. But for now, just T3 and later. I suppose we should mention this

Re: Does -0.5 fit an unsigned when truncated to an integer?

2013-03-19 Thread Torbjorn Granlund
Vincent Lefevre vinc...@vinc17.net writes: I haven't though a lot about this, but it is not clear that -1 + eps should be considered to fit an unsigned type. Why? We need to decide how to define the edge conditions. We could either see at as can be assigned to type without havoc or

Re: Does -0.5 fit an unsigned when truncated to an integer?

2013-03-19 Thread Torbjorn Granlund
Vincent Lefevre vinc...@vinc17.net writes: What is the behavior for MAXIMUM + eps (both for signed and unsigned types)? That's indeed something we (GMP and MPFR) should worry about. Whatever we decide, we should handle the upper and lower boundaries analogously! -- Torbjörn

Re: Does -0.5 fit an unsigned when truncated to an integer?

2013-03-18 Thread Torbjorn Granlund
Zimmermann Paul paul.zimmerm...@inria.fr writes: indeed this is inconsistent. mpf_fits_uint_p(-0.5) should return true, as well as mpf_fits_uint_p(-0.999). I am not so sure that would be the right fix here. -- Torbjörn ___ gmp-devel

Re: GMP and CUMP

2013-03-11 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: long vectors of bignums... do you mean that it might be used for the point-wise multiplication in Sch�nhage–Strassen? No. Surely GPUs could be used for individual huge multiplies, but that's not going to benefit a lot of GMP applications. I have

Re: GMP and CUMP

2013-03-11 Thread Torbjorn Granlund
Emmanuel Thomé emmanuel.th...@gmail.com writes: Hi, On Mon, Mar 11, 2013 at 9:53 PM, Torbjorn Granlund t...@gmplib.org wrote: I have 'mpz_vec_t', 'mpf_vec_t in mind, which have some number of mpz_t elements, each probably (padded to) the same size counted in limbs

Re: neon logops

2013-03-08 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Building on the copyi that tege committed the other day, use neon for the logical operations too. I did both a 128-bit aligned version, $ ./speed-128 -p 10 -C -s 10,50,100,500,1000,5000,1 mpn_and_n mpn_nand_n clock_gettime

Re: T3/T3 mul_2 and addmul_2

2013-03-08 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: Torbjorn Granlund t...@gmplib.org Date: Thu, 07 Mar 2013 20:58:51 +0100 I'm reasonably sure this is correct. Needs some work still: It was a one character bug in the non-emulation stuff. This is also in the smaller checked

Re: T3/T3 mul_2 and addmul_2

2013-03-08 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Seems to work fine, here are some speed runs: davem@patience:~/src/GMP/HG/build-sparc64-ultrasparct4/tune$ ./speed -C -s 32-64 -t 2 mpn_mul_2 overhead 6.06 cycles, precision 1 units of 3.51e-10 secs, CPU freq 2847.34 MHz

Re: mpn_cnd_add_n

2013-03-07 Thread Torbjorn Granlund
Here's a patch that reorders the arguments for mpn_addcnd_n and mpn_subcnd_n (I think it's best to keep this change separate from the renaming, since the potential problems are quite different). It's tested on x86_64, arm, and with --disable-assembly. I've run a regular make check and

T3/T3 mul_2 and addmul_2

2013-03-07 Thread Torbjorn Granlund
I wrote 4-way unrolled mul_2 and addmul_2 for T3/T4. The FAKE_T3 stuff includes missing.m4, which impelements some instructions missing from my old systems around here. I might retain that stuff for a while to allow local regression testing, even if it is a bit ugly. Could you please run time

Re: GMP and CUMP

2013-03-07 Thread Torbjorn Granlund
romes p romes_12...@yahoo.com writes: Hello developers I noticed that there is also a CUMP site http:/www.hpcs.cs.tsukuba.ac.jp/~nakayama/cump/ Sheesh, the guy has copyied and edited the GMP webpages and now claims the default all rights reserved with himself as owner. Not a serious

Re: T3/T3 mul_2 and addmul_2

2013-03-07 Thread Torbjorn Granlund
I only now spotted FPMADDXHI and FPMADDX. No Sun/Oracle SPARC hae been a floating-point demon, and these intger multiply instructions are performed in the fpu. Multiply-accumulate instructions are tricky, since one may easily put the accumulation on a carry recurrency path, and thereby kill

Re: [PATCH 2/2] Optimize 64-bit mpn_add_N and mpn_sub_N for sparc T3 and later.

2013-03-06 Thread Torbjorn Granlund
I think all you T3/T4 changes are now in. Please check that I didn't mess something up. Thanks for this contribution! -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel

Re: [PATCH 0/3] Resubmit of Sparc T3/T4 patches.

2013-03-06 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: Torbjorn Granlund t...@gmplib.org Date: Wed, 06 Mar 2013 00:08:09 +0100 The addmul code could be simularly improved. Grumble... and I did this work already, I sent older versions of my T3/T4 changes, let me go see how I screwed

Re: [PATCH 2/2] Optimize 64-bit mpn_add_N and mpn_sub_N for sparc T3 and later.

2013-03-06 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I optimised submul_1.asm, and then edited both addmul_1 and submul_1 to use as similar operand order as possible. Please test these using tests/devel/try, and please time this new submul_1. The testsuite starts failing very early with these

Re: [PATCH 2/2] Optimize 64-bit mpn_add_N and mpn_sub_N for sparc T3 and later.

2013-03-06 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: David Miller da...@davemloft.net Date: Thu, 07 Mar 2013 01:06:55 -0500 (EST) I'll test your routines with the obvious fix in a moment. With the one-liner fix both of your new implementations work. Thanks for testing! submul_1 is

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-05 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Actually, I think that's incorrect. Everyone has some *familiarity* with the C preprocessor, which surely is an advantage. And maybe most C programmers think they they understand it. But in my experience, very few understand the fine details

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-05 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: That would certainly cause some additional confusion. Any suggestion for appropriate m4 quote characters to use? ;-) I think one should be kind and use [ and ]. The resulting C dialect, where indexing would be written arr[[i]] is not too bad...

Re: [PATCH 1/3] Optimize 32-bit sparc T1 multiply routines.

2013-03-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: * mpn/sparc32/ultrasparct1/mul_1.asm (mpn_mul_1): Unroll main loop one time, align code on 32-byte boundary, add T2/T3/T4 timings. * mpn/sparc32/ultrasparct1/addmul_1.asm (mpn_addmul_1): Likewise. *

Re: [PATCH 0/3] Resubmit of Sparc T3/T4 patches.

2013-03-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: This is a resubmit of the work I did 2 months ago now that my FSF assignment has finally been completed. Just the simple stuff, use of mulx/umulx/addxccc and 1 level of loop unrolling. We now got patches 1/3 and 3/3. Is there a 2/3 too? --

Re: [PATCH 3/3] Optimize 64-bit mpn_add_N and mpn_sub_N for sparc T3 and later.

2013-03-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: * mpn/sparc64/ultrasparct3/add_n.asm: New file. * mpn/sparc64/ultrasparct3/sub_n.asm: New file. There is currently no mpn/sparc64/ultrasparct3, only ultrasparct1. For which CPUs are these new add_n/sub_n intended? Why not also for for

Re: [PATCH 3/3] Optimize 64-bit mpn_add_N and mpn_sub_N for sparc T3 and later.

2013-03-05 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: For T3 and T4. This file makes use of new instructions: addxc(cc). Thanks. Honestly, why they didn't have a proper 64-bit with carry insn right from the very first v9 cpu is a mystery. The SPARC cpu is so full of design mistakes that I am not

Re: [PATCH 0/3] Resubmit of Sparc T3/T4 patches.

2013-03-05 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: One extra add insn here (copy-paste from addmul)? addcc %o5, %g3, %g3 addxccc %g2, %g1, %g1 addxc %g0, %o4, %o5 Since I cannot test this at all (qemu-system-sparc64 persistenty resists all my usage attempts) I need you

Re: [PATCH 0/3] Resubmit of Sparc T3/T4 patches.

2013-03-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: Torbjorn Granlund t...@gmplib.org Date: Tue, 05 Mar 2013 21:35:19 +0100 Richard Henderson r...@twiddle.net writes: One extra add insn here (copy-paste from addmul)? addcc %o5, %g3, %g3 addxccc %g2, %g1, %g1

Re: [PATCH 0/3] Resubmit of Sparc T3/T4 patches.

2013-03-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: The versions I posted passed all of the tests. What does all the tests mean? I insists that you run tests/devel/try. Please send me the output of the command I asked you to run. Running GMP's test suite is *not* adequate for testing new assembly

Re: [PATCH 0/3] Resubmit of Sparc T3/T4 patches.

2013-03-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: diff --git a/mpn/sparc64/ultrasparct3/mul_1.asm b/mpn/sparc64/ultrasparct3/mul_1.asm index df52647..6a3f193 100644 --- a/mpn/sparc64/ultrasparct3/mul_1.asm +++ b/mpn/sparc64/ultrasparct3/mul_1.asm @@ -50,8 +50,7 @@ L(top): umulxhi %o4,

Re: [PATCH 0/3] Resubmit of Sparc T3/T4 patches.

2013-03-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: Torbjorn Granlund t...@gmplib.org Date: Tue, 05 Mar 2013 23:27:45 +0100 David Miller da...@davemloft.net writes: diff --git a/mpn/sparc64/ultrasparct3/mul_1.asm b/mpn/sparc64/ultrasparct3/mul_1.asm index df52647..6a3f193

Re: [PATCH 1/2] Add 64-bit sparc multiply routines for T3 and later.

2013-03-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: This is a respin of patch #2 from last night, it incorporates all of the improvements either explicitly or implicitly suggested :-) Torbjorn, I'm leaving out the configure regeneration from the patch, so that the patch is not so large, since I'm

Re: [PATCH 01/20] Delete mpn/generic/sizeinbase.c

2013-03-04 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Does anyone remember why it was deleted back then? I think it makes a lot of sense as a public mpn function. Checked old mail, but it is only mentioned 7 months earlier, when Kevin aded it. It does not make sense as an internal function, the

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-04 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: This does not adjust the public interface at all, or tidy the internal namespace at all. What it does do is annotate the source (in as few places as possible) so that we automatically create and use the hidden internal aliases inside the

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-04 Thread Torbjorn Granlund
Did you use gmp-func-list.txt for determining which functions to make public? -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel

Re: [PATCH 00/20] Create and use hidden aliases in libgmp.so

2013-03-04 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: No, I used the existing gmp-h.in file, as I mentioned elsewhere. Note that all symbols that are visible today are still visible with the patch. I'm not really cleaning up the set of exported symbols. Just making sure that gmp itself

Re: mpn_cnd_add_n

2013-03-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: And this. (I think I'd prefer mp_limb_t mpn_cnd_add_n (mp_limb_t cnd, mp_ptr rp, mp_srcptr ap, mp_srcptr bp, mp_size_t n) but that's a minor detail, and view the cnd_

Re: Public mpn_add_nc

2013-03-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: As I understand it, that plan would imply that for assembly files currently providing both _n and _nc, the _n entry point gets obsolete That was not the idea, at least not for internal calls. It sometimes have a cycle or two overhead. It does

Re: Register r9 in the ARM ABI

2013-03-01 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I have a general ARM question, that maybe Richard or someone else on the list can answer. From the ABI documentation I've read, register r9 is in some way reserved for implementation of things like thread local storage. If I write a leaf

Re: GMP symbol naming (and the history thereof)?

2013-03-01 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Excellent. That's more or less exactly what I want to do. That would be another welcome contribution! I believe that IFUNC and the fallback fat system can live side-by side, sharing most of the actual logic. The choice of which implementation

Re: GMP symbol naming (and the history thereof)?

2013-02-28 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Several times over the past week as I debug my neon routines, it has become painfully apparent (as I accidentally single-step into the dynamic linker) that the shared libgmp could use some help in modernizing its internal linkage. We are at

Re: ARM Neon popcount

2013-02-28 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: What about vldm? Like vldmup!, {q0,q1,q2,q3} As far as I understand the manual, it supports a larger number of registers. The registers must be consecutive, but that's no problem here. I added a long list of things to try.

Re: GMP symbol naming (and the history thereof)?

2013-02-28 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I think it make sense with some level of name mangling from API symbols to linker names. First, it's good practice to use a single prefix for all linker symbols, while it's nice to use multiple prefixes for API symbols (mpz_*, mpn_*, gmp_*,

Re: GMP symbol naming (and the history thereof)?

2013-02-28 Thread Torbjorn Granlund
I try to classify things on my list a bit further, and correct errors. I stumbed over some functions. I need feedback here. mpn_div_qr_2 is decl but not doc. I suggest we move the declaration to gmp-impl.h. mpn_divrem_2 is decl but not doc. I suggest we move the declaration to gmp-impl.h.

ARM Neon popcount

2013-02-27 Thread Torbjorn Granlund
I decided to play a bit with Neon, but instead of doing something hard like addmul_k, I wrote an mpn_popcount. :-) The code runs well for A15 at about 0.56 c/l, but much worse on A9 at about 2.8 c/l. (The inner-loops hard whacking on q8 is a problem on A9; using a8 and a9 alternatingly shaves

Re: Neon addmul_8

2013-02-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Hmm, I tried changing all output registers to unique registers (only written once in the loop, never ever read (except as vmlal reads the output register before accumulating to it). Do you mean that I need to change the *input* registers of all

Re: Neon addmul_8

2013-02-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I'm attaching the functions I've been testing, in case anyone else would like to play with them. May I innocently ask if the function have survived the prescribed testing (tests/devel/addmul_N.c and/or tests/devel/try.c)? ;-) -- Torbjörn

Re: Neon addmul_8

2013-02-26 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Perhaps I got the methodology wrong here, but it sure appears as if vmlal does not require the addend input until the 4th cycle, producing full output on the 5th. This seems to be the easiest way to hide a lot of output latency. I measured a

Re: Neon addmul_8

2013-02-24 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=169400 $ ./t.out mpn_addmul_8: 2845ms (1.782 cycles/limb) [973.59 Gb/s] mpn_addmul_8: 2620ms (1.641 cycles/limb) [1057.20 Gb/s] mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s]

Re: arm neon

2013-02-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time with larger unrolling to make full use of the vector load insns, and less over-prefetching. Good improvement! Keep in mind that addmul_ will be used for smallish count

Re: arm neon

2013-02-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: On 2013-02-23 06:06, Niels Möller wrote: Not sure what the bottlenecks of your loop are though; instruction decoding, load/store, or the recurrency chain (but at least it shouldn't be multiplier throughput, right?). Yeah, neither am I. I

<    1   2   3   4   >