Re: arm neon

2013-02-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: The reason for that is that we use Karatsuba's algoritm for counts of over about 20. Good to know. I won't base my tuning on tests/devel/addmul_N's default of ~600 limbs then. Using a larger count L1size might still be useful for

Re: Neon addmul_4

2013-02-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Down to 2.8-3.0 cyc/limb. Good leaps towards 0.7 c/l, and not far from the current code. (On A9 it runs at 4.25 c/l.) -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org

Re: arm neon

2013-02-22 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Indeed, the last version that Niels posted doesn't pass this test. Oops. The following does pass, but if I'm to believe the arithmetic it's still fairly slow -- around 12cyc/sec. 12cyc/sec is a poor clock frequency. :-) If one is even

Re: arm neon

2013-02-22 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: The widening add insns are: VADDL 32+32-64 Qd[n] = Dn[n] + Dm[n] VADDW 64+32-64 Qd[n] = Qn[n] + Dm[n] VPADDL 32+32-64 Qd[n/2] = Dn[2n] + Dn[2n+1](horizontal add) VPADAL 32+32+64-64 Qd[n/2] += Dn[2n] + Dn[2n+1] All

Re: arm neon

2013-02-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Found the vmlal instruction now. Makes for a cute loop, .Loop: vld1.32 l01[1], [vp]! vld1.32 {u00[]}, [up]! vaddl.u32 q1, l01, c01 vmlal.u32 q1, u00, v01 C q1 overlaps with c01

Re: arm neon

2013-02-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: One could hope so. I'm still a bit skeptic to mul and accumulate, at least for addmul_2, since it's seems like a good idea to schedule the multiplications far in advance. Multiply-accumulate can have the drawback to put multiplication in the

Re: arm neon

2013-02-20 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Unfortunately not. speed -C -s ... mpn_addmul_2 reported around 14 cycles, so it's 7 c/l, compared to 2.38 for the current non-simd code. If I interpret speed output correctly. Include addmul_1.3 in the measurements as a sanity check. What

Re: Annoying mpz_set_str behaviour

2013-02-20 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: This call fails: mpz_set_str (z, 0xbade, 16) Why? Because when given an explicit base the 0x prefix is not allowed. Do we really want it to work like that? I think so. Unless

GMP repo changes

2013-02-11 Thread Torbjorn Granlund
I have created a new repo http://gmplib.org:8000/gmp-5.1/ for the stable 5.1.x meaning that http://gmplib.org:8000/gmp/ is now open for checkins for the next major release. I've also removed the obsolete sp-fft repo. -- Torbjörn ___ gmp-devel mailing

Re: speed of unbalanced multiplication

2013-02-07 Thread Torbjorn Granlund
Zimmermann Paul paul.zimmerm...@loria.fr writes: if the culprit is the macro used in speed, it should be fixed! I stared at it for an hour yesterday, and I cannot see any problems. Operand alignment will differ, but then we shouldn't get consistently worse performance from mpn_mul. --

Re: speed of unbalanced multiplication

2013-02-07 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: If other developers does not dislike the changed meaning of the .r parameter to mpn_mul, this patch can be applied to the main repo... I don't mind, since I never remember which way it is anyway. Avoiding the local allocation is nice too. It would be nice

Re: speed of unbalanced division

2013-02-05 Thread Torbjorn Granlund
You might want to try this patch. It removes lots of code, originally put in place for saving memory. It allows the divappr function to do its job more smoothly. I am not sure this is the best change. Selecting 'in' better in the first place might have certain advantages. More work is needed.

Re: Public mpz_ptr and mpz_srcptr

2013-02-04 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: gmp.h includes the following: /* Types for function declarations in gmp files. */ /* ??? Should not pollute user name space with these ??? */ typedef const __mpz_struct *mpz_srcptr; typedef __mpz_struct *mpz_ptr; I'm afraid

Re: Toom sqr recursion

2013-01-27 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: I just spotted that we're not depending on SQR_BASECASE_THRESHOLD in any of the recursive toomN_sqr calls. I think we should, at least from toom2_sqr. And in toom44_mul.c and toom4_sqr.c we set MAYBE_*_toom4* using MUL_FFT_THRESHOLD when we should

Re: Toom sqr recursion

2013-01-27 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: And in toom44_mul.c and toom4_sqr.c we set MAYBE_*_toom4* using MUL_FFT_THRESHOLD when we should really consider TOOM6 or (if TOOM6 = 0) TOOM8. This was forgotten when the 6H and 8H code was merged. Here's is a possible patch set: diff -Nrc2 gmp

Re: speed of unbalanced multiplication

2013-01-26 Thread Torbjorn Granlund
Zimmermann Paul paul.zimmerm...@loria.fr writes: in GMP 5.1.0, a multiplication of n x m limbs for m n can be slower than a multiplication of n x n limbs. Compare for example the line starting with 775660 in the first output from speed, and the one starting with 100 in the second one

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: And for neon instructions, cycle numbers are in http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf Page? Seems it should be able to do one vmull per cycle. Not sure how to get latency from the given

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: In Chapter 3, multiplication instructions listed in a table starting on page 3-14. But now I see I read the entry for a smaller data size. For 32-bit inputs, it's apparently 2 cycles, not 1. It seems to be 2 cycles indeed: .text

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
The corresponding code sustains one vmull.u32 per cycle on A15. That's 4 times the bandwidth of its umul implementation. It is usually tricky to make use of SIMD operations for addmul_(k) and friends. The well-designed ARM instructions will surely make it easier, but it might still require many

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: But IIUC, we are thus performing a 32 x 32 - 64 mul per cycle. Can one stick addition here without consuming cycles? As I understand the manual, operations in the main cpu can be done

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
There are a few aspects worth noticing for prospective Neon hackers: There are 32 64-bit register, available both in in VFPv3-D32 and Neon. (There are IIRC at least 4 levels of FP support, VFP, VFPv2, VFPv3-D16, and VFPv3-D32... I've seen references to VFPv4 too.) In Neon, the registers can

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: I played with vmlal.u32 on A9 and A15. Surprisingly, both CPUs are very cooperative in that the accumulation dependency is very shallow. Nice. Is the same true for the non-simd umaal instruction

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: Michael Mohr m...@linuxcertified.com Date: Mon, 14 Jan 2013 12:10:30 -0800 At least for Android this may not be an option. It provides a cpufeatures library for use at runtime. I suggest using a configure option which explicitly

Re: arm neon

2013-01-12 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Hmm, if I understand you correctly, it is preferable if the cpu can start doing the multiplication without any dependency on the carry from previous iteration, right? At least in theory, umaal could be implemented in such a way. IIRC, the

Re: [patch] add x32 support

2013-01-10 Thread Torbjorn Granlund
Mike Frysinger vap...@gentoo.org writes: On Thursday 10 January 2013 15:05:12 Torbjorn Granlund wrote: Mike Frysinger vap...@gentoo.org writes: looks like support for the x32 ABI slipped through the cracks. here's the patch updated to latest hg branch (http://gmplib.org:8000/gmp

Re: speed and mpn_set_str

2013-01-09 Thread Torbjorn Granlund
Zimmermann Paul paul.zimmerm...@loria.fr writes: it seems the speed command considers for mpn_set_str the size of the *input*, i.e., the number of *bytes* of the input string, and not the number of limbs of the corresponding mpn number. There are a couple of quirks in tune/speed program

Re: GMP 6 the incomatible GMP?

2013-01-09 Thread Torbjorn Granlund
Marc Glisse marc.gli...@inria.fr writes: Experimentally (for me, others can have wildly different use cases), what really helps for small nums isn't to save space, it is to not allocate at all, for instance storing a few limbs in the base type and only allocating when that's not enough

Re: GMP 6 the incomatible GMP?

2013-01-09 Thread Torbjorn Granlund
Most people seem to prefer splitting the alloc data over a field in the structure and a field before mp_d[0], while I think that'd be slower as well as more complex than the floating-point coding I have in mind. My proposal will make allocations somewhat inaccurate, but I'll make sure it is never

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-09 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: BTW I noticed the following while researching and working on this T4 stuff: http://download.intel.com/embedded/processor/whitepaper/327831.pdf Note that there are no subtract equivalents being added of these new add-with-carry instructions.

Re: speed and mpn_set_str

2013-01-09 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Are the thresholds for mpn_set_str compared to input size or (estimated) output limb count? I imagine optimal threshold may also be base-dependent. Input size. The thresholds are done for base 10. No work has been done to be clever about this

Re: GMP 6 the incomatible GMP?

2013-01-08 Thread Torbjorn Granlund
Marc Glisse marc.gli...@inria.fr writes: Then we might as well always put it in _mp_d[-1], no? IIRC that's what mpfr does. I see some problems with that: * There is a serial memory dependency problem, making the allocation check at least one more cache latency away. I don't think OoO

GMP 6 the incomatible GMP?

2013-01-07 Thread Torbjorn Granlund
We have discussed, perhaps mostly off-list, to fix several nuisances in the library that will break binary compatibility. There is actually a terse page about it (linked from the developer's corner): https://gmplib.org/devel/incompatibility.html I would like to lift the precision limit (for

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-06 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Thanks for your help, the following works. I'll work on unrolling and scheduling it. PROLOGUE(mpn_sub_nc) ba,pt %xcc, L(ent) xorcy, 1, cy EPILOGUE() PROLOGUE(mpn_sub_n) mov 1, cy L(ent): cmp %g0,

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-05 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Each load can issue in 1 cycle, there is a 4 cycle latency, the loads will fully pipeline. Therefore the overhead is around 3n. At most one memop / cycle? Our current Karatsuba code (evaluating in 0, -1, oo) will suffer from the forgotten

Generated files in the repo

2012-12-29 Thread Torbjorn Granlund
I just realised that demos/calc/calc.{c,h} and demos/calc/calclex.c are generated files. Are there any reasons to keep such files in the repo? Having them in releases (and snapshots) makes some sense, since that avoids a build dependency on bison and flex. -- Torbjörn

Re: Generated files in the repo

2012-12-29 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Are there any reasons to keep such files in the repo? I don't think so. OK, so I went ahead and removed them. There should be no difference to releases. I discovered this when I found that the release scripts made tar files with updated

Re: mpz_combit

2012-12-28 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: bodr...@mail.dm.unipi.it writes: About mpz_combit code, may I suggest replacing the two lines MPN_NORMALIZE (dp, dsize); ASSERT (dsize 0); with MPN_NORMALIZE_NOT_ZERO (dp, dsize); ? Conciseness is nice. Is

Re: mpz_combit

2012-12-27 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: Ciao, Il Gio, 27 Dicembre 2012 12:12 pm, Niels ha scritto: Torbjorn Granlund t...@gmplib.org writes: I thought we had discussed this. The change is fine, optimising the common case is a great thing in general. Unfortunately, the common case

Re: Jacobi/Legendre/Kronecker

2012-12-18 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Zimmermann Paul paul.zimmerm...@loria.fr writes: I'm glad you asked (on the AMD processor): frite% gcc -I$GMP/include -O2 -g e.c $GMP/lib/libgmp.a frite% ./a.out GMP: header 5.0.5, library 5.0.5 mpz_jacobi: 408ms

Re: Undefined-behavior overflows in GMP?

2012-11-20 Thread Torbjorn Granlund
Roberto Bagnara bagn...@cs.unipr.it writes: I have just finished reading Understanding Integer Overflow in C/C++, by Will Dietz, Peng Li, John Regehr, and Vikram Adve (http://www.cs.utah.edu/~regehr/papers/overflow12.pdf). On page 9, it says: Finally, we reported nine undefined

Re: Undefined-behavior overflows in GMP?

2012-11-20 Thread Torbjorn Granlund
Marc Glisse marc.gli...@inria.fr writes: On Tue, 20 Nov 2012, Roberto Bagnara wrote: I have just finished reading Understanding Integer Overflow in C/C++, by Will Dietz, Peng Li, John Regehr, and Vikram Adve (http://www.cs.utah.edu/~regehr/papers/overflow12.pdf). On page 9, it

Re: Undefined-behavior overflows in GMP?

2012-11-20 Thread Torbjorn Granlund
Vincent Lefevre vinc...@vinc17.net writes: Concerning the mul_i.h problem, it was fixed there, AFAIK: 2012-02-09 Marc Glisse marc.gli...@inria.fr * gmp-impl.h (ABS_CAST): New macro. * mpf/cmp_si.c: Use ABS_CAST. * mpf/get_si.c: Use ABS_CAST. *

Re: Side-channel silent division

2012-11-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I was been thinking of the following algorithm (I think I wrote about the mod variant a while ago). Say we want to divide by an n-limb number D, and for simplicity, assume that D is normalised. First precompute the inverse v =

Re: broot vs brootinv performancs

2012-11-13 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Trying $ ./speed -o cycles-broken -s 1-30,100,200 -r mpn_mullo_n mpn_binvert it appears binvert is up to 5 times slower than mullo for just a few limbs, 2 times slower for n = 10, and roughly 50% slower in the range between 20 and 200

Re: mpz_combit

2012-10-30 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: I've tried rewriting mpz_combit. Two reasons: 1. To reduce overhead int he common case. 2. I just realized that the common case, for both positive and negative numbers, is to complement the

Re: A perfect power, and then?

2012-10-28 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: 0. Support in speed, for benchmarking. Not checked in yet, but here are some benchmark numbers, comparing to binvert: Do you have any interpretation of these numbers? I am hacking out the

Re: A perfect power, and then?

2012-10-27 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Perhaps you could add some of your remarks as a comment to the file? Makes sense. Not tonight, though. I saw your edits. And I edited them sligtly, adding some more FIXMEs and clarifying the operation of binv_root. Please read and tell me

Re: A perfect power, and then?

2012-10-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Sounds right. Such convolution type sum-of-products might want a separate function, returning 3 limbs. But I'm not talking about a convolution, but about the complete powering. If x is a candidate nth root, and x^n is of size s bits, I

Re: A perfect power, and then?

2012-10-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I guess this means: Avoid using functions from libm, and avoid using floating point in critical loops. Right? There's a (non-critical) floating point operation in mpn_perfect_power_p involving some logarithm table. Right. I would actually

Re: A perfect power, and then?

2012-10-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Ah, and if we're brain storming, yet another variant would be to compute an extra limb (or maybe only half a limb) of precision when we compute the 2-adic k'th root. If the extra limb is non-zero, then the input can't be a k'th power. That's

Re: A perfect power, and then?

2012-10-25 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Do you think we should have an advertised binv_sqrt function returning a^{-1/2}? (And if so, should we have something analogous also for euclidean square root? And for nth roots?) Perhaps. A start would be to advertise it internally. My

Re: A perfect power, and then?

2012-10-23 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: How should we solve this? It is tempting to let mpz_perfect_power_p return p, but unfortunately it has type 'int' which might be too small. Which type would be right? mp_bitcnt_t? Yes

A perfect power, and then?

2012-10-22 Thread Torbjorn Granlund
I recently played with the Quadratic Sieve (together with Niels) making some use of GMP, but of course the bulk of the work in QS is done sieving using 8-bit numbers. QS doesn't factor perfect powers, t = a^p. We can detect such numbers quickly using mpz_perfect_power_p. But we cannot get any p

Re: bdiv vs redc

2012-07-17 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: Some other test cases fail, though: t-root, t-perfpow, t-divis, and t-cong. I'm not familiar with this code, I had a quick look at the root code but didn't see any obvious dependency on bdiv.

Re: bdiv vs redc

2012-07-17 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: How do you define hensel square root with remainder? Given a and n, if there exists an x such that x^2 = a (mod B^n), that seems like the reasonable definition of the square root. But what if no such x exists; where should we put the remainder

Re: bdiv vs redc

2012-06-28 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: iterate nn-dn times q = np[0] * dinv hi = mpn_addmul_1 (np, dp, dn, q) hi += cy cy = hi cy // can this be true? hi += np[dn] cy2 += hi np[dn] np[dn] = hi np += 1; I wrote a possible start

Re: GMP 5.0.5 build fails with tcc due to x86_64 asm (unknown register...)

2012-06-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I think it makes sense to have --disable-assembly disable *all* use of assembly source files. Maybe. I suppose the file now in question is used only by tests/devel/try for detecting ABI breaches. Compiler generated code should not need such

Re: LIBGMPXX_LT_*

2012-06-26 Thread Torbjorn Granlund
Marc Glisse marc.gli...@inria.fr writes: I just added a new symbol to libgmpxx. Should I touch one of the LIBGMPXX_LT_* in Makefile.am, mark somewhere that it may need to be done prior to release, or ignore the (non-)issue? You might fill in the table in Makefile.am to make sure a CXX

Re: LIBGMPXX_LT_*

2012-06-26 Thread Torbjorn Granlund
Marc Glisse marc.gli...@inria.fr writes: Ok, I did that. And noticed that, although LIBMP_LT_* are still there, mpbsd was removed a while ago (I had completely forgotten about that...). There always seem to be a libmp bone remaining. -- Torbjörn

Re: mpz_get_ld

2012-06-23 Thread Torbjorn Granlund
Marc Glisse marc.gli...@inria.fr writes: On Tue, 19 Jun 2012, Torbjorn Granlund wrote: I don't think we require C99 for GMP, only C89. The ldexpl function was introduced with C99. Would mpz_get_ld become acceptable with an appropriate autoconf test for ldexpl and a silly loop

Re: GMP 5.0.5 build fails with tcc due to x86_64 asm (unknown register...)

2012-06-20 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: tmp-amd64call.s:120: error: unknown opcode 'movq' ... Even with the --disable-assembly option given to configure, make check is asking to the compiler to assemble a .s file. It didn't happen with the --host=none option given to GMP-5.0 . If you have

Re: mpz_get_ld

2012-06-19 Thread Torbjorn Granlund
I don't think we require C99 for GMP, only C89. The ldexpl function was introduced with C99. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel

Re: Pseudo prime tests

2012-06-14 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: You mean the BPSW primality test? I wasn't aware of this BPSW suggestion. What I am suggesting is similar, but I am not familiar with the Lucas test. I also think one should do more trial dividing, and let its limits be operand dependent. Some months ago,

Re: Pseudo prime tests

2012-06-11 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: I'd like to keep the Return 2 if n is definitely prime feature of the current _probab_prime_ function. Yes, that would be nice. 2. mpz_notdiv_pprime_p, like mpz_pprime_p but without trial dividing. Do we really need a new function for this? An

Re: Pseudo prime tests

2012-06-11 Thread Torbjorn Granlund
I have read up on this subject. Selfridge and Pomerance have noticed that an M-R test with base 2 plus a Fibonacci test (Fib_n + Legendre(n,5) == 0 (mod n)) seem to correctly identify composite numbers. They will give USD 620 to anyone who finds a composite identified as a prime, or proves that

Re: _mp_alloc vs ALLOC

2012-06-04 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: The old_size argument should be removed also from the free function, not just realloc. Fixed. Add return value to mpz_sqrt, for consistency with mpz_root (assuming such a flag can be returned very cheaply, which I think it can). Added.

Re: _mp_alloc vs ALLOC

2012-06-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: The point, as I understand it, is to collect various problems where the fixes imply interface changes, and which hence are not going to happen in any release currently planned. So we don't forget them, if/when we ever to planning a release which

Re: _mp_alloc vs ALLOC

2012-06-03 Thread Torbjorn Granlund
Marc Glisse marc.gli...@inria.fr writes: Naive question: are there many ABIs where adding a return type breaks compatibility? Probably few. But it breaks source code compatibility, if it is called via a typed function pointer. Our own tests/mpz/reuse.c would need modifications... --

Re: tunefat: run-time vs. compile-time constant thresholds

2012-05-14 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: Toom-2 and Toom-3 code for fat builds should be basically equivalent to the code used by tuneup, now: the thresholds are global variables, and threshold checking consist in a single comparison. We may extend the mechanism to other toom thresholds now...

Re: tunefat: run-time vs. compile-time constant thresholds

2012-05-06 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: I pushed a few changes to the MAYBE_ mechanism of lower degree toom functions, and to the ABOVE_THRESHOLD macro (only for GCC). The effect of the change can be measured looking at Hydra's coverage report: - before the change we had a total of 43613

Re: tunefat: run-time vs. compile-time constant thresholds

2012-04-27 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: Many macros in our code assume that thresholds are compile-time constants, e.g. the MAYBE_ machinery in toom code, but also the ABOVE_THRESHOLD macro, with its three stage comparison... #define ABOVE_THRESHOLD(size,thresh)

Re: Some arm cortex-a8 improvements

2012-04-24 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: arm.com has them, free registration required. Found it. Had to click through some zany revision numbers, making me thing these were revision guides, not proper documentation. Table B.5. Multiplication instruction cycle timings Instruction

Re: Some arm cortex-a8 improvements

2012-04-24 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: On my system, umaal has a latency if 3, whatever dependencies I create. (There are 4 input regs and 2 output, so there are quite a few possible dependency combinations; I only tried a subset.) Either the docs are plain wrong, or there are

Re: Some arm cortex-a8 improvements

2012-04-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Indeed I know that the hw registers that allow such recognition are all privileged. For linux they best one can do is /proc/cpuinfo or (to some extent) the values in AT_HWCAP. Something portable would be nice... FYI, I dug out the

Re: Some arm cortex-a8 improvements

2012-04-23 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: On 04/23/12 07:49, Torbjorn Granlund wrote: Do you know the repeat rate of umull, umlal, umaal, assuming no reg dependencies? For a8: 3 cycles. For a9 it seems to be 2 cycles, so 3.25 c/l for the current addmul_1 is not very good. I have

Re: Some arm cortex-a8 improvements

2012-04-22 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: I used the following, almost certainly not appropriate for general application. [snip] Thanks. I would be very useful to make GMP timing work with the kernel Linux running om ARM. Do you know if there are similar problems with, say, NetBSD? I

Re: Primorial and faster binomial.

2012-04-20 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: Yet, it fits in the result area and is overwritten by it. Such small overhead is impressive! The result area is preallocated with an overestimate: N*3/2 bits ... Now I am less impressed. :-) Not storing the full bit-array is a possible improvement for

Re: Some arm cortex-a8 improvements

2012-04-20 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: Three patches herein. If there's a better way to submit patches, please advise; I've never used hg before. The first patch gives gcc control over ctz/clz. Particularly for armv6t2 and later, which have rbit for use for ctz. The second

Broken mpn_copyi

2012-04-17 Thread Torbjorn Granlund
The mpn_copyi I checked in yesterday is broken, it does not copy things strictly incrementing. Unfortunately, fixing it probably requires a rewrite, which will take a couple of days. Affected machines are all x86-64 Intel CPUs except Pentium 4, and VIA Nano. -- Torbjörn

Re: Unused code

2012-04-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: Are you aware if any more files or functions that are unused? I suspect quite a lot of the new (since a couple of years now...) division code is not used by any advertised mpn_function. *One* C

Unused code

2012-04-13 Thread Torbjorn Granlund
I'd like to locate unused code in GMP. Unused code adds size to the source, which is not too bad, but if it is compiled, it adds size to the compiled library, in particular to the shared library (which is always mapped in its entire when a process linked to it is running). My current list is

Re: Unused code

2012-04-13 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: It is a different information, but a code coverage tool run during the testsuite tells which functions are never called there (on one specific platform, with one set of options): http://hydra.nixos.org/build/2398430/download/1/coverage/ Very

New insns for GMP

2012-04-10 Thread Torbjorn Granlund
Next year, Intel will add an instruction MULX that will be extremely useful for GMP's most critical functions. It is a 64x64-128 bit multiply with two separate explcit destination operands and one explcit source operand and one implicit source operand (in rdx). The instruction does not clobber

Re: Question about STD64 ABI

2012-03-30 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: If GCC generates code that takes advantage if this rule, then I expect no problems at any standard platform. GCC 4.7 on FreeBSD generates such code fpr bdiv_qr.c. This is not a leaf function, so rsp is moved down before the calls. But the fact

Re: Need for GMP 5.0.5

2012-03-28 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: For future releases, it might be interesting to just have a single code for processor recognition; instead of two. Do you think it would be possible? You mean the code in config.guess/configure.in and mpn/x86{,_64}/fat/fat.c? It would be desirable to

Need for GMP 5.0.5

2012-03-27 Thread Torbjorn Granlund
I believe we will need a GMP 5.0.5 release. (1) GMP 5.0.4 recognises AMD 11h processors as 10h (aka K10), then making use of the LZCNT instruction. Unfortunately, 11h processors while later than 10h processors, don't implement all K10 instructions, and also don't raise an illegal

GMP performance leap

2012-03-22 Thread Torbjorn Granlund
I reran GMPbench with current head repo sources and got a surprise. There is a large leap in performance for AMD K10, and nice improvements for a lot of other CPUs as well, compared to GMP 5.0. AMD K10 went from 2985 to 3500 with improvements at every measured point. The thing is that I am not

Re: Improved gcd_1 code

2012-03-13 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: I pushed some new gcd_1 code for x86_64, helping AMD k10 and bulldozer, Intel conroe/penryn, nehalem/westmere, and sandybridge, and VIA nano. What algorithm do you use? Plain binary

Improved gcd_1 code

2012-03-12 Thread Torbjorn Granlund
I pushed some new gcd_1 code for x86_64, helping AMD k10 and bulldozer, Intel conroe/penryn, nehalem/westmere, and sandybridge, and VIA nano. The actual code lives in x86_64/core2, other CPUs either inherit it or grab it from there. Additional improvements are welcome. One idea: Before

Re: Sandybridge addmul_N challenge

2012-03-10 Thread Torbjorn Granlund
I just realised a platform where your karatsuba addmul_2 could help a lot: atom-64. There, addmul_2 currently runs at 20 c/l. The pipeline is two-way superscalar and in-order. The mul insn can be repeated someting like one every 16th cycle. There is a dual-core atom-64 in the test systems

Re: test if an unsigned long fits in a limb.

2012-03-07 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: The first one is the one I prefer. Has this problem been addressed in the alimb branch? We generalised this and separated long and size arithmetic from limb arithmetic. More work is needed on the alimb code, and help is most welcome. I think Per Olofsson

Re: test if an unsigned long fits in a limb.

2012-03-06 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: One can have alternative functions accepting mp_limb_t and mp_limb_signed_t, if one finds that useful. That would be nice, and so would functions accepting long long. But unfortunatly, neither is possible if we want one GMP interface per release,

Re: mpz_invert (x, 1, 0)

2012-03-03 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: Again, Niels propose an elegant way out: we can let inverses mod 0 undefined. To obtain this we can keep the old library code, and patch only the recent testing program and documentation. I vote for this option. Me too. The behaviour of computations

New failures related to recent developments

2012-02-28 Thread Torbjorn Granlund
Three powerpc systems report failure this morning. I suspect they would have reported failure already yesterday, if compilation hadn't failed due to a missing file... The failures seem to be for small cases, for which the code (mpz/lucnum2_ui.c) uses a dumpmp/mini-gmp generated table. I

Re: New failures related to recent developments

2012-02-28 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I'm not entirely sure I understand fib-gen is supposed work. luc_limit is only assigned like this, in gen-fib.c:generate, if (mpz_cmp (l, limit) 0) luc_limit = i-1; Looking at mini-gmp.c:mpz_cmp, I've spotted

Re: New failures related to recent developments

2012-02-28 Thread Torbjorn Granlund
There is a systematic problem in mini-gmp.c when MPZ_REALLOC is called when a destination variable is the same as some other (source or destination) variable. After MPZ_REALLOC, all cached pointers must be considered to be defunct. I've spotted this error in 4 functions, but I haven't made a

Re: New failures related to recent developments

2012-02-28 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: After MPZ_REALLOC, all cached pointers must be considered to be defunct. It seems I was only paying attention to the destination pointer. I'll go through all uses of MPZ_REALLOC

Re: New failures related to recent developments

2012-02-28 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Will you commit these fixes, or do you want me to do that? Please take care of any fixing. It's your baby after all. :-) I have found the same four direct MPZ_REALLOC problems when reviewing the code: mpz_abs_add, mpz_and, mpz_ior and

New fastsse assembly

2012-02-27 Thread Torbjorn Granlund
I pushed new x86_64 assembly making use of 128-bit instructions working on xmm registers. While all x86_64 processors probably support the instructions used, some have less throughput using these than when using plain 64-bit instructions. The idea is to include these just before x86_64 in the

Re: Sandybridge addmul_N challenge

2012-02-27 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: carry-in lo in r14 carry-in hi in rcx mov 0(up), %rax mul v1 mov 8(rp), %r8 add %rax, %r8 mov %rdx, %r9 adc $0, %r9 mov 8(up), %rax mul v0

<    1   2   3   4   >