Richard Henderson r...@twiddle.net writes:
The reason for that is that we use Karatsuba's algoritm for counts of
over about 20.
Good to know. I won't base my tuning on tests/devel/addmul_N's
default of ~600 limbs then.
Using a larger count L1size might still be useful for
Richard Henderson r...@twiddle.net writes:
Down to 2.8-3.0 cyc/limb.
Good leaps towards 0.7 c/l, and not far from the current code.
(On A9 it runs at 4.25 c/l.)
--
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
Richard Henderson r...@twiddle.net writes:
Indeed, the last version that Niels posted doesn't pass this test.
Oops.
The following does pass, but if I'm to believe the arithmetic it's
still fairly slow -- around 12cyc/sec.
12cyc/sec is a poor clock frequency. :-)
If one is even
Richard Henderson r...@twiddle.net writes:
The widening add insns are:
VADDL 32+32-64 Qd[n] = Dn[n] + Dm[n]
VADDW 64+32-64 Qd[n] = Qn[n] + Dm[n]
VPADDL 32+32-64 Qd[n/2] = Dn[2n] + Dn[2n+1](horizontal add)
VPADAL 32+32+64-64 Qd[n/2] += Dn[2n] + Dn[2n+1]
All
ni...@lysator.liu.se (Niels Möller) writes:
Found the vmlal instruction now. Makes for a cute loop,
.Loop:
vld1.32 l01[1], [vp]!
vld1.32 {u00[]}, [up]!
vaddl.u32 q1, l01, c01
vmlal.u32 q1, u00, v01 C q1 overlaps with c01
ni...@lysator.liu.se (Niels Möller) writes:
One could hope so. I'm still a bit skeptic to mul and accumulate, at
least for addmul_2, since it's seems like a good idea to schedule the
multiplications far in advance.
Multiply-accumulate can have the drawback to put multiplication in the
ni...@lysator.liu.se (Niels Möller) writes:
Unfortunately not. speed -C -s ... mpn_addmul_2 reported around 14
cycles, so it's 7 c/l, compared to 2.38 for the current non-simd code.
If I interpret speed output correctly.
Include addmul_1.3 in the measurements as a sanity check.
What
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
This call fails:
mpz_set_str (z, 0xbade, 16)
Why? Because when given an explicit base the 0x prefix is not allowed.
Do we really want it to work like that?
I think so. Unless
I have created a new repo http://gmplib.org:8000/gmp-5.1/ for the stable
5.1.x meaning that http://gmplib.org:8000/gmp/ is now open for checkins
for the next major release.
I've also removed the obsolete sp-fft repo.
--
Torbjörn
___
gmp-devel mailing
Zimmermann Paul paul.zimmerm...@loria.fr writes:
if the culprit is the macro used in speed, it should be fixed!
I stared at it for an hour yesterday, and I cannot see any problems.
Operand alignment will differ, but then we shouldn't get consistently
worse performance from mpn_mul.
--
bodr...@mail.dm.unipi.it writes:
If other developers does not dislike the changed meaning of the .r
parameter to mpn_mul, this patch can be applied to the main repo...
I don't mind, since I never remember which way it is anyway.
Avoiding the local allocation is nice too.
It would be nice
You might want to try this patch. It removes lots of code, originally
put in place for saving memory. It allows the divappr function to do
its job more smoothly.
I am not sure this is the best change. Selecting 'in' better in the
first place might have certain advantages. More work is needed.
ni...@lysator.liu.se (Niels Möller) writes:
gmp.h includes the following:
/* Types for function declarations in gmp files. */
/* ??? Should not pollute user name space with these ??? */
typedef const __mpz_struct *mpz_srcptr;
typedef __mpz_struct *mpz_ptr;
I'm afraid
Torbjorn Granlund t...@gmplib.org writes:
I just spotted that we're not depending on SQR_BASECASE_THRESHOLD in any
of the recursive toomN_sqr calls. I think we should, at least from
toom2_sqr.
And in toom44_mul.c and toom4_sqr.c we set MAYBE_*_toom4* using
MUL_FFT_THRESHOLD when we should
Torbjorn Granlund t...@gmplib.org writes:
And in toom44_mul.c and toom4_sqr.c we set MAYBE_*_toom4* using
MUL_FFT_THRESHOLD when we should really consider TOOM6 or (if TOOM6 = 0)
TOOM8. This was forgotten when the 6H and 8H code was merged.
Here's is a possible patch set:
diff -Nrc2 gmp
Zimmermann Paul paul.zimmerm...@loria.fr writes:
in GMP 5.1.0, a multiplication of n x m limbs for m n can be slower than
a multiplication of n x n limbs. Compare for example the line starting with
775660 in the first output from speed, and the one starting with 100 in
the second one
ni...@lysator.liu.se (Niels Möller) writes:
And for neon instructions, cycle numbers are in
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf
Page?
Seems it should be able to do one vmull per cycle. Not sure how to
get latency from the given
ni...@lysator.liu.se (Niels Möller) writes:
In Chapter 3, multiplication instructions listed in a table starting on
page 3-14. But now I see I read the entry for a smaller data size. For
32-bit inputs, it's apparently 2 cycles, not 1.
It seems to be 2 cycles indeed:
.text
The corresponding code sustains one vmull.u32 per cycle on A15. That's
4 times the bandwidth of its umul implementation.
It is usually tricky to make use of SIMD operations for addmul_(k) and
friends. The well-designed ARM instructions will surely make it easier,
but it might still require many
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
But IIUC, we are thus performing a 32 x 32 - 64 mul per cycle.
Can one stick addition here without consuming cycles?
As I understand the manual, operations in the main cpu can be done
There are a few aspects worth noticing for prospective Neon hackers:
There are 32 64-bit register, available both in in VFPv3-D32 and Neon.
(There are IIRC at least 4 levels of FP support, VFP, VFPv2,
VFPv3-D16, and VFPv3-D32... I've seen references to VFPv4 too.)
In Neon, the registers can
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
I played with vmlal.u32 on A9 and A15. Surprisingly, both CPUs are very
cooperative in that the accumulation dependency is very shallow.
Nice. Is the same true for the non-simd umaal instruction
David Miller da...@davemloft.net writes:
From: Michael Mohr m...@linuxcertified.com
Date: Mon, 14 Jan 2013 12:10:30 -0800
At least for Android this may not be an option. It provides a
cpufeatures library for use at runtime. I suggest using a configure
option which explicitly
ni...@lysator.liu.se (Niels Möller) writes:
Hmm, if I understand you correctly, it is preferable if the cpu can
start doing the multiplication without any dependency on the carry from
previous iteration, right? At least in theory, umaal could be
implemented in such a way.
IIRC, the
Mike Frysinger vap...@gentoo.org writes:
On Thursday 10 January 2013 15:05:12 Torbjorn Granlund wrote:
Mike Frysinger vap...@gentoo.org writes:
looks like support for the x32 ABI slipped through the cracks. here's
the patch updated to latest hg branch (http://gmplib.org:8000/gmp
Zimmermann Paul paul.zimmerm...@loria.fr writes:
it seems the speed command considers for mpn_set_str the size of the *input*,
i.e., the number of *bytes* of the input string, and not the number of limbs
of the corresponding mpn number.
There are a couple of quirks in tune/speed program
Marc Glisse marc.gli...@inria.fr writes:
Experimentally (for me, others can have wildly different use cases),
what really helps for small nums isn't to save space, it is to not
allocate at all, for instance storing a few limbs in the base type and
only allocating when that's not enough
Most people seem to prefer splitting the alloc data over a field in the
structure and a field before mp_d[0], while I think that'd be slower as
well as more complex than the floating-point coding I have in mind.
My proposal will make allocations somewhat inaccurate, but I'll make
sure it is never
David Miller da...@davemloft.net writes:
BTW I noticed the following while researching and working on this
T4 stuff:
http://download.intel.com/embedded/processor/whitepaper/327831.pdf
Note that there are no subtract equivalents being added of these new
add-with-carry instructions.
ni...@lysator.liu.se (Niels Möller) writes:
Are the thresholds for mpn_set_str compared to input size or (estimated)
output limb count? I imagine optimal threshold may also be
base-dependent.
Input size. The thresholds are done for base 10. No work has been done
to be clever about this
Marc Glisse marc.gli...@inria.fr writes:
Then we might as well always put it in _mp_d[-1], no? IIRC that's what
mpfr does.
I see some problems with that:
* There is a serial memory dependency problem, making the allocation
check at least one more cache latency away. I don't think OoO
We have discussed, perhaps mostly off-list, to fix several nuisances in
the library that will break binary compatibility.
There is actually a terse page about it (linked from the developer's
corner): https://gmplib.org/devel/incompatibility.html
I would like to lift the precision limit (for
David Miller da...@davemloft.net writes:
Thanks for your help, the following works. I'll work on unrolling
and scheduling it.
PROLOGUE(mpn_sub_nc)
ba,pt %xcc, L(ent)
xorcy, 1, cy
EPILOGUE()
PROLOGUE(mpn_sub_n)
mov 1, cy
L(ent): cmp %g0,
David Miller da...@davemloft.net writes:
Each load can issue in 1 cycle, there is a 4 cycle latency, the
loads will fully pipeline. Therefore the overhead is around 3n.
At most one memop / cycle?
Our current Karatsuba code (evaluating in 0, -1, oo) will suffer from
the forgotten
I just realised that demos/calc/calc.{c,h} and demos/calc/calclex.c are
generated files.
Are there any reasons to keep such files in the repo?
Having them in releases (and snapshots) makes some sense, since that
avoids a build dependency on bison and flex.
--
Torbjörn
ni...@lysator.liu.se (Niels Möller) writes:
Are there any reasons to keep such files in the repo?
I don't think so.
OK, so I went ahead and removed them. There should be no difference to
releases.
I discovered this when I found that the release scripts made tar files
with updated
ni...@lysator.liu.se (Niels Möller) writes:
bodr...@mail.dm.unipi.it writes:
About mpz_combit code, may I suggest replacing the two lines
MPN_NORMALIZE (dp, dsize);
ASSERT (dsize 0);
with
MPN_NORMALIZE_NOT_ZERO (dp, dsize);
?
Conciseness is nice. Is
bodr...@mail.dm.unipi.it writes:
Ciao,
Il Gio, 27 Dicembre 2012 12:12 pm, Niels ha scritto:
Torbjorn Granlund t...@gmplib.org writes:
I thought we had discussed this. The change is fine, optimising the
common case is a great thing in general.
Unfortunately, the common case
ni...@lysator.liu.se (Niels Möller) writes:
Zimmermann Paul paul.zimmerm...@loria.fr writes:
I'm glad you asked (on the AMD processor):
frite% gcc -I$GMP/include -O2 -g e.c $GMP/lib/libgmp.a
frite% ./a.out
GMP: header 5.0.5, library 5.0.5
mpz_jacobi: 408ms
Roberto Bagnara bagn...@cs.unipr.it writes:
I have just finished reading Understanding Integer Overflow in C/C++,
by Will Dietz, Peng Li, John Regehr, and Vikram Adve
(http://www.cs.utah.edu/~regehr/papers/overflow12.pdf).
On page 9, it says:
Finally, we reported nine undefined
Marc Glisse marc.gli...@inria.fr writes:
On Tue, 20 Nov 2012, Roberto Bagnara wrote:
I have just finished reading Understanding Integer Overflow in C/C++,
by Will Dietz, Peng Li, John Regehr, and Vikram Adve
(http://www.cs.utah.edu/~regehr/papers/overflow12.pdf).
On page 9, it
Vincent Lefevre vinc...@vinc17.net writes:
Concerning the mul_i.h problem, it was fixed there, AFAIK:
2012-02-09 Marc Glisse marc.gli...@inria.fr
* gmp-impl.h (ABS_CAST): New macro.
* mpf/cmp_si.c: Use ABS_CAST.
* mpf/get_si.c: Use ABS_CAST.
*
ni...@lysator.liu.se (Niels Möller) writes:
I was been thinking of the following algorithm (I think I wrote about
the mod variant a while ago).
Say we want to divide by an n-limb number D, and for simplicity, assume
that D is normalised.
First precompute the inverse v =
ni...@lysator.liu.se (Niels Möller) writes:
Trying
$ ./speed -o cycles-broken -s 1-30,100,200 -r mpn_mullo_n mpn_binvert
it appears binvert is up to 5 times slower than mullo for just a few
limbs, 2 times slower for n = 10, and roughly 50% slower in the range
between 20 and 200
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
I've tried rewriting mpz_combit. Two reasons: 1. To reduce overhead int
he common case. 2. I just realized that the common case, for both
positive and negative numbers, is to complement the
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
0. Support in speed, for benchmarking.
Not checked in yet, but here are some benchmark numbers, comparing to
binvert:
Do you have any interpretation of these numbers?
I am hacking out the
ni...@lysator.liu.se (Niels Möller) writes:
Perhaps you could add some of your remarks as a comment to the file?
Makes sense. Not tonight, though.
I saw your edits. And I edited them sligtly, adding some more FIXMEs
and clarifying the operation of binv_root. Please read and tell me
ni...@lysator.liu.se (Niels Möller) writes:
Sounds right. Such convolution type sum-of-products might want a
separate function, returning 3 limbs.
But I'm not talking about a convolution, but about the complete
powering. If x is a candidate nth root, and x^n is of size s bits, I
ni...@lysator.liu.se (Niels Möller) writes:
I guess this means: Avoid using functions from libm, and avoid using
floating point in critical loops. Right? There's a (non-critical)
floating point operation in mpn_perfect_power_p involving some logarithm
table.
Right. I would actually
ni...@lysator.liu.se (Niels Möller) writes:
Ah, and if we're brain storming, yet another variant would be to compute
an extra limb (or maybe only half a limb) of precision when we compute
the 2-adic k'th root. If the extra limb is non-zero, then the input can't
be a k'th power. That's
ni...@lysator.liu.se (Niels Möller) writes:
Do you think we should have an advertised binv_sqrt function returning
a^{-1/2}? (And if so, should we have something analogous also for
euclidean square root? And for nth roots?)
Perhaps. A start would be to advertise it internally.
My
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
How should we solve this? It is tempting to let mpz_perfect_power_p
return p, but unfortunately it has type 'int' which might be too small.
Which type would be right? mp_bitcnt_t?
Yes
I recently played with the Quadratic Sieve (together with Niels) making
some use of GMP, but of course the bulk of the work in QS is done
sieving using 8-bit numbers.
QS doesn't factor perfect powers, t = a^p. We can detect such numbers
quickly using mpz_perfect_power_p. But we cannot get any p
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
Some other test cases fail, though: t-root, t-perfpow, t-divis, and
t-cong. I'm not familiar with this code, I had a quick look at the root
code but didn't see any obvious dependency on bdiv.
ni...@lysator.liu.se (Niels Möller) writes:
How do you define hensel square root with remainder? Given a and n, if
there exists an x such that x^2 = a (mod B^n), that seems like the
reasonable definition of the square root. But what if no such x exists;
where should we put the remainder
Torbjorn Granlund t...@gmplib.org writes:
iterate nn-dn times
q = np[0] * dinv
hi = mpn_addmul_1 (np, dp, dn, q)
hi += cy
cy = hi cy // can this be true?
hi += np[dn]
cy2 += hi np[dn]
np[dn] = hi
np += 1;
I wrote a possible start
ni...@lysator.liu.se (Niels Möller) writes:
I think it makes sense to have --disable-assembly disable *all* use of
assembly source files.
Maybe. I suppose the file now in question is used only by
tests/devel/try for detecting ABI breaches. Compiler generated code
should not need such
Marc Glisse marc.gli...@inria.fr writes:
I just added a new symbol to libgmpxx. Should I touch one of the
LIBGMPXX_LT_* in Makefile.am, mark somewhere that it may need to be
done prior to release, or ignore the (non-)issue?
You might fill in the table in Makefile.am to make sure a CXX
Marc Glisse marc.gli...@inria.fr writes:
Ok, I did that. And noticed that, although LIBMP_LT_* are still there,
mpbsd was removed a while ago (I had completely forgotten about
that...).
There always seem to be a libmp bone remaining.
--
Torbjörn
Marc Glisse marc.gli...@inria.fr writes:
On Tue, 19 Jun 2012, Torbjorn Granlund wrote:
I don't think we require C99 for GMP, only C89.
The ldexpl function was introduced with C99.
Would mpz_get_ld become acceptable with an appropriate autoconf test
for ldexpl and a silly loop
bodr...@mail.dm.unipi.it writes:
tmp-amd64call.s:120: error: unknown opcode 'movq'
...
Even with the --disable-assembly option given to configure, make check
is asking to the compiler to assemble a .s file. It didn't happen with the
--host=none option given to GMP-5.0 .
If you have
I don't think we require C99 for GMP, only C89.
The ldexpl function was introduced with C99.
--
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel
bodr...@mail.dm.unipi.it writes:
You mean the BPSW primality test?
I wasn't aware of this BPSW suggestion.
What I am suggesting is similar, but I am not familiar with the Lucas
test. I also think one should do more trial dividing, and let its
limits be operand dependent.
Some months ago,
bodr...@mail.dm.unipi.it writes:
I'd like to keep the Return 2 if n is definitely prime feature of the
current _probab_prime_ function.
Yes, that would be nice.
2. mpz_notdiv_pprime_p, like mpz_pprime_p but without trial dividing.
Do we really need a new function for this? An
I have read up on this subject.
Selfridge and Pomerance have noticed that an M-R test with base 2 plus
a Fibonacci test (Fib_n + Legendre(n,5) == 0 (mod n)) seem to correctly
identify composite numbers. They will give USD 620 to anyone who
finds a composite identified as a prime, or proves that
ni...@lysator.liu.se (Niels Möller) writes:
The old_size argument should be removed also from the free function, not
just realloc.
Fixed.
Add return value to mpz_sqrt, for consistency with mpz_root (assuming
such a flag can be returned very cheaply, which I think it can).
Added.
ni...@lysator.liu.se (Niels Möller) writes:
The point, as I understand it, is to collect various problems where the
fixes imply interface changes, and which hence are not going to happen
in any release currently planned. So we don't forget them, if/when we
ever to planning a release which
Marc Glisse marc.gli...@inria.fr writes:
Naive question:
are there many ABIs where adding a return type breaks compatibility?
Probably few. But it breaks source code compatibility, if it is called
via a typed function pointer. Our own tests/mpz/reuse.c would need
modifications...
--
bodr...@mail.dm.unipi.it writes:
Toom-2 and Toom-3 code for fat builds should be basically equivalent to
the code used by tuneup, now: the thresholds are global variables, and
threshold checking consist in a single comparison.
We may extend the mechanism to other toom thresholds now...
bodr...@mail.dm.unipi.it writes:
I pushed a few changes to the MAYBE_ mechanism of lower degree toom
functions, and to the ABOVE_THRESHOLD macro (only for GCC).
The effect of the change can be measured looking at Hydra's coverage report:
- before the change we had a total of 43613
bodr...@mail.dm.unipi.it writes:
Many macros in our code assume that thresholds are compile-time constants,
e.g. the MAYBE_ machinery in toom code, but also the ABOVE_THRESHOLD
macro, with its three stage comparison...
#define ABOVE_THRESHOLD(size,thresh)
Richard Henderson r...@twiddle.net writes:
arm.com has them, free registration required.
Found it. Had to click through some zany revision numbers, making me
thing these were revision guides, not proper documentation.
Table B.5. Multiplication instruction cycle timings
Instruction
Richard Henderson r...@twiddle.net writes:
On my system, umaal has a latency if 3, whatever dependencies I create.
(There are 4 input regs and 2 output, so there are quite a few
possible dependency combinations; I only tried a subset.)
Either the docs are plain wrong, or there are
Richard Henderson r...@twiddle.net writes:
Indeed I know that the hw registers that allow such recognition
are all privileged. For linux they best one can do is /proc/cpuinfo
or (to some extent) the values in AT_HWCAP.
Something portable would be nice...
FYI, I dug out the
Richard Henderson r...@twiddle.net writes:
On 04/23/12 07:49, Torbjorn Granlund wrote:
Do you know the repeat rate of umull, umlal, umaal, assuming no reg
dependencies?
For a8: 3 cycles.
For a9 it seems to be 2 cycles, so 3.25 c/l for the current addmul_1 is
not very good.
I have
Richard Henderson r...@twiddle.net writes:
I used the following, almost certainly not appropriate for general
application.
[snip]
Thanks. I would be very useful to make GMP timing work with the kernel
Linux running om ARM. Do you know if there are similar problems with,
say, NetBSD?
I
bodr...@mail.dm.unipi.it writes:
Yet, it fits in the result area and is overwritten by it.
Such small overhead is impressive!
The result area is preallocated with an overestimate: N*3/2 bits ...
Now I am less impressed. :-)
Not storing the full bit-array is a possible improvement for
Richard Henderson r...@twiddle.net writes:
Three patches herein. If there's a better way to submit patches,
please advise; I've never used hg before.
The first patch gives gcc control over ctz/clz. Particularly for
armv6t2 and later, which have rbit for use for ctz.
The second
The mpn_copyi I checked in yesterday is broken, it does not copy things
strictly incrementing.
Unfortunately, fixing it probably requires a rewrite, which will take a
couple of days.
Affected machines are all x86-64 Intel CPUs except Pentium 4, and VIA
Nano.
--
Torbjörn
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
Are you aware if any more files or functions that are unused?
I suspect quite a lot of the new (since a couple of years now...)
division code is not used by any advertised mpn_function.
*One* C
I'd like to locate unused code in GMP.
Unused code adds size to the source, which is not too bad, but if it is
compiled, it adds size to the compiled library, in particular to the
shared library (which is always mapped in its entire when a process
linked to it is running).
My current list is
bodr...@mail.dm.unipi.it writes:
It is a different information, but a code coverage tool run during the
testsuite tells which functions are never called there (on one specific
platform, with one set of options):
http://hydra.nixos.org/build/2398430/download/1/coverage/
Very
Next year, Intel will add an instruction MULX that will be extremely
useful for GMP's most critical functions.
It is a 64x64-128 bit multiply with two separate explcit destination
operands and one explcit source operand and one implicit source operand
(in rdx).
The instruction does not clobber
Torbjorn Granlund t...@gmplib.org writes:
If GCC generates code that takes advantage if this rule, then I expect
no problems at any standard platform.
GCC 4.7 on FreeBSD generates such code fpr bdiv_qr.c. This is not a
leaf function, so rsp is moved down before the calls. But the fact
bodr...@mail.dm.unipi.it writes:
For future releases, it might be interesting to just have a single code
for processor recognition; instead of two. Do you think it would be
possible?
You mean the code in config.guess/configure.in and
mpn/x86{,_64}/fat/fat.c?
It would be desirable to
I believe we will need a GMP 5.0.5 release.
(1) GMP 5.0.4 recognises AMD 11h processors as 10h (aka K10), then
making use of the LZCNT instruction. Unfortunately, 11h processors
while later than 10h processors, don't implement all K10
instructions, and also don't raise an illegal
I reran GMPbench with current head repo sources and got a surprise.
There is a large leap in performance for AMD K10, and nice improvements
for a lot of other CPUs as well, compared to GMP 5.0.
AMD K10 went from 2985 to 3500 with improvements at every measured
point.
The thing is that I am not
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
I pushed some new gcd_1 code for x86_64, helping AMD k10 and bulldozer,
Intel conroe/penryn, nehalem/westmere, and sandybridge, and VIA nano.
What algorithm do you use?
Plain binary
I pushed some new gcd_1 code for x86_64, helping AMD k10 and bulldozer,
Intel conroe/penryn, nehalem/westmere, and sandybridge, and VIA nano.
The actual code lives in x86_64/core2, other CPUs either inherit it or
grab it from there.
Additional improvements are welcome.
One idea:
Before
I just realised a platform where your karatsuba addmul_2 could help a
lot: atom-64. There, addmul_2 currently runs at 20 c/l. The pipeline
is two-way superscalar and in-order. The mul insn can be repeated
someting like one every 16th cycle.
There is a dual-core atom-64 in the test systems
bodr...@mail.dm.unipi.it writes:
The first one is the one I prefer. Has this problem been addressed in the
alimb branch?
We generalised this and separated long and size arithmetic from limb
arithmetic.
More work is needed on the alimb code, and help is most welcome. I
think Per Olofsson
ni...@lysator.liu.se (Niels Möller) writes:
One can have alternative functions accepting mp_limb_t and
mp_limb_signed_t, if one finds that useful.
That would be nice, and so would functions accepting long long.
But unfortunatly, neither is possible if we want one GMP interface per
release,
bodr...@mail.dm.unipi.it writes:
Again, Niels propose an elegant way out: we can let inverses mod 0
undefined. To obtain this we can keep the old library code, and patch only
the recent testing program and documentation.
I vote for this option.
Me too.
The behaviour of computations
Three powerpc systems report failure this morning. I suspect they would
have reported failure already yesterday, if compilation hadn't failed
due to a missing file...
The failures seem to be for small cases, for which the code
(mpz/lucnum2_ui.c) uses a dumpmp/mini-gmp generated table.
I
ni...@lysator.liu.se (Niels Möller) writes:
I'm not entirely sure I understand fib-gen is supposed work. luc_limit
is only assigned like this, in gen-fib.c:generate,
if (mpz_cmp (l, limit) 0)
luc_limit = i-1;
Looking at mini-gmp.c:mpz_cmp, I've spotted
There is a systematic problem in mini-gmp.c when MPZ_REALLOC is called
when a destination variable is the same as some other (source or
destination) variable.
After MPZ_REALLOC, all cached pointers must be considered to be defunct.
I've spotted this error in 4 functions, but I haven't made a
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
After MPZ_REALLOC, all cached pointers must be considered to be defunct.
It seems I was only paying attention to the destination pointer. I'll
go through all uses of MPZ_REALLOC
ni...@lysator.liu.se (Niels Möller) writes:
Will you commit these fixes, or do you want me to do that?
Please take care of any fixing. It's your baby after all. :-)
I have found the same four direct MPZ_REALLOC problems when reviewing
the code: mpz_abs_add, mpz_and, mpz_ior and
I pushed new x86_64 assembly making use of 128-bit instructions working
on xmm registers. While all x86_64 processors probably support the
instructions used, some have less throughput using these than when using
plain 64-bit instructions.
The idea is to include these just before x86_64 in the
Torbjorn Granlund t...@gmplib.org writes:
carry-in lo in r14
carry-in hi in rcx
mov 0(up), %rax
mul v1
mov 8(rp), %r8
add %rax, %r8
mov %rdx, %r9
adc $0, %r9
mov 8(up), %rax
mul v0
201 - 300 of 335 matches
Mail list logo