Re: How to calculate cycles/limb in assembly routines

2024-04-04 Thread Torbjörn Granlund
Albin Ahlbäck writes: I am looking at Torbjörn's `aorsmul_1.asm' for Apple M1, and I am having trouble understanding how the cycles per limb number was calculated. It was not calculated. It was measured. But that asm code was written with understanding of the M1 pipeline. Here, 1.25 c/l

Re: Has there been historical work done to investigate small integer optimization?

2024-02-12 Thread Torbjörn Granlund
marco.bodr...@tutanota.com writes: But implementing it with the current mpz type is "impossible". I mean, one should break the current interface. Currently, _mp_d is always a pointer AND it always point to a readable limb. Even if _mp_alloc is zero. If we set alloc = 0 and size >= 2^30,

Re: What's a reasonable size ratio for toom32?

2023-11-19 Thread Torbjörn Granlund
Niels Möller writes: I'd like to get the changes back in, piece by piece... Sounds good! The below patch is the change to add /r syntax, and enable it only for mpn_mul and mpn_mul_basecase. Seems to work for me, tested by running ./tuneup and ./speed -r -s 10-500 -f 1.2 -C

Re: What's a reasonable size ratio for toom32?

2023-10-19 Thread Torbjörn Granlund
Niels Möller writes: Looks like tuning MUL_TOOM42_TO_TOOM63_THRESHOLD crashes. Even though I can measure these independently using speed. I can't debug this further at the moment, so I'm reverting these changes for now. I can confirm that the tuneup program now works. (It took a few

Re: What's a reasonable size ratio for toom32?

2023-10-16 Thread Torbjörn Granlund
Niels Möller writes: I can't debug this further at the moment, so I'm reverting these changes for now. Thank you! The abort() happens in tuneup.c's one() function. (I didn't analyse it beyond running gdb to see which abort() was called.) -- Torbjörn Please encrypt, key id 0xC8601622

Re: What's a reasonable size ratio for toom32?

2023-10-15 Thread Torbjörn Granlund
Niels Möller writes: Pushed now. I've done some benchmarks on shell, on tip-of-tree GMP (no local changes). See numbers at the end of this message, comparing mpn_mul_basecase, mpn_toom22_mul and mpn_toom32_mul. Every single tuneup invocation made by the nightly builds in the last few

Re: [PATCH] Fix typo in z13 bdiv_dbm1c.

2023-10-10 Thread Torbjörn Granlund
Stefan Liebler writes: In my case, before calling mpn_divexact_by3c v6 was {0x1, 0x0} and thus mpn_bdiv_dbm1c returns 0x1, which is ANDed with 3 and then also returned by mpn_divexact_by3c. Therefore the test fails by chance. OK. We should clearly have tests/*call.asm and tests/*check.c

Re: [PATCH] Fix typo in z13 bdiv_dbm1c.

2023-09-30 Thread Torbjörn Granlund
Stefan Liebler writes: The returned limb was retrieved from v6 instead of v2. This is also observable in failing testcase t-fat. While I agree the code looks suspicious, I fail to trigger any test suite failure for any Z/arch configuration. How exactly did you trigger it? (I try to avoid

Re: [PATCH] Revert "Move popcount and hamdist back from z14 to z13 after needed edits."

2023-08-03 Thread Torbjörn Granlund
I committed your patch (while also impersonating you). -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel

Re: [PATCH] Revert "Move popcount and hamdist back from z14 to z13 after needed edits."

2023-08-03 Thread Torbjörn Granlund
Stefan Liebler writes: Thanks for that. In the meantime, gmp 6.3.0 + this patch should be used. Do you have plans for 6.3.1? We tend to make point releases. For me "qemu-system-s390x --cpu help" at least lists some z13 models, but to be honest, I don't use it. I have to check. Yes,

Re: [PATCH] Revert "Move popcount and hamdist back from z14 to z13 after needed edits."

2023-08-03 Thread Torbjörn Granlund
Stefan Liebler writes: Unfortunately not only the extended mnemonics are not available with z13, but also vpopct M3=1-3 is reserved. Thus you'll get an illegal-instruction if run on z13 as vector enhancement facility 1 (introduced with z14) is not available. Ah, darn. This will need to

Re: Why different runtimes for constant exponent and (huge) modulus for mpz_powm()?

2023-07-26 Thread Torbjörn Granlund
This thread does not seem relevant to gmp-devel readers. Please more it elsewhere, perhaps gmp-discuss. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel

Re: AMD Ryzen 5 7600X gmpbench submission (result 8010.2) -- new: 9176.2

2023-07-08 Thread Torbjörn Granlund
The result is 14.5% increase in GMPbench value, new value is 9176.2 ! That's much better, but still not as good as it should be with some optimisation. Specifically, there seems to be something slightly off with the multiply performance. Unfortunately, I don't have Zen4 hardware, nor do I the

Re: AMD Ryzen 5 7600X gmpbench submission (result 8010.2)

2023-07-08 Thread Torbjörn Granlund
herm...@stamm-wilbrandt.de writes: I recently bought a new 7600X PC that was not-too-expensive (619$), but is rank 18 on PassMark's Singe Thread performance list of >3100 CPUs: [snip] GMPbench: 8010.2 Interesting! I would hope Zen 4 could perform much better than that, though.

Re: [PATCH v3 0/4] Add addmul_1, addmul_2, and mul_basecase for IBM z13 and later

2023-06-23 Thread Torbjörn Granlund
These improvements are now (finally!) in GMP repo. I have not run any timing tests, as I trust you to worry about the performance. A mistake we GMP develiopers have made in the past is couting cycles for inner loops for quite large trip counts, and then accidentally adding overhead as a side

Re: Requests from Microsoft IP Addresses

2023-06-19 Thread Torbjörn Granlund
Pedro Gimeno writes: Stupid question maybe but, couldn't the online getbundle command be disabled? We don't want to disable functionality for legitimate use. If I ever get the time, I will assign more CPU cores to handling clone commends. It is not just bumping a number somewhere, it

Re: Requests from Microsoft IP Addresses

2023-06-19 Thread Torbjörn Granlund
I think we now understand what happened. It has everything to do with how Github works. It started when the FFmpeg project changed their GI script to clone GMP for every of their checkin. This is a bad idea, but not a terrible idea. (The right way would be to keep a local checkout and pull

Re: Requests from Microsoft IP Addresses

2023-06-18 Thread Torbjörn Granlund
I added "Usage conditions" to . "These resources are open to the public. Yet, we expect the usage of these resources to be used responsibly. Repeated clone command should be avoided. Scripting of clone commands is strongly discouraged. As a

Re: Requests from Microsoft IP Addresses

2023-06-18 Thread Torbjörn Granlund
Niels Möller writes: Would have been helpful with a specific reference to the user and script in question. I would guess this revert is intended to stop the hg clones (instead getting release tarballs from ftp.gnu.org):

Re: Requests from Microsoft IP Addresses

2023-06-18 Thread Torbjörn Granlund
Marc Glisse writes: One thing that should be doable is set up a mirror of GMP's repository on github, and advertise that one for CI purposes. Any user could do that (there are already a few), but if it was advertised on the GMP website, it would be more likely to be used by more people.

Re: Requests from Microsoft IP Addresses

2023-06-18 Thread Torbjörn Granlund
Niels Möller writes: Do we have more capacity to serve the nightly tarballs to lots of clients (at least, they shouldn't require any CPU compression work per request)? We do indeed have much more capacity for that. But the 1 GbE connection would quickly be saturated if a few more people

Re: Requests from Microsoft IP Addresses

2023-06-17 Thread Torbjörn Granlund
Mike Blacker writes: Microsoft and GitHub have investigated the issue and determined that a Github user updated a script within the FFMPeg-Builds project that pulled content from https://gmplib.org. This build was configured to run parallel simultaneous tests on 100 different types of

Re: Requests from Microsoft IP Addresses

2023-06-17 Thread Torbjörn Granlund
Mike Blacker writes: Microsoft and GitHub have investigated the issue and determined that a Github user updated a script within the FFMPeg-Builds project that pulled content from https://gmplib.org. This build was configured to run parallel simultaneous tests on 100 different types of

GMP servers are under DoS attack from Microsoft

2023-06-16 Thread Torbjörn Granlund
The GMP servers are under attack by several hundred IP addresses owned by Microsoft cooperation. We do not know if this is made with malice by Microsoft, if it is some sort of mistake, or if some of their cloud customer is running the attack. The attack targets the GMP repo, with thousands of

Re: Status of latest release for macos and arm64?

2023-05-10 Thread Torbjörn Granlund
Niels Möller writes: Hi, do I get it right that latest stable release (6.2.1) is not quite working on Macos arm64? With the "register x18" known issue listed at https://gmplib.org/#STATUS, fixed in https://gmplib.org/repo/gmp-6.2/rev/f4ff6ff711ed (and similar fix in the main repo).

Re: Fast constant-time gcd computation and modular inversion

2022-09-04 Thread Torbjörn Granlund
Marco Bodrato writes: We should start writing mpn_sec_binvert :-) I think mpn_binvert is almost sec_ naturally. The exception is when sbpi1_bdiv_q.or dbpi1_bdiv_q c are invoked. The former has some < on data (for carry computations) and the latter has a mpn_incr_u which is very leaky. --

Re: Fast constant-time gcd computation and modular inversion

2022-09-01 Thread Torbjörn Granlund
/* FIXME: Using mpz_invert is cheating. Instead, first compute m' = m^-1 (mod 2^k) via Newton/Hensel. We can then get the inverse via 2^{-k} (mod m) = (2^k - m') * m + 1)/2^k. */ mpz_invert (t, t, m); mpn_copyi (info->ip, mpz_limbs_read (t), mpz_size (t)); You might

Re: Fast constant-time gcd computation and modular inversion

2022-08-31 Thread Torbjörn Granlund
Much more unclear to me how close it might be to the typical or average number of iterations needed. That's perhaps not very interesting, as early exit is not an option here. (Unless this algorithm would beat plain, "leaky" inverse.) Currently uses exponentiation for the field inverse,

Re: Fast constant-time gcd computation and modular inversion

2022-08-31 Thread Torbjörn Granlund
> count = (49 * bits + 57) / 17; > > Odd. For sure. This isn't based on local progress of the algorithm (there ain't no guaranteed progress for a short sequence of reduction steps), but based on rather complex analysis of the number of steps needed for the complete 2-adic

Re: Fast constant-time gcd computation and modular inversion

2022-08-24 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: count = (49 * bits + 57) / 17; Odd. (For production code, one will need to cap the intermediates there, at least for 32-bit machines. Perhaps something like this: count = (51 * bits - 2 * bits + 57) / 17 = = 3 * bits - (2 * bits -

GMP testing

2022-06-19 Thread Torbjörn Granlund
The automated GMP testing has had a momentary hiatus due to a disruptive thunderstorm, further extended by a bad buggy BIOS image. Furthermore, we decided to make testing sparser in order to save electricity. Now, each configuration is tested within 10 days, up from 7 days. -- Torbjörn Please

Re: Fast constant-time gcd computation and modular inversion

2022-06-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Extract least significant 96 bits of each number. Is that 3 32-bit limbs or 1.5 64-bit limbs? -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org

Re: Fast constant-time gcd computation and modular inversion

2022-06-03 Thread Torbjörn Granlund
Any algorithm with these properties would be a huge improvement compared to what we have today: 0. No side-channel leakage of anything but bit counts of input operands. (I suppose there are usages where the mod argument is not sensitive, such as for the elliptic curve usages). This is

Re: mul_fft, cleaning up some details of the code

2022-03-06 Thread Torbjörn Granlund
Marco Bodrato writes: This is not really needed to solve a bug. The comment before the mpn_mul_fft_decompose function says "We must have nl <= 2*K*l", this means that we should never have ((dif = nl - Kl) > Kl), and the code in that branch should never be used. I noticed that at some

Re: binvert_limb speedup on 64 bit machines with UHWtype

2022-02-27 Thread Torbjörn Granlund
John Gatrell writes: I think you missed why the 0x7F is unnecessary. If you start with 8 bits and divide by 2 then the top bit must become zero. gcc does this itself and suppresses the 0x7F. So this idea will help other compilers start with 8-bits to achieve the same. The same trick can

Re: binvert_limb speedup on 64 bit machines with UHWtype

2022-02-27 Thread Torbjörn Granlund
John Gatrell writes: I noticed that replacing '(n/2)&0x7F' with '(unsigned char)n/2', may give a hint to assembler implementers that the 7F mask is unnecessary. For your consideration It is necessary to portably extract the least significant bits. Perhaps one could write it (n & 0xff)/2

Re: New mulmod_bknp1

2022-02-20 Thread Torbjörn Granlund
Very nice speedups there! I am too busy to examine the code to see what you've done. Perhaps you could outline the algorithms here? Is n = 3^t-k now slower than n' = 3^t for small k (with k mod 3 != 0)? Then we could zero-pad such operands... -- Torbjörn Please encrypt, key id 0xC8601622

Re: mpq_mul_ui

2022-01-24 Thread Torbjörn Granlund
Marc Glisse writes: What would you think of adding mpq_mul_ui, mpq_div_ui, mpq_ui_div, and also the _z versions? That would make sense to me. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org

[ADMIN] GMP mailing list

2022-01-15 Thread Torbjörn Granlund
The GMP mailing lists have been down for several months, but I am now working on resuming their operation. If you get this message, chances are that things now work again. The backgound is that a security update to the (FreeBSD) virtual server broke the mailing list software (mailman). I

Re: Suggested tune/tuneup.c patch

2021-10-31 Thread Torbjörn Granlund
Marco Bodrato writes: Is this still active? I can access https://gmplib.org/devel/thres/2021-10-30/ , but I'd like to check how FAC_ODD_THRESHOLD evolves after my last commit to fac_ui... and I find an empty page if I look at https://gmplib.org/devel/thres/2021-10-30/FAC_ODD_THRESHOLD

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-08 Thread Torbjörn Granlund
Torbjörn Granlund writes: zen12 2 2 = all equal (saturated mul) Typo. The last number should be 4, not 2. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-08 Thread Torbjörn Granlund
I created an "ununrolled" version, and a 4x unrolled version. I then compared these with some other variants. Here are the results: mul_1 addmul_1 addaddmulresult best variant zen12 2 2 = all equal (saturated mul) zen21.7 2.1

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-08 Thread Torbjörn Granlund
Your version is faster than my versions (where I tested them). I made some minor changes to your code. 1. Got rid of c1 by moving two adox earlier. That also made for a speedup. 2. Simplified the feed-in code by jumping into the loop for the odd n case. 3. Use rbx for the bp variable as rbp

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
And will cause an interesting failure if one can ever afford enough RAM to use an input size larger than 2^63 limbs ;-) Nobody in his right mind will ever need more than 2 EiB of memory. :-) Attaching a version that actually passes some tests (I should commit the unit tests, but not

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
Torbjörn Granlund writes: Problem: the adc will write a useless value to the O flag. That is then read by the first adox, yielding incorrect results. Clearing O without creating any (too bad false) dependencies could perhaps be done with an additional dummy adox zero, zero. On 2nd

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: L(top): mov (ap, n, 8), %rdx mulx%r8, alo, hi adoxahi, alo mov hi, ahi C 2-way unroll. adoxzero, ahi C Clears O mov (bp, n), %rdx

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Here's a sketch of a loop, that should work for both addaddmul_1msb0 and addsubmul_1msb0: L(top): mov (ap, n, 8), %rdx mulx%r8, alo, hi adoxahi, alo mov hi, ahi C 2-way unroll.

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Gave it a run on my closest x86_64 (intel broadwell, no mulx)), and numbers for mpn_addaddmul_1msb0 are not impressing. Also, it appears mpn_addmul_2 is significantly slower than two addmul_1. I believe addmul_2 is inhibited for that CPU. It

Re: Risc V greatly underperforms

2021-10-06 Thread Torbjörn Granlund
Hans Petter Selasky writes: Then you get a penalty. But the penalty might not be so big assuming random input. Adding one to a number is pretty cheap and you only need to continue traversing the data words making up the number when the increment overflows. Which in turn gets you a

Re: Please update addaddmul_1msb0.asm to support ABI in mingw64

2021-10-06 Thread Torbjörn Granlund
I haven't followed this discussion very closely, and did not see if you have conidered the following. OK, so the code is 3-ways unrolled. That's always a bit inconvenient and tends to cause some code bloat. I am pretty sure we have that at least in sme other place, but still make all the work

Re: Risc V greatly underperforms

2021-10-06 Thread Torbjörn Granlund
Hans Petter Selasky writes: If the GMP could utilitize multiple cores when doing bignum multiplication and addition, I think the picture would look different. For example for addition, you could split the number in two parts, and then speculate if there is an addition for the higher

Re: Risc V greatly underperforms

2021-09-21 Thread Torbjörn Granlund
A carry bit helps for some codes, GMP being a prime example. Keeping carry/borrow conditions in plain registers can be made to work well too. But then you need good ways of computing carry/borrow, and good ways of inputting the carry/borrow result to dependent add/subtract instructions. Risc V

Risc V greatly underperforms

2021-09-20 Thread Torbjörn Granlund
It seems safe to assume that most people on this list have heard of Risc V by now, the license-free instruction set. I trust that much fewer have looked at the technical details. I have, though, as we implement critical inner loops for GMP in assembly. My conclusion is that Risc V is a terrible

Re: div_qr_1n_pi1

2021-07-09 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Same as in the current (from 2013) version. Delaying the write is a bit tricky, since we already use all registers. But it would be better to update the quotient limbs in memory only in the unlikely carry-propagation case. I figure adc to memory

Re: div_qr_1n_pi1

2021-07-08 Thread Torbjörn Granlund
I think you should delay writing through QP to avoid adc to a memory place, and have just one plain write through QP per iteration. The dec UN and the branch might run faster if put adjacent to each other, as many CPUs fuse these into a single instruction. Your cycle numbers should proably be

Re: div_qr_1n_pi1

2021-07-04 Thread Torbjörn Granlund
Perhaps we should write a little low-level, m4-based asm compiler? We could define ad-hoc primitives, like some equivalent to the inline asm umul_ppmm, add/subtract with carry, loads, stores, branches, etc. The set of primitives should be separted in must-define and optional. Writing a definition

Re: div_qr_1n_pi1

2021-07-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Looks better, but few tuneup results yet. Method 3 wins on a few machines (https://gmplib.org/devel/thres/DIV_QR_1N_PI1_METHOD), with all of method 1, 2 and 4 appearing as runner up on some machine. The presentation in those "/thres/" pages is a

Re: div_qr_1n_pi1

2021-07-02 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Danger of "easy" last minute changes... Fix pushed now. Thanks, let's see that things clean up. (The autobuild system is a daft and still does not run anything as a result of a repo change. It is calendar triggered but can also be run manually.)

Re: div_qr_1n_pi1

2021-06-30 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I'm tempted to commit this code. I.e., new variants (not enabled) + tuneup changes. To see which variants are favorites on the various test machines. Should give some guidance as to what's most promising for assembly implementation. What do

Re: mul_fft

2021-06-30 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I think add_n_sub_n was originally motivated by improved locality (could apply at different levels of memory hierarcy). But maybe we could get close to twice the speed using newer instructions with multiple carry flags (I guess that's what

Re: div_qr_1n_pi1

2021-06-07 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I've tried it out. Works nicely, but no speedup on my machine. I'm attaching another patch. There are then 4 methods: method 1: Old loop around udiv_qrnnd_preinv. method 2: The clever code from 10 years ago, with the microoptimization I

Re: div_qr_1n_pi1

2021-06-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: And I don't quite trust these cycle numbers, they should probably be twice as large, on the order of 10 cycles/limb for all variants. Less than 5 cycles is too good to be true, right? Yes. "Turbo" messes things up. The TSC cycle counterstays it

Re: div_qr_1n_pi1

2021-06-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Maybe we should have some macrology for that? Or do all relevant processors and compilers support efficient cmov these days? I'm sticking to masking expressions for now. Let's not trust results from compiler generated code for these things. The

Re: div_qr_1n_pi1

2021-06-03 Thread Torbjörn Granlund
Don't forget about the adcx and adox instructions. They might come in handy here. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel

Re: div_qr_1n_pi1

2021-06-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: The critical path, via the u1 variable, is umul_ppmm (p1, p0, u1, B2); add_mss (cy, u1, u0, u0, up[j], p1, p0); u1 -= cy & d; Nice! The (cy & d) term is multiplied in the next iteration by B2, i.e., we have either the

Re: 2-adic Svoboda

2021-05-03 Thread Torbjörn Granlund
Paul Zimmermann writes: I tried to implement Montgomery-Svoboda at the C level, but did not manage to beat the mpn_redc_x routines. I'm very interested to see your results! Without invariance from e.g modexp, I don't believe one can beat sbpi1_bdiv_r (or the older redc_1). Newer

Re: 2-adic Svoboda

2021-05-02 Thread Torbjörn Granlund
Paul Zimmermann writes: yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it "Montgomery-Svoboda". The quotient selection becomes trivial, which means one can reduce the latency between two mpn_addmul_1 calls. It really becomes a mul_basecase except that the first round

2-adic Svoboda

2021-05-02 Thread Torbjörn Granlund
IIRC, Svoboda's division trick for N/D is to find a small multiplier m such that, for the division mN/(mD) we have mD = 10...XX... with the high limb of mD being 1000...0. This idea works also for 2-adic division. Find m = D^(-1) mod \beta where \beta is the lomb base. Then do mN/(mD) or mN

Re: [PATCH] Add Zhaoxin x86 processor support

2021-04-08 Thread Torbjörn Granlund
I try to use paths for zen and k8 compiler parameter. This runs great in test. I now realize that this is a wrong setup. I'm going to modify it with the strategy you recommend. I have attached a script which might be useful for choosing the best existing asm file for the new CPUs. Please

Re: [PATCH] Add Zhaoxin x86 processor support

2021-04-07 Thread Torbjörn Granlund
Let me make this clear: The changes look basically sound. My comments concentrated on things that looked odd to me. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org

Re: [PATCH] Add Zhaoxin x86 processor support

2021-04-07 Thread Torbjörn Granlund
Some comments about the patches: (1) Why do you set up paths for zen (as a fallback)? Doing that seems wrong unless all these 3 CPUs support every zen instruction. Do they? Also, passing k8 to the compiler adn choosing zen asm code makes very little sense to me. If zen makes sense for asm

Re: [PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13

2021-03-10 Thread Torbjörn Granlund
Marius Hillenbrand writes: z14 introduced "alignment hints" for vector loads, where 8-byte aligned reads have more bandwidth (e.g., "vl %v,,3" # 3 for 8-byte alignment, 4 for 16-byte alignment). vlerg does not take these hints. Empirically, I observe a slight advantage for vlerg

Re: [PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13

2021-03-09 Thread Torbjörn Granlund
Marius Hillenbrand writes: One minor proposal (patch to follow): Some versions of GCC only accept the -march=arch variant of the most recent CPU level they support (e.g., GCC-9 accepts -march=arch13 but not -march=z15; 13 as in the 13th edition of the Principles of Operations that

Re: [PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13

2021-03-09 Thread Torbjörn Granlund
I applied patch 1/4 and 2/4 with modifications. Please take a look at the repo code when you have the time. The main change I made to your suggested change is that I added z14 and z15 to the recognised cpu types. I also made z13 a fallback for z14 (if the latter is not understood by tools), and

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-06 Thread Torbjörn Granlund
Marius Hillenbrand writes: z13: introduced the vector extensions PoP: Vector Facility for z/Architecture Linux: vx / HWCAP_S390_VX z14: Vector-Enhancements Facility 1 vxe / HWCAP_S390_VXE z15: Vector-Enhancements Facility 2 (adds VLERG and VSTERG, among others)

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-05 Thread Torbjörn Granlund
>From 4x unrolling to 8x unrolling with pipelinging we gain ~40%. Wow, that's much more than I would have expected (or actually seen at my end). Which brings up another question: Which zarch pipelines does it make sense to optimise for? I thought I had z196 access, but that is probably not

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-02 Thread Torbjörn Granlund
Marius Hillenbrand writes: Most notably, I changed the order so that the mlgr's are next to each other. The reason is that decode and dispatch happens in two "groups" of up to three instructions each, with each group going into one of the two "issue sides" of the core (both are symmetric

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-01 Thread Torbjörn Granlund
Torbjörn Granlund writes: I played a bit with an addmul_1 of my own, with some ideas from your code. I don't plan to do more work on this. Does this perform well on hardware? I now realise that the instruction sequence of my example is essentially the same as in your code, except

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-01 Thread Torbjörn Granlund
I played a bit with an addmul_1 of my own, with some ideas from your code. I don't plan to do more work on this. Does this perform well on hardware? Note that it only works for n = 0 (mod 4). z14-addmul_1-ur.asm Description: Binary data -- Torbjörn Please encrypt, key id 0xC8601622

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-03-01 Thread Torbjörn Granlund
My analysis and reports in this thread had several problems. For example, I had made shared-lib builds of some qemu images used for "user mode" emulation; that does not work unless the host dynlibs are made available in the guest file system. Trying again: your submul_1 works fine on all tested

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-26 Thread Torbjörn Granlund
Torbjörn Granlund writes: Ehum. That failure was apparently due to a qemu bug. Sorry. I tested qemu 4.2.0 and qemu 5.2.0. The latter does not work at all, not even for "ls" or "cat". The former runs submul_1 for a while, but ultimately reports an error. The error

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-26 Thread Torbjörn Granlund
Torbjörn Granlund writes: Please make sure to test GMP contributions properly. And being honest about the tests performed is critically important. Ehum. That failure was apparently due to a qemu bug. Sorry. -- Torbjörn Please encrypt, key id 0xC8601622

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-26 Thread Torbjörn Granlund
Marius Hillenbrand writes: Done, completed successfully for both mpn_addmul_1 and mpn_submul1. Are you sure you did that? I tried your submul_1 under qemu (version 4.1.1). it does NOT pass any tests with tests/devel/try. Please make sure to test GMP contributions properly. And being

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-20 Thread Torbjörn Granlund
Marius Hillenbrand writes: Done, completed successfully for both mpn_addmul_1 and mpn_submul1. Good! I measure ~40% reduction in cycles per limb on z14 and ~60% reduction on z15, for both addmul_1 and submul_1 (smaller gains for small n <10 but no regressions), yet ... Great speedup!

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

2021-02-17 Thread Torbjörn Granlund
Thanks for contributing to GMP! Marius Hillenbrand writes: These patches add IBM z13 as a new s390_64 CPU level to mpn and add optimized versions of addmul_1 and submul_1 that exploit the SIMD extensions that were introduced with the IBM z13 generation. Both implementations share the same

Re: t-constants FAILs with GMP 6.2.1 on aarch64

2020-12-28 Thread Torbjörn Granlund
[Sorry for the slow reply to your report! I have been quite busy.] I think I understand the problem, but how to fix it is not obvious to me. And GMP_ASM_RODATA likely arrives at the .text result since CFLAGS includes -flto. Is there any reason to use CFLAGS in those tests? It is common

Re: 答复: [PATCH] Add Zhaoxin x86 processor support

2020-12-18 Thread Torbjörn Granlund
DylanFan-oc writes: We are willing to sign the papers. Great! However, we would like to know what we need to do and where to download the FSF paperwork. I will provide paperwork for you. Please bear with me for a couple of weeks; we have holidays coming up over here, and my day work

Re: [PATCH] Add Zhaoxin x86 processor support

2020-12-17 Thread Torbjörn Granlund
Thanks for the GMP contribution! This change is significant enough that we will require FSF paperwork from you and your employer. The paperwork gives the GNU project the legal right to distribute your code. Are you and your employer willing to sign such paperwork? -- Torbjörn Please encrypt,

Re: gcd_11 without abs

2020-11-21 Thread Torbjörn Granlund
Niels Möller writes: And then there will be some conditional operations somewhere. For shortest path, it's best if the code can be arranged to do those operations in parallel with count trailing zeros. Current ARM code (I'm looking at the v6t2 version) does that, with two conditional

Re: gcd_11 without abs

2020-11-21 Thread Torbjörn Granlund
Niels Möller writes: One could get rid of the absolute value by letting one of the working values be negative. Something like this, in pseudo code b = - b while ( (sum = a + b) != 0) { if (sum > 0) a = sum; else b = sum; } [snip] That's one

Re: Revert from backup of main GMP repo

2020-11-21 Thread Torbjörn Granlund
Seth Troisi writes: This dropped the mpz_prevprime commits (the final commit previously had a hash of 970b7221873f) When fires are out I'd appreciate it if they could be committed again. It will be re-committed. I hope Marco will take care of that. -- Torbjörn Please encrypt, key id

Revert from backup of main GMP repo

2020-11-18 Thread Torbjörn Granlund
I've reverted the main gmp repository /var/hg/gmp from backup in order to resolve broad breakage. The public mirror will be auto-synched soon. I don't think a repo which has been synchronised with /var/hg/gmp since 2020-11-15 will work properly. For those of you that have a checkout locally

Re: State of PRNG code in GMP

2020-06-09 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: It's a constraint on what the algorithm internal struct can look like, e.g., it can't have internal pointers (but it could have offsets). So not necessarily a show-stopper, but we should be aware when designing the interfaces. The latest code

Re: State of PRNG code in GMP

2020-06-09 Thread Torbjörn Granlund
Marco Bodrato writes: Mersenne Twister only uses mpz for initialization. Moreover there is a little "bug" in the initialization procedure, so that the sequence can be the same even if the seed is different (in the range where it is supposed to generate different sequencese). Oops.

Re: State of PRNG code in GMP

2020-06-03 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: > /* PRNG algorithm specific data for any constants, buffered random bits, or >other state. The _mp_data field should typically point to a algorithm >specific struct. The _mp_datasize field is used by generic code for >

Re: State of PRNG code in GMP

2020-06-03 Thread Torbjörn Granlund
Bradley Lucier writes: I don't know whether this is of interest to GMP developers. We probably don't have time for a very large project around random numbers. Independent random number generators is possible already today at the mpz level. Just initialise any needed number of randstate_t

Re: State of PRNG code in GMP

2020-06-02 Thread Torbjörn Granlund
Thanks Pedro for a quick and thorough answer! Much appreciated! Pedro Gimeno writes: > Question 1: Why is _mp_lc wrapped in a union? Historical reasons. It was that way when I implemented MT. I see. > Question 2: "_lc" = Linear Congruential? This is supposed to be a > generic

State of PRNG code in GMP

2020-06-01 Thread Torbjörn Granlund
I am looking into adding AES CTR as a new, fast PRNG in GMP. Unfortunately, the current code is somewhat confusing. The main structure for storing random state is the following: typedef struct { mpz_t _mp_seed; /* _mp_d member points to state of the generator. */

gmplib.org change

2020-05-26 Thread Torbjörn Granlund
We've reconfigured our web server to suppress .html from urls. It seem to work fine, but please let me know if you encounter any problems. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org

Re: GMP 6.2.0 doesn't build on C89 compilers - Patch attached

2020-05-07 Thread Torbjörn Granlund
Colin Finck writes: We build GMP as part of a GCC build for a build environment and have just upgraded to GMP 6.2.0. Unfortunately, this version fails to build using C89 compilers or under Linux distributions that don't advertise C99 support. In particular, one of our developers

  1   2   3   4   5   6   >