ni...@lysator.liu.se (Niels Möller) writes: > So one could perhaps provide a set of uniformly distribted > random numbers for div1 and get adequate results?
Are you saying that the quotient of two independently uniformly distributed numbers has a similar distribution as the euclidean quotient sequence? Otherwise, one could tweak the distribution in some way. Not exactly. I think running div1 like hgcd2 but without any of hgcd2 bookkeeping would make some sense. I.e., feed div1 with the input of Euclid's algorithm. To avoid skew from particular operands, perhaps table 10 uniformly distrinuted random numbers, then do all (a,b) pairs from that table. The idea is to make measurements more accurate. > I'm appending another iteration of the patch to add div2 function based > on div1 on the high limbs. Selected via HGCD2_DIV2_METHOD. Benchmarks: > > HGCD2_DIV2_METHOD mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 > 1 #1504.47 1740.53 1513.56 div1 + unlikely case > 2 #1739.66 1858.01 1753.73 current code > 3 #1615.32 1736.69 1621.95 the if:ed out version > > Looks good. Ready for repo? I think it's quite ready, but not today. I also wonder if we can delete the "current" code right away, to reduce number ov variants. Or if we should wire it all up for tuning before deciding that. The "current" div2, you mean? The branchy variant is hardly ever going to be the fastest, so pls go ahead and delete! > What is the current reasoning on suppressing the q = 1 code in the hgcd2 > code calling div1 and div2? I don't quite remember, but the operations in the q == 1 case are sub_ddmmss (or just ah -= bh in the single-limb loop) if (ah <= bh) True 41.5%, if I find the right place in Knuth u01 += u00; u11 += u10; So that's pretty cheap, except for the branch each mispredictions can cost several cycles. "Several" is an understatement. A misprediced branch cost more than 10 cycles today. The case of general q obviously needs to compute a quotient, where div1 gives a candidate quotient that's usually correct. Then the update needs two multiplies, u01 += q * u00; u11 += q * u10; In addition, the computation of the remainder is not not so cheap, at least not for the double-limb loop, where we get one or two additilan multiplies, some carry propagation, and logic for adjusting the div1 quotient when it happens to be one off. My guess is that special q=1 is often beneficial for the double-limb loop, and rarely beneficial in the single-limb loop. But needs some measurements. OK. I've written several div1 in asm (arm v5 method 2, 64-bit arm v8 method 2 and 3, 64-bit x86 method 2). Since these will not be static or inlined, I call them mpn_div_11. I am not sure how to time them within your framework. I realised one thing for div1: All current variant will pay a great branch misprediction price except method 1. Even method 3 pays, as q >=8 happens in some 20% of the case, so average cost perhaps 20 cycles times 0.2 = 4 cycles per call. Or perhaps 8 as two branches must be passed. The new method 2 pays as the loop iteration will vary very much and very unpredictably. I expect the number of mispredictions will be > 1 per call, actually. This is why method 1 stands a chance, in spite of how slow the division instructions generally are. I now believe a table-based thing which gets the quotient right > 90% of the time will be the best approach. And perhaps hgcd2 could be made to cope with inaccurate quotients, and that without itself executing unpredictable branches? As long as progress is made, perhaps sloppy quotients can be tolerated? -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel