On Wed, Sep 20, 2017 at 6:19 PM, Kyle Butt <k...@iteratee.net> wrote:
> > > On Mon, Sep 11, 2017 at 10:37 AM, Kyle Butt <k...@iteratee.net> wrote: > >> >> >> On Fri, Sep 8, 2017 at 9:55 PM, Mike Hamburg <m...@shiftleft.org> wrote: >> >>> I might possibly have written an ECC implementation that documents: “The >>> following operations are not implemented for NIST P224 because I didn’t >>> want to compute square roots: …” >>> >>> At least with only 2^24+1, you don’t have to implement Sutherland’s >>> algorithm. But the cost of Tonelli-Shanks is nontrivial, since it can >>> still take hundreds of multiplications. >>> >> >> I implemented it and added counting. For the first 10,000 residues the >> average # of multiplies for the iterative portion is 183. 22 of those would >> have to be done anyway if p were 3 mod 4. >> So 161 extra field multiplications for square roots. Taking the square >> root of u/v requires an additional 24 squares and 9 multiplies. >> I at least figured out how to do equality checks in band without having >> to reduce the value completely. My insight was to set the first component >> to t + t/2, and then do a parallel carry. If the value is 0, all the >> carries will be done, and the you check that all components are the same. >> >> I'll try to get some speed #'s but it's at least nice to know that >> Tonelli-Shanks isn't too difficult to implement (at least in Haskell), and >> that the # of multiplies isn't insurmountable. >> >> For example, my phi 17 prime has p-1 = 2^22*q, 431 bits, and 136 >> mulitplies per field multiply or square. Comparing 24 vs 22, you should >> expect 133 extra field multiplies per square root. >> The Ed41417 prime paper lists 144 multiplies per square. Saving 8 >> mulitplies per field multiplication, with ~8 field multiplies per point >> addition is 64 multiplies per point, saving a field multiply just over >> every 2 point additions. >> It's easy to see that over a scalar multiply, you could theoretically >> make it up. >> >> I've never done CUDA/OpenCL, but these should all vectorize nicely. It >> will be a worthwhile reason to learn vector programming. >> > > So the half-cyclic convolution is a pain to vectorize, and the compiler > certainly won't do it for you. > > I implemented multiply and square for avx2. On my workstation at work I > get 163 cycles/multiply and 121/square. Which isn't competitive with the > state of the art, but it's close. > > If you wanted differential power resistance, they may be competitive, > because they get it for (almost) free. (Instead of setting the accumulator > to 0, you broadcast a random value 0 t-1) > > Someone with more vectorizing experience may be able to push them further, > but that's a lot of work. > > If anyone is interested in the code, I can clean it up and post it. > I figured I'd write a last follow-up on some of the work I'd done before leaving it. I realized that you could use division to reduce the coefficients by t instead of montgomery style reduction. For an arbitrary t, that would be expensive, but for t = (2^26 - 2) It's relatively cheap. shift right by 26, that becomes a carry to the next coefficient, and add back 2 times that as an error term. You have to do this twice because the terms are bigger than t^2. You can find several t's that are close enough to a power of 2 to use this method of reduction. (2^26 - 2) is really nice in that 2 has hamming weight 1. p = phi 19 t is 3 mod 4, for cheap square roots. If you reduce this way, the reduction is exact, which is nice for 2 reasons: Equality checking: Subtract, reduce once. All terms will be the same if the values were equal. Bits: Because the reduction is exact, you can use a slightly larger t value. t can be slightly larger than 27 bits, For instance t = 2^27 + 9, p = phi 19 t is 486 bits. 9 requires an extra addition, but not an extra shift, because you have to compute the carry anyway. I did something else as well. I realized you could view the Granger Moss construction as a Karatsuba level, and tried t = (2^112 - 8), and p = phi 5 t I implemented a reduced carry version. You get the first quotient for free because shifting right by 112 is just the upper results of the product. I was able to get multiply reduce around 190 cycles implemented that way. Reduction was 26 of those cycles, and auto-vectorized nicely. I just used intel intrinsics for mulx and addc, no vector intrinsics. I may try the 32 bit version on skylake to see how much more I can get out of avx-512. The majority of the multiplies can fill a 512 bit vector, lane masking helps with the ones that can't. It also has embedded broadcast, which should help out a lot. The way I got performance from the cyclic convolution was to tilt it so that half of each operand was a broadcast. That along with the increased registers should lower the # of spills. The carry propagation runs in 3 passes instead of 5, and avx-512 has 64 bit arithmetic right shift, and 64-32 bit packing, used to finish the reduction. I may also try Neon. The same 4-way vectorization that I did for avx2 splits into 2-way vectorization, with long multiply accumulate. > > >> >>> Anyway, I’ll be interested to know what performance you get. >>> >>> > On Sep 8, 2017, at 9:21 PM, Kyle Butt <k...@iteratee.net> wrote: >>> > >>> > TLDR: How much do square roots matter for ECC primes? >>> > >>> > I've been working on finding Granger Moss primes where t fits in a 32 >>> bit integer. >>> > Along the way, I worked out tighter bounds for the min and max values >>> after reduction. >>> > >>> > In the paper, they aren't explicit about how much extra room is needed >>> to handle additions for curve operations. For edwards curves, you need to >>> account for a factor of 6. This changes their formula >>> > ceil(log_2(m / 2)) + 2k + 5 <= 2w >>> > to: >>> > ceil(log_2(3 * m) + 2k + 5 <= 2w >>> > >>> > for m + 1 = 17 and 19, this requires k < 26.5 >>> > >>> > Similarly for their reduction algorithm, they work in bits, but I did >>> the math on the actual bounds. >>> > >>> > Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l) >>> - 1 + (t - c) >>> > t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c < >>> b, after 2 reductions the max value is: >>> > 2^(63 - 2*l) - 1 + c + (t - c). = 2^(63 - 2*l) + t - 1. The first c >>> occurs because t/b = c. >>> > The minimum value is easier to calculate. In this case, the worst case >>> carry is 0. >>> > so the min value is 2^(63 - 2 * l). >>> > >>> > Upon multiplying these values are subtracted. If their difference is >>> less than 2^27.5, then we can adjust the formula above to: >>> > >>> > ceil(log_2(3 * m) + 2k + 4 <= 2w >>> > >>> > This does place a greater constraint on l, but it yields larger primes. >>> > >>> > because we know that the result of the product takes 1 less bit. >>> > This allows us to construct larger primes for m + 1 = 17, 19. >>> > For m+1 = 19 we can construct a prime with 483 bits, specifically: >>> > p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24 >>> > >>> > Unfortunately, these primes are not fast square root primes, by >>> construction. It should be clear that p % (2^24) = 1. How much does this >>> matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast. >>> Equality checking is also a little tricky due to the redundancy, but a >>> subtraction single round of modular carries can be done in parallel. Then >>> you can verify that all the limbs are the same value. >>> > >>> > I found an edwards curve that satisfies the safecurves criteria, the >>> constant is unusual but rigid: >>> > It is the least d in absolute value such that x^2 + y^2 = 1 + >>> (d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for >>> choosing such a d is that coefficient multiplication requires reduction, >>> which means that even for small constants, they must be in the montgomery >>> domain, taking up at least 2 limbs. But for a constant d/b it can be a >>> single limb. >>> > >>> > the least such d is: -56904 >>> > for that curve the trace is: 180587655261632913936542935860 >>> 4192390835234313183868353172046570215054410 >>> > >>> > I prototyped the code in Haskell, for quick turnaround and so that I >>> have something where I can print the intermediate values as test vectors >>> for other implementations. I can tidy it up and share it if there's >>> interest. It's not written for speed but for verification. >>> > >>> > I plan to try and implement a cuda or opencl version, to try and take >>> advantage of the parallel nature of the primes. >>> > >>> > Thoughts or suggestions? Would it be worthwhile to write any of this >>> up in a more formal setting? >>> > >>> > I think at a minimum borrowing the half bit by computing tighter >>> bounds is pretty cool. >>> > >>> > Kyle >>> > _______________________________________________ >>> > Curves mailing list >>> > Curves@moderncrypto.org >>> > https://moderncrypto.org/mailman/listinfo/curves >>> >>> >> >
_______________________________________________ Curves mailing list Curves@moderncrypto.org https://moderncrypto.org/mailman/listinfo/curves