> On Jan 22, 2016, at 4:20 AM, Michael Scott <mike.sc...@miracl.com> wrote: > > Have a look at the references in http://www.google.com/patents/US9148282 > <http://www.google.com/patents/US9148282> > > Particularly Siguna Mueller's Lucas-function based method..
Thanks! To your knowledge, is the Mueller method itself patented? I’ll check it out when I can get the right login. > Not sure about the cost.. The abstract claims “expected cost = 1.33 * the number of multiplications for a square and multiply”, which would be 468 in this case. I’ve also improved the splitting in my code, pasted below, which improves its worst-case time to 718 multiplies, with 7 elements in RAM and 2 in ROM plus still that pesky dlog table. A little tuning can save another few multiplies, but I went with simplicity instead. So it’s still presumably worse than the Mueller method, except perhaps in that it is deterministic. > Mike Thanks again, — also Mike > > > On Thu, Jan 21, 2016 at 11:30 PM, Mike Hamburg <m...@shiftleft.org > <mailto:m...@shiftleft.org>> wrote: > Hello Curves, > > I was porting an embedded EC library to work with other curves, and I was > wondering if anyone knew a compact, fast, deterministic algorithm for square > roots mod NIST p224 (or other annoying primes)? I’m not worried about > constant-time operation, since I probably have to blind anyway to reduce > other side channels, but a deterministic algorithm is preferable because of > possible real-time constraints. > > OpenSSL’s implementation takes something like 4700 multiplications on > average, which is more than a point*scalar multiply. > > Dan’s paper [1] gives a variant of the Tonelli-Shanks algorithm which is fast > and deterministic, requiring only 364 multiplies. But it’s very heavy, with > 1024 precomputed elements. Reducing the table size makes the algorithm > considerably slower. > > I’ve tried to combine Dan’s Tonelli-Shanks variant with Sutherland’s > algorithm, below. A variant with 4 precomputed points, a 64-element discrete > log table (not as big as it sounds) and 10 or so field elements in memory > takes 900 multiplies worst-case (~760 average-case). > > Cipolla’s algorithm is nondeterministic. > > Is Pocklington the way to go? It’s technically nondeterministic as well, but > if I read [1] correctly the failure probability is 2^-96. A straightforward > implementation probably takes more than 900 multiplies, but hopefully it’s > simpler and uses less memory, right? > > Is there some other algorithm I’m missing? > > Thanks, > — Mike > > [1] http://cr.yp.to/papers/sqroot.pdf <http://cr.yp.to/papers/sqroot.pdf> > p = 2^224 - 2^96 + 1 F = GF(p) qnr = 11^(2^128-1) dlog_table = dict(( (qnr^(2^90*i),2^6-i) for i in xrange(2^6) )) precmp = [qnr^2^66,qnr^2^84] def isqrt(x): s = x^3 s = s^2 * x # 2^3 - 1 u = s^2^3 * s # 2^6 - 1 s = u^2^6 * u # 2^12 - 1 t = s^2^12 * s # 2^24 - 1 s = t^2^24 * t # 2^48 - 1 s = s^2^48 * s # 2^96 - 1 s = s^2^24 * t # 2^120 - 1 s = s^2^6 * u # 2^126 - 1 s = s^2 * x # here s == x^(2^127-1) ret = [s] rou = [qnr] def leaf(r,n=6): dlog = dlog_table[r]>>(tbits-n) for i in xrange(len(ret)): for j in xrange(n): if dlog & (1<<j): ret[i] = ret[i] * rou[i] rou[i] = rou[i]^2 def stage(m): if len(ret) > 1: return ret[-1]^2^(6*m) else: return (ret[-1]^2*x)^2^(6*m-1) def descent(n): for i in xrange(n-1,0,-1): leaf(stage(i),6) if len(ret) > 1: rou.pop() leaf(ret.pop(),6) else: leaf(ret[0]^2*x,5) # Have to apply leaf() 96/6 = 16 times # Split it as 11+5, 11=6+5, 6=4+2, 5=3+2 # Precomputed roots of unity for 2^(6*5) and 2^(6*2) ret.append(stage(11)); rou.append(precmp[0]) ret.append(stage(3)); rou.append(precmp[1]) descent(2) descent(3) ret.append(stage(6)); rou.append(precmp[0]) ret.append(stage(3)); rou.append(precmp[1]) descent(2) descent(3) ret.append(stage(4)); rou.append(precmp[1]) descent(2) descent(4) return ret[0]
_______________________________________________ Curves mailing list Curves@moderncrypto.org https://moderncrypto.org/mailman/listinfo/curves