Re: [curves] Compact square root algorithm for annoying primes

Mike Hamburg Fri, 22 Jan 2016 13:57:06 -0800

> On Jan 22, 2016, at 4:20 AM, Michael Scott <mike.sc...@miracl.com> wrote:
> 
> Have a look at the references in http://www.google.com/patents/US9148282 
> <http://www.google.com/patents/US9148282>
> 
> Particularly Siguna Mueller's Lucas-function based method..


Thanks!  To your knowledge, is the Mueller method itself patented?  I’ll check 
it out when I can get the right login.

> Not sure about the cost..

The abstract claims “expected cost = 1.33 * the number of multiplications for a 
square and multiply”, which would be 468 in this case.

I’ve also improved the splitting in my code, pasted below, which improves its 
worst-case time to 718 multiplies, with 7 elements in RAM and 2 in ROM plus 
still that pesky dlog table.  A little tuning can save another few multiplies, 
but I went with simplicity instead.  So it’s still presumably worse than the 
Mueller method, except perhaps in that it is deterministic.

> Mike

Thanks again,
— also Mike

> 
> 
> On Thu, Jan 21, 2016 at 11:30 PM, Mike Hamburg <m...@shiftleft.org 
> <mailto:m...@shiftleft.org>> wrote:
> Hello Curves,
> 
> I was porting an embedded EC library to work with other curves, and I was 
> wondering if anyone knew a compact, fast, deterministic algorithm for square 
> roots mod NIST p224 (or other annoying primes)?  I’m not worried about 
> constant-time operation, since I probably have to blind anyway to reduce 
> other side channels, but a deterministic algorithm is preferable because of 
> possible real-time constraints.
> 
> OpenSSL’s implementation takes something like 4700 multiplications on 
> average, which is more than a point*scalar multiply.
> 
> Dan’s paper [1] gives a variant of the Tonelli-Shanks algorithm which is fast 
> and deterministic, requiring only 364 multiplies.  But it’s very heavy, with 
> 1024 precomputed elements.  Reducing the table size makes the algorithm 
> considerably slower.
> 
> I’ve tried to combine Dan’s Tonelli-Shanks variant with Sutherland’s 
> algorithm, below.  A variant with 4 precomputed points, a 64-element discrete 
> log table (not as big as it sounds) and 10 or so field elements in memory 
> takes 900 multiplies worst-case (~760 average-case).
> 
> Cipolla’s algorithm is nondeterministic.
> 
> Is Pocklington the way to go?  It’s technically nondeterministic as well, but 
> if I read [1] correctly the failure probability is 2^-96.  A straightforward 
> implementation probably takes more than 900 multiplies, but hopefully it’s 
> simpler and uses less memory, right?
> 
> Is there some other algorithm I’m missing?
> 
> Thanks,
> — Mike
> 
> [1] http://cr.yp.to/papers/sqroot.pdf <http://cr.yp.to/papers/sqroot.pdf>
> 

p = 2^224 - 2^96 + 1
F = GF(p)
qnr = 11^(2^128-1)
dlog_table = dict((
    (qnr^(2^90*i),2^6-i) for i in xrange(2^6)
))
precmp = [qnr^2^66,qnr^2^84]

def isqrt(x):
    s = x^3
    s = s^2 * x # 2^3 - 1
    u = s^2^3 * s # 2^6 - 1
    s = u^2^6 * u # 2^12 - 1
    t = s^2^12 * s # 2^24 - 1
    s = t^2^24 * t # 2^48 - 1
    s = s^2^48 * s # 2^96 - 1
    s = s^2^24 * t # 2^120 - 1
    s = s^2^6 * u # 2^126 - 1
    s = s^2 * x
    # here s == x^(2^127-1)
    
    ret = [s]
    rou = [qnr]
    
    def leaf(r,n=6):
        dlog = dlog_table[r]>>(tbits-n)
        for i in xrange(len(ret)):
            for j in xrange(n):
                if dlog & (1<<j): ret[i] = ret[i] * rou[i]
                rou[i] = rou[i]^2
    
    def stage(m):
        if len(ret) > 1:
            return ret[-1]^2^(6*m)
        else:
            return (ret[-1]^2*x)^2^(6*m-1)
    
    def descent(n):
        for i in xrange(n-1,0,-1):
            leaf(stage(i),6)
        
        if len(ret) > 1:
            rou.pop()
            leaf(ret.pop(),6)
        else:
            leaf(ret[0]^2*x,5)
    
    # Have to apply leaf() 96/6 = 16 times
    # Split it as 11+5, 11=6+5, 6=4+2, 5=3+2
    # Precomputed roots of unity for 2^(6*5) and 2^(6*2)
    ret.append(stage(11)); rou.append(precmp[0])
    ret.append(stage(3)); rou.append(precmp[1])
    descent(2)
    descent(3)
    
    ret.append(stage(6)); rou.append(precmp[0])
    ret.append(stage(3)); rou.append(precmp[1])
    descent(2)
    descent(3)
    
    ret.append(stage(4)); rou.append(precmp[1])
    descent(2)
    descent(4)
    
    return ret[0]

_______________________________________________
Curves mailing list
Curves@moderncrypto.org
https://moderncrypto.org/mailman/listinfo/curves

Re: [curves] Compact square root algorithm for annoying primes

Reply via email to