Hello Curves,

I was porting an embedded EC library to work with other curves, and I was 
wondering if anyone knew a compact, fast, deterministic algorithm for square 
roots mod NIST p224 (or other annoying primes)?  I’m not worried about 
constant-time operation, since I probably have to blind anyway to reduce other 
side channels, but a deterministic algorithm is preferable because of possible 
real-time constraints.

OpenSSL’s implementation takes something like 4700 multiplications on average, 
which is more than a point*scalar multiply.

Dan’s paper [1] gives a variant of the Tonelli-Shanks algorithm which is fast 
and deterministic, requiring only 364 multiplies.  But it’s very heavy, with 
1024 precomputed elements.  Reducing the table size makes the algorithm 
considerably slower.

I’ve tried to combine Dan’s Tonelli-Shanks variant with Sutherland’s algorithm, 
below.  A variant with 4 precomputed points, a 64-element discrete log table 
(not as big as it sounds) and 10 or so field elements in memory takes 900 
multiplies worst-case (~760 average-case).

Cipolla’s algorithm is nondeterministic.

Is Pocklington the way to go?  It’s technically nondeterministic as well, but 
if I read [1] correctly the failure probability is 2^-96.  A straightforward 
implementation probably takes more than 900 multiplies, but hopefully it’s 
simpler and uses less memory, right?

Is there some other algorithm I’m missing?

Thanks,
— Mike

[1] http://cr.yp.to/papers/sqroot.pdf <http://cr.yp.to/papers/sqroot.pdf>

####################################################

p = 2^224 - 2^96 + 1
F = GF(p)
qnr = F(11)^(2^128-1)

# 3-element precomputed table with roots of unity
precmp = dict((
    (3*2^i, qnr^2^(96-3*2^i))
    for i in [2,3,4]
))

# 64-element discrete log table
tbits = 6
tlogger = qnr^2^(96-tbits)
dlog_table = dict((
    (tlogger^i,2^tbits-i) for i in xrange(2^tbits)
))

def isqrt(x):
    # inverse square root of x, assuming x is a QR
    # First, DJB’s addition chain
    s = x^3
    s = s^2 * x # 2^3 - 1
    u = s^2^3 * s # 2^6 - 1
    s = u^2^6 * u # 2^12 - 1
    t = s^2^12 * s # 2^24 - 1
    s = t^2^24 * t # 2^48 - 1
    s = s^2^48 * s # 2^96 - 1
    s = s^2^24 * t # 2^120 - 1
    s = s^2^6 * u # 2^126 - 1
    s = s^2 * x
    # here s == x^(2^127-1)
    
    ret = [s,s^2*x] # working powers of return value
    rou = [qnr,qnr^2] # working roots of unity
    stack = [95]
    
    def leaf(n):
        dlog = dlog_table[ret.pop()]>>(tbits-n)
        for i in xrange(len(rou)):
            for j in xrange(n):
                if dlog & (1<<j): ret[i] = ret[i] * rou[i]
                rou[i] = rou[i]^2
    
    # max stack depth is 4, so 5+5 field elements in ret+rou
    while len(stack):
        n = stack.pop()
        ret.append(ret[-1]^2^floor(n/2))
        
        if n <= 2*tbits:
            leaf(ceil(n/2))
            rou.pop()
            leaf(floor(n/2))
        else:
            rou.append(precmp[ceil(n/2)])
            stack.append(floor(n/2))
            stack.append(ceil(n/2))
            
    return ret[0]
_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves

Reply via email to