ni...@lysator.liu.se (Niels Möller) writes:

> Maamoun TK <maamoun...@googlemail.com> writes:
>
>> +L4x_loop:
> [...]
>> +    C polynomial multiplication "classical" pre-processing
>> +    xxmrghd        VSR(C23h),VSR(C2),VSR(C3)
>> +    xxmrgld        VSR(C23l),VSR(C2),VSR(C3)
>> +    xxmrghd        VSR(C01h),VSR(C0),VSR(C1)
>> +    xxmrgld        VSR(C01l),VSR(C0),VSR(C1)
>> +
>> +    C polynomial multiplication "classical"
>> +    vpmsumd        C3,C3,H1                      C M3 = H^1l*C3h⊕H^1h*C3l
>> +    vpmsumd        C2,C2,H2                      C M2 = H^2l*C2h⊕H^2h*C2l
>> +    vpmsumd        C1,C1,H3                      C M1 = H^3l*C1h⊕H^3h*C1l
>> +    vpmsumd        C0,C0,H4                      C M0 = H^4l*C0h⊕H^4h*C0l
>> +    vpmsumd        C23h,C23h,H21h                C H23 = H^2h*C2h⊕H^1h*C3h
>> +    vpmsumd        C23l,C23l,H21l                C L23 = H^2l*C2l⊕H^1l*C3l
>> +    vpmsumd        C01h,C01h,H43h                C H01 = H^4h*C0h⊕H^4h*C1h
>> +    vpmsumd        C01l,C01l,H43l                C L01 = H^4l*C0l⊕H^4l*C1l
>
> So you do 4 blocks with only 10 vpmsumd instructions (the 8 above, and 2
> more below for the reduction). That's nice! It would be even nicer if
> the H terms could be rearranged to eliminate the pre-processing of the C
> values.

I think I've found a slightly better way to organize it. I'm assuming
bit reversal and that we have pre-multiplied the key factor by x, so
that we need to compute

  A(x) B(x) / x^128 (mod P(x))

where P(x) is the reversed polynomial P(x) = x^128 + x^127 + x^126 +
x^121 + 1. Denote the 64-bit pieces as

  A(x) = a_1(x) x^64 + a_0(x), B(x) = b_1(x) x^64 + b_0(x)

We do precomputations on B (the key, invariant when processing a message
of multiple blocks).

First, compute b_0(x) / x^64 (mod P(x)), which expands it from 64 bits to
128,

  c_1(x) x^64 + c_0(x) = b_0(x) / x^64 (mod P(x))

Next, add together d_1 = b_1 + c_0. We can then write (to simplify
notation, everything below is (mod P(x)), + means xor, and we also omit
the (x) arguments).

  A B / x^128 = a_1 b_1 + (a_0 b_1 + a_1 b_0) / x^64 + a_0 b_0 / x^128
              = a_1 b_1 + (a_0 b_1 + a_1 b_0) / x^64 + a_0 (c_1 x^64 + c_0) / 
x^64
              = a_1 b_1 + a_0 c_1 + (a_0 b_1 + a_1 b_0 + a_0 c_0) / 2^64
              = a_1 b_1 + a_0 c_1 + (a_0 d_1 + a_1 b_0) / x^64
                `------R--------'    `------F--------'

So if we have the input in register A (loaded from memory with no
processing besides ensuring proper *byte* order), and precompute two
values, M representing b_1(x) x^64 + c_1(x), and L representing b_0(x)
x^64 + d_1(x)), then we get the two halves above with two vpmsumd,

  vpmsumd R, M, A
  vpmsumd F, L, A

When doing more than one block at a time, I think it's easiest to
accumulate the R and F values separately.

After accumulation, we have the pieces

  R = r_1 x^64 + r_0, F = f_1 x^64 + f_0
 
To do the division with x^64 (mod P), and reduce to a 128-bit result. we
need to cancel f_0, and we do that by adding multiplyign f_0 P and
adding in:
  
  (f_1 x^64 + f_0 + f_0 P) / 2^64 = f_1 + f_0 (x^64 + x^63 + x^62 + x^57)

If we prepare a register G representing the constant polynomial (x^63 +
x^62 + x^57), i.e, all zero high part, we can do that as

  vpmsumd T, F, G

and then to get the final result, we need to xor together the three
values

   r_1 x^64 + r_0
   f_0 x^64 + f_1   (F swapped)
   t_1 x^64 + t_0

It may be worth noting that both r_1 and t_1 are only 63 bits, so it's
only the contribution of f_0 in the high half position that increases
the size of the result from 127 bits to 128.

There are some possible variations of the final reduction, e.g, one
could use G' = x^63 + x^62 + x^61 + x^56 (i.e., one more term), but then
left shift the result of f_0 G'. With some different and really clever
handling of the extra x factor (here, assumed premultiplied into b),
maybe one could use a single vpmsumd to multiply f_0 G' and add in f_1
at the right place before shifting.

So to sum up, with this method, there's no preprocessing of the input,
two vpmsumd per block to produce the values to be accumulated (same as
in your patch), and a single vpmsumd (one less than in your patch) for
the final reduction.

Furthermore, there's actually no need to do the final reduction until
the end of the message. If we can keep the accumulated R and M values
separately in the gcm_ctx struct (an ABI change, though), the final
reduction can be postponed until gcm_digest is called.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to