Re: p1619 (disk): ciphertext-stealing, tweak-mapping, other

David McGrew Thu, 22 Dec 2005 07:17:02 -0800

Hi Michael,

thanks for your comments, more inline:


On Dec 21, 2005, at 2:09 PM, Michael Torla wrote:

David McGrew wrote:
You mean advantage in terms of latency, right? I'm not sure thatthis is the case, since both XCB and EME* need to do one passover the data before any data can be output, and I suspect thatthe circuit depth of those two passes isn't much different. Itwould be interesting to see a detailed comparison. For thatmatter, it would be worthwhile to discuss the implementationscenarios enough to get a good idea of what the "successcriteria" for wide-block modes like these are. (E.g. since allof these modes require the data to be buffered, what criticalpath should be measured? The path to output the first byte, orto output all of the bytes?)
I've looked at this to some extent.
From the point of view of an arbitrary block size, XCB is much morecostly. To support a block that is larger than the AES hardwareaccelerator's buffer size, data must be fetched twice. Thisfeature is unique to XCB; I've not seen it in any other mode of anycrypto algorithm I've looked at.

AFAICT, the requirement that the encryptor buffer the block that itis encrypting is a fundamental requirement for any cipher that is apseudorandom permutation with an input width that matches theplaintext size. Any mode that met the goal would the need to bufferthe data.

EME and EME* and several other modes also have this property, IIUC.In the EME specifications, the dependancy of the second ECB pass onthe results of the first ECB pass is somewhat hidden because it isexpressed indirectly through variables. Note, for example, on page 4of http://seclab.cs.ucdavis.edu/papers/eme.pdf that the variable M,which is needed to compute the second ECB pass, is only computedafter the first ECB pass completes.

Having to fetch data twice is very costly.


I bet it is!

For a block size of 4096 bits, it is reasonable to buffer theentire block within the AES hardware accelerator.


Which would be a significant advantage, AFAIK.

So, assuming a block of 4096 bits (512 bytes), computation times are:
LRW:  32 AES computations + 1 GFMult
only the first GF Multiply must be performed in serial.
EME:  66 AES computations
all GF Mults can be parallelized
XCB:  37 AES computations + 34 GFMult
There are a total of 68 GF mults to perform, split between twoseparate GHASH computations. Computation of ciphertext stream Bdepends on D, which depends on I, K, and H. Given one AES engineand one GFMult engine, those first 3 AES computations must occurserially, followed by 34 GF multiplies, all serially. Only once Dis computed can the AES engine and the multiplier be usedconcurrently.
GCM:  34 AES computations +  2 GF multiplies
GHASH(AAD) can't start until AES(0,K) produces H. For a typical 96bit AAD, that takes 2 GF multiplies. I'm assuming that themultiply accumulate of len(A) || len(C) can be performed inparallel with a rear-loaded AES computation of CTR_0.
CCM:  66 AES computations
Counter Mode requires 32 AES computations; CBC-MAC requires 32 AEScomputations, and there are two additional AES computations -- oneassociated with AAD, and the other to encrypt the MAC.For LRW, the first GFMultiply must be performed prior to the firstAES computation; all others can be performed in parallel. For EME,it appears all GFMultiply computations can be performed inparallel. For XCB, There are a total of 68 GF Multiplies toperform, 35 of which I think can be parallelized. However, thecomputation of ciphertext is dependent on D, which AES(P_0 ^ I, K)^ ghash(H,Z,B). Computing H requires an AES computation.Computing I requires an AES computation. Computing C requires andAES computation. Computing D requires a GHASH, which is a seriesof GF-multiply accumulates (34 precisely). All that MUST beserialized; after D is computed we can start using the AES engineand the GF engine in parallel.

One useful strategy for implementing XCB is to store the values of Hand I along with K. This enables the first ghash invocation and thefirst AES invocation to be computed in parallel. Of course, itrequires additional storage.

Lets assume 16 clock cycles per AES computation and 16 clock cyclesper GF Multiply. In that case encrypting takes:
LRW:  33*16 = 528 clock cycles
EME:  66*16 = 1056 clock cycles
XCB:  (37+34)*16 = 1136 clock cycles
GCM:  (34+2) * 16 = 576 clock cycles
CCM:  66*16 = 1056 clock cycles

Roughly speaking, the cost of non-malleability (psedorandompermutation modes) is equal to that of combined encryption &authentication modes.

The 16 clock cycle per AES computation is experiential. The GFMultiply time is arbitrary -- it assumes a 128x16 multiplier usedfor 16 cycles because that matches the performance of an AESengine. If you can afford the die area and power, you can ofcourse make the GF multiplier much faster.

Here's my understanding of a hardware XCB implementation, pleasecheck me if I'm wrong. For large data sets, the computation isdominated by two operations: the initial ghash pass, and the finalcombined counter mode and ghash pass. So the latency between whenthe data is first provided to the encryptor and when the first datastarts to leave the encryptor is roughly equal to the time it takesfor the first ghash pass, and the total time is the sum of the timefor both operations. As you point out, ghash can be made to runquickly at the cost of increased circuit size and power, so that the -first-data-out latency can be made small by cranking up the speed ofthe multiply operation to a rate faster than the AES engine. (Thoughof course in the second stage, it is desirable to have AES and ghashrunning at the same rate.) Since EME* has a speed determined bythe AES engine, XCB can have lower latency, at the cost of a high-speed multiplier.


Comments welcome.

David

Note this analysis ignores two factors:
bus latency -- the time resources are locked while waiting for datato arrive. If there is a bus latency issue, it is more likely tobe an issue with XCB, where for large data sets data would likelyhave to be fetched twice.control complexity. LRW is clearly much simpler than either EME orXCB. EME and XCB both have special challenges to implementationabove CCM or GCM, because of message block scheduling.
mt

Re: p1619 (disk): ciphertext-stealing, tweak-mapping, other

Reply via email to