Niels Möller <ni...@lysator.liu.se> writes: > My current understanding is that the most important improvement over the > initial implementation is to call the underlying block cipher with more > than one block at a time (enabling parallelism on some hardware, and > reducing overhead). I've kept doing this, but going to doing up to 16 block at a time, with a function to fill out the 16 offset blocks (with minor optimization; doing even/odd message count values, but not precomputing any more values), followed by xor and cipher operations on 16 blocks.
The same fill function can then be used also for ocb_update (i.e., processing of associated data). I also found that the xoring for the checksum was a bottleneck, the initial implementation this was just static void ocb_checksum_n (union nettle_block16 *checksum, size_t n, const uint8_t *src) { for (; n > 0; n--, src += OCB_BLOCK_SIZE) memxor (checksum->b, src, OCB_BLOCK_SIZE); } This is lots of overhead per block, in particular since src does not have any guaranteed alignment. I rewrote this as a main loop reading and xoring aligned uint64_t, plus some rather hairy code to handle the first and last bytes and rotate bytes to the right position. The bulk of the work is then the very simple loop /* Now src is 64-bit aligned, so do 64-bit reads. */ for (s0 = s1 = 0 ; n > 0; n--, src += OCB_BLOCK_SIZE) { s0 ^= ((const uint64_t *) src)[0]; s1 ^= ((const uint64_t *) src)[1]; } Together, these optimizations gave a speedup of about 3x for all of encrypt, decrypt, update. Speed slightly faster than gcm (except for update which still is slower), and considerably faster then eax. Benchmarked on my x86_64 laptop with special instructions for both aes and gcm. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se