It's gotten better with this patch, now it takes 0.000049 seconds to execute under the same circumstances.
On Fri, Sep 25, 2020 at 9:59 AM Niels Möller <ni...@lysator.liu.se> wrote: > Maamoun TK <maamoun...@googlemail.com> writes: > > >> What's the speedup you get from assembly gcm_fill? I see the C > >> implementation uses memcpy and WRITE_UINT32, and is likely significantly > >> slower than the ctr_fill16 in ctr.c. But it could be improved using > >> portable means. If done well, it should be a very small fraction of the > >> cpu time spent for gcm encryption. > > > I measured the execution time of both C and altivec implementations on > > POWER8 for 32,768 blocks (512 KB), repeated 10000 times and compiled > > with -O3 gcm_fill_c() took 0.000073 seconds to execute > > gcm_fill_altivec() took 0.000019 seconds to execute As you can see, > > the function itself isn't time consuming at all and maybe optimizing > > it is not worth it, > > Can you try below patch? For now, tested on little endian (x86_64) only, > and there the loop compiles to > > 50: 89 c8 mov %ecx,%eax > 52: 4c 89 0a mov %r9,(%rdx) > 55: 48 83 c2 10 add $0x10,%rdx > 59: 83 c1 01 add $0x1,%ecx > 5c: 0f c8 bswap %eax > 5e: 48 c1 e0 20 shl $0x20,%rax > 62: 4c 01 d0 add %r10,%rax > 65: 48 89 42 f8 mov %rax,-0x8(%rdx) > 69: 4c 39 c2 cmp %r8,%rdx > 6c: 75 e2 jne 50 <gcm_fill+0x20> > > Should run in a few cycles per block (6 cycles assuming dual-issue, > decent out-of-order capabilities per block). I would expect unrolling, > to do multiple blocks in parallel, to give a large performance > improvement only on strict in-order processors. > > > but gcm_fill is part of AES_CTR and what other > > libraries usually do is optimizing AES_CTR as a whole so I considered > > optimizing it to stay on the same track. > > In Nettle, I strive to go to the extra complexity of assembler > implementation only when there's a significant performance benefit. > > Regards, > /Niels > > diff --git a/gcm.c b/gcm.c > index cf615daf..71e9f365 100644 > --- a/gcm.c > +++ b/gcm.c > @@ -334,6 +334,46 @@ gcm_update(struct gcm_ctx *ctx, const struct gcm_key > *key, > } > > static nettle_fill16_func gcm_fill; > +#if WORDS_BIGENDIAN > +static void > +gcm_fill(uint8_t *ctr, size_t blocks, union nettle_block16 *buffer) > +{ > + uint64_t hi, lo; > + uint32_t lo; > + size_t i; > + hi = READ_UINT64(ctr); > + mid = (uint64_t)READ_UINT32(ctr + 8) << 32; > + lo = READ_UINT32(ctr + 12); > + > + for (i = 0; i < blocks; i++) > + { > + buffer[i].u64[0] = hi; > + buffer[i].u64[1] = mid + lo++; > + } > + WRITE_UINT32(ctr + 12, lo); > + > +} > +#elif HAVE_BUILTIN_BSWAP64 > +/* Assume __builtin_bswap32 is also available */ > +static void > +gcm_fill(uint8_t *ctr, size_t blocks, union nettle_block16 *buffer) > +{ > + uint64_t hi, mid; > + uint32_t lo; > + size_t i; > + hi = LE_READ_UINT64(ctr); > + mid = LE_READ_UINT32(ctr + 8); > + lo = READ_UINT32(ctr + 12); > + > + for (i = 0; i < blocks; i++) > + { > + buffer[i].u64[0] = hi; > + buffer[i].u64[1] = mid + ((uint64_t)__builtin_bswap32(lo) << 32); > + lo++; > + } > + WRITE_UINT32(ctr + 12, lo); > +} > +#else > static void > gcm_fill(uint8_t *ctr, size_t blocks, union nettle_block16 *buffer) > { > @@ -349,6 +389,7 @@ gcm_fill(uint8_t *ctr, size_t blocks, union > nettle_block16 *buffer) > > WRITE_UINT32(ctr + GCM_BLOCK_SIZE - 4, c); > } > +#endif > > void > gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key, > > > -- > Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. > Internet email is subject to wholesale government surveillance. > _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs