> Date: Thu, 18 Jun 2020 20:21:36 -0400
> From: Greg Troxel <g...@lexort.com>
> 
> Taylor R Campbell <riastr...@netbsd.org> writes:
> 
> > Can you be more specific about the systems you're concerned about?
> 
> What I meant is: consider an external USB disk of say 4T, which has a
> cgd partition within which is ffs.
> 
> Someone attaches it to several systems in turn, doing cgd_attach, mount,
> and then runs bup with /mnt/bup as the target, getting deduplication
> across systems.

(Side note: as a matter of architecture I would recommend
incorporating the cryptography into the application, like borgbackup,
restic, or Tarsnap do -- anything at a higher level than disks (even
at the level of the file system, like zfs encryption) has much more
flexibility and can also provide authentication.  Generally the main
use case for disk encryption is to enable recycling disks without
worrying about information disclosure; the threat model and security
of disk encryption systems are both qualitatively very weak.)

> [...]
> So, using the new faster cipher won't work, because it's not supported
> by the older systems.
> 
> Hoewver, if the -current system does AES slowly because it has the new
> constant-time implementation, and the older ones do it like they used
> to, I don't see a real problem.

OK.  If you encounter a scenario where this is likely to be a real
problem, let me know.

I drafted an SSE2 implementation which considerably improves on the
BearSSL aes_ct implementation on a number of amd64 CPUs I tested from
around a decade ago.  It is still slower than before -- and AES-CBC
encryption hurts by far the most, because it is necessarily
sequential, whereas AES-CBC decryption and AES-XTS in both directions
can be vectorized -- but it does mitigate the problem somewhat.  This
covers all amd64 CPUs and probably most `i386' CPUs of the last 15-20
years.

There is some more room for improvement -- SSSE3 provides PSHUFB which
can sequentially speed up parts of AES, and is supported by a good
number of amd64 CPUs starting around 14 years ago that lack AES-NI --
but there are diminishing returns for increasing implementation and
maintenance effort, so I'd like to focus on making an impact on
systems that matter.  (That includes non-x86 CPUs -- e.g., we could
probably easily adapt the Intel SSE2 logic to ARM NEON -- but I would
like to focus on systems where there is demand.)

> > The best way to test this is to just boot a new kernel and try a
> > workload.  But I assume you are looking for a userland program that
> > one can compile and run to test it without booting a new kernel.
> 
> Yes, that's what I meant.  Kind of like "openssl speed".

I drafted a couple programs to approximately measure performance from
userland.  They are very naive and do nothing to measure overhead from
cgd(4) or disk i/o itself.

https://www.NetBSD.org/~riastradh/tmp/20200621/aestest.tgz
https://www.NetBSD.org/~riastradh/tmp/20200622/adiantum.tgz

aestest usage:

# build it
make aestest bad

# measure performance of the bad AES implementation
progress -f /dev/zero sh -c 'exec ./bad >/dev/null'

# list new AES implementations supported on this CPU; on Intel,
# descending order of preference according to hardware support is:
# - Intel AES-NI
# - Intel SSE2 bitsliced
# - BearSSL aes_ct
./aestest -l

# measure performance of one of them for AES-XTS encryption with
# AES-256 (equivalent of `aes-xts 512' in cgd, because XTS takes two
# AES keys)
progress -f /dev/zero sh -c 'exec ./aestest -e -b 256 -c aes-xts -i "Intel SSE2 
bitsliced" > /dev/null'

adiantum usage:

# build it
make adiantum

# measure performance of adiantum on 512-byte units (will see ~20%
# improvement if/when we teach cgd(4) to handle 4096-byte units;
# change `#define SECSIZE 512' to `#define SECSIZE 4096' to simulate)
progress -f /dev/zero sh -c 'exec ./adiantum >/dev/null'

Generally I expect adiantum to outperform all AES-based ciphers on all
CPUs _except_ those CPUs with hardware AES support -- and there is
plenty of room for improvement of Adiantum performance with MD
vectorization of ChaCha and Poly1305; this implementation is almost
the naivest possible portable C code.

> > (Note that there is no impact on userland crypto, which means no
> > impact on TLS or OpenVPN or anything like that[...])
> 
> So it remains to make userland AES use also constant time, as a separate
> step?

Correct.

> >> I'm unclear on openssl and hardware support; "openssl speed" might be a
> >> good home for this, and I don't know if openssl needs the same treatment
> >> as cgd.  (Fine to keep separable things separate; not a complaint!)
> >
> > OpenSSL is a mixed bag.  It has a lot more MD implementations of
> > various cryptographic primitives.  But many of them are still leaky.
> > So it's probably not a very good proxy for what the performance impact
> > of this patch set will be.
> 
> I sort of meant putting the new code in there so it can be measured, but
> I realize that's messy.

`messy' indeed.

Reply via email to