RE: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT

2018-03-10 Thread Vakul Garg
Hi

How does this patchset affect the throughput performance of crypto?
Is it expected to increase?

Regards

Vakul

> -Original Message-
> From: linux-crypto-ow...@vger.kernel.org [mailto:linux-crypto-
> ow...@vger.kernel.org] On Behalf Of Ard Biesheuvel
> Sent: Saturday, March 10, 2018 8:52 PM
> To: linux-crypto@vger.kernel.org
> Cc: herb...@gondor.apana.org.au; linux-arm-ker...@lists.infradead.org;
> Ard Biesheuvel ; Dave Martin
> ; Russell King - ARM Linux
> ; Sebastian Andrzej Siewior
> ; Mark Rutland ; linux-rt-
> us...@vger.kernel.org; Peter Zijlstra ; Catalin
> Marinas ; Will Deacon
> ; Steven Rostedt ; Thomas
> Gleixner 
> Subject: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
> 
> As reported by Sebastian, the way the arm64 NEON crypto code currently
> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
> causing problems with RT builds, given that the skcipher walk API may
> allocate and free temporary buffers it uses to present the input and output
> arrays to the crypto algorithm in blocksize sized chunks (where blocksize is
> the natural blocksize of the crypto algorithm), and doing so with NEON
> enabled means we're alloc/free'ing memory with preemption disabled.
> 
> This was deliberate: when this code was introduced, each
> kernel_neon_begin() and kernel_neon_end() call incurred a fixed penalty of
> storing resp.
> loading the contents of all NEON registers to/from memory, and so doing it
> less often had an obvious performance benefit. However, in the mean time,
> we have refactored the core kernel mode NEON code, and now
> kernel_neon_begin() only incurs this penalty the first time it is called after
> entering the kernel, and the NEON register restore is deferred until returning
> to userland. This means pulling those calls into the loops that iterate over 
> the
> input/output of the crypto algorithm is not a big deal anymore (although
> there are some places in the code where we relied on the NEON registers
> retaining their values between calls)
> 
> So let's clean this up for arm64: update the NEON based skcipher drivers to
> no longer keep the NEON enabled when calling into the skcipher walk API.
> 
> As pointed out by Peter, this only solves part of the problem. So let's 
> tackle it
> more thoroughly, and update the algorithms to test the NEED_RESCHED flag
> each time after processing a fixed chunk of input.
> 
> Given that this issue was flagged by the RT people, I would appreciate it if
> they could confirm whether they are happy with this approach.
> 
> Changes since v4:
> - rebase onto v4.16-rc3
> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
>   in v4.16-rc1
> 
> Changes since v3:
> - incorporate Dave's feedback on the asm macros to push/pop frames and to
> yield
>   the NEON conditionally
> - make frame_push/pop more easy to use, by recording the arguments to
>   frame_push, removing the need to specify them again when calling
> frame_pop
> - emit local symbol .Lframe_local_offset to allow code using the frame
> push/pop
>   macros to index the stack more easily
> - use the magic \@ macro invocation counter provided by GAS to generate
> unique
>   labels om the NEON yield macros, rather than relying on chance
> 
> Changes since v2:
> - Drop logic to yield only after so many blocks - as it turns out, the
>   throughput of the algorithms that are most likely to be affected by the
>   overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if
> that
>   is inacceptable, you are probably not using CONFIG_PREEMPT in the first
>   place.
> - Add yield support to the AES-CCM driver
> - Clean up macros based on feedback from Dave
> - Given that I had to add stack frame logic to many of these functions, factor
>   it out and wrap it in a couple of macros
> - Merge the changes to the core asm driver and glue code of the
> GHASH/GCM
>   driver. The latter was not correct without the former.
> 
> Changes since v1:
> - add CRC-T10DIF test vector (#1)
> - stop using GFP_ATOMIC in scatterwalk API calls, now that they are
> executed
>   with preemption enabled (#2 - #6)
> - do some preparatory refactoring on the AES block mode code (#7 - #9)
> - add yield patches (#10 - #18)
> - add test patch (#19) - DO NOT MERGE
> 
> Cc: Dave Martin 
> Cc: Russell King - ARM Linux 
> Cc: Sebastian Andrzej Siewior 
> Cc: Mark Rutland 
> Cc: linux-rt-us...@vger.kernel.org
> Cc: Peter Zijlstra 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Steven Rostedt 
> Cc: Thomas Gleixner 
> 
> Ard Biesheuvel (23):
>   

[RFC PATCH] crypto: pcrypt - forbid recursive instantiation

2018-03-10 Thread Eric Biggers
From: Eric Biggers 

If the pcrypt template is used multiple times in an algorithm, then a
deadlock occurs because all pcrypt instances share the same
padata_instance, which completes requests in the order submitted.  That
is, the inner pcrypt request waits for the outer pcrypt request while
the outer request is already waiting for the inner.

Fix this by making pcrypt forbid instantiation if pcrypt appears in the
underlying ->cra_driver_name.  This is somewhat of a hack, but it's a
simple fix that should be sufficient to prevent the deadlock.

Reproducer:

#include 
#include 
#include 

int main()
{
struct sockaddr_alg addr = {
.salg_type = "aead",
.salg_name = "pcrypt(pcrypt(rfc4106-gcm-aesni))"
};
int algfd, reqfd;
char buf[32] = { 0 };

algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(algfd, (void *), sizeof(addr));
setsockopt(algfd, SOL_ALG, ALG_SET_KEY, buf, 20);
reqfd = accept(algfd, 0, 0);
write(reqfd, buf, 32);
read(reqfd, buf, 16);
}

Reported-by: 
syzbot+56c7151cad94eec37c521f0e47d2eee53f936...@syzkaller.appspotmail.com
Fixes: 5068c7a883d1 ("crypto: pcrypt - Add pcrypt crypto parallelization 
wrapper")
Cc:  # v2.6.34+
Signed-off-by: Eric Biggers 
---
 crypto/pcrypt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index f8ec3d4ba4a80..3ec64604f6a56 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -265,6 +265,12 @@ static void pcrypt_free(struct aead_instance *inst)
 static int pcrypt_init_instance(struct crypto_instance *inst,
struct crypto_alg *alg)
 {
+   /* Recursive pcrypt deadlocks due to the shared padata_instance */
+   if (!strncmp(alg->cra_driver_name, "pcrypt(", 7) ||
+   strstr(alg->cra_driver_name, "(pcrypt(") ||
+   strstr(alg->cra_driver_name, ",pcrypt("))
+   return -EINVAL;
+
if (snprintf(inst->alg.cra_driver_name, CRYPTO_MAX_ALG_NAME,
 "pcrypt(%s)", alg->cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
return -ENAMETOOLONG;
-- 
2.16.2



[PATCH v3 6/6] tpm2-sessions: NOT FOR COMMITTING add sessions testing

2018-03-10 Thread James Bottomley
This runs through a preset sequence using sessions to demonstrate that
the session handling code functions.  It does both HMAC, encryption
and decryption by testing an encrypted sealing operation with
authority and proving that the same sealed data comes back again via
an HMAC and response encryption.  It also does policy unsealing which
mimics the more complex of the trusted key scenarios.

Signed-off-by: James Bottomley 

---
v3: add policy unseal testing with two sessions
---
 drivers/char/tpm/Makefile |   1 +
 drivers/char/tpm/tpm-chip.c   |   1 +
 drivers/char/tpm/tpm.h|   1 +
 drivers/char/tpm/tpm2-cmd.c   |   2 +
 drivers/char/tpm/tpm2-sessions-test.c | 359 ++
 5 files changed, 364 insertions(+)
 create mode 100644 drivers/char/tpm/tpm2-sessions-test.c

diff --git a/drivers/char/tpm/Makefile b/drivers/char/tpm/Makefile
index b83737ccaa81..1ac7a4046630 100644
--- a/drivers/char/tpm/Makefile
+++ b/drivers/char/tpm/Makefile
@@ -6,6 +6,7 @@ obj-$(CONFIG_TCG_TPM) += tpm.o
 tpm-y := tpm-interface.o tpm-dev.o tpm-sysfs.o tpm-chip.o tpm2-cmd.o \
 tpm-dev-common.o tpmrm-dev.o tpm1_eventlog.o tpm2_eventlog.o \
  tpm2-space.o tpm-buf.o tpm2-sessions.o
+obj-m +=  tpm2-sessions-test.o
 tpm-$(CONFIG_ACPI) += tpm_ppi.o tpm_eventlog_acpi.o
 tpm-$(CONFIG_EFI) += tpm_eventlog_efi.o
 tpm-$(CONFIG_OF) += tpm_eventlog_of.o
diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
index 0a62c19937b6..ca174ee1e670 100644
--- a/drivers/char/tpm/tpm-chip.c
+++ b/drivers/char/tpm/tpm-chip.c
@@ -118,6 +118,7 @@ struct tpm_chip *tpm_chip_find_get(struct tpm_chip *chip)
 
return res;
 }
+EXPORT_SYMBOL(tpm_chip_find_get);
 
 /**
  * tpm_dev_release() - free chip memory and the device number
diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index b1eee56cbbb5..8a652d36939d 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -146,6 +146,7 @@ enum tpm2_command_codes {
TPM2_CC_CONTEXT_LOAD= 0x0161,
TPM2_CC_CONTEXT_SAVE= 0x0162,
TPM2_CC_FLUSH_CONTEXT   = 0x0165,
+   TPM2_CC_POLICY_COMMAND_CODE = 0x16c,
TPM2_CC_READ_PUBLIC = 0x0173,
TPM2_CC_START_AUTH_SESS = 0x0176,
TPM2_CC_GET_CAPABILITY  = 0x017A,
diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index 8b164b7347de..3f47d8b3d361 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -418,6 +418,7 @@ void tpm2_flush_context_cmd(struct tpm_chip *chip, u32 
handle,
 
tpm_buf_destroy();
 }
+EXPORT_SYMBOL_GPL(tpm2_flush_context_cmd);
 
 /**
  * tpm_buf_append_auth() - append TPMS_AUTH_COMMAND to the buffer.
@@ -448,6 +449,7 @@ void tpm2_buf_append_auth(struct tpm_buf *buf, u32 
session_handle,
if (hmac && hmac_len)
tpm_buf_append(buf, hmac, hmac_len);
 }
+EXPORT_SYMBOL_GPL(tpm2_buf_append_auth);
 
 /**
  * tpm2_seal_trusted() - seal the payload of a trusted key
diff --git a/drivers/char/tpm/tpm2-sessions-test.c 
b/drivers/char/tpm/tpm2-sessions-test.c
new file mode 100644
index ..4559e1a5f4d8
--- /dev/null
+++ b/drivers/char/tpm/tpm2-sessions-test.c
@@ -0,0 +1,359 @@
+/* run a set of tests of the sessions code */
+#include "tpm.h"
+#include "tpm2-sessions.h"
+
+#include 
+
+#include 
+
+/* simple policy: command code must be TPM2_CC_UNSEAL */
+static u8 policy[] = {
+   0xe6, 0x13, 0x13, 0x70, 0x76, 0x52, 0x4b, 0xde,
+   0x48, 0x75, 0x33, 0x86, 0x58, 0x84, 0xe9, 0x73,
+   0x2e, 0xbe, 0xe3, 0xaa, 0xcb, 0x09, 0x5d, 0x94,
+   0xa6, 0xde, 0x49, 0x2e, 0xc0, 0x6c, 0x46, 0xfa,
+};
+
+static u32 get_policy(struct tpm_chip *chip)
+{
+   struct tpm_buf buf;
+   u8 nonce[SHA256_DIGEST_SIZE];
+   u32 h;
+   int rc;
+
+   rc = tpm_buf_init(, TPM2_ST_NO_SESSIONS, TPM2_CC_START_AUTH_SESS);
+   if (rc)
+   return 0;
+
+   /* salt key */
+   tpm_buf_append_u32(, TPM2_RH_NULL);
+   /* bind key */
+   tpm_buf_append_u32(, TPM2_RH_NULL);
+   /* zero nonce */
+   memset(nonce, 0, sizeof(nonce));
+   tpm_buf_append_u16(, sizeof(nonce));
+   tpm_buf_append(, nonce, sizeof(nonce));
+   /* encrypted salt (empty) */
+   tpm_buf_append_u16(, 0);
+   /* session type (HMAC, audit or policy) */
+   tpm_buf_append_u8(, TPM2_SE_POLICY);
+   /* symmetric encryption parameters */
+   /* symmetric algorithm */
+   tpm_buf_append_u16(, TPM2_ALG_NULL);
+
+   /* hash algorithm for session */
+   tpm_buf_append_u16(, TPM2_ALG_SHA256);
+
+   rc = tpm_transmit_cmd(chip, >kernel_space, buf.data, PAGE_SIZE,
+ 0, 0, "start policy session");
+
+   h = get_unaligned_be32([TPM_HEADER_SIZE]);
+
+   tpm_buf_reset_cmd(, TPM2_ST_NO_SESSIONS,
+ TPM2_CC_POLICY_COMMAND_CODE);
+   tpm_buf_append_u32(, h);
+   

[PATCH v3 5/6] trusted keys: Add session encryption protection to the seal/unseal path

2018-03-10 Thread James Bottomley
If some entity is snooping the TPM bus, the can see the data going in
to be sealed and the data coming out as it is unsealed.  Add parameter
and response encryption to these cases to ensure that no secrets are
leaked even if the bus is snooped.

As part of doing this conversion it was discovered that policy
sessions can't work with HMAC protected authority because of missing
pieces (the tpm Nonce).  I've added code to work the same way as
before, which will result in potential authority exposure (while still
adding security for the command and the returned blob), and a fixme to
redo the API to get rid of this security hole.

Signed-off-by: James Bottomley 
---
 drivers/char/tpm/tpm2-cmd.c | 156 
 1 file changed, 98 insertions(+), 58 deletions(-)

diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index 47395c455ae1..8b164b7347de 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -463,8 +463,9 @@ int tpm2_seal_trusted(struct tpm_chip *chip,
  struct trusted_key_options *options)
 {
unsigned int blob_len;
-   struct tpm_buf buf;
+   struct tpm_buf buf, t2b;
u32 hash, rlength;
+   struct tpm2_auth *auth;
int i;
int rc;
 
@@ -478,45 +479,56 @@ int tpm2_seal_trusted(struct tpm_chip *chip,
if (i == ARRAY_SIZE(tpm2_hash_map))
return -EINVAL;
 
-   rc = tpm_buf_init(, TPM2_ST_SESSIONS, TPM2_CC_CREATE);
+   rc = tpm2_start_auth_session(chip, );
if (rc)
return rc;
 
-   tpm_buf_append_u32(, options->keyhandle);
-   tpm2_buf_append_auth(, TPM2_RS_PW,
-NULL /* nonce */, 0,
-0 /* session_attributes */,
-options->keyauth /* hmac */,
-TPM_DIGEST_SIZE);
+   rc = tpm_buf_init(, TPM2_ST_SESSIONS, TPM2_CC_CREATE);
+   if (rc) {
+   tpm2_end_auth_session(auth);
+   return rc;
+   }
 
+   rc = tpm_buf_init_2b();
+   if (rc) {
+   tpm_buf_destroy();
+   tpm2_end_auth_session(auth);
+   return rc;
+   }
+
+   tpm_buf_append_name(, auth, options->keyhandle, NULL);
+   tpm_buf_append_hmac_session(, auth, TPM2_SA_DECRYPT,
+   options->keyauth, TPM_DIGEST_SIZE);
/* sensitive */
-   tpm_buf_append_u16(, 4 + TPM_DIGEST_SIZE + payload->key_len + 1);
+   tpm_buf_append_u16(, TPM_DIGEST_SIZE);
+   tpm_buf_append(, options->blobauth, TPM_DIGEST_SIZE);
+   tpm_buf_append_u16(, payload->key_len + 1);
+   tpm_buf_append(, payload->key, payload->key_len);
+   tpm_buf_append_u8(, payload->migratable);
 
-   tpm_buf_append_u16(, TPM_DIGEST_SIZE);
-   tpm_buf_append(, options->blobauth, TPM_DIGEST_SIZE);
-   tpm_buf_append_u16(, payload->key_len + 1);
-   tpm_buf_append(, payload->key, payload->key_len);
-   tpm_buf_append_u8(, payload->migratable);
+   tpm_buf_append_2b(, );
 
/* public */
-   tpm_buf_append_u16(, 14 + options->policydigest_len);
-   tpm_buf_append_u16(, TPM2_ALG_KEYEDHASH);
-   tpm_buf_append_u16(, hash);
+   tpm_buf_append_u16(, TPM2_ALG_KEYEDHASH);
+   tpm_buf_append_u16(, hash);
 
/* policy */
if (options->policydigest_len) {
-   tpm_buf_append_u32(, 0);
-   tpm_buf_append_u16(, options->policydigest_len);
-   tpm_buf_append(, options->policydigest,
+   tpm_buf_append_u32(, 0);
+   tpm_buf_append_u16(, options->policydigest_len);
+   tpm_buf_append(, options->policydigest,
   options->policydigest_len);
} else {
-   tpm_buf_append_u32(, TPM2_OA_USER_WITH_AUTH);
-   tpm_buf_append_u16(, 0);
+   tpm_buf_append_u32(, TPM2_OA_USER_WITH_AUTH);
+   tpm_buf_append_u16(, 0);
}
 
/* public parameters */
-   tpm_buf_append_u16(, TPM2_ALG_NULL);
-   tpm_buf_append_u16(, 0);
+   tpm_buf_append_u16(, TPM2_ALG_NULL);
+   /* unique (zero) */
+   tpm_buf_append_u16(, 0);
+
+   tpm_buf_append_2b(, );
 
/* outside info */
tpm_buf_append_u16(, 0);
@@ -529,8 +541,11 @@ int tpm2_seal_trusted(struct tpm_chip *chip,
goto out;
}
 
-   rc = tpm_transmit_cmd(chip, NULL, buf.data, PAGE_SIZE, 4, 0,
- "sealing data");
+   tpm_buf_fill_hmac_session(, auth);
+
+   rc = tpm_transmit_cmd(chip, >kernel_space, buf.data,
+ PAGE_SIZE, 4, 0, "sealing data");
+   rc = tpm_buf_check_hmac_response(, auth, rc);
if (rc)
goto out;
 
@@ -549,6 +564,7 @@ int tpm2_seal_trusted(struct tpm_chip *chip,
payload->blob_len = blob_len;
 
 out:
+   

[PATCH v3 4/6] tpm2: add session encryption protection to tpm2_get_random()

2018-03-10 Thread James Bottomley
If some entity is snooping the TPM bus, they can see the random
numbers we're extracting from the TPM and do prediction attacks
against their consumers.  Foil this attack by using response
encryption to prevent the attacker from seeing the random sequence.

Signed-off-by: James Bottomley 

---

v3: add error handling to sessions and redo to be outside loop
---
 drivers/char/tpm/tpm2-cmd.c | 73 +++--
 1 file changed, 38 insertions(+), 35 deletions(-)

diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index 6ed07ca4a5e8..47395c455ae1 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -38,10 +38,6 @@ struct tpm2_get_tpm_pt_out {
__be32  value;
 } __packed;
 
-struct tpm2_get_random_in {
-   __be16  size;
-} __packed;
-
 struct tpm2_get_random_out {
__be16  size;
u8  buffer[TPM_MAX_RNG_DATA];
@@ -51,8 +47,6 @@ union tpm2_cmd_params {
struct  tpm2_startup_in startup_in;
struct  tpm2_get_tpm_pt_in  get_tpm_pt_in;
struct  tpm2_get_tpm_pt_out get_tpm_pt_out;
-   struct  tpm2_get_random_in  getrandom_in;
-   struct  tpm2_get_random_out getrandom_out;
 };
 
 struct tpm2_cmd {
@@ -304,17 +298,6 @@ int tpm2_pcr_extend(struct tpm_chip *chip, int pcr_idx, 
u32 count,
return rc;
 }
 
-
-#define TPM2_GETRANDOM_IN_SIZE \
-   (sizeof(struct tpm_input_header) + \
-sizeof(struct tpm2_get_random_in))
-
-static const struct tpm_input_header tpm2_getrandom_header = {
-   .tag = cpu_to_be16(TPM2_ST_NO_SESSIONS),
-   .length = cpu_to_be32(TPM2_GETRANDOM_IN_SIZE),
-   .ordinal = cpu_to_be32(TPM2_CC_GET_RANDOM)
-};
-
 /**
  * tpm2_get_random() - get random bytes from the TPM RNG
  *
@@ -327,44 +310,64 @@ static const struct tpm_input_header 
tpm2_getrandom_header = {
  */
 int tpm2_get_random(struct tpm_chip *chip, u8 *out, size_t max)
 {
-   struct tpm2_cmd cmd;
-   u32 recd, rlength;
+   u32 recd;
u32 num_bytes;
int err;
int total = 0;
int retries = 5;
u8 *dest = out;
+   struct tpm_buf buf;
+   struct tpm2_get_random_out *rout;
+   struct tpm2_auth *auth;
 
-   num_bytes = min_t(u32, max, sizeof(cmd.params.getrandom_out.buffer));
+   num_bytes = min_t(u32, max, TPM_MAX_RNG_DATA);
 
-   if (!out || !num_bytes ||
-   max > sizeof(cmd.params.getrandom_out.buffer))
+   if (!out || !num_bytes
+   || max > TPM_MAX_RNG_DATA)
return -EINVAL;
 
-   do {
-   cmd.header.in = tpm2_getrandom_header;
-   cmd.params.getrandom_in.size = cpu_to_be16(num_bytes);
+   err = tpm2_start_auth_session(chip, );
+   if (err)
+   return err;
+
+   err = tpm_buf_init(, TPM2_ST_SESSIONS, TPM2_CC_GET_RANDOM);
+   if (err) {
+   tpm2_end_auth_session(auth);
+   return err;
+   }
 
-   err = tpm_transmit_cmd(chip, NULL, , sizeof(cmd),
-  offsetof(struct tpm2_get_random_out,
-   buffer),
+   do {
+   tpm_buf_append_hmac_session(, auth, TPM2_SA_ENCRYPT
+   | TPM2_SA_CONTINUE_SESSION,
+   NULL, 0);
+   tpm_buf_append_u16(, num_bytes);
+   tpm_buf_fill_hmac_session(, auth);
+   err = tpm_transmit_cmd(chip, >kernel_space, buf.data,
+  PAGE_SIZE, TPM_HEADER_SIZE + 2,
   0, "attempting get random");
+   err = tpm_buf_check_hmac_response(, auth, err);
if (err)
break;
 
-   recd = min_t(u32, be16_to_cpu(cmd.params.getrandom_out.size),
-num_bytes);
-   rlength = be32_to_cpu(cmd.header.out.length);
-   if (rlength < offsetof(struct tpm2_get_random_out, buffer) +
- recd)
-   return -EFAULT;
-   memcpy(dest, cmd.params.getrandom_out.buffer, recd);
+   rout = (struct tpm2_get_random_out *)[TPM_HEADER_SIZE 
+ 4];
+   recd = be16_to_cpu(rout->size);
+   recd = min_t(u32, recd, num_bytes);
+   if (tpm_buf_length() < TPM_HEADER_SIZE + 4
+   + 2 + recd) {
+   total = -EFAULT;
+   break;
+   }
+   memcpy(dest, rout->buffer, recd);
 
dest += recd;
total += recd;
num_bytes -= recd;
+   tpm_buf_reset_cmd(, TPM2_ST_SESSIONS, TPM2_CC_GET_RANDOM);
} while (retries-- && total < max);
 
+   tpm_buf_destroy();
+   tpm2_end_auth_session(auth);
+
return total ? total : -EIO;
 }
 
-- 

[PATCH v3 3/6] tpm2: add hmac checks to tpm2_pcr_extend()

2018-03-10 Thread James Bottomley
We use tpm2_pcr_extend() in trusted keys to extend a PCR to prevent a
key from being re-loaded until the next reboot.  To use this
functionality securely, that extend must be protected by a session
hmac.

Signed-off-by: James Bottomley 

---

v3: add error handling to sessions
---
 drivers/char/tpm/tpm2-cmd.c | 33 +
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index c0ebfc4efd4d..6ed07ca4a5e8 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -247,13 +247,6 @@ int tpm2_pcr_read(struct tpm_chip *chip, int pcr_idx, u8 
*res_buf)
return rc;
 }
 
-struct tpm2_null_auth_area {
-   __be32  handle;
-   __be16  nonce_size;
-   u8  attributes;
-   __be16  auth_size;
-} __packed;
-
 /**
  * tpm2_pcr_extend() - extend a PCR value
  *
@@ -268,7 +261,7 @@ int tpm2_pcr_extend(struct tpm_chip *chip, int pcr_idx, u32 
count,
struct tpm2_digest *digests)
 {
struct tpm_buf buf;
-   struct tpm2_null_auth_area auth_area;
+   struct tpm2_auth *auth;
int rc;
int i;
int j;
@@ -276,20 +269,19 @@ int tpm2_pcr_extend(struct tpm_chip *chip, int pcr_idx, 
u32 count,
if (count > ARRAY_SIZE(chip->active_banks))
return -EINVAL;
 
-   rc = tpm_buf_init(, TPM2_ST_SESSIONS, TPM2_CC_PCR_EXTEND);
+   rc = tpm2_start_auth_session(chip, );
if (rc)
return rc;
 
-   tpm_buf_append_u32(, pcr_idx);
+   rc = tpm_buf_init(, TPM2_ST_SESSIONS, TPM2_CC_PCR_EXTEND);
+   if (rc) {
+   tpm2_end_auth_session(auth);
+   return rc;
+   }
 
-   auth_area.handle = cpu_to_be32(TPM2_RS_PW);
-   auth_area.nonce_size = 0;
-   auth_area.attributes = 0;
-   auth_area.auth_size = 0;
+   tpm_buf_append_name(, auth, pcr_idx, NULL);
+   tpm_buf_append_hmac_session(, auth, 0, NULL, 0);
 
-   tpm_buf_append_u32(, sizeof(struct tpm2_null_auth_area));
-   tpm_buf_append(, (const unsigned char *)_area,
-  sizeof(auth_area));
tpm_buf_append_u32(, count);
 
for (i = 0; i < count; i++) {
@@ -302,9 +294,10 @@ int tpm2_pcr_extend(struct tpm_chip *chip, int pcr_idx, 
u32 count,
   hash_digest_size[tpm2_hash_map[j].crypto_id]);
}
}
-
-   rc = tpm_transmit_cmd(chip, NULL, buf.data, PAGE_SIZE, 0, 0,
- "attempting extend a PCR value");
+   tpm_buf_fill_hmac_session(, auth);
+   rc = tpm_transmit_cmd(chip, >kernel_space, buf.data, PAGE_SIZE,
+ 0, 0, "attempting extend a PCR value");
+   rc = tpm_buf_check_hmac_response(, auth, rc);
 
tpm_buf_destroy();
 
-- 
2.12.3


[PATCH v3 2/6] tpm2-sessions: Add full HMAC and encrypt/decrypt session handling

2018-03-10 Thread James Bottomley
This code adds true session based HMAC authentication plus parameter
decryption and response encryption using AES.

The basic design of this code is to segregate all the nasty crypto,
hash and hmac code into tpm2-sessions.c and export a usable API.

The API first of all starts off by gaining a session with

tpm2_start_auth_session()

Which initiates a session with the TPM and allocates an opaque
tpm2_auth structure to handle the session parameters.  Then the use is
simply:

* tpm_buf_append_name() in place of the tpm_buf_append_u32 for the
  handles

* tpm_buf_append_hmac_session() where tpm2_append_auth() would go

* tpm_buf_fill_hmac_session() called after the entire command buffer
  is finished but before tpm_transmit_cmd() is called which computes
  the correct HMAC and places it in the command at the correct
  location.

Finally, after tpm_transmit_cmd() is called,
tpm_buf_check_hmac_response() is called to check that the returned
HMAC matched and collect the new state for the next use of the
session, if any.

The features of the session is controlled by the session attributes
set in tpm_buf_append_hmac_session().  If TPM2_SA_CONTINUE_SESSION is
not specified, the session will be flushed and the tpm2_auth structure
freed in tpm_buf_check_hmac_response(); otherwise the session may be
used again.  Parameter encryption is specified by or'ing the flag
TPM2_SA_DECRYPT and response encryption by or'ing the flag
TPM2_SA_ENCRYPT.  the various encryptions will be taken care of by
tpm_buf_fill_hmac_session() and tpm_buf_check_hmac_response()
respectively.

To get all of this to work securely, the Kernel now needs a primary
key to encrypt the session salt to, so we derive an EC key from the
NULL seed and store it in the tpm_chip structure.  We also make sure
that this seed remains for the kernel by using a kernel space to take
it out of the TPM when userspace wants to use it.

Signed-off-by: James Bottomley 

---

v2: Added docbook and improved response check API
v3: Add readpublic, fix hmac length, add API for close on error
allow for the hmac session not being first in the sessions
---
 drivers/char/tpm/Kconfig |3 +
 drivers/char/tpm/Makefile|2 +-
 drivers/char/tpm/tpm.h   |   27 +
 drivers/char/tpm/tpm2-cmd.c  |   34 +-
 drivers/char/tpm/tpm2-sessions.c | 1166 ++
 drivers/char/tpm/tpm2-sessions.h |   57 ++
 6 files changed, 1273 insertions(+), 16 deletions(-)
 create mode 100644 drivers/char/tpm/tpm2-sessions.c
 create mode 100644 drivers/char/tpm/tpm2-sessions.h

diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 0aee88df98d1..8c714d8550c4 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -8,6 +8,9 @@ menuconfig TCG_TPM
select SECURITYFS
select CRYPTO
select CRYPTO_HASH_INFO
+   select CRYPTO_ECDH
+   select CRYPTO_AES
+   select CRYPTO_CFB
---help---
  If you have a TPM security chip in your system, which
  implements the Trusted Computing Group's specification,
diff --git a/drivers/char/tpm/Makefile b/drivers/char/tpm/Makefile
index 41b2482b97c3..b83737ccaa81 100644
--- a/drivers/char/tpm/Makefile
+++ b/drivers/char/tpm/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_TCG_TPM) += tpm.o
 tpm-y := tpm-interface.o tpm-dev.o tpm-sysfs.o tpm-chip.o tpm2-cmd.o \
 tpm-dev-common.o tpmrm-dev.o tpm1_eventlog.o tpm2_eventlog.o \
- tpm2-space.o tpm-buf.o
+ tpm2-space.o tpm-buf.o tpm2-sessions.o
 tpm-$(CONFIG_ACPI) += tpm_ppi.o tpm_eventlog_acpi.o
 tpm-$(CONFIG_EFI) += tpm_eventlog_efi.o
 tpm-$(CONFIG_OF) += tpm_eventlog_of.o
diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 2fca263d4ca3..b1eee56cbbb5 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -42,6 +42,9 @@
 #include 
 #endif
 
+/* fixed define for the curve we use which is NIST_P256 */
+#define EC_PT_SZ   32
+
 enum tpm_const {
TPM_MINOR = 224,/* officially assigned */
TPM_BUFSIZE = 4096,
@@ -93,6 +96,7 @@ enum tpm2_const {
 enum tpm2_structures {
TPM2_ST_NO_SESSIONS = 0x8001,
TPM2_ST_SESSIONS= 0x8002,
+   TPM2_ST_CREATION= 0x8021,
 };
 
 /* Indicates from what layer of the software stack the error comes from */
@@ -114,16 +118,25 @@ enum tpm2_return_codes {
 enum tpm2_algorithms {
TPM2_ALG_ERROR  = 0x,
TPM2_ALG_SHA1   = 0x0004,
+   TPM2_ALG_AES= 0x0006,
TPM2_ALG_KEYEDHASH  = 0x0008,
TPM2_ALG_SHA256 = 0x000B,
TPM2_ALG_SHA384 = 0x000C,
TPM2_ALG_SHA512 = 0x000D,
TPM2_ALG_NULL   = 0x0010,
TPM2_ALG_SM3_256= 0x0012,
+   TPM2_ALG_ECC= 0x0023,
+   TPM2_ALG_CFB= 0x0043,
+};
+
+enum tpm2_curves {
+   TPM2_ECC_NONE   = 0x,
+   TPM2_ECC_NIST_P256

[PATCH v3 1/6] tpm-buf: create new functions for handling TPM buffers

2018-03-10 Thread James Bottomley
This separates out the old tpm_buf_... handling functions from static
inlines into tpm.h and makes them their own tpm-buf.c file.  It also
adds handling for tpm2b structures and also incremental pointer
advancing parsers.

Signed-off-by: James Bottomley 

---

v2: added this patch to separate out the API changes
v3: added tpm_buf_reset_cmd()
---
 drivers/char/tpm/Makefile  |   2 +-
 drivers/char/tpm/tpm-buf.c | 191 +
 drivers/char/tpm/tpm.h |  95 --
 3 files changed, 208 insertions(+), 80 deletions(-)
 create mode 100644 drivers/char/tpm/tpm-buf.c

diff --git a/drivers/char/tpm/Makefile b/drivers/char/tpm/Makefile
index d37c4a1748f5..41b2482b97c3 100644
--- a/drivers/char/tpm/Makefile
+++ b/drivers/char/tpm/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_TCG_TPM) += tpm.o
 tpm-y := tpm-interface.o tpm-dev.o tpm-sysfs.o tpm-chip.o tpm2-cmd.o \
 tpm-dev-common.o tpmrm-dev.o tpm1_eventlog.o tpm2_eventlog.o \
- tpm2-space.o
+ tpm2-space.o tpm-buf.o
 tpm-$(CONFIG_ACPI) += tpm_ppi.o tpm_eventlog_acpi.o
 tpm-$(CONFIG_EFI) += tpm_eventlog_efi.o
 tpm-$(CONFIG_OF) += tpm_eventlog_of.o
diff --git a/drivers/char/tpm/tpm-buf.c b/drivers/char/tpm/tpm-buf.c
new file mode 100644
index ..146a71cec067
--- /dev/null
+++ b/drivers/char/tpm/tpm-buf.c
@@ -0,0 +1,191 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Handing for tpm2b structures to facilitate the building of commands
+ */
+
+#include "tpm.h"
+
+#include 
+
+#include 
+
+static int __tpm_buf_init(struct tpm_buf *buf)
+{
+   buf->data_page = alloc_page(GFP_HIGHUSER);
+   if (!buf->data_page)
+   return -ENOMEM;
+
+   buf->flags = 0;
+   buf->data = kmap(buf->data_page);
+
+   return 0;
+}
+
+void tpm_buf_reset_cmd(struct tpm_buf *buf, u16 tag, u32 ordinal)
+{
+   struct tpm_input_header *head;
+
+   head = (struct tpm_input_header *) buf->data;
+
+   head->tag = cpu_to_be16(tag);
+   head->length = cpu_to_be32(sizeof(*head));
+   head->ordinal = cpu_to_be32(ordinal);
+}
+EXPORT_SYMBOL_GPL(tpm_buf_reset_cmd);
+
+int tpm_buf_init(struct tpm_buf *buf, u16 tag, u32 ordinal)
+{
+   int rc;
+
+   rc = __tpm_buf_init(buf);
+   if (rc)
+   return rc;
+
+   tpm_buf_reset_cmd(buf, tag, ordinal);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(tpm_buf_init);
+
+int tpm_buf_init_2b(struct tpm_buf *buf)
+{
+   struct tpm_input_header *head;
+   int rc;
+
+   rc = __tpm_buf_init(buf);
+   if (rc)
+   return rc;
+
+   head = (struct tpm_input_header *) buf->data;
+
+   head->length = cpu_to_be32(sizeof(*head));
+
+   buf->flags = TPM_BUF_2B;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(tpm_buf_init_2b);
+
+void tpm_buf_destroy(struct tpm_buf *buf)
+{
+   kunmap(buf->data_page);
+   __free_page(buf->data_page);
+}
+EXPORT_SYMBOL_GPL(tpm_buf_destroy);
+
+static void *tpm_buf_data(struct tpm_buf *buf)
+{
+   if (buf->flags & TPM_BUF_2B)
+   return buf->data + TPM_HEADER_SIZE;
+   return buf->data;
+}
+
+u32 tpm_buf_length(struct tpm_buf *buf)
+{
+   struct tpm_input_header *head = (struct tpm_input_header *)buf->data;
+   u32 len;
+
+   len = be32_to_cpu(head->length);
+   if (buf->flags & TPM_BUF_2B)
+   len -= sizeof(*head);
+   return len;
+}
+EXPORT_SYMBOL_GPL(tpm_buf_length);
+
+u16 tpm_buf_tag(struct tpm_buf *buf)
+{
+   struct tpm_input_header *head = (struct tpm_input_header *)buf->data;
+
+   return be16_to_cpu(head->tag);
+}
+EXPORT_SYMBOL_GPL(tpm_buf_tag);
+
+void tpm_buf_append(struct tpm_buf *buf,
+   const unsigned char *new_data,
+   unsigned int new_len)
+{
+   struct tpm_input_header *head = (struct tpm_input_header *) buf->data;
+   u32 len = be32_to_cpu(head->length);
+
+   /* Return silently if overflow has already happened. */
+   if (buf->flags & TPM_BUF_OVERFLOW)
+   return;
+
+   if ((len + new_len) > PAGE_SIZE) {
+   WARN(1, "tpm_buf: overflow\n");
+   buf->flags |= TPM_BUF_OVERFLOW;
+   return;
+   }
+
+   memcpy(>data[len], new_data, new_len);
+   head->length = cpu_to_be32(len + new_len);
+}
+EXPORT_SYMBOL_GPL(tpm_buf_append);
+
+void tpm_buf_append_u8(struct tpm_buf *buf, const u8 value)
+{
+   tpm_buf_append(buf, , 1);
+}
+EXPORT_SYMBOL_GPL(tpm_buf_append_u8);
+
+void tpm_buf_append_u16(struct tpm_buf *buf, const u16 value)
+{
+   __be16 value2 = cpu_to_be16(value);
+
+   tpm_buf_append(buf, (u8 *) , 2);
+}
+EXPORT_SYMBOL_GPL(tpm_buf_append_u16);
+
+void tpm_buf_append_u32(struct tpm_buf *buf, const u32 value)
+{
+   __be32 value2 = cpu_to_be32(value);
+
+   tpm_buf_append(buf, (u8 *) , 4);
+}
+EXPORT_SYMBOL_GPL(tpm_buf_append_u32);
+
+static void tpm_buf_reset(struct tpm_buf *buf)
+{
+   struct tpm_input_header 

[PATCH v3 0/6] add integrity and security to TPM2 transactions

2018-03-10 Thread James Bottomley
By now, everybody knows we have a problem with the TPM2_RS_PW easy
button on TPM2 in that transactions on the TPM bus can be intercepted
and altered.  The way to fix this is to use real sessions for HMAC
capabilities to ensure integrity and to use parameter and response
encryption to ensure confidentiality of the data flowing over the TPM
bus.

This patch series is about adding a simple API which can ensure the
above properties as a layered addition to the existing TPM handling
code.  This series now includes protections for PCR extend, getting
random numbers from the TPM and data sealing and unsealing.  It
therefore eliminates all uses of TPM2_RS_PW in the kernel and adds
encryption protection to sensitive data flowing into and out of the
TPM.

This series is also dependent on additions to the crypto subsystem to
fix problems in the elliptic curve key handling and add the Cipher
FeedBack encryption scheme:

https://marc.info/?l=linux-crypto-vger=151994371015475

In the third version I've added data sealing and unsealing protection,
apart from one API based problem which means that the way trusted keys
were protected it's not currently possible to HMAC protect an authority
that comes with a policy, so the API will have to be extended to fix
that case

I've verified this using the test suite in the last patch on a VM
connected to a tpm2 emulator.  I also instrumented the emulator to make
sure the sensitive data was properly encrypted.

James

---

James Bottomley (6):
  tpm-buf: create new functions for handling TPM buffers
  tpm2-sessions: Add full HMAC and encrypt/decrypt session handling
  tpm2: add hmac checks to tpm2_pcr_extend()
  tpm2: add session encryption protection to tpm2_get_random()
  trusted keys: Add session encryption protection to the seal/unseal path
  tpm2-sessions: NOT FOR COMMITTING add sessions testing

 drivers/char/tpm/Kconfig  |3 +
 drivers/char/tpm/Makefile |3 +-
 drivers/char/tpm/tpm-buf.c|  191 ++
 drivers/char/tpm/tpm-chip.c   |1 +
 drivers/char/tpm/tpm.h|  123 ++--
 drivers/char/tpm/tpm2-cmd.c   |  298 +
 drivers/char/tpm/tpm2-sessions-test.c |  359 ++
 drivers/char/tpm/tpm2-sessions.c  | 1166 +
 drivers/char/tpm/tpm2-sessions.h  |   57 ++
 9 files changed, 1993 insertions(+), 208 deletions(-)
 create mode 100644 drivers/char/tpm/tpm-buf.c
 create mode 100644 drivers/char/tpm/tpm2-sessions-test.c
 create mode 100644 drivers/char/tpm/tpm2-sessions.c
 create mode 100644 drivers/char/tpm/tpm2-sessions.h

-- 
2.12.3


Re: [RFC 0/5] add integrity and security to TPM2 transactions

2018-03-10 Thread James Bottomley
On Sat, 2018-03-10 at 14:49 +0200, Jarkko Sakkinen wrote:
> On Wed, 2018-03-07 at 15:29 -0800, James Bottomley wrote:
> > 
> > By now, everybody knows we have a problem with the TPM2_RS_PW easy
> > button on TPM2 in that transactions on the TPM bus can be
> > intercepted
> > and altered.  The way to fix this is to use real sessions for HMAC
> > capabilities to ensure integrity and to use parameter and response
> > encryption to ensure confidentiality of the data flowing over the
> > TPM
> > bus.
> > 
> > This RFC is about adding a simple API which can ensure the above
> > properties as a layered addition to the existing TPM handling code.
> >  Eventually we can add this to the random number generator, the PCR
> > extensions and the trusted key handling, but this all depends on
> > the
> > conversion to tpm_buf which is not yet upstream, so I've
> > constructed a
> > second patch which demonstrates the new API in a test module for
> > those
> > who wish to play with it.
> > 
> > This series is also dependent on additions to the crypto subsystem
> > to
> > fix problems in the elliptic curve key handling and add the Cipher
> > FeedBack encryption scheme:
> > 
> > https://marc.info/?l=linux-crypto-vger=151994371015475
> > 
> > In the second version, I added security HMAC to our PCR extend and
> > encryption to the returned random number generators and also
> > extracted
> > the parsing and tpm2b construction API into a new file.
> > 
> > James
> 
> Might take up until end of next week before I have time to try this
> out.Anyway, I'll see if I get this running on my systems before at
> the code that much.

OK, you might want to wait for v3 then.  I've got it working with
sealed (trusted) keys, well except for a problem with the trusted keys
API that means we can't protect the password for policy based keys.  I
think the API is finally complete, so I'll send v3 as a PATCH not an
RFC.

The point of the last patch is to show the test rig for this I'm
running in a VM using an instrumented tpm2 emulator to prove we're
getting all the correct data in and out (and that the encryption and
hmac are working); more physical TPM testing would be useful ..

Thanks,

James



[PATCH v5 08/23] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path

2018-03-10 Thread Ard Biesheuvel
CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-modes.S | 23 ++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index e86535a1329d..a68412e1e3a4 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -395,8 +395,28 @@ AES_ENDPROC(aes_xts_decrypt)
 AES_ENTRY(aes_mac_update)
ld1 {v0.16b}, [x4]  /* get dg */
enc_prepare w2, x1, x7
-   cbnzw5, .Lmacenc
+   cbz w5, .Lmacloop4x
 
+   encrypt_block   v0, w2, x1, x7, w8
+
+.Lmacloop4x:
+   subsw3, w3, #4
+   bmi .Lmac1x
+   ld1 {v1.16b-v4.16b}, [x0], #64  /* get next pt block */
+   eor v0.16b, v0.16b, v1.16b  /* ..and xor with dg */
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v2.16b
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v3.16b
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v4.16b
+   cmp w3, wzr
+   csinv   x5, x6, xzr, eq
+   cbz w5, .Lmacout
+   encrypt_block   v0, w2, x1, x7, w8
+   b   .Lmacloop4x
+.Lmac1x:
+   add w3, w3, #4
 .Lmacloop:
cbz w3, .Lmacout
ld1 {v1.16b}, [x0], #16 /* get next pt block */
@@ -406,7 +426,6 @@ AES_ENTRY(aes_mac_update)
csinv   x5, x6, xzr, eq
cbz w5, .Lmacout
 
-.Lmacenc:
encrypt_block   v0, w2, x1, x7, w8
b   .Lmacloop
 
-- 
2.15.1



[PATCH v5 23/23] DO NOT MERGE

2018-03-10 Thread Ard Biesheuvel
Test code to force a kernel_neon_end+begin sequence at every yield point,
and wipe the entire NEON state before resuming the algorithm.
---
 arch/arm64/include/asm/assembler.h | 33 
 1 file changed, 33 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 61168cbe9781..b471b0bbdfe6 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -678,6 +678,7 @@ alternative_else_nop_endif
cmp w1, #PREEMPT_DISABLE_OFFSET
cselx0, x0, xzr, eq
tbnzx0, #TIF_NEED_RESCHED, .Lyield_\@   // needs 
rescheduling?
+   b   .Lyield_\@
 #endif
/* fall through to endif_yield_neon */
.subsection 1
@@ -687,6 +688,38 @@ alternative_else_nop_endif
.macro  do_cond_yield_neon
bl  kernel_neon_end
bl  kernel_neon_begin
+   moviv0.16b, #0x55
+   moviv1.16b, #0x55
+   moviv2.16b, #0x55
+   moviv3.16b, #0x55
+   moviv4.16b, #0x55
+   moviv5.16b, #0x55
+   moviv6.16b, #0x55
+   moviv7.16b, #0x55
+   moviv8.16b, #0x55
+   moviv9.16b, #0x55
+   moviv10.16b, #0x55
+   moviv11.16b, #0x55
+   moviv12.16b, #0x55
+   moviv13.16b, #0x55
+   moviv14.16b, #0x55
+   moviv15.16b, #0x55
+   moviv16.16b, #0x55
+   moviv17.16b, #0x55
+   moviv18.16b, #0x55
+   moviv19.16b, #0x55
+   moviv20.16b, #0x55
+   moviv21.16b, #0x55
+   moviv22.16b, #0x55
+   moviv23.16b, #0x55
+   moviv24.16b, #0x55
+   moviv25.16b, #0x55
+   moviv26.16b, #0x55
+   moviv27.16b, #0x55
+   moviv28.16b, #0x55
+   moviv29.16b, #0x55
+   moviv30.16b, #0x55
+   moviv31.16b, #0x55
.endm
 
.macro  endif_yield_neon, lbl
-- 
2.15.1



[PATCH v5 11/23] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT

2018-03-10 Thread Ard Biesheuvel
Add support macros to conditionally yield the NEON (and thus the CPU)
that may be called from the assembler code.

In some cases, yielding the NEON involves saving and restoring a non
trivial amount of context (especially in the CRC folding algorithms),
and so the macro is split into three, and the code in between is only
executed when the yield path is taken, allowing the context to be preserved.
The third macro takes an optional label argument that marks the resume
path after a yield has been performed.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/assembler.h | 64 
 arch/arm64/kernel/asm-offsets.c|  2 +
 2 files changed, 66 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index eef1fd2c1c0b..61168cbe9781 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -635,4 +635,68 @@ alternative_else_nop_endif
.endif
.endm
 
+/*
+ * Check whether to yield to another runnable task from kernel mode NEON code
+ * (which runs with preemption disabled).
+ *
+ * if_will_cond_yield_neon
+ *// pre-yield patchup code
+ * do_cond_yield_neon
+ *// post-yield patchup code
+ * endif_yield_neon
+ *
+ * where  is optional, and marks the point where execution will resume
+ * after a yield has been performed. If omitted, execution resumes right after
+ * the endif_yield_neon invocation.
+ *
+ * Note that the patchup code does not support assembler directives that change
+ * the output section, any use of such directives is undefined.
+ *
+ * The yield itself consists of the following:
+ * - Check whether the preempt count is exactly 1, in which case disabling
+ *   preemption once will make the task preemptible. If this is not the case,
+ *   yielding is pointless.
+ * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
+ *   kernel mode NEON (which will trigger a reschedule), and branch to the
+ *   yield fixup code.
+ *
+ * This macro sequence clobbers x0, x1 and the flags register unconditionally,
+ * and may clobber x2 .. x18 if the yield path is taken.
+ */
+
+   .macro  cond_yield_neon, lbl
+   if_will_cond_yield_neon
+   do_cond_yield_neon
+   endif_yield_neon\lbl
+   .endm
+
+   .macro  if_will_cond_yield_neon
+#ifdef CONFIG_PREEMPT
+   get_thread_info x0
+   ldr w1, [x0, #TSK_TI_PREEMPT]
+   ldr x0, [x0, #TSK_TI_FLAGS]
+   cmp w1, #PREEMPT_DISABLE_OFFSET
+   cselx0, x0, xzr, eq
+   tbnzx0, #TIF_NEED_RESCHED, .Lyield_\@   // needs 
rescheduling?
+#endif
+   /* fall through to endif_yield_neon */
+   .subsection 1
+.Lyield_\@ :
+   .endm
+
+   .macro  do_cond_yield_neon
+   bl  kernel_neon_end
+   bl  kernel_neon_begin
+   .endm
+
+   .macro  endif_yield_neon, lbl
+   .ifnb   \lbl
+   b   \lbl
+   .else
+   b   .Lyield_out_\@
+   .endif
+   .previous
+.Lyield_out_\@ :
+   .endm
+
 #endif /* __ASM_ASSEMBLER_H */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 1303e04110cd..1e2ea2e51acb 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -93,6 +93,8 @@ int main(void)
   DEFINE(DMA_TO_DEVICE,DMA_TO_DEVICE);
   DEFINE(DMA_FROM_DEVICE,  DMA_FROM_DEVICE);
   BLANK();
+  DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
+  BLANK();
   DEFINE(CLOCK_REALTIME,   CLOCK_REALTIME);
   DEFINE(CLOCK_MONOTONIC,  CLOCK_MONOTONIC);
   DEFINE(CLOCK_MONOTONIC_RAW,  CLOCK_MONOTONIC_RAW);
-- 
2.15.1



[PATCH v5 14/23] crypto: arm64/aes-ccm - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 150 +---
 1 file changed, 95 insertions(+), 55 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S 
b/arch/arm64/crypto/aes-ce-ccm-core.S
index e3a375c4cb83..88f5aef7934c 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,24 +19,33 @@
 *   u32 *macp, u8 const rk[], u32 rounds);
 */
 ENTRY(ce_aes_ccm_auth_data)
-   ldr w8, [x3]/* leftover from prev round? */
+   frame_push  7
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+
+   ldr w25, [x22]  /* leftover from prev round? */
ld1 {v0.16b}, [x0]  /* load mac */
-   cbz w8, 1f
-   sub w8, w8, #16
+   cbz w25, 1f
+   sub w25, w25, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrbw7, [x1], #1/* get 1 byte of input */
-   subsw2, w2, #1
-   add w8, w8, #1
+0: ldrbw7, [x20], #1   /* get 1 byte of input */
+   subsw21, w21, #1
+   add w25, w25, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1  /* rotate in the input bytes */
beq 8f  /* out of input? */
-   cbnzw8, 0b
+   cbnzw25, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4]   /* load first round key */
-   prfmpldl1strm, [x1]
-   cmp w5, #12 /* which key size? */
-   add x6, x4, #16
-   sub w7, w5, #2  /* modified # of rounds */
+1: ld1 {v3.4s}, [x23]  /* load first round key */
+   prfmpldl1strm, [x20]
+   cmp w24, #12/* which key size? */
+   add x6, x23, #16
+   sub w7, w24, #2 /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -55,33 +64,43 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16  /* load next round key */
bpl 3b
aesev0.16b, v4.16b
-   subsw2, w2, #16 /* last data? */
+   subsw21, w21, #16   /* last data? */
eor v0.16b, v0.16b, v5.16b  /* final round */
bmi 6f
-   ld1 {v1.16b}, [x1], #16 /* load next input block */
+   ld1 {v1.16b}, [x20], #16/* load next input block */
eor v0.16b, v0.16b, v1.16b  /* xor with mac */
-   bne 1b
-6: st1 {v0.16b}, [x0]  /* store mac */
+   beq 6f
+
+   if_will_cond_yield_neon
+   st1 {v0.16b}, [x19] /* store mac */
+   do_cond_yield_neon
+   ld1 {v0.16b}, [x19] /* reload mac */
+   endif_yield_neon
+
+   b   1b
+6: st1 {v0.16b}, [x19] /* store mac */
beq 10f
-   addsw2, w2, #16
+   addsw21, w21, #16
beq 10f
-   mov w8, w2
-7: ldrbw7, [x1], #1
+   mov w25, w21
+7: ldrbw7, [x20], #1
umovw6, v0.b[0]
eor w6, w6, w7
-   strbw6, [x0], #1
-   subsw2, w2, #1
+   strbw6, [x19], #1
+   subsw21, w21, #1
beq 10f
ext v0.16b, v0.16b, v0.16b, #1  /* rotate out the mac bytes */
b   7b
-8: mov w7, w8
-   add w8, w8, #16
+8: mov w7, w25
+   add w25, w25, #16
 9: ext v1.16b, v1.16b, v1.16b, #1
addsw7, w7, #1
bne 9b
eor v0.16b, v0.16b, v1.16b
-   st1 {v0.16b}, [x0]
-10:str w8, [x3]
+   st1 {v0.16b}, [x19]
+10:str w25, [x22]
+
+   frame_pop
ret
 ENDPROC(ce_aes_ccm_auth_data)
 
@@ -126,19 +145,29 @@ ENTRY(ce_aes_ccm_final)
 ENDPROC(ce_aes_ccm_final)
 
.macro  aes_ccm_do_crypt,enc
-   ldr x8, [x6, #8]/* load lower ctr */
-   ld1 {v0.16b}, [x5]  /* load mac */
-CPU_LE(rev x8, x8  )   /* keep swabbed ctr in 
reg */
+   frame_push  8
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+   mov x25, x6
+
+   ldr x26, [x25, #8]  /* load lower ctr */
+   ld1 {v0.16b}, [x24] /* load mac */
+CPU_LE(rev x26, x26)   /* keep swabbed ctr 

[PATCH v5 20/23] crypto: arm64/sha3-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha3-ce-core.S | 77 +---
 1 file changed, 50 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/crypto/sha3-ce-core.S b/arch/arm64/crypto/sha3-ce-core.S
index 332ad7530690..a7d587fa54f6 100644
--- a/arch/arm64/crypto/sha3-ce-core.S
+++ b/arch/arm64/crypto/sha3-ce-core.S
@@ -41,9 +41,16 @@
 */
.text
 ENTRY(sha3_ce_transform)
-   /* load state */
-   add x8, x0, #32
-   ld1 { v0.1d- v3.1d}, [x0]
+   frame_push  4
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+
+0: /* load state */
+   add x8, x19, #32
+   ld1 { v0.1d- v3.1d}, [x19]
ld1 { v4.1d- v7.1d}, [x8], #32
ld1 { v8.1d-v11.1d}, [x8], #32
ld1 {v12.1d-v15.1d}, [x8], #32
@@ -51,13 +58,13 @@ ENTRY(sha3_ce_transform)
ld1 {v20.1d-v23.1d}, [x8], #32
ld1 {v24.1d}, [x8]
 
-0: sub w2, w2, #1
+1: sub w21, w21, #1
mov w8, #24
adr_l   x9, .Lsha3_rcon
 
/* load input */
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b-v31.8b}, [x1], #24
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b-v31.8b}, [x20], #24
eor v0.8b, v0.8b, v25.8b
eor v1.8b, v1.8b, v26.8b
eor v2.8b, v2.8b, v27.8b
@@ -66,10 +73,10 @@ ENTRY(sha3_ce_transform)
eor v5.8b, v5.8b, v30.8b
eor v6.8b, v6.8b, v31.8b
 
-   tbnzx3, #6, 2f  // SHA3-512
+   tbnzx22, #6, 3f // SHA3-512
 
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b-v30.8b}, [x1], #16
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b-v30.8b}, [x20], #16
eor  v7.8b,  v7.8b, v25.8b
eor  v8.8b,  v8.8b, v26.8b
eor  v9.8b,  v9.8b, v27.8b
@@ -77,34 +84,34 @@ ENTRY(sha3_ce_transform)
eor v11.8b, v11.8b, v29.8b
eor v12.8b, v12.8b, v30.8b
 
-   tbnzx3, #4, 1f  // SHA3-384 or SHA3-224
+   tbnzx22, #4, 2f // SHA3-384 or SHA3-224
 
// SHA3-256
-   ld1 {v25.8b-v28.8b}, [x1], #32
+   ld1 {v25.8b-v28.8b}, [x20], #32
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
-   b   3f
+   b   4f
 
-1: tbz x3, #2, 3f  // bit 2 cleared? SHA-384
+2: tbz x22, #2, 4f // bit 2 cleared? SHA-384
 
// SHA3-224
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b}, [x1], #8
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b}, [x20], #8
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
eor v17.8b, v17.8b, v29.8b
-   b   3f
+   b   4f
 
// SHA3-512
-2: ld1 {v25.8b-v26.8b}, [x1], #16
+3: ld1 {v25.8b-v26.8b}, [x20], #16
eor  v7.8b,  v7.8b, v25.8b
eor  v8.8b,  v8.8b, v26.8b
 
-3: sub w8, w8, #1
+4: sub w8, w8, #1
 
eor3v29.16b,  v4.16b,  v9.16b, v14.16b
eor3v26.16b,  v1.16b,  v6.16b, v11.16b
@@ -183,17 +190,33 @@ ENTRY(sha3_ce_transform)
 
eor  v0.16b,  v0.16b, v31.16b
 
-   cbnzw8, 3b
-   cbnzw2, 0b
+   cbnzw8, 4b
+   cbz w21, 5f
+
+   if_will_cond_yield_neon
+   add x8, x19, #32
+   st1 { v0.1d- v3.1d}, [x19]
+   st1 { v4.1d- v7.1d}, [x8], #32
+   st1 { v8.1d-v11.1d}, [x8], #32
+   st1 {v12.1d-v15.1d}, [x8], #32
+   st1 {v16.1d-v19.1d}, [x8], #32
+   st1 {v20.1d-v23.1d}, [x8], #32
+   st1 {v24.1d}, [x8]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/* save state */
-   st1 { v0.1d- v3.1d}, [x0], #32
-   st1 { v4.1d- v7.1d}, [x0], #32
-   st1 { v8.1d-v11.1d}, [x0], #32
-   st1 {v12.1d-v15.1d}, [x0], #32
-   st1 {v16.1d-v19.1d}, [x0], #32
-   st1 {v20.1d-v23.1d}, [x0], #32
-   st1 {v24.1d}, [x0]
+5: st1 { v0.1d- v3.1d}, [x19], #32
+   st1 { v4.1d- v7.1d}, [x19], #32
+   st1 { v8.1d-v11.1d}, [x19], #32
+   st1 {v12.1d-v15.1d}, [x19], #32
+   st1 {v16.1d-v19.1d}, [x19], #32
+   st1 {v20.1d-v23.1d}, [x19], #32
+   st1 {v24.1d}, [x19]
+   frame_pop
ret
 ENDPROC(sha3_ce_transform)
 
-- 
2.15.1



[PATCH v5 22/23] crypto: arm64/sm3-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sm3-ce-core.S | 30 +++-
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/crypto/sm3-ce-core.S b/arch/arm64/crypto/sm3-ce-core.S
index 27169fe07a68..5a116c8d0cee 100644
--- a/arch/arm64/crypto/sm3-ce-core.S
+++ b/arch/arm64/crypto/sm3-ce-core.S
@@ -77,19 +77,25 @@
 */
.text
 ENTRY(sm3_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load state */
-   ld1 {v8.4s-v9.4s}, [x0]
+   ld1 {v8.4s-v9.4s}, [x19]
rev64   v8.4s, v8.4s
rev64   v9.4s, v9.4s
ext v8.16b, v8.16b, v8.16b, #8
ext v9.16b, v9.16b, v9.16b, #8
 
-   adr_l   x8, .Lt
+0: adr_l   x8, .Lt
ldp s13, s14, [x8]
 
/* load input */
-0: ld1 {v0.16b-v3.16b}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v0.16b-v3.16b}, [x20], #64
+   sub w21, w21, #1
 
mov v15.16b, v8.16b
mov v16.16b, v9.16b
@@ -125,14 +131,24 @@ CPU_LE(   rev32   v3.16b, v3.16b  )
eor v9.16b, v9.16b, v16.16b
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 2f
+
+   if_will_cond_yield_neon
+   st1 {v8.4s-v9.4s}, [x19]
+   do_cond_yield_neon
+   ld1 {v8.4s-v9.4s}, [x19]
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/* save state */
-   rev64   v8.4s, v8.4s
+2: rev64   v8.4s, v8.4s
rev64   v9.4s, v9.4s
ext v8.16b, v8.16b, v8.16b, #8
ext v9.16b, v9.16b, v9.16b, #8
-   st1 {v8.4s-v9.4s}, [x0]
+   st1 {v8.4s-v9.4s}, [x19]
+   frame_pop
ret
 ENDPROC(sm3_ce_transform)
 
-- 
2.15.1



[PATCH v5 13/23] crypto: arm64/sha2-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha2-ce-core.S | 37 ++--
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 4c3c89b812ce..cd8b36412469 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -79,30 +79,36 @@
 */
.text
 ENTRY(sha2_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load round constants */
-   adr_l   x8, .Lsha2_rcon
+0: adr_l   x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
ld1 {v12.4s-v15.4s}, [x8]
 
/* load state */
-   ld1 {dgav.4s, dgbv.4s}, [x0]
+   ld1 {dgav.4s, dgbv.4s}, [x19]
 
/* load sha256_ce_state::finalize */
ldr_l   w4, sha256_ce_offsetof_finalize, x4
-   ldr w4, [x0, x4]
+   ldr w4, [x19, x4]
 
/* load input */
-0: ld1 {v16.4s-v19.4s}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v16.4s-v19.4s}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev32   v16.16b, v16.16b)
 CPU_LE(rev32   v17.16b, v17.16b)
 CPU_LE(rev32   v18.16b, v18.16b)
 CPU_LE(rev32   v19.16b, v19.16b)
 
-1: add t0.4s, v16.4s, v0.4s
+2: add t0.4s, v16.4s, v0.4s
mov dg0v.16b, dgav.16b
mov dg1v.16b, dgbv.16b
 
@@ -131,16 +137,24 @@ CPU_LE(   rev32   v19.16b, v19.16b)
add dgbv.4s, dgbv.4s, dg1v.4s
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {dgav.4s, dgbv.4s}, [x19]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/*
 * Final block: add padding and total bit count.
 * Skip if the input size was not a round multiple of the block size,
 * the padding is handled by the C code in that case.
 */
-   cbz x4, 3f
+3: cbz x4, 4f
ldr_l   w4, sha256_ce_offsetof_count, x4
-   ldr x4, [x0, x4]
+   ldr x4, [x19, x4]
moviv17.2d, #0
mov x8, #0x8000
moviv18.2d, #0
@@ -149,9 +163,10 @@ CPU_LE(rev32   v19.16b, v19.16b)
mov x4, #0
mov v19.d[0], xzr
mov v19.d[1], x7
-   b   1b
+   b   2b
 
/* store new state */
-3: st1 {dgav.4s, dgbv.4s}, [x0]
+4: st1 {dgav.4s, dgbv.4s}, [x19]
+   frame_pop
ret
 ENDPROC(sha2_ce_transform)
-- 
2.15.1



[PATCH v5 09/23] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels

2018-03-10 Thread Ard Biesheuvel
Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.

Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha256-glue.c | 36 +---
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index b064d925fe2a..e8880ccdc71f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -89,21 +89,32 @@ static struct shash_alg algs[] = { {
 static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
  unsigned int len)
 {
-   /*
-* Stacking and unstacking a substantial slice of the NEON register
-* file may significantly affect performance for small updates when
-* executing in interrupt context, so fall back to the scalar code
-* in that case.
-*/
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+
if (!may_use_simd())
return sha256_base_do_update(desc, data, len,
(sha256_block_fn *)sha256_block_data_order);
 
-   kernel_neon_begin();
-   sha256_base_do_update(desc, data, len,
-   (sha256_block_fn *)sha256_block_neon);
-   kernel_neon_end();
+   while (len > 0) {
+   unsigned int chunk = len;
+
+   /*
+* Don't hog the CPU for the entire time it takes to process all
+* input when running on a preemptible kernel, but process the
+* data block by block instead.
+*/
+   if (IS_ENABLED(CONFIG_PREEMPT) &&
+   chunk + sctx->count % SHA256_BLOCK_SIZE > SHA256_BLOCK_SIZE)
+   chunk = SHA256_BLOCK_SIZE -
+   sctx->count % SHA256_BLOCK_SIZE;
 
+   kernel_neon_begin();
+   sha256_base_do_update(desc, data, chunk,
+ (sha256_block_fn *)sha256_block_neon);
+   kernel_neon_end();
+   data += chunk;
+   len -= chunk;
+   }
return 0;
 }
 
@@ -117,10 +128,9 @@ static int sha256_finup_neon(struct shash_desc *desc, 
const u8 *data,
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_data_order);
} else {
-   kernel_neon_begin();
if (len)
-   sha256_base_do_update(desc, data, len,
-   (sha256_block_fn *)sha256_block_neon);
+   sha256_update_neon(desc, data, len);
+   kernel_neon_begin();
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_neon);
kernel_neon_end();
-- 
2.15.1



[PATCH v5 18/23] crypto: arm64/crc32-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/crc32-ce-core.S | 40 +++-
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/crypto/crc32-ce-core.S 
b/arch/arm64/crypto/crc32-ce-core.S
index 16ed3c7ebd37..8061bf0f9c66 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,10 @@
dCONSTANT   .reqd0
qCONSTANT   .reqq0
 
-   BUF .reqx0
-   LEN .reqx1
-   CRC .reqx2
+   BUF .reqx19
+   LEN .reqx20
+   CRC .reqx21
+   CONST   .reqx22
 
vzr .reqv9
 
@@ -123,7 +124,14 @@ ENTRY(crc32_pmull_le)
 ENTRY(crc32c_pmull_le)
adr_l   x3, .Lcrc32c_constants
 
-0: bic LEN, LEN, #15
+0: frame_push  4, 64
+
+   mov BUF, x0
+   mov LEN, x1
+   mov CRC, x2
+   mov CONST, x3
+
+   bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
movivzr.16b, #0
fmovdCONSTANT, CRC
@@ -132,7 +140,7 @@ ENTRY(crc32c_pmull_le)
cmp LEN, #0x40
b.ltless_64
 
-   ldr qCONSTANT, [x3]
+   ldr qCONSTANT, [CONST]
 
 loop_64:   /* 64 bytes Full cache line folding */
sub LEN, LEN, #0x40
@@ -162,10 +170,21 @@ loop_64:  /* 64 bytes Full cache line folding */
eor v4.16b, v4.16b, v8.16b
 
cmp LEN, #0x40
-   b.geloop_64
+   b.ltless_64
+
+   if_will_cond_yield_neon
+   stp q1, q2, [sp, #.Lframe_local_offset]
+   stp q3, q4, [sp, #.Lframe_local_offset + 32]
+   do_cond_yield_neon
+   ldp q1, q2, [sp, #.Lframe_local_offset]
+   ldp q3, q4, [sp, #.Lframe_local_offset + 32]
+   ldr qCONSTANT, [CONST]
+   movivzr.16b, #0
+   endif_yield_neon
+   b   loop_64
 
 less_64:   /* Folding cache line into 128bit */
-   ldr qCONSTANT, [x3, #16]
+   ldr qCONSTANT, [CONST, #16]
 
pmull2  v5.1q, v1.2d, vCONSTANT.2d
pmull   v1.1q, v1.1d, vCONSTANT.1d
@@ -204,8 +223,8 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
 
/* final 32-bit fold */
-   ldr dCONSTANT, [x3, #32]
-   ldr d3, [x3, #40]
+   ldr dCONSTANT, [CONST, #32]
+   ldr d3, [CONST, #40]
 
ext v2.16b, v1.16b, vzr.16b, #4
and v1.16b, v1.16b, v3.16b
@@ -213,7 +232,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
 
/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
-   ldr qCONSTANT, [x3, #48]
+   ldr qCONSTANT, [CONST, #48]
 
and v2.16b, v1.16b, v3.16b
ext v2.16b, vzr.16b, v2.16b, #8
@@ -223,6 +242,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
mov w0, v1.s[1]
 
+   frame_pop
ret
 ENDPROC(crc32_pmull_le)
 ENDPROC(crc32c_pmull_le)
-- 
2.15.1



[PATCH v5 16/23] crypto: arm64/aes-bs - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-neonbs-core.S | 305 +++-
 1 file changed, 170 insertions(+), 135 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-core.S 
b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..e613a87f8b53 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,61 @@ ENDPROC(aesbs_decrypt8)
 *   int blocks)
 */
.macro  __ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
 
 99:mov x5, #1
-   lsl x5, x5, x4
-   subsw4, w4, #8
-   cselx4, x4, xzr, pl
+   lsl x5, x5, x23
+   subsw23, w23, #8
+   cselx23, x23, xzr, pl
cselx5, x5, xzr, mi
 
-   ld1 {v0.16b}, [x1], #16
+   ld1 {v0.16b}, [x20], #16
tbnzx5, #1, 0f
-   ld1 {v1.16b}, [x1], #16
+   ld1 {v1.16b}, [x20], #16
tbnzx5, #2, 0f
-   ld1 {v2.16b}, [x1], #16
+   ld1 {v2.16b}, [x20], #16
tbnzx5, #3, 0f
-   ld1 {v3.16b}, [x1], #16
+   ld1 {v3.16b}, [x20], #16
tbnzx5, #4, 0f
-   ld1 {v4.16b}, [x1], #16
+   ld1 {v4.16b}, [x20], #16
tbnzx5, #5, 0f
-   ld1 {v5.16b}, [x1], #16
+   ld1 {v5.16b}, [x20], #16
tbnzx5, #6, 0f
-   ld1 {v6.16b}, [x1], #16
+   ld1 {v6.16b}, [x20], #16
tbnzx5, #7, 0f
-   ld1 {v7.16b}, [x1], #16
+   ld1 {v7.16b}, [x20], #16
 
-0: mov bskey, x2
-   mov rounds, x3
+0: mov bskey, x21
+   mov rounds, x22
bl  \do8
 
-   st1 {\o0\().16b}, [x0], #16
+   st1 {\o0\().16b}, [x19], #16
tbnzx5, #1, 1f
-   st1 {\o1\().16b}, [x0], #16
+   st1 {\o1\().16b}, [x19], #16
tbnzx5, #2, 1f
-   st1 {\o2\().16b}, [x0], #16
+   st1 {\o2\().16b}, [x19], #16
tbnzx5, #3, 1f
-   st1 {\o3\().16b}, [x0], #16
+   st1 {\o3\().16b}, [x19], #16
tbnzx5, #4, 1f
-   st1 {\o4\().16b}, [x0], #16
+   st1 {\o4\().16b}, [x19], #16
tbnzx5, #5, 1f
-   st1 {\o5\().16b}, [x0], #16
+   st1 {\o5\().16b}, [x19], #16
tbnzx5, #6, 1f
-   st1 {\o6\().16b}, [x0], #16
+   st1 {\o6\().16b}, [x19], #16
tbnzx5, #7, 1f
-   st1 {\o7\().16b}, [x0], #16
+   st1 {\o7\().16b}, [x19], #16
 
-   cbnzx4, 99b
+   cbz x23, 1f
+   cond_yield_neon
+   b   99b
 
-1: ldp x29, x30, [sp], #16
+1: frame_pop
ret
.endm
 
@@ -632,43 +639,49 @@ ENDPROC(aesbs_ecb_decrypt)
 */
.align  4
 ENTRY(aesbs_cbc_decrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  6
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
 
 99:mov x6, #1
-   lsl x6, x6, x4
-   subsw4, w4, #8
-   cselx4, x4, xzr, pl
+   lsl x6, x6, x23
+   subsw23, w23, #8
+   cselx23, x23, xzr, pl
cselx6, x6, xzr, mi
 
-   ld1 {v0.16b}, [x1], #16
+   ld1 {v0.16b}, [x20], #16
mov v25.16b, v0.16b
tbnzx6, #1, 0f
-   ld1 {v1.16b}, [x1], #16
+   ld1 {v1.16b}, [x20], #16
mov v26.16b, v1.16b
tbnzx6, #2, 0f
-   ld1 {v2.16b}, [x1], #16
+   ld1 {v2.16b}, [x20], #16
mov v27.16b, v2.16b
tbnzx6, #3, 0f
-   ld1 {v3.16b}, [x1], #16
+   ld1 {v3.16b}, [x20], #16
mov v28.16b, v3.16b
tbnzx6, #4, 0f
-   ld1 

[PATCH v5 15/23] crypto: arm64/aes-blk - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-ce.S|  15 +-
 arch/arm64/crypto/aes-modes.S | 331 
 2 files changed, 216 insertions(+), 130 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..623e74ed1c67 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -30,18 +30,21 @@
.endm
 
/* prepare for encryption with key in rk[] */
-   .macro  enc_prepare, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  enc_prepare, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
/* prepare for encryption (again) but with new key in rk[] */
-   .macro  enc_switch_key, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  enc_switch_key, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
/* prepare for decryption with key in rk[] */
-   .macro  dec_prepare, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  dec_prepare, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
.macro  do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..483a7130cf0e 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align  4
 
 aes_encrypt_block4x:
-   encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+   encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
 ENDPROC(aes_encrypt_block4x)
 
 aes_decrypt_block4x:
-   decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+   decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
 ENDPROC(aes_decrypt_block4x)
 
@@ -31,57 +31,71 @@ ENDPROC(aes_decrypt_block4x)
 */
 
 AES_ENTRY(aes_ecb_encrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
 
-   enc_prepare w3, x2, x5
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+
+.Lecbencrestart:
+   enc_prepare w22, x21, x5
 
 .LecbencloopNx:
-   subsw4, w4, #4
+   subsw23, w23, #4
bmi .Lecbenc1x
-   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
+   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
bl  aes_encrypt_block4x
-   st1 {v0.16b-v3.16b}, [x0], #64
+   st1 {v0.16b-v3.16b}, [x19], #64
+   cond_yield_neon .Lecbencrestart
b   .LecbencloopNx
 .Lecbenc1x:
-   addsw4, w4, #4
+   addsw23, w23, #4
beq .Lecbencout
 .Lecbencloop:
-   ld1 {v0.16b}, [x1], #16 /* get next pt block */
-   encrypt_block   v0, w3, x2, x5, w6
-   st1 {v0.16b}, [x0], #16
-   subsw4, w4, #1
+   ld1 {v0.16b}, [x20], #16/* get next pt block */
+   encrypt_block   v0, w22, x21, x5, w6
+   st1 {v0.16b}, [x19], #16
+   subsw23, w23, #1
bne .Lecbencloop
 .Lecbencout:
-   ldp x29, x30, [sp], #16
+   frame_pop
ret
 AES_ENDPROC(aes_ecb_encrypt)
 
 
 AES_ENTRY(aes_ecb_decrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
 
-   dec_prepare w3, x2, x5
+.Lecbdecrestart:
+   dec_prepare w22, x21, x5
 
 .LecbdecloopNx:
-   subsw4, w4, #4
+   subsw23, w23, #4
bmi .Lecbdec1x
-   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 ct blocks */
+   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
bl  aes_decrypt_block4x
-   st1 {v0.16b-v3.16b}, [x0], #64
+   st1 {v0.16b-v3.16b}, [x19], #64
+   cond_yield_neon .Lecbdecrestart
b   .LecbdecloopNx
 .Lecbdec1x:
-   addsw4, w4, #4
+   addsw23, w23, #4
beq .Lecbdecout
 .Lecbdecloop:
-   ld1 {v0.16b}, [x1], #16 /* get next ct block */
-   decrypt_block   v0, w3, x2, x5, w6
-   st1 {v0.16b}, [x0], #16
-   subsw4, w4, #1
+   

[PATCH v5 12/23] crypto: arm64/sha1-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha1-ce-core.S | 42 ++--
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 46049850727d..78eb35fb5056 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -69,30 +69,36 @@
 *int blocks)
 */
 ENTRY(sha1_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load round constants */
-   loadrc  k0.4s, 0x5a827999, w6
+0: loadrc  k0.4s, 0x5a827999, w6
loadrc  k1.4s, 0x6ed9eba1, w6
loadrc  k2.4s, 0x8f1bbcdc, w6
loadrc  k3.4s, 0xca62c1d6, w6
 
/* load state */
-   ld1 {dgav.4s}, [x0]
-   ldr dgb, [x0, #16]
+   ld1 {dgav.4s}, [x19]
+   ldr dgb, [x19, #16]
 
/* load sha1_ce_state::finalize */
ldr_l   w4, sha1_ce_offsetof_finalize, x4
-   ldr w4, [x0, x4]
+   ldr w4, [x19, x4]
 
/* load input */
-0: ld1 {v8.4s-v11.4s}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v8.4s-v11.4s}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev32   v8.16b, v8.16b  )
 CPU_LE(rev32   v9.16b, v9.16b  )
 CPU_LE(rev32   v10.16b, v10.16b)
 CPU_LE(rev32   v11.16b, v11.16b)
 
-1: add t0.4s, v8.4s, k0.4s
+2: add t0.4s, v8.4s, k0.4s
mov dg0v.16b, dgav.16b
 
add_update  c, ev, k0,  8,  9, 10, 11, dgb
@@ -123,16 +129,25 @@ CPU_LE(   rev32   v11.16b, v11.16b)
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
 
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {dgav.4s}, [x19]
+   str dgb, [x19, #16]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/*
 * Final block: add padding and total bit count.
 * Skip if the input size was not a round multiple of the block size,
 * the padding is handled by the C code in that case.
 */
-   cbz x4, 3f
+3: cbz x4, 4f
ldr_l   w4, sha1_ce_offsetof_count, x4
-   ldr x4, [x0, x4]
+   ldr x4, [x19, x4]
moviv9.2d, #0
mov x8, #0x8000
moviv10.2d, #0
@@ -141,10 +156,11 @@ CPU_LE(   rev32   v11.16b, v11.16b)
mov x4, #0
mov v11.d[0], xzr
mov v11.d[1], x7
-   b   1b
+   b   2b
 
/* store new state */
-3: st1 {dgav.4s}, [x0]
-   str dgb, [x0, #16]
+4: st1 {dgav.4s}, [x19]
+   str dgb, [x19, #16]
+   frame_pop
ret
 ENDPROC(sha1_ce_transform)
-- 
2.15.1



[PATCH v5 10/23] arm64: assembler: add utility macros to push/pop stack frames

2018-03-10 Thread Ard Biesheuvel
We are going to add code to all the NEON crypto routines that will
turn them into non-leaf functions, so we need to manage the stack
frames. To make this less tedious and error prone, add some macros
that take the number of callee saved registers to preserve and the
extra size to allocate in the stack frame (for locals) and emit
the ldp/stp sequences.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/assembler.h | 70 
 1 file changed, 70 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 053d83e8db6f..eef1fd2c1c0b 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -555,6 +555,19 @@ USER(\label, icivau, \tmp2)// 
invalidate I line PoU
 #endif
.endm
 
+/*
+ * Errata workaround post TTBR0_EL1 update.
+ */
+   .macro  post_ttbr0_update_workaround
+#ifdef CONFIG_CAVIUM_ERRATUM_27456
+alternative_if ARM64_WORKAROUND_CAVIUM_27456
+   ic  iallu
+   dsb nsh
+   isb
+alternative_else_nop_endif
+#endif
+   .endm
+
 /**
  * Errata workaround prior to disable MMU. Insert an ISB immediately prior
  * to executing the MSR that will change SCTLR_ELn[M] from a value of 1 to 0.
@@ -565,4 +578,61 @@ USER(\label, icivau, \tmp2)// 
invalidate I line PoU
 #endif
.endm
 
+   /*
+* frame_push - Push @regcount callee saved registers to the stack,
+*  starting at x19, as well as x29/x30, and set x29 to
+*  the new value of sp. Add @extra bytes of stack space
+*  for locals.
+*/
+   .macro  frame_push, regcount:req, extra
+   __frame st, \regcount, \extra
+   .endm
+
+   /*
+* frame_pop  - Pop the callee saved registers from the stack that were
+*  pushed in the most recent call to frame_push, as well
+*  as x29/x30 and any extra stack space that may have been
+*  allocated.
+*/
+   .macro  frame_pop
+   __frame ld
+   .endm
+
+   .macro  __frame_regs, reg1, reg2, op, num
+   .if .Lframe_regcount == \num
+   \op\()r \reg1, [sp, #(\num + 1) * 8]
+   .elseif .Lframe_regcount > \num
+   \op\()p \reg1, \reg2, [sp, #(\num + 1) * 8]
+   .endif
+   .endm
+
+   .macro  __frame, op, regcount, extra=0
+   .ifc\op, st
+   .if (\regcount) < 0 || (\regcount) > 10
+   .error  "regcount should be in the range [0 ... 10]"
+   .endif
+   .if ((\extra) % 16) != 0
+   .error  "extra should be a multiple of 16 bytes"
+   .endif
+   .set.Lframe_regcount, \regcount
+   .set.Lframe_extra, \extra
+   .set.Lframe_local_offset, ((\regcount + 3) / 2) * 16
+   stp x29, x30, [sp, #-.Lframe_local_offset - .Lframe_extra]!
+   mov x29, sp
+   .elseif .Lframe_regcount == -1 // && op == 'ld'
+   .error  "frame_push/frame_pop may not be nested"
+   .endif
+
+   __frame_regsx19, x20, \op, 1
+   __frame_regsx21, x22, \op, 3
+   __frame_regsx23, x24, \op, 5
+   __frame_regsx25, x26, \op, 7
+   __frame_regsx27, x28, \op, 9
+
+   .ifc\op, ld
+   ldp x29, x30, [sp], #.Lframe_local_offset + .Lframe_extra
+   .set.Lframe_regcount, -1
+   .endif
+   .endm
+
 #endif /* __ASM_ASSEMBLER_H */
-- 
2.15.1



[PATCH v5 21/23] crypto: arm64/sha512-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha512-ce-core.S | 27 +++-
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/sha512-ce-core.S 
b/arch/arm64/crypto/sha512-ce-core.S
index 7f3bca5c59a2..ce65e3abe4f2 100644
--- a/arch/arm64/crypto/sha512-ce-core.S
+++ b/arch/arm64/crypto/sha512-ce-core.S
@@ -107,17 +107,23 @@
 */
.text
 ENTRY(sha512_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load state */
-   ld1 {v8.2d-v11.2d}, [x0]
+0: ld1 {v8.2d-v11.2d}, [x19]
 
/* load first 4 round constants */
adr_l   x3, .Lsha512_rcon
ld1 {v20.2d-v23.2d}, [x3], #64
 
/* load input */
-0: ld1 {v12.2d-v15.2d}, [x1], #64
-   ld1 {v16.2d-v19.2d}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v12.2d-v15.2d}, [x20], #64
+   ld1 {v16.2d-v19.2d}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev64   v12.16b, v12.16b)
 CPU_LE(rev64   v13.16b, v13.16b)
@@ -196,9 +202,18 @@ CPU_LE(rev64   v19.16b, v19.16b)
add v11.2d, v11.2d, v3.2d
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {v8.2d-v11.2d}, [x19]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/* store new state */
-3: st1 {v8.2d-v11.2d}, [x0]
+3: st1 {v8.2d-v11.2d}, [x19]
+   frame_pop
ret
 ENDPROC(sha512_ce_transform)
-- 
2.15.1



[PATCH v5 19/23] crypto: arm64/crct10dif-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/crct10dif-ce-core.S | 32 +---
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/crct10dif-ce-core.S 
b/arch/arm64/crypto/crct10dif-ce-core.S
index f179c01bd55c..663ea71cdb38 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,19 @@
.text
.cpugeneric+crypto
 
-   arg1_low32  .reqw0
-   arg2.reqx1
-   arg3.reqx2
+   arg1_low32  .reqw19
+   arg2.reqx20
+   arg3.reqx21
 
vzr .reqv13
 
 ENTRY(crc_t10dif_pmull)
+   frame_push  3, 128
+
+   mov arg1_low32, w0
+   mov arg2, x1
+   mov arg3, x2
+
movivzr.16b, #0 // init zero register
 
// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +181,25 @@ CPU_LE(ext v12.16b, v12.16b, v12.16b, #8   
)
subsarg3, arg3, #128
 
// check if there is another 64B in the buffer to be able to fold
-   b.ge_fold_64_B_loop
+   b.lt_fold_64_B_end
+
+   if_will_cond_yield_neon
+   stp q0, q1, [sp, #.Lframe_local_offset]
+   stp q2, q3, [sp, #.Lframe_local_offset + 32]
+   stp q4, q5, [sp, #.Lframe_local_offset + 64]
+   stp q6, q7, [sp, #.Lframe_local_offset + 96]
+   do_cond_yield_neon
+   ldp q0, q1, [sp, #.Lframe_local_offset]
+   ldp q2, q3, [sp, #.Lframe_local_offset + 32]
+   ldp q4, q5, [sp, #.Lframe_local_offset + 64]
+   ldp q6, q7, [sp, #.Lframe_local_offset + 96]
+   ldr_l   q10, rk3, x8
+   movivzr.16b, #0 // init zero register
+   endif_yield_neon
+
+   b   _fold_64_B_loop
 
+_fold_64_B_end:
// at this point, the buffer pointer is pointing at the last y Bytes
// of the buffer the 64B of folded data is in 4 of the vector
// registers: v0, v1, v2, v3
@@ -304,6 +327,7 @@ _barrett:
 _cleanup:
// scale the result back to 16 bits
lsr x0, x0, #16
+   frame_pop
ret
 
 _less_than_128:
-- 
2.15.1



[PATCH v5 17/23] crypto: arm64/aes-ghash - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-core.S | 113 ++--
 arch/arm64/crypto/ghash-ce-glue.c |  28 +++--
 2 files changed, 97 insertions(+), 44 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..dcffb9e77589 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,22 +213,31 @@
.endm
 
.macro  __pmull_ghash, pn
-   ld1 {SHASH.2d}, [x3]
-   ld1 {XL.2d}, [x1]
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+
+0: ld1 {SHASH.2d}, [x22]
+   ld1 {XL.2d}, [x20]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
__pmull_pre_\pn
 
/* do the head block first, if supplied */
-   cbz x4, 0f
-   ld1 {T1.2d}, [x4]
-   b   1f
+   cbz x23, 1f
+   ld1 {T1.2d}, [x23]
+   mov x23, xzr
+   b   2f
 
-0: ld1 {T1.2d}, [x2], #16
-   sub w0, w0, #1
+1: ld1 {T1.2d}, [x21], #16
+   sub w19, w19, #1
 
-1: /* multiply XL by SHASH in GF(2^128) */
+2: /* multiply XL by SHASH in GF(2^128) */
 CPU_LE(rev64   T1.16b, T1.16b  )
 
ext T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +259,18 @@ CPU_LE(rev64   T1.16b, T1.16b  )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
 
-   cbnzw0, 0b
+   cbz w19, 3f
+
+   if_will_cond_yield_neon
+   st1 {XL.2d}, [x20]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
-   st1 {XL.2d}, [x1]
+3: st1 {XL.2d}, [x20]
+   frame_pop
ret
.endm
 
@@ -304,38 +322,55 @@ ENDPROC(pmull_ghash_update_p8)
.endm
 
.macro  pmull_gcm_do_crypt, enc
-   ld1 {SHASH.2d}, [x4]
-   ld1 {XL.2d}, [x1]
-   ldr x8, [x5, #8]// load lower counter
+   frame_push  10
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+   mov x25, x6
+   mov x26, x7
+   .if \enc == 1
+   ldr x27, [sp, #96]  // first stacked arg
+   .endif
+
+   ldr x28, [x24, #8]  // load lower counter
+CPU_LE(rev x28, x28)
+
+0: mov x0, x25
+   load_round_keys w26, x0
+   ld1 {SHASH.2d}, [x23]
+   ld1 {XL.2d}, [x20]
 
moviMASK.16b, #0xe1
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE(rev x8, x8  )
shl MASK.2d, MASK.2d, #57
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
.if \enc == 1
-   ld1 {KS.16b}, [x7]
+   ld1 {KS.16b}, [x27]
.endif
 
-0: ld1 {CTR.8b}, [x5]  // load upper counter
-   ld1 {INP.16b}, [x3], #16
-   rev x9, x8
-   add x8, x8, #1
-   sub w0, w0, #1
+1: ld1 {CTR.8b}, [x24] // load upper counter
+   ld1 {INP.16b}, [x22], #16
+   rev x9, x28
+   add x28, x28, #1
+   sub w19, w19, #1
ins CTR.d[1], x9// set lower counter
 
.if \enc == 1
eor INP.16b, INP.16b, KS.16b// encrypt input
-   st1 {INP.16b}, [x2], #16
+   st1 {INP.16b}, [x21], #16
.endif
 
rev64   T1.16b, INP.16b
 
-   cmp w6, #12
-   b.ge2f  // AES-192/256?
+   cmp w26, #12
+   b.ge4f  // AES-192/256?
 
-1: enc_round   CTR, v21
+2: enc_round   CTR, v21
 
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +425,39 @@ CPU_LE(   rev x8, x8  )
 
.if \enc == 0
eor INP.16b, INP.16b, KS.16b
-   st1 

[PATCH v5 01/23] crypto: testmgr - add a new test case for CRC-T10DIF

2018-03-10 Thread Ard Biesheuvel
In order to be able to test yield support under preempt, add a test
vector for CRC-T10DIF that is long enough to take multiple iterations
(and thus possible preemption between them) of the primary loop of the
accelerated x86 and arm64 implementations.

Signed-off-by: Ard Biesheuvel 
---
 crypto/testmgr.h | 259 
 1 file changed, 259 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 6044f6906bd6..52d9ff93beac 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -2044,6 +2044,265 @@ static const struct hash_testvec 
crct10dif_tv_template[] = {
.digest = (u8 *)(u16 []){ 0x44c6 },
.np = 4,
.tap= { 1, 255, 57, 6 },
+   }, {
+   .plaintext ="\x6e\x05\x79\x10\xa7\x1b\xb2\x49"
+   "\xe0\x54\xeb\x82\x19\x8d\x24\xbb"
+   "\x2f\xc6\x5d\xf4\x68\xff\x96\x0a"
+   "\xa1\x38\xcf\x43\xda\x71\x08\x7c"
+   "\x13\xaa\x1e\xb5\x4c\xe3\x57\xee"
+   "\x85\x1c\x90\x27\xbe\x32\xc9\x60"
+   "\xf7\x6b\x02\x99\x0d\xa4\x3b\xd2"
+   "\x46\xdd\x74\x0b\x7f\x16\xad\x21"
+   "\xb8\x4f\xe6\x5a\xf1\x88\x1f\x93"
+   "\x2a\xc1\x35\xcc\x63\xfa\x6e\x05"
+   "\x9c\x10\xa7\x3e\xd5\x49\xe0\x77"
+   "\x0e\x82\x19\xb0\x24\xbb\x52\xe9"
+   "\x5d\xf4\x8b\x22\x96\x2d\xc4\x38"
+   "\xcf\x66\xfd\x71\x08\x9f\x13\xaa"
+   "\x41\xd8\x4c\xe3\x7a\x11\x85\x1c"
+   "\xb3\x27\xbe\x55\xec\x60\xf7\x8e"
+   "\x02\x99\x30\xc7\x3b\xd2\x69\x00"
+   "\x74\x0b\xa2\x16\xad\x44\xdb\x4f"
+   "\xe6\x7d\x14\x88\x1f\xb6\x2a\xc1"
+   "\x58\xef\x63\xfa\x91\x05\x9c\x33"
+   "\xca\x3e\xd5\x6c\x03\x77\x0e\xa5"
+   "\x19\xb0\x47\xde\x52\xe9\x80\x17"
+   "\x8b\x22\xb9\x2d\xc4\x5b\xf2\x66"
+   "\xfd\x94\x08\x9f\x36\xcd\x41\xd8"
+   "\x6f\x06\x7a\x11\xa8\x1c\xb3\x4a"
+   "\xe1\x55\xec\x83\x1a\x8e\x25\xbc"
+   "\x30\xc7\x5e\xf5\x69\x00\x97\x0b"
+   "\xa2\x39\xd0\x44\xdb\x72\x09\x7d"
+   "\x14\xab\x1f\xb6\x4d\xe4\x58\xef"
+   "\x86\x1d\x91\x28\xbf\x33\xca\x61"
+   "\xf8\x6c\x03\x9a\x0e\xa5\x3c\xd3"
+   "\x47\xde\x75\x0c\x80\x17\xae\x22"
+   "\xb9\x50\xe7\x5b\xf2\x89\x20\x94"
+   "\x2b\xc2\x36\xcd\x64\xfb\x6f\x06"
+   "\x9d\x11\xa8\x3f\xd6\x4a\xe1\x78"
+   "\x0f\x83\x1a\xb1\x25\xbc\x53\xea"
+   "\x5e\xf5\x8c\x00\x97\x2e\xc5\x39"
+   "\xd0\x67\xfe\x72\x09\xa0\x14\xab"
+   "\x42\xd9\x4d\xe4\x7b\x12\x86\x1d"
+   "\xb4\x28\xbf\x56\xed\x61\xf8\x8f"
+   "\x03\x9a\x31\xc8\x3c\xd3\x6a\x01"
+   "\x75\x0c\xa3\x17\xae\x45\xdc\x50"
+   "\xe7\x7e\x15\x89\x20\xb7\x2b\xc2"
+   "\x59\xf0\x64\xfb\x92\x06\x9d\x34"
+   "\xcb\x3f\xd6\x6d\x04\x78\x0f\xa6"
+   "\x1a\xb1\x48\xdf\x53\xea\x81\x18"
+   "\x8c\x23\xba\x2e\xc5\x5c\xf3\x67"
+   "\xfe\x95\x09\xa0\x37\xce\x42\xd9"
+   "\x70\x07\x7b\x12\xa9\x1d\xb4\x4b"
+   "\xe2\x56\xed\x84\x1b\x8f\x26\xbd"
+   "\x31\xc8\x5f\xf6\x6a\x01\x98\x0c"
+   "\xa3\x3a\xd1\x45\xdc\x73\x0a\x7e"
+   "\x15\xac\x20\xb7\x4e\xe5\x59\xf0"
+   "\x87\x1e\x92\x29\xc0\x34\xcb\x62"
+   "\xf9\x6d\x04\x9b\x0f\xa6\x3d\xd4"
+   "\x48\xdf\x76\x0d\x81\x18\xaf\x23"
+   "\xba\x51\xe8\x5c\xf3\x8a\x21\x95"
+   "\x2c\xc3\x37\xce\x65\xfc\x70\x07"
+   "\x9e\x12\xa9\x40\xd7\x4b\xe2\x79"
+   "\x10\x84\x1b\xb2\x26\xbd\x54\xeb"
+   "\x5f\xf6\x8d\x01\x98\x2f\xc6\x3a"
+   "\xd1\x68\xff\x73\x0a\xa1\x15\xac"
+   

[PATCH v5 02/23] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop

2018-03-10 Thread Ard Biesheuvel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++--
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c 
b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 
maciv[], u32 msglen)
 }
 
 static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
-  u32 abytes, u32 *macp, bool use_neon)
+  u32 abytes, u32 *macp)
 {
-   if (likely(use_neon)) {
+   if (may_use_simd()) {
+   kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
 num_rounds(key));
+   kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 
mac[], u8 const in[],
}
 }
 
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
-  bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
 {
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request 
*req, u8 mac[],
ltag.len = 6;
}
 
-   ccm_update_mac(ctx, mac, (u8 *), ltag.len, , use_neon);
+   ccm_update_mac(ctx, mac, (u8 *), ltag.len, );
scatterwalk_start(, req->src);
 
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request 
*req, u8 mac[],
n = scatterwalk_clamp(, len);
}
p = scatterwalk_map();
-   ccm_update_mac(ctx, mac, p, n, , use_neon);
+   ccm_update_mac(ctx, mac, p, n, );
len -= n;
 
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
-   bool use_neon = may_use_simd();
int err;
 
err = ccm_init_mac(req, mac, len);
if (err)
return err;
 
-   if (likely(use_neon))
-   kernel_neon_begin();
-
if (req->assoclen)
-   ccm_calculate_auth_mac(req, mac, use_neon);
+   ccm_calculate_auth_mac(req, mac);
 
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
 
err = skcipher_walk_aead_encrypt(, req, true);
 
-   if (likely(use_neon)) {
+   if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
 
if (walk.nbytes == walk.total)
tail = 0;
 
+   kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
   walk.src.virt.addr,
   walk.nbytes - tail, ctx->key_enc,
   num_rounds(ctx), mac, walk.iv);
+   kernel_neon_end();
 
err = skcipher_walk_done(, tail);
}
-   if (!err)
+   if (!err) {
+   kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
 num_rounds(ctx));
-
-   kernel_neon_end();
+   kernel_neon_end();
+   }
} else {
err = 

[PATCH v5 04/23] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop

2018-03-10 Thread Ard Biesheuvel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-neonbs-glue.c | 36 +---
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-glue.c 
b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..e7a95a566462 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -99,9 +99,8 @@ static int __ecb_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
 
+   kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
   ctx->rounds, blocks);
+   kernel_neon_end();
err = skcipher_walk_done(,
 walk.nbytes - blocks * AES_BLOCK_SIZE);
}
-   kernel_neon_end();
 
return err;
 }
@@ -158,19 +158,19 @@ static int cbc_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
/* fall back to the non-bitsliced NEON implementation */
+   kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
 ctx->enc, ctx->key.rounds, blocks,
 walk.iv);
+   kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
-   kernel_neon_end();
return err;
 }
 
@@ -181,9 +181,8 @@ static int cbc_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
 
+   kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
  ctx->key.rk, ctx->key.rounds, blocks,
  walk.iv);
+   kernel_neon_end();
err = skcipher_walk_done(,
 walk.nbytes - blocks * AES_BLOCK_SIZE);
}
-   kernel_neon_end();
 
return err;
 }
@@ -229,9 +229,8 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 buf[AES_BLOCK_SIZE];
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
 
+   kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
  ctx->rk, ctx->rounds, blocks, walk.iv, final);
+   kernel_neon_end();
 
if 

[PATCH v5 03/23] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop

2018-03-10 Thread Ard Biesheuvel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-glue.c| 95 ++--
 arch/arm64/crypto/aes-modes.S   | 90 +--
 arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
 3 files changed, 97 insertions(+), 102 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 2fa850e86aa8..253188fb8cb0 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
 
 /* defined in aes-modes.S */
 asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, int first);
+   int rounds, int blocks);
 asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, int first);
+   int rounds, int blocks);
 
 asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 iv[], int first);
+   int rounds, int blocks, u8 iv[]);
 asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 iv[], int first);
+   int rounds, int blocks, u8 iv[]);
 
 asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 ctr[], int first);
+   int rounds, int blocks, u8 ctr[]);
 
 asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
 {
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-   int err, first, rounds = 6 + ctx->key_length / 4;
+   int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
-   for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+   while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+   kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_enc, rounds, blocks, first);
+   (u8 *)ctx->key_enc, rounds, blocks);
+   kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
-   kernel_neon_end();
return err;
 }
 
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
 {
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-   int err, first, rounds = 6 + ctx->key_length / 4;
+   int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
-   for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+   while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+   kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_dec, rounds, blocks, first);
+   (u8 *)ctx->key_dec, rounds, blocks);
+   kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % 

[PATCH v5 07/23] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path

2018-03-10 Thread Ard Biesheuvel
CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-modes.S | 31 
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 27a235b2ddee..e86535a1329d 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -94,17 +94,36 @@ AES_ENDPROC(aes_ecb_decrypt)
 */
 
 AES_ENTRY(aes_cbc_encrypt)
-   ld1 {v0.16b}, [x5]  /* get iv */
+   ld1 {v4.16b}, [x5]  /* get iv */
enc_prepare w3, x2, x6
 
-.Lcbcencloop:
-   ld1 {v1.16b}, [x1], #16 /* get next pt block */
-   eor v0.16b, v0.16b, v1.16b  /* ..and xor with iv */
+.Lcbcencloop4x:
+   subsw4, w4, #4
+   bmi .Lcbcenc1x
+   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
+   eor v0.16b, v0.16b, v4.16b  /* ..and xor with iv */
encrypt_block   v0, w3, x2, x6, w7
-   st1 {v0.16b}, [x0], #16
+   eor v1.16b, v1.16b, v0.16b
+   encrypt_block   v1, w3, x2, x6, w7
+   eor v2.16b, v2.16b, v1.16b
+   encrypt_block   v2, w3, x2, x6, w7
+   eor v3.16b, v3.16b, v2.16b
+   encrypt_block   v3, w3, x2, x6, w7
+   st1 {v0.16b-v3.16b}, [x0], #64
+   mov v4.16b, v3.16b
+   b   .Lcbcencloop4x
+.Lcbcenc1x:
+   addsw4, w4, #4
+   beq .Lcbcencout
+.Lcbcencloop:
+   ld1 {v0.16b}, [x1], #16 /* get next pt block */
+   eor v4.16b, v4.16b, v0.16b  /* ..and xor with iv */
+   encrypt_block   v4, w3, x2, x6, w7
+   st1 {v4.16b}, [x0], #16
subsw4, w4, #1
bne .Lcbcencloop
-   st1 {v0.16b}, [x5]  /* return iv */
+.Lcbcencout:
+   st1 {v4.16b}, [x5]  /* return iv */
ret
 AES_ENDPROC(aes_cbc_encrypt)
 
-- 
2.15.1



[PATCH v5 06/23] crypto: arm64/aes-blk - remove configurable interleave

2018-03-10 Thread Ard Biesheuvel
The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)

We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/Makefile|   3 -
 arch/arm64/crypto/aes-modes.S | 237 
 2 files changed, 40 insertions(+), 200 deletions(-)

diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index cee9b8d9830b..b6b624602582 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -59,9 +59,6 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
 obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
 aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
 
-AFLAGS_aes-ce.o:= -DINTERLEAVE=4
-AFLAGS_aes-neon.o  := -DINTERLEAVE=4
-
 CFLAGS_aes-glue-ce.o   := -DUSE_V8_CRYPTO_EXTENSIONS
 
 $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 65b273667b34..27a235b2ddee 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -13,44 +13,6 @@
.text
.align  4
 
-/*
- * There are several ways to instantiate this code:
- * - no interleave, all inline
- * - 2-way interleave, 2x calls out of line (-DINTERLEAVE=2)
- * - 2-way interleave, all inline (-DINTERLEAVE=2 -DINTERLEAVE_INLINE)
- * - 4-way interleave, 4x calls out of line (-DINTERLEAVE=4)
- * - 4-way interleave, all inline (-DINTERLEAVE=4 -DINTERLEAVE_INLINE)
- *
- * Macros imported by this code:
- * - enc_prepare   - setup NEON registers for encryption
- * - dec_prepare   - setup NEON registers for decryption
- * - enc_switch_key- change to new key after having prepared for encryption
- * - encrypt_block - encrypt a single block
- * - decrypt block - decrypt a single block
- * - encrypt_block2x   - encrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - decrypt_block2x   - decrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - encrypt_block4x   - encrypt 4 blocks in parallel (if INTERLEAVE == 4)
- * - decrypt_block4x   - decrypt 4 blocks in parallel (if INTERLEAVE == 4)
- */
-
-#if defined(INTERLEAVE) && !defined(INTERLEAVE_INLINE)
-#define FRAME_PUSH stp x29, x30, [sp,#-16]! ; mov x29, sp
-#define FRAME_POP  ldp x29, x30, [sp],#16
-
-#if INTERLEAVE == 2
-
-aes_encrypt_block2x:
-   encrypt_block2x v0, v1, w3, x2, x8, w7
-   ret
-ENDPROC(aes_encrypt_block2x)
-
-aes_decrypt_block2x:
-   decrypt_block2x v0, v1, w3, x2, x8, w7
-   ret
-ENDPROC(aes_decrypt_block2x)
-
-#elif INTERLEAVE == 4
-
 aes_encrypt_block4x:
encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
@@ -61,48 +23,6 @@ aes_decrypt_block4x:
ret
 ENDPROC(aes_decrypt_block4x)
 
-#else
-#error INTERLEAVE should equal 2 or 4
-#endif
-
-   .macro  do_encrypt_block2x
-   bl  aes_encrypt_block2x
-   .endm
-
-   .macro  do_decrypt_block2x
-   bl  aes_decrypt_block2x
-   .endm
-
-   .macro  do_encrypt_block4x
-   bl  aes_encrypt_block4x
-   .endm
-
-   .macro  do_decrypt_block4x
-   bl  aes_decrypt_block4x
-   .endm
-
-#else
-#define FRAME_PUSH
-#define FRAME_POP
-
-   .macro  do_encrypt_block2x
-   encrypt_block2x v0, v1, w3, x2, x8, w7
-   .endm
-
-   .macro  do_decrypt_block2x
-   decrypt_block2x v0, v1, w3, x2, x8, w7
-   .endm
-
-   .macro  do_encrypt_block4x
-   encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
-   .endm
-
-   .macro  do_decrypt_block4x
-   decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
-   .endm
-
-#endif
-
/*
 * aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
 * int blocks)
@@ -111,28 +31,21 @@ ENDPROC(aes_decrypt_block4x)
 */
 
 AES_ENTRY(aes_ecb_encrypt)
-   FRAME_PUSH
+   stp x29, x30, [sp, #-16]!
+   mov x29, sp
 
enc_prepare w3, x2, x5
 
 .LecbencloopNx:
-#if INTERLEAVE >= 2
-   subsw4, w4, #INTERLEAVE
+   subsw4, w4, #4
bmi .Lecbenc1x
-#if INTERLEAVE == 2
-   ld1 {v0.16b-v1.16b}, [x1], #32  /* get 2 pt blocks */
-   do_encrypt_block2x
-   st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
-   do_encrypt_block4x
+   bl  aes_encrypt_block4x
st1

[PATCH v5 05/23] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop

2018-03-10 Thread Ard Biesheuvel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/chacha20-neon-glue.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/crypto/chacha20-neon-glue.c 
b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..727579c93ded 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -37,12 +37,19 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 
*src,
u8 buf[CHACHA20_BLOCK_SIZE];
 
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+   kernel_neon_begin();
chacha20_4block_xor_neon(state, dst, src);
+   kernel_neon_end();
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
+
+   if (!bytes)
+   return;
+
+   kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
@@ -55,6 +62,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 
*src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+   kernel_neon_end();
 }
 
 static int chacha20_neon(struct skcipher_request *req)
@@ -68,11 +76,10 @@ static int chacha20_neon(struct skcipher_request *req)
if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
return crypto_chacha20_crypt(req);
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
crypto_chacha20_init(state, ctx, walk.iv);
 
-   kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
 
@@ -83,7 +90,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(, walk.nbytes - nbytes);
}
-   kernel_neon_end();
 
return err;
 }
-- 
2.15.1



[PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT

2018-03-10 Thread Ard Biesheuvel
As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.

This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)

So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.

As pointed out by Peter, this only solves part of the problem. So let's
tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
flag each time after processing a fixed chunk of input.

Given that this issue was flagged by the RT people, I would appreciate it
if they could confirm whether they are happy with this approach.

Changes since v4:
- rebase onto v4.16-rc3
- apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
  in v4.16-rc1

Changes since v3:
- incorporate Dave's feedback on the asm macros to push/pop frames and to yield
  the NEON conditionally
- make frame_push/pop more easy to use, by recording the arguments to
  frame_push, removing the need to specify them again when calling frame_pop
- emit local symbol .Lframe_local_offset to allow code using the frame push/pop
  macros to index the stack more easily
- use the magic \@ macro invocation counter provided by GAS to generate unique
  labels om the NEON yield macros, rather than relying on chance

Changes since v2:
- Drop logic to yield only after so many blocks - as it turns out, the
  throughput of the algorithms that are most likely to be affected by the
  overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if that
  is inacceptable, you are probably not using CONFIG_PREEMPT in the first
  place.
- Add yield support to the AES-CCM driver
- Clean up macros based on feedback from Dave
- Given that I had to add stack frame logic to many of these functions, factor
  it out and wrap it in a couple of macros
- Merge the changes to the core asm driver and glue code of the GHASH/GCM
  driver. The latter was not correct without the former.

Changes since v1:
- add CRC-T10DIF test vector (#1)
- stop using GFP_ATOMIC in scatterwalk API calls, now that they are executed
  with preemption enabled (#2 - #6)
- do some preparatory refactoring on the AES block mode code (#7 - #9)
- add yield patches (#10 - #18)
- add test patch (#19) - DO NOT MERGE

Cc: Dave Martin 
Cc: Russell King - ARM Linux 
Cc: Sebastian Andrzej Siewior 
Cc: Mark Rutland 
Cc: linux-rt-us...@vger.kernel.org
Cc: Peter Zijlstra 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 

Ard Biesheuvel (23):
  crypto: testmgr - add a new test case for CRC-T10DIF
  crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
  crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
  crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
  crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
  crypto: arm64/aes-blk - remove configurable interleave
  crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
  crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
  crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
  arm64: assembler: add utility macros to push/pop stack frames
  arm64: assembler: add macros to conditionally yield the NEON under
PREEMPT
  crypto: arm64/sha1-ce - yield NEON after every block of input
  crypto: arm64/sha2-ce - yield NEON after every block of input
  crypto: arm64/aes-ccm - yield NEON after every block of input
  crypto: arm64/aes-blk - yield NEON after every block of input
  crypto: arm64/aes-bs - yield NEON after every block of input
  crypto: arm64/aes-ghash - yield NEON after every block of input
  

Re: [RFC 0/5] add integrity and security to TPM2 transactions

2018-03-10 Thread Jarkko Sakkinen
On Wed, 2018-03-07 at 15:29 -0800, James Bottomley wrote:
> By now, everybody knows we have a problem with the TPM2_RS_PW easy
> button on TPM2 in that transactions on the TPM bus can be intercepted
> and altered.  The way to fix this is to use real sessions for HMAC
> capabilities to ensure integrity and to use parameter and response
> encryption to ensure confidentiality of the data flowing over the TPM
> bus.
> 
> This RFC is about adding a simple API which can ensure the above
> properties as a layered addition to the existing TPM handling code.
>  Eventually we can add this to the random number generator, the PCR
> extensions and the trusted key handling, but this all depends on the
> conversion to tpm_buf which is not yet upstream, so I've constructed a
> second patch which demonstrates the new API in a test module for those
> who wish to play with it.
> 
> This series is also dependent on additions to the crypto subsystem to
> fix problems in the elliptic curve key handling and add the Cipher
> FeedBack encryption scheme:
> 
> https://marc.info/?l=linux-crypto-vger=151994371015475
> 
> In the second version, I added security HMAC to our PCR extend and
> encryption to the returned random number generators and also extracted
> the parsing and tpm2b construction API into a new file.
> 
> James

Might take up until end of next week before I have time to try this out.
Anyway, I'll see if I get this running on my systems before at the code
that much.

/Jarkko