from:"Dave Watson"

[LSF/MM TOPIC] Improve performance of fget/fput

2019-02-15 Thread Dave Watson

In some of our hottest network services, fget_light + fput overhead
can represent 1-2% of the processes' total CPU usage.  I'd like to
discuss ways to reduce this overhead.

One proposal we have been testing is removing the refcount increment
and decrement, and using some sort of safe memory reclamation
instead. The hottest callers include recvmsg, sendmsg, epoll_wait, etc
- mostly networking calls, often used on non-blocking sockets.  Often
we are only incrementing and decrementing the refcount for a very
short period of time, ideally we wouldn't adjust the refcount unless
we know we are going to block.

We could use RCU, but we would have to be particularly careful that
none of these calls ever block, or ensure that we increment the
refcount at the blocking locations.  As an alternative to RCU, hazard
pointers have similar overhead to SRCU, and could work equally well on
blocking or nonblocking syscalls without additional changes.

(There were also recent related discussions on SCM_RIGHTS refcount
cycle issues, which is the other half of a file* gc)

There might also be ways to rearrange the file* struct or fd table so
that we're not taking so many cache misses for sockfd_lookup_light,
since for sockets we don't use most of the file* struct at all.

[PATCH 10/12] x86/crypto: aesni: Introduce READ_PARTIAL_BLOCK macro

2018-12-10 Thread Dave Watson

Introduce READ_PARTIAL_BLOCK macro, and use it in the two existing
partial block cases: AAD and the end of ENC_DEC.   In particular,
the ENC_DEC case should be faster, since we read by 8/4 bytes if
possible.

This macro will also be used to read partial blocks between
enc_update and dec_update calls.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 102 +--
 1 file changed, 59 insertions(+), 43 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 44a4a8b43ca4..ff00ad19064d 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -415,68 +415,56 @@ _zero_cipher_left\@:
 vmovdqu %xmm14, AadHash(arg2)
 vmovdqu %xmm9, CurCount(arg2)
 
-cmp $16, arg5
-jl  _only_less_than_16\@
-
+# check for 0 length
 mov arg5, %r13
 and $15, %r13# r13 = (arg5 mod 16)
 
 je  _multiple_of_16_bytes\@
 
-# handle the last <16 Byte block seperately
+# handle the last <16 Byte block separately
 
 mov %r13, PBlockLen(arg2)
 
-vpaddd   ONE(%rip), %xmm9, %xmm9 # INCR CNT to get Yn
+vpaddd  ONE(%rip), %xmm9, %xmm9  # INCR CNT to get Yn
 vmovdqu %xmm9, CurCount(arg2)
 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9
 
 ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Yn)
 vmovdqu %xmm9, PBlockEncKey(arg2)
 
-sub $16, %r11
-add %r13, %r11
-vmovdqu (arg4, %r11), %xmm1  # receive the last <16 
Byte block
-
-lea SHIFT_MASK+16(%rip), %r12
-sub %r13, %r12   # adjust the shuffle mask 
pointer to be
-# able to shift 16-r13 
bytes (r13 is the
-# number of bytes in 
plaintext mod 16)
-vmovdqu (%r12), %xmm2# get the appropriate 
shuffle mask
-vpshufb %xmm2, %xmm1, %xmm1  # shift right 16-r13 bytes
-jmp _final_ghash_mul\@
-
-_only_less_than_16\@:
-# check for 0 length
-mov arg5, %r13
-and $15, %r13# r13 = (arg5 mod 16)
+cmp $16, arg5
+jge _large_enough_update\@
 
-je  _multiple_of_16_bytes\@
+lea (arg4,%r11,1), %r10
+mov %r13, %r12
 
-# handle the last <16 Byte block separately
-
-
-vpaddd  ONE(%rip), %xmm9, %xmm9  # INCR CNT to get Yn
-vpshufb SHUF_MASK(%rip), %xmm9, %xmm9
-ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Yn)
-
-vmovdqu %xmm9, PBlockEncKey(arg2)
+READ_PARTIAL_BLOCK %r10 %r12 %xmm1
 
 lea SHIFT_MASK+16(%rip), %r12
 sub %r13, %r12   # adjust the shuffle mask 
pointer to be
 # able to shift 16-r13 
bytes (r13 is the
-# number of bytes in 
plaintext mod 16)
+   # number of bytes in plaintext mod 16)
 
-_get_last_16_byte_loop\@:
-movb(arg4, %r11),  %al
-movb%al,  TMP1 (%rsp , %r11)
-add $1, %r11
-cmp %r13,  %r11
-jne _get_last_16_byte_loop\@
+jmp _final_ghash_mul\@
+
+_large_enough_update\@:
+sub $16, %r11
+add %r13, %r11
+
+# receive the last <16 Byte block
+vmovdqu(arg4, %r11, 1), %xmm1
 
-vmovdqu  TMP1(%rsp), %xmm1
+sub%r13, %r11
+add$16, %r11
 
-sub $16, %r11
+leaSHIFT_MASK+16(%rip), %r12
+# adjust the shuffle mask pointer to be able to shift 16-r13 bytes
+# (r13 is the number of bytes in plaintext mod 16)
+sub%r13, %r12
+# get the appropriate shuffle mask
+vmovdqu(%r12), %xmm2
+# shift right 16-r13 bytes
+vpshufb  %xmm2, %xmm1, %xmm1
 
 _final_ghash_mul\@:
 .if  \ENC_DEC ==  DEC
@@ -490,8 +478,6 @@ _final_ghash_mul\@:
 vpxor   %xmm2, %xmm14, %xmm14
 
 vmovdqu %xmm14, AadHash(arg2)
-sub %r13, %r11
-add $16, %r11
 .else
 vpxor   %xmm1, %xmm9, %xmm9  # Plaintext XOR E(K, Yn)
 vmovdqu ALL_F-SHIFT_MASK(%r12), %xmm1# get the appropriate 
mask to
@@ -501,8 +487,6 @@ _final_ghash_mul\@:
 vpxor   %xmm9, %xmm14, %xmm14
 
 vmovdqu %xmm14, AadHash(arg2)
-sub %r13, %r11
-add $16, %r11
 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9# shuffle xmm9 back to 
output as ciphertext
 .endif
 
@@ -721,6 +705,38 @@ _get_AAD_done\@:
 \PRECOMPUTE  %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
 .endm
 
+
+# R

[PATCH 12/12] x86/crypto: aesni: Add scatter/gather avx stubs, and use them in C

2018-12-10 Thread Dave Watson

Add the appropriate scatter/gather stubs to the avx asm.
In the C code, we can now always use crypt_by_sg, since both
sse and asm code now support scatter/gather.

Introduce a new struct, aesni_gcm_tfm, that is initialized on
startup to point to either the SSE, AVX, or AVX2 versions of the
four necessary encryption/decryption routines.

GENX_OPTSIZE is still checked at the start of crypt_by_sg.  The
total size of the data is checked, since the additional overhead
is in the init function, calculating additional HashKeys.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 181 ++--
 arch/x86/crypto/aesni-intel_glue.c   | 349 +++
 2 files changed, 198 insertions(+), 332 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index af45fc57db90..91c039ab5699 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -518,14 +518,13 @@ _less_than_8_bytes_left\@:
 #
 
 _multiple_of_16_bytes\@:
-GCM_COMPLETE \GHASH_MUL \REP
 .endm
 
 
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
-.macro GCM_COMPLETE GHASH_MUL REP
+.macro GCM_COMPLETE GHASH_MUL REP AUTH_TAG AUTH_TAG_LEN
 vmovdqu AadHash(arg2), %xmm14
 vmovdqu HashKey(arg2), %xmm13
 
@@ -560,8 +559,8 @@ _partial_done\@:
 
 
 _return_T\@:
-mov arg9, %r10  # r10 = authTag
-mov arg10, %r11  # r11 = auth_tag_len
+mov \AUTH_TAG, %r10  # r10 = authTag
+mov \AUTH_TAG_LEN, %r11  # r11 = auth_tag_len
 
 cmp $16, %r11
 je  _T_16\@
@@ -680,14 +679,14 @@ _get_AAD_done\@:
 
 mov %r11, PBlockLen(arg2) # ctx_data.partial_block_length = 0
 mov %r11, PBlockEncKey(arg2) # ctx_data.partial_block_enc_key = 0
-mov arg4, %rax
+mov arg3, %rax
 movdqu (%rax), %xmm0
 movdqu %xmm0, OrigIV(arg2) # ctx_data.orig_IV = iv
 
 vpshufb SHUF_MASK(%rip), %xmm0, %xmm0
 movdqu %xmm0, CurCount(arg2) # ctx_data.current_counter = iv
 
-vmovdqu  (arg3), %xmm6  # xmm6 = HashKey
+vmovdqu  (arg4), %xmm6  # xmm6 = HashKey
 
 vpshufb  SHUF_MASK(%rip), %xmm6, %xmm6
 ###  PRECOMPUTATION of HashKey<<1 mod poly from the HashKey
@@ -1776,88 +1775,100 @@ _initial_blocks_done\@:
 #const   u8 *aad, /* Additional Authentication Data (AAD)*/
 #u64 aad_len) /* Length of AAD in bytes. With RFC4106 this is 
going to be 8 or 12 Bytes */
 #
-ENTRY(aesni_gcm_precomp_avx_gen2)
+ENTRY(aesni_gcm_init_avx_gen2)
 FUNC_SAVE
 INIT GHASH_MUL_AVX, PRECOMPUTE_AVX
 FUNC_RESTORE
 ret
-ENDPROC(aesni_gcm_precomp_avx_gen2)
+ENDPROC(aesni_gcm_init_avx_gen2)
 
 ###
-#void   aesni_gcm_enc_avx_gen2(
+#void   aesni_gcm_enc_update_avx_gen2(
 #gcm_data*my_ctx_data, /* aligned to 16 Bytes */
 #gcm_context_data *data,
 #u8  *out, /* Ciphertext output. Encrypt in-place is allowed.  */
 #const   u8 *in, /* Plaintext input */
-#u64 plaintext_len, /* Length of data in Bytes for encryption. */
-#u8  *iv, /* Pre-counter block j0: 4 byte salt
-#  (from Security Association) concatenated with 8 byte
-#  Initialisation Vector (from IPSec ESP Payload)
-#  concatenated with 0x0001. 16-byte aligned pointer. 
*/
-#const   u8 *aad, /* Additional Authentication Data (AAD)*/
-#u64 aad_len, /* Length of AAD in bytes. With RFC4106 this is 
going to be 8 or 12 Bytes */
-#u8  *auth_tag, /* Authenticated Tag output. */
-#u64 auth_tag_len)# /* Authenticated Tag Length in bytes.
-#  Valid values are 16 (most likely), 12 or 8. */
+#u64 plaintext_len) /* Length of data in Bytes for encryption. */
 ###
-ENTRY(aesni_gcm_enc_avx_gen2)
+ENTRY(aesni_gcm_enc_update_avx_gen2)
 FUNC_SAVE
 mov keysize, %eax
 cmp $32, %eax
-je  key_256_enc
+je  key_256_enc_update
 cmp $16, %eax
-je  key_128_enc
+je  key_128_enc_update
 # must be 192
 GCM_ENC_DEC INITIAL_BLOCKS_AVX, GHASH_8_ENCRYPT_8_PARALLEL_AVX, 
GHASH_LAST_8_AVX, GHASH_MUL_AVX, ENC, 11
 FUNC_RESTORE
 ret
-key_128_enc:
+key_128_enc_update:
 GCM_ENC_DEC INITIAL_BLOCKS_AVX, GHASH_8_ENCRYPT_8_PARALLEL_AVX, 
GHASH_LAST_8_AVX, GHASH_MUL_AVX,

[PATCH 07/12] x86/crypto: aesni: Merge avx precompute functions

2018-12-10 Thread Dave Watson

The precompute functions differ only by the sub-macros
they call, merge them to a single macro.   Later diffs
add more code to fill in the gcm_context_data structure,
this allows changes in a single place.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 76 +---
 1 file changed, 27 insertions(+), 49 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 305abece93ad..e347ba61db65 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -661,6 +661,31 @@ _get_AAD_done\@:
 vmovdqu \T7, AadHash(arg2)
 .endm
 
+.macro INIT GHASH_MUL PRECOMPUTE
+vmovdqu  (arg3), %xmm6  # xmm6 = HashKey
+
+vpshufb  SHUF_MASK(%rip), %xmm6, %xmm6
+###  PRECOMPUTATION of HashKey<<1 mod poly from the HashKey
+vmovdqa  %xmm6, %xmm2
+vpsllq   $1, %xmm6, %xmm6
+vpsrlq   $63, %xmm2, %xmm2
+vmovdqa  %xmm2, %xmm1
+vpslldq  $8, %xmm2, %xmm2
+vpsrldq  $8, %xmm1, %xmm1
+vpor %xmm2, %xmm6, %xmm6
+#reduction
+vpshufd  $0b00100100, %xmm1, %xmm2
+vpcmpeqd TWOONE(%rip), %xmm2, %xmm2
+vpandPOLY(%rip), %xmm2, %xmm2
+vpxor%xmm2, %xmm6, %xmm6# xmm6 holds the HashKey<<1 mod 
poly
+###
+vmovdqu  %xmm6, HashKey(arg2)   # store HashKey<<1 mod poly
+
+CALC_AAD_HASH \GHASH_MUL, arg5, arg6, %xmm2, %xmm6, %xmm3, %xmm4, 
%xmm5, %xmm7, %xmm1, %xmm0
+
+\PRECOMPUTE  %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
+.endm
+
 #ifdef CONFIG_AS_AVX
 ###
 # GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
@@ -1558,31 +1583,7 @@ _initial_blocks_done\@:
 #
 ENTRY(aesni_gcm_precomp_avx_gen2)
 FUNC_SAVE
-
-vmovdqu  (arg3), %xmm6  # xmm6 = HashKey
-
-vpshufb  SHUF_MASK(%rip), %xmm6, %xmm6
-###  PRECOMPUTATION of HashKey<<1 mod poly from the HashKey
-vmovdqa  %xmm6, %xmm2
-vpsllq   $1, %xmm6, %xmm6
-vpsrlq   $63, %xmm2, %xmm2
-vmovdqa  %xmm2, %xmm1
-vpslldq  $8, %xmm2, %xmm2
-vpsrldq  $8, %xmm1, %xmm1
-vpor %xmm2, %xmm6, %xmm6
-#reduction
-vpshufd  $0b00100100, %xmm1, %xmm2
-vpcmpeqd TWOONE(%rip), %xmm2, %xmm2
-vpandPOLY(%rip), %xmm2, %xmm2
-vpxor%xmm2, %xmm6, %xmm6# xmm6 holds the HashKey<<1 mod 
poly
-###
-vmovdqu  %xmm6, HashKey(arg2)   # store HashKey<<1 mod poly
-
-
-CALC_AAD_HASH GHASH_MUL_AVX, arg5, arg6, %xmm2, %xmm6, %xmm3, %xmm4, 
%xmm5, %xmm7, %xmm1, %xmm0
-
-PRECOMPUTE_AVX  %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
-
+INIT GHASH_MUL_AVX, PRECOMPUTE_AVX
 FUNC_RESTORE
 ret
 ENDPROC(aesni_gcm_precomp_avx_gen2)
@@ -2547,30 +2548,7 @@ _initial_blocks_done\@:
 #
 ENTRY(aesni_gcm_precomp_avx_gen4)
 FUNC_SAVE
-
-vmovdqu  (arg3), %xmm6# xmm6 = HashKey
-
-vpshufb  SHUF_MASK(%rip), %xmm6, %xmm6
-###  PRECOMPUTATION of HashKey<<1 mod poly from the HashKey
-vmovdqa  %xmm6, %xmm2
-vpsllq   $1, %xmm6, %xmm6
-vpsrlq   $63, %xmm2, %xmm2
-vmovdqa  %xmm2, %xmm1
-vpslldq  $8, %xmm2, %xmm2
-vpsrldq  $8, %xmm1, %xmm1
-vpor %xmm2, %xmm6, %xmm6
-#reduction
-vpshufd  $0b00100100, %xmm1, %xmm2
-vpcmpeqd TWOONE(%rip), %xmm2, %xmm2
-vpandPOLY(%rip), %xmm2, %xmm2
-vpxor%xmm2, %xmm6, %xmm6  # xmm6 holds the HashKey<<1 mod 
poly
-###
-vmovdqu  %xmm6, HashKey(arg2) # store HashKey<<1 mod poly
-
-CALC_AAD_HASH GHASH_MUL_AVX2, arg5, arg6, %xmm2, %xmm6, %xmm3, %xmm4, 
%xmm5, %xmm7, %xmm1, %xmm0
-
-PRECOMPUTE_AVX2  %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
-
+INIT GHASH_MUL_AVX2, PRECOMPUTE_AVX2
 FUNC_RESTORE
 ret
 ENDPROC(aesni_gcm_precomp_avx_gen4)
-- 
2.17.1

[PATCH 11/12] x86/crypto: aesni: Introduce partial block macro

2018-12-10 Thread Dave Watson

Before this diff, multiple calls to GCM_ENC_DEC will
succeed, but only if all calls are a multiple of 16 bytes.

Handle partial blocks at the start of GCM_ENC_DEC, and update
aadhash as appropriate.

The data offset %r11 is also updated after the partial block.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 156 ++-
 1 file changed, 150 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index ff00ad19064d..af45fc57db90 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -301,6 +301,12 @@ VARIABLE_OFFSET = 16*8
 vmovdqu  HashKey(arg2), %xmm13  # xmm13 = HashKey
 add arg5, InLen(arg2)
 
+# initialize the data pointer offset as zero
+xor %r11d, %r11d
+
+PARTIAL_BLOCK \GHASH_MUL, arg3, arg4, arg5, %r11, %xmm8, \ENC_DEC
+sub %r11, arg5
+
 mov arg5, %r13  # save the number of bytes of 
plaintext/ciphertext
 and $-16, %r13  # r13 = r13 - (r13 mod 16)
 
@@ -737,6 +743,150 @@ _read_next_byte_lt8_\@:
 _done_read_partial_block_\@:
 .endm
 
+# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks
+# between update calls.
+# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK
+# Outputs encrypted bytes, and updates hash and partial info in 
gcm_data_context
+# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13
+.macro PARTIAL_BLOCK GHASH_MUL CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN 
DATA_OFFSET \
+AAD_HASH ENC_DEC
+movPBlockLen(arg2), %r13
+cmp$0, %r13
+je _partial_block_done_\@  # Leave Macro if no partial blocks
+# Read in input data without over reading
+cmp$16, \PLAIN_CYPH_LEN
+jl _fewer_than_16_bytes_\@
+vmovdqu(\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just 
fill xmm
+jmp_data_read_\@
+
+_fewer_than_16_bytes_\@:
+lea(\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10
+mov\PLAIN_CYPH_LEN, %r12
+READ_PARTIAL_BLOCK %r10 %r12 %xmm1
+
+mov PBlockLen(arg2), %r13
+
+_data_read_\@: # Finished reading in data
+
+vmovdquPBlockEncKey(arg2), %xmm9
+vmovdquHashKey(arg2), %xmm13
+
+leaSHIFT_MASK(%rip), %r12
+
+# adjust the shuffle mask pointer to be able to shift r13 bytes
+# r16-r13 is the number of bytes in plaintext mod 16)
+add%r13, %r12
+vmovdqu(%r12), %xmm2   # get the appropriate shuffle 
mask
+vpshufb %xmm2, %xmm9, %xmm9# shift right r13 bytes
+
+.if  \ENC_DEC ==  DEC
+vmovdqa%xmm1, %xmm3
+pxor   %xmm1, %xmm9# Cyphertext XOR E(K, Yn)
+
+mov\PLAIN_CYPH_LEN, %r10
+add%r13, %r10
+# Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+sub$16, %r10
+# Determine if if partial block is not being filled and
+# shift mask accordingly
+jge_no_extra_mask_1_\@
+sub%r10, %r12
+_no_extra_mask_1_\@:
+
+vmovdquALL_F-SHIFT_MASK(%r12), %xmm1
+# get the appropriate mask to mask out bottom r13 bytes of xmm9
+vpand  %xmm1, %xmm9, %xmm9 # mask out bottom r13 bytes of 
xmm9
+
+vpand  %xmm1, %xmm3, %xmm3
+vmovdqaSHUF_MASK(%rip), %xmm10
+vpshufb%xmm10, %xmm3, %xmm3
+vpshufb%xmm2, %xmm3, %xmm3
+vpxor  %xmm3, \AAD_HASH, \AAD_HASH
+
+cmp$0, %r10
+jl _partial_incomplete_1_\@
+
+# GHASH computation for the last <16 Byte block
+\GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+xor%eax,%eax
+
+mov%rax, PBlockLen(arg2)
+jmp_dec_done_\@
+_partial_incomplete_1_\@:
+add\PLAIN_CYPH_LEN, PBlockLen(arg2)
+_dec_done_\@:
+vmovdqu\AAD_HASH, AadHash(arg2)
+.else
+vpxor  %xmm1, %xmm9, %xmm9 # Plaintext XOR E(K, Yn)
+
+mov\PLAIN_CYPH_LEN, %r10
+add%r13, %r10
+# Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+sub$16, %r10
+# Determine if if partial block is not being filled and
+# shift mask accordingly
+jge_no_extra_mask_2_\@
+sub%r10, %r12
+_no_extra_mask_2_\@:
+
+vmovdquALL_F-SHIFT_MASK(%r12), %xmm1
+# get the appropriate mask to mask out bottom r13 bytes of xmm9
+vpand  %xmm1, %xmm9, %xmm9
+
+vmovdqaSHUF_MASK(%rip), %xmm1
+vpshufb %xmm1, %xmm9, %xmm9
+vpshufb %xmm2, %xmm9, %xmm9
+vpxor  %xmm9, \AAD_HASH, \AAD_HASH
+
+cmp$0, %r10
+jl _partial_incomplete_2_\@
+
+# GH

[PATCH 09/12] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE

2018-12-10 Thread Dave Watson

Prepare to handle partial blocks between scatter/gather calls.
For the last partial block, we only want to calculate the aadhash
in GCM_COMPLETE, and a new partial block macro will handle both
aadhash update and encrypting partial blocks between calls.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 0a9cdcfdd987..44a4a8b43ca4 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -488,8 +488,7 @@ _final_ghash_mul\@:
 vpand   %xmm1, %xmm2, %xmm2
 vpshufb SHUF_MASK(%rip), %xmm2, %xmm2
 vpxor   %xmm2, %xmm14, %xmm14
-   #GHASH computation for the last <16 Byte block
-\GHASH_MUL   %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+
 vmovdqu %xmm14, AadHash(arg2)
 sub %r13, %r11
 add $16, %r11
@@ -500,8 +499,7 @@ _final_ghash_mul\@:
 vpand   %xmm1, %xmm9, %xmm9  # mask out top 16-r13 
bytes of xmm9
 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9
 vpxor   %xmm9, %xmm14, %xmm14
-   #GHASH computation for the last <16 Byte block
-\GHASH_MUL   %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+
 vmovdqu %xmm14, AadHash(arg2)
 sub %r13, %r11
 add $16, %r11
@@ -541,6 +539,14 @@ _multiple_of_16_bytes\@:
 vmovdqu AadHash(arg2), %xmm14
 vmovdqu HashKey(arg2), %xmm13
 
+mov PBlockLen(arg2), %r12
+cmp $0, %r12
+je _partial_done\@
+
+   #GHASH computation for the last <16 Byte block
+\GHASH_MUL   %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+
+_partial_done\@:
 mov AadLen(arg2), %r12  # r12 = aadLen (number 
of bytes)
 shl $3, %r12 # convert into number of 
bits
 vmovd   %r12d, %xmm15# len(A) in xmm15
-- 
2.17.1

[PATCH 06/12] x86/crypto: aesni: Split AAD hash calculation to separate macro

2018-12-10 Thread Dave Watson

AAD hash only needs to be calculated once for each scatter/gather operation.
Move it to its own macro, and call it from GCM_INIT instead of
INITIAL_BLOCKS.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 228 ++-
 arch/x86/crypto/aesni-intel_glue.c   |  28 ++-
 2 files changed, 115 insertions(+), 141 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 8e9ae4b26118..305abece93ad 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -182,6 +182,14 @@ aad_shift_arr:
 .text
 
 
+#define AadHash 16*0
+#define AadLen 16*1
+#define InLen (16*1)+8
+#define PBlockEncKey 16*2
+#define OrigIV 16*3
+#define CurCount 16*4
+#define PBlockLen 16*5
+
 HashKey= 16*6   # store HashKey <<1 mod poly here
 HashKey_2  = 16*7   # store HashKey^2 <<1 mod poly here
 HashKey_3  = 16*8   # store HashKey^3 <<1 mod poly here
@@ -585,6 +593,74 @@ _T_16\@:
 _return_T_done\@:
 .endm
 
+.macro CALC_AAD_HASH GHASH_MUL AAD AADLEN T1 T2 T3 T4 T5 T6 T7 T8
+
+   mov \AAD, %r10  # r10 = AAD
+   mov \AADLEN, %r12  # r12 = aadLen
+
+
+   mov %r12, %r11
+
+   vpxor   \T8, \T8, \T8
+   vpxor   \T7, \T7, \T7
+   cmp $16, %r11
+   jl  _get_AAD_rest8\@
+_get_AAD_blocks\@:
+   vmovdqu (%r10), \T7
+   vpshufb SHUF_MASK(%rip), \T7, \T7
+   vpxor   \T7, \T8, \T8
+   \GHASH_MUL   \T8, \T2, \T1, \T3, \T4, \T5, \T6
+   add $16, %r10
+   sub $16, %r12
+   sub $16, %r11
+   cmp $16, %r11
+   jge _get_AAD_blocks\@
+   vmovdqu \T8, \T7
+   cmp $0, %r11
+   je  _get_AAD_done\@
+
+   vpxor   \T7, \T7, \T7
+
+   /* read the last <16B of AAD. since we have at least 4B of
+   data right after the AAD (the ICV, and maybe some CT), we can
+   read 4B/8B blocks safely, and then get rid of the extra stuff */
+_get_AAD_rest8\@:
+   cmp $4, %r11
+   jle _get_AAD_rest4\@
+   movq(%r10), \T1
+   add $8, %r10
+   sub $8, %r11
+   vpslldq $8, \T1, \T1
+   vpsrldq $8, \T7, \T7
+   vpxor   \T1, \T7, \T7
+   jmp _get_AAD_rest8\@
+_get_AAD_rest4\@:
+   cmp $0, %r11
+   jle  _get_AAD_rest0\@
+   mov (%r10), %eax
+   movq%rax, \T1
+   add $4, %r10
+   sub $4, %r11
+   vpslldq $12, \T1, \T1
+   vpsrldq $4, \T7, \T7
+   vpxor   \T1, \T7, \T7
+_get_AAD_rest0\@:
+   /* finalize: shift out the extra bytes we read, and align
+   left. since pslldq can only shift by an immediate, we use
+   vpshufb and an array of shuffle masks */
+   movq%r12, %r11
+   salq$4, %r11
+   vmovdqu  aad_shift_arr(%r11), \T1
+   vpshufb \T1, \T7, \T7
+_get_AAD_rest_final\@:
+   vpshufb SHUF_MASK(%rip), \T7, \T7
+   vpxor   \T8, \T7, \T7
+   \GHASH_MUL   \T7, \T2, \T1, \T3, \T4, \T5, \T6
+
+_get_AAD_done\@:
+vmovdqu \T7, AadHash(arg2)
+.endm
+
 #ifdef CONFIG_AS_AVX
 ###
 # GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
@@ -701,72 +777,9 @@ _return_T_done\@:
 
 .macro INITIAL_BLOCKS_AVX REP num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 
XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC
i = (8-\num_initial_blocks)
-   j = 0
setreg
+vmovdqu AadHash(arg2), reg_i
 
-   mov arg7, %r10  # r10 = AAD
-   mov arg8, %r12  # r12 = aadLen
-
-
-   mov %r12, %r11
-
-   vpxor   reg_j, reg_j, reg_j
-   vpxor   reg_i, reg_i, reg_i
-   cmp $16, %r11
-   jl  _get_AAD_rest8\@
-_get_AAD_blocks\@:
-   vmovdqu (%r10), reg_i
-   vpshufb SHUF_MASK(%rip), reg_i, reg_i
-   vpxor   reg_i, reg_j, reg_j
-   GHASH_MUL_AVX   reg_j, \T2, \T1, \T3, \T4, \T5, \T6
-   add $16, %r10
-   sub $16, %r12
-   sub $16, %r11
-   cmp $16, %r11
-   jge _get_AAD_blocks\@
-   vmovdqu reg_j, reg_i
-   cmp $0, %r11
-   je  _get_AAD_done\@
-
-   vpxor   reg_i, reg_i, reg_i
-
-   /* read the last <16B of AAD. since we have at least 4B of
-   data right after the AAD (the ICV, and maybe some CT), we can
-   read 4B/8B blocks safely, and then get rid of the extra stuff */
-_get_AAD_rest8\@:
-   cmp $4, %r11
-   jle _get_AAD_rest4\@
-   movq(%r10), \T1
-   add $8, %r10
-   sub $8, %r11
-   vpslldq $8, \T1, \T1
-   vpsrldq $8, reg_i, reg_i
-   vpxor   \T1, reg_i, reg_i
-   jmp _get_AAD_rest8\@
-_get_AAD_rest4\@:
-   cmp $0, %r11
-   jle  _get_AAD_rest0\@
-   mov (%r10), %eax
-   movq%rax, \T1
-   add $4,

[PATCH 08/12] x86/crypto: aesni: Fill in new context data structures

2018-12-10 Thread Dave Watson

Fill in aadhash, aadlen, pblocklen, curcount with appropriate values.
pblocklen, aadhash, and pblockenckey are also updated at the end
of each scatter/gather operation, to be carried over to the next
operation.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 51 +---
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index e347ba61db65..0a9cdcfdd987 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -297,7 +297,9 @@ VARIABLE_OFFSET = 16*8
 # clobbering all xmm registers
 # clobbering r10, r11, r12, r13, r14, r15
 .macro  GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 
GHASH_MUL ENC_DEC REP
+vmovdqu AadHash(arg2), %xmm8
 vmovdqu  HashKey(arg2), %xmm13  # xmm13 = HashKey
+add arg5, InLen(arg2)
 
 mov arg5, %r13  # save the number of bytes of 
plaintext/ciphertext
 and $-16, %r13  # r13 = r13 - (r13 mod 16)
@@ -410,6 +412,9 @@ _eight_cipher_left\@:
 
 
 _zero_cipher_left\@:
+vmovdqu %xmm14, AadHash(arg2)
+vmovdqu %xmm9, CurCount(arg2)
+
 cmp $16, arg5
 jl  _only_less_than_16\@
 
@@ -420,10 +425,14 @@ _zero_cipher_left\@:
 
 # handle the last <16 Byte block seperately
 
+mov %r13, PBlockLen(arg2)
 
 vpaddd   ONE(%rip), %xmm9, %xmm9 # INCR CNT to get Yn
+vmovdqu %xmm9, CurCount(arg2)
 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9
+
 ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Yn)
+vmovdqu %xmm9, PBlockEncKey(arg2)
 
 sub $16, %r11
 add %r13, %r11
@@ -451,6 +460,7 @@ _only_less_than_16\@:
 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9
 ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Yn)
 
+vmovdqu %xmm9, PBlockEncKey(arg2)
 
 lea SHIFT_MASK+16(%rip), %r12
 sub %r13, %r12   # adjust the shuffle mask 
pointer to be
@@ -480,6 +490,7 @@ _final_ghash_mul\@:
 vpxor   %xmm2, %xmm14, %xmm14
#GHASH computation for the last <16 Byte block
 \GHASH_MUL   %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+vmovdqu %xmm14, AadHash(arg2)
 sub %r13, %r11
 add $16, %r11
 .else
@@ -491,6 +502,7 @@ _final_ghash_mul\@:
 vpxor   %xmm9, %xmm14, %xmm14
#GHASH computation for the last <16 Byte block
 \GHASH_MUL   %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+vmovdqu %xmm14, AadHash(arg2)
 sub %r13, %r11
 add $16, %r11
 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9# shuffle xmm9 back to 
output as ciphertext
@@ -526,12 +538,16 @@ _multiple_of_16_bytes\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE GHASH_MUL REP
-mov arg8, %r12   # r12 = aadLen (number of 
bytes)
+vmovdqu AadHash(arg2), %xmm14
+vmovdqu HashKey(arg2), %xmm13
+
+mov AadLen(arg2), %r12  # r12 = aadLen (number 
of bytes)
 shl $3, %r12 # convert into number of 
bits
 vmovd   %r12d, %xmm15# len(A) in xmm15
 
-shl $3, arg5 # len(C) in bits  (*128)
-vmovq   arg5, %xmm1
+mov InLen(arg2), %r12
+shl $3, %r12# len(C) in bits  (*128)
+vmovq   %r12, %xmm1
 vpslldq $8, %xmm15, %xmm15   # xmm15 = len(A)|| 
0x
 vpxor   %xmm1, %xmm15, %xmm15# xmm15 = len(A)||len(C)
 
@@ -539,8 +555,7 @@ _multiple_of_16_bytes\@:
 \GHASH_MUL   %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6   
 # final GHASH computation
 vpshufb SHUF_MASK(%rip), %xmm14, %xmm14  # perform a 16Byte swap
 
-mov arg6, %rax   # rax = *Y0
-vmovdqu (%rax), %xmm9# xmm9 = Y0
+vmovdqu OrigIV(arg2), %xmm9
 
 ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Y0)
 
@@ -662,6 +677,20 @@ _get_AAD_done\@:
 .endm
 
 .macro INIT GHASH_MUL PRECOMPUTE
+mov arg6, %r11
+mov %r11, AadLen(arg2) # ctx_data.aad_length = aad_length
+xor %r11d, %r11d
+mov %r11, InLen(arg2) # ctx_data.in_length = 0
+
+mov %r11, PBlockLen(arg2) # ctx_data.partial_block_length = 0
+mov %r11, PBlockEncKey(arg2) # ctx_data.partial_block_enc_key = 0
+mov arg4, %rax
+movdqu (%rax), %xmm0
+movdqu %xmm0, OrigIV(arg2) # ctx_data.orig_IV = iv
+
+vpshufb SHUF_MASK(%rip), %xmm0, %xmm0
+movdqu %xmm0, CurCount(arg2) # ctx_data.current_cou

[PATCH 05/12] x86/crypto: aesni: Add GCM_COMPLETE macro

2018-12-10 Thread Dave Watson

Merge encode and decode tag calculations in GCM_COMPLETE macro.
Scatter/gather routines will call this once at the end of encryption
or decryption.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 2aa11c503bb9..8e9ae4b26118 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -510,6 +510,14 @@ _less_than_8_bytes_left\@:
 #
 
 _multiple_of_16_bytes\@:
+GCM_COMPLETE \GHASH_MUL \REP
+.endm
+
+
+# GCM_COMPLETE Finishes update of tag of last partial block
+# Output: Authorization Tag (AUTH_TAG)
+# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
+.macro GCM_COMPLETE GHASH_MUL REP
 mov arg8, %r12   # r12 = aadLen (number of 
bytes)
 shl $3, %r12 # convert into number of 
bits
 vmovd   %r12d, %xmm15# len(A) in xmm15
-- 
2.17.1

[PATCH 04/12] x86/crypto: aesni: support 256 byte keys in avx asm

2018-12-10 Thread Dave Watson

Add support for 192/256-bit keys using the avx gcm/aes routines.
The sse routines were previously updated in e31ac32d3b (Add support
for 192 & 256 bit keys to AESNI RFC4106).

Instead of adding an additional loop in the hotpath as in e31ac32d3b,
this diff instead generates separate versions of the code using macros,
and the entry routines choose which version once.   This results
in a 5% performance improvement vs. adding a loop to the hot path.
This is the same strategy chosen by the intel isa-l_crypto library.

The key size checks are removed from the c code where appropriate.

Note that this diff depends on using gcm_context_data - 256 bit keys
require 16 HashKeys + 15 expanded keys, which is larger than
struct crypto_aes_ctx, so they are stored in struct gcm_context_data.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 188 +--
 arch/x86/crypto/aesni-intel_glue.c   |  18 +--
 2 files changed, 145 insertions(+), 61 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index dd895f69399b..2aa11c503bb9 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -209,6 +209,7 @@ HashKey_8_k= 16*21   # store XOR of HashKey^8 <<1 mod 
poly here (for Karatsu
 #define arg8 STACK_OFFSET+8*2(%r14)
 #define arg9 STACK_OFFSET+8*3(%r14)
 #define arg10 STACK_OFFSET+8*4(%r14)
+#define keysize 2*15*16(arg1)
 
 i = 0
 j = 0
@@ -272,22 +273,22 @@ VARIABLE_OFFSET = 16*8
 .endm
 
 # Encryption of a single block
-.macro ENCRYPT_SINGLE_BLOCK XMM0
+.macro ENCRYPT_SINGLE_BLOCK REP XMM0
 vpxor(arg1), \XMM0, \XMM0
-   i = 1
-   setreg
-.rep 9
+   i = 1
+   setreg
+.rep \REP
 vaesenc  16*i(arg1), \XMM0, \XMM0
-   i = (i+1)
-   setreg
+   i = (i+1)
+   setreg
 .endr
-vaesenclast 16*10(arg1), \XMM0, \XMM0
+vaesenclast 16*i(arg1), \XMM0, \XMM0
 .endm
 
 # combined for GCM encrypt and decrypt functions
 # clobbering all xmm registers
 # clobbering r10, r11, r12, r13, r14, r15
-.macro  GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 
GHASH_MUL ENC_DEC
+.macro  GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 
GHASH_MUL ENC_DEC REP
 vmovdqu  HashKey(arg2), %xmm13  # xmm13 = HashKey
 
 mov arg5, %r13  # save the number of bytes of 
plaintext/ciphertext
@@ -314,42 +315,42 @@ VARIABLE_OFFSET = 16*8
 jmp _initial_num_blocks_is_1\@
 
 _initial_num_blocks_is_7\@:
-\INITIAL_BLOCKS  7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+\INITIAL_BLOCKS  \REP, 7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, 
%xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, 
\ENC_DEC
 sub $16*7, %r13
 jmp _initial_blocks_encrypted\@
 
 _initial_num_blocks_is_6\@:
-\INITIAL_BLOCKS  6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+\INITIAL_BLOCKS  \REP, 6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, 
%xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, 
\ENC_DEC
 sub $16*6, %r13
 jmp _initial_blocks_encrypted\@
 
 _initial_num_blocks_is_5\@:
-\INITIAL_BLOCKS  5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+\INITIAL_BLOCKS  \REP, 5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, 
%xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, 
\ENC_DEC
 sub $16*5, %r13
 jmp _initial_blocks_encrypted\@
 
 _initial_num_blocks_is_4\@:
-\INITIAL_BLOCKS  4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+\INITIAL_BLOCKS  \REP, 4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, 
%xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, 
\ENC_DEC
 sub $16*4, %r13
 jmp _initial_blocks_encrypted\@
 
 _initial_num_blocks_is_3\@:
-\INITIAL_BLOCKS  3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+\INITIAL_BLOCKS  \REP, 3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, 
%xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, 
\ENC_DEC
 sub $16*3, %r13
 jmp _initial_blocks_encrypted\@
 
 _initial_num_blocks_is_2\@:
-\INITIAL_BLOCKS  2, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+\INITIAL_BL

[PATCH 02/12] x86/crypto: aesni: Introduce gcm_context_data

2018-12-10 Thread Dave Watson

Add the gcm_context_data structure to the avx asm routines.
This will be necessary to support both 256 bit keys and
scatter/gather.

The pre-computed HashKeys are now stored in the gcm_context_data
struct, which is expanded to hold the greater number of hashkeys
necessary for avx.

Loads and stores to the new struct are always done unlaligned to
avoid compiler issues, see e5b954e8 "Use unaligned loads from
gcm_context_data"

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 378 +++
 arch/x86/crypto/aesni-intel_glue.c   |  58 ++--
 2 files changed, 215 insertions(+), 221 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 318135a77975..284f1b8b88fc 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -182,43 +182,22 @@ aad_shift_arr:
 .text
 
 
-##define the fields of the gcm aes context
-#{
-#u8 expanded_keys[16*11] store expanded keys
-#u8 shifted_hkey_1[16]   store HashKey <<1 mod poly here
-#u8 shifted_hkey_2[16]   store HashKey^2 <<1 mod poly here
-#u8 shifted_hkey_3[16]   store HashKey^3 <<1 mod poly here
-#u8 shifted_hkey_4[16]   store HashKey^4 <<1 mod poly here
-#u8 shifted_hkey_5[16]   store HashKey^5 <<1 mod poly here
-#u8 shifted_hkey_6[16]   store HashKey^6 <<1 mod poly here
-#u8 shifted_hkey_7[16]   store HashKey^7 <<1 mod poly here
-#u8 shifted_hkey_8[16]   store HashKey^8 <<1 mod poly here
-#u8 shifted_hkey_1_k[16] store XOR HashKey <<1 mod poly here (for 
Karatsuba purposes)
-#u8 shifted_hkey_2_k[16] store XOR HashKey^2 <<1 mod poly here (for 
Karatsuba purposes)
-#u8 shifted_hkey_3_k[16] store XOR HashKey^3 <<1 mod poly here (for 
Karatsuba purposes)
-#u8 shifted_hkey_4_k[16] store XOR HashKey^4 <<1 mod poly here (for 
Karatsuba purposes)
-#u8 shifted_hkey_5_k[16] store XOR HashKey^5 <<1 mod poly here (for 
Karatsuba purposes)
-#u8 shifted_hkey_6_k[16] store XOR HashKey^6 <<1 mod poly here (for 
Karatsuba purposes)
-#u8 shifted_hkey_7_k[16] store XOR HashKey^7 <<1 mod poly here (for 
Karatsuba purposes)
-#u8 shifted_hkey_8_k[16] store XOR HashKey^8 <<1 mod poly here (for 
Karatsuba purposes)
-#} gcm_ctx#
-
-HashKey= 16*11   # store HashKey <<1 mod poly here
-HashKey_2  = 16*12   # store HashKey^2 <<1 mod poly here
-HashKey_3  = 16*13   # store HashKey^3 <<1 mod poly here
-HashKey_4  = 16*14   # store HashKey^4 <<1 mod poly here
-HashKey_5  = 16*15   # store HashKey^5 <<1 mod poly here
-HashKey_6  = 16*16   # store HashKey^6 <<1 mod poly here
-HashKey_7  = 16*17   # store HashKey^7 <<1 mod poly here
-HashKey_8  = 16*18   # store HashKey^8 <<1 mod poly here
-HashKey_k  = 16*19   # store XOR of HashKey <<1 mod poly here (for 
Karatsuba purposes)
-HashKey_2_k= 16*20   # store XOR of HashKey^2 <<1 mod poly here (for 
Karatsuba purposes)
-HashKey_3_k= 16*21   # store XOR of HashKey^3 <<1 mod poly here (for 
Karatsuba purposes)
-HashKey_4_k= 16*22   # store XOR of HashKey^4 <<1 mod poly here (for 
Karatsuba purposes)
-HashKey_5_k= 16*23   # store XOR of HashKey^5 <<1 mod poly here (for 
Karatsuba purposes)
-HashKey_6_k= 16*24   # store XOR of HashKey^6 <<1 mod poly here (for 
Karatsuba purposes)
-HashKey_7_k= 16*25   # store XOR of HashKey^7 <<1 mod poly here (for 
Karatsuba purposes)
-HashKey_8_k= 16*26   # store XOR of HashKey^8 <<1 mod poly here (for 
Karatsuba purposes)
+HashKey= 16*6   # store HashKey <<1 mod poly here
+HashKey_2  = 16*7   # store HashKey^2 <<1 mod poly here
+HashKey_3  = 16*8   # store HashKey^3 <<1 mod poly here
+HashKey_4  = 16*9   # store HashKey^4 <<1 mod poly here
+HashKey_5  = 16*10   # store HashKey^5 <<1 mod poly here
+HashKey_6  = 16*11   # store HashKey^6 <<1 mod poly here
+HashKey_7  = 16*12   # store HashKey^7 <<1 mod poly here
+HashKey_8  = 16*13   # store HashKey^8 <<1 mod poly here
+HashKey_k  = 16*14   # store XOR of HashKey <<1 mod poly here (for 
Karatsuba purposes)
+HashKey_2_k= 16*15   # store XOR of HashKey^2 <<1 mod poly here (for 
Karatsuba purposes)
+HashKey_3_k= 16*16   # store XOR of HashKey^3 <<1 mod poly here (for 
Karatsuba purposes)
+HashKey_4_k= 16*17   # store XOR of HashKey^4 <<1 mod poly here (for 
Karatsuba purposes)
+HashKey_5_k= 16*18   # store XOR of HashKey^5 <<1 mod poly here (for 
Karatsuba purposes)
+HashKey_6_k= 16*19   # store XOR of HashKey^6 <<1 mod poly here (for 
Karatsuba purposes)
+HashKey_7_k= 16*20   # stor

[PATCH 03/12] x86/crypto: aesni: Macro-ify func save/restore

2018-12-10 Thread Dave Watson

Macro-ify function save and restore.  These will be used in new functions
added for scatter/gather update operations.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 94 +---
 1 file changed, 36 insertions(+), 58 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 284f1b8b88fc..dd895f69399b 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -247,6 +247,30 @@ VARIABLE_OFFSET = 16*8
 # Utility Macros
 
 
+.macro FUNC_SAVE
+#the number of pushes must equal STACK_OFFSET
+push%r12
+push%r13
+push%r14
+push%r15
+
+mov %rsp, %r14
+
+
+
+sub $VARIABLE_OFFSET, %rsp
+and $~63, %rsp# align rsp to 64 bytes
+.endm
+
+.macro FUNC_RESTORE
+mov %r14, %rsp
+
+pop %r15
+pop %r14
+pop %r13
+pop %r12
+.endm
+
 # Encryption of a single block
 .macro ENCRYPT_SINGLE_BLOCK XMM0
 vpxor(arg1), \XMM0, \XMM0
@@ -264,22 +288,6 @@ VARIABLE_OFFSET = 16*8
 # clobbering all xmm registers
 # clobbering r10, r11, r12, r13, r14, r15
 .macro  GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 
GHASH_MUL ENC_DEC
-
-#the number of pushes must equal STACK_OFFSET
-push%r12
-push%r13
-push%r14
-push%r15
-
-mov %rsp, %r14
-
-
-
-
-sub $VARIABLE_OFFSET, %rsp
-and $~63, %rsp  # align rsp to 64 bytes
-
-
 vmovdqu  HashKey(arg2), %xmm13  # xmm13 = HashKey
 
 mov arg5, %r13  # save the number of bytes of 
plaintext/ciphertext
@@ -566,12 +574,6 @@ _T_16\@:
 vmovdqu %xmm9, (%r10)
 
 _return_T_done\@:
-mov %r14, %rsp
-
-pop %r15
-pop %r14
-pop %r13
-pop %r12
 .endm
 
 #ifdef CONFIG_AS_AVX
@@ -1511,18 +1513,7 @@ _initial_blocks_done\@:
 #u8 *hash_subkey)# /* H, the Hash sub key input. Data starts on a 
16-byte boundary. */
 #
 ENTRY(aesni_gcm_precomp_avx_gen2)
-#the number of pushes must equal STACK_OFFSET
-push%r12
-push%r13
-push%r14
-push%r15
-
-mov %rsp, %r14
-
-
-
-sub $VARIABLE_OFFSET, %rsp
-and $~63, %rsp  # align rsp to 64 bytes
+FUNC_SAVE
 
 vmovdqu  (arg3), %xmm6  # xmm6 = HashKey
 
@@ -1546,12 +1537,7 @@ ENTRY(aesni_gcm_precomp_avx_gen2)
 
 PRECOMPUTE_AVX  %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
 
-mov %r14, %rsp
-
-pop %r15
-pop %r14
-pop %r13
-pop %r12
+FUNC_RESTORE
 ret
 ENDPROC(aesni_gcm_precomp_avx_gen2)
 
@@ -1573,7 +1559,9 @@ ENDPROC(aesni_gcm_precomp_avx_gen2)
 #  Valid values are 16 (most likely), 12 or 8. */
 ###
 ENTRY(aesni_gcm_enc_avx_gen2)
+FUNC_SAVE
 GCM_ENC_DEC INITIAL_BLOCKS_AVX GHASH_8_ENCRYPT_8_PARALLEL_AVX 
GHASH_LAST_8_AVX GHASH_MUL_AVX ENC
+FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_enc_avx_gen2)
 
@@ -1595,7 +1583,9 @@ ENDPROC(aesni_gcm_enc_avx_gen2)
 #  Valid values are 16 (most likely), 12 or 8. */
 ###
 ENTRY(aesni_gcm_dec_avx_gen2)
+FUNC_SAVE
 GCM_ENC_DEC INITIAL_BLOCKS_AVX GHASH_8_ENCRYPT_8_PARALLEL_AVX 
GHASH_LAST_8_AVX GHASH_MUL_AVX DEC
+FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_dec_avx_gen2)
 #endif /* CONFIG_AS_AVX */
@@ -2525,18 +2515,7 @@ _initial_blocks_done\@:
 #  Data starts on a 16-byte boundary. */
 #
 ENTRY(aesni_gcm_precomp_avx_gen4)
-#the number of pushes must equal STACK_OFFSET
-push%r12
-push%r13
-push%r14
-push%r15
-
-mov %rsp, %r14
-
-
-
-sub $VARIABLE_OFFSET, %rsp
-and $~63, %rsp# align rsp to 64 bytes
+FUNC_SAVE
 
 vmovdqu  (arg3), %xmm6# xmm6 = HashKey
 
@@ -2560,12 +2539,7 @@ ENTRY(aesni_gcm_precomp_avx_gen4)
 
 PRECOMPUTE_AVX2  %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
 
-mov %r14, %rsp
-
-pop %r15
-pop %r14
-pop %r13
-pop %r12
+FUNC_RESTORE
 ret
 ENDPROC(aesni_gcm_precomp_avx_gen4)
 
@@ -2588,7 +2562,9 @@ ENDPROC(aesni_gcm_precomp_avx_gen4)
 #  Valid values are 16 (most likely), 12 or 8

[PATCH 00/12] x86/crypto: gcmaes AVX scatter/gather support

2018-12-10 Thread Dave Watson

This patch set refactors the x86 aes/gcm AVX crypto routines to
support true scatter/gather by adding gcm_enc/dec_update methods.

It is similar to the previous SSE patchset starting at e1fd316f.  
Unlike the SSE routines, the AVX routines did not support
keysize 192 & 256, this patchset also adds support for those
keysizes.

The final patch updates the C glue code, passing everything through
the crypt_by_sg() function instead of the previous memcpy based
routines.

Dave Watson (12):
  x86/crypto: aesni: Merge GCM_ENC_DEC
  x86/crypto: aesni: Introduce gcm_context_data
  x86/crypto: aesni: Macro-ify func save/restore
  x86/crypto: aesni: support 256 byte keys in avx asm
  x86/crypto: aesni: Add GCM_COMPLETE macro
  x86/crypto: aesni: Split AAD hash calculation to separate macro
  x86/crypto: aesni: Merge avx precompute functions
  x86/crypto: aesni: Fill in new context data structures
  x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
  x86/crypto: aesni: Introduce READ_PARTIAL_BLOCK macro
  x86/crypto: aesni: Introduce partial block macro
  x86/crypto: aesni: Add scatter/gather avx stubs, and use them in C

 arch/x86/crypto/aesni-intel_avx-x86_64.S | 2125 ++
 arch/x86/crypto/aesni-intel_glue.c   |  353 ++--
 2 files changed, 1117 insertions(+), 1361 deletions(-)

-- 
2.17.1

[PATCH 01/12] x86/crypto: aesni: Merge GCM_ENC_DEC

2018-12-10 Thread Dave Watson

The GCM_ENC_DEC routines for AVX and AVX2 are identical, except they
call separate sub-macros.  Pass the macros as arguments, and merge them.
This facilitates additional refactoring, by requiring changes in only
one place.

The GCM_ENC_DEC macro was moved above the CONFIG_AS_AVX* ifdefs,
since it will be used by both AVX and AVX2.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 951 ---
 1 file changed, 318 insertions(+), 633 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 1985ea0b551b..318135a77975 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -280,6 +280,320 @@ VARIABLE_OFFSET = 16*8
 vaesenclast 16*10(arg1), \XMM0, \XMM0
 .endm
 
+# combined for GCM encrypt and decrypt functions
+# clobbering all xmm registers
+# clobbering r10, r11, r12, r13, r14, r15
+.macro  GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 
GHASH_MUL ENC_DEC
+
+#the number of pushes must equal STACK_OFFSET
+push%r12
+push%r13
+push%r14
+push%r15
+
+mov %rsp, %r14
+
+
+
+
+sub $VARIABLE_OFFSET, %rsp
+and $~63, %rsp  # align rsp to 64 bytes
+
+
+vmovdqu  HashKey(arg1), %xmm13  # xmm13 = HashKey
+
+mov arg4, %r13  # save the number of bytes of 
plaintext/ciphertext
+and $-16, %r13  # r13 = r13 - (r13 mod 16)
+
+mov %r13, %r12
+shr $4, %r12
+and $7, %r12
+jz  _initial_num_blocks_is_0\@
+
+cmp $7, %r12
+je  _initial_num_blocks_is_7\@
+cmp $6, %r12
+je  _initial_num_blocks_is_6\@
+cmp $5, %r12
+je  _initial_num_blocks_is_5\@
+cmp $4, %r12
+je  _initial_num_blocks_is_4\@
+cmp $3, %r12
+je  _initial_num_blocks_is_3\@
+cmp $2, %r12
+je  _initial_num_blocks_is_2\@
+
+jmp _initial_num_blocks_is_1\@
+
+_initial_num_blocks_is_7\@:
+\INITIAL_BLOCKS  7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+sub $16*7, %r13
+jmp _initial_blocks_encrypted\@
+
+_initial_num_blocks_is_6\@:
+\INITIAL_BLOCKS  6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+sub $16*6, %r13
+jmp _initial_blocks_encrypted\@
+
+_initial_num_blocks_is_5\@:
+\INITIAL_BLOCKS  5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+sub $16*5, %r13
+jmp _initial_blocks_encrypted\@
+
+_initial_num_blocks_is_4\@:
+\INITIAL_BLOCKS  4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+sub $16*4, %r13
+jmp _initial_blocks_encrypted\@
+
+_initial_num_blocks_is_3\@:
+\INITIAL_BLOCKS  3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+sub $16*3, %r13
+jmp _initial_blocks_encrypted\@
+
+_initial_num_blocks_is_2\@:
+\INITIAL_BLOCKS  2, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+sub $16*2, %r13
+jmp _initial_blocks_encrypted\@
+
+_initial_num_blocks_is_1\@:
+\INITIAL_BLOCKS  1, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+sub $16*1, %r13
+jmp _initial_blocks_encrypted\@
+
+_initial_num_blocks_is_0\@:
+\INITIAL_BLOCKS  0, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, 
%xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+
+
+_initial_blocks_encrypted\@:
+cmp $0, %r13
+je  _zero_cipher_left\@
+
+sub $128, %r13
+je  _eight_cipher_left\@
+
+
+
+
+vmovd   %xmm9, %r15d
+and $255, %r15d
+vpshufb SHUF_MASK(%rip), %xmm9, %xmm9
+
+
+_encrypt_by_8_new\@:
+cmp $(255-8), %r15d
+jg  _encrypt_by_8\@
+
+
+
+add $8, %r15b
+\GHASH_8_ENCRYPT_8_PARALLEL  %xmm0, %xmm10, %xmm11, %xmm12, 
%xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, 
%xmm15, out_order, \ENC_DEC
+add $128, %r11
+sub $128, %r13
+jne _encrypt_by_8_new\@
+
+vpshufb SHUF_MASK(%rip), %xmm9, %xmm9
+jmp _eight_cipher_left\@
+
+_encrypt_by_8

Re: [PATCH] net/tls: Remove VLA usage

2018-04-11 Thread Dave Watson

On 04/10/18 05:52 PM, Kees Cook wrote:
> In the quest to remove VLAs from the kernel[1], this replaces the VLA
> size with the only possible size used in the code, and adds a mechanism
> to double-check future IV sizes.
> 
> [1] 
> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
> 
> Signed-off-by: Kees Cook <keesc...@chromium.org>

Thanks

Acked-by: Dave Watson <davejwat...@fb.com>

> ---
>  net/tls/tls_sw.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 4dc766b03f00..71e79597f940 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -41,6 +41,8 @@
>  #include 
>  #include 
>  
> +#define MAX_IV_SIZE  TLS_CIPHER_AES_GCM_128_IV_SIZE
> +
>  static int tls_do_decryption(struct sock *sk,
>struct scatterlist *sgin,
>struct scatterlist *sgout,
> @@ -673,7 +675,7 @@ static int decrypt_skb(struct sock *sk, struct sk_buff 
> *skb,
>  {
>   struct tls_context *tls_ctx = tls_get_ctx(sk);
>   struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
> - char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + tls_ctx->rx.iv_size];
> + char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + MAX_IV_SIZE];
>   struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2];
>   struct scatterlist *sgin = _arr[0];
>   struct strp_msg *rxm = strp_msg(skb);
> @@ -1094,6 +1096,12 @@ int tls_set_sw_offload(struct sock *sk, struct 
> tls_context *ctx, int tx)
>   goto free_priv;
>   }
>  
> + /* Sanity-check the IV size for stack allocations. */
> + if (iv_size > MAX_IV_SIZE) {
> + rc = -EINVAL;
> + goto free_priv;
> + }
> +
>   cctx->prepend_size = TLS_HEADER_SIZE + nonce_size;
>   cctx->tag_size = tag_size;
>   cctx->overhead_size = cctx->prepend_size + cctx->tag_size;
> -- 
> 2.7.4
> 
> 
> -- 
> Kees Cook
> Pixel Security

Re: [PATCH] net/tls: Remove VLA usage

2018-04-11 Thread Dave Watson

On 04/10/18 05:52 PM, Kees Cook wrote:
> In the quest to remove VLAs from the kernel[1], this replaces the VLA
> size with the only possible size used in the code, and adds a mechanism
> to double-check future IV sizes.
> 
> [1] 
> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
> 
> Signed-off-by: Kees Cook 

Thanks

Acked-by: Dave Watson 

> ---
>  net/tls/tls_sw.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 4dc766b03f00..71e79597f940 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -41,6 +41,8 @@
>  #include 
>  #include 
>  
> +#define MAX_IV_SIZE  TLS_CIPHER_AES_GCM_128_IV_SIZE
> +
>  static int tls_do_decryption(struct sock *sk,
>struct scatterlist *sgin,
>struct scatterlist *sgout,
> @@ -673,7 +675,7 @@ static int decrypt_skb(struct sock *sk, struct sk_buff 
> *skb,
>  {
>   struct tls_context *tls_ctx = tls_get_ctx(sk);
>   struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
> - char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + tls_ctx->rx.iv_size];
> + char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + MAX_IV_SIZE];
>   struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2];
>   struct scatterlist *sgin = _arr[0];
>   struct strp_msg *rxm = strp_msg(skb);
> @@ -1094,6 +1096,12 @@ int tls_set_sw_offload(struct sock *sk, struct 
> tls_context *ctx, int tx)
>   goto free_priv;
>   }
>  
> + /* Sanity-check the IV size for stack allocations. */
> + if (iv_size > MAX_IV_SIZE) {
> + rc = -EINVAL;
> + goto free_priv;
> + }
> +
>   cctx->prepend_size = TLS_HEADER_SIZE + nonce_size;
>   cctx->tag_size = tag_size;
>   cctx->overhead_size = cctx->prepend_size + cctx->tag_size;
> -- 
> 2.7.4
> 
> 
> -- 
> Kees Cook
> Pixel Security

[PATCH v2 04/14] x86/crypto: aesni: Add GCM_COMPLETE macro

2018-02-14 Thread Dave Watson

Merge encode and decode tag calculations in GCM_COMPLETE macro.
Scatter/gather routines will call this once at the end of encryption
or decryption.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 172 ++
 1 file changed, 63 insertions(+), 109 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index b9fe2ab..529c542 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -222,6 +222,67 @@ ALL_F:  .octa 0x
mov %r13, %r12
 .endm
 
+# GCM_COMPLETE Finishes update of tag of last partial block
+# Output: Authorization Tag (AUTH_TAG)
+# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
+.macro GCM_COMPLETE
+   mov arg8, %r12# %r13 = aadLen (number of bytes)
+   shl $3, %r12  # convert into number of bits
+   movd%r12d, %xmm15 # len(A) in %xmm15
+   shl $3, %arg4 # len(C) in bits (*128)
+   MOVQ_R64_XMM%arg4, %xmm1
+   pslldq  $8, %xmm15# %xmm15 = len(A)||0x
+   pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
+   pxor%xmm15, %xmm8
+   GHASH_MUL   %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+   # final GHASH computation
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10, %xmm8
+
+   mov %arg5, %rax   # %rax = *Y0
+   movdqu  (%rax), %xmm0 # %xmm0 = Y0
+   ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
+   pxor%xmm8, %xmm0
+_return_T_\@:
+   mov arg9, %r10 # %r10 = authTag
+   mov arg10, %r11# %r11 = auth_tag_len
+   cmp $16, %r11
+   je  _T_16_\@
+   cmp $8, %r11
+   jl  _T_4_\@
+_T_8_\@:
+   MOVQ_R64_XMM%xmm0, %rax
+   mov %rax, (%r10)
+   add $8, %r10
+   sub $8, %r11
+   psrldq  $8, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_\@
+_T_4_\@:
+   movd%xmm0, %eax
+   mov %eax, (%r10)
+   add $4, %r10
+   sub $4, %r11
+   psrldq  $4, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_\@
+_T_123_\@:
+   movd%xmm0, %eax
+   cmp $2, %r11
+   jl  _T_1_\@
+   mov %ax, (%r10)
+   cmp $2, %r11
+   je  _return_T_done_\@
+   add $2, %r10
+   sar $16, %eax
+_T_1_\@:
+   mov %al, (%r10)
+   jmp _return_T_done_\@
+_T_16_\@:
+   movdqu  %xmm0, (%r10)
+_return_T_done_\@:
+.endm
+
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
 *
@@ -1271,61 +1332,7 @@ _less_than_8_bytes_left_decrypt:
sub $1, %r13
jne _less_than_8_bytes_left_decrypt
 _multiple_of_16_bytes_decrypt:
-   mov arg8, %r12# %r13 = aadLen (number of bytes)
-   shl $3, %r12  # convert into number of bits
-   movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg4 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg4, %xmm1
-   pslldq  $8, %xmm15# %xmm15 = len(A)||0x
-   pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
-   pxor%xmm15, %xmm8
-   GHASH_MUL   %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
-# final GHASH computation
-movdqa SHUF_MASK(%rip), %xmm10
-   PSHUFB_XMM %xmm10, %xmm8
-
-   mov %arg5, %rax   # %rax = *Y0
-   movdqu  (%rax), %xmm0 # %xmm0 = Y0
-   ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
-   pxor%xmm8, %xmm0
-_return_T_decrypt:
-   mov arg9, %r10# %r10 = authTag
-   mov arg10, %r11   # %r11 = auth_tag_len
-   cmp $16, %r11
-   je  _T_16_decrypt
-   cmp $8, %r11
-   jl  _T_4_decrypt
-_T_8_decrypt:
-   MOVQ_R64_XMM%xmm0, %rax
-   mov %rax, (%r10)
-   add $8, %r10
-   sub $8, %r11
-   psrldq  $8, %xmm0
-   cmp $0, %r11
-   je  _return_T_done_decrypt
-_T_4_decrypt:
-   movd%xmm0, %eax
-   mov %eax, (%r10)
-   add $4, %r10
-   sub $4, %r11
-   psrldq  $4, %xmm0
-   cmp $0, %r11
-   je  _return_T_done_decrypt
-_T_123_decrypt:
-   movd%xmm0, %eax
-   cmp $2, %r11
-   jl  _T_1_decrypt
-   mov %ax, (%r10)
-   cmp $2, %r11
-   je  _return_T_done_decrypt
-   add $2, %r10
-   sar $16, %eax
-_T_1_decrypt:
-   mov %al, (%r10)
-   jmp _return_T_done_decrypt
-_T_16_decrypt:
-   movdqu  %xmm0, (%r10)
-_return_T_done_decrypt:
+   GCM_COMPLETE
FUNC_RESTORE
ret
 E

[PATCH v2 04/14] x86/crypto: aesni: Add GCM_COMPLETE macro

2018-02-14 Thread Dave Watson

Merge encode and decode tag calculations in GCM_COMPLETE macro.
Scatter/gather routines will call this once at the end of encryption
or decryption.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 172 ++
 1 file changed, 63 insertions(+), 109 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index b9fe2ab..529c542 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -222,6 +222,67 @@ ALL_F:  .octa 0x
mov %r13, %r12
 .endm
 
+# GCM_COMPLETE Finishes update of tag of last partial block
+# Output: Authorization Tag (AUTH_TAG)
+# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
+.macro GCM_COMPLETE
+   mov arg8, %r12# %r13 = aadLen (number of bytes)
+   shl $3, %r12  # convert into number of bits
+   movd%r12d, %xmm15 # len(A) in %xmm15
+   shl $3, %arg4 # len(C) in bits (*128)
+   MOVQ_R64_XMM%arg4, %xmm1
+   pslldq  $8, %xmm15# %xmm15 = len(A)||0x
+   pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
+   pxor%xmm15, %xmm8
+   GHASH_MUL   %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+   # final GHASH computation
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10, %xmm8
+
+   mov %arg5, %rax   # %rax = *Y0
+   movdqu  (%rax), %xmm0 # %xmm0 = Y0
+   ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
+   pxor%xmm8, %xmm0
+_return_T_\@:
+   mov arg9, %r10 # %r10 = authTag
+   mov arg10, %r11# %r11 = auth_tag_len
+   cmp $16, %r11
+   je  _T_16_\@
+   cmp $8, %r11
+   jl  _T_4_\@
+_T_8_\@:
+   MOVQ_R64_XMM%xmm0, %rax
+   mov %rax, (%r10)
+   add $8, %r10
+   sub $8, %r11
+   psrldq  $8, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_\@
+_T_4_\@:
+   movd%xmm0, %eax
+   mov %eax, (%r10)
+   add $4, %r10
+   sub $4, %r11
+   psrldq  $4, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_\@
+_T_123_\@:
+   movd%xmm0, %eax
+   cmp $2, %r11
+   jl  _T_1_\@
+   mov %ax, (%r10)
+   cmp $2, %r11
+   je  _return_T_done_\@
+   add $2, %r10
+   sar $16, %eax
+_T_1_\@:
+   mov %al, (%r10)
+   jmp _return_T_done_\@
+_T_16_\@:
+   movdqu  %xmm0, (%r10)
+_return_T_done_\@:
+.endm
+
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
 *
@@ -1271,61 +1332,7 @@ _less_than_8_bytes_left_decrypt:
sub $1, %r13
jne _less_than_8_bytes_left_decrypt
 _multiple_of_16_bytes_decrypt:
-   mov arg8, %r12# %r13 = aadLen (number of bytes)
-   shl $3, %r12  # convert into number of bits
-   movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg4 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg4, %xmm1
-   pslldq  $8, %xmm15# %xmm15 = len(A)||0x
-   pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
-   pxor%xmm15, %xmm8
-   GHASH_MUL   %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
-# final GHASH computation
-movdqa SHUF_MASK(%rip), %xmm10
-   PSHUFB_XMM %xmm10, %xmm8
-
-   mov %arg5, %rax   # %rax = *Y0
-   movdqu  (%rax), %xmm0 # %xmm0 = Y0
-   ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
-   pxor%xmm8, %xmm0
-_return_T_decrypt:
-   mov arg9, %r10# %r10 = authTag
-   mov arg10, %r11   # %r11 = auth_tag_len
-   cmp $16, %r11
-   je  _T_16_decrypt
-   cmp $8, %r11
-   jl  _T_4_decrypt
-_T_8_decrypt:
-   MOVQ_R64_XMM%xmm0, %rax
-   mov %rax, (%r10)
-   add $8, %r10
-   sub $8, %r11
-   psrldq  $8, %xmm0
-   cmp $0, %r11
-   je  _return_T_done_decrypt
-_T_4_decrypt:
-   movd%xmm0, %eax
-   mov %eax, (%r10)
-   add $4, %r10
-   sub $4, %r11
-   psrldq  $4, %xmm0
-   cmp $0, %r11
-   je  _return_T_done_decrypt
-_T_123_decrypt:
-   movd%xmm0, %eax
-   cmp $2, %r11
-   jl  _T_1_decrypt
-   mov %ax, (%r10)
-   cmp $2, %r11
-   je  _return_T_done_decrypt
-   add $2, %r10
-   sar $16, %eax
-_T_1_decrypt:
-   mov %al, (%r10)
-   jmp _return_T_done_decrypt
-_T_16_decrypt:
-   movdqu  %xmm0, (%r10)
-_return_T_done_decrypt:
+   GCM_COMPLETE
FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_dec

[PATCH v2 06/14] x86/crypto: aesni: Introduce gcm_context_data

2018-02-14 Thread Dave Watson

Introduce a gcm_context_data struct that will be used to pass
context data between scatter/gather update calls.  It is passed
as the second argument (after crypto keys), other args are
renumbered.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S  | 115 +
 arch/x86/crypto/aesni-intel_glue.c |  81 ++
 2 files changed, 121 insertions(+), 75 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 8021fd1..6c5a80d 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -111,6 +111,14 @@ ALL_F:  .octa 0x
// (for Karatsuba purposes)
 #defineVARIABLE_OFFSET 16*8
 
+#define AadHash 16*0
+#define AadLen 16*1
+#define InLen (16*1)+8
+#define PBlockEncKey 16*2
+#define OrigIV 16*3
+#define CurCount 16*4
+#define PBlockLen 16*5
+
 #define arg1 rdi
 #define arg2 rsi
 #define arg3 rdx
@@ -121,6 +129,7 @@ ALL_F:  .octa 0x
 #define arg8 STACK_OFFSET+16(%r14)
 #define arg9 STACK_OFFSET+24(%r14)
 #define arg10 STACK_OFFSET+32(%r14)
+#define arg11 STACK_OFFSET+40(%r14)
 #define keysize 2*15*16(%arg1)
 #endif
 
@@ -195,9 +204,9 @@ ALL_F:  .octa 0x
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
 .macro GCM_INIT
-   mov %arg6, %r12
+   mov arg7, %r12
movdqu  (%r12), %xmm13
-   movdqa  SHUF_MASK(%rip), %xmm2
+   movdqa  SHUF_MASK(%rip), %xmm2
PSHUFB_XMM %xmm2, %xmm13
 
# precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
@@ -217,7 +226,7 @@ ALL_F:  .octa 0x
pandPOLY(%rip), %xmm2
pxor%xmm2, %xmm13
movdqa  %xmm13, HashKey(%rsp)
-   mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod 
poly)
+   mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
and $-16, %r13
mov %r13, %r12
 .endm
@@ -271,18 +280,18 @@ _four_cipher_left_\@:
GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
 %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
 _zero_cipher_left_\@:
-   mov %arg4, %r13
-   and $15, %r13   # %r13 = arg4 (mod 16)
+   mov %arg5, %r13
+   and $15, %r13   # %r13 = arg5 (mod 16)
je  _multiple_of_16_bytes_\@
 
# Handle the last <16 Byte block separately
paddd ONE(%rip), %xmm0# INCR CNT to get Yn
-movdqa SHUF_MASK(%rip), %xmm10
+   movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm0
 
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
 
-   lea (%arg3,%r11,1), %r10
+   lea (%arg4,%r11,1), %r10
mov %r13, %r12
READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
 
@@ -320,13 +329,13 @@ _zero_cipher_left_\@:
MOVQ_R64_XMM %xmm0, %rax
cmp $8, %r13
jle _less_than_8_bytes_left_\@
-   mov %rax, (%arg2 , %r11, 1)
+   mov %rax, (%arg3 , %r11, 1)
add $8, %r11
psrldq $8, %xmm0
MOVQ_R64_XMM %xmm0, %rax
sub $8, %r13
 _less_than_8_bytes_left_\@:
-   mov %al,  (%arg2, %r11, 1)
+   mov %al,  (%arg3, %r11, 1)
add $1, %r11
shr $8, %rax
sub $1, %r13
@@ -338,11 +347,11 @@ _multiple_of_16_bytes_\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE
-   mov arg8, %r12# %r13 = aadLen (number of bytes)
+   mov arg9, %r12# %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg4 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg4, %xmm1
+   shl $3, %arg5 # len(C) in bits (*128)
+   MOVQ_R64_XMM%arg5, %xmm1
pslldq  $8, %xmm15# %xmm15 = len(A)||0x
pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
pxor%xmm15, %xmm8
@@ -351,13 +360,13 @@ _multiple_of_16_bytes_\@:
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm8
 
-   mov %arg5, %rax   # %rax = *Y0
+   mov %arg6, %rax   # %rax = *Y0
movdqu  (%rax), %xmm0 # %xmm0 = Y0
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
-   mov arg9, %r10 # %r10 = authTag
-   mov arg10, %r11# %r11 = auth_tag_len
+   mov arg10, %r10 # %r10 = authTag

[PATCH v2 06/14] x86/crypto: aesni: Introduce gcm_context_data

2018-02-14 Thread Dave Watson

Introduce a gcm_context_data struct that will be used to pass
context data between scatter/gather update calls.  It is passed
as the second argument (after crypto keys), other args are
renumbered.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S  | 115 +
 arch/x86/crypto/aesni-intel_glue.c |  81 ++
 2 files changed, 121 insertions(+), 75 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 8021fd1..6c5a80d 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -111,6 +111,14 @@ ALL_F:  .octa 0x
// (for Karatsuba purposes)
 #defineVARIABLE_OFFSET 16*8
 
+#define AadHash 16*0
+#define AadLen 16*1
+#define InLen (16*1)+8
+#define PBlockEncKey 16*2
+#define OrigIV 16*3
+#define CurCount 16*4
+#define PBlockLen 16*5
+
 #define arg1 rdi
 #define arg2 rsi
 #define arg3 rdx
@@ -121,6 +129,7 @@ ALL_F:  .octa 0x
 #define arg8 STACK_OFFSET+16(%r14)
 #define arg9 STACK_OFFSET+24(%r14)
 #define arg10 STACK_OFFSET+32(%r14)
+#define arg11 STACK_OFFSET+40(%r14)
 #define keysize 2*15*16(%arg1)
 #endif
 
@@ -195,9 +204,9 @@ ALL_F:  .octa 0x
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
 .macro GCM_INIT
-   mov %arg6, %r12
+   mov arg7, %r12
movdqu  (%r12), %xmm13
-   movdqa  SHUF_MASK(%rip), %xmm2
+   movdqa  SHUF_MASK(%rip), %xmm2
PSHUFB_XMM %xmm2, %xmm13
 
# precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
@@ -217,7 +226,7 @@ ALL_F:  .octa 0x
pandPOLY(%rip), %xmm2
pxor%xmm2, %xmm13
movdqa  %xmm13, HashKey(%rsp)
-   mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod 
poly)
+   mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
and $-16, %r13
mov %r13, %r12
 .endm
@@ -271,18 +280,18 @@ _four_cipher_left_\@:
GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
 %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
 _zero_cipher_left_\@:
-   mov %arg4, %r13
-   and $15, %r13   # %r13 = arg4 (mod 16)
+   mov %arg5, %r13
+   and $15, %r13   # %r13 = arg5 (mod 16)
je  _multiple_of_16_bytes_\@
 
# Handle the last <16 Byte block separately
paddd ONE(%rip), %xmm0# INCR CNT to get Yn
-movdqa SHUF_MASK(%rip), %xmm10
+   movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm0
 
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
 
-   lea (%arg3,%r11,1), %r10
+   lea (%arg4,%r11,1), %r10
mov %r13, %r12
READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
 
@@ -320,13 +329,13 @@ _zero_cipher_left_\@:
MOVQ_R64_XMM %xmm0, %rax
cmp $8, %r13
jle _less_than_8_bytes_left_\@
-   mov %rax, (%arg2 , %r11, 1)
+   mov %rax, (%arg3 , %r11, 1)
add $8, %r11
psrldq $8, %xmm0
MOVQ_R64_XMM %xmm0, %rax
sub $8, %r13
 _less_than_8_bytes_left_\@:
-   mov %al,  (%arg2, %r11, 1)
+   mov %al,  (%arg3, %r11, 1)
add $1, %r11
shr $8, %rax
sub $1, %r13
@@ -338,11 +347,11 @@ _multiple_of_16_bytes_\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE
-   mov arg8, %r12# %r13 = aadLen (number of bytes)
+   mov arg9, %r12# %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg4 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg4, %xmm1
+   shl $3, %arg5 # len(C) in bits (*128)
+   MOVQ_R64_XMM%arg5, %xmm1
pslldq  $8, %xmm15# %xmm15 = len(A)||0x
pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
pxor%xmm15, %xmm8
@@ -351,13 +360,13 @@ _multiple_of_16_bytes_\@:
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm8
 
-   mov %arg5, %rax   # %rax = *Y0
+   mov %arg6, %rax   # %rax = *Y0
movdqu  (%rax), %xmm0 # %xmm0 = Y0
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
-   mov arg9, %r10 # %r10 = authTag
-   mov arg10, %r11# %r11 = auth_tag_len
+   mov arg10, %r10 # %r10 = authTag
+   mov arg11, %r11

[PATCH v2 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

2018-02-14 Thread Dave Watson

Add gcmaes_crypt_by_sg routine, that will do scatter/gather
by sg. Either src or dst may contain multiple buffers, so
iterate over both at the same time if they are different.
If the input is the same as the output, iterate only over one.

Currently both the AAD and TAG must be linear, so copy them out
with scatterlist_map_and_copy.  If first buffer contains the
entire AAD, we can optimize and not copy.   Since the AAD
can be any size, if copied it must be on the heap.  TAG can
be on the stack since it is always < 16 bytes.

Only the SSE routines are updated so far, so leave the previous
gcmaes_en/decrypt routines, and branch to the sg ones if the
keysize is inappropriate for avx, or we are SSE only.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_glue.c | 133 +
 1 file changed, 133 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index de986f9..acbe7e8 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -791,6 +791,127 @@ static int generic_gcmaes_set_authsize(struct crypto_aead 
*tfm,
return 0;
 }
 
+static int gcmaes_crypt_by_sg(bool enc, struct aead_request *req,
+ unsigned int assoclen, u8 *hash_subkey,
+ u8 *iv, void *aes_ctx)
+{
+   struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+   unsigned long auth_tag_len = crypto_aead_authsize(tfm);
+   struct gcm_context_data data AESNI_ALIGN_ATTR;
+   struct scatter_walk dst_sg_walk = {};
+   unsigned long left = req->cryptlen;
+   unsigned long len, srclen, dstlen;
+   struct scatter_walk assoc_sg_walk;
+   struct scatter_walk src_sg_walk;
+   struct scatterlist src_start[2];
+   struct scatterlist dst_start[2];
+   struct scatterlist *src_sg;
+   struct scatterlist *dst_sg;
+   u8 *src, *dst, *assoc;
+   u8 *assocmem = NULL;
+   u8 authTag[16];
+
+   if (!enc)
+   left -= auth_tag_len;
+
+   /* Linearize assoc, if not already linear */
+   if (req->src->length >= assoclen && req->src->length &&
+   (!PageHighMem(sg_page(req->src)) ||
+   req->src->offset + req->src->length < PAGE_SIZE)) {
+   scatterwalk_start(_sg_walk, req->src);
+   assoc = scatterwalk_map(_sg_walk);
+   } else {
+   /* assoc can be any length, so must be on heap */
+   assocmem = kmalloc(assoclen, GFP_ATOMIC);
+   if (unlikely(!assocmem))
+   return -ENOMEM;
+   assoc = assocmem;
+
+   scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0);
+   }
+
+   src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen);
+   scatterwalk_start(_sg_walk, src_sg);
+   if (req->src != req->dst) {
+   dst_sg = scatterwalk_ffwd(dst_start, req->dst, req->assoclen);
+   scatterwalk_start(_sg_walk, dst_sg);
+   }
+
+   kernel_fpu_begin();
+   aesni_gcm_init(aes_ctx, , iv,
+   hash_subkey, assoc, assoclen);
+   if (req->src != req->dst) {
+   while (left) {
+   src = scatterwalk_map(_sg_walk);
+   dst = scatterwalk_map(_sg_walk);
+   srclen = scatterwalk_clamp(_sg_walk, left);
+   dstlen = scatterwalk_clamp(_sg_walk, left);
+   len = min(srclen, dstlen);
+   if (len) {
+   if (enc)
+   aesni_gcm_enc_update(aes_ctx, ,
+dst, src, len);
+   else
+   aesni_gcm_dec_update(aes_ctx, ,
+dst, src, len);
+   }
+   left -= len;
+
+   scatterwalk_unmap(src);
+   scatterwalk_unmap(dst);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_done(_sg_walk, 0, left);
+   scatterwalk_done(_sg_walk, 1, left);
+   }
+   } else {
+   while (left) {
+   dst = src = scatterwalk_map(_sg_walk);
+   len = scatterwalk_clamp(_sg_walk, left);
+   if (len) {
+   if (enc)
+   aesni_gcm_enc_update(aes_ctx, ,
+src, src, len);
+   else
+   aesni_gcm_dec_u

[PATCH v2 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

2018-02-14 Thread Dave Watson

Add gcmaes_crypt_by_sg routine, that will do scatter/gather
by sg. Either src or dst may contain multiple buffers, so
iterate over both at the same time if they are different.
If the input is the same as the output, iterate only over one.

Currently both the AAD and TAG must be linear, so copy them out
with scatterlist_map_and_copy.  If first buffer contains the
entire AAD, we can optimize and not copy.   Since the AAD
can be any size, if copied it must be on the heap.  TAG can
be on the stack since it is always < 16 bytes.

Only the SSE routines are updated so far, so leave the previous
gcmaes_en/decrypt routines, and branch to the sg ones if the
keysize is inappropriate for avx, or we are SSE only.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_glue.c | 133 +
 1 file changed, 133 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index de986f9..acbe7e8 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -791,6 +791,127 @@ static int generic_gcmaes_set_authsize(struct crypto_aead 
*tfm,
return 0;
 }
 
+static int gcmaes_crypt_by_sg(bool enc, struct aead_request *req,
+ unsigned int assoclen, u8 *hash_subkey,
+ u8 *iv, void *aes_ctx)
+{
+   struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+   unsigned long auth_tag_len = crypto_aead_authsize(tfm);
+   struct gcm_context_data data AESNI_ALIGN_ATTR;
+   struct scatter_walk dst_sg_walk = {};
+   unsigned long left = req->cryptlen;
+   unsigned long len, srclen, dstlen;
+   struct scatter_walk assoc_sg_walk;
+   struct scatter_walk src_sg_walk;
+   struct scatterlist src_start[2];
+   struct scatterlist dst_start[2];
+   struct scatterlist *src_sg;
+   struct scatterlist *dst_sg;
+   u8 *src, *dst, *assoc;
+   u8 *assocmem = NULL;
+   u8 authTag[16];
+
+   if (!enc)
+   left -= auth_tag_len;
+
+   /* Linearize assoc, if not already linear */
+   if (req->src->length >= assoclen && req->src->length &&
+   (!PageHighMem(sg_page(req->src)) ||
+   req->src->offset + req->src->length < PAGE_SIZE)) {
+   scatterwalk_start(_sg_walk, req->src);
+   assoc = scatterwalk_map(_sg_walk);
+   } else {
+   /* assoc can be any length, so must be on heap */
+   assocmem = kmalloc(assoclen, GFP_ATOMIC);
+   if (unlikely(!assocmem))
+   return -ENOMEM;
+   assoc = assocmem;
+
+   scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0);
+   }
+
+   src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen);
+   scatterwalk_start(_sg_walk, src_sg);
+   if (req->src != req->dst) {
+   dst_sg = scatterwalk_ffwd(dst_start, req->dst, req->assoclen);
+   scatterwalk_start(_sg_walk, dst_sg);
+   }
+
+   kernel_fpu_begin();
+   aesni_gcm_init(aes_ctx, , iv,
+   hash_subkey, assoc, assoclen);
+   if (req->src != req->dst) {
+   while (left) {
+   src = scatterwalk_map(_sg_walk);
+   dst = scatterwalk_map(_sg_walk);
+   srclen = scatterwalk_clamp(_sg_walk, left);
+   dstlen = scatterwalk_clamp(_sg_walk, left);
+   len = min(srclen, dstlen);
+   if (len) {
+   if (enc)
+   aesni_gcm_enc_update(aes_ctx, ,
+dst, src, len);
+   else
+   aesni_gcm_dec_update(aes_ctx, ,
+dst, src, len);
+   }
+   left -= len;
+
+   scatterwalk_unmap(src);
+   scatterwalk_unmap(dst);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_done(_sg_walk, 0, left);
+   scatterwalk_done(_sg_walk, 1, left);
+   }
+   } else {
+   while (left) {
+   dst = src = scatterwalk_map(_sg_walk);
+   len = scatterwalk_clamp(_sg_walk, left);
+   if (len) {
+   if (enc)
+   aesni_gcm_enc_update(aes_ctx, ,
+src, src, len);
+   else
+   aesni_gcm_dec_update(

[PATCH v2 13/14] x86/crypto: aesni: Introduce scatter/gather asm function stubs

2018-02-14 Thread Dave Watson

The asm macros are all set up now, introduce entry points.

GCM_INIT and GCM_COMPLETE have arguments supplied, so that
the new scatter/gather entry points don't have to take all the
arguments, and only the ones they need.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S  | 116 -
 arch/x86/crypto/aesni-intel_glue.c |  16 +
 2 files changed, 106 insertions(+), 26 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index b941952..311b2de 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -200,8 +200,8 @@ ALL_F:  .octa 0x
 # Output: HashKeys stored in gcm_context_data.  Only needs to be called
 # once per key.
 # clobbers r12, and tmp xmm registers.
-.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
-   mov arg7, %r12
+.macro PRECOMPUTE SUBKEY TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
+   mov \SUBKEY, %r12
movdqu  (%r12), \TMP3
movdqa  SHUF_MASK(%rip), \TMP2
PSHUFB_XMM \TMP2, \TMP3
@@ -254,14 +254,14 @@ ALL_F:  .octa 0x
 
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
-.macro GCM_INIT
-   mov arg9, %r11
+.macro GCM_INIT Iv SUBKEY AAD AADLEN
+   mov \AADLEN, %r11
mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
xor %r11, %r11
mov %r11, InLen(%arg2) # ctx_data.in_length = 0
mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
-   mov %arg6, %rax
+   mov \Iv, %rax
movdqu (%rax), %xmm0
movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv
 
@@ -269,11 +269,11 @@ ALL_F:  .octa 0x
PSHUFB_XMM %xmm2, %xmm0
movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv
 
-   PRECOMPUTE %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7
+   PRECOMPUTE \SUBKEY, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
movdqa HashKey(%arg2), %xmm13
 
-   CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
-   %xmm5 %xmm6
+   CALC_AAD_HASH %xmm13, \AAD, \AADLEN, %xmm0, %xmm1, %xmm2, %xmm3, \
+   %xmm4, %xmm5, %xmm6
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -435,7 +435,7 @@ _multiple_of_16_bytes_\@:
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
-.macro GCM_COMPLETE
+.macro GCM_COMPLETE AUTHTAG AUTHTAGLEN
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%arg2), %xmm13
 
@@ -466,8 +466,8 @@ _partial_done\@:
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
-   mov arg10, %r10 # %r10 = authTag
-   mov arg11, %r11# %r11 = auth_tag_len
+   mov \AUTHTAG, %r10 # %r10 = authTag
+   mov \AUTHTAGLEN, %r11# %r11 = auth_tag_len
cmp $16, %r11
je  _T_16_\@
cmp $8, %r11
@@ -599,11 +599,11 @@ _done_read_partial_block_\@:
 
 # CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
 # clobbers r10-11, xmm14
-.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \
+.macro CALC_AAD_HASH HASHKEY AAD AADLEN TMP1 TMP2 TMP3 TMP4 TMP5 \
TMP6 TMP7
MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg8, %r10   # %r10 = AAD
-   movarg9, %r11   # %r11 = aadLen
+   mov\AAD, %r10   # %r10 = AAD
+   mov\AADLEN, %r11# %r11 = aadLen
pxor   \TMP7, \TMP7
pxor   \TMP6, \TMP6
 
@@ -1103,18 +1103,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 
operation
mov   keysize,%eax
shr   $2,%eax   # 128->4, 192->6, 256->8
sub   $4,%eax   # 128->0, 192->2, 256->4
-   jzaes_loop_par_enc_done
+   jzaes_loop_par_enc_done\@
 
-aes_loop_par_enc:
+aes_loop_par_enc\@:
MOVADQ(%r10),\TMP3
 .irpc  index, 1234
AESENC\TMP3, %xmm\index
 .endr
add   $16,%r10
sub   $1,%eax
-   jnz   aes_loop_par_enc
+   jnz   aes_loop_par_enc\@
 
-aes_loop_par_enc_done:
+aes_loop_par_enc_done\@:
MOVADQ(%r10), \TMP3
AESENCLAST \TMP3, \XMM1   # Round 10
AESENCLAST \TMP3, \XMM2
@@ -1311,18 +1311,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 
operation
mov   keysize,%eax
shr   $2,%eax   # 128->4, 192->6, 256->8
sub   $4,%eax

[PATCH v2 13/14] x86/crypto: aesni: Introduce scatter/gather asm function stubs

2018-02-14 Thread Dave Watson

The asm macros are all set up now, introduce entry points.

GCM_INIT and GCM_COMPLETE have arguments supplied, so that
the new scatter/gather entry points don't have to take all the
arguments, and only the ones they need.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S  | 116 -
 arch/x86/crypto/aesni-intel_glue.c |  16 +
 2 files changed, 106 insertions(+), 26 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index b941952..311b2de 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -200,8 +200,8 @@ ALL_F:  .octa 0x
 # Output: HashKeys stored in gcm_context_data.  Only needs to be called
 # once per key.
 # clobbers r12, and tmp xmm registers.
-.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
-   mov arg7, %r12
+.macro PRECOMPUTE SUBKEY TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
+   mov \SUBKEY, %r12
movdqu  (%r12), \TMP3
movdqa  SHUF_MASK(%rip), \TMP2
PSHUFB_XMM \TMP2, \TMP3
@@ -254,14 +254,14 @@ ALL_F:  .octa 0x
 
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
-.macro GCM_INIT
-   mov arg9, %r11
+.macro GCM_INIT Iv SUBKEY AAD AADLEN
+   mov \AADLEN, %r11
mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
xor %r11, %r11
mov %r11, InLen(%arg2) # ctx_data.in_length = 0
mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
-   mov %arg6, %rax
+   mov \Iv, %rax
movdqu (%rax), %xmm0
movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv
 
@@ -269,11 +269,11 @@ ALL_F:  .octa 0x
PSHUFB_XMM %xmm2, %xmm0
movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv
 
-   PRECOMPUTE %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7
+   PRECOMPUTE \SUBKEY, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
movdqa HashKey(%arg2), %xmm13
 
-   CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
-   %xmm5 %xmm6
+   CALC_AAD_HASH %xmm13, \AAD, \AADLEN, %xmm0, %xmm1, %xmm2, %xmm3, \
+   %xmm4, %xmm5, %xmm6
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -435,7 +435,7 @@ _multiple_of_16_bytes_\@:
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
-.macro GCM_COMPLETE
+.macro GCM_COMPLETE AUTHTAG AUTHTAGLEN
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%arg2), %xmm13
 
@@ -466,8 +466,8 @@ _partial_done\@:
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
-   mov arg10, %r10 # %r10 = authTag
-   mov arg11, %r11# %r11 = auth_tag_len
+   mov \AUTHTAG, %r10 # %r10 = authTag
+   mov \AUTHTAGLEN, %r11# %r11 = auth_tag_len
cmp $16, %r11
je  _T_16_\@
cmp $8, %r11
@@ -599,11 +599,11 @@ _done_read_partial_block_\@:
 
 # CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
 # clobbers r10-11, xmm14
-.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \
+.macro CALC_AAD_HASH HASHKEY AAD AADLEN TMP1 TMP2 TMP3 TMP4 TMP5 \
TMP6 TMP7
MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg8, %r10   # %r10 = AAD
-   movarg9, %r11   # %r11 = aadLen
+   mov\AAD, %r10   # %r10 = AAD
+   mov\AADLEN, %r11# %r11 = aadLen
pxor   \TMP7, \TMP7
pxor   \TMP6, \TMP6
 
@@ -1103,18 +1103,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 
operation
mov   keysize,%eax
shr   $2,%eax   # 128->4, 192->6, 256->8
sub   $4,%eax   # 128->0, 192->2, 256->4
-   jzaes_loop_par_enc_done
+   jzaes_loop_par_enc_done\@
 
-aes_loop_par_enc:
+aes_loop_par_enc\@:
MOVADQ(%r10),\TMP3
 .irpc  index, 1234
AESENC\TMP3, %xmm\index
 .endr
add   $16,%r10
sub   $1,%eax
-   jnz   aes_loop_par_enc
+   jnz   aes_loop_par_enc\@
 
-aes_loop_par_enc_done:
+aes_loop_par_enc_done\@:
MOVADQ(%r10), \TMP3
AESENCLAST \TMP3, \XMM1   # Round 10
AESENCLAST \TMP3, \XMM2
@@ -1311,18 +1311,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 
operation
mov   keysize,%eax
shr   $2,%eax   # 128->4, 192->6, 256->8
sub   $4,%eax   # 128-&g

[PATCH v2 09/14] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE

2018-02-14 Thread Dave Watson

Prepare to handle partial blocks between scatter/gather calls.
For the last partial block, we only want to calculate the aadhash
in GCM_COMPLETE, and a new partial block macro will handle both
aadhash update and encrypting partial blocks between calls.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index aa82493..37b1cee 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -345,7 +345,6 @@ _zero_cipher_left_\@:
pxor%xmm0, %xmm8
 .endif
 
-   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
movdqu %xmm8, AadHash(%arg2)
 .ifc \operation, enc
# GHASH computation for the last <16 byte block
@@ -378,6 +377,15 @@ _multiple_of_16_bytes_\@:
 .macro GCM_COMPLETE
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%rsp), %xmm13
+
+   mov PBlockLen(%arg2), %r12
+
+   cmp $0, %r12
+   je _partial_done\@
+
+   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+
+_partial_done\@:
mov AadLen(%arg2), %r12  # %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-- 
2.9.5

[PATCH v2 12/14] x86/crypto: aesni: Add fast path for > 16 byte update

2018-02-14 Thread Dave Watson

We can fast-path any < 16 byte read if the full message is > 16 bytes,
and shift over by the appropriate amount.  Usually we are
reading > 16 bytes, so this should be faster than the READ_PARTIAL
macro introduced in b20209c91e2 for the average case.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 398bd2237f..b941952 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -355,12 +355,37 @@ _zero_cipher_left_\@:
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
movdqu %xmm0, PBlockEncKey(%arg2)
 
+   cmp $16, %arg5
+   jge _large_enough_update_\@
+
lea (%arg4,%r11,1), %r10
mov %r13, %r12
READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
+   jmp _data_read_\@
+
+_large_enough_update_\@:
+   sub $16, %r11
+   add %r13, %r11
+
+   # receive the last <16 Byte block
+   movdqu  (%arg4, %r11, 1), %xmm1
 
+   sub %r13, %r11
+   add $16, %r11
+
+   lea SHIFT_MASK+16(%rip), %r12
+   # adjust the shuffle mask pointer to be able to shift 16-r13 bytes
+   # (r13 is the number of bytes in plaintext mod 16)
+   sub %r13, %r12
+   # get the appropriate shuffle mask
+   movdqu  (%r12), %xmm2
+   # shift right 16-r13 bytes
+   PSHUFB_XMM  %xmm2, %xmm1
+
+_data_read_\@:
lea ALL_F+16(%rip), %r12
sub %r13, %r12
+
 .ifc \operation, dec
movdqa  %xmm1, %xmm2
 .endif
-- 
2.9.5

[PATCH v2 12/14] x86/crypto: aesni: Add fast path for > 16 byte update

2018-02-14 Thread Dave Watson

We can fast-path any < 16 byte read if the full message is > 16 bytes,
and shift over by the appropriate amount.  Usually we are
reading > 16 bytes, so this should be faster than the READ_PARTIAL
macro introduced in b20209c91e2 for the average case.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 398bd2237f..b941952 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -355,12 +355,37 @@ _zero_cipher_left_\@:
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
movdqu %xmm0, PBlockEncKey(%arg2)
 
+   cmp $16, %arg5
+   jge _large_enough_update_\@
+
lea (%arg4,%r11,1), %r10
mov %r13, %r12
READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
+   jmp _data_read_\@
+
+_large_enough_update_\@:
+   sub $16, %r11
+   add %r13, %r11
+
+   # receive the last <16 Byte block
+   movdqu  (%arg4, %r11, 1), %xmm1
 
+   sub %r13, %r11
+   add $16, %r11
+
+   lea SHIFT_MASK+16(%rip), %r12
+   # adjust the shuffle mask pointer to be able to shift 16-r13 bytes
+   # (r13 is the number of bytes in plaintext mod 16)
+   sub %r13, %r12
+   # get the appropriate shuffle mask
+   movdqu  (%r12), %xmm2
+   # shift right 16-r13 bytes
+   PSHUFB_XMM  %xmm2, %xmm1
+
+_data_read_\@:
lea ALL_F+16(%rip), %r12
sub %r13, %r12
+
 .ifc \operation, dec
movdqa  %xmm1, %xmm2
 .endif
-- 
2.9.5

[PATCH v2 09/14] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE

2018-02-14 Thread Dave Watson

Prepare to handle partial blocks between scatter/gather calls.
For the last partial block, we only want to calculate the aadhash
in GCM_COMPLETE, and a new partial block macro will handle both
aadhash update and encrypting partial blocks between calls.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index aa82493..37b1cee 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -345,7 +345,6 @@ _zero_cipher_left_\@:
pxor%xmm0, %xmm8
 .endif
 
-   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
movdqu %xmm8, AadHash(%arg2)
 .ifc \operation, enc
# GHASH computation for the last <16 byte block
@@ -378,6 +377,15 @@ _multiple_of_16_bytes_\@:
 .macro GCM_COMPLETE
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%rsp), %xmm13
+
+   mov PBlockLen(%arg2), %r12
+
+   cmp $0, %r12
+   je _partial_done\@
+
+   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+
+_partial_done\@:
mov AadLen(%arg2), %r12  # %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-- 
2.9.5

[PATCH v2 11/14] x86/crypto: aesni: Introduce partial block macro

2018-02-14 Thread Dave Watson

Before this diff, multiple calls to GCM_ENC_DEC will
succeed, but only if all calls are a multiple of 16 bytes.

Handle partial blocks at the start of GCM_ENC_DEC, and update
aadhash as appropriate.

The data offset %r11 is also updated after the partial block.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 151 +-
 1 file changed, 150 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 3ada06b..398bd2237f 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -284,7 +284,13 @@ ALL_F:  .octa 0x
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%arg2), %xmm13
add %arg5, InLen(%arg2)
+
+   xor %r11, %r11 # initialise the data pointer offset as zero
+   PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation
+
+   sub %r11, %arg5 # sub partial block data used
mov %arg5, %r13 # save the number of bytes
+
and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
mov %r13, %r12
# Encrypt/Decrypt first few blocks
@@ -605,6 +611,150 @@ _get_AAD_done\@:
movdqu \TMP6, AadHash(%arg2)
 .endm
 
+# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks
+# between update calls.
+# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK
+# Outputs encrypted bytes, and updates hash and partial info in 
gcm_data_context
+# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13
+.macro PARTIAL_BLOCK CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \
+   AAD_HASH operation
+   mov PBlockLen(%arg2), %r13
+   cmp $0, %r13
+   je  _partial_block_done_\@  # Leave Macro if no partial blocks
+   # Read in input data without over reading
+   cmp $16, \PLAIN_CYPH_LEN
+   jl  _fewer_than_16_bytes_\@
+   movups  (\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm
+   jmp _data_read_\@
+
+_fewer_than_16_bytes_\@:
+   lea (\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10
+   mov \PLAIN_CYPH_LEN, %r12
+   READ_PARTIAL_BLOCK %r10 %r12 %xmm0 %xmm1
+
+   mov PBlockLen(%arg2), %r13
+
+_data_read_\@: # Finished reading in data
+
+   movdqu  PBlockEncKey(%arg2), %xmm9
+   movdqu  HashKey(%arg2), %xmm13
+
+   lea SHIFT_MASK(%rip), %r12
+
+   # adjust the shuffle mask pointer to be able to shift r13 bytes
+   # r16-r13 is the number of bytes in plaintext mod 16)
+   add %r13, %r12
+   movdqu  (%r12), %xmm2   # get the appropriate shuffle mask
+   PSHUFB_XMM %xmm2, %xmm9 # shift right r13 bytes
+
+.ifc \operation, dec
+   movdqa  %xmm1, %xmm3
+   pxor%xmm1, %xmm9# Cyphertext XOR E(K, Yn)
+
+   mov \PLAIN_CYPH_LEN, %r10
+   add %r13, %r10
+   # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+   sub $16, %r10
+   # Determine if if partial block is not being filled and
+   # shift mask accordingly
+   jge _no_extra_mask_1_\@
+   sub %r10, %r12
+_no_extra_mask_1_\@:
+
+   movdqu  ALL_F-SHIFT_MASK(%r12), %xmm1
+   # get the appropriate mask to mask out bottom r13 bytes of xmm9
+   pand%xmm1, %xmm9# mask out bottom r13 bytes of xmm9
+
+   pand%xmm1, %xmm3
+   movdqa  SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM  %xmm10, %xmm3
+   PSHUFB_XMM  %xmm2, %xmm3
+   pxor%xmm3, \AAD_HASH
+
+   cmp $0, %r10
+   jl  _partial_incomplete_1_\@
+
+   # GHASH computation for the last <16 Byte block
+   GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+   xor %rax,%rax
+
+   mov %rax, PBlockLen(%arg2)
+   jmp _dec_done_\@
+_partial_incomplete_1_\@:
+   add \PLAIN_CYPH_LEN, PBlockLen(%arg2)
+_dec_done_\@:
+   movdqu  \AAD_HASH, AadHash(%arg2)
+.else
+   pxor%xmm1, %xmm9# Plaintext XOR E(K, Yn)
+
+   mov \PLAIN_CYPH_LEN, %r10
+   add %r13, %r10
+   # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+   sub $16, %r10
+   # Determine if if partial block is not being filled and
+   # shift mask accordingly
+   jge _no_extra_mask_2_\@
+   sub %r10, %r12
+_no_extra_mask_2_\@:
+
+   movdqu  ALL_F-SHIFT_MASK(%r12), %xmm1
+   # get the appropriate mask to mask out bottom r13 bytes of xmm9
+   pand%xmm1, %xmm9
+
+   movdqa  SHUF_MASK(%rip), %xmm1
+   PSHUFB_XMM %xmm1, %xmm9
+   PSHUFB_XMM %xmm2, %xmm9
+   pxor%xmm9, \AAD_HASH
+
+   cmp $0, %r10
+   jl  _partial_incomplete_2_\@
+
+   # GHASH computation for the last <16 Byte block
+   GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6

[PATCH v2 11/14] x86/crypto: aesni: Introduce partial block macro

2018-02-14 Thread Dave Watson

Before this diff, multiple calls to GCM_ENC_DEC will
succeed, but only if all calls are a multiple of 16 bytes.

Handle partial blocks at the start of GCM_ENC_DEC, and update
aadhash as appropriate.

The data offset %r11 is also updated after the partial block.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 151 +-
 1 file changed, 150 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 3ada06b..398bd2237f 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -284,7 +284,13 @@ ALL_F:  .octa 0x
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%arg2), %xmm13
add %arg5, InLen(%arg2)
+
+   xor %r11, %r11 # initialise the data pointer offset as zero
+   PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation
+
+   sub %r11, %arg5 # sub partial block data used
mov %arg5, %r13 # save the number of bytes
+
and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
mov %r13, %r12
# Encrypt/Decrypt first few blocks
@@ -605,6 +611,150 @@ _get_AAD_done\@:
movdqu \TMP6, AadHash(%arg2)
 .endm
 
+# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks
+# between update calls.
+# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK
+# Outputs encrypted bytes, and updates hash and partial info in 
gcm_data_context
+# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13
+.macro PARTIAL_BLOCK CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \
+   AAD_HASH operation
+   mov PBlockLen(%arg2), %r13
+   cmp $0, %r13
+   je  _partial_block_done_\@  # Leave Macro if no partial blocks
+   # Read in input data without over reading
+   cmp $16, \PLAIN_CYPH_LEN
+   jl  _fewer_than_16_bytes_\@
+   movups  (\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm
+   jmp _data_read_\@
+
+_fewer_than_16_bytes_\@:
+   lea (\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10
+   mov \PLAIN_CYPH_LEN, %r12
+   READ_PARTIAL_BLOCK %r10 %r12 %xmm0 %xmm1
+
+   mov PBlockLen(%arg2), %r13
+
+_data_read_\@: # Finished reading in data
+
+   movdqu  PBlockEncKey(%arg2), %xmm9
+   movdqu  HashKey(%arg2), %xmm13
+
+   lea SHIFT_MASK(%rip), %r12
+
+   # adjust the shuffle mask pointer to be able to shift r13 bytes
+   # r16-r13 is the number of bytes in plaintext mod 16)
+   add %r13, %r12
+   movdqu  (%r12), %xmm2   # get the appropriate shuffle mask
+   PSHUFB_XMM %xmm2, %xmm9 # shift right r13 bytes
+
+.ifc \operation, dec
+   movdqa  %xmm1, %xmm3
+   pxor%xmm1, %xmm9# Cyphertext XOR E(K, Yn)
+
+   mov \PLAIN_CYPH_LEN, %r10
+   add %r13, %r10
+   # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+   sub $16, %r10
+   # Determine if if partial block is not being filled and
+   # shift mask accordingly
+   jge _no_extra_mask_1_\@
+   sub %r10, %r12
+_no_extra_mask_1_\@:
+
+   movdqu  ALL_F-SHIFT_MASK(%r12), %xmm1
+   # get the appropriate mask to mask out bottom r13 bytes of xmm9
+   pand%xmm1, %xmm9# mask out bottom r13 bytes of xmm9
+
+   pand%xmm1, %xmm3
+   movdqa  SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM  %xmm10, %xmm3
+   PSHUFB_XMM  %xmm2, %xmm3
+   pxor%xmm3, \AAD_HASH
+
+   cmp $0, %r10
+   jl  _partial_incomplete_1_\@
+
+   # GHASH computation for the last <16 Byte block
+   GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+   xor %rax,%rax
+
+   mov %rax, PBlockLen(%arg2)
+   jmp _dec_done_\@
+_partial_incomplete_1_\@:
+   add \PLAIN_CYPH_LEN, PBlockLen(%arg2)
+_dec_done_\@:
+   movdqu  \AAD_HASH, AadHash(%arg2)
+.else
+   pxor%xmm1, %xmm9# Plaintext XOR E(K, Yn)
+
+   mov \PLAIN_CYPH_LEN, %r10
+   add %r13, %r10
+   # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+   sub $16, %r10
+   # Determine if if partial block is not being filled and
+   # shift mask accordingly
+   jge _no_extra_mask_2_\@
+   sub %r10, %r12
+_no_extra_mask_2_\@:
+
+   movdqu  ALL_F-SHIFT_MASK(%r12), %xmm1
+   # get the appropriate mask to mask out bottom r13 bytes of xmm9
+   pand%xmm1, %xmm9
+
+   movdqa  SHUF_MASK(%rip), %xmm1
+   PSHUFB_XMM %xmm1, %xmm9
+   PSHUFB_XMM %xmm2, %xmm9
+   pxor%xmm9, \AAD_HASH
+
+   cmp $0, %r10
+   jl  _partial_incomplete_2_\@
+
+   # GHASH computation for the last <16 Byte block
+   GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+   xor %ra

[PATCH v2 10/14] x86/crypto: aesni: Move HashKey computation from stack to gcm_context

2018-02-14 Thread Dave Watson

HashKey computation only needs to happen once per scatter/gather operation,
save it between calls in gcm_context struct instead of on the stack.
Since the asm no longer stores anything on the stack, we can use
%rsp directly, and clean up the frame save/restore macros a bit.

Hashkeys actually only need to be calculated once per key and could
be moved to when set_key is called, however, the current glue code
falls back to generic aes code if fpu is disabled.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 205 --
 1 file changed, 106 insertions(+), 99 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 37b1cee..3ada06b 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -93,23 +93,6 @@ ALL_F:  .octa 0x
 
 
 #defineSTACK_OFFSET8*3
-#defineHashKey 16*0// store HashKey <<1 mod poly here
-#defineHashKey_2   16*1// store HashKey^2 <<1 mod poly here
-#defineHashKey_3   16*2// store HashKey^3 <<1 mod poly here
-#defineHashKey_4   16*3// store HashKey^4 <<1 mod poly here
-#defineHashKey_k   16*4// store XOR of High 64 bits and Low 64
-   // bits of  HashKey <<1 mod poly here
-   //(for Karatsuba purposes)
-#defineHashKey_2_k 16*5// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^2 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineHashKey_3_k 16*6// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^3 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineHashKey_4_k 16*7// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^4 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineVARIABLE_OFFSET 16*8
 
 #define AadHash 16*0
 #define AadLen 16*1
@@ -118,6 +101,22 @@ ALL_F:  .octa 0x
 #define OrigIV 16*3
 #define CurCount 16*4
 #define PBlockLen 16*5
+#defineHashKey 16*6// store HashKey <<1 mod poly here
+#defineHashKey_2   16*7// store HashKey^2 <<1 mod poly here
+#defineHashKey_3   16*8// store HashKey^3 <<1 mod poly here
+#defineHashKey_4   16*9// store HashKey^4 <<1 mod poly here
+#defineHashKey_k   16*10   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey <<1 mod poly here
+   //(for Karatsuba purposes)
+#defineHashKey_2_k 16*11   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^2 <<1 mod poly here
+   // (for Karatsuba purposes)
+#defineHashKey_3_k 16*12   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^3 <<1 mod poly here
+   // (for Karatsuba purposes)
+#defineHashKey_4_k 16*13   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^4 <<1 mod poly here
+   // (for Karatsuba purposes)
 
 #define arg1 rdi
 #define arg2 rsi
@@ -125,11 +124,11 @@ ALL_F:  .octa 0x
 #define arg4 rcx
 #define arg5 r8
 #define arg6 r9
-#define arg7 STACK_OFFSET+8(%r14)
-#define arg8 STACK_OFFSET+16(%r14)
-#define arg9 STACK_OFFSET+24(%r14)
-#define arg10 STACK_OFFSET+32(%r14)
-#define arg11 STACK_OFFSET+40(%r14)
+#define arg7 STACK_OFFSET+8(%rsp)
+#define arg8 STACK_OFFSET+16(%rsp)
+#define arg9 STACK_OFFSET+24(%rsp)
+#define arg10 STACK_OFFSET+32(%rsp)
+#define arg11 STACK_OFFSET+40(%rsp)
 #define keysize 2*15*16(%arg1)
 #endif
 
@@ -183,28 +182,79 @@ ALL_F:  .octa 0x
push%r12
push%r13
push%r14
-   mov %rsp, %r14
 #
 # states of %xmm registers %xmm6:%xmm15 not saved
 # all %xmm registers are clobbered
 #
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp
 .endm
 
 
 .macro FUNC_RESTORE
-   mov %r14, %rsp
pop %r14
pop %r13
pop %r12
 .endm
 
+# Precompute hashkeys.
+# Input: Hash subkey.
+# Output: HashKeys stored in gcm_context_data.  Only needs to be called
+# once per key.
+# clobbers r12, and tmp xmm registers.
+.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
+   mov arg7, %r12
+   movdqu  (%r12), \TMP3
+   movdqa  SHUF_MASK(%rip), \TMP2
+   PSHUFB_XMM \TMP2, \TMP3
+
+   # precompute HashK

[PATCH v2 08/14] x86/crypto: aesni: Fill in new context data structures

2018-02-14 Thread Dave Watson

Fill in aadhash, aadlen, pblocklen, curcount with appropriate values.
pblocklen, aadhash, and pblockenckey are also updated at the end
of each scatter/gather operation, to be carried over to the next
operation.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 51 ++-
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 58bbfac..aa82493 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -204,6 +204,21 @@ ALL_F:  .octa 0x
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
 .macro GCM_INIT
+
+   mov arg9, %r11
+   mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
+   xor %r11, %r11
+   mov %r11, InLen(%arg2) # ctx_data.in_length = 0
+   mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
+   mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
+   mov %arg6, %rax
+   movdqu (%rax), %xmm0
+   movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv
+
+   movdqa  SHUF_MASK(%rip), %xmm2
+   PSHUFB_XMM %xmm2, %xmm0
+   movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv
+
mov arg7, %r12
movdqu  (%r12), %xmm13
movdqa  SHUF_MASK(%rip), %xmm2
@@ -226,13 +241,9 @@ ALL_F:  .octa 0x
pandPOLY(%rip), %xmm2
pxor%xmm2, %xmm13
movdqa  %xmm13, HashKey(%rsp)
-   mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
-   and $-16, %r13
-   mov %r13, %r12
 
CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
%xmm5 %xmm6
-   mov %r13, %r12
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -240,6 +251,12 @@ ALL_F:  .octa 0x
 # Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK
 # Clobbers rax, r10-r13, and xmm0-xmm15
 .macro GCM_ENC_DEC operation
+   movdqu AadHash(%arg2), %xmm8
+   movdqu HashKey(%rsp), %xmm13
+   add %arg5, InLen(%arg2)
+   mov %arg5, %r13 # save the number of bytes
+   and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
+   mov %r13, %r12
# Encrypt/Decrypt first few blocks
 
and $(3<<4), %r12
@@ -284,16 +301,23 @@ _four_cipher_left_\@:
GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
 %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
 _zero_cipher_left_\@:
+   movdqu %xmm8, AadHash(%arg2)
+   movdqu %xmm0, CurCount(%arg2)
+
mov %arg5, %r13
and $15, %r13   # %r13 = arg5 (mod 16)
je  _multiple_of_16_bytes_\@
 
+   mov %r13, PBlockLen(%arg2)
+
# Handle the last <16 Byte block separately
paddd ONE(%rip), %xmm0# INCR CNT to get Yn
+   movdqu %xmm0, CurCount(%arg2)
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm0
 
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
+   movdqu %xmm0, PBlockEncKey(%arg2)
 
lea (%arg4,%r11,1), %r10
mov %r13, %r12
@@ -322,6 +346,7 @@ _zero_cipher_left_\@:
 .endif
 
GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+   movdqu %xmm8, AadHash(%arg2)
 .ifc \operation, enc
# GHASH computation for the last <16 byte block
movdqa SHUF_MASK(%rip), %xmm10
@@ -351,11 +376,15 @@ _multiple_of_16_bytes_\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE
-   mov arg9, %r12# %r13 = aadLen (number of bytes)
+   movdqu AadHash(%arg2), %xmm8
+   movdqu HashKey(%rsp), %xmm13
+   mov AadLen(%arg2), %r12  # %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg5 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg5, %xmm1
+   mov InLen(%arg2), %r12
+   shl $3, %r12  # len(C) in bits (*128)
+   MOVQ_R64_XMM%r12, %xmm1
+
pslldq  $8, %xmm15# %xmm15 = len(A)||0x
pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
pxor%xmm15, %xmm8
@@ -364,8 +393,7 @@ _multiple_of_16_bytes_\@:
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm8
 
-   mov %arg6, %rax   # %rax = *Y0
-   movdqu  (%rax), %xmm0 # %xmm0 = Y0
+   movdqu OrigIV(%arg2), %xmm0   # %xmm0 = Y0
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
@@ -553,15

[PATCH v2 07/14] x86/crypto: aesni: Split AAD hash calculation to separate macro

2018-02-14 Thread Dave Watson

AAD hash only needs to be calculated once for each scatter/gather operation.
Move it to its own macro, and call it from GCM_INIT instead of
INITIAL_BLOCKS.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 71 ---
 1 file changed, 43 insertions(+), 28 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 6c5a80d..58bbfac 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -229,6 +229,10 @@ ALL_F:  .octa 0x
mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
and $-16, %r13
mov %r13, %r12
+
+   CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
+   %xmm5 %xmm6
+   mov %r13, %r12
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -496,51 +500,62 @@ _read_next_byte_lt8_\@:
 _done_read_partial_block_\@:
 .endm
 
-/*
-* if a = number of total plaintext bytes
-* b = floor(a/16)
-* num_initial_blocks = b mod 4
-* encrypt the initial num_initial_blocks blocks and apply ghash on
-* the ciphertext
-* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers
-* are clobbered
-* arg1, %arg3, %arg4, %r14 are used as a pointer only, not modified
-*/
-
-
-.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \
-XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
-MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg8, %r10   # %r10 = AAD
-   movarg9, %r11   # %r11 = aadLen
-   pxor   %xmm\i, %xmm\i
-   pxor   \XMM2, \XMM2
+# CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
+# clobbers r10-11, xmm14
+.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \
+   TMP6 TMP7
+   MOVADQ SHUF_MASK(%rip), %xmm14
+   movarg8, %r10   # %r10 = AAD
+   movarg9, %r11   # %r11 = aadLen
+   pxor   \TMP7, \TMP7
+   pxor   \TMP6, \TMP6
 
cmp$16, %r11
jl _get_AAD_rest\@
 _get_AAD_blocks\@:
-   movdqu (%r10), %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   %xmm\i, \XMM2
-   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   movdqu (%r10), \TMP7
+   PSHUFB_XMM   %xmm14, \TMP7 # byte-reflect the AAD data
+   pxor   \TMP7, \TMP6
+   GHASH_MUL  \TMP6, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5
add$16, %r10
sub$16, %r11
cmp$16, %r11
jge_get_AAD_blocks\@
 
-   movdqu \XMM2, %xmm\i
+   movdqu \TMP6, \TMP7
 
/* read the last <16B of AAD */
 _get_AAD_rest\@:
cmp$0, %r11
je _get_AAD_done\@
 
-   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   \XMM2, %xmm\i
-   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, \TMP7
+   PSHUFB_XMM   %xmm14, \TMP7 # byte-reflect the AAD data
+   pxor   \TMP6, \TMP7
+   GHASH_MUL  \TMP7, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5
+   movdqu \TMP7, \TMP6
 
 _get_AAD_done\@:
+   movdqu \TMP6, AadHash(%arg2)
+.endm
+
+/*
+* if a = number of total plaintext bytes
+* b = floor(a/16)
+* num_initial_blocks = b mod 4
+* encrypt the initial num_initial_blocks blocks and apply ghash on
+* the ciphertext
+* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers
+* are clobbered
+* arg1, %arg2, %arg3, %r14 are used as a pointer only, not modified
+*/
+
+
+.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \
+   XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
+
+   movdqu AadHash(%arg2), %xmm\i   # XMM0 = Y0
+
xor%r11, %r11 # initialise the data pointer offset as zero
# start AES for num_initial_blocks blocks
 
-- 
2.9.5

[PATCH v2 10/14] x86/crypto: aesni: Move HashKey computation from stack to gcm_context

2018-02-14 Thread Dave Watson

HashKey computation only needs to happen once per scatter/gather operation,
save it between calls in gcm_context struct instead of on the stack.
Since the asm no longer stores anything on the stack, we can use
%rsp directly, and clean up the frame save/restore macros a bit.

Hashkeys actually only need to be calculated once per key and could
be moved to when set_key is called, however, the current glue code
falls back to generic aes code if fpu is disabled.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 205 --
 1 file changed, 106 insertions(+), 99 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 37b1cee..3ada06b 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -93,23 +93,6 @@ ALL_F:  .octa 0x
 
 
 #defineSTACK_OFFSET8*3
-#defineHashKey 16*0// store HashKey <<1 mod poly here
-#defineHashKey_2   16*1// store HashKey^2 <<1 mod poly here
-#defineHashKey_3   16*2// store HashKey^3 <<1 mod poly here
-#defineHashKey_4   16*3// store HashKey^4 <<1 mod poly here
-#defineHashKey_k   16*4// store XOR of High 64 bits and Low 64
-   // bits of  HashKey <<1 mod poly here
-   //(for Karatsuba purposes)
-#defineHashKey_2_k 16*5// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^2 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineHashKey_3_k 16*6// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^3 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineHashKey_4_k 16*7// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^4 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineVARIABLE_OFFSET 16*8
 
 #define AadHash 16*0
 #define AadLen 16*1
@@ -118,6 +101,22 @@ ALL_F:  .octa 0x
 #define OrigIV 16*3
 #define CurCount 16*4
 #define PBlockLen 16*5
+#defineHashKey 16*6// store HashKey <<1 mod poly here
+#defineHashKey_2   16*7// store HashKey^2 <<1 mod poly here
+#defineHashKey_3   16*8// store HashKey^3 <<1 mod poly here
+#defineHashKey_4   16*9// store HashKey^4 <<1 mod poly here
+#defineHashKey_k   16*10   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey <<1 mod poly here
+   //(for Karatsuba purposes)
+#defineHashKey_2_k 16*11   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^2 <<1 mod poly here
+   // (for Karatsuba purposes)
+#defineHashKey_3_k 16*12   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^3 <<1 mod poly here
+   // (for Karatsuba purposes)
+#defineHashKey_4_k 16*13   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^4 <<1 mod poly here
+   // (for Karatsuba purposes)
 
 #define arg1 rdi
 #define arg2 rsi
@@ -125,11 +124,11 @@ ALL_F:  .octa 0x
 #define arg4 rcx
 #define arg5 r8
 #define arg6 r9
-#define arg7 STACK_OFFSET+8(%r14)
-#define arg8 STACK_OFFSET+16(%r14)
-#define arg9 STACK_OFFSET+24(%r14)
-#define arg10 STACK_OFFSET+32(%r14)
-#define arg11 STACK_OFFSET+40(%r14)
+#define arg7 STACK_OFFSET+8(%rsp)
+#define arg8 STACK_OFFSET+16(%rsp)
+#define arg9 STACK_OFFSET+24(%rsp)
+#define arg10 STACK_OFFSET+32(%rsp)
+#define arg11 STACK_OFFSET+40(%rsp)
 #define keysize 2*15*16(%arg1)
 #endif
 
@@ -183,28 +182,79 @@ ALL_F:  .octa 0x
push%r12
push%r13
push%r14
-   mov %rsp, %r14
 #
 # states of %xmm registers %xmm6:%xmm15 not saved
 # all %xmm registers are clobbered
 #
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp
 .endm
 
 
 .macro FUNC_RESTORE
-   mov %r14, %rsp
pop %r14
pop %r13
pop %r12
 .endm
 
+# Precompute hashkeys.
+# Input: Hash subkey.
+# Output: HashKeys stored in gcm_context_data.  Only needs to be called
+# once per key.
+# clobbers r12, and tmp xmm registers.
+.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
+   mov arg7, %r12
+   movdqu  (%r12), \TMP3
+   movdqa  SHUF_MASK(%rip), \TMP2
+   PSHUFB_XMM \TMP2, \TMP3
+
+   # precompute HashKey<<1 mod poly from t

[PATCH v2 08/14] x86/crypto: aesni: Fill in new context data structures

2018-02-14 Thread Dave Watson

Fill in aadhash, aadlen, pblocklen, curcount with appropriate values.
pblocklen, aadhash, and pblockenckey are also updated at the end
of each scatter/gather operation, to be carried over to the next
operation.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 51 ++-
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 58bbfac..aa82493 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -204,6 +204,21 @@ ALL_F:  .octa 0x
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
 .macro GCM_INIT
+
+   mov arg9, %r11
+   mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
+   xor %r11, %r11
+   mov %r11, InLen(%arg2) # ctx_data.in_length = 0
+   mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
+   mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
+   mov %arg6, %rax
+   movdqu (%rax), %xmm0
+   movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv
+
+   movdqa  SHUF_MASK(%rip), %xmm2
+   PSHUFB_XMM %xmm2, %xmm0
+   movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv
+
mov arg7, %r12
movdqu  (%r12), %xmm13
movdqa  SHUF_MASK(%rip), %xmm2
@@ -226,13 +241,9 @@ ALL_F:  .octa 0x
pandPOLY(%rip), %xmm2
pxor%xmm2, %xmm13
movdqa  %xmm13, HashKey(%rsp)
-   mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
-   and $-16, %r13
-   mov %r13, %r12
 
CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
%xmm5 %xmm6
-   mov %r13, %r12
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -240,6 +251,12 @@ ALL_F:  .octa 0x
 # Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK
 # Clobbers rax, r10-r13, and xmm0-xmm15
 .macro GCM_ENC_DEC operation
+   movdqu AadHash(%arg2), %xmm8
+   movdqu HashKey(%rsp), %xmm13
+   add %arg5, InLen(%arg2)
+   mov %arg5, %r13 # save the number of bytes
+   and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
+   mov %r13, %r12
# Encrypt/Decrypt first few blocks
 
and $(3<<4), %r12
@@ -284,16 +301,23 @@ _four_cipher_left_\@:
GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
 %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
 _zero_cipher_left_\@:
+   movdqu %xmm8, AadHash(%arg2)
+   movdqu %xmm0, CurCount(%arg2)
+
mov %arg5, %r13
and $15, %r13   # %r13 = arg5 (mod 16)
je  _multiple_of_16_bytes_\@
 
+   mov %r13, PBlockLen(%arg2)
+
# Handle the last <16 Byte block separately
paddd ONE(%rip), %xmm0# INCR CNT to get Yn
+   movdqu %xmm0, CurCount(%arg2)
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm0
 
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
+   movdqu %xmm0, PBlockEncKey(%arg2)
 
lea (%arg4,%r11,1), %r10
mov %r13, %r12
@@ -322,6 +346,7 @@ _zero_cipher_left_\@:
 .endif
 
GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+   movdqu %xmm8, AadHash(%arg2)
 .ifc \operation, enc
# GHASH computation for the last <16 byte block
movdqa SHUF_MASK(%rip), %xmm10
@@ -351,11 +376,15 @@ _multiple_of_16_bytes_\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE
-   mov arg9, %r12# %r13 = aadLen (number of bytes)
+   movdqu AadHash(%arg2), %xmm8
+   movdqu HashKey(%rsp), %xmm13
+   mov AadLen(%arg2), %r12  # %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg5 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg5, %xmm1
+   mov InLen(%arg2), %r12
+   shl $3, %r12  # len(C) in bits (*128)
+   MOVQ_R64_XMM%r12, %xmm1
+
pslldq  $8, %xmm15# %xmm15 = len(A)||0x
pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
pxor%xmm15, %xmm8
@@ -364,8 +393,7 @@ _multiple_of_16_bytes_\@:
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm8
 
-   mov %arg6, %rax   # %rax = *Y0
-   movdqu  (%rax), %xmm0 # %xmm0 = Y0
+   movdqu OrigIV(%arg2), %xmm0   # %xmm0 = Y0
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
@@ -553,15 +581,14 @@ _get_AAD

[PATCH v2 07/14] x86/crypto: aesni: Split AAD hash calculation to separate macro

2018-02-14 Thread Dave Watson

AAD hash only needs to be calculated once for each scatter/gather operation.
Move it to its own macro, and call it from GCM_INIT instead of
INITIAL_BLOCKS.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 71 ---
 1 file changed, 43 insertions(+), 28 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 6c5a80d..58bbfac 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -229,6 +229,10 @@ ALL_F:  .octa 0x
mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
and $-16, %r13
mov %r13, %r12
+
+   CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
+   %xmm5 %xmm6
+   mov %r13, %r12
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -496,51 +500,62 @@ _read_next_byte_lt8_\@:
 _done_read_partial_block_\@:
 .endm
 
-/*
-* if a = number of total plaintext bytes
-* b = floor(a/16)
-* num_initial_blocks = b mod 4
-* encrypt the initial num_initial_blocks blocks and apply ghash on
-* the ciphertext
-* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers
-* are clobbered
-* arg1, %arg3, %arg4, %r14 are used as a pointer only, not modified
-*/
-
-
-.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \
-XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
-MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg8, %r10   # %r10 = AAD
-   movarg9, %r11   # %r11 = aadLen
-   pxor   %xmm\i, %xmm\i
-   pxor   \XMM2, \XMM2
+# CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
+# clobbers r10-11, xmm14
+.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \
+   TMP6 TMP7
+   MOVADQ SHUF_MASK(%rip), %xmm14
+   movarg8, %r10   # %r10 = AAD
+   movarg9, %r11   # %r11 = aadLen
+   pxor   \TMP7, \TMP7
+   pxor   \TMP6, \TMP6
 
cmp$16, %r11
jl _get_AAD_rest\@
 _get_AAD_blocks\@:
-   movdqu (%r10), %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   %xmm\i, \XMM2
-   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   movdqu (%r10), \TMP7
+   PSHUFB_XMM   %xmm14, \TMP7 # byte-reflect the AAD data
+   pxor   \TMP7, \TMP6
+   GHASH_MUL  \TMP6, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5
add$16, %r10
sub$16, %r11
cmp$16, %r11
jge_get_AAD_blocks\@
 
-   movdqu \XMM2, %xmm\i
+   movdqu \TMP6, \TMP7
 
/* read the last <16B of AAD */
 _get_AAD_rest\@:
cmp$0, %r11
je _get_AAD_done\@
 
-   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   \XMM2, %xmm\i
-   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, \TMP7
+   PSHUFB_XMM   %xmm14, \TMP7 # byte-reflect the AAD data
+   pxor   \TMP6, \TMP7
+   GHASH_MUL  \TMP7, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5
+   movdqu \TMP7, \TMP6
 
 _get_AAD_done\@:
+   movdqu \TMP6, AadHash(%arg2)
+.endm
+
+/*
+* if a = number of total plaintext bytes
+* b = floor(a/16)
+* num_initial_blocks = b mod 4
+* encrypt the initial num_initial_blocks blocks and apply ghash on
+* the ciphertext
+* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers
+* are clobbered
+* arg1, %arg2, %arg3, %r14 are used as a pointer only, not modified
+*/
+
+
+.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \
+   XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
+
+   movdqu AadHash(%arg2), %xmm\i   # XMM0 = Y0
+
xor%r11, %r11 # initialise the data pointer offset as zero
# start AES for num_initial_blocks blocks
 
-- 
2.9.5

[PATCH v2 05/14] x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro

2018-02-14 Thread Dave Watson

Make a macro for the main encode/decode routine.  Only a small handful
of lines differ for enc and dec.   This will also become the main
scatter/gather update routine.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 293 +++---
 1 file changed, 114 insertions(+), 179 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 529c542..8021fd1 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -222,6 +222,118 @@ ALL_F:  .octa 0x
mov %r13, %r12
 .endm
 
+# GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
+# struct has been initialized by GCM_INIT.
+# Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK
+# Clobbers rax, r10-r13, and xmm0-xmm15
+.macro GCM_ENC_DEC operation
+   # Encrypt/Decrypt first few blocks
+
+   and $(3<<4), %r12
+   jz  _initial_num_blocks_is_0_\@
+   cmp $(2<<4), %r12
+   jb  _initial_num_blocks_is_1_\@
+   je  _initial_num_blocks_is_2_\@
+_initial_num_blocks_is_3_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 5, 678, \operation
+   sub $48, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_2_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 6, 78, \operation
+   sub $32, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_1_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 7, 8, \operation
+   sub $16, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_0_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 8, 0, \operation
+_initial_blocks_\@:
+
+   # Main loop - Encrypt/Decrypt remaining blocks
+
+   cmp $0, %r13
+   je  _zero_cipher_left_\@
+   sub $64, %r13
+   je  _four_cipher_left_\@
+_crypt_by_4_\@:
+   GHASH_4_ENCRYPT_4_PARALLEL_\operation   %xmm9, %xmm10, %xmm11, %xmm12, \
+   %xmm13, %xmm14, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, \
+   %xmm7, %xmm8, enc
+   add $64, %r11
+   sub $64, %r13
+   jne _crypt_by_4_\@
+_four_cipher_left_\@:
+   GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
+%xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
+_zero_cipher_left_\@:
+   mov %arg4, %r13
+   and $15, %r13   # %r13 = arg4 (mod 16)
+   je  _multiple_of_16_bytes_\@
+
+   # Handle the last <16 Byte block separately
+   paddd ONE(%rip), %xmm0# INCR CNT to get Yn
+movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10, %xmm0
+
+   ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
+
+   lea (%arg3,%r11,1), %r10
+   mov %r13, %r12
+   READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
+
+   lea ALL_F+16(%rip), %r12
+   sub %r13, %r12
+.ifc \operation, dec
+   movdqa  %xmm1, %xmm2
+.endif
+   pxor%xmm1, %xmm0# XOR Encrypt(K, Yn)
+   movdqu  (%r12), %xmm1
+   # get the appropriate mask to mask out top 16-r13 bytes of xmm0
+   pand%xmm1, %xmm0# mask out top 16-r13 bytes of xmm0
+.ifc \operation, dec
+   pand%xmm1, %xmm2
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10 ,%xmm2
+
+   pxor %xmm2, %xmm8
+.else
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10,%xmm0
+
+   pxor%xmm0, %xmm8
+.endif
+
+   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+.ifc \operation, enc
+   # GHASH computation for the last <16 byte block
+   movdqa SHUF_MASK(%rip), %xmm10
+   # shuffle xmm0 back to output as ciphertext
+   PSHUFB_XMM %xmm10, %xmm0
+.endif
+
+   # Output %r13 bytes
+   MOVQ_R64_XMM %xmm0, %rax
+   cmp $8, %r13
+   jle _less_than_8_bytes_left_\@
+   mov %rax, (%arg2 , %r11, 1)
+   add $8, %r11
+   psrldq $8, %xmm0
+   MOVQ_R64_XMM %xmm0, %rax
+   sub $8, %r13
+_less_than_8_bytes_left_\@:
+   mov %al,  (%arg2, %r11, 1)
+   add $1, %r11
+   shr $8, %rax
+   sub $1, %r13
+   jne _less_than_8_bytes_left_\@
+_multiple_of_16_bytes_\@:
+.endm
+
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
@@ -1245,93 +1357,7 @@ ENTRY(aesni_gcm_dec)
FUNC_SAVE
 
GCM_INIT
-
-# Decrypt first few blocks
-
-   and $(3<<4), %r12
-   jz _initial_num_blocks_is_0_decrypt
-

[PATCH v2 05/14] x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro

2018-02-14 Thread Dave Watson

Make a macro for the main encode/decode routine.  Only a small handful
of lines differ for enc and dec.   This will also become the main
scatter/gather update routine.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 293 +++---
 1 file changed, 114 insertions(+), 179 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 529c542..8021fd1 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -222,6 +222,118 @@ ALL_F:  .octa 0x
mov %r13, %r12
 .endm
 
+# GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
+# struct has been initialized by GCM_INIT.
+# Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK
+# Clobbers rax, r10-r13, and xmm0-xmm15
+.macro GCM_ENC_DEC operation
+   # Encrypt/Decrypt first few blocks
+
+   and $(3<<4), %r12
+   jz  _initial_num_blocks_is_0_\@
+   cmp $(2<<4), %r12
+   jb  _initial_num_blocks_is_1_\@
+   je  _initial_num_blocks_is_2_\@
+_initial_num_blocks_is_3_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 5, 678, \operation
+   sub $48, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_2_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 6, 78, \operation
+   sub $32, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_1_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 7, 8, \operation
+   sub $16, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_0_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 8, 0, \operation
+_initial_blocks_\@:
+
+   # Main loop - Encrypt/Decrypt remaining blocks
+
+   cmp $0, %r13
+   je  _zero_cipher_left_\@
+   sub $64, %r13
+   je  _four_cipher_left_\@
+_crypt_by_4_\@:
+   GHASH_4_ENCRYPT_4_PARALLEL_\operation   %xmm9, %xmm10, %xmm11, %xmm12, \
+   %xmm13, %xmm14, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, \
+   %xmm7, %xmm8, enc
+   add $64, %r11
+   sub $64, %r13
+   jne _crypt_by_4_\@
+_four_cipher_left_\@:
+   GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
+%xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
+_zero_cipher_left_\@:
+   mov %arg4, %r13
+   and $15, %r13   # %r13 = arg4 (mod 16)
+   je  _multiple_of_16_bytes_\@
+
+   # Handle the last <16 Byte block separately
+   paddd ONE(%rip), %xmm0# INCR CNT to get Yn
+movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10, %xmm0
+
+   ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
+
+   lea (%arg3,%r11,1), %r10
+   mov %r13, %r12
+   READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
+
+   lea ALL_F+16(%rip), %r12
+   sub %r13, %r12
+.ifc \operation, dec
+   movdqa  %xmm1, %xmm2
+.endif
+   pxor%xmm1, %xmm0# XOR Encrypt(K, Yn)
+   movdqu  (%r12), %xmm1
+   # get the appropriate mask to mask out top 16-r13 bytes of xmm0
+   pand%xmm1, %xmm0# mask out top 16-r13 bytes of xmm0
+.ifc \operation, dec
+   pand%xmm1, %xmm2
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10 ,%xmm2
+
+   pxor %xmm2, %xmm8
+.else
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10,%xmm0
+
+   pxor%xmm0, %xmm8
+.endif
+
+   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+.ifc \operation, enc
+   # GHASH computation for the last <16 byte block
+   movdqa SHUF_MASK(%rip), %xmm10
+   # shuffle xmm0 back to output as ciphertext
+   PSHUFB_XMM %xmm10, %xmm0
+.endif
+
+   # Output %r13 bytes
+   MOVQ_R64_XMM %xmm0, %rax
+   cmp $8, %r13
+   jle _less_than_8_bytes_left_\@
+   mov %rax, (%arg2 , %r11, 1)
+   add $8, %r11
+   psrldq $8, %xmm0
+   MOVQ_R64_XMM %xmm0, %rax
+   sub $8, %r13
+_less_than_8_bytes_left_\@:
+   mov %al,  (%arg2, %r11, 1)
+   add $1, %r11
+   shr $8, %rax
+   sub $1, %r13
+   jne _less_than_8_bytes_left_\@
+_multiple_of_16_bytes_\@:
+.endm
+
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
@@ -1245,93 +1357,7 @@ ENTRY(aesni_gcm_dec)
FUNC_SAVE
 
GCM_INIT
-
-# Decrypt first few blocks
-
-   and $(3<<4), %r12
-   jz _initial_num_blocks_is_0_decrypt
-   cmp $(2<<4), %r12
-   jb _initial_num_

[PATCH v2 02/14] x86/crypto: aesni: Macro-ify func save/restore

2018-02-14 Thread Dave Watson

Macro-ify function save and restore.  These will be used in new functions
added for scatter/gather update operations.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 53 ++-
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 48911fe..39b42b1 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -170,6 +170,26 @@ ALL_F:  .octa 0x
 #define TKEYP  T1
 #endif
 
+.macro FUNC_SAVE
+   push%r12
+   push%r13
+   push%r14
+   mov %rsp, %r14
+#
+# states of %xmm registers %xmm6:%xmm15 not saved
+# all %xmm registers are clobbered
+#
+   sub $VARIABLE_OFFSET, %rsp
+   and $~63, %rsp
+.endm
+
+
+.macro FUNC_RESTORE
+   mov %r14, %rsp
+   pop %r14
+   pop %r13
+   pop %r12
+.endm
 
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
@@ -1130,16 +1150,7 @@ _esb_loop_\@:
 *
 */
 ENTRY(aesni_gcm_dec)
-   push%r12
-   push%r13
-   push%r14
-   mov %rsp, %r14
-/*
-* states of %xmm registers %xmm6:%xmm15 not saved
-* all %xmm registers are clobbered
-*/
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp# align rsp to 64 bytes
+   FUNC_SAVE
mov %arg6, %r12
movdqu  (%r12), %xmm13# %xmm13 = HashKey
 movdqa  SHUF_MASK(%rip), %xmm2
@@ -1309,10 +1320,7 @@ _T_1_decrypt:
 _T_16_decrypt:
movdqu  %xmm0, (%r10)
 _return_T_done_decrypt:
-   mov %r14, %rsp
-   pop %r14
-   pop %r13
-   pop %r12
+   FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_dec)
 
@@ -1393,22 +1401,12 @@ ENDPROC(aesni_gcm_dec)
 * poly = x^128 + x^127 + x^126 + x^121 + 1
 ***/
 ENTRY(aesni_gcm_enc)
-   push%r12
-   push%r13
-   push%r14
-   mov %rsp, %r14
-#
-# states of %xmm registers %xmm6:%xmm15 not saved
-# all %xmm registers are clobbered
-#
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp
+   FUNC_SAVE
mov %arg6, %r12
movdqu  (%r12), %xmm13
 movdqa  SHUF_MASK(%rip), %xmm2
PSHUFB_XMM %xmm2, %xmm13
 
-
 # precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
 
movdqa  %xmm13, %xmm2
@@ -1576,10 +1574,7 @@ _T_1_encrypt:
 _T_16_encrypt:
movdqu  %xmm0, (%r10)
 _return_T_done_encrypt:
-   mov %r14, %rsp
-   pop %r14
-   pop %r13
-   pop %r12
+   FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_enc)
 
-- 
2.9.5

[PATCH v2 02/14] x86/crypto: aesni: Macro-ify func save/restore

2018-02-14 Thread Dave Watson

Macro-ify function save and restore.  These will be used in new functions
added for scatter/gather update operations.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 53 ++-
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 48911fe..39b42b1 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -170,6 +170,26 @@ ALL_F:  .octa 0x
 #define TKEYP  T1
 #endif
 
+.macro FUNC_SAVE
+   push%r12
+   push%r13
+   push%r14
+   mov %rsp, %r14
+#
+# states of %xmm registers %xmm6:%xmm15 not saved
+# all %xmm registers are clobbered
+#
+   sub $VARIABLE_OFFSET, %rsp
+   and $~63, %rsp
+.endm
+
+
+.macro FUNC_RESTORE
+   mov %r14, %rsp
+   pop %r14
+   pop %r13
+   pop %r12
+.endm
 
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
@@ -1130,16 +1150,7 @@ _esb_loop_\@:
 *
 */
 ENTRY(aesni_gcm_dec)
-   push%r12
-   push%r13
-   push%r14
-   mov %rsp, %r14
-/*
-* states of %xmm registers %xmm6:%xmm15 not saved
-* all %xmm registers are clobbered
-*/
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp# align rsp to 64 bytes
+   FUNC_SAVE
mov %arg6, %r12
movdqu  (%r12), %xmm13# %xmm13 = HashKey
 movdqa  SHUF_MASK(%rip), %xmm2
@@ -1309,10 +1320,7 @@ _T_1_decrypt:
 _T_16_decrypt:
movdqu  %xmm0, (%r10)
 _return_T_done_decrypt:
-   mov %r14, %rsp
-   pop %r14
-   pop %r13
-   pop %r12
+   FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_dec)
 
@@ -1393,22 +1401,12 @@ ENDPROC(aesni_gcm_dec)
 * poly = x^128 + x^127 + x^126 + x^121 + 1
 ***/
 ENTRY(aesni_gcm_enc)
-   push%r12
-   push%r13
-   push%r14
-   mov %rsp, %r14
-#
-# states of %xmm registers %xmm6:%xmm15 not saved
-# all %xmm registers are clobbered
-#
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp
+   FUNC_SAVE
mov %arg6, %r12
movdqu  (%r12), %xmm13
 movdqa  SHUF_MASK(%rip), %xmm2
PSHUFB_XMM %xmm2, %xmm13
 
-
 # precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
 
movdqa  %xmm13, %xmm2
@@ -1576,10 +1574,7 @@ _T_1_encrypt:
 _T_16_encrypt:
movdqu  %xmm0, (%r10)
 _return_T_done_encrypt:
-   mov %r14, %rsp
-   pop %r14
-   pop %r13
-   pop %r12
+   FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_enc)
 
-- 
2.9.5

[PATCH v2 03/14] x86/crypto: aesni: Add GCM_INIT macro

2018-02-14 Thread Dave Watson

Reduce code duplication by introducting GCM_INIT macro.  This macro
will also be exposed as a function for implementing scatter/gather
support, since INIT only needs to be called once for the full
operation.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 84 +++
 1 file changed, 33 insertions(+), 51 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 39b42b1..b9fe2ab 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -191,6 +191,37 @@ ALL_F:  .octa 0x
pop %r12
 .endm
 
+
+# GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
+# Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
+.macro GCM_INIT
+   mov %arg6, %r12
+   movdqu  (%r12), %xmm13
+   movdqa  SHUF_MASK(%rip), %xmm2
+   PSHUFB_XMM %xmm2, %xmm13
+
+   # precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
+
+   movdqa  %xmm13, %xmm2
+   psllq   $1, %xmm13
+   psrlq   $63, %xmm2
+   movdqa  %xmm2, %xmm1
+   pslldq  $8, %xmm2
+   psrldq  $8, %xmm1
+   por %xmm2, %xmm13
+
+   # reduce HashKey<<1
+
+   pshufd  $0x24, %xmm1, %xmm2
+   pcmpeqd TWOONE(%rip), %xmm2
+   pandPOLY(%rip), %xmm2
+   pxor%xmm2, %xmm13
+   movdqa  %xmm13, HashKey(%rsp)
+   mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod 
poly)
+   and $-16, %r13
+   mov %r13, %r12
+.endm
+
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
 *
@@ -1151,36 +1182,11 @@ _esb_loop_\@:
 */
 ENTRY(aesni_gcm_dec)
FUNC_SAVE
-   mov %arg6, %r12
-   movdqu  (%r12), %xmm13# %xmm13 = HashKey
-movdqa  SHUF_MASK(%rip), %xmm2
-   PSHUFB_XMM %xmm2, %xmm13
-
-
-# Precompute HashKey<<1 (mod poly) from the hash key (required for GHASH)
-
-   movdqa  %xmm13, %xmm2
-   psllq   $1, %xmm13
-   psrlq   $63, %xmm2
-   movdqa  %xmm2, %xmm1
-   pslldq  $8, %xmm2
-   psrldq  $8, %xmm1
-   por %xmm2, %xmm13
-
-# Reduction
-
-   pshufd  $0x24, %xmm1, %xmm2
-   pcmpeqd TWOONE(%rip), %xmm2
-   pandPOLY(%rip), %xmm2
-   pxor%xmm2, %xmm13 # %xmm13 holds the HashKey<<1 (mod poly)
 
+   GCM_INIT
 
 # Decrypt first few blocks
 
-   movdqa %xmm13, HashKey(%rsp)   # store HashKey<<1 (mod poly)
-   mov %arg4, %r13# save the number of bytes of plaintext/ciphertext
-   and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
-   mov %r13, %r12
and $(3<<4), %r12
jz _initial_num_blocks_is_0_decrypt
cmp $(2<<4), %r12
@@ -1402,32 +1408,8 @@ ENDPROC(aesni_gcm_dec)
 ***/
 ENTRY(aesni_gcm_enc)
FUNC_SAVE
-   mov %arg6, %r12
-   movdqu  (%r12), %xmm13
-movdqa  SHUF_MASK(%rip), %xmm2
-   PSHUFB_XMM %xmm2, %xmm13
-
-# precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
-
-   movdqa  %xmm13, %xmm2
-   psllq   $1, %xmm13
-   psrlq   $63, %xmm2
-   movdqa  %xmm2, %xmm1
-   pslldq  $8, %xmm2
-   psrldq  $8, %xmm1
-   por %xmm2, %xmm13
-
-# reduce HashKey<<1
-
-   pshufd  $0x24, %xmm1, %xmm2
-   pcmpeqd TWOONE(%rip), %xmm2
-   pandPOLY(%rip), %xmm2
-   pxor%xmm2, %xmm13
-   movdqa  %xmm13, HashKey(%rsp)
-   mov %arg4, %r13# %xmm13 holds HashKey<<1 (mod poly)
-   and $-16, %r13
-   mov %r13, %r12
 
+   GCM_INIT
 # Encrypt first few blocks
 
and $(3<<4), %r12
-- 
2.9.5

[PATCH v2 03/14] x86/crypto: aesni: Add GCM_INIT macro

2018-02-14 Thread Dave Watson

Reduce code duplication by introducting GCM_INIT macro.  This macro
will also be exposed as a function for implementing scatter/gather
support, since INIT only needs to be called once for the full
operation.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 84 +++
 1 file changed, 33 insertions(+), 51 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 39b42b1..b9fe2ab 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -191,6 +191,37 @@ ALL_F:  .octa 0x
pop %r12
 .endm
 
+
+# GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
+# Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
+.macro GCM_INIT
+   mov %arg6, %r12
+   movdqu  (%r12), %xmm13
+   movdqa  SHUF_MASK(%rip), %xmm2
+   PSHUFB_XMM %xmm2, %xmm13
+
+   # precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
+
+   movdqa  %xmm13, %xmm2
+   psllq   $1, %xmm13
+   psrlq   $63, %xmm2
+   movdqa  %xmm2, %xmm1
+   pslldq  $8, %xmm2
+   psrldq  $8, %xmm1
+   por %xmm2, %xmm13
+
+   # reduce HashKey<<1
+
+   pshufd  $0x24, %xmm1, %xmm2
+   pcmpeqd TWOONE(%rip), %xmm2
+   pandPOLY(%rip), %xmm2
+   pxor%xmm2, %xmm13
+   movdqa  %xmm13, HashKey(%rsp)
+   mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod 
poly)
+   and $-16, %r13
+   mov %r13, %r12
+.endm
+
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
 *
@@ -1151,36 +1182,11 @@ _esb_loop_\@:
 */
 ENTRY(aesni_gcm_dec)
FUNC_SAVE
-   mov %arg6, %r12
-   movdqu  (%r12), %xmm13# %xmm13 = HashKey
-movdqa  SHUF_MASK(%rip), %xmm2
-   PSHUFB_XMM %xmm2, %xmm13
-
-
-# Precompute HashKey<<1 (mod poly) from the hash key (required for GHASH)
-
-   movdqa  %xmm13, %xmm2
-   psllq   $1, %xmm13
-   psrlq   $63, %xmm2
-   movdqa  %xmm2, %xmm1
-   pslldq  $8, %xmm2
-   psrldq  $8, %xmm1
-   por %xmm2, %xmm13
-
-# Reduction
-
-   pshufd  $0x24, %xmm1, %xmm2
-   pcmpeqd TWOONE(%rip), %xmm2
-   pandPOLY(%rip), %xmm2
-   pxor%xmm2, %xmm13 # %xmm13 holds the HashKey<<1 (mod poly)
 
+   GCM_INIT
 
 # Decrypt first few blocks
 
-   movdqa %xmm13, HashKey(%rsp)   # store HashKey<<1 (mod poly)
-   mov %arg4, %r13# save the number of bytes of plaintext/ciphertext
-   and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
-   mov %r13, %r12
and $(3<<4), %r12
jz _initial_num_blocks_is_0_decrypt
cmp $(2<<4), %r12
@@ -1402,32 +1408,8 @@ ENDPROC(aesni_gcm_dec)
 ***/
 ENTRY(aesni_gcm_enc)
FUNC_SAVE
-   mov %arg6, %r12
-   movdqu  (%r12), %xmm13
-movdqa  SHUF_MASK(%rip), %xmm2
-   PSHUFB_XMM %xmm2, %xmm13
-
-# precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
-
-   movdqa  %xmm13, %xmm2
-   psllq   $1, %xmm13
-   psrlq   $63, %xmm2
-   movdqa  %xmm2, %xmm1
-   pslldq  $8, %xmm2
-   psrldq  $8, %xmm1
-   por %xmm2, %xmm13
-
-# reduce HashKey<<1
-
-   pshufd  $0x24, %xmm1, %xmm2
-   pcmpeqd TWOONE(%rip), %xmm2
-   pandPOLY(%rip), %xmm2
-   pxor%xmm2, %xmm13
-   movdqa  %xmm13, HashKey(%rsp)
-   mov %arg4, %r13# %xmm13 holds HashKey<<1 (mod poly)
-   and $-16, %r13
-   mov %r13, %r12
 
+   GCM_INIT
 # Encrypt first few blocks
 
and $(3<<4), %r12
-- 
2.9.5

[PATCH v2 00/14] x86/crypto gcmaes SSE scatter/gather support

2018-02-14 Thread Dave Watson

This patch set refactors the x86 aes/gcm SSE crypto routines to
support true scatter/gather by adding gcm_enc/dec_update methods.

The layout is:

* First 5 patches refactor the code to use macros, so changes only
  need to be applied once for encode and decode.  There should be no
  functional changes.

* The next 6 patches introduce a gcm_context structure to be passed
  between scatter/gather calls to maintain state.  The struct is also
  used as scratch space for the existing enc/dec routines.

* The last 2 set up the asm function entry points for scatter gather
  support, and then call the new routines per buffer in the passed in
  sglist in aesni-intel_glue.

Testing: 
asm itself fuzz tested vs. existing code and isa-l asm.
Ran libkcapi test suite, passes.

perf of a large (16k messages) TLS sends sg vs. no sg:

no-sg

33287255597  cycles  
53702871176  instructions

43.47%   _crypt_by_4
17.83%   memcpy
16.36%   aes_loop_par_enc_done

sg

27568944591  cycles 
54580446678  instructions

49.87%   _crypt_by_4
17.40%   aes_loop_par_enc_done
1.79%aes_loop_initial_5416
1.52%aes_loop_initial_4974
1.27%gcmaes_encrypt_sg.constprop.15

V1 -> V2:

patch 14: merge enc/dec
  also use new routine if cryptlen < AVX_GEN2_OPTSIZE
  optimize case if assoc is already linear

Dave Watson (14):
  x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC
  x86/crypto: aesni: Macro-ify func save/restore
  x86/crypto: aesni: Add GCM_INIT macro
  x86/crypto: aesni: Add GCM_COMPLETE macro
  x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro
  x86/crypto: aesni: Introduce gcm_context_data
  x86/crypto: aesni: Split AAD hash calculation to separate macro
  x86/crypto: aesni: Fill in new context data structures
  x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
  x86/crypto: aesni: Move HashKey computation from stack to gcm_context
  x86/crypto: aesni: Introduce partial block macro
  x86/crypto: aesni: Add fast path for > 16 byte update
  x86/crypto: aesni: Introduce scatter/gather asm function stubs
  x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

 arch/x86/crypto/aesni-intel_asm.S  | 1414 ++--
 arch/x86/crypto/aesni-intel_glue.c |  230 +-
 2 files changed, 899 insertions(+), 745 deletions(-)

-- 
2.9.5

[PATCH v2 01/14] x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC

2018-02-14 Thread Dave Watson

Use macro operations to merge implemetations of INITIAL_BLOCKS,
since they differ by only a small handful of lines.

Use macro counter \@ to simplify implementation.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 298 ++
 1 file changed, 48 insertions(+), 250 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 76d8cd4..48911fe 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -275,234 +275,7 @@ _done_read_partial_block_\@:
 */
 
 
-.macro INITIAL_BLOCKS_DEC num_initial_blocks TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 
XMM1 \
-XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
-MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg7, %r10   # %r10 = AAD
-   movarg8, %r11   # %r11 = aadLen
-   pxor   %xmm\i, %xmm\i
-   pxor   \XMM2, \XMM2
-
-   cmp$16, %r11
-   jl _get_AAD_rest\num_initial_blocks\operation
-_get_AAD_blocks\num_initial_blocks\operation:
-   movdqu (%r10), %xmm\i
-   PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   %xmm\i, \XMM2
-   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-   add$16, %r10
-   sub$16, %r11
-   cmp$16, %r11
-   jge_get_AAD_blocks\num_initial_blocks\operation
-
-   movdqu \XMM2, %xmm\i
-
-   /* read the last <16B of AAD */
-_get_AAD_rest\num_initial_blocks\operation:
-   cmp$0, %r11
-   je _get_AAD_done\num_initial_blocks\operation
-
-   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   \XMM2, %xmm\i
-   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-
-_get_AAD_done\num_initial_blocks\operation:
-   xor%r11, %r11 # initialise the data pointer offset as zero
-   # start AES for num_initial_blocks blocks
-
-   mov%arg5, %rax  # %rax = *Y0
-   movdqu (%rax), \XMM0# XMM0 = Y0
-   PSHUFB_XMM   %xmm14, \XMM0
-
-.if (\i == 5) || (\i == 6) || (\i == 7)
-   MOVADQ  ONE(%RIP),\TMP1
-   MOVADQ  (%arg1),\TMP2
-.irpc index, \i_seq
-   paddd  \TMP1, \XMM0 # INCR Y0
-   movdqa \XMM0, %xmm\index
-   PSHUFB_XMM   %xmm14, %xmm\index  # perform a 16 byte swap
-   pxor   \TMP2, %xmm\index
-.endr
-   lea 0x10(%arg1),%r10
-   mov keysize,%eax
-   shr $2,%eax # 128->4, 192->6, 256->8
-   add $5,%eax   # 128->9, 192->11, 256->13
-
-aes_loop_initial_dec\num_initial_blocks:
-   MOVADQ  (%r10),\TMP1
-.irpc  index, \i_seq
-   AESENC  \TMP1, %xmm\index
-.endr
-   add $16,%r10
-   sub $1,%eax
-   jnz aes_loop_initial_dec\num_initial_blocks
-
-   MOVADQ  (%r10), \TMP1
-.irpc index, \i_seq
-   AESENCLAST \TMP1, %xmm\index # Last Round
-.endr
-.irpc index, \i_seq
-   movdqu (%arg3 , %r11, 1), \TMP1
-   pxor   \TMP1, %xmm\index
-   movdqu %xmm\index, (%arg2 , %r11, 1)
-   # write back plaintext/ciphertext for num_initial_blocks
-   add$16, %r11
-
-   movdqa \TMP1, %xmm\index
-   PSHUFB_XMM %xmm14, %xmm\index
-# prepare plaintext/ciphertext for GHASH computation
-.endr
-.endif
-
-# apply GHASH on num_initial_blocks blocks
-
-.if \i == 5
-pxor   %xmm5, %xmm6
-   GHASH_MUL  %xmm6, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm6, %xmm7
-   GHASH_MUL  %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.elseif \i == 6
-pxor   %xmm6, %xmm7
-   GHASH_MUL  %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.elseif \i == 7
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.endif
-   cmp$64, %r13
-   jl  _initial_blocks_done\num_initial_blocks\operation
-   # no need for precomputed values
-/*
-*
-* Precomputations for HashKey parallel with encryption of first 4 blocks.
-* Haskey_i_k holds XORed values of the low and high parts of the Haskey_i
-*/
-   MOVADQ ONE(%rip), \TMP1
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM1
-   PSHUFB_XMM  %xmm14, \XMM1# perform a 16 byte swap
-
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM2
-   PSHUFB_XMM  %xmm14, \XMM2# perform a 16 byte swap
-
-   paddd  \TMP1, \XMM0  # INCR Y0
-

[PATCH v2 00/14] x86/crypto gcmaes SSE scatter/gather support

2018-02-14 Thread Dave Watson

This patch set refactors the x86 aes/gcm SSE crypto routines to
support true scatter/gather by adding gcm_enc/dec_update methods.

The layout is:

* First 5 patches refactor the code to use macros, so changes only
  need to be applied once for encode and decode.  There should be no
  functional changes.

* The next 6 patches introduce a gcm_context structure to be passed
  between scatter/gather calls to maintain state.  The struct is also
  used as scratch space for the existing enc/dec routines.

* The last 2 set up the asm function entry points for scatter gather
  support, and then call the new routines per buffer in the passed in
  sglist in aesni-intel_glue.

Testing: 
asm itself fuzz tested vs. existing code and isa-l asm.
Ran libkcapi test suite, passes.

perf of a large (16k messages) TLS sends sg vs. no sg:

no-sg

33287255597  cycles  
53702871176  instructions

43.47%   _crypt_by_4
17.83%   memcpy
16.36%   aes_loop_par_enc_done

sg

27568944591  cycles 
54580446678  instructions

49.87%   _crypt_by_4
17.40%   aes_loop_par_enc_done
1.79%aes_loop_initial_5416
1.52%aes_loop_initial_4974
1.27%gcmaes_encrypt_sg.constprop.15

V1 -> V2:

patch 14: merge enc/dec
  also use new routine if cryptlen < AVX_GEN2_OPTSIZE
  optimize case if assoc is already linear

Dave Watson (14):
  x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC
  x86/crypto: aesni: Macro-ify func save/restore
  x86/crypto: aesni: Add GCM_INIT macro
  x86/crypto: aesni: Add GCM_COMPLETE macro
  x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro
  x86/crypto: aesni: Introduce gcm_context_data
  x86/crypto: aesni: Split AAD hash calculation to separate macro
  x86/crypto: aesni: Fill in new context data structures
  x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
  x86/crypto: aesni: Move HashKey computation from stack to gcm_context
  x86/crypto: aesni: Introduce partial block macro
  x86/crypto: aesni: Add fast path for > 16 byte update
  x86/crypto: aesni: Introduce scatter/gather asm function stubs
  x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

 arch/x86/crypto/aesni-intel_asm.S  | 1414 ++--
 arch/x86/crypto/aesni-intel_glue.c |  230 +-
 2 files changed, 899 insertions(+), 745 deletions(-)

-- 
2.9.5

[PATCH v2 01/14] x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC

2018-02-14 Thread Dave Watson

Use macro operations to merge implemetations of INITIAL_BLOCKS,
since they differ by only a small handful of lines.

Use macro counter \@ to simplify implementation.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 298 ++
 1 file changed, 48 insertions(+), 250 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 76d8cd4..48911fe 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -275,234 +275,7 @@ _done_read_partial_block_\@:
 */
 
 
-.macro INITIAL_BLOCKS_DEC num_initial_blocks TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 
XMM1 \
-XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
-MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg7, %r10   # %r10 = AAD
-   movarg8, %r11   # %r11 = aadLen
-   pxor   %xmm\i, %xmm\i
-   pxor   \XMM2, \XMM2
-
-   cmp$16, %r11
-   jl _get_AAD_rest\num_initial_blocks\operation
-_get_AAD_blocks\num_initial_blocks\operation:
-   movdqu (%r10), %xmm\i
-   PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   %xmm\i, \XMM2
-   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-   add$16, %r10
-   sub$16, %r11
-   cmp$16, %r11
-   jge_get_AAD_blocks\num_initial_blocks\operation
-
-   movdqu \XMM2, %xmm\i
-
-   /* read the last <16B of AAD */
-_get_AAD_rest\num_initial_blocks\operation:
-   cmp$0, %r11
-   je _get_AAD_done\num_initial_blocks\operation
-
-   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   \XMM2, %xmm\i
-   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-
-_get_AAD_done\num_initial_blocks\operation:
-   xor%r11, %r11 # initialise the data pointer offset as zero
-   # start AES for num_initial_blocks blocks
-
-   mov%arg5, %rax  # %rax = *Y0
-   movdqu (%rax), \XMM0# XMM0 = Y0
-   PSHUFB_XMM   %xmm14, \XMM0
-
-.if (\i == 5) || (\i == 6) || (\i == 7)
-   MOVADQ  ONE(%RIP),\TMP1
-   MOVADQ  (%arg1),\TMP2
-.irpc index, \i_seq
-   paddd  \TMP1, \XMM0 # INCR Y0
-   movdqa \XMM0, %xmm\index
-   PSHUFB_XMM   %xmm14, %xmm\index  # perform a 16 byte swap
-   pxor   \TMP2, %xmm\index
-.endr
-   lea 0x10(%arg1),%r10
-   mov keysize,%eax
-   shr $2,%eax # 128->4, 192->6, 256->8
-   add $5,%eax   # 128->9, 192->11, 256->13
-
-aes_loop_initial_dec\num_initial_blocks:
-   MOVADQ  (%r10),\TMP1
-.irpc  index, \i_seq
-   AESENC  \TMP1, %xmm\index
-.endr
-   add $16,%r10
-   sub $1,%eax
-   jnz aes_loop_initial_dec\num_initial_blocks
-
-   MOVADQ  (%r10), \TMP1
-.irpc index, \i_seq
-   AESENCLAST \TMP1, %xmm\index # Last Round
-.endr
-.irpc index, \i_seq
-   movdqu (%arg3 , %r11, 1), \TMP1
-   pxor   \TMP1, %xmm\index
-   movdqu %xmm\index, (%arg2 , %r11, 1)
-   # write back plaintext/ciphertext for num_initial_blocks
-   add$16, %r11
-
-   movdqa \TMP1, %xmm\index
-   PSHUFB_XMM %xmm14, %xmm\index
-# prepare plaintext/ciphertext for GHASH computation
-.endr
-.endif
-
-# apply GHASH on num_initial_blocks blocks
-
-.if \i == 5
-pxor   %xmm5, %xmm6
-   GHASH_MUL  %xmm6, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm6, %xmm7
-   GHASH_MUL  %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.elseif \i == 6
-pxor   %xmm6, %xmm7
-   GHASH_MUL  %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.elseif \i == 7
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.endif
-   cmp$64, %r13
-   jl  _initial_blocks_done\num_initial_blocks\operation
-   # no need for precomputed values
-/*
-*
-* Precomputations for HashKey parallel with encryption of first 4 blocks.
-* Haskey_i_k holds XORed values of the low and high parts of the Haskey_i
-*/
-   MOVADQ ONE(%rip), \TMP1
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM1
-   PSHUFB_XMM  %xmm14, \XMM1# perform a 16 byte swap
-
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM2
-   PSHUFB_XMM  %xmm14, \XMM2# perform a 16 byte swap
-
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM3
-

Re: [PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

2018-02-13 Thread Dave Watson

On 02/13/18 08:42 AM, Stephan Mueller wrote:
> > +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int
> > assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx)
> > +{
> > +   struct crypto_aead *tfm = crypto_aead_reqtfm(req);
> > +   unsigned long auth_tag_len = crypto_aead_authsize(tfm);
> > +   struct gcm_context_data data AESNI_ALIGN_ATTR;
> > +   struct scatter_walk dst_sg_walk = {};
> > +   unsigned long left = req->cryptlen;
> > +   unsigned long len, srclen, dstlen;
> > +   struct scatter_walk src_sg_walk;
> > +   struct scatterlist src_start[2];
> > +   struct scatterlist dst_start[2];
> > +   struct scatterlist *src_sg;
> > +   struct scatterlist *dst_sg;
> > +   u8 *src, *dst, *assoc;
> > +   u8 authTag[16];
> > +
> > +   assoc = kmalloc(assoclen, GFP_ATOMIC);
> > +   if (unlikely(!assoc))
> > +   return -ENOMEM;
> > +   scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0);
> 
> Have you tested that this code does not barf when assoclen is 0?
> 
> Maybe it is worth while to finally add a test vector to testmgr.h which 
> validates such scenario. If you would like, here is a vector you could add to 
> testmgr:
> 
> https://github.com/smuellerDD/libkcapi/blob/master/test/test.sh#L315

I tested assoclen and cryptlen being 0 and it works, yes.  Both
kmalloc and scatterwalk_map_and_copy work correctly with 0 assoclen.

> This is a decryption of gcm(aes) with no message, no AAD and just a tag. The 
> result should be EBADMSG.
> > +
> > +   src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen);
> 
> Why do you use assoclen in the map_and_copy, and req->assoclen in the ffwd?

If I understand correctly, rfc4106 appends extra data after the assoc.
assoclen is the real assoc length, req->assoclen is assoclen + the
extra data length.  So we ffwd by req->assoclen in the scatterlist,
but use assoclen when memcpy and testing.

> > 
> > +static int gcmaes_decrypt_sg(struct aead_request *req, unsigned int
> > assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx)
> > +{
> 
> This is a lot of code duplication.

I will merge them and send a V2.

> Ciao
> Stephan
> 
> 

Thanks!

Re: [PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

2018-02-13 Thread Dave Watson

On 02/13/18 08:42 AM, Stephan Mueller wrote:
> > +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int
> > assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx)
> > +{
> > +   struct crypto_aead *tfm = crypto_aead_reqtfm(req);
> > +   unsigned long auth_tag_len = crypto_aead_authsize(tfm);
> > +   struct gcm_context_data data AESNI_ALIGN_ATTR;
> > +   struct scatter_walk dst_sg_walk = {};
> > +   unsigned long left = req->cryptlen;
> > +   unsigned long len, srclen, dstlen;
> > +   struct scatter_walk src_sg_walk;
> > +   struct scatterlist src_start[2];
> > +   struct scatterlist dst_start[2];
> > +   struct scatterlist *src_sg;
> > +   struct scatterlist *dst_sg;
> > +   u8 *src, *dst, *assoc;
> > +   u8 authTag[16];
> > +
> > +   assoc = kmalloc(assoclen, GFP_ATOMIC);
> > +   if (unlikely(!assoc))
> > +   return -ENOMEM;
> > +   scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0);
> 
> Have you tested that this code does not barf when assoclen is 0?
> 
> Maybe it is worth while to finally add a test vector to testmgr.h which 
> validates such scenario. If you would like, here is a vector you could add to 
> testmgr:
> 
> https://github.com/smuellerDD/libkcapi/blob/master/test/test.sh#L315

I tested assoclen and cryptlen being 0 and it works, yes.  Both
kmalloc and scatterwalk_map_and_copy work correctly with 0 assoclen.

> This is a decryption of gcm(aes) with no message, no AAD and just a tag. The 
> result should be EBADMSG.
> > +
> > +   src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen);
> 
> Why do you use assoclen in the map_and_copy, and req->assoclen in the ffwd?

If I understand correctly, rfc4106 appends extra data after the assoc.
assoclen is the real assoc length, req->assoclen is assoclen + the
extra data length.  So we ffwd by req->assoclen in the scatterlist,
but use assoclen when memcpy and testing.

> > 
> > +static int gcmaes_decrypt_sg(struct aead_request *req, unsigned int
> > assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx)
> > +{
> 
> This is a lot of code duplication.

I will merge them and send a V2.

> Ciao
> Stephan
> 
> 

Thanks!

Re: [PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

2018-02-13 Thread Dave Watson

On 02/12/18 03:12 PM, Junaid Shahid wrote:
> Hi Dave,
> 
> 
> On 02/12/2018 11:51 AM, Dave Watson wrote:
> 
> > +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int 
> > assoclen,
> > +   u8 *hash_subkey, u8 *iv, void *aes_ctx)
> >  
> > +static int gcmaes_decrypt_sg(struct aead_request *req, unsigned int 
> > assoclen,
> > +   u8 *hash_subkey, u8 *iv, void *aes_ctx)
> 
> These two functions are almost identical. Wouldn't it be better to combine 
> them into a single encrypt/decrypt function, similar to what you have done 
> for the assembly macros?
> 
> > +   if (((struct crypto_aes_ctx *)aes_ctx)->key_length != AES_KEYSIZE_128 ||
> > +   aesni_gcm_enc_tfm == aesni_gcm_enc) {
> 
> Shouldn't we also include a check for the buffer length being less than 
> AVX_GEN2_OPTSIZE? AVX will not be used in that case either.

Yes, these both sound reasonable.  I will send a V2.

Thanks!

Re: [PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

2018-02-13 Thread Dave Watson

On 02/12/18 03:12 PM, Junaid Shahid wrote:
> Hi Dave,
> 
> 
> On 02/12/2018 11:51 AM, Dave Watson wrote:
> 
> > +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int 
> > assoclen,
> > +   u8 *hash_subkey, u8 *iv, void *aes_ctx)
> >  
> > +static int gcmaes_decrypt_sg(struct aead_request *req, unsigned int 
> > assoclen,
> > +   u8 *hash_subkey, u8 *iv, void *aes_ctx)
> 
> These two functions are almost identical. Wouldn't it be better to combine 
> them into a single encrypt/decrypt function, similar to what you have done 
> for the assembly macros?
> 
> > +   if (((struct crypto_aes_ctx *)aes_ctx)->key_length != AES_KEYSIZE_128 ||
> > +   aesni_gcm_enc_tfm == aesni_gcm_enc) {
> 
> Shouldn't we also include a check for the buffer length being less than 
> AVX_GEN2_OPTSIZE? AVX will not be used in that case either.

Yes, these both sound reasonable.  I will send a V2.

Thanks!

[PATCH 13/14] x86/crypto: aesni: Introduce scatter/gather asm function stubs

2018-02-12 Thread Dave Watson

The asm macros are all set up now, introduce entry points.

GCM_INIT and GCM_COMPLETE have arguments supplied, so that
the new scatter/gather entry points don't have to take all the
arguments, and only the ones they need.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S  | 116 -
 arch/x86/crypto/aesni-intel_glue.c |  16 +
 2 files changed, 106 insertions(+), 26 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index b941952..311b2de 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -200,8 +200,8 @@ ALL_F:  .octa 0x
 # Output: HashKeys stored in gcm_context_data.  Only needs to be called
 # once per key.
 # clobbers r12, and tmp xmm registers.
-.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
-   mov arg7, %r12
+.macro PRECOMPUTE SUBKEY TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
+   mov \SUBKEY, %r12
movdqu  (%r12), \TMP3
movdqa  SHUF_MASK(%rip), \TMP2
PSHUFB_XMM \TMP2, \TMP3
@@ -254,14 +254,14 @@ ALL_F:  .octa 0x
 
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
-.macro GCM_INIT
-   mov arg9, %r11
+.macro GCM_INIT Iv SUBKEY AAD AADLEN
+   mov \AADLEN, %r11
mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
xor %r11, %r11
mov %r11, InLen(%arg2) # ctx_data.in_length = 0
mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
-   mov %arg6, %rax
+   mov \Iv, %rax
movdqu (%rax), %xmm0
movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv
 
@@ -269,11 +269,11 @@ ALL_F:  .octa 0x
PSHUFB_XMM %xmm2, %xmm0
movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv
 
-   PRECOMPUTE %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7
+   PRECOMPUTE \SUBKEY, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
movdqa HashKey(%arg2), %xmm13
 
-   CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
-   %xmm5 %xmm6
+   CALC_AAD_HASH %xmm13, \AAD, \AADLEN, %xmm0, %xmm1, %xmm2, %xmm3, \
+   %xmm4, %xmm5, %xmm6
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -435,7 +435,7 @@ _multiple_of_16_bytes_\@:
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
-.macro GCM_COMPLETE
+.macro GCM_COMPLETE AUTHTAG AUTHTAGLEN
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%arg2), %xmm13
 
@@ -466,8 +466,8 @@ _partial_done\@:
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
-   mov arg10, %r10 # %r10 = authTag
-   mov arg11, %r11# %r11 = auth_tag_len
+   mov \AUTHTAG, %r10 # %r10 = authTag
+   mov \AUTHTAGLEN, %r11# %r11 = auth_tag_len
cmp $16, %r11
je  _T_16_\@
cmp $8, %r11
@@ -599,11 +599,11 @@ _done_read_partial_block_\@:
 
 # CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
 # clobbers r10-11, xmm14
-.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \
+.macro CALC_AAD_HASH HASHKEY AAD AADLEN TMP1 TMP2 TMP3 TMP4 TMP5 \
TMP6 TMP7
MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg8, %r10   # %r10 = AAD
-   movarg9, %r11   # %r11 = aadLen
+   mov\AAD, %r10   # %r10 = AAD
+   mov\AADLEN, %r11# %r11 = aadLen
pxor   \TMP7, \TMP7
pxor   \TMP6, \TMP6
 
@@ -1103,18 +1103,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 
operation
mov   keysize,%eax
shr   $2,%eax   # 128->4, 192->6, 256->8
sub   $4,%eax   # 128->0, 192->2, 256->4
-   jzaes_loop_par_enc_done
+   jzaes_loop_par_enc_done\@
 
-aes_loop_par_enc:
+aes_loop_par_enc\@:
MOVADQ(%r10),\TMP3
 .irpc  index, 1234
AESENC\TMP3, %xmm\index
 .endr
add   $16,%r10
sub   $1,%eax
-   jnz   aes_loop_par_enc
+   jnz   aes_loop_par_enc\@
 
-aes_loop_par_enc_done:
+aes_loop_par_enc_done\@:
MOVADQ(%r10), \TMP3
AESENCLAST \TMP3, \XMM1   # Round 10
AESENCLAST \TMP3, \XMM2
@@ -1311,18 +1311,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 
operation
mov   keysize,%eax
shr   $2,%eax   # 128->4, 192->6, 256->8
sub   $4,%eax

[PATCH 13/14] x86/crypto: aesni: Introduce scatter/gather asm function stubs

2018-02-12 Thread Dave Watson

The asm macros are all set up now, introduce entry points.

GCM_INIT and GCM_COMPLETE have arguments supplied, so that
the new scatter/gather entry points don't have to take all the
arguments, and only the ones they need.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S  | 116 -
 arch/x86/crypto/aesni-intel_glue.c |  16 +
 2 files changed, 106 insertions(+), 26 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index b941952..311b2de 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -200,8 +200,8 @@ ALL_F:  .octa 0x
 # Output: HashKeys stored in gcm_context_data.  Only needs to be called
 # once per key.
 # clobbers r12, and tmp xmm registers.
-.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
-   mov arg7, %r12
+.macro PRECOMPUTE SUBKEY TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
+   mov \SUBKEY, %r12
movdqu  (%r12), \TMP3
movdqa  SHUF_MASK(%rip), \TMP2
PSHUFB_XMM \TMP2, \TMP3
@@ -254,14 +254,14 @@ ALL_F:  .octa 0x
 
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
-.macro GCM_INIT
-   mov arg9, %r11
+.macro GCM_INIT Iv SUBKEY AAD AADLEN
+   mov \AADLEN, %r11
mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
xor %r11, %r11
mov %r11, InLen(%arg2) # ctx_data.in_length = 0
mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
-   mov %arg6, %rax
+   mov \Iv, %rax
movdqu (%rax), %xmm0
movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv
 
@@ -269,11 +269,11 @@ ALL_F:  .octa 0x
PSHUFB_XMM %xmm2, %xmm0
movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv
 
-   PRECOMPUTE %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7
+   PRECOMPUTE \SUBKEY, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
movdqa HashKey(%arg2), %xmm13
 
-   CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
-   %xmm5 %xmm6
+   CALC_AAD_HASH %xmm13, \AAD, \AADLEN, %xmm0, %xmm1, %xmm2, %xmm3, \
+   %xmm4, %xmm5, %xmm6
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -435,7 +435,7 @@ _multiple_of_16_bytes_\@:
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
-.macro GCM_COMPLETE
+.macro GCM_COMPLETE AUTHTAG AUTHTAGLEN
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%arg2), %xmm13
 
@@ -466,8 +466,8 @@ _partial_done\@:
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
-   mov arg10, %r10 # %r10 = authTag
-   mov arg11, %r11# %r11 = auth_tag_len
+   mov \AUTHTAG, %r10 # %r10 = authTag
+   mov \AUTHTAGLEN, %r11# %r11 = auth_tag_len
cmp $16, %r11
je  _T_16_\@
cmp $8, %r11
@@ -599,11 +599,11 @@ _done_read_partial_block_\@:
 
 # CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
 # clobbers r10-11, xmm14
-.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \
+.macro CALC_AAD_HASH HASHKEY AAD AADLEN TMP1 TMP2 TMP3 TMP4 TMP5 \
TMP6 TMP7
MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg8, %r10   # %r10 = AAD
-   movarg9, %r11   # %r11 = aadLen
+   mov\AAD, %r10   # %r10 = AAD
+   mov\AADLEN, %r11# %r11 = aadLen
pxor   \TMP7, \TMP7
pxor   \TMP6, \TMP6
 
@@ -1103,18 +1103,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 
operation
mov   keysize,%eax
shr   $2,%eax   # 128->4, 192->6, 256->8
sub   $4,%eax   # 128->0, 192->2, 256->4
-   jzaes_loop_par_enc_done
+   jzaes_loop_par_enc_done\@
 
-aes_loop_par_enc:
+aes_loop_par_enc\@:
MOVADQ(%r10),\TMP3
 .irpc  index, 1234
AESENC\TMP3, %xmm\index
 .endr
add   $16,%r10
sub   $1,%eax
-   jnz   aes_loop_par_enc
+   jnz   aes_loop_par_enc\@
 
-aes_loop_par_enc_done:
+aes_loop_par_enc_done\@:
MOVADQ(%r10), \TMP3
AESENCLAST \TMP3, \XMM1   # Round 10
AESENCLAST \TMP3, \XMM2
@@ -1311,18 +1311,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 
operation
mov   keysize,%eax
shr   $2,%eax   # 128->4, 192->6, 256->8
sub   $4,%eax   # 128-&g

[PATCH 06/14] x86/crypto: aesni: Introduce gcm_context_data

2018-02-12 Thread Dave Watson

Introduce a gcm_context_data struct that will be used to pass
context data between scatter/gather update calls.  It is passed
as the second argument (after crypto keys), other args are
renumbered.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S  | 115 +
 arch/x86/crypto/aesni-intel_glue.c |  81 ++
 2 files changed, 121 insertions(+), 75 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 8021fd1..6c5a80d 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -111,6 +111,14 @@ ALL_F:  .octa 0x
// (for Karatsuba purposes)
 #defineVARIABLE_OFFSET 16*8
 
+#define AadHash 16*0
+#define AadLen 16*1
+#define InLen (16*1)+8
+#define PBlockEncKey 16*2
+#define OrigIV 16*3
+#define CurCount 16*4
+#define PBlockLen 16*5
+
 #define arg1 rdi
 #define arg2 rsi
 #define arg3 rdx
@@ -121,6 +129,7 @@ ALL_F:  .octa 0x
 #define arg8 STACK_OFFSET+16(%r14)
 #define arg9 STACK_OFFSET+24(%r14)
 #define arg10 STACK_OFFSET+32(%r14)
+#define arg11 STACK_OFFSET+40(%r14)
 #define keysize 2*15*16(%arg1)
 #endif
 
@@ -195,9 +204,9 @@ ALL_F:  .octa 0x
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
 .macro GCM_INIT
-   mov %arg6, %r12
+   mov arg7, %r12
movdqu  (%r12), %xmm13
-   movdqa  SHUF_MASK(%rip), %xmm2
+   movdqa  SHUF_MASK(%rip), %xmm2
PSHUFB_XMM %xmm2, %xmm13
 
# precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
@@ -217,7 +226,7 @@ ALL_F:  .octa 0x
pandPOLY(%rip), %xmm2
pxor%xmm2, %xmm13
movdqa  %xmm13, HashKey(%rsp)
-   mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod 
poly)
+   mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
and $-16, %r13
mov %r13, %r12
 .endm
@@ -271,18 +280,18 @@ _four_cipher_left_\@:
GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
 %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
 _zero_cipher_left_\@:
-   mov %arg4, %r13
-   and $15, %r13   # %r13 = arg4 (mod 16)
+   mov %arg5, %r13
+   and $15, %r13   # %r13 = arg5 (mod 16)
je  _multiple_of_16_bytes_\@
 
# Handle the last <16 Byte block separately
paddd ONE(%rip), %xmm0# INCR CNT to get Yn
-movdqa SHUF_MASK(%rip), %xmm10
+   movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm0
 
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
 
-   lea (%arg3,%r11,1), %r10
+   lea (%arg4,%r11,1), %r10
mov %r13, %r12
READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
 
@@ -320,13 +329,13 @@ _zero_cipher_left_\@:
MOVQ_R64_XMM %xmm0, %rax
cmp $8, %r13
jle _less_than_8_bytes_left_\@
-   mov %rax, (%arg2 , %r11, 1)
+   mov %rax, (%arg3 , %r11, 1)
add $8, %r11
psrldq $8, %xmm0
MOVQ_R64_XMM %xmm0, %rax
sub $8, %r13
 _less_than_8_bytes_left_\@:
-   mov %al,  (%arg2, %r11, 1)
+   mov %al,  (%arg3, %r11, 1)
add $1, %r11
shr $8, %rax
sub $1, %r13
@@ -338,11 +347,11 @@ _multiple_of_16_bytes_\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE
-   mov arg8, %r12# %r13 = aadLen (number of bytes)
+   mov arg9, %r12# %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg4 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg4, %xmm1
+   shl $3, %arg5 # len(C) in bits (*128)
+   MOVQ_R64_XMM%arg5, %xmm1
pslldq  $8, %xmm15# %xmm15 = len(A)||0x
pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
pxor%xmm15, %xmm8
@@ -351,13 +360,13 @@ _multiple_of_16_bytes_\@:
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm8
 
-   mov %arg5, %rax   # %rax = *Y0
+   mov %arg6, %rax   # %rax = *Y0
movdqu  (%rax), %xmm0 # %xmm0 = Y0
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
-   mov arg9, %r10 # %r10 = authTag
-   mov arg10, %r11# %r11 = auth_tag_len
+   mov arg10, %r10 # %r10 = authTag

[PATCH 06/14] x86/crypto: aesni: Introduce gcm_context_data

2018-02-12 Thread Dave Watson

Introduce a gcm_context_data struct that will be used to pass
context data between scatter/gather update calls.  It is passed
as the second argument (after crypto keys), other args are
renumbered.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S  | 115 +
 arch/x86/crypto/aesni-intel_glue.c |  81 ++
 2 files changed, 121 insertions(+), 75 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 8021fd1..6c5a80d 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -111,6 +111,14 @@ ALL_F:  .octa 0x
// (for Karatsuba purposes)
 #defineVARIABLE_OFFSET 16*8
 
+#define AadHash 16*0
+#define AadLen 16*1
+#define InLen (16*1)+8
+#define PBlockEncKey 16*2
+#define OrigIV 16*3
+#define CurCount 16*4
+#define PBlockLen 16*5
+
 #define arg1 rdi
 #define arg2 rsi
 #define arg3 rdx
@@ -121,6 +129,7 @@ ALL_F:  .octa 0x
 #define arg8 STACK_OFFSET+16(%r14)
 #define arg9 STACK_OFFSET+24(%r14)
 #define arg10 STACK_OFFSET+32(%r14)
+#define arg11 STACK_OFFSET+40(%r14)
 #define keysize 2*15*16(%arg1)
 #endif
 
@@ -195,9 +204,9 @@ ALL_F:  .octa 0x
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
 .macro GCM_INIT
-   mov %arg6, %r12
+   mov arg7, %r12
movdqu  (%r12), %xmm13
-   movdqa  SHUF_MASK(%rip), %xmm2
+   movdqa  SHUF_MASK(%rip), %xmm2
PSHUFB_XMM %xmm2, %xmm13
 
# precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
@@ -217,7 +226,7 @@ ALL_F:  .octa 0x
pandPOLY(%rip), %xmm2
pxor%xmm2, %xmm13
movdqa  %xmm13, HashKey(%rsp)
-   mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod 
poly)
+   mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
and $-16, %r13
mov %r13, %r12
 .endm
@@ -271,18 +280,18 @@ _four_cipher_left_\@:
GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
 %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
 _zero_cipher_left_\@:
-   mov %arg4, %r13
-   and $15, %r13   # %r13 = arg4 (mod 16)
+   mov %arg5, %r13
+   and $15, %r13   # %r13 = arg5 (mod 16)
je  _multiple_of_16_bytes_\@
 
# Handle the last <16 Byte block separately
paddd ONE(%rip), %xmm0# INCR CNT to get Yn
-movdqa SHUF_MASK(%rip), %xmm10
+   movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm0
 
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
 
-   lea (%arg3,%r11,1), %r10
+   lea (%arg4,%r11,1), %r10
mov %r13, %r12
READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
 
@@ -320,13 +329,13 @@ _zero_cipher_left_\@:
MOVQ_R64_XMM %xmm0, %rax
cmp $8, %r13
jle _less_than_8_bytes_left_\@
-   mov %rax, (%arg2 , %r11, 1)
+   mov %rax, (%arg3 , %r11, 1)
add $8, %r11
psrldq $8, %xmm0
MOVQ_R64_XMM %xmm0, %rax
sub $8, %r13
 _less_than_8_bytes_left_\@:
-   mov %al,  (%arg2, %r11, 1)
+   mov %al,  (%arg3, %r11, 1)
add $1, %r11
shr $8, %rax
sub $1, %r13
@@ -338,11 +347,11 @@ _multiple_of_16_bytes_\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE
-   mov arg8, %r12# %r13 = aadLen (number of bytes)
+   mov arg9, %r12# %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg4 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg4, %xmm1
+   shl $3, %arg5 # len(C) in bits (*128)
+   MOVQ_R64_XMM%arg5, %xmm1
pslldq  $8, %xmm15# %xmm15 = len(A)||0x
pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
pxor%xmm15, %xmm8
@@ -351,13 +360,13 @@ _multiple_of_16_bytes_\@:
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm8
 
-   mov %arg5, %rax   # %rax = *Y0
+   mov %arg6, %rax   # %rax = *Y0
movdqu  (%rax), %xmm0 # %xmm0 = Y0
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
-   mov arg9, %r10 # %r10 = authTag
-   mov arg10, %r11# %r11 = auth_tag_len
+   mov arg10, %r10 # %r10 = authTag
+   mov arg11, %r11

[PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

2018-02-12 Thread Dave Watson

Add gcmaes_en/decrypt_sg routines, that will do scatter/gather
by sg. Either src or dst may contain multiple buffers, so
iterate over both at the same time if they are different.
If the input is the same as the output, iterate only over one.

Currently both the AAD and TAG must be linear, so copy them out
with scatterlist_map_and_copy. 

Only the SSE routines are updated so far, so leave the previous
gcmaes_en/decrypt routines, and branch to the sg ones if the
keysize is inappropriate for avx, or we are SSE only.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_glue.c | 166 +
 1 file changed, 166 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index de986f9..1e32fbe 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -791,6 +791,82 @@ static int generic_gcmaes_set_authsize(struct crypto_aead 
*tfm,
return 0;
 }
 
+static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int assoclen,
+   u8 *hash_subkey, u8 *iv, void *aes_ctx)
+{
+   struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+   unsigned long auth_tag_len = crypto_aead_authsize(tfm);
+   struct gcm_context_data data AESNI_ALIGN_ATTR;
+   struct scatter_walk dst_sg_walk = {};
+   unsigned long left = req->cryptlen;
+   unsigned long len, srclen, dstlen;
+   struct scatter_walk src_sg_walk;
+   struct scatterlist src_start[2];
+   struct scatterlist dst_start[2];
+   struct scatterlist *src_sg;
+   struct scatterlist *dst_sg;
+   u8 *src, *dst, *assoc;
+   u8 authTag[16];
+
+   assoc = kmalloc(assoclen, GFP_ATOMIC);
+   if (unlikely(!assoc))
+   return -ENOMEM;
+   scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0);
+
+   src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen);
+   scatterwalk_start(_sg_walk, src_sg);
+   if (req->src != req->dst) {
+   dst_sg = scatterwalk_ffwd(dst_start, req->dst, req->assoclen);
+   scatterwalk_start(_sg_walk, dst_sg);
+   }
+
+   kernel_fpu_begin();
+   aesni_gcm_init(aes_ctx, , iv,
+   hash_subkey, assoc, assoclen);
+   if (req->src != req->dst) {
+   while (left) {
+   src = scatterwalk_map(_sg_walk);
+   dst = scatterwalk_map(_sg_walk);
+   srclen = scatterwalk_clamp(_sg_walk, left);
+   dstlen = scatterwalk_clamp(_sg_walk, left);
+   len = min(srclen, dstlen);
+   if (len)
+   aesni_gcm_enc_update(aes_ctx, ,
+dst, src, len);
+   left -= len;
+
+   scatterwalk_unmap(src);
+   scatterwalk_unmap(dst);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_done(_sg_walk, 0, left);
+   scatterwalk_done(_sg_walk, 1, left);
+   }
+   } else {
+   while (left) {
+   dst = src = scatterwalk_map(_sg_walk);
+   len = scatterwalk_clamp(_sg_walk, left);
+   if (len)
+   aesni_gcm_enc_update(aes_ctx, ,
+src, src, len);
+   left -= len;
+   scatterwalk_unmap(src);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_done(_sg_walk, 1, left);
+   }
+   }
+   aesni_gcm_finalize(aes_ctx, , authTag, auth_tag_len);
+   kernel_fpu_end();
+
+   kfree(assoc);
+
+   /* Copy in the authTag */
+   scatterwalk_map_and_copy(authTag, req->dst,
+   req->assoclen + req->cryptlen,
+   auth_tag_len, 1);
+   return 0;
+}
+
 static int gcmaes_encrypt(struct aead_request *req, unsigned int assoclen,
  u8 *hash_subkey, u8 *iv, void *aes_ctx)
 {
@@ -802,6 +878,11 @@ static int gcmaes_encrypt(struct aead_request *req, 
unsigned int assoclen,
struct scatter_walk dst_sg_walk = {};
struct gcm_context_data data AESNI_ALIGN_ATTR;
 
+   if (((struct crypto_aes_ctx *)aes_ctx)->key_length != AES_KEYSIZE_128 ||
+   aesni_gcm_enc_tfm == aesni_gcm_enc) {
+   return gcmaes_encrypt_sg(req, assoclen, hash_subkey, iv,
+   aes_ctx);
+   }
if (sg_is_last(req->src) &&
(!PageHighMem(sg_page(req->src)) ||
req->src->offset + req->src->length <= PAGE_SIZE) &&
@@ -854,6

[PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

2018-02-12 Thread Dave Watson

Add gcmaes_en/decrypt_sg routines, that will do scatter/gather
by sg. Either src or dst may contain multiple buffers, so
iterate over both at the same time if they are different.
If the input is the same as the output, iterate only over one.

Currently both the AAD and TAG must be linear, so copy them out
with scatterlist_map_and_copy. 

Only the SSE routines are updated so far, so leave the previous
gcmaes_en/decrypt routines, and branch to the sg ones if the
keysize is inappropriate for avx, or we are SSE only.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_glue.c | 166 +
 1 file changed, 166 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index de986f9..1e32fbe 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -791,6 +791,82 @@ static int generic_gcmaes_set_authsize(struct crypto_aead 
*tfm,
return 0;
 }
 
+static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int assoclen,
+   u8 *hash_subkey, u8 *iv, void *aes_ctx)
+{
+   struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+   unsigned long auth_tag_len = crypto_aead_authsize(tfm);
+   struct gcm_context_data data AESNI_ALIGN_ATTR;
+   struct scatter_walk dst_sg_walk = {};
+   unsigned long left = req->cryptlen;
+   unsigned long len, srclen, dstlen;
+   struct scatter_walk src_sg_walk;
+   struct scatterlist src_start[2];
+   struct scatterlist dst_start[2];
+   struct scatterlist *src_sg;
+   struct scatterlist *dst_sg;
+   u8 *src, *dst, *assoc;
+   u8 authTag[16];
+
+   assoc = kmalloc(assoclen, GFP_ATOMIC);
+   if (unlikely(!assoc))
+   return -ENOMEM;
+   scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0);
+
+   src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen);
+   scatterwalk_start(_sg_walk, src_sg);
+   if (req->src != req->dst) {
+   dst_sg = scatterwalk_ffwd(dst_start, req->dst, req->assoclen);
+   scatterwalk_start(_sg_walk, dst_sg);
+   }
+
+   kernel_fpu_begin();
+   aesni_gcm_init(aes_ctx, , iv,
+   hash_subkey, assoc, assoclen);
+   if (req->src != req->dst) {
+   while (left) {
+   src = scatterwalk_map(_sg_walk);
+   dst = scatterwalk_map(_sg_walk);
+   srclen = scatterwalk_clamp(_sg_walk, left);
+   dstlen = scatterwalk_clamp(_sg_walk, left);
+   len = min(srclen, dstlen);
+   if (len)
+   aesni_gcm_enc_update(aes_ctx, ,
+dst, src, len);
+   left -= len;
+
+   scatterwalk_unmap(src);
+   scatterwalk_unmap(dst);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_done(_sg_walk, 0, left);
+   scatterwalk_done(_sg_walk, 1, left);
+   }
+   } else {
+   while (left) {
+   dst = src = scatterwalk_map(_sg_walk);
+   len = scatterwalk_clamp(_sg_walk, left);
+   if (len)
+   aesni_gcm_enc_update(aes_ctx, ,
+src, src, len);
+   left -= len;
+   scatterwalk_unmap(src);
+   scatterwalk_advance(_sg_walk, len);
+   scatterwalk_done(_sg_walk, 1, left);
+   }
+   }
+   aesni_gcm_finalize(aes_ctx, , authTag, auth_tag_len);
+   kernel_fpu_end();
+
+   kfree(assoc);
+
+   /* Copy in the authTag */
+   scatterwalk_map_and_copy(authTag, req->dst,
+   req->assoclen + req->cryptlen,
+   auth_tag_len, 1);
+   return 0;
+}
+
 static int gcmaes_encrypt(struct aead_request *req, unsigned int assoclen,
  u8 *hash_subkey, u8 *iv, void *aes_ctx)
 {
@@ -802,6 +878,11 @@ static int gcmaes_encrypt(struct aead_request *req, 
unsigned int assoclen,
struct scatter_walk dst_sg_walk = {};
struct gcm_context_data data AESNI_ALIGN_ATTR;
 
+   if (((struct crypto_aes_ctx *)aes_ctx)->key_length != AES_KEYSIZE_128 ||
+   aesni_gcm_enc_tfm == aesni_gcm_enc) {
+   return gcmaes_encrypt_sg(req, assoclen, hash_subkey, iv,
+   aes_ctx);
+   }
if (sg_is_last(req->src) &&
(!PageHighMem(sg_page(req->src)) ||
req->src->offset + req->src->length <= PAGE_SIZE) &&
@@ -854,6 +935,86 @@ static int gc

[PATCH 05/14] x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro

2018-02-12 Thread Dave Watson

Make a macro for the main encode/decode routine.  Only a small handful
of lines differ for enc and dec.   This will also become the main
scatter/gather update routine.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 293 +++---
 1 file changed, 114 insertions(+), 179 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 529c542..8021fd1 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -222,6 +222,118 @@ ALL_F:  .octa 0x
mov %r13, %r12
 .endm
 
+# GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
+# struct has been initialized by GCM_INIT.
+# Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK
+# Clobbers rax, r10-r13, and xmm0-xmm15
+.macro GCM_ENC_DEC operation
+   # Encrypt/Decrypt first few blocks
+
+   and $(3<<4), %r12
+   jz  _initial_num_blocks_is_0_\@
+   cmp $(2<<4), %r12
+   jb  _initial_num_blocks_is_1_\@
+   je  _initial_num_blocks_is_2_\@
+_initial_num_blocks_is_3_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 5, 678, \operation
+   sub $48, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_2_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 6, 78, \operation
+   sub $32, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_1_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 7, 8, \operation
+   sub $16, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_0_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 8, 0, \operation
+_initial_blocks_\@:
+
+   # Main loop - Encrypt/Decrypt remaining blocks
+
+   cmp $0, %r13
+   je  _zero_cipher_left_\@
+   sub $64, %r13
+   je  _four_cipher_left_\@
+_crypt_by_4_\@:
+   GHASH_4_ENCRYPT_4_PARALLEL_\operation   %xmm9, %xmm10, %xmm11, %xmm12, \
+   %xmm13, %xmm14, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, \
+   %xmm7, %xmm8, enc
+   add $64, %r11
+   sub $64, %r13
+   jne _crypt_by_4_\@
+_four_cipher_left_\@:
+   GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
+%xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
+_zero_cipher_left_\@:
+   mov %arg4, %r13
+   and $15, %r13   # %r13 = arg4 (mod 16)
+   je  _multiple_of_16_bytes_\@
+
+   # Handle the last <16 Byte block separately
+   paddd ONE(%rip), %xmm0# INCR CNT to get Yn
+movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10, %xmm0
+
+   ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
+
+   lea (%arg3,%r11,1), %r10
+   mov %r13, %r12
+   READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
+
+   lea ALL_F+16(%rip), %r12
+   sub %r13, %r12
+.ifc \operation, dec
+   movdqa  %xmm1, %xmm2
+.endif
+   pxor%xmm1, %xmm0# XOR Encrypt(K, Yn)
+   movdqu  (%r12), %xmm1
+   # get the appropriate mask to mask out top 16-r13 bytes of xmm0
+   pand%xmm1, %xmm0# mask out top 16-r13 bytes of xmm0
+.ifc \operation, dec
+   pand%xmm1, %xmm2
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10 ,%xmm2
+
+   pxor %xmm2, %xmm8
+.else
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10,%xmm0
+
+   pxor%xmm0, %xmm8
+.endif
+
+   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+.ifc \operation, enc
+   # GHASH computation for the last <16 byte block
+   movdqa SHUF_MASK(%rip), %xmm10
+   # shuffle xmm0 back to output as ciphertext
+   PSHUFB_XMM %xmm10, %xmm0
+.endif
+
+   # Output %r13 bytes
+   MOVQ_R64_XMM %xmm0, %rax
+   cmp $8, %r13
+   jle _less_than_8_bytes_left_\@
+   mov %rax, (%arg2 , %r11, 1)
+   add $8, %r11
+   psrldq $8, %xmm0
+   MOVQ_R64_XMM %xmm0, %rax
+   sub $8, %r13
+_less_than_8_bytes_left_\@:
+   mov %al,  (%arg2, %r11, 1)
+   add $1, %r11
+   shr $8, %rax
+   sub $1, %r13
+   jne _less_than_8_bytes_left_\@
+_multiple_of_16_bytes_\@:
+.endm
+
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
@@ -1245,93 +1357,7 @@ ENTRY(aesni_gcm_dec)
FUNC_SAVE
 
GCM_INIT
-
-# Decrypt first few blocks
-
-   and $(3<<4), %r12
-   jz _initial_num_blocks_is_0_decrypt
-

[PATCH 05/14] x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro

2018-02-12 Thread Dave Watson

Make a macro for the main encode/decode routine.  Only a small handful
of lines differ for enc and dec.   This will also become the main
scatter/gather update routine.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 293 +++---
 1 file changed, 114 insertions(+), 179 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 529c542..8021fd1 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -222,6 +222,118 @@ ALL_F:  .octa 0x
mov %r13, %r12
 .endm
 
+# GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
+# struct has been initialized by GCM_INIT.
+# Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK
+# Clobbers rax, r10-r13, and xmm0-xmm15
+.macro GCM_ENC_DEC operation
+   # Encrypt/Decrypt first few blocks
+
+   and $(3<<4), %r12
+   jz  _initial_num_blocks_is_0_\@
+   cmp $(2<<4), %r12
+   jb  _initial_num_blocks_is_1_\@
+   je  _initial_num_blocks_is_2_\@
+_initial_num_blocks_is_3_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 5, 678, \operation
+   sub $48, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_2_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 6, 78, \operation
+   sub $32, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_1_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 7, 8, \operation
+   sub $16, %r13
+   jmp _initial_blocks_\@
+_initial_num_blocks_is_0_\@:
+   INITIAL_BLOCKS_ENC_DEC  %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \
+%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 8, 0, \operation
+_initial_blocks_\@:
+
+   # Main loop - Encrypt/Decrypt remaining blocks
+
+   cmp $0, %r13
+   je  _zero_cipher_left_\@
+   sub $64, %r13
+   je  _four_cipher_left_\@
+_crypt_by_4_\@:
+   GHASH_4_ENCRYPT_4_PARALLEL_\operation   %xmm9, %xmm10, %xmm11, %xmm12, \
+   %xmm13, %xmm14, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, \
+   %xmm7, %xmm8, enc
+   add $64, %r11
+   sub $64, %r13
+   jne _crypt_by_4_\@
+_four_cipher_left_\@:
+   GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
+%xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
+_zero_cipher_left_\@:
+   mov %arg4, %r13
+   and $15, %r13   # %r13 = arg4 (mod 16)
+   je  _multiple_of_16_bytes_\@
+
+   # Handle the last <16 Byte block separately
+   paddd ONE(%rip), %xmm0# INCR CNT to get Yn
+movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10, %xmm0
+
+   ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
+
+   lea (%arg3,%r11,1), %r10
+   mov %r13, %r12
+   READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
+
+   lea ALL_F+16(%rip), %r12
+   sub %r13, %r12
+.ifc \operation, dec
+   movdqa  %xmm1, %xmm2
+.endif
+   pxor%xmm1, %xmm0# XOR Encrypt(K, Yn)
+   movdqu  (%r12), %xmm1
+   # get the appropriate mask to mask out top 16-r13 bytes of xmm0
+   pand%xmm1, %xmm0# mask out top 16-r13 bytes of xmm0
+.ifc \operation, dec
+   pand%xmm1, %xmm2
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10 ,%xmm2
+
+   pxor %xmm2, %xmm8
+.else
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10,%xmm0
+
+   pxor%xmm0, %xmm8
+.endif
+
+   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+.ifc \operation, enc
+   # GHASH computation for the last <16 byte block
+   movdqa SHUF_MASK(%rip), %xmm10
+   # shuffle xmm0 back to output as ciphertext
+   PSHUFB_XMM %xmm10, %xmm0
+.endif
+
+   # Output %r13 bytes
+   MOVQ_R64_XMM %xmm0, %rax
+   cmp $8, %r13
+   jle _less_than_8_bytes_left_\@
+   mov %rax, (%arg2 , %r11, 1)
+   add $8, %r11
+   psrldq $8, %xmm0
+   MOVQ_R64_XMM %xmm0, %rax
+   sub $8, %r13
+_less_than_8_bytes_left_\@:
+   mov %al,  (%arg2, %r11, 1)
+   add $1, %r11
+   shr $8, %rax
+   sub $1, %r13
+   jne _less_than_8_bytes_left_\@
+_multiple_of_16_bytes_\@:
+.endm
+
 # GCM_COMPLETE Finishes update of tag of last partial block
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
@@ -1245,93 +1357,7 @@ ENTRY(aesni_gcm_dec)
FUNC_SAVE
 
GCM_INIT
-
-# Decrypt first few blocks
-
-   and $(3<<4), %r12
-   jz _initial_num_blocks_is_0_decrypt
-   cmp $(2<<4), %r12
-   jb _initial_num_

[PATCH 12/14] x86/crypto: aesni: Add fast path for > 16 byte update

2018-02-12 Thread Dave Watson

We can fast-path any < 16 byte read if the full message is > 16 bytes,
and shift over by the appropriate amount.  Usually we are
reading > 16 bytes, so this should be faster than the READ_PARTIAL
macro introduced in b20209c91e2 for the average case.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 398bd2237f..b941952 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -355,12 +355,37 @@ _zero_cipher_left_\@:
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
movdqu %xmm0, PBlockEncKey(%arg2)
 
+   cmp $16, %arg5
+   jge _large_enough_update_\@
+
lea (%arg4,%r11,1), %r10
mov %r13, %r12
READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
+   jmp _data_read_\@
+
+_large_enough_update_\@:
+   sub $16, %r11
+   add %r13, %r11
+
+   # receive the last <16 Byte block
+   movdqu  (%arg4, %r11, 1), %xmm1
 
+   sub %r13, %r11
+   add $16, %r11
+
+   lea SHIFT_MASK+16(%rip), %r12
+   # adjust the shuffle mask pointer to be able to shift 16-r13 bytes
+   # (r13 is the number of bytes in plaintext mod 16)
+   sub %r13, %r12
+   # get the appropriate shuffle mask
+   movdqu  (%r12), %xmm2
+   # shift right 16-r13 bytes
+   PSHUFB_XMM  %xmm2, %xmm1
+
+_data_read_\@:
lea ALL_F+16(%rip), %r12
sub %r13, %r12
+
 .ifc \operation, dec
movdqa  %xmm1, %xmm2
 .endif
-- 
2.9.5

[PATCH 10/14] x86/crypto: aesni: Move HashKey computation from stack to gcm_context

2018-02-12 Thread Dave Watson

HashKey computation only needs to happen once per scatter/gather operation,
save it between calls in gcm_context struct instead of on the stack.
Since the asm no longer stores anything on the stack, we can use
%rsp directly, and clean up the frame save/restore macros a bit.

Hashkeys actually only need to be calculated once per key and could
be moved to when set_key is called, however, the current glue code
falls back to generic aes code if fpu is disabled.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 205 --
 1 file changed, 106 insertions(+), 99 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 37b1cee..3ada06b 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -93,23 +93,6 @@ ALL_F:  .octa 0x
 
 
 #defineSTACK_OFFSET8*3
-#defineHashKey 16*0// store HashKey <<1 mod poly here
-#defineHashKey_2   16*1// store HashKey^2 <<1 mod poly here
-#defineHashKey_3   16*2// store HashKey^3 <<1 mod poly here
-#defineHashKey_4   16*3// store HashKey^4 <<1 mod poly here
-#defineHashKey_k   16*4// store XOR of High 64 bits and Low 64
-   // bits of  HashKey <<1 mod poly here
-   //(for Karatsuba purposes)
-#defineHashKey_2_k 16*5// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^2 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineHashKey_3_k 16*6// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^3 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineHashKey_4_k 16*7// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^4 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineVARIABLE_OFFSET 16*8
 
 #define AadHash 16*0
 #define AadLen 16*1
@@ -118,6 +101,22 @@ ALL_F:  .octa 0x
 #define OrigIV 16*3
 #define CurCount 16*4
 #define PBlockLen 16*5
+#defineHashKey 16*6// store HashKey <<1 mod poly here
+#defineHashKey_2   16*7// store HashKey^2 <<1 mod poly here
+#defineHashKey_3   16*8// store HashKey^3 <<1 mod poly here
+#defineHashKey_4   16*9// store HashKey^4 <<1 mod poly here
+#defineHashKey_k   16*10   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey <<1 mod poly here
+   //(for Karatsuba purposes)
+#defineHashKey_2_k 16*11   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^2 <<1 mod poly here
+   // (for Karatsuba purposes)
+#defineHashKey_3_k 16*12   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^3 <<1 mod poly here
+   // (for Karatsuba purposes)
+#defineHashKey_4_k 16*13   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^4 <<1 mod poly here
+   // (for Karatsuba purposes)
 
 #define arg1 rdi
 #define arg2 rsi
@@ -125,11 +124,11 @@ ALL_F:  .octa 0x
 #define arg4 rcx
 #define arg5 r8
 #define arg6 r9
-#define arg7 STACK_OFFSET+8(%r14)
-#define arg8 STACK_OFFSET+16(%r14)
-#define arg9 STACK_OFFSET+24(%r14)
-#define arg10 STACK_OFFSET+32(%r14)
-#define arg11 STACK_OFFSET+40(%r14)
+#define arg7 STACK_OFFSET+8(%rsp)
+#define arg8 STACK_OFFSET+16(%rsp)
+#define arg9 STACK_OFFSET+24(%rsp)
+#define arg10 STACK_OFFSET+32(%rsp)
+#define arg11 STACK_OFFSET+40(%rsp)
 #define keysize 2*15*16(%arg1)
 #endif
 
@@ -183,28 +182,79 @@ ALL_F:  .octa 0x
push%r12
push%r13
push%r14
-   mov %rsp, %r14
 #
 # states of %xmm registers %xmm6:%xmm15 not saved
 # all %xmm registers are clobbered
 #
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp
 .endm
 
 
 .macro FUNC_RESTORE
-   mov %r14, %rsp
pop %r14
pop %r13
pop %r12
 .endm
 
+# Precompute hashkeys.
+# Input: Hash subkey.
+# Output: HashKeys stored in gcm_context_data.  Only needs to be called
+# once per key.
+# clobbers r12, and tmp xmm registers.
+.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
+   mov arg7, %r12
+   movdqu  (%r12), \TMP3
+   movdqa  SHUF_MASK(%rip), \TMP2
+   PSHUFB_XMM \TMP2, \TMP3
+
+   # precompute HashK

[PATCH 12/14] x86/crypto: aesni: Add fast path for > 16 byte update

2018-02-12 Thread Dave Watson

We can fast-path any < 16 byte read if the full message is > 16 bytes,
and shift over by the appropriate amount.  Usually we are
reading > 16 bytes, so this should be faster than the READ_PARTIAL
macro introduced in b20209c91e2 for the average case.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 398bd2237f..b941952 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -355,12 +355,37 @@ _zero_cipher_left_\@:
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
movdqu %xmm0, PBlockEncKey(%arg2)
 
+   cmp $16, %arg5
+   jge _large_enough_update_\@
+
lea (%arg4,%r11,1), %r10
mov %r13, %r12
READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1
+   jmp _data_read_\@
+
+_large_enough_update_\@:
+   sub $16, %r11
+   add %r13, %r11
+
+   # receive the last <16 Byte block
+   movdqu  (%arg4, %r11, 1), %xmm1
 
+   sub %r13, %r11
+   add $16, %r11
+
+   lea SHIFT_MASK+16(%rip), %r12
+   # adjust the shuffle mask pointer to be able to shift 16-r13 bytes
+   # (r13 is the number of bytes in plaintext mod 16)
+   sub %r13, %r12
+   # get the appropriate shuffle mask
+   movdqu  (%r12), %xmm2
+   # shift right 16-r13 bytes
+   PSHUFB_XMM  %xmm2, %xmm1
+
+_data_read_\@:
lea ALL_F+16(%rip), %r12
sub %r13, %r12
+
 .ifc \operation, dec
movdqa  %xmm1, %xmm2
 .endif
-- 
2.9.5

[PATCH 10/14] x86/crypto: aesni: Move HashKey computation from stack to gcm_context

2018-02-12 Thread Dave Watson

HashKey computation only needs to happen once per scatter/gather operation,
save it between calls in gcm_context struct instead of on the stack.
Since the asm no longer stores anything on the stack, we can use
%rsp directly, and clean up the frame save/restore macros a bit.

Hashkeys actually only need to be calculated once per key and could
be moved to when set_key is called, however, the current glue code
falls back to generic aes code if fpu is disabled.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 205 --
 1 file changed, 106 insertions(+), 99 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 37b1cee..3ada06b 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -93,23 +93,6 @@ ALL_F:  .octa 0x
 
 
 #defineSTACK_OFFSET8*3
-#defineHashKey 16*0// store HashKey <<1 mod poly here
-#defineHashKey_2   16*1// store HashKey^2 <<1 mod poly here
-#defineHashKey_3   16*2// store HashKey^3 <<1 mod poly here
-#defineHashKey_4   16*3// store HashKey^4 <<1 mod poly here
-#defineHashKey_k   16*4// store XOR of High 64 bits and Low 64
-   // bits of  HashKey <<1 mod poly here
-   //(for Karatsuba purposes)
-#defineHashKey_2_k 16*5// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^2 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineHashKey_3_k 16*6// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^3 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineHashKey_4_k 16*7// store XOR of High 64 bits and Low 64
-   // bits of  HashKey^4 <<1 mod poly here
-   // (for Karatsuba purposes)
-#defineVARIABLE_OFFSET 16*8
 
 #define AadHash 16*0
 #define AadLen 16*1
@@ -118,6 +101,22 @@ ALL_F:  .octa 0x
 #define OrigIV 16*3
 #define CurCount 16*4
 #define PBlockLen 16*5
+#defineHashKey 16*6// store HashKey <<1 mod poly here
+#defineHashKey_2   16*7// store HashKey^2 <<1 mod poly here
+#defineHashKey_3   16*8// store HashKey^3 <<1 mod poly here
+#defineHashKey_4   16*9// store HashKey^4 <<1 mod poly here
+#defineHashKey_k   16*10   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey <<1 mod poly here
+   //(for Karatsuba purposes)
+#defineHashKey_2_k 16*11   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^2 <<1 mod poly here
+   // (for Karatsuba purposes)
+#defineHashKey_3_k 16*12   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^3 <<1 mod poly here
+   // (for Karatsuba purposes)
+#defineHashKey_4_k 16*13   // store XOR of High 64 bits and Low 64
+   // bits of  HashKey^4 <<1 mod poly here
+   // (for Karatsuba purposes)
 
 #define arg1 rdi
 #define arg2 rsi
@@ -125,11 +124,11 @@ ALL_F:  .octa 0x
 #define arg4 rcx
 #define arg5 r8
 #define arg6 r9
-#define arg7 STACK_OFFSET+8(%r14)
-#define arg8 STACK_OFFSET+16(%r14)
-#define arg9 STACK_OFFSET+24(%r14)
-#define arg10 STACK_OFFSET+32(%r14)
-#define arg11 STACK_OFFSET+40(%r14)
+#define arg7 STACK_OFFSET+8(%rsp)
+#define arg8 STACK_OFFSET+16(%rsp)
+#define arg9 STACK_OFFSET+24(%rsp)
+#define arg10 STACK_OFFSET+32(%rsp)
+#define arg11 STACK_OFFSET+40(%rsp)
 #define keysize 2*15*16(%arg1)
 #endif
 
@@ -183,28 +182,79 @@ ALL_F:  .octa 0x
push%r12
push%r13
push%r14
-   mov %rsp, %r14
 #
 # states of %xmm registers %xmm6:%xmm15 not saved
 # all %xmm registers are clobbered
 #
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp
 .endm
 
 
 .macro FUNC_RESTORE
-   mov %r14, %rsp
pop %r14
pop %r13
pop %r12
 .endm
 
+# Precompute hashkeys.
+# Input: Hash subkey.
+# Output: HashKeys stored in gcm_context_data.  Only needs to be called
+# once per key.
+# clobbers r12, and tmp xmm registers.
+.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7
+   mov arg7, %r12
+   movdqu  (%r12), \TMP3
+   movdqa  SHUF_MASK(%rip), \TMP2
+   PSHUFB_XMM \TMP2, \TMP3
+
+   # precompute HashKey<<1 mod poly from t

[PATCH 11/14] x86/crypto: aesni: Introduce partial block macro

2018-02-12 Thread Dave Watson

Before this diff, multiple calls to GCM_ENC_DEC will
succeed, but only if all calls are a multiple of 16 bytes.

Handle partial blocks at the start of GCM_ENC_DEC, and update
aadhash as appropriate.

The data offset %r11 is also updated after the partial block.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 151 +-
 1 file changed, 150 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 3ada06b..398bd2237f 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -284,7 +284,13 @@ ALL_F:  .octa 0x
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%arg2), %xmm13
add %arg5, InLen(%arg2)
+
+   xor %r11, %r11 # initialise the data pointer offset as zero
+   PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation
+
+   sub %r11, %arg5 # sub partial block data used
mov %arg5, %r13 # save the number of bytes
+
and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
mov %r13, %r12
# Encrypt/Decrypt first few blocks
@@ -605,6 +611,150 @@ _get_AAD_done\@:
movdqu \TMP6, AadHash(%arg2)
 .endm
 
+# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks
+# between update calls.
+# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK
+# Outputs encrypted bytes, and updates hash and partial info in 
gcm_data_context
+# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13
+.macro PARTIAL_BLOCK CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \
+   AAD_HASH operation
+   mov PBlockLen(%arg2), %r13
+   cmp $0, %r13
+   je  _partial_block_done_\@  # Leave Macro if no partial blocks
+   # Read in input data without over reading
+   cmp $16, \PLAIN_CYPH_LEN
+   jl  _fewer_than_16_bytes_\@
+   movups  (\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm
+   jmp _data_read_\@
+
+_fewer_than_16_bytes_\@:
+   lea (\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10
+   mov \PLAIN_CYPH_LEN, %r12
+   READ_PARTIAL_BLOCK %r10 %r12 %xmm0 %xmm1
+
+   mov PBlockLen(%arg2), %r13
+
+_data_read_\@: # Finished reading in data
+
+   movdqu  PBlockEncKey(%arg2), %xmm9
+   movdqu  HashKey(%arg2), %xmm13
+
+   lea SHIFT_MASK(%rip), %r12
+
+   # adjust the shuffle mask pointer to be able to shift r13 bytes
+   # r16-r13 is the number of bytes in plaintext mod 16)
+   add %r13, %r12
+   movdqu  (%r12), %xmm2   # get the appropriate shuffle mask
+   PSHUFB_XMM %xmm2, %xmm9 # shift right r13 bytes
+
+.ifc \operation, dec
+   movdqa  %xmm1, %xmm3
+   pxor%xmm1, %xmm9# Cyphertext XOR E(K, Yn)
+
+   mov \PLAIN_CYPH_LEN, %r10
+   add %r13, %r10
+   # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+   sub $16, %r10
+   # Determine if if partial block is not being filled and
+   # shift mask accordingly
+   jge _no_extra_mask_1_\@
+   sub %r10, %r12
+_no_extra_mask_1_\@:
+
+   movdqu  ALL_F-SHIFT_MASK(%r12), %xmm1
+   # get the appropriate mask to mask out bottom r13 bytes of xmm9
+   pand%xmm1, %xmm9# mask out bottom r13 bytes of xmm9
+
+   pand%xmm1, %xmm3
+   movdqa  SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM  %xmm10, %xmm3
+   PSHUFB_XMM  %xmm2, %xmm3
+   pxor%xmm3, \AAD_HASH
+
+   cmp $0, %r10
+   jl  _partial_incomplete_1_\@
+
+   # GHASH computation for the last <16 Byte block
+   GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+   xor %rax,%rax
+
+   mov %rax, PBlockLen(%arg2)
+   jmp _dec_done_\@
+_partial_incomplete_1_\@:
+   add \PLAIN_CYPH_LEN, PBlockLen(%arg2)
+_dec_done_\@:
+   movdqu  \AAD_HASH, AadHash(%arg2)
+.else
+   pxor%xmm1, %xmm9# Plaintext XOR E(K, Yn)
+
+   mov \PLAIN_CYPH_LEN, %r10
+   add %r13, %r10
+   # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+   sub $16, %r10
+   # Determine if if partial block is not being filled and
+   # shift mask accordingly
+   jge _no_extra_mask_2_\@
+   sub %r10, %r12
+_no_extra_mask_2_\@:
+
+   movdqu  ALL_F-SHIFT_MASK(%r12), %xmm1
+   # get the appropriate mask to mask out bottom r13 bytes of xmm9
+   pand%xmm1, %xmm9
+
+   movdqa  SHUF_MASK(%rip), %xmm1
+   PSHUFB_XMM %xmm1, %xmm9
+   PSHUFB_XMM %xmm2, %xmm9
+   pxor%xmm9, \AAD_HASH
+
+   cmp $0, %r10
+   jl  _partial_incomplete_2_\@
+
+   # GHASH computation for the last <16 Byte block
+   GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6

[PATCH 11/14] x86/crypto: aesni: Introduce partial block macro

2018-02-12 Thread Dave Watson

Before this diff, multiple calls to GCM_ENC_DEC will
succeed, but only if all calls are a multiple of 16 bytes.

Handle partial blocks at the start of GCM_ENC_DEC, and update
aadhash as appropriate.

The data offset %r11 is also updated after the partial block.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 151 +-
 1 file changed, 150 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 3ada06b..398bd2237f 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -284,7 +284,13 @@ ALL_F:  .octa 0x
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%arg2), %xmm13
add %arg5, InLen(%arg2)
+
+   xor %r11, %r11 # initialise the data pointer offset as zero
+   PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation
+
+   sub %r11, %arg5 # sub partial block data used
mov %arg5, %r13 # save the number of bytes
+
and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
mov %r13, %r12
# Encrypt/Decrypt first few blocks
@@ -605,6 +611,150 @@ _get_AAD_done\@:
movdqu \TMP6, AadHash(%arg2)
 .endm
 
+# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks
+# between update calls.
+# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK
+# Outputs encrypted bytes, and updates hash and partial info in 
gcm_data_context
+# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13
+.macro PARTIAL_BLOCK CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \
+   AAD_HASH operation
+   mov PBlockLen(%arg2), %r13
+   cmp $0, %r13
+   je  _partial_block_done_\@  # Leave Macro if no partial blocks
+   # Read in input data without over reading
+   cmp $16, \PLAIN_CYPH_LEN
+   jl  _fewer_than_16_bytes_\@
+   movups  (\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm
+   jmp _data_read_\@
+
+_fewer_than_16_bytes_\@:
+   lea (\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10
+   mov \PLAIN_CYPH_LEN, %r12
+   READ_PARTIAL_BLOCK %r10 %r12 %xmm0 %xmm1
+
+   mov PBlockLen(%arg2), %r13
+
+_data_read_\@: # Finished reading in data
+
+   movdqu  PBlockEncKey(%arg2), %xmm9
+   movdqu  HashKey(%arg2), %xmm13
+
+   lea SHIFT_MASK(%rip), %r12
+
+   # adjust the shuffle mask pointer to be able to shift r13 bytes
+   # r16-r13 is the number of bytes in plaintext mod 16)
+   add %r13, %r12
+   movdqu  (%r12), %xmm2   # get the appropriate shuffle mask
+   PSHUFB_XMM %xmm2, %xmm9 # shift right r13 bytes
+
+.ifc \operation, dec
+   movdqa  %xmm1, %xmm3
+   pxor%xmm1, %xmm9# Cyphertext XOR E(K, Yn)
+
+   mov \PLAIN_CYPH_LEN, %r10
+   add %r13, %r10
+   # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+   sub $16, %r10
+   # Determine if if partial block is not being filled and
+   # shift mask accordingly
+   jge _no_extra_mask_1_\@
+   sub %r10, %r12
+_no_extra_mask_1_\@:
+
+   movdqu  ALL_F-SHIFT_MASK(%r12), %xmm1
+   # get the appropriate mask to mask out bottom r13 bytes of xmm9
+   pand%xmm1, %xmm9# mask out bottom r13 bytes of xmm9
+
+   pand%xmm1, %xmm3
+   movdqa  SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM  %xmm10, %xmm3
+   PSHUFB_XMM  %xmm2, %xmm3
+   pxor%xmm3, \AAD_HASH
+
+   cmp $0, %r10
+   jl  _partial_incomplete_1_\@
+
+   # GHASH computation for the last <16 Byte block
+   GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+   xor %rax,%rax
+
+   mov %rax, PBlockLen(%arg2)
+   jmp _dec_done_\@
+_partial_incomplete_1_\@:
+   add \PLAIN_CYPH_LEN, PBlockLen(%arg2)
+_dec_done_\@:
+   movdqu  \AAD_HASH, AadHash(%arg2)
+.else
+   pxor%xmm1, %xmm9# Plaintext XOR E(K, Yn)
+
+   mov \PLAIN_CYPH_LEN, %r10
+   add %r13, %r10
+   # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling
+   sub $16, %r10
+   # Determine if if partial block is not being filled and
+   # shift mask accordingly
+   jge _no_extra_mask_2_\@
+   sub %r10, %r12
+_no_extra_mask_2_\@:
+
+   movdqu  ALL_F-SHIFT_MASK(%r12), %xmm1
+   # get the appropriate mask to mask out bottom r13 bytes of xmm9
+   pand%xmm1, %xmm9
+
+   movdqa  SHUF_MASK(%rip), %xmm1
+   PSHUFB_XMM %xmm1, %xmm9
+   PSHUFB_XMM %xmm2, %xmm9
+   pxor%xmm9, \AAD_HASH
+
+   cmp $0, %r10
+   jl  _partial_incomplete_2_\@
+
+   # GHASH computation for the last <16 Byte block
+   GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
+   xor %ra

[PATCH 08/14] x86/crypto: aesni: Fill in new context data structures

2018-02-12 Thread Dave Watson

Fill in aadhash, aadlen, pblocklen, curcount with appropriate values.
pblocklen, aadhash, and pblockenckey are also updated at the end
of each scatter/gather operation, to be carried over to the next
operation.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 51 ++-
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 58bbfac..aa82493 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -204,6 +204,21 @@ ALL_F:  .octa 0x
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
 .macro GCM_INIT
+
+   mov arg9, %r11
+   mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
+   xor %r11, %r11
+   mov %r11, InLen(%arg2) # ctx_data.in_length = 0
+   mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
+   mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
+   mov %arg6, %rax
+   movdqu (%rax), %xmm0
+   movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv
+
+   movdqa  SHUF_MASK(%rip), %xmm2
+   PSHUFB_XMM %xmm2, %xmm0
+   movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv
+
mov arg7, %r12
movdqu  (%r12), %xmm13
movdqa  SHUF_MASK(%rip), %xmm2
@@ -226,13 +241,9 @@ ALL_F:  .octa 0x
pandPOLY(%rip), %xmm2
pxor%xmm2, %xmm13
movdqa  %xmm13, HashKey(%rsp)
-   mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
-   and $-16, %r13
-   mov %r13, %r12
 
CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
%xmm5 %xmm6
-   mov %r13, %r12
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -240,6 +251,12 @@ ALL_F:  .octa 0x
 # Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK
 # Clobbers rax, r10-r13, and xmm0-xmm15
 .macro GCM_ENC_DEC operation
+   movdqu AadHash(%arg2), %xmm8
+   movdqu HashKey(%rsp), %xmm13
+   add %arg5, InLen(%arg2)
+   mov %arg5, %r13 # save the number of bytes
+   and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
+   mov %r13, %r12
# Encrypt/Decrypt first few blocks
 
and $(3<<4), %r12
@@ -284,16 +301,23 @@ _four_cipher_left_\@:
GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
 %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
 _zero_cipher_left_\@:
+   movdqu %xmm8, AadHash(%arg2)
+   movdqu %xmm0, CurCount(%arg2)
+
mov %arg5, %r13
and $15, %r13   # %r13 = arg5 (mod 16)
je  _multiple_of_16_bytes_\@
 
+   mov %r13, PBlockLen(%arg2)
+
# Handle the last <16 Byte block separately
paddd ONE(%rip), %xmm0# INCR CNT to get Yn
+   movdqu %xmm0, CurCount(%arg2)
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm0
 
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
+   movdqu %xmm0, PBlockEncKey(%arg2)
 
lea (%arg4,%r11,1), %r10
mov %r13, %r12
@@ -322,6 +346,7 @@ _zero_cipher_left_\@:
 .endif
 
GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+   movdqu %xmm8, AadHash(%arg2)
 .ifc \operation, enc
# GHASH computation for the last <16 byte block
movdqa SHUF_MASK(%rip), %xmm10
@@ -351,11 +376,15 @@ _multiple_of_16_bytes_\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE
-   mov arg9, %r12# %r13 = aadLen (number of bytes)
+   movdqu AadHash(%arg2), %xmm8
+   movdqu HashKey(%rsp), %xmm13
+   mov AadLen(%arg2), %r12  # %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg5 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg5, %xmm1
+   mov InLen(%arg2), %r12
+   shl $3, %r12  # len(C) in bits (*128)
+   MOVQ_R64_XMM%r12, %xmm1
+
pslldq  $8, %xmm15# %xmm15 = len(A)||0x
pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
pxor%xmm15, %xmm8
@@ -364,8 +393,7 @@ _multiple_of_16_bytes_\@:
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm8
 
-   mov %arg6, %rax   # %rax = *Y0
-   movdqu  (%rax), %xmm0 # %xmm0 = Y0
+   movdqu OrigIV(%arg2), %xmm0   # %xmm0 = Y0
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
@@ -553,15

[PATCH 08/14] x86/crypto: aesni: Fill in new context data structures

2018-02-12 Thread Dave Watson

Fill in aadhash, aadlen, pblocklen, curcount with appropriate values.
pblocklen, aadhash, and pblockenckey are also updated at the end
of each scatter/gather operation, to be carried over to the next
operation.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 51 ++-
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 58bbfac..aa82493 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -204,6 +204,21 @@ ALL_F:  .octa 0x
 # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
 # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
 .macro GCM_INIT
+
+   mov arg9, %r11
+   mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
+   xor %r11, %r11
+   mov %r11, InLen(%arg2) # ctx_data.in_length = 0
+   mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
+   mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
+   mov %arg6, %rax
+   movdqu (%rax), %xmm0
+   movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv
+
+   movdqa  SHUF_MASK(%rip), %xmm2
+   PSHUFB_XMM %xmm2, %xmm0
+   movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv
+
mov arg7, %r12
movdqu  (%r12), %xmm13
movdqa  SHUF_MASK(%rip), %xmm2
@@ -226,13 +241,9 @@ ALL_F:  .octa 0x
pandPOLY(%rip), %xmm2
pxor%xmm2, %xmm13
movdqa  %xmm13, HashKey(%rsp)
-   mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
-   and $-16, %r13
-   mov %r13, %r12
 
CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
%xmm5 %xmm6
-   mov %r13, %r12
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -240,6 +251,12 @@ ALL_F:  .octa 0x
 # Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK
 # Clobbers rax, r10-r13, and xmm0-xmm15
 .macro GCM_ENC_DEC operation
+   movdqu AadHash(%arg2), %xmm8
+   movdqu HashKey(%rsp), %xmm13
+   add %arg5, InLen(%arg2)
+   mov %arg5, %r13 # save the number of bytes
+   and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
+   mov %r13, %r12
# Encrypt/Decrypt first few blocks
 
and $(3<<4), %r12
@@ -284,16 +301,23 @@ _four_cipher_left_\@:
GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \
 %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8
 _zero_cipher_left_\@:
+   movdqu %xmm8, AadHash(%arg2)
+   movdqu %xmm0, CurCount(%arg2)
+
mov %arg5, %r13
and $15, %r13   # %r13 = arg5 (mod 16)
je  _multiple_of_16_bytes_\@
 
+   mov %r13, PBlockLen(%arg2)
+
# Handle the last <16 Byte block separately
paddd ONE(%rip), %xmm0# INCR CNT to get Yn
+   movdqu %xmm0, CurCount(%arg2)
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm0
 
ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn)
+   movdqu %xmm0, PBlockEncKey(%arg2)
 
lea (%arg4,%r11,1), %r10
mov %r13, %r12
@@ -322,6 +346,7 @@ _zero_cipher_left_\@:
 .endif
 
GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+   movdqu %xmm8, AadHash(%arg2)
 .ifc \operation, enc
# GHASH computation for the last <16 byte block
movdqa SHUF_MASK(%rip), %xmm10
@@ -351,11 +376,15 @@ _multiple_of_16_bytes_\@:
 # Output: Authorization Tag (AUTH_TAG)
 # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
 .macro GCM_COMPLETE
-   mov arg9, %r12# %r13 = aadLen (number of bytes)
+   movdqu AadHash(%arg2), %xmm8
+   movdqu HashKey(%rsp), %xmm13
+   mov AadLen(%arg2), %r12  # %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg5 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg5, %xmm1
+   mov InLen(%arg2), %r12
+   shl $3, %r12  # len(C) in bits (*128)
+   MOVQ_R64_XMM%r12, %xmm1
+
pslldq  $8, %xmm15# %xmm15 = len(A)||0x
pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
pxor%xmm15, %xmm8
@@ -364,8 +393,7 @@ _multiple_of_16_bytes_\@:
movdqa SHUF_MASK(%rip), %xmm10
PSHUFB_XMM %xmm10, %xmm8
 
-   mov %arg6, %rax   # %rax = *Y0
-   movdqu  (%rax), %xmm0 # %xmm0 = Y0
+   movdqu OrigIV(%arg2), %xmm0   # %xmm0 = Y0
ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
pxor%xmm8, %xmm0
 _return_T_\@:
@@ -553,15 +581,14 @@ _get_AAD

[PATCH 09/14] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE

2018-02-12 Thread Dave Watson

Prepare to handle partial blocks between scatter/gather calls.
For the last partial block, we only want to calculate the aadhash
in GCM_COMPLETE, and a new partial block macro will handle both
aadhash update and encrypting partial blocks between calls.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index aa82493..37b1cee 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -345,7 +345,6 @@ _zero_cipher_left_\@:
pxor%xmm0, %xmm8
 .endif
 
-   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
movdqu %xmm8, AadHash(%arg2)
 .ifc \operation, enc
# GHASH computation for the last <16 byte block
@@ -378,6 +377,15 @@ _multiple_of_16_bytes_\@:
 .macro GCM_COMPLETE
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%rsp), %xmm13
+
+   mov PBlockLen(%arg2), %r12
+
+   cmp $0, %r12
+   je _partial_done\@
+
+   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+
+_partial_done\@:
mov AadLen(%arg2), %r12  # %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-- 
2.9.5

[PATCH 09/14] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE

2018-02-12 Thread Dave Watson

Prepare to handle partial blocks between scatter/gather calls.
For the last partial block, we only want to calculate the aadhash
in GCM_COMPLETE, and a new partial block macro will handle both
aadhash update and encrypting partial blocks between calls.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index aa82493..37b1cee 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -345,7 +345,6 @@ _zero_cipher_left_\@:
pxor%xmm0, %xmm8
 .endif
 
-   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
movdqu %xmm8, AadHash(%arg2)
 .ifc \operation, enc
# GHASH computation for the last <16 byte block
@@ -378,6 +377,15 @@ _multiple_of_16_bytes_\@:
 .macro GCM_COMPLETE
movdqu AadHash(%arg2), %xmm8
movdqu HashKey(%rsp), %xmm13
+
+   mov PBlockLen(%arg2), %r12
+
+   cmp $0, %r12
+   je _partial_done\@
+
+   GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+
+_partial_done\@:
mov AadLen(%arg2), %r12  # %r13 = aadLen (number of bytes)
shl $3, %r12  # convert into number of bits
movd%r12d, %xmm15 # len(A) in %xmm15
-- 
2.9.5

[PATCH 07/14] x86/crypto: aesni: Split AAD hash calculation to separate macro

2018-02-12 Thread Dave Watson

AAD hash only needs to be calculated once for each scatter/gather operation.
Move it to its own macro, and call it from GCM_INIT instead of
INITIAL_BLOCKS.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 71 ---
 1 file changed, 43 insertions(+), 28 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 6c5a80d..58bbfac 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -229,6 +229,10 @@ ALL_F:  .octa 0x
mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
and $-16, %r13
mov %r13, %r12
+
+   CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
+   %xmm5 %xmm6
+   mov %r13, %r12
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -496,51 +500,62 @@ _read_next_byte_lt8_\@:
 _done_read_partial_block_\@:
 .endm
 
-/*
-* if a = number of total plaintext bytes
-* b = floor(a/16)
-* num_initial_blocks = b mod 4
-* encrypt the initial num_initial_blocks blocks and apply ghash on
-* the ciphertext
-* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers
-* are clobbered
-* arg1, %arg3, %arg4, %r14 are used as a pointer only, not modified
-*/
-
-
-.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \
-XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
-MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg8, %r10   # %r10 = AAD
-   movarg9, %r11   # %r11 = aadLen
-   pxor   %xmm\i, %xmm\i
-   pxor   \XMM2, \XMM2
+# CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
+# clobbers r10-11, xmm14
+.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \
+   TMP6 TMP7
+   MOVADQ SHUF_MASK(%rip), %xmm14
+   movarg8, %r10   # %r10 = AAD
+   movarg9, %r11   # %r11 = aadLen
+   pxor   \TMP7, \TMP7
+   pxor   \TMP6, \TMP6
 
cmp$16, %r11
jl _get_AAD_rest\@
 _get_AAD_blocks\@:
-   movdqu (%r10), %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   %xmm\i, \XMM2
-   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   movdqu (%r10), \TMP7
+   PSHUFB_XMM   %xmm14, \TMP7 # byte-reflect the AAD data
+   pxor   \TMP7, \TMP6
+   GHASH_MUL  \TMP6, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5
add$16, %r10
sub$16, %r11
cmp$16, %r11
jge_get_AAD_blocks\@
 
-   movdqu \XMM2, %xmm\i
+   movdqu \TMP6, \TMP7
 
/* read the last <16B of AAD */
 _get_AAD_rest\@:
cmp$0, %r11
je _get_AAD_done\@
 
-   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   \XMM2, %xmm\i
-   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, \TMP7
+   PSHUFB_XMM   %xmm14, \TMP7 # byte-reflect the AAD data
+   pxor   \TMP6, \TMP7
+   GHASH_MUL  \TMP7, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5
+   movdqu \TMP7, \TMP6
 
 _get_AAD_done\@:
+   movdqu \TMP6, AadHash(%arg2)
+.endm
+
+/*
+* if a = number of total plaintext bytes
+* b = floor(a/16)
+* num_initial_blocks = b mod 4
+* encrypt the initial num_initial_blocks blocks and apply ghash on
+* the ciphertext
+* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers
+* are clobbered
+* arg1, %arg2, %arg3, %r14 are used as a pointer only, not modified
+*/
+
+
+.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \
+   XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
+
+   movdqu AadHash(%arg2), %xmm\i   # XMM0 = Y0
+
xor%r11, %r11 # initialise the data pointer offset as zero
# start AES for num_initial_blocks blocks
 
-- 
2.9.5

[PATCH 00/14] x86/crypto gcmaes SSE scatter/gather support

2018-02-12 Thread Dave Watson

This patch set refactors the x86 aes/gcm SSE crypto routines to
support true scatter/gather by adding gcm_enc/dec_update methods.

The layout is:

* First 5 patches refactor the code to use macros, so changes only
  need to be applied once for encode and decode.  There should be no
  functional changes.

* The next 6 patches introduce a gcm_context structure to be passed
  between scatter/gather calls to maintain state.  The struct is also
  used as scratch space for the existing enc/dec routines.

* The last 2 set up the asm function entry points for scatter gather
  support, and then call the new routines per buffer in the passed in
  sglist in aesni-intel_glue.

Testing: 
asm itself fuzz tested vs. existing code and isa-l asm.
Ran libkcapi test suite, passes.
Passes my TLS tests.
IPSec or testing of other aesni users would be appreciated.

perf of a large (16k messages) TLS sends sg vs. no sg:

no-sg

33287255597  cycles  
53702871176  instructions

43.47%   _crypt_by_4
17.83%   memcpy
16.36%   aes_loop_par_enc_done

sg

27568944591  cycles 
54580446678  instructions

49.87%   _crypt_by_4
17.40%   aes_loop_par_enc_done
1.79%aes_loop_initial_5416
1.52%aes_loop_initial_4974
1.27%gcmaes_encrypt_sg.constprop.15


Dave Watson (14):
  x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC
  x86/crypto: aesni: Macro-ify func save/restore
  x86/crypto: aesni: Add GCM_INIT macro
  x86/crypto: aesni: Add GCM_COMPLETE macro
  x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro
  x86/crypto: aesni: Introduce gcm_context_data
  x86/crypto: aesni: Split AAD hash calculation to separate macro
  x86/crypto: aesni: Fill in new context data structures
  x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
  x86/crypto: aesni: Move HashKey computation from stack to gcm_context
  x86/crypto: aesni: Introduce partial block macro
  x86/crypto: aesni: Add fast path for > 16 byte update
  x86/crypto: aesni: Introduce scatter/gather asm function stubs
  x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

 arch/x86/crypto/aesni-intel_asm.S  | 1414 ++--
 arch/x86/crypto/aesni-intel_glue.c |  263 ++-
 2 files changed, 932 insertions(+), 745 deletions(-)

-- 
2.9.5

[PATCH 07/14] x86/crypto: aesni: Split AAD hash calculation to separate macro

2018-02-12 Thread Dave Watson

AAD hash only needs to be calculated once for each scatter/gather operation.
Move it to its own macro, and call it from GCM_INIT instead of
INITIAL_BLOCKS.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 71 ---
 1 file changed, 43 insertions(+), 28 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 6c5a80d..58bbfac 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -229,6 +229,10 @@ ALL_F:  .octa 0x
mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly)
and $-16, %r13
mov %r13, %r12
+
+   CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \
+   %xmm5 %xmm6
+   mov %r13, %r12
 .endm
 
 # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context
@@ -496,51 +500,62 @@ _read_next_byte_lt8_\@:
 _done_read_partial_block_\@:
 .endm
 
-/*
-* if a = number of total plaintext bytes
-* b = floor(a/16)
-* num_initial_blocks = b mod 4
-* encrypt the initial num_initial_blocks blocks and apply ghash on
-* the ciphertext
-* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers
-* are clobbered
-* arg1, %arg3, %arg4, %r14 are used as a pointer only, not modified
-*/
-
-
-.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \
-XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
-MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg8, %r10   # %r10 = AAD
-   movarg9, %r11   # %r11 = aadLen
-   pxor   %xmm\i, %xmm\i
-   pxor   \XMM2, \XMM2
+# CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
+# clobbers r10-11, xmm14
+.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \
+   TMP6 TMP7
+   MOVADQ SHUF_MASK(%rip), %xmm14
+   movarg8, %r10   # %r10 = AAD
+   movarg9, %r11   # %r11 = aadLen
+   pxor   \TMP7, \TMP7
+   pxor   \TMP6, \TMP6
 
cmp$16, %r11
jl _get_AAD_rest\@
 _get_AAD_blocks\@:
-   movdqu (%r10), %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   %xmm\i, \XMM2
-   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   movdqu (%r10), \TMP7
+   PSHUFB_XMM   %xmm14, \TMP7 # byte-reflect the AAD data
+   pxor   \TMP7, \TMP6
+   GHASH_MUL  \TMP6, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5
add$16, %r10
sub$16, %r11
cmp$16, %r11
jge_get_AAD_blocks\@
 
-   movdqu \XMM2, %xmm\i
+   movdqu \TMP6, \TMP7
 
/* read the last <16B of AAD */
 _get_AAD_rest\@:
cmp$0, %r11
je _get_AAD_done\@
 
-   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   \XMM2, %xmm\i
-   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, \TMP7
+   PSHUFB_XMM   %xmm14, \TMP7 # byte-reflect the AAD data
+   pxor   \TMP6, \TMP7
+   GHASH_MUL  \TMP7, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5
+   movdqu \TMP7, \TMP6
 
 _get_AAD_done\@:
+   movdqu \TMP6, AadHash(%arg2)
+.endm
+
+/*
+* if a = number of total plaintext bytes
+* b = floor(a/16)
+* num_initial_blocks = b mod 4
+* encrypt the initial num_initial_blocks blocks and apply ghash on
+* the ciphertext
+* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers
+* are clobbered
+* arg1, %arg2, %arg3, %r14 are used as a pointer only, not modified
+*/
+
+
+.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \
+   XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
+
+   movdqu AadHash(%arg2), %xmm\i   # XMM0 = Y0
+
xor%r11, %r11 # initialise the data pointer offset as zero
# start AES for num_initial_blocks blocks
 
-- 
2.9.5

[PATCH 00/14] x86/crypto gcmaes SSE scatter/gather support

2018-02-12 Thread Dave Watson

This patch set refactors the x86 aes/gcm SSE crypto routines to
support true scatter/gather by adding gcm_enc/dec_update methods.

The layout is:

* First 5 patches refactor the code to use macros, so changes only
  need to be applied once for encode and decode.  There should be no
  functional changes.

* The next 6 patches introduce a gcm_context structure to be passed
  between scatter/gather calls to maintain state.  The struct is also
  used as scratch space for the existing enc/dec routines.

* The last 2 set up the asm function entry points for scatter gather
  support, and then call the new routines per buffer in the passed in
  sglist in aesni-intel_glue.

Testing: 
asm itself fuzz tested vs. existing code and isa-l asm.
Ran libkcapi test suite, passes.
Passes my TLS tests.
IPSec or testing of other aesni users would be appreciated.

perf of a large (16k messages) TLS sends sg vs. no sg:

no-sg

33287255597  cycles  
53702871176  instructions

43.47%   _crypt_by_4
17.83%   memcpy
16.36%   aes_loop_par_enc_done

sg

27568944591  cycles 
54580446678  instructions

49.87%   _crypt_by_4
17.40%   aes_loop_par_enc_done
1.79%aes_loop_initial_5416
1.52%aes_loop_initial_4974
1.27%gcmaes_encrypt_sg.constprop.15


Dave Watson (14):
  x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC
  x86/crypto: aesni: Macro-ify func save/restore
  x86/crypto: aesni: Add GCM_INIT macro
  x86/crypto: aesni: Add GCM_COMPLETE macro
  x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro
  x86/crypto: aesni: Introduce gcm_context_data
  x86/crypto: aesni: Split AAD hash calculation to separate macro
  x86/crypto: aesni: Fill in new context data structures
  x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
  x86/crypto: aesni: Move HashKey computation from stack to gcm_context
  x86/crypto: aesni: Introduce partial block macro
  x86/crypto: aesni: Add fast path for > 16 byte update
  x86/crypto: aesni: Introduce scatter/gather asm function stubs
  x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather

 arch/x86/crypto/aesni-intel_asm.S  | 1414 ++--
 arch/x86/crypto/aesni-intel_glue.c |  263 ++-
 2 files changed, 932 insertions(+), 745 deletions(-)

-- 
2.9.5

[PATCH 04/14] x86/crypto: aesni: Add GCM_COMPLETE macro

2018-02-12 Thread Dave Watson

Merge encode and decode tag calculations in GCM_COMPLETE macro.
Scatter/gather routines will call this once at the end of encryption
or decryption.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 172 ++
 1 file changed, 63 insertions(+), 109 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index b9fe2ab..529c542 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -222,6 +222,67 @@ ALL_F:  .octa 0x
mov %r13, %r12
 .endm
 
+# GCM_COMPLETE Finishes update of tag of last partial block
+# Output: Authorization Tag (AUTH_TAG)
+# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
+.macro GCM_COMPLETE
+   mov arg8, %r12# %r13 = aadLen (number of bytes)
+   shl $3, %r12  # convert into number of bits
+   movd%r12d, %xmm15 # len(A) in %xmm15
+   shl $3, %arg4 # len(C) in bits (*128)
+   MOVQ_R64_XMM%arg4, %xmm1
+   pslldq  $8, %xmm15# %xmm15 = len(A)||0x
+   pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
+   pxor%xmm15, %xmm8
+   GHASH_MUL   %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+   # final GHASH computation
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10, %xmm8
+
+   mov %arg5, %rax   # %rax = *Y0
+   movdqu  (%rax), %xmm0 # %xmm0 = Y0
+   ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
+   pxor%xmm8, %xmm0
+_return_T_\@:
+   mov arg9, %r10 # %r10 = authTag
+   mov arg10, %r11# %r11 = auth_tag_len
+   cmp $16, %r11
+   je  _T_16_\@
+   cmp $8, %r11
+   jl  _T_4_\@
+_T_8_\@:
+   MOVQ_R64_XMM%xmm0, %rax
+   mov %rax, (%r10)
+   add $8, %r10
+   sub $8, %r11
+   psrldq  $8, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_\@
+_T_4_\@:
+   movd%xmm0, %eax
+   mov %eax, (%r10)
+   add $4, %r10
+   sub $4, %r11
+   psrldq  $4, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_\@
+_T_123_\@:
+   movd%xmm0, %eax
+   cmp $2, %r11
+   jl  _T_1_\@
+   mov %ax, (%r10)
+   cmp $2, %r11
+   je  _return_T_done_\@
+   add $2, %r10
+   sar $16, %eax
+_T_1_\@:
+   mov %al, (%r10)
+   jmp _return_T_done_\@
+_T_16_\@:
+   movdqu  %xmm0, (%r10)
+_return_T_done_\@:
+.endm
+
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
 *
@@ -1271,61 +1332,7 @@ _less_than_8_bytes_left_decrypt:
sub $1, %r13
jne _less_than_8_bytes_left_decrypt
 _multiple_of_16_bytes_decrypt:
-   mov arg8, %r12# %r13 = aadLen (number of bytes)
-   shl $3, %r12  # convert into number of bits
-   movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg4 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg4, %xmm1
-   pslldq  $8, %xmm15# %xmm15 = len(A)||0x
-   pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
-   pxor%xmm15, %xmm8
-   GHASH_MUL   %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
-# final GHASH computation
-movdqa SHUF_MASK(%rip), %xmm10
-   PSHUFB_XMM %xmm10, %xmm8
-
-   mov %arg5, %rax   # %rax = *Y0
-   movdqu  (%rax), %xmm0 # %xmm0 = Y0
-   ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
-   pxor%xmm8, %xmm0
-_return_T_decrypt:
-   mov arg9, %r10# %r10 = authTag
-   mov arg10, %r11   # %r11 = auth_tag_len
-   cmp $16, %r11
-   je  _T_16_decrypt
-   cmp $8, %r11
-   jl  _T_4_decrypt
-_T_8_decrypt:
-   MOVQ_R64_XMM%xmm0, %rax
-   mov %rax, (%r10)
-   add $8, %r10
-   sub $8, %r11
-   psrldq  $8, %xmm0
-   cmp $0, %r11
-   je  _return_T_done_decrypt
-_T_4_decrypt:
-   movd%xmm0, %eax
-   mov %eax, (%r10)
-   add $4, %r10
-   sub $4, %r11
-   psrldq  $4, %xmm0
-   cmp $0, %r11
-   je  _return_T_done_decrypt
-_T_123_decrypt:
-   movd%xmm0, %eax
-   cmp $2, %r11
-   jl  _T_1_decrypt
-   mov %ax, (%r10)
-   cmp $2, %r11
-   je  _return_T_done_decrypt
-   add $2, %r10
-   sar $16, %eax
-_T_1_decrypt:
-   mov %al, (%r10)
-   jmp _return_T_done_decrypt
-_T_16_decrypt:
-   movdqu  %xmm0, (%r10)
-_return_T_done_decrypt:
+   GCM_COMPLETE
FUNC_RESTORE
ret
 E

[PATCH 04/14] x86/crypto: aesni: Add GCM_COMPLETE macro

2018-02-12 Thread Dave Watson

Merge encode and decode tag calculations in GCM_COMPLETE macro.
Scatter/gather routines will call this once at the end of encryption
or decryption.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 172 ++
 1 file changed, 63 insertions(+), 109 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index b9fe2ab..529c542 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -222,6 +222,67 @@ ALL_F:  .octa 0x
mov %r13, %r12
 .endm
 
+# GCM_COMPLETE Finishes update of tag of last partial block
+# Output: Authorization Tag (AUTH_TAG)
+# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15
+.macro GCM_COMPLETE
+   mov arg8, %r12# %r13 = aadLen (number of bytes)
+   shl $3, %r12  # convert into number of bits
+   movd%r12d, %xmm15 # len(A) in %xmm15
+   shl $3, %arg4 # len(C) in bits (*128)
+   MOVQ_R64_XMM%arg4, %xmm1
+   pslldq  $8, %xmm15# %xmm15 = len(A)||0x
+   pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
+   pxor%xmm15, %xmm8
+   GHASH_MUL   %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
+   # final GHASH computation
+   movdqa SHUF_MASK(%rip), %xmm10
+   PSHUFB_XMM %xmm10, %xmm8
+
+   mov %arg5, %rax   # %rax = *Y0
+   movdqu  (%rax), %xmm0 # %xmm0 = Y0
+   ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
+   pxor%xmm8, %xmm0
+_return_T_\@:
+   mov arg9, %r10 # %r10 = authTag
+   mov arg10, %r11# %r11 = auth_tag_len
+   cmp $16, %r11
+   je  _T_16_\@
+   cmp $8, %r11
+   jl  _T_4_\@
+_T_8_\@:
+   MOVQ_R64_XMM%xmm0, %rax
+   mov %rax, (%r10)
+   add $8, %r10
+   sub $8, %r11
+   psrldq  $8, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_\@
+_T_4_\@:
+   movd%xmm0, %eax
+   mov %eax, (%r10)
+   add $4, %r10
+   sub $4, %r11
+   psrldq  $4, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_\@
+_T_123_\@:
+   movd%xmm0, %eax
+   cmp $2, %r11
+   jl  _T_1_\@
+   mov %ax, (%r10)
+   cmp $2, %r11
+   je  _return_T_done_\@
+   add $2, %r10
+   sar $16, %eax
+_T_1_\@:
+   mov %al, (%r10)
+   jmp _return_T_done_\@
+_T_16_\@:
+   movdqu  %xmm0, (%r10)
+_return_T_done_\@:
+.endm
+
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
 *
@@ -1271,61 +1332,7 @@ _less_than_8_bytes_left_decrypt:
sub $1, %r13
jne _less_than_8_bytes_left_decrypt
 _multiple_of_16_bytes_decrypt:
-   mov arg8, %r12# %r13 = aadLen (number of bytes)
-   shl $3, %r12  # convert into number of bits
-   movd%r12d, %xmm15 # len(A) in %xmm15
-   shl $3, %arg4 # len(C) in bits (*128)
-   MOVQ_R64_XMM%arg4, %xmm1
-   pslldq  $8, %xmm15# %xmm15 = len(A)||0x
-   pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C)
-   pxor%xmm15, %xmm8
-   GHASH_MUL   %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6
-# final GHASH computation
-movdqa SHUF_MASK(%rip), %xmm10
-   PSHUFB_XMM %xmm10, %xmm8
-
-   mov %arg5, %rax   # %rax = *Y0
-   movdqu  (%rax), %xmm0 # %xmm0 = Y0
-   ENCRYPT_SINGLE_BLOCK%xmm0,  %xmm1 # E(K, Y0)
-   pxor%xmm8, %xmm0
-_return_T_decrypt:
-   mov arg9, %r10# %r10 = authTag
-   mov arg10, %r11   # %r11 = auth_tag_len
-   cmp $16, %r11
-   je  _T_16_decrypt
-   cmp $8, %r11
-   jl  _T_4_decrypt
-_T_8_decrypt:
-   MOVQ_R64_XMM%xmm0, %rax
-   mov %rax, (%r10)
-   add $8, %r10
-   sub $8, %r11
-   psrldq  $8, %xmm0
-   cmp $0, %r11
-   je  _return_T_done_decrypt
-_T_4_decrypt:
-   movd%xmm0, %eax
-   mov %eax, (%r10)
-   add $4, %r10
-   sub $4, %r11
-   psrldq  $4, %xmm0
-   cmp $0, %r11
-   je  _return_T_done_decrypt
-_T_123_decrypt:
-   movd%xmm0, %eax
-   cmp $2, %r11
-   jl  _T_1_decrypt
-   mov %ax, (%r10)
-   cmp $2, %r11
-   je  _return_T_done_decrypt
-   add $2, %r10
-   sar $16, %eax
-_T_1_decrypt:
-   mov %al, (%r10)
-   jmp _return_T_done_decrypt
-_T_16_decrypt:
-   movdqu  %xmm0, (%r10)
-_return_T_done_decrypt:
+   GCM_COMPLETE
FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_dec

[PATCH 03/14] x86/crypto: aesni: Add GCM_INIT macro

2018-02-12 Thread Dave Watson

Reduce code duplication by introducting GCM_INIT macro.  This macro
will also be exposed as a function for implementing scatter/gather
support, since INIT only needs to be called once for the full
operation.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 84 +++
 1 file changed, 33 insertions(+), 51 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 39b42b1..b9fe2ab 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -191,6 +191,37 @@ ALL_F:  .octa 0x
pop %r12
 .endm
 
+
+# GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
+# Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
+.macro GCM_INIT
+   mov %arg6, %r12
+   movdqu  (%r12), %xmm13
+   movdqa  SHUF_MASK(%rip), %xmm2
+   PSHUFB_XMM %xmm2, %xmm13
+
+   # precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
+
+   movdqa  %xmm13, %xmm2
+   psllq   $1, %xmm13
+   psrlq   $63, %xmm2
+   movdqa  %xmm2, %xmm1
+   pslldq  $8, %xmm2
+   psrldq  $8, %xmm1
+   por %xmm2, %xmm13
+
+   # reduce HashKey<<1
+
+   pshufd  $0x24, %xmm1, %xmm2
+   pcmpeqd TWOONE(%rip), %xmm2
+   pandPOLY(%rip), %xmm2
+   pxor%xmm2, %xmm13
+   movdqa  %xmm13, HashKey(%rsp)
+   mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod 
poly)
+   and $-16, %r13
+   mov %r13, %r12
+.endm
+
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
 *
@@ -1151,36 +1182,11 @@ _esb_loop_\@:
 */
 ENTRY(aesni_gcm_dec)
FUNC_SAVE
-   mov %arg6, %r12
-   movdqu  (%r12), %xmm13# %xmm13 = HashKey
-movdqa  SHUF_MASK(%rip), %xmm2
-   PSHUFB_XMM %xmm2, %xmm13
-
-
-# Precompute HashKey<<1 (mod poly) from the hash key (required for GHASH)
-
-   movdqa  %xmm13, %xmm2
-   psllq   $1, %xmm13
-   psrlq   $63, %xmm2
-   movdqa  %xmm2, %xmm1
-   pslldq  $8, %xmm2
-   psrldq  $8, %xmm1
-   por %xmm2, %xmm13
-
-# Reduction
-
-   pshufd  $0x24, %xmm1, %xmm2
-   pcmpeqd TWOONE(%rip), %xmm2
-   pandPOLY(%rip), %xmm2
-   pxor%xmm2, %xmm13 # %xmm13 holds the HashKey<<1 (mod poly)
 
+   GCM_INIT
 
 # Decrypt first few blocks
 
-   movdqa %xmm13, HashKey(%rsp)   # store HashKey<<1 (mod poly)
-   mov %arg4, %r13# save the number of bytes of plaintext/ciphertext
-   and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
-   mov %r13, %r12
and $(3<<4), %r12
jz _initial_num_blocks_is_0_decrypt
cmp $(2<<4), %r12
@@ -1402,32 +1408,8 @@ ENDPROC(aesni_gcm_dec)
 ***/
 ENTRY(aesni_gcm_enc)
FUNC_SAVE
-   mov %arg6, %r12
-   movdqu  (%r12), %xmm13
-movdqa  SHUF_MASK(%rip), %xmm2
-   PSHUFB_XMM %xmm2, %xmm13
-
-# precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
-
-   movdqa  %xmm13, %xmm2
-   psllq   $1, %xmm13
-   psrlq   $63, %xmm2
-   movdqa  %xmm2, %xmm1
-   pslldq  $8, %xmm2
-   psrldq  $8, %xmm1
-   por %xmm2, %xmm13
-
-# reduce HashKey<<1
-
-   pshufd  $0x24, %xmm1, %xmm2
-   pcmpeqd TWOONE(%rip), %xmm2
-   pandPOLY(%rip), %xmm2
-   pxor%xmm2, %xmm13
-   movdqa  %xmm13, HashKey(%rsp)
-   mov %arg4, %r13# %xmm13 holds HashKey<<1 (mod poly)
-   and $-16, %r13
-   mov %r13, %r12
 
+   GCM_INIT
 # Encrypt first few blocks
 
and $(3<<4), %r12
-- 
2.9.5

[PATCH 03/14] x86/crypto: aesni: Add GCM_INIT macro

2018-02-12 Thread Dave Watson

Reduce code duplication by introducting GCM_INIT macro.  This macro
will also be exposed as a function for implementing scatter/gather
support, since INIT only needs to be called once for the full
operation.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 84 +++
 1 file changed, 33 insertions(+), 51 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 39b42b1..b9fe2ab 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -191,6 +191,37 @@ ALL_F:  .octa 0x
pop %r12
 .endm
 
+
+# GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding.
+# Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13
+.macro GCM_INIT
+   mov %arg6, %r12
+   movdqu  (%r12), %xmm13
+   movdqa  SHUF_MASK(%rip), %xmm2
+   PSHUFB_XMM %xmm2, %xmm13
+
+   # precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
+
+   movdqa  %xmm13, %xmm2
+   psllq   $1, %xmm13
+   psrlq   $63, %xmm2
+   movdqa  %xmm2, %xmm1
+   pslldq  $8, %xmm2
+   psrldq  $8, %xmm1
+   por %xmm2, %xmm13
+
+   # reduce HashKey<<1
+
+   pshufd  $0x24, %xmm1, %xmm2
+   pcmpeqd TWOONE(%rip), %xmm2
+   pandPOLY(%rip), %xmm2
+   pxor%xmm2, %xmm13
+   movdqa  %xmm13, HashKey(%rsp)
+   mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod 
poly)
+   and $-16, %r13
+   mov %r13, %r12
+.endm
+
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
 *
@@ -1151,36 +1182,11 @@ _esb_loop_\@:
 */
 ENTRY(aesni_gcm_dec)
FUNC_SAVE
-   mov %arg6, %r12
-   movdqu  (%r12), %xmm13# %xmm13 = HashKey
-movdqa  SHUF_MASK(%rip), %xmm2
-   PSHUFB_XMM %xmm2, %xmm13
-
-
-# Precompute HashKey<<1 (mod poly) from the hash key (required for GHASH)
-
-   movdqa  %xmm13, %xmm2
-   psllq   $1, %xmm13
-   psrlq   $63, %xmm2
-   movdqa  %xmm2, %xmm1
-   pslldq  $8, %xmm2
-   psrldq  $8, %xmm1
-   por %xmm2, %xmm13
-
-# Reduction
-
-   pshufd  $0x24, %xmm1, %xmm2
-   pcmpeqd TWOONE(%rip), %xmm2
-   pandPOLY(%rip), %xmm2
-   pxor%xmm2, %xmm13 # %xmm13 holds the HashKey<<1 (mod poly)
 
+   GCM_INIT
 
 # Decrypt first few blocks
 
-   movdqa %xmm13, HashKey(%rsp)   # store HashKey<<1 (mod poly)
-   mov %arg4, %r13# save the number of bytes of plaintext/ciphertext
-   and $-16, %r13  # %r13 = %r13 - (%r13 mod 16)
-   mov %r13, %r12
and $(3<<4), %r12
jz _initial_num_blocks_is_0_decrypt
cmp $(2<<4), %r12
@@ -1402,32 +1408,8 @@ ENDPROC(aesni_gcm_dec)
 ***/
 ENTRY(aesni_gcm_enc)
FUNC_SAVE
-   mov %arg6, %r12
-   movdqu  (%r12), %xmm13
-movdqa  SHUF_MASK(%rip), %xmm2
-   PSHUFB_XMM %xmm2, %xmm13
-
-# precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
-
-   movdqa  %xmm13, %xmm2
-   psllq   $1, %xmm13
-   psrlq   $63, %xmm2
-   movdqa  %xmm2, %xmm1
-   pslldq  $8, %xmm2
-   psrldq  $8, %xmm1
-   por %xmm2, %xmm13
-
-# reduce HashKey<<1
-
-   pshufd  $0x24, %xmm1, %xmm2
-   pcmpeqd TWOONE(%rip), %xmm2
-   pandPOLY(%rip), %xmm2
-   pxor%xmm2, %xmm13
-   movdqa  %xmm13, HashKey(%rsp)
-   mov %arg4, %r13# %xmm13 holds HashKey<<1 (mod poly)
-   and $-16, %r13
-   mov %r13, %r12
 
+   GCM_INIT
 # Encrypt first few blocks
 
and $(3<<4), %r12
-- 
2.9.5

[PATCH 01/14] x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC

2018-02-12 Thread Dave Watson

Use macro operations to merge implemetations of INITIAL_BLOCKS,
since they differ by only a small handful of lines.

Use macro counter \@ to simplify implementation.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 298 ++
 1 file changed, 48 insertions(+), 250 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 76d8cd4..48911fe 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -275,234 +275,7 @@ _done_read_partial_block_\@:
 */
 
 
-.macro INITIAL_BLOCKS_DEC num_initial_blocks TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 
XMM1 \
-XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
-MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg7, %r10   # %r10 = AAD
-   movarg8, %r11   # %r11 = aadLen
-   pxor   %xmm\i, %xmm\i
-   pxor   \XMM2, \XMM2
-
-   cmp$16, %r11
-   jl _get_AAD_rest\num_initial_blocks\operation
-_get_AAD_blocks\num_initial_blocks\operation:
-   movdqu (%r10), %xmm\i
-   PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   %xmm\i, \XMM2
-   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-   add$16, %r10
-   sub$16, %r11
-   cmp$16, %r11
-   jge_get_AAD_blocks\num_initial_blocks\operation
-
-   movdqu \XMM2, %xmm\i
-
-   /* read the last <16B of AAD */
-_get_AAD_rest\num_initial_blocks\operation:
-   cmp$0, %r11
-   je _get_AAD_done\num_initial_blocks\operation
-
-   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   \XMM2, %xmm\i
-   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-
-_get_AAD_done\num_initial_blocks\operation:
-   xor%r11, %r11 # initialise the data pointer offset as zero
-   # start AES for num_initial_blocks blocks
-
-   mov%arg5, %rax  # %rax = *Y0
-   movdqu (%rax), \XMM0# XMM0 = Y0
-   PSHUFB_XMM   %xmm14, \XMM0
-
-.if (\i == 5) || (\i == 6) || (\i == 7)
-   MOVADQ  ONE(%RIP),\TMP1
-   MOVADQ  (%arg1),\TMP2
-.irpc index, \i_seq
-   paddd  \TMP1, \XMM0 # INCR Y0
-   movdqa \XMM0, %xmm\index
-   PSHUFB_XMM   %xmm14, %xmm\index  # perform a 16 byte swap
-   pxor   \TMP2, %xmm\index
-.endr
-   lea 0x10(%arg1),%r10
-   mov keysize,%eax
-   shr $2,%eax # 128->4, 192->6, 256->8
-   add $5,%eax   # 128->9, 192->11, 256->13
-
-aes_loop_initial_dec\num_initial_blocks:
-   MOVADQ  (%r10),\TMP1
-.irpc  index, \i_seq
-   AESENC  \TMP1, %xmm\index
-.endr
-   add $16,%r10
-   sub $1,%eax
-   jnz aes_loop_initial_dec\num_initial_blocks
-
-   MOVADQ  (%r10), \TMP1
-.irpc index, \i_seq
-   AESENCLAST \TMP1, %xmm\index # Last Round
-.endr
-.irpc index, \i_seq
-   movdqu (%arg3 , %r11, 1), \TMP1
-   pxor   \TMP1, %xmm\index
-   movdqu %xmm\index, (%arg2 , %r11, 1)
-   # write back plaintext/ciphertext for num_initial_blocks
-   add$16, %r11
-
-   movdqa \TMP1, %xmm\index
-   PSHUFB_XMM %xmm14, %xmm\index
-# prepare plaintext/ciphertext for GHASH computation
-.endr
-.endif
-
-# apply GHASH on num_initial_blocks blocks
-
-.if \i == 5
-pxor   %xmm5, %xmm6
-   GHASH_MUL  %xmm6, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm6, %xmm7
-   GHASH_MUL  %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.elseif \i == 6
-pxor   %xmm6, %xmm7
-   GHASH_MUL  %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.elseif \i == 7
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.endif
-   cmp$64, %r13
-   jl  _initial_blocks_done\num_initial_blocks\operation
-   # no need for precomputed values
-/*
-*
-* Precomputations for HashKey parallel with encryption of first 4 blocks.
-* Haskey_i_k holds XORed values of the low and high parts of the Haskey_i
-*/
-   MOVADQ ONE(%rip), \TMP1
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM1
-   PSHUFB_XMM  %xmm14, \XMM1# perform a 16 byte swap
-
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM2
-   PSHUFB_XMM  %xmm14, \XMM2# perform a 16 byte swap
-
-   paddd  \TMP1, \XMM0  # INCR Y0
-

[PATCH 01/14] x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC

2018-02-12 Thread Dave Watson

Use macro operations to merge implemetations of INITIAL_BLOCKS,
since they differ by only a small handful of lines.

Use macro counter \@ to simplify implementation.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 298 ++
 1 file changed, 48 insertions(+), 250 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 76d8cd4..48911fe 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -275,234 +275,7 @@ _done_read_partial_block_\@:
 */
 
 
-.macro INITIAL_BLOCKS_DEC num_initial_blocks TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 
XMM1 \
-XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
-MOVADQ SHUF_MASK(%rip), %xmm14
-   movarg7, %r10   # %r10 = AAD
-   movarg8, %r11   # %r11 = aadLen
-   pxor   %xmm\i, %xmm\i
-   pxor   \XMM2, \XMM2
-
-   cmp$16, %r11
-   jl _get_AAD_rest\num_initial_blocks\operation
-_get_AAD_blocks\num_initial_blocks\operation:
-   movdqu (%r10), %xmm\i
-   PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   %xmm\i, \XMM2
-   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-   add$16, %r10
-   sub$16, %r11
-   cmp$16, %r11
-   jge_get_AAD_blocks\num_initial_blocks\operation
-
-   movdqu \XMM2, %xmm\i
-
-   /* read the last <16B of AAD */
-_get_AAD_rest\num_initial_blocks\operation:
-   cmp$0, %r11
-   je _get_AAD_done\num_initial_blocks\operation
-
-   READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i
-   PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
-   pxor   \XMM2, %xmm\i
-   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-
-_get_AAD_done\num_initial_blocks\operation:
-   xor%r11, %r11 # initialise the data pointer offset as zero
-   # start AES for num_initial_blocks blocks
-
-   mov%arg5, %rax  # %rax = *Y0
-   movdqu (%rax), \XMM0# XMM0 = Y0
-   PSHUFB_XMM   %xmm14, \XMM0
-
-.if (\i == 5) || (\i == 6) || (\i == 7)
-   MOVADQ  ONE(%RIP),\TMP1
-   MOVADQ  (%arg1),\TMP2
-.irpc index, \i_seq
-   paddd  \TMP1, \XMM0 # INCR Y0
-   movdqa \XMM0, %xmm\index
-   PSHUFB_XMM   %xmm14, %xmm\index  # perform a 16 byte swap
-   pxor   \TMP2, %xmm\index
-.endr
-   lea 0x10(%arg1),%r10
-   mov keysize,%eax
-   shr $2,%eax # 128->4, 192->6, 256->8
-   add $5,%eax   # 128->9, 192->11, 256->13
-
-aes_loop_initial_dec\num_initial_blocks:
-   MOVADQ  (%r10),\TMP1
-.irpc  index, \i_seq
-   AESENC  \TMP1, %xmm\index
-.endr
-   add $16,%r10
-   sub $1,%eax
-   jnz aes_loop_initial_dec\num_initial_blocks
-
-   MOVADQ  (%r10), \TMP1
-.irpc index, \i_seq
-   AESENCLAST \TMP1, %xmm\index # Last Round
-.endr
-.irpc index, \i_seq
-   movdqu (%arg3 , %r11, 1), \TMP1
-   pxor   \TMP1, %xmm\index
-   movdqu %xmm\index, (%arg2 , %r11, 1)
-   # write back plaintext/ciphertext for num_initial_blocks
-   add$16, %r11
-
-   movdqa \TMP1, %xmm\index
-   PSHUFB_XMM %xmm14, %xmm\index
-# prepare plaintext/ciphertext for GHASH computation
-.endr
-.endif
-
-# apply GHASH on num_initial_blocks blocks
-
-.if \i == 5
-pxor   %xmm5, %xmm6
-   GHASH_MUL  %xmm6, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm6, %xmm7
-   GHASH_MUL  %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.elseif \i == 6
-pxor   %xmm6, %xmm7
-   GHASH_MUL  %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.elseif \i == 7
-pxor   %xmm7, %xmm8
-   GHASH_MUL  %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
-.endif
-   cmp$64, %r13
-   jl  _initial_blocks_done\num_initial_blocks\operation
-   # no need for precomputed values
-/*
-*
-* Precomputations for HashKey parallel with encryption of first 4 blocks.
-* Haskey_i_k holds XORed values of the low and high parts of the Haskey_i
-*/
-   MOVADQ ONE(%rip), \TMP1
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM1
-   PSHUFB_XMM  %xmm14, \XMM1# perform a 16 byte swap
-
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM2
-   PSHUFB_XMM  %xmm14, \XMM2# perform a 16 byte swap
-
-   paddd  \TMP1, \XMM0  # INCR Y0
-   MOVADQ \XMM0, \XMM3
-

[PATCH 02/14] x86/crypto: aesni: Macro-ify func save/restore

2018-02-12 Thread Dave Watson

Macro-ify function save and restore.  These will be used in new functions
added for scatter/gather update operations.

Signed-off-by: Dave Watson <davejwat...@fb.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 53 ++-
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 48911fe..39b42b1 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -170,6 +170,26 @@ ALL_F:  .octa 0x
 #define TKEYP  T1
 #endif
 
+.macro FUNC_SAVE
+   push%r12
+   push%r13
+   push%r14
+   mov %rsp, %r14
+#
+# states of %xmm registers %xmm6:%xmm15 not saved
+# all %xmm registers are clobbered
+#
+   sub $VARIABLE_OFFSET, %rsp
+   and $~63, %rsp
+.endm
+
+
+.macro FUNC_RESTORE
+   mov %r14, %rsp
+   pop %r14
+   pop %r13
+   pop %r12
+.endm
 
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
@@ -1130,16 +1150,7 @@ _esb_loop_\@:
 *
 */
 ENTRY(aesni_gcm_dec)
-   push%r12
-   push%r13
-   push%r14
-   mov %rsp, %r14
-/*
-* states of %xmm registers %xmm6:%xmm15 not saved
-* all %xmm registers are clobbered
-*/
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp# align rsp to 64 bytes
+   FUNC_SAVE
mov %arg6, %r12
movdqu  (%r12), %xmm13# %xmm13 = HashKey
 movdqa  SHUF_MASK(%rip), %xmm2
@@ -1309,10 +1320,7 @@ _T_1_decrypt:
 _T_16_decrypt:
movdqu  %xmm0, (%r10)
 _return_T_done_decrypt:
-   mov %r14, %rsp
-   pop %r14
-   pop %r13
-   pop %r12
+   FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_dec)
 
@@ -1393,22 +1401,12 @@ ENDPROC(aesni_gcm_dec)
 * poly = x^128 + x^127 + x^126 + x^121 + 1
 ***/
 ENTRY(aesni_gcm_enc)
-   push%r12
-   push%r13
-   push%r14
-   mov %rsp, %r14
-#
-# states of %xmm registers %xmm6:%xmm15 not saved
-# all %xmm registers are clobbered
-#
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp
+   FUNC_SAVE
mov %arg6, %r12
movdqu  (%r12), %xmm13
 movdqa  SHUF_MASK(%rip), %xmm2
PSHUFB_XMM %xmm2, %xmm13
 
-
 # precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
 
movdqa  %xmm13, %xmm2
@@ -1576,10 +1574,7 @@ _T_1_encrypt:
 _T_16_encrypt:
movdqu  %xmm0, (%r10)
 _return_T_done_encrypt:
-   mov %r14, %rsp
-   pop %r14
-   pop %r13
-   pop %r12
+   FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_enc)
 
-- 
2.9.5

[PATCH 02/14] x86/crypto: aesni: Macro-ify func save/restore

2018-02-12 Thread Dave Watson

Macro-ify function save and restore.  These will be used in new functions
added for scatter/gather update operations.

Signed-off-by: Dave Watson 
---
 arch/x86/crypto/aesni-intel_asm.S | 53 ++-
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 48911fe..39b42b1 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -170,6 +170,26 @@ ALL_F:  .octa 0x
 #define TKEYP  T1
 #endif
 
+.macro FUNC_SAVE
+   push%r12
+   push%r13
+   push%r14
+   mov %rsp, %r14
+#
+# states of %xmm registers %xmm6:%xmm15 not saved
+# all %xmm registers are clobbered
+#
+   sub $VARIABLE_OFFSET, %rsp
+   and $~63, %rsp
+.endm
+
+
+.macro FUNC_RESTORE
+   mov %r14, %rsp
+   pop %r14
+   pop %r13
+   pop %r12
+.endm
 
 #ifdef __x86_64__
 /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
@@ -1130,16 +1150,7 @@ _esb_loop_\@:
 *
 */
 ENTRY(aesni_gcm_dec)
-   push%r12
-   push%r13
-   push%r14
-   mov %rsp, %r14
-/*
-* states of %xmm registers %xmm6:%xmm15 not saved
-* all %xmm registers are clobbered
-*/
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp# align rsp to 64 bytes
+   FUNC_SAVE
mov %arg6, %r12
movdqu  (%r12), %xmm13# %xmm13 = HashKey
 movdqa  SHUF_MASK(%rip), %xmm2
@@ -1309,10 +1320,7 @@ _T_1_decrypt:
 _T_16_decrypt:
movdqu  %xmm0, (%r10)
 _return_T_done_decrypt:
-   mov %r14, %rsp
-   pop %r14
-   pop %r13
-   pop %r12
+   FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_dec)
 
@@ -1393,22 +1401,12 @@ ENDPROC(aesni_gcm_dec)
 * poly = x^128 + x^127 + x^126 + x^121 + 1
 ***/
 ENTRY(aesni_gcm_enc)
-   push%r12
-   push%r13
-   push%r14
-   mov %rsp, %r14
-#
-# states of %xmm registers %xmm6:%xmm15 not saved
-# all %xmm registers are clobbered
-#
-   sub $VARIABLE_OFFSET, %rsp
-   and $~63, %rsp
+   FUNC_SAVE
mov %arg6, %r12
movdqu  (%r12), %xmm13
 movdqa  SHUF_MASK(%rip), %xmm2
PSHUFB_XMM %xmm2, %xmm13
 
-
 # precompute HashKey<<1 mod poly from the HashKey (required for GHASH)
 
movdqa  %xmm13, %xmm2
@@ -1576,10 +1574,7 @@ _T_1_encrypt:
 _T_16_encrypt:
movdqu  %xmm0, (%r10)
 _return_T_done_encrypt:
-   mov %r14, %rsp
-   pop %r14
-   pop %r13
-   pop %r12
+   FUNC_RESTORE
ret
 ENDPROC(aesni_gcm_enc)
 
-- 
2.9.5

Re: [PATCH v4] membarrier: expedited private command

2017-07-31 Thread Dave Watson

On 07/28/17 04:40 PM, Mathieu Desnoyers wrote:
> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> from all runqueues for which current thread's mm is the same as the
> thread calling sys_membarrier. It executes faster than the non-expedited
> variant (no blocking). It also works on NOHZ_FULL configurations.

I tested this with our hazard pointer use case on x86_64, and it seems
to work great.  We don't currently have any uses needing SHARED.

Tested-by: Dave Watson <davejwat...@fb.com>

Thanks!

https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555
https://github.com/facebook/folly/blob/master/folly/experimental/AsymmetricMemoryBarrier.cpp#L86

Re: [PATCH v4] membarrier: expedited private command

2017-07-31 Thread Dave Watson

On 07/28/17 04:40 PM, Mathieu Desnoyers wrote:
> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> from all runqueues for which current thread's mm is the same as the
> thread calling sys_membarrier. It executes faster than the non-expedited
> variant (no blocking). It also works on NOHZ_FULL configurations.

I tested this with our hazard pointer use case on x86_64, and it seems
to work great.  We don't currently have any uses needing SHARED.

Tested-by: Dave Watson 

Thanks!

https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555
https://github.com/facebook/folly/blob/master/folly/experimental/AsymmetricMemoryBarrier.cpp#L86

Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-31 Thread Dave Watson

Hi Paul, 

Thanks for looking at this again!

On 07/27/17 11:12 AM, Paul E. McKenney wrote:
> Hello!
> 
> But my main question is whether the throttling shown below is acceptable
> for your use cases, namely only one expedited sys_membarrier() permitted
> per scheduling-clock period (1 millisecond on many platforms), with any
> excess being silently converted to non-expedited form.  The reason for
> the throttling is concerns about DoS attacks based on user code with a
> tight loop invoking this system call.

We've been using sys_membarrier for the last year or so in a handful
of places with no issues.  Recently we made it an option in our hazard
pointers implementation, giving us something with performance between
hazard pointers and RCU:

https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555

Currently hazard pointers tries to free retired memory the same thread
that did the retire(), so the latency spiked for workloads that were
retire() heavy.   For the moment we dropped back to using mprotect
hacks.

I've tested Mathieu's v4 patch, it works great.  We currently don't
have any cases where we need SHARED. 

I also tested the rate-limited version, while better than the current
non-EXPEDITED SHARED version, we still hit the slow path pretty often.
I agree with other commenters that returning an error code instead of
silently slowing down might be better.

> + case MEMBARRIER_CMD_SHARED_EXPEDITED:
> + if (num_online_cpus() > 1) {
> + static unsigned long lastexp;
> + unsigned long j;
> +
> + j = jiffies;
> + if (READ_ONCE(lastexp) == j) {
> + synchronize_sched();
> + WRITE_ONCE(lastexp, j);

It looks like this update of lastexp should be in the other branch?

> + } else {
> + synchronize_sched_expedited();
> + }
> + }
> + return 0;

Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-31 Thread Dave Watson

Hi Paul, 

Thanks for looking at this again!

On 07/27/17 11:12 AM, Paul E. McKenney wrote:
> Hello!
> 
> But my main question is whether the throttling shown below is acceptable
> for your use cases, namely only one expedited sys_membarrier() permitted
> per scheduling-clock period (1 millisecond on many platforms), with any
> excess being silently converted to non-expedited form.  The reason for
> the throttling is concerns about DoS attacks based on user code with a
> tight loop invoking this system call.

We've been using sys_membarrier for the last year or so in a handful
of places with no issues.  Recently we made it an option in our hazard
pointers implementation, giving us something with performance between
hazard pointers and RCU:

https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555

Currently hazard pointers tries to free retired memory the same thread
that did the retire(), so the latency spiked for workloads that were
retire() heavy.   For the moment we dropped back to using mprotect
hacks.

I've tested Mathieu's v4 patch, it works great.  We currently don't
have any cases where we need SHARED. 

I also tested the rate-limited version, while better than the current
non-EXPEDITED SHARED version, we still hit the slow path pretty often.
I agree with other commenters that returning an error code instead of
silently slowing down might be better.

> + case MEMBARRIER_CMD_SHARED_EXPEDITED:
> + if (num_online_cpus() > 1) {
> + static unsigned long lastexp;
> + unsigned long j;
> +
> + j = jiffies;
> + if (READ_ONCE(lastexp) == j) {
> + synchronize_sched();
> + WRITE_ONCE(lastexp, j);

It looks like this update of lastexp should be in the other branch?

> + } else {
> + synchronize_sched_expedited();
> + }
> + }
> + return 0;

Re: [RFC PATCH v8 1/9] Restartable sequences system call

2016-08-19 Thread Dave Watson

On 08/19/16 02:24 PM, Josh Triplett wrote:
> On Fri, Aug 19, 2016 at 01:56:11PM -0700, Andi Kleen wrote:
> > > Nobody gets a cpu number just to get a cpu number - it's not a useful
> > > thing to benchmark. What does getcpu() so much that we care?
> > 
> > malloc is the primary target I believe. Saves lots of memory to keep
> > caches per CPU rather than per thread.
> 
> Also improves locality; that does seem like a good idea.  Has anyone
> written and tested the corresponding changes to a malloc implementation?
> 

I had modified jemalloc to use rseq instead of per-thread caches, and
did some testing on one of our services.

Memory usage decreased by ~20% due to fewer caches.  Our services
generally have lots and lots of idle threads (~400), and we already go
through a few hoops to try and flush idle thread caches.  Threads are
often coming from dependent libraries written by disparate teams,
making them harder to reduce to a smaller number.

We also have quite a few data structures that are sharded
thread-locally only to avoid contention, for example we have extensive
statistics code that would also be a prime candidate for rseq .  We
often have to prune some stats because they're taking up too much
memory, rseq would let us fit a bit more in.

jemalloc diff here (pretty stale now):

https://github.com/djwatson/jemalloc/commit/51f6e6f61b88eee8de981f0f2d52bc48f85e0d01

Original numbers posted here:

https://lkml.org/lkml/2015/10/22/588

Re: [RFC PATCH v8 1/9] Restartable sequences system call

2016-08-19 Thread Dave Watson

On 08/19/16 02:24 PM, Josh Triplett wrote:
> On Fri, Aug 19, 2016 at 01:56:11PM -0700, Andi Kleen wrote:
> > > Nobody gets a cpu number just to get a cpu number - it's not a useful
> > > thing to benchmark. What does getcpu() so much that we care?
> > 
> > malloc is the primary target I believe. Saves lots of memory to keep
> > caches per CPU rather than per thread.
> 
> Also improves locality; that does seem like a good idea.  Has anyone
> written and tested the corresponding changes to a malloc implementation?
> 

I had modified jemalloc to use rseq instead of per-thread caches, and
did some testing on one of our services.

Memory usage decreased by ~20% due to fewer caches.  Our services
generally have lots and lots of idle threads (~400), and we already go
through a few hoops to try and flush idle thread caches.  Threads are
often coming from dependent libraries written by disparate teams,
making them harder to reduce to a smaller number.

We also have quite a few data structures that are sharded
thread-locally only to avoid contention, for example we have extensive
statistics code that would also be a prime candidate for rseq .  We
often have to prune some stats because they're taking up too much
memory, rseq would let us fit a bit more in.

jemalloc diff here (pretty stale now):

https://github.com/djwatson/jemalloc/commit/51f6e6f61b88eee8de981f0f2d52bc48f85e0d01

Original numbers posted here:

https://lkml.org/lkml/2015/10/22/588

Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests

2016-08-12 Thread Dave Watson


 Would pairing one rseq_start with two rseq_finish do the trick
 there ?
>>>
>>> Yes, two rseq_finish works, as long as the extra rseq management overhead
>>> is not substantial.
>>
>> I've added a commit implementing rseq_finish2() in my rseq volatile
>> dev branch. You can fetch it at:
>>
>> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
>>
>> I also have a separate test and benchmark tree in addition to the
>> kernel selftests here:
>>
>> https://github.com/compudj/rseq-test
>>
>> I named the first write a "speculative" write, and the second write
>> the "final" write.
>>
>> Would you like to extend the test cases to cover your intended use-case ?
>>
>
>Hi Dave!
>
>I just pushed a rseq_finish2() test in my rseq-fallback branch. It implements
>a per-cpu buffer holding pointers, and pushes/pops items to/from it.
>
>To use it:
>
>cd tools/testing/selftests/rseq
>./param_test -T b
>
>(see -h for advanced usage)
>
>Let me know if I got it right!

Hi Mathieu,

Thanks, you beat me to it.I commented on the github, that's pretty much it. 
 

> In the kernel, if rather than testing for:
> 
> if ((void __user *)instruction_pointer(regs) < post_commit_ip) {
> 
> we could test for both start_ip and post_commit_ip:
> 
> if ((void __user *)instruction_pointer(regs) < post_commit_ip
> && (void __user *)instruction_pointer(regs) >= start_ip) {
> 
> We could perform the failure path (storing NULL into the rseq_cs
> field of struct rseq) in C rather than being required to do it in
> assembly at addresses >= to post_commit_ip, all because the kernel
> would test whether we are within the assembly block address range
> using both the lower and upper bounds (start_ip and post_commit_ip).

Sounds reasonable to me.  I agree it would be best to move the failure path 
out of the asm if possible.

Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests

2016-08-12 Thread Dave Watson


 Would pairing one rseq_start with two rseq_finish do the trick
 there ?
>>>
>>> Yes, two rseq_finish works, as long as the extra rseq management overhead
>>> is not substantial.
>>
>> I've added a commit implementing rseq_finish2() in my rseq volatile
>> dev branch. You can fetch it at:
>>
>> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
>>
>> I also have a separate test and benchmark tree in addition to the
>> kernel selftests here:
>>
>> https://github.com/compudj/rseq-test
>>
>> I named the first write a "speculative" write, and the second write
>> the "final" write.
>>
>> Would you like to extend the test cases to cover your intended use-case ?
>>
>
>Hi Dave!
>
>I just pushed a rseq_finish2() test in my rseq-fallback branch. It implements
>a per-cpu buffer holding pointers, and pushes/pops items to/from it.
>
>To use it:
>
>cd tools/testing/selftests/rseq
>./param_test -T b
>
>(see -h for advanced usage)
>
>Let me know if I got it right!

Hi Mathieu,

Thanks, you beat me to it.I commented on the github, that's pretty much it. 
 

> In the kernel, if rather than testing for:
> 
> if ((void __user *)instruction_pointer(regs) < post_commit_ip) {
> 
> we could test for both start_ip and post_commit_ip:
> 
> if ((void __user *)instruction_pointer(regs) < post_commit_ip
> && (void __user *)instruction_pointer(regs) >= start_ip) {
> 
> We could perform the failure path (storing NULL into the rseq_cs
> field of struct rseq) in C rather than being required to do it in
> assembly at addresses >= to post_commit_ip, all because the kernel
> would test whether we are within the assembly block address range
> using both the lower and upper bounds (start_ip and post_commit_ip).

Sounds reasonable to me.  I agree it would be best to move the failure path 
out of the asm if possible.

Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests

2016-07-24 Thread Dave Watson

>> +static inline __attribute__((always_inline))
>> +bool rseq_finish(struct rseq_lock *rlock,
>> + intptr_t *p, intptr_t to_write,
>> + struct rseq_state start_value)

>> This ABI looks like it will work fine for our use case. I don't think it
>> has been mentioned yet, but we may still need multiple asm blocks
>> for differing numbers of writes. For example, an array-based freelist push:

>> void push(void *obj) {
>> if (index < maxlen) {
>> freelist[index++] = obj;
>> }
>> }

>> would be more efficiently implemented with a two-write rseq_finish:

>> rseq_finish2([index], obj, // first write
>> , index + 1, // second write
>> ...);

> Would pairing one rseq_start with two rseq_finish do the trick
> there ?

Yes, two rseq_finish works, as long as the extra rseq management overhead
is not substantial.

Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests

2016-07-24 Thread Dave Watson

>> +static inline __attribute__((always_inline))
>> +bool rseq_finish(struct rseq_lock *rlock,
>> + intptr_t *p, intptr_t to_write,
>> + struct rseq_state start_value)

>> This ABI looks like it will work fine for our use case. I don't think it
>> has been mentioned yet, but we may still need multiple asm blocks
>> for differing numbers of writes. For example, an array-based freelist push:

>> void push(void *obj) {
>> if (index < maxlen) {
>> freelist[index++] = obj;
>> }
>> }

>> would be more efficiently implemented with a two-write rseq_finish:

>> rseq_finish2([index], obj, // first write
>> , index + 1, // second write
>> ...);

> Would pairing one rseq_start with two rseq_finish do the trick
> there ?

Yes, two rseq_finish works, as long as the extra rseq management overhead
is not substantial.

Re: [RFC PATCH 2/2] Crypto kernel tls socket

2015-11-23 Thread Dave Watson

On 11/23/15 02:27 PM, Sowmini Varadhan wrote:
> On (11/23/15 09:43), Dave Watson wrote:
> > Currently gcm(aes) represents ~80% of our SSL connections.
> >
> > Userspace interface:
> >
> > 1) A transform and op socket are created using the userspace crypto 
> > interface
> > 2) Setsockopt ALG_SET_AUTHSIZE is called
> > 3) Setsockopt ALG_SET_KEY is called twice, since we need both send/recv keys
> > 4) ALG_SET_IV cmsgs are sent twice, since we need both send/recv IVs.
> >To support userspace heartbeats, changeciphersuite, etc, we would also 
> > need
> >to get these back out, use them, then reset them via CMSG.
> > 5) ALG_SET_OP cmsg is overloaded to mean FD to read/write from.
>
> [from patch 0/2:]
> > If a non application-data TLS record is seen, it is left on the TCP
> > socket and an error is returned on the ALG socket, and the record is
> > left for userspace to manage.
>
> I'm trying to see how your approach would fit with the RDS-type of
> use-case. RDS-TCP is mostly similar in concept to kcm,
> except that rds has its own header for multiplexing, and has no
> dependancy on BPF for basic things like re-assembling the datagram.
> If I were to try to use this for RDS-TCP, the tls_tcp_read_sock() logic
> would be merged into the recv_actor callback for RDS, right?  Thus tls
> control-plane message could be seen in the middle of the
> data-stream, so we really have to freeze the processing of the data
> stream till the control-plane message is processed?

Correct.

> In the tls.c example that you have, the opfd is generated from
> the accept() on the AF_ALG socket- how would this work if I wanted
> my opfd to be a PF_RDS or a PF_KCM or similar?

For kcm, opfd is the fd you would pass along in kcm_attach.

For rds, it looks like you'd want to use opfd as the sock instead of
the new one created by sock_create_kern in rds_tcp_conn_connect.

> One concern is that this patchset provides a solution for the "80%"
> case but what about the other 20% (and the non x86 platforms)?

Almost all the rest are aes sha.  The actual encrypt / decrypt code
would be similar to this previous patch:

http://marc.info/?l=linux-kernel=140662647602192=2

The software routines in gcm(aes) should work for all platforms
without aesni.

> E.g., if I get a cipher-suite request outside the aes-ni, what would
> happen (punt to uspace?)
>
> --Sowmini

Right, bind() would fail and you would fallback to uspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 1/2] Crypto support aesni rfc5288

2015-11-23 Thread Dave Watson

Support rfc5288 using intel aesni routines.  See also rfc5246.

AAD length is 13 bytes padded out to 16. Padding bytes have to be
passed in in scatterlist currently, which probably isn't quite the
right fix.

The assoclen checks were moved to the individual rfc stubs, and the
common routines support all assoc lengths.

---
 arch/x86/crypto/aesni-intel_asm.S|   6 ++
 arch/x86/crypto/aesni-intel_avx-x86_64.S |   4 ++
 arch/x86/crypto/aesni-intel_glue.c   | 105 +++
 3 files changed, 88 insertions(+), 27 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 6bd2c6c..49667c4 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -228,6 +228,9 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
 MOVADQ SHUF_MASK(%rip), %xmm14
movarg7, %r10   # %r10 = AAD
movarg8, %r12   # %r12 = aadLen
+   add$3, %r12
+   and$~3, %r12
+
mov%r12, %r11
pxor   %xmm\i, %xmm\i

@@ -453,6 +456,9 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
 MOVADQ SHUF_MASK(%rip), %xmm14
movarg7, %r10   # %r10 = AAD
movarg8, %r12   # %r12 = aadLen
+   add$3, %r12
+   and$~3, %r12
+
mov%r12, %r11
pxor   %xmm\i, %xmm\i
 _get_AAD_loop\num_initial_blocks\operation:
diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 522ab68..0756e4a 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -360,6 +360,8 @@ VARIABLE_OFFSET = 16*8

 mov arg6, %r10  # r10 = AAD
 mov arg7, %r12  # r12 = aadLen
+add $3, %r12
+and $~3, %r12


 mov %r12, %r11
@@ -1619,6 +1621,8 @@ ENDPROC(aesni_gcm_dec_avx_gen2)

 mov arg6, %r10   # r10 = AAD
 mov arg7, %r12   # r12 = aadLen
+add $3, %r12
+and $~3, %r12


 mov %r12, %r11
diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 3633ad6..00a42ca 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -949,12 +949,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req)
struct scatter_walk src_sg_walk;
struct scatter_walk dst_sg_walk;
unsigned int i;
-
-   /* Assuming we are supporting rfc4106 64-bit extended */
-   /* sequence numbers We need to have the AAD length equal */
-   /* to 16 or 20 bytes */
-   if (unlikely(req->assoclen != 16 && req->assoclen != 20))
-   return -EINVAL;
+   unsigned int padded_assoclen = (req->assoclen + 3) & ~3;

/* IV below built */
for (i = 0; i < 4; i++)
@@ -970,21 +965,21 @@ static int helper_rfc4106_encrypt(struct aead_request 
*req)
one_entry_in_sg = 1;
scatterwalk_start(_sg_walk, req->src);
assoc = scatterwalk_map(_sg_walk);
-   src = assoc + req->assoclen;
+   src = assoc + padded_assoclen;
dst = src;
if (unlikely(req->src != req->dst)) {
scatterwalk_start(_sg_walk, req->dst);
-   dst = scatterwalk_map(_sg_walk) + req->assoclen;
+   dst = scatterwalk_map(_sg_walk) + padded_assoclen;
}
} else {
/* Allocate memory for src, dst, assoc */
-   assoc = kmalloc(req->cryptlen + auth_tag_len + req->assoclen,
+   assoc = kmalloc(req->cryptlen + auth_tag_len + padded_assoclen,
GFP_ATOMIC);
if (unlikely(!assoc))
return -ENOMEM;
scatterwalk_map_and_copy(assoc, req->src, 0,
-req->assoclen + req->cryptlen, 0);
-   src = assoc + req->assoclen;
+padded_assoclen + req->cryptlen, 0);
+   src = assoc + padded_assoclen;
dst = src;
}

@@ -998,7 +993,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req)
 * back to the packet. */
if (one_entry_in_sg) {
if (unlikely(req->src != req->dst)) {
-   scatterwalk_unmap(dst - req->assoclen);
+   scatterwalk_unmap(dst - padded_assoclen);
scatterwalk_advance(_sg_walk, req->dst->length);
scatterwalk_done(_sg_walk, 1, 0);
}
@@ -1006,7 +1001,7 @@ static int helper_rfc4106_encrypt(struct aead_request 
*req)
scatterwalk_advance(_sg_walk, req->src->length);
scatterwalk_done(_sg_walk,

[RFC PATCH 2/2] Crypto kernel tls socket

2015-11-23 Thread Dave Watson

Userspace crypto interface for TLS.  Currently supports gcm(aes) 128bit only,
however the interface is the same as the rest of the SOCK_ALG interface, so it
should be possible to add more without any user interface changes.

Currently gcm(aes) represents ~80% of our SSL connections.

Userspace interface:

1) A transform and op socket are created using the userspace crypto interface
2) Setsockopt ALG_SET_AUTHSIZE is called
3) Setsockopt ALG_SET_KEY is called twice, since we need both send/recv keys
4) ALG_SET_IV cmsgs are sent twice, since we need both send/recv IVs.
   To support userspace heartbeats, changeciphersuite, etc, we would also need
   to get these back out, use them, then reset them via CMSG.
5) ALG_SET_OP cmsg is overloaded to mean FD to read/write from.

Example program:

https://github.com/djwatson/ktls

At a high level, this could be implemented on TCP sockets directly instead with
various tradeoffs.

The userspace crypto interface might benefit from some interface
tweaking to deal with multiple keys / ivs better.  The crypto accept()
op socket interface isn't a great fit, since there are never multiple
parallel operations.

There's also some questions around using skbuffs instead of scatterlists for
send/recv, and if we are buffering on recv, when we should be decrypting the
data.
---
 crypto/Kconfig |   12 +
 crypto/Makefile|1 +
 crypto/algif_tls.c | 1233 
 3 files changed, 1246 insertions(+)
 create mode 100644 crypto/algif_tls.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 7240821..c15638a 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1639,6 +1639,18 @@ config CRYPTO_USER_API_AEAD
  This option enables the user-spaces interface for AEAD
  cipher algorithms.

+config CRYPTO_USER_API_TLS
+   tristate "User-space interface for TLS net sockets"
+   depends on NET
+   select CRYPTO_AEAD
+   select CRYPTO_USER_API
+   help
+ This option enables kernel TLS socket framing
+ cipher algorithms.  TLS framing is added/removed and
+  chained to a TCP socket.  Handshake is done in
+  userspace.
+
+
 config CRYPTO_HASH_INFO
bool

diff --git a/crypto/Makefile b/crypto/Makefile
index f7aba92..fc26012 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -121,6 +121,7 @@ obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
 obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
 obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
+obj-$(CONFIG_CRYPTO_USER_API_TLS) += algif_tls.o

 #
 # generic algorithms and the async_tx api
diff --git a/crypto/algif_tls.c b/crypto/algif_tls.c
new file mode 100644
index 000..123ade3
--- /dev/null
+++ b/crypto/algif_tls.c
@@ -0,0 +1,1233 @@
+/*
+ * algif_tls: User-space interface for TLS
+ *
+ * Copyright (C) 2015, Dave Watson 
+ *
+ * This file provides the user-space API for AEAD ciphers.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define TLS_HEADER_SIZE 13
+#define TLS_TAG_SIZE 16
+#define TLS_IV_SIZE 8
+#define TLS_PADDED_AADLEN 16
+#define TLS_MAX_MESSAGE_LEN (1 << 14)
+
+/* Bytes not included in tls msg size field */
+#define TLS_FRAMING_SIZE 5
+
+#define TLS_APPLICATION_DATA_MSG 0x17
+#define TLS_VERSION 3
+
+struct tls_tfm_pair {
+   struct crypto_aead *tfm_send;
+   struct crypto_aead *tfm_recv;
+   int cur_setkey;
+};
+
+static struct workqueue_struct *tls_wq;
+
+struct tls_sg_list {
+   unsigned int cur;
+   struct scatterlist sg[ALG_MAX_PAGES];
+};
+
+#define RSGL_MAX_ENTRIES ALG_MAX_PAGES
+
+struct tls_ctx {
+   /* Send and encrypted transmit buffers */
+   struct tls_sg_list tsgl;
+   struct scatterlist tcsgl[ALG_MAX_PAGES];
+
+   /* Encrypted receive and receive buffers. */
+   struct tls_sg_list rcsgl;
+   struct af_alg_sgl rsgl[RSGL_MAX_ENTRIES];
+
+   /* Sequence numbers. */
+   int iv_set;
+   void *iv_send;
+   void *iv_recv;
+
+   struct af_alg_completion completion;
+
+   /* Bytes to send */
+   unsigned long used;
+
+   /* padded */
+   size_t aead_assoclen;
+   /* unpadded */
+   size_t assoclen;
+   struct aead_request aead_req;
+   struct aead_request aead_resp;
+
+   bool more;
+   bool merge;
+
+   /* Chained TCP socket */
+   struct sock *sock;
+   struct socket *socket;
+
+   void (*save_data_ready)(struct sock *sk);
+   void (*save_write_space)(struct sock *sk);
+   void (*save_state_change)(struct sock *sk);
+   struct work_struct tx_work

[RFC PATCH 0/2] Crypto kernel TLS socket

2015-11-23 Thread Dave Watson

An approach for a kernel TLS socket.

Only the symmetric encryption / decryption is done in-kernel, as well
as minimal framing handling.  The handshake is kept in userspace, and
the negotiated cipher / keys / IVs are then set on the algif_tls
socket, which is then hooked in to a tcp socket using
sk_write_space/sk_data_ready hooks.

If a non application-data TLS record is seen, it is left on the TCP
socket and an error is returned on the ALG socket, and the record is
left for userspace to manage. Userspace can't ignore the message, but
could just close the socket.

TLS could potentially also be done directly on the TCP socket, but
seemed a bit harder to work with the OOB data for non application_data
messages, and the sockopts / CMSGS already exist for ALG sockets.  The
flip side is having to manage two fds in userspace.

Some reasons we're looking at this:

1) Access to sendfile/splice for CDN-type applications.  We were
   inspired by Netflix exploring this in FreeBSD

   https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf

   For perf, this patch is almost on par with userspace OpenSSL.
   Currently there are some copies and allocs to support
   scatter/gather in aesni-intel_glue.c, but with some extra work to
   remove those (not included here), a sendfile() is faster than the
   equivalent userspace read/SSL_write using a 128k buffer by 2~7%.

2) Access to the unencrypted bytes in kernelspace.  For example, Tom
   Herbert's kcm would need this

   https://lwn.net/Articles/657999/

3) NIC offload. To support running aesni routines on the NIC instead
   of the processor, we would probably need enough of the framing
   interface put in kernel.


Dave Watson (2):
  Crypto support aesni rfc5288
  Crypto kernel tls socket

 arch/x86/crypto/aesni-intel_asm.S|6 +
 arch/x86/crypto/aesni-intel_avx-x86_64.S |4 +
 arch/x86/crypto/aesni-intel_glue.c   |  105 ++-
 crypto/Kconfig   |   12 +
 crypto/Makefile  |1 +
 crypto/algif_tls.c   | 1233 ++
 6 files changed, 1334 insertions(+), 27 deletions(-)
 create mode 100644 crypto/algif_tls.c

--
2.4.6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 2/2] Crypto kernel tls socket

2015-11-23 Thread Dave Watson

Userspace crypto interface for TLS.  Currently supports gcm(aes) 128bit only,
however the interface is the same as the rest of the SOCK_ALG interface, so it
should be possible to add more without any user interface changes.

Currently gcm(aes) represents ~80% of our SSL connections.

Userspace interface:

1) A transform and op socket are created using the userspace crypto interface
2) Setsockopt ALG_SET_AUTHSIZE is called
3) Setsockopt ALG_SET_KEY is called twice, since we need both send/recv keys
4) ALG_SET_IV cmsgs are sent twice, since we need both send/recv IVs.
   To support userspace heartbeats, changeciphersuite, etc, we would also need
   to get these back out, use them, then reset them via CMSG.
5) ALG_SET_OP cmsg is overloaded to mean FD to read/write from.

Example program:

https://github.com/djwatson/ktls

At a high level, this could be implemented on TCP sockets directly instead with
various tradeoffs.

The userspace crypto interface might benefit from some interface
tweaking to deal with multiple keys / ivs better.  The crypto accept()
op socket interface isn't a great fit, since there are never multiple
parallel operations.

There's also some questions around using skbuffs instead of scatterlists for
send/recv, and if we are buffering on recv, when we should be decrypting the
data.
---
 crypto/Kconfig |   12 +
 crypto/Makefile|1 +
 crypto/algif_tls.c | 1233 
 3 files changed, 1246 insertions(+)
 create mode 100644 crypto/algif_tls.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 7240821..c15638a 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1639,6 +1639,18 @@ config CRYPTO_USER_API_AEAD
  This option enables the user-spaces interface for AEAD
  cipher algorithms.

+config CRYPTO_USER_API_TLS
+   tristate "User-space interface for TLS net sockets"
+   depends on NET
+   select CRYPTO_AEAD
+   select CRYPTO_USER_API
+   help
+ This option enables kernel TLS socket framing
+ cipher algorithms.  TLS framing is added/removed and
+  chained to a TCP socket.  Handshake is done in
+  userspace.
+
+
 config CRYPTO_HASH_INFO
bool

diff --git a/crypto/Makefile b/crypto/Makefile
index f7aba92..fc26012 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -121,6 +121,7 @@ obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
 obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
 obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
+obj-$(CONFIG_CRYPTO_USER_API_TLS) += algif_tls.o

 #
 # generic algorithms and the async_tx api
diff --git a/crypto/algif_tls.c b/crypto/algif_tls.c
new file mode 100644
index 000..123ade3
--- /dev/null
+++ b/crypto/algif_tls.c
@@ -0,0 +1,1233 @@
+/*
+ * algif_tls: User-space interface for TLS
+ *
+ * Copyright (C) 2015, Dave Watson <davejwat...@fb.com>
+ *
+ * This file provides the user-space API for AEAD ciphers.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define TLS_HEADER_SIZE 13
+#define TLS_TAG_SIZE 16
+#define TLS_IV_SIZE 8
+#define TLS_PADDED_AADLEN 16
+#define TLS_MAX_MESSAGE_LEN (1 << 14)
+
+/* Bytes not included in tls msg size field */
+#define TLS_FRAMING_SIZE 5
+
+#define TLS_APPLICATION_DATA_MSG 0x17
+#define TLS_VERSION 3
+
+struct tls_tfm_pair {
+   struct crypto_aead *tfm_send;
+   struct crypto_aead *tfm_recv;
+   int cur_setkey;
+};
+
+static struct workqueue_struct *tls_wq;
+
+struct tls_sg_list {
+   unsigned int cur;
+   struct scatterlist sg[ALG_MAX_PAGES];
+};
+
+#define RSGL_MAX_ENTRIES ALG_MAX_PAGES
+
+struct tls_ctx {
+   /* Send and encrypted transmit buffers */
+   struct tls_sg_list tsgl;
+   struct scatterlist tcsgl[ALG_MAX_PAGES];
+
+   /* Encrypted receive and receive buffers. */
+   struct tls_sg_list rcsgl;
+   struct af_alg_sgl rsgl[RSGL_MAX_ENTRIES];
+
+   /* Sequence numbers. */
+   int iv_set;
+   void *iv_send;
+   void *iv_recv;
+
+   struct af_alg_completion completion;
+
+   /* Bytes to send */
+   unsigned long used;
+
+   /* padded */
+   size_t aead_assoclen;
+   /* unpadded */
+   size_t assoclen;
+   struct aead_request aead_req;
+   struct aead_request aead_resp;
+
+   bool more;
+   bool merge;
+
+   /* Chained TCP socket */
+   struct sock *sock;
+   struct socket *socket;
+
+   void (*save_data_ready)(struct sock *sk);
+   void (*save_write_space)(struct sock *sk);
+   void (*save_state_change)(struct sock *sk);
+   stru

[RFC PATCH 0/2] Crypto kernel TLS socket

2015-11-23 Thread Dave Watson

An approach for a kernel TLS socket.

Only the symmetric encryption / decryption is done in-kernel, as well
as minimal framing handling.  The handshake is kept in userspace, and
the negotiated cipher / keys / IVs are then set on the algif_tls
socket, which is then hooked in to a tcp socket using
sk_write_space/sk_data_ready hooks.

If a non application-data TLS record is seen, it is left on the TCP
socket and an error is returned on the ALG socket, and the record is
left for userspace to manage. Userspace can't ignore the message, but
could just close the socket.

TLS could potentially also be done directly on the TCP socket, but
seemed a bit harder to work with the OOB data for non application_data
messages, and the sockopts / CMSGS already exist for ALG sockets.  The
flip side is having to manage two fds in userspace.

Some reasons we're looking at this:

1) Access to sendfile/splice for CDN-type applications.  We were
   inspired by Netflix exploring this in FreeBSD

   https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf

   For perf, this patch is almost on par with userspace OpenSSL.
   Currently there are some copies and allocs to support
   scatter/gather in aesni-intel_glue.c, but with some extra work to
   remove those (not included here), a sendfile() is faster than the
   equivalent userspace read/SSL_write using a 128k buffer by 2~7%.

2) Access to the unencrypted bytes in kernelspace.  For example, Tom
   Herbert's kcm would need this

   https://lwn.net/Articles/657999/

3) NIC offload. To support running aesni routines on the NIC instead
   of the processor, we would probably need enough of the framing
   interface put in kernel.


Dave Watson (2):
  Crypto support aesni rfc5288
  Crypto kernel tls socket

 arch/x86/crypto/aesni-intel_asm.S|6 +
 arch/x86/crypto/aesni-intel_avx-x86_64.S |4 +
 arch/x86/crypto/aesni-intel_glue.c   |  105 ++-
 crypto/Kconfig   |   12 +
 crypto/Makefile  |1 +
 crypto/algif_tls.c   | 1233 ++
 6 files changed, 1334 insertions(+), 27 deletions(-)
 create mode 100644 crypto/algif_tls.c

--
2.4.6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 1/2] Crypto support aesni rfc5288

2015-11-23 Thread Dave Watson

Support rfc5288 using intel aesni routines.  See also rfc5246.

AAD length is 13 bytes padded out to 16. Padding bytes have to be
passed in in scatterlist currently, which probably isn't quite the
right fix.

The assoclen checks were moved to the individual rfc stubs, and the
common routines support all assoc lengths.

---
 arch/x86/crypto/aesni-intel_asm.S|   6 ++
 arch/x86/crypto/aesni-intel_avx-x86_64.S |   4 ++
 arch/x86/crypto/aesni-intel_glue.c   | 105 +++
 3 files changed, 88 insertions(+), 27 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 6bd2c6c..49667c4 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -228,6 +228,9 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
 MOVADQ SHUF_MASK(%rip), %xmm14
movarg7, %r10   # %r10 = AAD
movarg8, %r12   # %r12 = aadLen
+   add$3, %r12
+   and$~3, %r12
+
mov%r12, %r11
pxor   %xmm\i, %xmm\i

@@ -453,6 +456,9 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
 MOVADQ SHUF_MASK(%rip), %xmm14
movarg7, %r10   # %r10 = AAD
movarg8, %r12   # %r12 = aadLen
+   add$3, %r12
+   and$~3, %r12
+
mov%r12, %r11
pxor   %xmm\i, %xmm\i
 _get_AAD_loop\num_initial_blocks\operation:
diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 522ab68..0756e4a 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -360,6 +360,8 @@ VARIABLE_OFFSET = 16*8

 mov arg6, %r10  # r10 = AAD
 mov arg7, %r12  # r12 = aadLen
+add $3, %r12
+and $~3, %r12


 mov %r12, %r11
@@ -1619,6 +1621,8 @@ ENDPROC(aesni_gcm_dec_avx_gen2)

 mov arg6, %r10   # r10 = AAD
 mov arg7, %r12   # r12 = aadLen
+add $3, %r12
+and $~3, %r12


 mov %r12, %r11
diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 3633ad6..00a42ca 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -949,12 +949,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req)
struct scatter_walk src_sg_walk;
struct scatter_walk dst_sg_walk;
unsigned int i;
-
-   /* Assuming we are supporting rfc4106 64-bit extended */
-   /* sequence numbers We need to have the AAD length equal */
-   /* to 16 or 20 bytes */
-   if (unlikely(req->assoclen != 16 && req->assoclen != 20))
-   return -EINVAL;
+   unsigned int padded_assoclen = (req->assoclen + 3) & ~3;

/* IV below built */
for (i = 0; i < 4; i++)
@@ -970,21 +965,21 @@ static int helper_rfc4106_encrypt(struct aead_request 
*req)
one_entry_in_sg = 1;
scatterwalk_start(_sg_walk, req->src);
assoc = scatterwalk_map(_sg_walk);
-   src = assoc + req->assoclen;
+   src = assoc + padded_assoclen;
dst = src;
if (unlikely(req->src != req->dst)) {
scatterwalk_start(_sg_walk, req->dst);
-   dst = scatterwalk_map(_sg_walk) + req->assoclen;
+   dst = scatterwalk_map(_sg_walk) + padded_assoclen;
}
} else {
/* Allocate memory for src, dst, assoc */
-   assoc = kmalloc(req->cryptlen + auth_tag_len + req->assoclen,
+   assoc = kmalloc(req->cryptlen + auth_tag_len + padded_assoclen,
GFP_ATOMIC);
if (unlikely(!assoc))
return -ENOMEM;
scatterwalk_map_and_copy(assoc, req->src, 0,
-req->assoclen + req->cryptlen, 0);
-   src = assoc + req->assoclen;
+padded_assoclen + req->cryptlen, 0);
+   src = assoc + padded_assoclen;
dst = src;
}

@@ -998,7 +993,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req)
 * back to the packet. */
if (one_entry_in_sg) {
if (unlikely(req->src != req->dst)) {
-   scatterwalk_unmap(dst - req->assoclen);
+   scatterwalk_unmap(dst - padded_assoclen);
scatterwalk_advance(_sg_walk, req->dst->length);
scatterwalk_done(_sg_walk, 1, 0);
}
@@ -1006,7 +1001,7 @@ static int helper_rfc4106_encrypt(struct aead_request 
*req)
scatterwalk_advance(_sg_walk, req->src->length);
scatterwalk_done(_sg_walk,

Re: [RFC PATCH 2/2] Crypto kernel tls socket

2015-11-23 Thread Dave Watson

On 11/23/15 02:27 PM, Sowmini Varadhan wrote:
> On (11/23/15 09:43), Dave Watson wrote:
> > Currently gcm(aes) represents ~80% of our SSL connections.
> >
> > Userspace interface:
> >
> > 1) A transform and op socket are created using the userspace crypto 
> > interface
> > 2) Setsockopt ALG_SET_AUTHSIZE is called
> > 3) Setsockopt ALG_SET_KEY is called twice, since we need both send/recv keys
> > 4) ALG_SET_IV cmsgs are sent twice, since we need both send/recv IVs.
> >To support userspace heartbeats, changeciphersuite, etc, we would also 
> > need
> >to get these back out, use them, then reset them via CMSG.
> > 5) ALG_SET_OP cmsg is overloaded to mean FD to read/write from.
>
> [from patch 0/2:]
> > If a non application-data TLS record is seen, it is left on the TCP
> > socket and an error is returned on the ALG socket, and the record is
> > left for userspace to manage.
>
> I'm trying to see how your approach would fit with the RDS-type of
> use-case. RDS-TCP is mostly similar in concept to kcm,
> except that rds has its own header for multiplexing, and has no
> dependancy on BPF for basic things like re-assembling the datagram.
> If I were to try to use this for RDS-TCP, the tls_tcp_read_sock() logic
> would be merged into the recv_actor callback for RDS, right?  Thus tls
> control-plane message could be seen in the middle of the
> data-stream, so we really have to freeze the processing of the data
> stream till the control-plane message is processed?

Correct.

> In the tls.c example that you have, the opfd is generated from
> the accept() on the AF_ALG socket- how would this work if I wanted
> my opfd to be a PF_RDS or a PF_KCM or similar?

For kcm, opfd is the fd you would pass along in kcm_attach.

For rds, it looks like you'd want to use opfd as the sock instead of
the new one created by sock_create_kern in rds_tcp_conn_connect.

> One concern is that this patchset provides a solution for the "80%"
> case but what about the other 20% (and the non x86 platforms)?

Almost all the rest are aes sha.  The actual encrypt / decrypt code
would be similar to this previous patch:

http://marc.info/?l=linux-kernel=140662647602192=2

The software routines in gcm(aes) should work for all platforms
without aesni.

> E.g., if I get a cipher-suite request outside the aes-ni, what would
> happen (punt to uspace?)
>
> --Sowmini

Right, bind() would fail and you would fallback to uspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections

2015-10-28 Thread Dave Watson

On 10/27/15 04:56 PM, Paul Turner wrote:
> This series is a new approach which introduces an alternate ABI that does not
> depend on open-coded assembly nor a central 'repository' of rseq sequences.
> Sequences may now be inlined and the preparatory[*] work for the sequence can
> be written in a higher level language.

Very nice, it's definitely much easier to use.

> Exactly, for x86_64 this looks like:
>   movq , rcx [1]
>   movq $1f,  [2]
>   cmpq ,  [3] (start is in rcx)
>   jnz  (4)
>   movq , () (5)
>   1: movq $0, 
>
> There has been some related discussion, which I am supportive of, in which
> we use fs/gs instead of TLS.  This maps naturally to the above and removes
> the current requirement for per-thread initialization (this is a good thing!).
>
> On debugger interactions:
>
> There are some nice properties about this new style of API which allow it to
> actually support safe interactions with a debugger:
>  a) The event counter is a per-cpu value.  This means that we can not advance
> it if no threads from the same process execute on that cpu.  This
> naturally allows basic single step support with thread-isolation.

I think this means multiple processes would no longer be able to use
per-cpu variables in shared memory, since they would no longer restart
with respect to each other?

>  b) Single-step can be augmented to evalute the ABI without incrementing the
> event count.
>  c) A debugger can also be augmented to evaluate this ABI and push restarts
> on the kernel's behalf.
>
> This is also compatible with David's approach of not single stepping between
> 2-4 above.  However, I think these are ultimately a little stronger since true
> single-stepping and breakpoint support would be available.  Which would be
> nice to allow actual debugging of sequences.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections

2015-10-28 Thread Dave Watson

On 10/27/15 04:56 PM, Paul Turner wrote:
> This series is a new approach which introduces an alternate ABI that does not
> depend on open-coded assembly nor a central 'repository' of rseq sequences.
> Sequences may now be inlined and the preparatory[*] work for the sequence can
> be written in a higher level language.

Very nice, it's definitely much easier to use.

> Exactly, for x86_64 this looks like:
>   movq , rcx [1]
>   movq $1f,  [2]
>   cmpq ,  [3] (start is in rcx)
>   jnz  (4)
>   movq , () (5)
>   1: movq $0, 
>
> There has been some related discussion, which I am supportive of, in which
> we use fs/gs instead of TLS.  This maps naturally to the above and removes
> the current requirement for per-thread initialization (this is a good thing!).
>
> On debugger interactions:
>
> There are some nice properties about this new style of API which allow it to
> actually support safe interactions with a debugger:
>  a) The event counter is a per-cpu value.  This means that we can not advance
> it if no threads from the same process execute on that cpu.  This
> naturally allows basic single step support with thread-isolation.

I think this means multiple processes would no longer be able to use
per-cpu variables in shared memory, since they would no longer restart
with respect to each other?

>  b) Single-step can be augmented to evalute the ABI without incrementing the
> event count.
>  c) A debugger can also be augmented to evaluate this ABI and push restarts
> on the kernel's behalf.
>
> This is also compatible with David's approach of not single stepping between
> 2-4 above.  However, I think these are ultimately a little stronger since true
> single-stepping and breakpoint support would be available.  Which would be
> nice to allow actual debugging of sequences.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 110 matches

Mail list logo