[LSF/MM TOPIC] Improve performance of fget/fput
In some of our hottest network services, fget_light + fput overhead can represent 1-2% of the processes' total CPU usage. I'd like to discuss ways to reduce this overhead. One proposal we have been testing is removing the refcount increment and decrement, and using some sort of safe memory reclamation instead. The hottest callers include recvmsg, sendmsg, epoll_wait, etc - mostly networking calls, often used on non-blocking sockets. Often we are only incrementing and decrementing the refcount for a very short period of time, ideally we wouldn't adjust the refcount unless we know we are going to block. We could use RCU, but we would have to be particularly careful that none of these calls ever block, or ensure that we increment the refcount at the blocking locations. As an alternative to RCU, hazard pointers have similar overhead to SRCU, and could work equally well on blocking or nonblocking syscalls without additional changes. (There were also recent related discussions on SCM_RIGHTS refcount cycle issues, which is the other half of a file* gc) There might also be ways to rearrange the file* struct or fd table so that we're not taking so many cache misses for sockfd_lookup_light, since for sockets we don't use most of the file* struct at all.
[PATCH 10/12] x86/crypto: aesni: Introduce READ_PARTIAL_BLOCK macro
Introduce READ_PARTIAL_BLOCK macro, and use it in the two existing partial block cases: AAD and the end of ENC_DEC. In particular, the ENC_DEC case should be faster, since we read by 8/4 bytes if possible. This macro will also be used to read partial blocks between enc_update and dec_update calls. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 102 +-- 1 file changed, 59 insertions(+), 43 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 44a4a8b43ca4..ff00ad19064d 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -415,68 +415,56 @@ _zero_cipher_left\@: vmovdqu %xmm14, AadHash(arg2) vmovdqu %xmm9, CurCount(arg2) -cmp $16, arg5 -jl _only_less_than_16\@ - +# check for 0 length mov arg5, %r13 and $15, %r13# r13 = (arg5 mod 16) je _multiple_of_16_bytes\@ -# handle the last <16 Byte block seperately +# handle the last <16 Byte block separately mov %r13, PBlockLen(arg2) -vpaddd ONE(%rip), %xmm9, %xmm9 # INCR CNT to get Yn +vpaddd ONE(%rip), %xmm9, %xmm9 # INCR CNT to get Yn vmovdqu %xmm9, CurCount(arg2) vpshufb SHUF_MASK(%rip), %xmm9, %xmm9 ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Yn) vmovdqu %xmm9, PBlockEncKey(arg2) -sub $16, %r11 -add %r13, %r11 -vmovdqu (arg4, %r11), %xmm1 # receive the last <16 Byte block - -lea SHIFT_MASK+16(%rip), %r12 -sub %r13, %r12 # adjust the shuffle mask pointer to be -# able to shift 16-r13 bytes (r13 is the -# number of bytes in plaintext mod 16) -vmovdqu (%r12), %xmm2# get the appropriate shuffle mask -vpshufb %xmm2, %xmm1, %xmm1 # shift right 16-r13 bytes -jmp _final_ghash_mul\@ - -_only_less_than_16\@: -# check for 0 length -mov arg5, %r13 -and $15, %r13# r13 = (arg5 mod 16) +cmp $16, arg5 +jge _large_enough_update\@ -je _multiple_of_16_bytes\@ +lea (arg4,%r11,1), %r10 +mov %r13, %r12 -# handle the last <16 Byte block separately - - -vpaddd ONE(%rip), %xmm9, %xmm9 # INCR CNT to get Yn -vpshufb SHUF_MASK(%rip), %xmm9, %xmm9 -ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Yn) - -vmovdqu %xmm9, PBlockEncKey(arg2) +READ_PARTIAL_BLOCK %r10 %r12 %xmm1 lea SHIFT_MASK+16(%rip), %r12 sub %r13, %r12 # adjust the shuffle mask pointer to be # able to shift 16-r13 bytes (r13 is the -# number of bytes in plaintext mod 16) + # number of bytes in plaintext mod 16) -_get_last_16_byte_loop\@: -movb(arg4, %r11), %al -movb%al, TMP1 (%rsp , %r11) -add $1, %r11 -cmp %r13, %r11 -jne _get_last_16_byte_loop\@ +jmp _final_ghash_mul\@ + +_large_enough_update\@: +sub $16, %r11 +add %r13, %r11 + +# receive the last <16 Byte block +vmovdqu(arg4, %r11, 1), %xmm1 -vmovdqu TMP1(%rsp), %xmm1 +sub%r13, %r11 +add$16, %r11 -sub $16, %r11 +leaSHIFT_MASK+16(%rip), %r12 +# adjust the shuffle mask pointer to be able to shift 16-r13 bytes +# (r13 is the number of bytes in plaintext mod 16) +sub%r13, %r12 +# get the appropriate shuffle mask +vmovdqu(%r12), %xmm2 +# shift right 16-r13 bytes +vpshufb %xmm2, %xmm1, %xmm1 _final_ghash_mul\@: .if \ENC_DEC == DEC @@ -490,8 +478,6 @@ _final_ghash_mul\@: vpxor %xmm2, %xmm14, %xmm14 vmovdqu %xmm14, AadHash(arg2) -sub %r13, %r11 -add $16, %r11 .else vpxor %xmm1, %xmm9, %xmm9 # Plaintext XOR E(K, Yn) vmovdqu ALL_F-SHIFT_MASK(%r12), %xmm1# get the appropriate mask to @@ -501,8 +487,6 @@ _final_ghash_mul\@: vpxor %xmm9, %xmm14, %xmm14 vmovdqu %xmm14, AadHash(arg2) -sub %r13, %r11 -add $16, %r11 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9# shuffle xmm9 back to output as ciphertext .endif @@ -721,6 +705,38 @@ _get_AAD_done\@: \PRECOMPUTE %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5 .endm + +# R
[PATCH 12/12] x86/crypto: aesni: Add scatter/gather avx stubs, and use them in C
Add the appropriate scatter/gather stubs to the avx asm. In the C code, we can now always use crypt_by_sg, since both sse and asm code now support scatter/gather. Introduce a new struct, aesni_gcm_tfm, that is initialized on startup to point to either the SSE, AVX, or AVX2 versions of the four necessary encryption/decryption routines. GENX_OPTSIZE is still checked at the start of crypt_by_sg. The total size of the data is checked, since the additional overhead is in the init function, calculating additional HashKeys. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 181 ++-- arch/x86/crypto/aesni-intel_glue.c | 349 +++ 2 files changed, 198 insertions(+), 332 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index af45fc57db90..91c039ab5699 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -518,14 +518,13 @@ _less_than_8_bytes_left\@: # _multiple_of_16_bytes\@: -GCM_COMPLETE \GHASH_MUL \REP .endm # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 -.macro GCM_COMPLETE GHASH_MUL REP +.macro GCM_COMPLETE GHASH_MUL REP AUTH_TAG AUTH_TAG_LEN vmovdqu AadHash(arg2), %xmm14 vmovdqu HashKey(arg2), %xmm13 @@ -560,8 +559,8 @@ _partial_done\@: _return_T\@: -mov arg9, %r10 # r10 = authTag -mov arg10, %r11 # r11 = auth_tag_len +mov \AUTH_TAG, %r10 # r10 = authTag +mov \AUTH_TAG_LEN, %r11 # r11 = auth_tag_len cmp $16, %r11 je _T_16\@ @@ -680,14 +679,14 @@ _get_AAD_done\@: mov %r11, PBlockLen(arg2) # ctx_data.partial_block_length = 0 mov %r11, PBlockEncKey(arg2) # ctx_data.partial_block_enc_key = 0 -mov arg4, %rax +mov arg3, %rax movdqu (%rax), %xmm0 movdqu %xmm0, OrigIV(arg2) # ctx_data.orig_IV = iv vpshufb SHUF_MASK(%rip), %xmm0, %xmm0 movdqu %xmm0, CurCount(arg2) # ctx_data.current_counter = iv -vmovdqu (arg3), %xmm6 # xmm6 = HashKey +vmovdqu (arg4), %xmm6 # xmm6 = HashKey vpshufb SHUF_MASK(%rip), %xmm6, %xmm6 ### PRECOMPUTATION of HashKey<<1 mod poly from the HashKey @@ -1776,88 +1775,100 @@ _initial_blocks_done\@: #const u8 *aad, /* Additional Authentication Data (AAD)*/ #u64 aad_len) /* Length of AAD in bytes. With RFC4106 this is going to be 8 or 12 Bytes */ # -ENTRY(aesni_gcm_precomp_avx_gen2) +ENTRY(aesni_gcm_init_avx_gen2) FUNC_SAVE INIT GHASH_MUL_AVX, PRECOMPUTE_AVX FUNC_RESTORE ret -ENDPROC(aesni_gcm_precomp_avx_gen2) +ENDPROC(aesni_gcm_init_avx_gen2) ### -#void aesni_gcm_enc_avx_gen2( +#void aesni_gcm_enc_update_avx_gen2( #gcm_data*my_ctx_data, /* aligned to 16 Bytes */ #gcm_context_data *data, #u8 *out, /* Ciphertext output. Encrypt in-place is allowed. */ #const u8 *in, /* Plaintext input */ -#u64 plaintext_len, /* Length of data in Bytes for encryption. */ -#u8 *iv, /* Pre-counter block j0: 4 byte salt -# (from Security Association) concatenated with 8 byte -# Initialisation Vector (from IPSec ESP Payload) -# concatenated with 0x0001. 16-byte aligned pointer. */ -#const u8 *aad, /* Additional Authentication Data (AAD)*/ -#u64 aad_len, /* Length of AAD in bytes. With RFC4106 this is going to be 8 or 12 Bytes */ -#u8 *auth_tag, /* Authenticated Tag output. */ -#u64 auth_tag_len)# /* Authenticated Tag Length in bytes. -# Valid values are 16 (most likely), 12 or 8. */ +#u64 plaintext_len) /* Length of data in Bytes for encryption. */ ### -ENTRY(aesni_gcm_enc_avx_gen2) +ENTRY(aesni_gcm_enc_update_avx_gen2) FUNC_SAVE mov keysize, %eax cmp $32, %eax -je key_256_enc +je key_256_enc_update cmp $16, %eax -je key_128_enc +je key_128_enc_update # must be 192 GCM_ENC_DEC INITIAL_BLOCKS_AVX, GHASH_8_ENCRYPT_8_PARALLEL_AVX, GHASH_LAST_8_AVX, GHASH_MUL_AVX, ENC, 11 FUNC_RESTORE ret -key_128_enc: +key_128_enc_update: GCM_ENC_DEC INITIAL_BLOCKS_AVX, GHASH_8_ENCRYPT_8_PARALLEL_AVX, GHASH_LAST_8_AVX, GHASH_MUL_AVX,
[PATCH 07/12] x86/crypto: aesni: Merge avx precompute functions
The precompute functions differ only by the sub-macros they call, merge them to a single macro. Later diffs add more code to fill in the gcm_context_data structure, this allows changes in a single place. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 76 +--- 1 file changed, 27 insertions(+), 49 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 305abece93ad..e347ba61db65 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -661,6 +661,31 @@ _get_AAD_done\@: vmovdqu \T7, AadHash(arg2) .endm +.macro INIT GHASH_MUL PRECOMPUTE +vmovdqu (arg3), %xmm6 # xmm6 = HashKey + +vpshufb SHUF_MASK(%rip), %xmm6, %xmm6 +### PRECOMPUTATION of HashKey<<1 mod poly from the HashKey +vmovdqa %xmm6, %xmm2 +vpsllq $1, %xmm6, %xmm6 +vpsrlq $63, %xmm2, %xmm2 +vmovdqa %xmm2, %xmm1 +vpslldq $8, %xmm2, %xmm2 +vpsrldq $8, %xmm1, %xmm1 +vpor %xmm2, %xmm6, %xmm6 +#reduction +vpshufd $0b00100100, %xmm1, %xmm2 +vpcmpeqd TWOONE(%rip), %xmm2, %xmm2 +vpandPOLY(%rip), %xmm2, %xmm2 +vpxor%xmm2, %xmm6, %xmm6# xmm6 holds the HashKey<<1 mod poly +### +vmovdqu %xmm6, HashKey(arg2) # store HashKey<<1 mod poly + +CALC_AAD_HASH \GHASH_MUL, arg5, arg6, %xmm2, %xmm6, %xmm3, %xmm4, %xmm5, %xmm7, %xmm1, %xmm0 + +\PRECOMPUTE %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5 +.endm + #ifdef CONFIG_AS_AVX ### # GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) @@ -1558,31 +1583,7 @@ _initial_blocks_done\@: # ENTRY(aesni_gcm_precomp_avx_gen2) FUNC_SAVE - -vmovdqu (arg3), %xmm6 # xmm6 = HashKey - -vpshufb SHUF_MASK(%rip), %xmm6, %xmm6 -### PRECOMPUTATION of HashKey<<1 mod poly from the HashKey -vmovdqa %xmm6, %xmm2 -vpsllq $1, %xmm6, %xmm6 -vpsrlq $63, %xmm2, %xmm2 -vmovdqa %xmm2, %xmm1 -vpslldq $8, %xmm2, %xmm2 -vpsrldq $8, %xmm1, %xmm1 -vpor %xmm2, %xmm6, %xmm6 -#reduction -vpshufd $0b00100100, %xmm1, %xmm2 -vpcmpeqd TWOONE(%rip), %xmm2, %xmm2 -vpandPOLY(%rip), %xmm2, %xmm2 -vpxor%xmm2, %xmm6, %xmm6# xmm6 holds the HashKey<<1 mod poly -### -vmovdqu %xmm6, HashKey(arg2) # store HashKey<<1 mod poly - - -CALC_AAD_HASH GHASH_MUL_AVX, arg5, arg6, %xmm2, %xmm6, %xmm3, %xmm4, %xmm5, %xmm7, %xmm1, %xmm0 - -PRECOMPUTE_AVX %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5 - +INIT GHASH_MUL_AVX, PRECOMPUTE_AVX FUNC_RESTORE ret ENDPROC(aesni_gcm_precomp_avx_gen2) @@ -2547,30 +2548,7 @@ _initial_blocks_done\@: # ENTRY(aesni_gcm_precomp_avx_gen4) FUNC_SAVE - -vmovdqu (arg3), %xmm6# xmm6 = HashKey - -vpshufb SHUF_MASK(%rip), %xmm6, %xmm6 -### PRECOMPUTATION of HashKey<<1 mod poly from the HashKey -vmovdqa %xmm6, %xmm2 -vpsllq $1, %xmm6, %xmm6 -vpsrlq $63, %xmm2, %xmm2 -vmovdqa %xmm2, %xmm1 -vpslldq $8, %xmm2, %xmm2 -vpsrldq $8, %xmm1, %xmm1 -vpor %xmm2, %xmm6, %xmm6 -#reduction -vpshufd $0b00100100, %xmm1, %xmm2 -vpcmpeqd TWOONE(%rip), %xmm2, %xmm2 -vpandPOLY(%rip), %xmm2, %xmm2 -vpxor%xmm2, %xmm6, %xmm6 # xmm6 holds the HashKey<<1 mod poly -### -vmovdqu %xmm6, HashKey(arg2) # store HashKey<<1 mod poly - -CALC_AAD_HASH GHASH_MUL_AVX2, arg5, arg6, %xmm2, %xmm6, %xmm3, %xmm4, %xmm5, %xmm7, %xmm1, %xmm0 - -PRECOMPUTE_AVX2 %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5 - +INIT GHASH_MUL_AVX2, PRECOMPUTE_AVX2 FUNC_RESTORE ret ENDPROC(aesni_gcm_precomp_avx_gen4) -- 2.17.1
[PATCH 11/12] x86/crypto: aesni: Introduce partial block macro
Before this diff, multiple calls to GCM_ENC_DEC will succeed, but only if all calls are a multiple of 16 bytes. Handle partial blocks at the start of GCM_ENC_DEC, and update aadhash as appropriate. The data offset %r11 is also updated after the partial block. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 156 ++- 1 file changed, 150 insertions(+), 6 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index ff00ad19064d..af45fc57db90 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -301,6 +301,12 @@ VARIABLE_OFFSET = 16*8 vmovdqu HashKey(arg2), %xmm13 # xmm13 = HashKey add arg5, InLen(arg2) +# initialize the data pointer offset as zero +xor %r11d, %r11d + +PARTIAL_BLOCK \GHASH_MUL, arg3, arg4, arg5, %r11, %xmm8, \ENC_DEC +sub %r11, arg5 + mov arg5, %r13 # save the number of bytes of plaintext/ciphertext and $-16, %r13 # r13 = r13 - (r13 mod 16) @@ -737,6 +743,150 @@ _read_next_byte_lt8_\@: _done_read_partial_block_\@: .endm +# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks +# between update calls. +# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK +# Outputs encrypted bytes, and updates hash and partial info in gcm_data_context +# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13 +.macro PARTIAL_BLOCK GHASH_MUL CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \ +AAD_HASH ENC_DEC +movPBlockLen(arg2), %r13 +cmp$0, %r13 +je _partial_block_done_\@ # Leave Macro if no partial blocks +# Read in input data without over reading +cmp$16, \PLAIN_CYPH_LEN +jl _fewer_than_16_bytes_\@ +vmovdqu(\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm +jmp_data_read_\@ + +_fewer_than_16_bytes_\@: +lea(\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10 +mov\PLAIN_CYPH_LEN, %r12 +READ_PARTIAL_BLOCK %r10 %r12 %xmm1 + +mov PBlockLen(arg2), %r13 + +_data_read_\@: # Finished reading in data + +vmovdquPBlockEncKey(arg2), %xmm9 +vmovdquHashKey(arg2), %xmm13 + +leaSHIFT_MASK(%rip), %r12 + +# adjust the shuffle mask pointer to be able to shift r13 bytes +# r16-r13 is the number of bytes in plaintext mod 16) +add%r13, %r12 +vmovdqu(%r12), %xmm2 # get the appropriate shuffle mask +vpshufb %xmm2, %xmm9, %xmm9# shift right r13 bytes + +.if \ENC_DEC == DEC +vmovdqa%xmm1, %xmm3 +pxor %xmm1, %xmm9# Cyphertext XOR E(K, Yn) + +mov\PLAIN_CYPH_LEN, %r10 +add%r13, %r10 +# Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling +sub$16, %r10 +# Determine if if partial block is not being filled and +# shift mask accordingly +jge_no_extra_mask_1_\@ +sub%r10, %r12 +_no_extra_mask_1_\@: + +vmovdquALL_F-SHIFT_MASK(%r12), %xmm1 +# get the appropriate mask to mask out bottom r13 bytes of xmm9 +vpand %xmm1, %xmm9, %xmm9 # mask out bottom r13 bytes of xmm9 + +vpand %xmm1, %xmm3, %xmm3 +vmovdqaSHUF_MASK(%rip), %xmm10 +vpshufb%xmm10, %xmm3, %xmm3 +vpshufb%xmm2, %xmm3, %xmm3 +vpxor %xmm3, \AAD_HASH, \AAD_HASH + +cmp$0, %r10 +jl _partial_incomplete_1_\@ + +# GHASH computation for the last <16 Byte block +\GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 +xor%eax,%eax + +mov%rax, PBlockLen(arg2) +jmp_dec_done_\@ +_partial_incomplete_1_\@: +add\PLAIN_CYPH_LEN, PBlockLen(arg2) +_dec_done_\@: +vmovdqu\AAD_HASH, AadHash(arg2) +.else +vpxor %xmm1, %xmm9, %xmm9 # Plaintext XOR E(K, Yn) + +mov\PLAIN_CYPH_LEN, %r10 +add%r13, %r10 +# Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling +sub$16, %r10 +# Determine if if partial block is not being filled and +# shift mask accordingly +jge_no_extra_mask_2_\@ +sub%r10, %r12 +_no_extra_mask_2_\@: + +vmovdquALL_F-SHIFT_MASK(%r12), %xmm1 +# get the appropriate mask to mask out bottom r13 bytes of xmm9 +vpand %xmm1, %xmm9, %xmm9 + +vmovdqaSHUF_MASK(%rip), %xmm1 +vpshufb %xmm1, %xmm9, %xmm9 +vpshufb %xmm2, %xmm9, %xmm9 +vpxor %xmm9, \AAD_HASH, \AAD_HASH + +cmp$0, %r10 +jl _partial_incomplete_2_\@ + +# GH
[PATCH 09/12] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
Prepare to handle partial blocks between scatter/gather calls. For the last partial block, we only want to calculate the aadhash in GCM_COMPLETE, and a new partial block macro will handle both aadhash update and encrypting partial blocks between calls. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 0a9cdcfdd987..44a4a8b43ca4 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -488,8 +488,7 @@ _final_ghash_mul\@: vpand %xmm1, %xmm2, %xmm2 vpshufb SHUF_MASK(%rip), %xmm2, %xmm2 vpxor %xmm2, %xmm14, %xmm14 - #GHASH computation for the last <16 Byte block -\GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + vmovdqu %xmm14, AadHash(arg2) sub %r13, %r11 add $16, %r11 @@ -500,8 +499,7 @@ _final_ghash_mul\@: vpand %xmm1, %xmm9, %xmm9 # mask out top 16-r13 bytes of xmm9 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9 vpxor %xmm9, %xmm14, %xmm14 - #GHASH computation for the last <16 Byte block -\GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + vmovdqu %xmm14, AadHash(arg2) sub %r13, %r11 add $16, %r11 @@ -541,6 +539,14 @@ _multiple_of_16_bytes\@: vmovdqu AadHash(arg2), %xmm14 vmovdqu HashKey(arg2), %xmm13 +mov PBlockLen(arg2), %r12 +cmp $0, %r12 +je _partial_done\@ + + #GHASH computation for the last <16 Byte block +\GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + +_partial_done\@: mov AadLen(arg2), %r12 # r12 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits vmovd %r12d, %xmm15# len(A) in xmm15 -- 2.17.1
[PATCH 06/12] x86/crypto: aesni: Split AAD hash calculation to separate macro
AAD hash only needs to be calculated once for each scatter/gather operation. Move it to its own macro, and call it from GCM_INIT instead of INITIAL_BLOCKS. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 228 ++- arch/x86/crypto/aesni-intel_glue.c | 28 ++- 2 files changed, 115 insertions(+), 141 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 8e9ae4b26118..305abece93ad 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -182,6 +182,14 @@ aad_shift_arr: .text +#define AadHash 16*0 +#define AadLen 16*1 +#define InLen (16*1)+8 +#define PBlockEncKey 16*2 +#define OrigIV 16*3 +#define CurCount 16*4 +#define PBlockLen 16*5 + HashKey= 16*6 # store HashKey <<1 mod poly here HashKey_2 = 16*7 # store HashKey^2 <<1 mod poly here HashKey_3 = 16*8 # store HashKey^3 <<1 mod poly here @@ -585,6 +593,74 @@ _T_16\@: _return_T_done\@: .endm +.macro CALC_AAD_HASH GHASH_MUL AAD AADLEN T1 T2 T3 T4 T5 T6 T7 T8 + + mov \AAD, %r10 # r10 = AAD + mov \AADLEN, %r12 # r12 = aadLen + + + mov %r12, %r11 + + vpxor \T8, \T8, \T8 + vpxor \T7, \T7, \T7 + cmp $16, %r11 + jl _get_AAD_rest8\@ +_get_AAD_blocks\@: + vmovdqu (%r10), \T7 + vpshufb SHUF_MASK(%rip), \T7, \T7 + vpxor \T7, \T8, \T8 + \GHASH_MUL \T8, \T2, \T1, \T3, \T4, \T5, \T6 + add $16, %r10 + sub $16, %r12 + sub $16, %r11 + cmp $16, %r11 + jge _get_AAD_blocks\@ + vmovdqu \T8, \T7 + cmp $0, %r11 + je _get_AAD_done\@ + + vpxor \T7, \T7, \T7 + + /* read the last <16B of AAD. since we have at least 4B of + data right after the AAD (the ICV, and maybe some CT), we can + read 4B/8B blocks safely, and then get rid of the extra stuff */ +_get_AAD_rest8\@: + cmp $4, %r11 + jle _get_AAD_rest4\@ + movq(%r10), \T1 + add $8, %r10 + sub $8, %r11 + vpslldq $8, \T1, \T1 + vpsrldq $8, \T7, \T7 + vpxor \T1, \T7, \T7 + jmp _get_AAD_rest8\@ +_get_AAD_rest4\@: + cmp $0, %r11 + jle _get_AAD_rest0\@ + mov (%r10), %eax + movq%rax, \T1 + add $4, %r10 + sub $4, %r11 + vpslldq $12, \T1, \T1 + vpsrldq $4, \T7, \T7 + vpxor \T1, \T7, \T7 +_get_AAD_rest0\@: + /* finalize: shift out the extra bytes we read, and align + left. since pslldq can only shift by an immediate, we use + vpshufb and an array of shuffle masks */ + movq%r12, %r11 + salq$4, %r11 + vmovdqu aad_shift_arr(%r11), \T1 + vpshufb \T1, \T7, \T7 +_get_AAD_rest_final\@: + vpshufb SHUF_MASK(%rip), \T7, \T7 + vpxor \T8, \T7, \T7 + \GHASH_MUL \T7, \T2, \T1, \T3, \T4, \T5, \T6 + +_get_AAD_done\@: +vmovdqu \T7, AadHash(arg2) +.endm + #ifdef CONFIG_AS_AVX ### # GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) @@ -701,72 +777,9 @@ _return_T_done\@: .macro INITIAL_BLOCKS_AVX REP num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC i = (8-\num_initial_blocks) - j = 0 setreg +vmovdqu AadHash(arg2), reg_i - mov arg7, %r10 # r10 = AAD - mov arg8, %r12 # r12 = aadLen - - - mov %r12, %r11 - - vpxor reg_j, reg_j, reg_j - vpxor reg_i, reg_i, reg_i - cmp $16, %r11 - jl _get_AAD_rest8\@ -_get_AAD_blocks\@: - vmovdqu (%r10), reg_i - vpshufb SHUF_MASK(%rip), reg_i, reg_i - vpxor reg_i, reg_j, reg_j - GHASH_MUL_AVX reg_j, \T2, \T1, \T3, \T4, \T5, \T6 - add $16, %r10 - sub $16, %r12 - sub $16, %r11 - cmp $16, %r11 - jge _get_AAD_blocks\@ - vmovdqu reg_j, reg_i - cmp $0, %r11 - je _get_AAD_done\@ - - vpxor reg_i, reg_i, reg_i - - /* read the last <16B of AAD. since we have at least 4B of - data right after the AAD (the ICV, and maybe some CT), we can - read 4B/8B blocks safely, and then get rid of the extra stuff */ -_get_AAD_rest8\@: - cmp $4, %r11 - jle _get_AAD_rest4\@ - movq(%r10), \T1 - add $8, %r10 - sub $8, %r11 - vpslldq $8, \T1, \T1 - vpsrldq $8, reg_i, reg_i - vpxor \T1, reg_i, reg_i - jmp _get_AAD_rest8\@ -_get_AAD_rest4\@: - cmp $0, %r11 - jle _get_AAD_rest0\@ - mov (%r10), %eax - movq%rax, \T1 - add $4,
[PATCH 08/12] x86/crypto: aesni: Fill in new context data structures
Fill in aadhash, aadlen, pblocklen, curcount with appropriate values. pblocklen, aadhash, and pblockenckey are also updated at the end of each scatter/gather operation, to be carried over to the next operation. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 51 +--- 1 file changed, 37 insertions(+), 14 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index e347ba61db65..0a9cdcfdd987 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -297,7 +297,9 @@ VARIABLE_OFFSET = 16*8 # clobbering all xmm registers # clobbering r10, r11, r12, r13, r14, r15 .macro GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 GHASH_MUL ENC_DEC REP +vmovdqu AadHash(arg2), %xmm8 vmovdqu HashKey(arg2), %xmm13 # xmm13 = HashKey +add arg5, InLen(arg2) mov arg5, %r13 # save the number of bytes of plaintext/ciphertext and $-16, %r13 # r13 = r13 - (r13 mod 16) @@ -410,6 +412,9 @@ _eight_cipher_left\@: _zero_cipher_left\@: +vmovdqu %xmm14, AadHash(arg2) +vmovdqu %xmm9, CurCount(arg2) + cmp $16, arg5 jl _only_less_than_16\@ @@ -420,10 +425,14 @@ _zero_cipher_left\@: # handle the last <16 Byte block seperately +mov %r13, PBlockLen(arg2) vpaddd ONE(%rip), %xmm9, %xmm9 # INCR CNT to get Yn +vmovdqu %xmm9, CurCount(arg2) vpshufb SHUF_MASK(%rip), %xmm9, %xmm9 + ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Yn) +vmovdqu %xmm9, PBlockEncKey(arg2) sub $16, %r11 add %r13, %r11 @@ -451,6 +460,7 @@ _only_less_than_16\@: vpshufb SHUF_MASK(%rip), %xmm9, %xmm9 ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Yn) +vmovdqu %xmm9, PBlockEncKey(arg2) lea SHIFT_MASK+16(%rip), %r12 sub %r13, %r12 # adjust the shuffle mask pointer to be @@ -480,6 +490,7 @@ _final_ghash_mul\@: vpxor %xmm2, %xmm14, %xmm14 #GHASH computation for the last <16 Byte block \GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 +vmovdqu %xmm14, AadHash(arg2) sub %r13, %r11 add $16, %r11 .else @@ -491,6 +502,7 @@ _final_ghash_mul\@: vpxor %xmm9, %xmm14, %xmm14 #GHASH computation for the last <16 Byte block \GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 +vmovdqu %xmm14, AadHash(arg2) sub %r13, %r11 add $16, %r11 vpshufb SHUF_MASK(%rip), %xmm9, %xmm9# shuffle xmm9 back to output as ciphertext @@ -526,12 +538,16 @@ _multiple_of_16_bytes\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE GHASH_MUL REP -mov arg8, %r12 # r12 = aadLen (number of bytes) +vmovdqu AadHash(arg2), %xmm14 +vmovdqu HashKey(arg2), %xmm13 + +mov AadLen(arg2), %r12 # r12 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits vmovd %r12d, %xmm15# len(A) in xmm15 -shl $3, arg5 # len(C) in bits (*128) -vmovq arg5, %xmm1 +mov InLen(arg2), %r12 +shl $3, %r12# len(C) in bits (*128) +vmovq %r12, %xmm1 vpslldq $8, %xmm15, %xmm15 # xmm15 = len(A)|| 0x vpxor %xmm1, %xmm15, %xmm15# xmm15 = len(A)||len(C) @@ -539,8 +555,7 @@ _multiple_of_16_bytes\@: \GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 # final GHASH computation vpshufb SHUF_MASK(%rip), %xmm14, %xmm14 # perform a 16Byte swap -mov arg6, %rax # rax = *Y0 -vmovdqu (%rax), %xmm9# xmm9 = Y0 +vmovdqu OrigIV(arg2), %xmm9 ENCRYPT_SINGLE_BLOCK\REP, %xmm9# E(K, Y0) @@ -662,6 +677,20 @@ _get_AAD_done\@: .endm .macro INIT GHASH_MUL PRECOMPUTE +mov arg6, %r11 +mov %r11, AadLen(arg2) # ctx_data.aad_length = aad_length +xor %r11d, %r11d +mov %r11, InLen(arg2) # ctx_data.in_length = 0 + +mov %r11, PBlockLen(arg2) # ctx_data.partial_block_length = 0 +mov %r11, PBlockEncKey(arg2) # ctx_data.partial_block_enc_key = 0 +mov arg4, %rax +movdqu (%rax), %xmm0 +movdqu %xmm0, OrigIV(arg2) # ctx_data.orig_IV = iv + +vpshufb SHUF_MASK(%rip), %xmm0, %xmm0 +movdqu %xmm0, CurCount(arg2) # ctx_data.current_cou
[PATCH 05/12] x86/crypto: aesni: Add GCM_COMPLETE macro
Merge encode and decode tag calculations in GCM_COMPLETE macro. Scatter/gather routines will call this once at the end of encryption or decryption. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 8 1 file changed, 8 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 2aa11c503bb9..8e9ae4b26118 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -510,6 +510,14 @@ _less_than_8_bytes_left\@: # _multiple_of_16_bytes\@: +GCM_COMPLETE \GHASH_MUL \REP +.endm + + +# GCM_COMPLETE Finishes update of tag of last partial block +# Output: Authorization Tag (AUTH_TAG) +# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 +.macro GCM_COMPLETE GHASH_MUL REP mov arg8, %r12 # r12 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits vmovd %r12d, %xmm15# len(A) in xmm15 -- 2.17.1
[PATCH 04/12] x86/crypto: aesni: support 256 byte keys in avx asm
Add support for 192/256-bit keys using the avx gcm/aes routines. The sse routines were previously updated in e31ac32d3b (Add support for 192 & 256 bit keys to AESNI RFC4106). Instead of adding an additional loop in the hotpath as in e31ac32d3b, this diff instead generates separate versions of the code using macros, and the entry routines choose which version once. This results in a 5% performance improvement vs. adding a loop to the hot path. This is the same strategy chosen by the intel isa-l_crypto library. The key size checks are removed from the c code where appropriate. Note that this diff depends on using gcm_context_data - 256 bit keys require 16 HashKeys + 15 expanded keys, which is larger than struct crypto_aes_ctx, so they are stored in struct gcm_context_data. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 188 +-- arch/x86/crypto/aesni-intel_glue.c | 18 +-- 2 files changed, 145 insertions(+), 61 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index dd895f69399b..2aa11c503bb9 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -209,6 +209,7 @@ HashKey_8_k= 16*21 # store XOR of HashKey^8 <<1 mod poly here (for Karatsu #define arg8 STACK_OFFSET+8*2(%r14) #define arg9 STACK_OFFSET+8*3(%r14) #define arg10 STACK_OFFSET+8*4(%r14) +#define keysize 2*15*16(arg1) i = 0 j = 0 @@ -272,22 +273,22 @@ VARIABLE_OFFSET = 16*8 .endm # Encryption of a single block -.macro ENCRYPT_SINGLE_BLOCK XMM0 +.macro ENCRYPT_SINGLE_BLOCK REP XMM0 vpxor(arg1), \XMM0, \XMM0 - i = 1 - setreg -.rep 9 + i = 1 + setreg +.rep \REP vaesenc 16*i(arg1), \XMM0, \XMM0 - i = (i+1) - setreg + i = (i+1) + setreg .endr -vaesenclast 16*10(arg1), \XMM0, \XMM0 +vaesenclast 16*i(arg1), \XMM0, \XMM0 .endm # combined for GCM encrypt and decrypt functions # clobbering all xmm registers # clobbering r10, r11, r12, r13, r14, r15 -.macro GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 GHASH_MUL ENC_DEC +.macro GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 GHASH_MUL ENC_DEC REP vmovdqu HashKey(arg2), %xmm13 # xmm13 = HashKey mov arg5, %r13 # save the number of bytes of plaintext/ciphertext @@ -314,42 +315,42 @@ VARIABLE_OFFSET = 16*8 jmp _initial_num_blocks_is_1\@ _initial_num_blocks_is_7\@: -\INITIAL_BLOCKS 7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +\INITIAL_BLOCKS \REP, 7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC sub $16*7, %r13 jmp _initial_blocks_encrypted\@ _initial_num_blocks_is_6\@: -\INITIAL_BLOCKS 6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +\INITIAL_BLOCKS \REP, 6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC sub $16*6, %r13 jmp _initial_blocks_encrypted\@ _initial_num_blocks_is_5\@: -\INITIAL_BLOCKS 5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +\INITIAL_BLOCKS \REP, 5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC sub $16*5, %r13 jmp _initial_blocks_encrypted\@ _initial_num_blocks_is_4\@: -\INITIAL_BLOCKS 4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +\INITIAL_BLOCKS \REP, 4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC sub $16*4, %r13 jmp _initial_blocks_encrypted\@ _initial_num_blocks_is_3\@: -\INITIAL_BLOCKS 3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +\INITIAL_BLOCKS \REP, 3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC sub $16*3, %r13 jmp _initial_blocks_encrypted\@ _initial_num_blocks_is_2\@: -\INITIAL_BLOCKS 2, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +\INITIAL_BL
[PATCH 02/12] x86/crypto: aesni: Introduce gcm_context_data
Add the gcm_context_data structure to the avx asm routines. This will be necessary to support both 256 bit keys and scatter/gather. The pre-computed HashKeys are now stored in the gcm_context_data struct, which is expanded to hold the greater number of hashkeys necessary for avx. Loads and stores to the new struct are always done unlaligned to avoid compiler issues, see e5b954e8 "Use unaligned loads from gcm_context_data" Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 378 +++ arch/x86/crypto/aesni-intel_glue.c | 58 ++-- 2 files changed, 215 insertions(+), 221 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 318135a77975..284f1b8b88fc 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -182,43 +182,22 @@ aad_shift_arr: .text -##define the fields of the gcm aes context -#{ -#u8 expanded_keys[16*11] store expanded keys -#u8 shifted_hkey_1[16] store HashKey <<1 mod poly here -#u8 shifted_hkey_2[16] store HashKey^2 <<1 mod poly here -#u8 shifted_hkey_3[16] store HashKey^3 <<1 mod poly here -#u8 shifted_hkey_4[16] store HashKey^4 <<1 mod poly here -#u8 shifted_hkey_5[16] store HashKey^5 <<1 mod poly here -#u8 shifted_hkey_6[16] store HashKey^6 <<1 mod poly here -#u8 shifted_hkey_7[16] store HashKey^7 <<1 mod poly here -#u8 shifted_hkey_8[16] store HashKey^8 <<1 mod poly here -#u8 shifted_hkey_1_k[16] store XOR HashKey <<1 mod poly here (for Karatsuba purposes) -#u8 shifted_hkey_2_k[16] store XOR HashKey^2 <<1 mod poly here (for Karatsuba purposes) -#u8 shifted_hkey_3_k[16] store XOR HashKey^3 <<1 mod poly here (for Karatsuba purposes) -#u8 shifted_hkey_4_k[16] store XOR HashKey^4 <<1 mod poly here (for Karatsuba purposes) -#u8 shifted_hkey_5_k[16] store XOR HashKey^5 <<1 mod poly here (for Karatsuba purposes) -#u8 shifted_hkey_6_k[16] store XOR HashKey^6 <<1 mod poly here (for Karatsuba purposes) -#u8 shifted_hkey_7_k[16] store XOR HashKey^7 <<1 mod poly here (for Karatsuba purposes) -#u8 shifted_hkey_8_k[16] store XOR HashKey^8 <<1 mod poly here (for Karatsuba purposes) -#} gcm_ctx# - -HashKey= 16*11 # store HashKey <<1 mod poly here -HashKey_2 = 16*12 # store HashKey^2 <<1 mod poly here -HashKey_3 = 16*13 # store HashKey^3 <<1 mod poly here -HashKey_4 = 16*14 # store HashKey^4 <<1 mod poly here -HashKey_5 = 16*15 # store HashKey^5 <<1 mod poly here -HashKey_6 = 16*16 # store HashKey^6 <<1 mod poly here -HashKey_7 = 16*17 # store HashKey^7 <<1 mod poly here -HashKey_8 = 16*18 # store HashKey^8 <<1 mod poly here -HashKey_k = 16*19 # store XOR of HashKey <<1 mod poly here (for Karatsuba purposes) -HashKey_2_k= 16*20 # store XOR of HashKey^2 <<1 mod poly here (for Karatsuba purposes) -HashKey_3_k= 16*21 # store XOR of HashKey^3 <<1 mod poly here (for Karatsuba purposes) -HashKey_4_k= 16*22 # store XOR of HashKey^4 <<1 mod poly here (for Karatsuba purposes) -HashKey_5_k= 16*23 # store XOR of HashKey^5 <<1 mod poly here (for Karatsuba purposes) -HashKey_6_k= 16*24 # store XOR of HashKey^6 <<1 mod poly here (for Karatsuba purposes) -HashKey_7_k= 16*25 # store XOR of HashKey^7 <<1 mod poly here (for Karatsuba purposes) -HashKey_8_k= 16*26 # store XOR of HashKey^8 <<1 mod poly here (for Karatsuba purposes) +HashKey= 16*6 # store HashKey <<1 mod poly here +HashKey_2 = 16*7 # store HashKey^2 <<1 mod poly here +HashKey_3 = 16*8 # store HashKey^3 <<1 mod poly here +HashKey_4 = 16*9 # store HashKey^4 <<1 mod poly here +HashKey_5 = 16*10 # store HashKey^5 <<1 mod poly here +HashKey_6 = 16*11 # store HashKey^6 <<1 mod poly here +HashKey_7 = 16*12 # store HashKey^7 <<1 mod poly here +HashKey_8 = 16*13 # store HashKey^8 <<1 mod poly here +HashKey_k = 16*14 # store XOR of HashKey <<1 mod poly here (for Karatsuba purposes) +HashKey_2_k= 16*15 # store XOR of HashKey^2 <<1 mod poly here (for Karatsuba purposes) +HashKey_3_k= 16*16 # store XOR of HashKey^3 <<1 mod poly here (for Karatsuba purposes) +HashKey_4_k= 16*17 # store XOR of HashKey^4 <<1 mod poly here (for Karatsuba purposes) +HashKey_5_k= 16*18 # store XOR of HashKey^5 <<1 mod poly here (for Karatsuba purposes) +HashKey_6_k= 16*19 # store XOR of HashKey^6 <<1 mod poly here (for Karatsuba purposes) +HashKey_7_k= 16*20 # stor
[PATCH 03/12] x86/crypto: aesni: Macro-ify func save/restore
Macro-ify function save and restore. These will be used in new functions added for scatter/gather update operations. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 94 +--- 1 file changed, 36 insertions(+), 58 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 284f1b8b88fc..dd895f69399b 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -247,6 +247,30 @@ VARIABLE_OFFSET = 16*8 # Utility Macros +.macro FUNC_SAVE +#the number of pushes must equal STACK_OFFSET +push%r12 +push%r13 +push%r14 +push%r15 + +mov %rsp, %r14 + + + +sub $VARIABLE_OFFSET, %rsp +and $~63, %rsp# align rsp to 64 bytes +.endm + +.macro FUNC_RESTORE +mov %r14, %rsp + +pop %r15 +pop %r14 +pop %r13 +pop %r12 +.endm + # Encryption of a single block .macro ENCRYPT_SINGLE_BLOCK XMM0 vpxor(arg1), \XMM0, \XMM0 @@ -264,22 +288,6 @@ VARIABLE_OFFSET = 16*8 # clobbering all xmm registers # clobbering r10, r11, r12, r13, r14, r15 .macro GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 GHASH_MUL ENC_DEC - -#the number of pushes must equal STACK_OFFSET -push%r12 -push%r13 -push%r14 -push%r15 - -mov %rsp, %r14 - - - - -sub $VARIABLE_OFFSET, %rsp -and $~63, %rsp # align rsp to 64 bytes - - vmovdqu HashKey(arg2), %xmm13 # xmm13 = HashKey mov arg5, %r13 # save the number of bytes of plaintext/ciphertext @@ -566,12 +574,6 @@ _T_16\@: vmovdqu %xmm9, (%r10) _return_T_done\@: -mov %r14, %rsp - -pop %r15 -pop %r14 -pop %r13 -pop %r12 .endm #ifdef CONFIG_AS_AVX @@ -1511,18 +1513,7 @@ _initial_blocks_done\@: #u8 *hash_subkey)# /* H, the Hash sub key input. Data starts on a 16-byte boundary. */ # ENTRY(aesni_gcm_precomp_avx_gen2) -#the number of pushes must equal STACK_OFFSET -push%r12 -push%r13 -push%r14 -push%r15 - -mov %rsp, %r14 - - - -sub $VARIABLE_OFFSET, %rsp -and $~63, %rsp # align rsp to 64 bytes +FUNC_SAVE vmovdqu (arg3), %xmm6 # xmm6 = HashKey @@ -1546,12 +1537,7 @@ ENTRY(aesni_gcm_precomp_avx_gen2) PRECOMPUTE_AVX %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5 -mov %r14, %rsp - -pop %r15 -pop %r14 -pop %r13 -pop %r12 +FUNC_RESTORE ret ENDPROC(aesni_gcm_precomp_avx_gen2) @@ -1573,7 +1559,9 @@ ENDPROC(aesni_gcm_precomp_avx_gen2) # Valid values are 16 (most likely), 12 or 8. */ ### ENTRY(aesni_gcm_enc_avx_gen2) +FUNC_SAVE GCM_ENC_DEC INITIAL_BLOCKS_AVX GHASH_8_ENCRYPT_8_PARALLEL_AVX GHASH_LAST_8_AVX GHASH_MUL_AVX ENC +FUNC_RESTORE ret ENDPROC(aesni_gcm_enc_avx_gen2) @@ -1595,7 +1583,9 @@ ENDPROC(aesni_gcm_enc_avx_gen2) # Valid values are 16 (most likely), 12 or 8. */ ### ENTRY(aesni_gcm_dec_avx_gen2) +FUNC_SAVE GCM_ENC_DEC INITIAL_BLOCKS_AVX GHASH_8_ENCRYPT_8_PARALLEL_AVX GHASH_LAST_8_AVX GHASH_MUL_AVX DEC +FUNC_RESTORE ret ENDPROC(aesni_gcm_dec_avx_gen2) #endif /* CONFIG_AS_AVX */ @@ -2525,18 +2515,7 @@ _initial_blocks_done\@: # Data starts on a 16-byte boundary. */ # ENTRY(aesni_gcm_precomp_avx_gen4) -#the number of pushes must equal STACK_OFFSET -push%r12 -push%r13 -push%r14 -push%r15 - -mov %rsp, %r14 - - - -sub $VARIABLE_OFFSET, %rsp -and $~63, %rsp# align rsp to 64 bytes +FUNC_SAVE vmovdqu (arg3), %xmm6# xmm6 = HashKey @@ -2560,12 +2539,7 @@ ENTRY(aesni_gcm_precomp_avx_gen4) PRECOMPUTE_AVX2 %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5 -mov %r14, %rsp - -pop %r15 -pop %r14 -pop %r13 -pop %r12 +FUNC_RESTORE ret ENDPROC(aesni_gcm_precomp_avx_gen4) @@ -2588,7 +2562,9 @@ ENDPROC(aesni_gcm_precomp_avx_gen4) # Valid values are 16 (most likely), 12 or 8
[PATCH 00/12] x86/crypto: gcmaes AVX scatter/gather support
This patch set refactors the x86 aes/gcm AVX crypto routines to support true scatter/gather by adding gcm_enc/dec_update methods. It is similar to the previous SSE patchset starting at e1fd316f. Unlike the SSE routines, the AVX routines did not support keysize 192 & 256, this patchset also adds support for those keysizes. The final patch updates the C glue code, passing everything through the crypt_by_sg() function instead of the previous memcpy based routines. Dave Watson (12): x86/crypto: aesni: Merge GCM_ENC_DEC x86/crypto: aesni: Introduce gcm_context_data x86/crypto: aesni: Macro-ify func save/restore x86/crypto: aesni: support 256 byte keys in avx asm x86/crypto: aesni: Add GCM_COMPLETE macro x86/crypto: aesni: Split AAD hash calculation to separate macro x86/crypto: aesni: Merge avx precompute functions x86/crypto: aesni: Fill in new context data structures x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE x86/crypto: aesni: Introduce READ_PARTIAL_BLOCK macro x86/crypto: aesni: Introduce partial block macro x86/crypto: aesni: Add scatter/gather avx stubs, and use them in C arch/x86/crypto/aesni-intel_avx-x86_64.S | 2125 ++ arch/x86/crypto/aesni-intel_glue.c | 353 ++-- 2 files changed, 1117 insertions(+), 1361 deletions(-) -- 2.17.1
[PATCH 01/12] x86/crypto: aesni: Merge GCM_ENC_DEC
The GCM_ENC_DEC routines for AVX and AVX2 are identical, except they call separate sub-macros. Pass the macros as arguments, and merge them. This facilitates additional refactoring, by requiring changes in only one place. The GCM_ENC_DEC macro was moved above the CONFIG_AS_AVX* ifdefs, since it will be used by both AVX and AVX2. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 951 --- 1 file changed, 318 insertions(+), 633 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 1985ea0b551b..318135a77975 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -280,6 +280,320 @@ VARIABLE_OFFSET = 16*8 vaesenclast 16*10(arg1), \XMM0, \XMM0 .endm +# combined for GCM encrypt and decrypt functions +# clobbering all xmm registers +# clobbering r10, r11, r12, r13, r14, r15 +.macro GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 GHASH_MUL ENC_DEC + +#the number of pushes must equal STACK_OFFSET +push%r12 +push%r13 +push%r14 +push%r15 + +mov %rsp, %r14 + + + + +sub $VARIABLE_OFFSET, %rsp +and $~63, %rsp # align rsp to 64 bytes + + +vmovdqu HashKey(arg1), %xmm13 # xmm13 = HashKey + +mov arg4, %r13 # save the number of bytes of plaintext/ciphertext +and $-16, %r13 # r13 = r13 - (r13 mod 16) + +mov %r13, %r12 +shr $4, %r12 +and $7, %r12 +jz _initial_num_blocks_is_0\@ + +cmp $7, %r12 +je _initial_num_blocks_is_7\@ +cmp $6, %r12 +je _initial_num_blocks_is_6\@ +cmp $5, %r12 +je _initial_num_blocks_is_5\@ +cmp $4, %r12 +je _initial_num_blocks_is_4\@ +cmp $3, %r12 +je _initial_num_blocks_is_3\@ +cmp $2, %r12 +je _initial_num_blocks_is_2\@ + +jmp _initial_num_blocks_is_1\@ + +_initial_num_blocks_is_7\@: +\INITIAL_BLOCKS 7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +sub $16*7, %r13 +jmp _initial_blocks_encrypted\@ + +_initial_num_blocks_is_6\@: +\INITIAL_BLOCKS 6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +sub $16*6, %r13 +jmp _initial_blocks_encrypted\@ + +_initial_num_blocks_is_5\@: +\INITIAL_BLOCKS 5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +sub $16*5, %r13 +jmp _initial_blocks_encrypted\@ + +_initial_num_blocks_is_4\@: +\INITIAL_BLOCKS 4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +sub $16*4, %r13 +jmp _initial_blocks_encrypted\@ + +_initial_num_blocks_is_3\@: +\INITIAL_BLOCKS 3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +sub $16*3, %r13 +jmp _initial_blocks_encrypted\@ + +_initial_num_blocks_is_2\@: +\INITIAL_BLOCKS 2, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +sub $16*2, %r13 +jmp _initial_blocks_encrypted\@ + +_initial_num_blocks_is_1\@: +\INITIAL_BLOCKS 1, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC +sub $16*1, %r13 +jmp _initial_blocks_encrypted\@ + +_initial_num_blocks_is_0\@: +\INITIAL_BLOCKS 0, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC + + +_initial_blocks_encrypted\@: +cmp $0, %r13 +je _zero_cipher_left\@ + +sub $128, %r13 +je _eight_cipher_left\@ + + + + +vmovd %xmm9, %r15d +and $255, %r15d +vpshufb SHUF_MASK(%rip), %xmm9, %xmm9 + + +_encrypt_by_8_new\@: +cmp $(255-8), %r15d +jg _encrypt_by_8\@ + + + +add $8, %r15b +\GHASH_8_ENCRYPT_8_PARALLEL %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, out_order, \ENC_DEC +add $128, %r11 +sub $128, %r13 +jne _encrypt_by_8_new\@ + +vpshufb SHUF_MASK(%rip), %xmm9, %xmm9 +jmp _eight_cipher_left\@ + +_encrypt_by_8
Re: [PATCH] net/tls: Remove VLA usage
On 04/10/18 05:52 PM, Kees Cook wrote: > In the quest to remove VLAs from the kernel[1], this replaces the VLA > size with the only possible size used in the code, and adds a mechanism > to double-check future IV sizes. > > [1] > https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com > > Signed-off-by: Kees Cook <keesc...@chromium.org> Thanks Acked-by: Dave Watson <davejwat...@fb.com> > --- > net/tls/tls_sw.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c > index 4dc766b03f00..71e79597f940 100644 > --- a/net/tls/tls_sw.c > +++ b/net/tls/tls_sw.c > @@ -41,6 +41,8 @@ > #include > #include > > +#define MAX_IV_SIZE TLS_CIPHER_AES_GCM_128_IV_SIZE > + > static int tls_do_decryption(struct sock *sk, >struct scatterlist *sgin, >struct scatterlist *sgout, > @@ -673,7 +675,7 @@ static int decrypt_skb(struct sock *sk, struct sk_buff > *skb, > { > struct tls_context *tls_ctx = tls_get_ctx(sk); > struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); > - char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + tls_ctx->rx.iv_size]; > + char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + MAX_IV_SIZE]; > struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2]; > struct scatterlist *sgin = _arr[0]; > struct strp_msg *rxm = strp_msg(skb); > @@ -1094,6 +1096,12 @@ int tls_set_sw_offload(struct sock *sk, struct > tls_context *ctx, int tx) > goto free_priv; > } > > + /* Sanity-check the IV size for stack allocations. */ > + if (iv_size > MAX_IV_SIZE) { > + rc = -EINVAL; > + goto free_priv; > + } > + > cctx->prepend_size = TLS_HEADER_SIZE + nonce_size; > cctx->tag_size = tag_size; > cctx->overhead_size = cctx->prepend_size + cctx->tag_size; > -- > 2.7.4 > > > -- > Kees Cook > Pixel Security
Re: [PATCH] net/tls: Remove VLA usage
On 04/10/18 05:52 PM, Kees Cook wrote: > In the quest to remove VLAs from the kernel[1], this replaces the VLA > size with the only possible size used in the code, and adds a mechanism > to double-check future IV sizes. > > [1] > https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com > > Signed-off-by: Kees Cook Thanks Acked-by: Dave Watson > --- > net/tls/tls_sw.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c > index 4dc766b03f00..71e79597f940 100644 > --- a/net/tls/tls_sw.c > +++ b/net/tls/tls_sw.c > @@ -41,6 +41,8 @@ > #include > #include > > +#define MAX_IV_SIZE TLS_CIPHER_AES_GCM_128_IV_SIZE > + > static int tls_do_decryption(struct sock *sk, >struct scatterlist *sgin, >struct scatterlist *sgout, > @@ -673,7 +675,7 @@ static int decrypt_skb(struct sock *sk, struct sk_buff > *skb, > { > struct tls_context *tls_ctx = tls_get_ctx(sk); > struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); > - char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + tls_ctx->rx.iv_size]; > + char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + MAX_IV_SIZE]; > struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2]; > struct scatterlist *sgin = _arr[0]; > struct strp_msg *rxm = strp_msg(skb); > @@ -1094,6 +1096,12 @@ int tls_set_sw_offload(struct sock *sk, struct > tls_context *ctx, int tx) > goto free_priv; > } > > + /* Sanity-check the IV size for stack allocations. */ > + if (iv_size > MAX_IV_SIZE) { > + rc = -EINVAL; > + goto free_priv; > + } > + > cctx->prepend_size = TLS_HEADER_SIZE + nonce_size; > cctx->tag_size = tag_size; > cctx->overhead_size = cctx->prepend_size + cctx->tag_size; > -- > 2.7.4 > > > -- > Kees Cook > Pixel Security
[PATCH v2 04/14] x86/crypto: aesni: Add GCM_COMPLETE macro
Merge encode and decode tag calculations in GCM_COMPLETE macro. Scatter/gather routines will call this once at the end of encryption or decryption. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 172 ++ 1 file changed, 63 insertions(+), 109 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index b9fe2ab..529c542 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -222,6 +222,67 @@ ALL_F: .octa 0x mov %r13, %r12 .endm +# GCM_COMPLETE Finishes update of tag of last partial block +# Output: Authorization Tag (AUTH_TAG) +# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 +.macro GCM_COMPLETE + mov arg8, %r12# %r13 = aadLen (number of bytes) + shl $3, %r12 # convert into number of bits + movd%r12d, %xmm15 # len(A) in %xmm15 + shl $3, %arg4 # len(C) in bits (*128) + MOVQ_R64_XMM%arg4, %xmm1 + pslldq $8, %xmm15# %xmm15 = len(A)||0x + pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) + pxor%xmm15, %xmm8 + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + # final GHASH computation + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm8 + + mov %arg5, %rax # %rax = *Y0 + movdqu (%rax), %xmm0 # %xmm0 = Y0 + ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) + pxor%xmm8, %xmm0 +_return_T_\@: + mov arg9, %r10 # %r10 = authTag + mov arg10, %r11# %r11 = auth_tag_len + cmp $16, %r11 + je _T_16_\@ + cmp $8, %r11 + jl _T_4_\@ +_T_8_\@: + MOVQ_R64_XMM%xmm0, %rax + mov %rax, (%r10) + add $8, %r10 + sub $8, %r11 + psrldq $8, %xmm0 + cmp $0, %r11 + je _return_T_done_\@ +_T_4_\@: + movd%xmm0, %eax + mov %eax, (%r10) + add $4, %r10 + sub $4, %r11 + psrldq $4, %xmm0 + cmp $0, %r11 + je _return_T_done_\@ +_T_123_\@: + movd%xmm0, %eax + cmp $2, %r11 + jl _T_1_\@ + mov %ax, (%r10) + cmp $2, %r11 + je _return_T_done_\@ + add $2, %r10 + sar $16, %eax +_T_1_\@: + mov %al, (%r10) + jmp _return_T_done_\@ +_T_16_\@: + movdqu %xmm0, (%r10) +_return_T_done_\@: +.endm + #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) * @@ -1271,61 +1332,7 @@ _less_than_8_bytes_left_decrypt: sub $1, %r13 jne _less_than_8_bytes_left_decrypt _multiple_of_16_bytes_decrypt: - mov arg8, %r12# %r13 = aadLen (number of bytes) - shl $3, %r12 # convert into number of bits - movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg4 # len(C) in bits (*128) - MOVQ_R64_XMM%arg4, %xmm1 - pslldq $8, %xmm15# %xmm15 = len(A)||0x - pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) - pxor%xmm15, %xmm8 - GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 -# final GHASH computation -movdqa SHUF_MASK(%rip), %xmm10 - PSHUFB_XMM %xmm10, %xmm8 - - mov %arg5, %rax # %rax = *Y0 - movdqu (%rax), %xmm0 # %xmm0 = Y0 - ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) - pxor%xmm8, %xmm0 -_return_T_decrypt: - mov arg9, %r10# %r10 = authTag - mov arg10, %r11 # %r11 = auth_tag_len - cmp $16, %r11 - je _T_16_decrypt - cmp $8, %r11 - jl _T_4_decrypt -_T_8_decrypt: - MOVQ_R64_XMM%xmm0, %rax - mov %rax, (%r10) - add $8, %r10 - sub $8, %r11 - psrldq $8, %xmm0 - cmp $0, %r11 - je _return_T_done_decrypt -_T_4_decrypt: - movd%xmm0, %eax - mov %eax, (%r10) - add $4, %r10 - sub $4, %r11 - psrldq $4, %xmm0 - cmp $0, %r11 - je _return_T_done_decrypt -_T_123_decrypt: - movd%xmm0, %eax - cmp $2, %r11 - jl _T_1_decrypt - mov %ax, (%r10) - cmp $2, %r11 - je _return_T_done_decrypt - add $2, %r10 - sar $16, %eax -_T_1_decrypt: - mov %al, (%r10) - jmp _return_T_done_decrypt -_T_16_decrypt: - movdqu %xmm0, (%r10) -_return_T_done_decrypt: + GCM_COMPLETE FUNC_RESTORE ret E
[PATCH v2 04/14] x86/crypto: aesni: Add GCM_COMPLETE macro
Merge encode and decode tag calculations in GCM_COMPLETE macro. Scatter/gather routines will call this once at the end of encryption or decryption. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 172 ++ 1 file changed, 63 insertions(+), 109 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index b9fe2ab..529c542 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -222,6 +222,67 @@ ALL_F: .octa 0x mov %r13, %r12 .endm +# GCM_COMPLETE Finishes update of tag of last partial block +# Output: Authorization Tag (AUTH_TAG) +# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 +.macro GCM_COMPLETE + mov arg8, %r12# %r13 = aadLen (number of bytes) + shl $3, %r12 # convert into number of bits + movd%r12d, %xmm15 # len(A) in %xmm15 + shl $3, %arg4 # len(C) in bits (*128) + MOVQ_R64_XMM%arg4, %xmm1 + pslldq $8, %xmm15# %xmm15 = len(A)||0x + pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) + pxor%xmm15, %xmm8 + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + # final GHASH computation + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm8 + + mov %arg5, %rax # %rax = *Y0 + movdqu (%rax), %xmm0 # %xmm0 = Y0 + ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) + pxor%xmm8, %xmm0 +_return_T_\@: + mov arg9, %r10 # %r10 = authTag + mov arg10, %r11# %r11 = auth_tag_len + cmp $16, %r11 + je _T_16_\@ + cmp $8, %r11 + jl _T_4_\@ +_T_8_\@: + MOVQ_R64_XMM%xmm0, %rax + mov %rax, (%r10) + add $8, %r10 + sub $8, %r11 + psrldq $8, %xmm0 + cmp $0, %r11 + je _return_T_done_\@ +_T_4_\@: + movd%xmm0, %eax + mov %eax, (%r10) + add $4, %r10 + sub $4, %r11 + psrldq $4, %xmm0 + cmp $0, %r11 + je _return_T_done_\@ +_T_123_\@: + movd%xmm0, %eax + cmp $2, %r11 + jl _T_1_\@ + mov %ax, (%r10) + cmp $2, %r11 + je _return_T_done_\@ + add $2, %r10 + sar $16, %eax +_T_1_\@: + mov %al, (%r10) + jmp _return_T_done_\@ +_T_16_\@: + movdqu %xmm0, (%r10) +_return_T_done_\@: +.endm + #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) * @@ -1271,61 +1332,7 @@ _less_than_8_bytes_left_decrypt: sub $1, %r13 jne _less_than_8_bytes_left_decrypt _multiple_of_16_bytes_decrypt: - mov arg8, %r12# %r13 = aadLen (number of bytes) - shl $3, %r12 # convert into number of bits - movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg4 # len(C) in bits (*128) - MOVQ_R64_XMM%arg4, %xmm1 - pslldq $8, %xmm15# %xmm15 = len(A)||0x - pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) - pxor%xmm15, %xmm8 - GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 -# final GHASH computation -movdqa SHUF_MASK(%rip), %xmm10 - PSHUFB_XMM %xmm10, %xmm8 - - mov %arg5, %rax # %rax = *Y0 - movdqu (%rax), %xmm0 # %xmm0 = Y0 - ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) - pxor%xmm8, %xmm0 -_return_T_decrypt: - mov arg9, %r10# %r10 = authTag - mov arg10, %r11 # %r11 = auth_tag_len - cmp $16, %r11 - je _T_16_decrypt - cmp $8, %r11 - jl _T_4_decrypt -_T_8_decrypt: - MOVQ_R64_XMM%xmm0, %rax - mov %rax, (%r10) - add $8, %r10 - sub $8, %r11 - psrldq $8, %xmm0 - cmp $0, %r11 - je _return_T_done_decrypt -_T_4_decrypt: - movd%xmm0, %eax - mov %eax, (%r10) - add $4, %r10 - sub $4, %r11 - psrldq $4, %xmm0 - cmp $0, %r11 - je _return_T_done_decrypt -_T_123_decrypt: - movd%xmm0, %eax - cmp $2, %r11 - jl _T_1_decrypt - mov %ax, (%r10) - cmp $2, %r11 - je _return_T_done_decrypt - add $2, %r10 - sar $16, %eax -_T_1_decrypt: - mov %al, (%r10) - jmp _return_T_done_decrypt -_T_16_decrypt: - movdqu %xmm0, (%r10) -_return_T_done_decrypt: + GCM_COMPLETE FUNC_RESTORE ret ENDPROC(aesni_gcm_dec
[PATCH v2 06/14] x86/crypto: aesni: Introduce gcm_context_data
Introduce a gcm_context_data struct that will be used to pass context data between scatter/gather update calls. It is passed as the second argument (after crypto keys), other args are renumbered. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 115 + arch/x86/crypto/aesni-intel_glue.c | 81 ++ 2 files changed, 121 insertions(+), 75 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 8021fd1..6c5a80d 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -111,6 +111,14 @@ ALL_F: .octa 0x // (for Karatsuba purposes) #defineVARIABLE_OFFSET 16*8 +#define AadHash 16*0 +#define AadLen 16*1 +#define InLen (16*1)+8 +#define PBlockEncKey 16*2 +#define OrigIV 16*3 +#define CurCount 16*4 +#define PBlockLen 16*5 + #define arg1 rdi #define arg2 rsi #define arg3 rdx @@ -121,6 +129,7 @@ ALL_F: .octa 0x #define arg8 STACK_OFFSET+16(%r14) #define arg9 STACK_OFFSET+24(%r14) #define arg10 STACK_OFFSET+32(%r14) +#define arg11 STACK_OFFSET+40(%r14) #define keysize 2*15*16(%arg1) #endif @@ -195,9 +204,9 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 .macro GCM_INIT - mov %arg6, %r12 + mov arg7, %r12 movdqu (%r12), %xmm13 - movdqa SHUF_MASK(%rip), %xmm2 + movdqa SHUF_MASK(%rip), %xmm2 PSHUFB_XMM %xmm2, %xmm13 # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) @@ -217,7 +226,7 @@ ALL_F: .octa 0x pandPOLY(%rip), %xmm2 pxor%xmm2, %xmm13 movdqa %xmm13, HashKey(%rsp) - mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod poly) + mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) and $-16, %r13 mov %r13, %r12 .endm @@ -271,18 +280,18 @@ _four_cipher_left_\@: GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 _zero_cipher_left_\@: - mov %arg4, %r13 - and $15, %r13 # %r13 = arg4 (mod 16) + mov %arg5, %r13 + and $15, %r13 # %r13 = arg5 (mod 16) je _multiple_of_16_bytes_\@ # Handle the last <16 Byte block separately paddd ONE(%rip), %xmm0# INCR CNT to get Yn -movdqa SHUF_MASK(%rip), %xmm10 + movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) - lea (%arg3,%r11,1), %r10 + lea (%arg4,%r11,1), %r10 mov %r13, %r12 READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 @@ -320,13 +329,13 @@ _zero_cipher_left_\@: MOVQ_R64_XMM %xmm0, %rax cmp $8, %r13 jle _less_than_8_bytes_left_\@ - mov %rax, (%arg2 , %r11, 1) + mov %rax, (%arg3 , %r11, 1) add $8, %r11 psrldq $8, %xmm0 MOVQ_R64_XMM %xmm0, %rax sub $8, %r13 _less_than_8_bytes_left_\@: - mov %al, (%arg2, %r11, 1) + mov %al, (%arg3, %r11, 1) add $1, %r11 shr $8, %rax sub $1, %r13 @@ -338,11 +347,11 @@ _multiple_of_16_bytes_\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE - mov arg8, %r12# %r13 = aadLen (number of bytes) + mov arg9, %r12# %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg4 # len(C) in bits (*128) - MOVQ_R64_XMM%arg4, %xmm1 + shl $3, %arg5 # len(C) in bits (*128) + MOVQ_R64_XMM%arg5, %xmm1 pslldq $8, %xmm15# %xmm15 = len(A)||0x pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) pxor%xmm15, %xmm8 @@ -351,13 +360,13 @@ _multiple_of_16_bytes_\@: movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm8 - mov %arg5, %rax # %rax = *Y0 + mov %arg6, %rax # %rax = *Y0 movdqu (%rax), %xmm0 # %xmm0 = Y0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: - mov arg9, %r10 # %r10 = authTag - mov arg10, %r11# %r11 = auth_tag_len + mov arg10, %r10 # %r10 = authTag
[PATCH v2 06/14] x86/crypto: aesni: Introduce gcm_context_data
Introduce a gcm_context_data struct that will be used to pass context data between scatter/gather update calls. It is passed as the second argument (after crypto keys), other args are renumbered. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 115 + arch/x86/crypto/aesni-intel_glue.c | 81 ++ 2 files changed, 121 insertions(+), 75 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 8021fd1..6c5a80d 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -111,6 +111,14 @@ ALL_F: .octa 0x // (for Karatsuba purposes) #defineVARIABLE_OFFSET 16*8 +#define AadHash 16*0 +#define AadLen 16*1 +#define InLen (16*1)+8 +#define PBlockEncKey 16*2 +#define OrigIV 16*3 +#define CurCount 16*4 +#define PBlockLen 16*5 + #define arg1 rdi #define arg2 rsi #define arg3 rdx @@ -121,6 +129,7 @@ ALL_F: .octa 0x #define arg8 STACK_OFFSET+16(%r14) #define arg9 STACK_OFFSET+24(%r14) #define arg10 STACK_OFFSET+32(%r14) +#define arg11 STACK_OFFSET+40(%r14) #define keysize 2*15*16(%arg1) #endif @@ -195,9 +204,9 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 .macro GCM_INIT - mov %arg6, %r12 + mov arg7, %r12 movdqu (%r12), %xmm13 - movdqa SHUF_MASK(%rip), %xmm2 + movdqa SHUF_MASK(%rip), %xmm2 PSHUFB_XMM %xmm2, %xmm13 # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) @@ -217,7 +226,7 @@ ALL_F: .octa 0x pandPOLY(%rip), %xmm2 pxor%xmm2, %xmm13 movdqa %xmm13, HashKey(%rsp) - mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod poly) + mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) and $-16, %r13 mov %r13, %r12 .endm @@ -271,18 +280,18 @@ _four_cipher_left_\@: GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 _zero_cipher_left_\@: - mov %arg4, %r13 - and $15, %r13 # %r13 = arg4 (mod 16) + mov %arg5, %r13 + and $15, %r13 # %r13 = arg5 (mod 16) je _multiple_of_16_bytes_\@ # Handle the last <16 Byte block separately paddd ONE(%rip), %xmm0# INCR CNT to get Yn -movdqa SHUF_MASK(%rip), %xmm10 + movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) - lea (%arg3,%r11,1), %r10 + lea (%arg4,%r11,1), %r10 mov %r13, %r12 READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 @@ -320,13 +329,13 @@ _zero_cipher_left_\@: MOVQ_R64_XMM %xmm0, %rax cmp $8, %r13 jle _less_than_8_bytes_left_\@ - mov %rax, (%arg2 , %r11, 1) + mov %rax, (%arg3 , %r11, 1) add $8, %r11 psrldq $8, %xmm0 MOVQ_R64_XMM %xmm0, %rax sub $8, %r13 _less_than_8_bytes_left_\@: - mov %al, (%arg2, %r11, 1) + mov %al, (%arg3, %r11, 1) add $1, %r11 shr $8, %rax sub $1, %r13 @@ -338,11 +347,11 @@ _multiple_of_16_bytes_\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE - mov arg8, %r12# %r13 = aadLen (number of bytes) + mov arg9, %r12# %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg4 # len(C) in bits (*128) - MOVQ_R64_XMM%arg4, %xmm1 + shl $3, %arg5 # len(C) in bits (*128) + MOVQ_R64_XMM%arg5, %xmm1 pslldq $8, %xmm15# %xmm15 = len(A)||0x pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) pxor%xmm15, %xmm8 @@ -351,13 +360,13 @@ _multiple_of_16_bytes_\@: movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm8 - mov %arg5, %rax # %rax = *Y0 + mov %arg6, %rax # %rax = *Y0 movdqu (%rax), %xmm0 # %xmm0 = Y0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: - mov arg9, %r10 # %r10 = authTag - mov arg10, %r11# %r11 = auth_tag_len + mov arg10, %r10 # %r10 = authTag + mov arg11, %r11
[PATCH v2 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather
Add gcmaes_crypt_by_sg routine, that will do scatter/gather by sg. Either src or dst may contain multiple buffers, so iterate over both at the same time if they are different. If the input is the same as the output, iterate only over one. Currently both the AAD and TAG must be linear, so copy them out with scatterlist_map_and_copy. If first buffer contains the entire AAD, we can optimize and not copy. Since the AAD can be any size, if copied it must be on the heap. TAG can be on the stack since it is always < 16 bytes. Only the SSE routines are updated so far, so leave the previous gcmaes_en/decrypt routines, and branch to the sg ones if the keysize is inappropriate for avx, or we are SSE only. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_glue.c | 133 + 1 file changed, 133 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index de986f9..acbe7e8 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -791,6 +791,127 @@ static int generic_gcmaes_set_authsize(struct crypto_aead *tfm, return 0; } +static int gcmaes_crypt_by_sg(bool enc, struct aead_request *req, + unsigned int assoclen, u8 *hash_subkey, + u8 *iv, void *aes_ctx) +{ + struct crypto_aead *tfm = crypto_aead_reqtfm(req); + unsigned long auth_tag_len = crypto_aead_authsize(tfm); + struct gcm_context_data data AESNI_ALIGN_ATTR; + struct scatter_walk dst_sg_walk = {}; + unsigned long left = req->cryptlen; + unsigned long len, srclen, dstlen; + struct scatter_walk assoc_sg_walk; + struct scatter_walk src_sg_walk; + struct scatterlist src_start[2]; + struct scatterlist dst_start[2]; + struct scatterlist *src_sg; + struct scatterlist *dst_sg; + u8 *src, *dst, *assoc; + u8 *assocmem = NULL; + u8 authTag[16]; + + if (!enc) + left -= auth_tag_len; + + /* Linearize assoc, if not already linear */ + if (req->src->length >= assoclen && req->src->length && + (!PageHighMem(sg_page(req->src)) || + req->src->offset + req->src->length < PAGE_SIZE)) { + scatterwalk_start(_sg_walk, req->src); + assoc = scatterwalk_map(_sg_walk); + } else { + /* assoc can be any length, so must be on heap */ + assocmem = kmalloc(assoclen, GFP_ATOMIC); + if (unlikely(!assocmem)) + return -ENOMEM; + assoc = assocmem; + + scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0); + } + + src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen); + scatterwalk_start(_sg_walk, src_sg); + if (req->src != req->dst) { + dst_sg = scatterwalk_ffwd(dst_start, req->dst, req->assoclen); + scatterwalk_start(_sg_walk, dst_sg); + } + + kernel_fpu_begin(); + aesni_gcm_init(aes_ctx, , iv, + hash_subkey, assoc, assoclen); + if (req->src != req->dst) { + while (left) { + src = scatterwalk_map(_sg_walk); + dst = scatterwalk_map(_sg_walk); + srclen = scatterwalk_clamp(_sg_walk, left); + dstlen = scatterwalk_clamp(_sg_walk, left); + len = min(srclen, dstlen); + if (len) { + if (enc) + aesni_gcm_enc_update(aes_ctx, , +dst, src, len); + else + aesni_gcm_dec_update(aes_ctx, , +dst, src, len); + } + left -= len; + + scatterwalk_unmap(src); + scatterwalk_unmap(dst); + scatterwalk_advance(_sg_walk, len); + scatterwalk_advance(_sg_walk, len); + scatterwalk_done(_sg_walk, 0, left); + scatterwalk_done(_sg_walk, 1, left); + } + } else { + while (left) { + dst = src = scatterwalk_map(_sg_walk); + len = scatterwalk_clamp(_sg_walk, left); + if (len) { + if (enc) + aesni_gcm_enc_update(aes_ctx, , +src, src, len); + else + aesni_gcm_dec_u
[PATCH v2 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather
Add gcmaes_crypt_by_sg routine, that will do scatter/gather by sg. Either src or dst may contain multiple buffers, so iterate over both at the same time if they are different. If the input is the same as the output, iterate only over one. Currently both the AAD and TAG must be linear, so copy them out with scatterlist_map_and_copy. If first buffer contains the entire AAD, we can optimize and not copy. Since the AAD can be any size, if copied it must be on the heap. TAG can be on the stack since it is always < 16 bytes. Only the SSE routines are updated so far, so leave the previous gcmaes_en/decrypt routines, and branch to the sg ones if the keysize is inappropriate for avx, or we are SSE only. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_glue.c | 133 + 1 file changed, 133 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index de986f9..acbe7e8 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -791,6 +791,127 @@ static int generic_gcmaes_set_authsize(struct crypto_aead *tfm, return 0; } +static int gcmaes_crypt_by_sg(bool enc, struct aead_request *req, + unsigned int assoclen, u8 *hash_subkey, + u8 *iv, void *aes_ctx) +{ + struct crypto_aead *tfm = crypto_aead_reqtfm(req); + unsigned long auth_tag_len = crypto_aead_authsize(tfm); + struct gcm_context_data data AESNI_ALIGN_ATTR; + struct scatter_walk dst_sg_walk = {}; + unsigned long left = req->cryptlen; + unsigned long len, srclen, dstlen; + struct scatter_walk assoc_sg_walk; + struct scatter_walk src_sg_walk; + struct scatterlist src_start[2]; + struct scatterlist dst_start[2]; + struct scatterlist *src_sg; + struct scatterlist *dst_sg; + u8 *src, *dst, *assoc; + u8 *assocmem = NULL; + u8 authTag[16]; + + if (!enc) + left -= auth_tag_len; + + /* Linearize assoc, if not already linear */ + if (req->src->length >= assoclen && req->src->length && + (!PageHighMem(sg_page(req->src)) || + req->src->offset + req->src->length < PAGE_SIZE)) { + scatterwalk_start(_sg_walk, req->src); + assoc = scatterwalk_map(_sg_walk); + } else { + /* assoc can be any length, so must be on heap */ + assocmem = kmalloc(assoclen, GFP_ATOMIC); + if (unlikely(!assocmem)) + return -ENOMEM; + assoc = assocmem; + + scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0); + } + + src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen); + scatterwalk_start(_sg_walk, src_sg); + if (req->src != req->dst) { + dst_sg = scatterwalk_ffwd(dst_start, req->dst, req->assoclen); + scatterwalk_start(_sg_walk, dst_sg); + } + + kernel_fpu_begin(); + aesni_gcm_init(aes_ctx, , iv, + hash_subkey, assoc, assoclen); + if (req->src != req->dst) { + while (left) { + src = scatterwalk_map(_sg_walk); + dst = scatterwalk_map(_sg_walk); + srclen = scatterwalk_clamp(_sg_walk, left); + dstlen = scatterwalk_clamp(_sg_walk, left); + len = min(srclen, dstlen); + if (len) { + if (enc) + aesni_gcm_enc_update(aes_ctx, , +dst, src, len); + else + aesni_gcm_dec_update(aes_ctx, , +dst, src, len); + } + left -= len; + + scatterwalk_unmap(src); + scatterwalk_unmap(dst); + scatterwalk_advance(_sg_walk, len); + scatterwalk_advance(_sg_walk, len); + scatterwalk_done(_sg_walk, 0, left); + scatterwalk_done(_sg_walk, 1, left); + } + } else { + while (left) { + dst = src = scatterwalk_map(_sg_walk); + len = scatterwalk_clamp(_sg_walk, left); + if (len) { + if (enc) + aesni_gcm_enc_update(aes_ctx, , +src, src, len); + else + aesni_gcm_dec_update(
[PATCH v2 13/14] x86/crypto: aesni: Introduce scatter/gather asm function stubs
The asm macros are all set up now, introduce entry points. GCM_INIT and GCM_COMPLETE have arguments supplied, so that the new scatter/gather entry points don't have to take all the arguments, and only the ones they need. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 116 - arch/x86/crypto/aesni-intel_glue.c | 16 + 2 files changed, 106 insertions(+), 26 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index b941952..311b2de 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -200,8 +200,8 @@ ALL_F: .octa 0x # Output: HashKeys stored in gcm_context_data. Only needs to be called # once per key. # clobbers r12, and tmp xmm registers. -.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 - mov arg7, %r12 +.macro PRECOMPUTE SUBKEY TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 + mov \SUBKEY, %r12 movdqu (%r12), \TMP3 movdqa SHUF_MASK(%rip), \TMP2 PSHUFB_XMM \TMP2, \TMP3 @@ -254,14 +254,14 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 -.macro GCM_INIT - mov arg9, %r11 +.macro GCM_INIT Iv SUBKEY AAD AADLEN + mov \AADLEN, %r11 mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length xor %r11, %r11 mov %r11, InLen(%arg2) # ctx_data.in_length = 0 mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0 mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0 - mov %arg6, %rax + mov \Iv, %rax movdqu (%rax), %xmm0 movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv @@ -269,11 +269,11 @@ ALL_F: .octa 0x PSHUFB_XMM %xmm2, %xmm0 movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv - PRECOMPUTE %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7 + PRECOMPUTE \SUBKEY, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, movdqa HashKey(%arg2), %xmm13 - CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ - %xmm5 %xmm6 + CALC_AAD_HASH %xmm13, \AAD, \AADLEN, %xmm0, %xmm1, %xmm2, %xmm3, \ + %xmm4, %xmm5, %xmm6 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -435,7 +435,7 @@ _multiple_of_16_bytes_\@: # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 -.macro GCM_COMPLETE +.macro GCM_COMPLETE AUTHTAG AUTHTAGLEN movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%arg2), %xmm13 @@ -466,8 +466,8 @@ _partial_done\@: ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: - mov arg10, %r10 # %r10 = authTag - mov arg11, %r11# %r11 = auth_tag_len + mov \AUTHTAG, %r10 # %r10 = authTag + mov \AUTHTAGLEN, %r11# %r11 = auth_tag_len cmp $16, %r11 je _T_16_\@ cmp $8, %r11 @@ -599,11 +599,11 @@ _done_read_partial_block_\@: # CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted. # clobbers r10-11, xmm14 -.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \ +.macro CALC_AAD_HASH HASHKEY AAD AADLEN TMP1 TMP2 TMP3 TMP4 TMP5 \ TMP6 TMP7 MOVADQ SHUF_MASK(%rip), %xmm14 - movarg8, %r10 # %r10 = AAD - movarg9, %r11 # %r11 = aadLen + mov\AAD, %r10 # %r10 = AAD + mov\AADLEN, %r11# %r11 = aadLen pxor \TMP7, \TMP7 pxor \TMP6, \TMP6 @@ -1103,18 +1103,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 operation mov keysize,%eax shr $2,%eax # 128->4, 192->6, 256->8 sub $4,%eax # 128->0, 192->2, 256->4 - jzaes_loop_par_enc_done + jzaes_loop_par_enc_done\@ -aes_loop_par_enc: +aes_loop_par_enc\@: MOVADQ(%r10),\TMP3 .irpc index, 1234 AESENC\TMP3, %xmm\index .endr add $16,%r10 sub $1,%eax - jnz aes_loop_par_enc + jnz aes_loop_par_enc\@ -aes_loop_par_enc_done: +aes_loop_par_enc_done\@: MOVADQ(%r10), \TMP3 AESENCLAST \TMP3, \XMM1 # Round 10 AESENCLAST \TMP3, \XMM2 @@ -1311,18 +1311,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 operation mov keysize,%eax shr $2,%eax # 128->4, 192->6, 256->8 sub $4,%eax
[PATCH v2 13/14] x86/crypto: aesni: Introduce scatter/gather asm function stubs
The asm macros are all set up now, introduce entry points. GCM_INIT and GCM_COMPLETE have arguments supplied, so that the new scatter/gather entry points don't have to take all the arguments, and only the ones they need. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 116 - arch/x86/crypto/aesni-intel_glue.c | 16 + 2 files changed, 106 insertions(+), 26 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index b941952..311b2de 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -200,8 +200,8 @@ ALL_F: .octa 0x # Output: HashKeys stored in gcm_context_data. Only needs to be called # once per key. # clobbers r12, and tmp xmm registers. -.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 - mov arg7, %r12 +.macro PRECOMPUTE SUBKEY TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 + mov \SUBKEY, %r12 movdqu (%r12), \TMP3 movdqa SHUF_MASK(%rip), \TMP2 PSHUFB_XMM \TMP2, \TMP3 @@ -254,14 +254,14 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 -.macro GCM_INIT - mov arg9, %r11 +.macro GCM_INIT Iv SUBKEY AAD AADLEN + mov \AADLEN, %r11 mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length xor %r11, %r11 mov %r11, InLen(%arg2) # ctx_data.in_length = 0 mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0 mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0 - mov %arg6, %rax + mov \Iv, %rax movdqu (%rax), %xmm0 movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv @@ -269,11 +269,11 @@ ALL_F: .octa 0x PSHUFB_XMM %xmm2, %xmm0 movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv - PRECOMPUTE %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7 + PRECOMPUTE \SUBKEY, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, movdqa HashKey(%arg2), %xmm13 - CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ - %xmm5 %xmm6 + CALC_AAD_HASH %xmm13, \AAD, \AADLEN, %xmm0, %xmm1, %xmm2, %xmm3, \ + %xmm4, %xmm5, %xmm6 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -435,7 +435,7 @@ _multiple_of_16_bytes_\@: # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 -.macro GCM_COMPLETE +.macro GCM_COMPLETE AUTHTAG AUTHTAGLEN movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%arg2), %xmm13 @@ -466,8 +466,8 @@ _partial_done\@: ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: - mov arg10, %r10 # %r10 = authTag - mov arg11, %r11# %r11 = auth_tag_len + mov \AUTHTAG, %r10 # %r10 = authTag + mov \AUTHTAGLEN, %r11# %r11 = auth_tag_len cmp $16, %r11 je _T_16_\@ cmp $8, %r11 @@ -599,11 +599,11 @@ _done_read_partial_block_\@: # CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted. # clobbers r10-11, xmm14 -.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \ +.macro CALC_AAD_HASH HASHKEY AAD AADLEN TMP1 TMP2 TMP3 TMP4 TMP5 \ TMP6 TMP7 MOVADQ SHUF_MASK(%rip), %xmm14 - movarg8, %r10 # %r10 = AAD - movarg9, %r11 # %r11 = aadLen + mov\AAD, %r10 # %r10 = AAD + mov\AADLEN, %r11# %r11 = aadLen pxor \TMP7, \TMP7 pxor \TMP6, \TMP6 @@ -1103,18 +1103,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 operation mov keysize,%eax shr $2,%eax # 128->4, 192->6, 256->8 sub $4,%eax # 128->0, 192->2, 256->4 - jzaes_loop_par_enc_done + jzaes_loop_par_enc_done\@ -aes_loop_par_enc: +aes_loop_par_enc\@: MOVADQ(%r10),\TMP3 .irpc index, 1234 AESENC\TMP3, %xmm\index .endr add $16,%r10 sub $1,%eax - jnz aes_loop_par_enc + jnz aes_loop_par_enc\@ -aes_loop_par_enc_done: +aes_loop_par_enc_done\@: MOVADQ(%r10), \TMP3 AESENCLAST \TMP3, \XMM1 # Round 10 AESENCLAST \TMP3, \XMM2 @@ -1311,18 +1311,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 operation mov keysize,%eax shr $2,%eax # 128->4, 192->6, 256->8 sub $4,%eax # 128-&g
[PATCH v2 09/14] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
Prepare to handle partial blocks between scatter/gather calls. For the last partial block, we only want to calculate the aadhash in GCM_COMPLETE, and a new partial block macro will handle both aadhash update and encrypting partial blocks between calls. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index aa82493..37b1cee 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -345,7 +345,6 @@ _zero_cipher_left_\@: pxor%xmm0, %xmm8 .endif - GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 movdqu %xmm8, AadHash(%arg2) .ifc \operation, enc # GHASH computation for the last <16 byte block @@ -378,6 +377,15 @@ _multiple_of_16_bytes_\@: .macro GCM_COMPLETE movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%rsp), %xmm13 + + mov PBlockLen(%arg2), %r12 + + cmp $0, %r12 + je _partial_done\@ + + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + +_partial_done\@: mov AadLen(%arg2), %r12 # %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 -- 2.9.5
[PATCH v2 12/14] x86/crypto: aesni: Add fast path for > 16 byte update
We can fast-path any < 16 byte read if the full message is > 16 bytes, and shift over by the appropriate amount. Usually we are reading > 16 bytes, so this should be faster than the READ_PARTIAL macro introduced in b20209c91e2 for the average case. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 398bd2237f..b941952 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -355,12 +355,37 @@ _zero_cipher_left_\@: ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) movdqu %xmm0, PBlockEncKey(%arg2) + cmp $16, %arg5 + jge _large_enough_update_\@ + lea (%arg4,%r11,1), %r10 mov %r13, %r12 READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 + jmp _data_read_\@ + +_large_enough_update_\@: + sub $16, %r11 + add %r13, %r11 + + # receive the last <16 Byte block + movdqu (%arg4, %r11, 1), %xmm1 + sub %r13, %r11 + add $16, %r11 + + lea SHIFT_MASK+16(%rip), %r12 + # adjust the shuffle mask pointer to be able to shift 16-r13 bytes + # (r13 is the number of bytes in plaintext mod 16) + sub %r13, %r12 + # get the appropriate shuffle mask + movdqu (%r12), %xmm2 + # shift right 16-r13 bytes + PSHUFB_XMM %xmm2, %xmm1 + +_data_read_\@: lea ALL_F+16(%rip), %r12 sub %r13, %r12 + .ifc \operation, dec movdqa %xmm1, %xmm2 .endif -- 2.9.5
[PATCH v2 12/14] x86/crypto: aesni: Add fast path for > 16 byte update
We can fast-path any < 16 byte read if the full message is > 16 bytes, and shift over by the appropriate amount. Usually we are reading > 16 bytes, so this should be faster than the READ_PARTIAL macro introduced in b20209c91e2 for the average case. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 398bd2237f..b941952 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -355,12 +355,37 @@ _zero_cipher_left_\@: ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) movdqu %xmm0, PBlockEncKey(%arg2) + cmp $16, %arg5 + jge _large_enough_update_\@ + lea (%arg4,%r11,1), %r10 mov %r13, %r12 READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 + jmp _data_read_\@ + +_large_enough_update_\@: + sub $16, %r11 + add %r13, %r11 + + # receive the last <16 Byte block + movdqu (%arg4, %r11, 1), %xmm1 + sub %r13, %r11 + add $16, %r11 + + lea SHIFT_MASK+16(%rip), %r12 + # adjust the shuffle mask pointer to be able to shift 16-r13 bytes + # (r13 is the number of bytes in plaintext mod 16) + sub %r13, %r12 + # get the appropriate shuffle mask + movdqu (%r12), %xmm2 + # shift right 16-r13 bytes + PSHUFB_XMM %xmm2, %xmm1 + +_data_read_\@: lea ALL_F+16(%rip), %r12 sub %r13, %r12 + .ifc \operation, dec movdqa %xmm1, %xmm2 .endif -- 2.9.5
[PATCH v2 09/14] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
Prepare to handle partial blocks between scatter/gather calls. For the last partial block, we only want to calculate the aadhash in GCM_COMPLETE, and a new partial block macro will handle both aadhash update and encrypting partial blocks between calls. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index aa82493..37b1cee 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -345,7 +345,6 @@ _zero_cipher_left_\@: pxor%xmm0, %xmm8 .endif - GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 movdqu %xmm8, AadHash(%arg2) .ifc \operation, enc # GHASH computation for the last <16 byte block @@ -378,6 +377,15 @@ _multiple_of_16_bytes_\@: .macro GCM_COMPLETE movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%rsp), %xmm13 + + mov PBlockLen(%arg2), %r12 + + cmp $0, %r12 + je _partial_done\@ + + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + +_partial_done\@: mov AadLen(%arg2), %r12 # %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 -- 2.9.5
[PATCH v2 11/14] x86/crypto: aesni: Introduce partial block macro
Before this diff, multiple calls to GCM_ENC_DEC will succeed, but only if all calls are a multiple of 16 bytes. Handle partial blocks at the start of GCM_ENC_DEC, and update aadhash as appropriate. The data offset %r11 is also updated after the partial block. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 151 +- 1 file changed, 150 insertions(+), 1 deletion(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 3ada06b..398bd2237f 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -284,7 +284,13 @@ ALL_F: .octa 0x movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%arg2), %xmm13 add %arg5, InLen(%arg2) + + xor %r11, %r11 # initialise the data pointer offset as zero + PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation + + sub %r11, %arg5 # sub partial block data used mov %arg5, %r13 # save the number of bytes + and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) mov %r13, %r12 # Encrypt/Decrypt first few blocks @@ -605,6 +611,150 @@ _get_AAD_done\@: movdqu \TMP6, AadHash(%arg2) .endm +# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks +# between update calls. +# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK +# Outputs encrypted bytes, and updates hash and partial info in gcm_data_context +# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13 +.macro PARTIAL_BLOCK CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \ + AAD_HASH operation + mov PBlockLen(%arg2), %r13 + cmp $0, %r13 + je _partial_block_done_\@ # Leave Macro if no partial blocks + # Read in input data without over reading + cmp $16, \PLAIN_CYPH_LEN + jl _fewer_than_16_bytes_\@ + movups (\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm + jmp _data_read_\@ + +_fewer_than_16_bytes_\@: + lea (\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10 + mov \PLAIN_CYPH_LEN, %r12 + READ_PARTIAL_BLOCK %r10 %r12 %xmm0 %xmm1 + + mov PBlockLen(%arg2), %r13 + +_data_read_\@: # Finished reading in data + + movdqu PBlockEncKey(%arg2), %xmm9 + movdqu HashKey(%arg2), %xmm13 + + lea SHIFT_MASK(%rip), %r12 + + # adjust the shuffle mask pointer to be able to shift r13 bytes + # r16-r13 is the number of bytes in plaintext mod 16) + add %r13, %r12 + movdqu (%r12), %xmm2 # get the appropriate shuffle mask + PSHUFB_XMM %xmm2, %xmm9 # shift right r13 bytes + +.ifc \operation, dec + movdqa %xmm1, %xmm3 + pxor%xmm1, %xmm9# Cyphertext XOR E(K, Yn) + + mov \PLAIN_CYPH_LEN, %r10 + add %r13, %r10 + # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling + sub $16, %r10 + # Determine if if partial block is not being filled and + # shift mask accordingly + jge _no_extra_mask_1_\@ + sub %r10, %r12 +_no_extra_mask_1_\@: + + movdqu ALL_F-SHIFT_MASK(%r12), %xmm1 + # get the appropriate mask to mask out bottom r13 bytes of xmm9 + pand%xmm1, %xmm9# mask out bottom r13 bytes of xmm9 + + pand%xmm1, %xmm3 + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm3 + PSHUFB_XMM %xmm2, %xmm3 + pxor%xmm3, \AAD_HASH + + cmp $0, %r10 + jl _partial_incomplete_1_\@ + + # GHASH computation for the last <16 Byte block + GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + xor %rax,%rax + + mov %rax, PBlockLen(%arg2) + jmp _dec_done_\@ +_partial_incomplete_1_\@: + add \PLAIN_CYPH_LEN, PBlockLen(%arg2) +_dec_done_\@: + movdqu \AAD_HASH, AadHash(%arg2) +.else + pxor%xmm1, %xmm9# Plaintext XOR E(K, Yn) + + mov \PLAIN_CYPH_LEN, %r10 + add %r13, %r10 + # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling + sub $16, %r10 + # Determine if if partial block is not being filled and + # shift mask accordingly + jge _no_extra_mask_2_\@ + sub %r10, %r12 +_no_extra_mask_2_\@: + + movdqu ALL_F-SHIFT_MASK(%r12), %xmm1 + # get the appropriate mask to mask out bottom r13 bytes of xmm9 + pand%xmm1, %xmm9 + + movdqa SHUF_MASK(%rip), %xmm1 + PSHUFB_XMM %xmm1, %xmm9 + PSHUFB_XMM %xmm2, %xmm9 + pxor%xmm9, \AAD_HASH + + cmp $0, %r10 + jl _partial_incomplete_2_\@ + + # GHASH computation for the last <16 Byte block + GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
[PATCH v2 11/14] x86/crypto: aesni: Introduce partial block macro
Before this diff, multiple calls to GCM_ENC_DEC will succeed, but only if all calls are a multiple of 16 bytes. Handle partial blocks at the start of GCM_ENC_DEC, and update aadhash as appropriate. The data offset %r11 is also updated after the partial block. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 151 +- 1 file changed, 150 insertions(+), 1 deletion(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 3ada06b..398bd2237f 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -284,7 +284,13 @@ ALL_F: .octa 0x movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%arg2), %xmm13 add %arg5, InLen(%arg2) + + xor %r11, %r11 # initialise the data pointer offset as zero + PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation + + sub %r11, %arg5 # sub partial block data used mov %arg5, %r13 # save the number of bytes + and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) mov %r13, %r12 # Encrypt/Decrypt first few blocks @@ -605,6 +611,150 @@ _get_AAD_done\@: movdqu \TMP6, AadHash(%arg2) .endm +# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks +# between update calls. +# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK +# Outputs encrypted bytes, and updates hash and partial info in gcm_data_context +# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13 +.macro PARTIAL_BLOCK CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \ + AAD_HASH operation + mov PBlockLen(%arg2), %r13 + cmp $0, %r13 + je _partial_block_done_\@ # Leave Macro if no partial blocks + # Read in input data without over reading + cmp $16, \PLAIN_CYPH_LEN + jl _fewer_than_16_bytes_\@ + movups (\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm + jmp _data_read_\@ + +_fewer_than_16_bytes_\@: + lea (\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10 + mov \PLAIN_CYPH_LEN, %r12 + READ_PARTIAL_BLOCK %r10 %r12 %xmm0 %xmm1 + + mov PBlockLen(%arg2), %r13 + +_data_read_\@: # Finished reading in data + + movdqu PBlockEncKey(%arg2), %xmm9 + movdqu HashKey(%arg2), %xmm13 + + lea SHIFT_MASK(%rip), %r12 + + # adjust the shuffle mask pointer to be able to shift r13 bytes + # r16-r13 is the number of bytes in plaintext mod 16) + add %r13, %r12 + movdqu (%r12), %xmm2 # get the appropriate shuffle mask + PSHUFB_XMM %xmm2, %xmm9 # shift right r13 bytes + +.ifc \operation, dec + movdqa %xmm1, %xmm3 + pxor%xmm1, %xmm9# Cyphertext XOR E(K, Yn) + + mov \PLAIN_CYPH_LEN, %r10 + add %r13, %r10 + # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling + sub $16, %r10 + # Determine if if partial block is not being filled and + # shift mask accordingly + jge _no_extra_mask_1_\@ + sub %r10, %r12 +_no_extra_mask_1_\@: + + movdqu ALL_F-SHIFT_MASK(%r12), %xmm1 + # get the appropriate mask to mask out bottom r13 bytes of xmm9 + pand%xmm1, %xmm9# mask out bottom r13 bytes of xmm9 + + pand%xmm1, %xmm3 + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm3 + PSHUFB_XMM %xmm2, %xmm3 + pxor%xmm3, \AAD_HASH + + cmp $0, %r10 + jl _partial_incomplete_1_\@ + + # GHASH computation for the last <16 Byte block + GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + xor %rax,%rax + + mov %rax, PBlockLen(%arg2) + jmp _dec_done_\@ +_partial_incomplete_1_\@: + add \PLAIN_CYPH_LEN, PBlockLen(%arg2) +_dec_done_\@: + movdqu \AAD_HASH, AadHash(%arg2) +.else + pxor%xmm1, %xmm9# Plaintext XOR E(K, Yn) + + mov \PLAIN_CYPH_LEN, %r10 + add %r13, %r10 + # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling + sub $16, %r10 + # Determine if if partial block is not being filled and + # shift mask accordingly + jge _no_extra_mask_2_\@ + sub %r10, %r12 +_no_extra_mask_2_\@: + + movdqu ALL_F-SHIFT_MASK(%r12), %xmm1 + # get the appropriate mask to mask out bottom r13 bytes of xmm9 + pand%xmm1, %xmm9 + + movdqa SHUF_MASK(%rip), %xmm1 + PSHUFB_XMM %xmm1, %xmm9 + PSHUFB_XMM %xmm2, %xmm9 + pxor%xmm9, \AAD_HASH + + cmp $0, %r10 + jl _partial_incomplete_2_\@ + + # GHASH computation for the last <16 Byte block + GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + xor %ra
[PATCH v2 10/14] x86/crypto: aesni: Move HashKey computation from stack to gcm_context
HashKey computation only needs to happen once per scatter/gather operation, save it between calls in gcm_context struct instead of on the stack. Since the asm no longer stores anything on the stack, we can use %rsp directly, and clean up the frame save/restore macros a bit. Hashkeys actually only need to be calculated once per key and could be moved to when set_key is called, however, the current glue code falls back to generic aes code if fpu is disabled. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 205 -- 1 file changed, 106 insertions(+), 99 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 37b1cee..3ada06b 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -93,23 +93,6 @@ ALL_F: .octa 0x #defineSTACK_OFFSET8*3 -#defineHashKey 16*0// store HashKey <<1 mod poly here -#defineHashKey_2 16*1// store HashKey^2 <<1 mod poly here -#defineHashKey_3 16*2// store HashKey^3 <<1 mod poly here -#defineHashKey_4 16*3// store HashKey^4 <<1 mod poly here -#defineHashKey_k 16*4// store XOR of High 64 bits and Low 64 - // bits of HashKey <<1 mod poly here - //(for Karatsuba purposes) -#defineHashKey_2_k 16*5// store XOR of High 64 bits and Low 64 - // bits of HashKey^2 <<1 mod poly here - // (for Karatsuba purposes) -#defineHashKey_3_k 16*6// store XOR of High 64 bits and Low 64 - // bits of HashKey^3 <<1 mod poly here - // (for Karatsuba purposes) -#defineHashKey_4_k 16*7// store XOR of High 64 bits and Low 64 - // bits of HashKey^4 <<1 mod poly here - // (for Karatsuba purposes) -#defineVARIABLE_OFFSET 16*8 #define AadHash 16*0 #define AadLen 16*1 @@ -118,6 +101,22 @@ ALL_F: .octa 0x #define OrigIV 16*3 #define CurCount 16*4 #define PBlockLen 16*5 +#defineHashKey 16*6// store HashKey <<1 mod poly here +#defineHashKey_2 16*7// store HashKey^2 <<1 mod poly here +#defineHashKey_3 16*8// store HashKey^3 <<1 mod poly here +#defineHashKey_4 16*9// store HashKey^4 <<1 mod poly here +#defineHashKey_k 16*10 // store XOR of High 64 bits and Low 64 + // bits of HashKey <<1 mod poly here + //(for Karatsuba purposes) +#defineHashKey_2_k 16*11 // store XOR of High 64 bits and Low 64 + // bits of HashKey^2 <<1 mod poly here + // (for Karatsuba purposes) +#defineHashKey_3_k 16*12 // store XOR of High 64 bits and Low 64 + // bits of HashKey^3 <<1 mod poly here + // (for Karatsuba purposes) +#defineHashKey_4_k 16*13 // store XOR of High 64 bits and Low 64 + // bits of HashKey^4 <<1 mod poly here + // (for Karatsuba purposes) #define arg1 rdi #define arg2 rsi @@ -125,11 +124,11 @@ ALL_F: .octa 0x #define arg4 rcx #define arg5 r8 #define arg6 r9 -#define arg7 STACK_OFFSET+8(%r14) -#define arg8 STACK_OFFSET+16(%r14) -#define arg9 STACK_OFFSET+24(%r14) -#define arg10 STACK_OFFSET+32(%r14) -#define arg11 STACK_OFFSET+40(%r14) +#define arg7 STACK_OFFSET+8(%rsp) +#define arg8 STACK_OFFSET+16(%rsp) +#define arg9 STACK_OFFSET+24(%rsp) +#define arg10 STACK_OFFSET+32(%rsp) +#define arg11 STACK_OFFSET+40(%rsp) #define keysize 2*15*16(%arg1) #endif @@ -183,28 +182,79 @@ ALL_F: .octa 0x push%r12 push%r13 push%r14 - mov %rsp, %r14 # # states of %xmm registers %xmm6:%xmm15 not saved # all %xmm registers are clobbered # - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp .endm .macro FUNC_RESTORE - mov %r14, %rsp pop %r14 pop %r13 pop %r12 .endm +# Precompute hashkeys. +# Input: Hash subkey. +# Output: HashKeys stored in gcm_context_data. Only needs to be called +# once per key. +# clobbers r12, and tmp xmm registers. +.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 + mov arg7, %r12 + movdqu (%r12), \TMP3 + movdqa SHUF_MASK(%rip), \TMP2 + PSHUFB_XMM \TMP2, \TMP3 + + # precompute HashK
[PATCH v2 08/14] x86/crypto: aesni: Fill in new context data structures
Fill in aadhash, aadlen, pblocklen, curcount with appropriate values. pblocklen, aadhash, and pblockenckey are also updated at the end of each scatter/gather operation, to be carried over to the next operation. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 51 ++- 1 file changed, 39 insertions(+), 12 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 58bbfac..aa82493 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -204,6 +204,21 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 .macro GCM_INIT + + mov arg9, %r11 + mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length + xor %r11, %r11 + mov %r11, InLen(%arg2) # ctx_data.in_length = 0 + mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0 + mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0 + mov %arg6, %rax + movdqu (%rax), %xmm0 + movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv + + movdqa SHUF_MASK(%rip), %xmm2 + PSHUFB_XMM %xmm2, %xmm0 + movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv + mov arg7, %r12 movdqu (%r12), %xmm13 movdqa SHUF_MASK(%rip), %xmm2 @@ -226,13 +241,9 @@ ALL_F: .octa 0x pandPOLY(%rip), %xmm2 pxor%xmm2, %xmm13 movdqa %xmm13, HashKey(%rsp) - mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) - and $-16, %r13 - mov %r13, %r12 CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ %xmm5 %xmm6 - mov %r13, %r12 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -240,6 +251,12 @@ ALL_F: .octa 0x # Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK # Clobbers rax, r10-r13, and xmm0-xmm15 .macro GCM_ENC_DEC operation + movdqu AadHash(%arg2), %xmm8 + movdqu HashKey(%rsp), %xmm13 + add %arg5, InLen(%arg2) + mov %arg5, %r13 # save the number of bytes + and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) + mov %r13, %r12 # Encrypt/Decrypt first few blocks and $(3<<4), %r12 @@ -284,16 +301,23 @@ _four_cipher_left_\@: GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 _zero_cipher_left_\@: + movdqu %xmm8, AadHash(%arg2) + movdqu %xmm0, CurCount(%arg2) + mov %arg5, %r13 and $15, %r13 # %r13 = arg5 (mod 16) je _multiple_of_16_bytes_\@ + mov %r13, PBlockLen(%arg2) + # Handle the last <16 Byte block separately paddd ONE(%rip), %xmm0# INCR CNT to get Yn + movdqu %xmm0, CurCount(%arg2) movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) + movdqu %xmm0, PBlockEncKey(%arg2) lea (%arg4,%r11,1), %r10 mov %r13, %r12 @@ -322,6 +346,7 @@ _zero_cipher_left_\@: .endif GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + movdqu %xmm8, AadHash(%arg2) .ifc \operation, enc # GHASH computation for the last <16 byte block movdqa SHUF_MASK(%rip), %xmm10 @@ -351,11 +376,15 @@ _multiple_of_16_bytes_\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE - mov arg9, %r12# %r13 = aadLen (number of bytes) + movdqu AadHash(%arg2), %xmm8 + movdqu HashKey(%rsp), %xmm13 + mov AadLen(%arg2), %r12 # %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg5 # len(C) in bits (*128) - MOVQ_R64_XMM%arg5, %xmm1 + mov InLen(%arg2), %r12 + shl $3, %r12 # len(C) in bits (*128) + MOVQ_R64_XMM%r12, %xmm1 + pslldq $8, %xmm15# %xmm15 = len(A)||0x pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) pxor%xmm15, %xmm8 @@ -364,8 +393,7 @@ _multiple_of_16_bytes_\@: movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm8 - mov %arg6, %rax # %rax = *Y0 - movdqu (%rax), %xmm0 # %xmm0 = Y0 + movdqu OrigIV(%arg2), %xmm0 # %xmm0 = Y0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: @@ -553,15
[PATCH v2 07/14] x86/crypto: aesni: Split AAD hash calculation to separate macro
AAD hash only needs to be calculated once for each scatter/gather operation. Move it to its own macro, and call it from GCM_INIT instead of INITIAL_BLOCKS. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 71 --- 1 file changed, 43 insertions(+), 28 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 6c5a80d..58bbfac 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -229,6 +229,10 @@ ALL_F: .octa 0x mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) and $-16, %r13 mov %r13, %r12 + + CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ + %xmm5 %xmm6 + mov %r13, %r12 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -496,51 +500,62 @@ _read_next_byte_lt8_\@: _done_read_partial_block_\@: .endm -/* -* if a = number of total plaintext bytes -* b = floor(a/16) -* num_initial_blocks = b mod 4 -* encrypt the initial num_initial_blocks blocks and apply ghash on -* the ciphertext -* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers -* are clobbered -* arg1, %arg3, %arg4, %r14 are used as a pointer only, not modified -*/ - - -.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ -XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation -MOVADQ SHUF_MASK(%rip), %xmm14 - movarg8, %r10 # %r10 = AAD - movarg9, %r11 # %r11 = aadLen - pxor %xmm\i, %xmm\i - pxor \XMM2, \XMM2 +# CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted. +# clobbers r10-11, xmm14 +.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \ + TMP6 TMP7 + MOVADQ SHUF_MASK(%rip), %xmm14 + movarg8, %r10 # %r10 = AAD + movarg9, %r11 # %r11 = aadLen + pxor \TMP7, \TMP7 + pxor \TMP6, \TMP6 cmp$16, %r11 jl _get_AAD_rest\@ _get_AAD_blocks\@: - movdqu (%r10), %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor %xmm\i, \XMM2 - GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 + movdqu (%r10), \TMP7 + PSHUFB_XMM %xmm14, \TMP7 # byte-reflect the AAD data + pxor \TMP7, \TMP6 + GHASH_MUL \TMP6, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5 add$16, %r10 sub$16, %r11 cmp$16, %r11 jge_get_AAD_blocks\@ - movdqu \XMM2, %xmm\i + movdqu \TMP6, \TMP7 /* read the last <16B of AAD */ _get_AAD_rest\@: cmp$0, %r11 je _get_AAD_done\@ - READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor \XMM2, %xmm\i - GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 + READ_PARTIAL_BLOCK %r10, %r11, \TMP1, \TMP7 + PSHUFB_XMM %xmm14, \TMP7 # byte-reflect the AAD data + pxor \TMP6, \TMP7 + GHASH_MUL \TMP7, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5 + movdqu \TMP7, \TMP6 _get_AAD_done\@: + movdqu \TMP6, AadHash(%arg2) +.endm + +/* +* if a = number of total plaintext bytes +* b = floor(a/16) +* num_initial_blocks = b mod 4 +* encrypt the initial num_initial_blocks blocks and apply ghash on +* the ciphertext +* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers +* are clobbered +* arg1, %arg2, %arg3, %r14 are used as a pointer only, not modified +*/ + + +.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ + XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation + + movdqu AadHash(%arg2), %xmm\i # XMM0 = Y0 + xor%r11, %r11 # initialise the data pointer offset as zero # start AES for num_initial_blocks blocks -- 2.9.5
[PATCH v2 10/14] x86/crypto: aesni: Move HashKey computation from stack to gcm_context
HashKey computation only needs to happen once per scatter/gather operation, save it between calls in gcm_context struct instead of on the stack. Since the asm no longer stores anything on the stack, we can use %rsp directly, and clean up the frame save/restore macros a bit. Hashkeys actually only need to be calculated once per key and could be moved to when set_key is called, however, the current glue code falls back to generic aes code if fpu is disabled. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 205 -- 1 file changed, 106 insertions(+), 99 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 37b1cee..3ada06b 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -93,23 +93,6 @@ ALL_F: .octa 0x #defineSTACK_OFFSET8*3 -#defineHashKey 16*0// store HashKey <<1 mod poly here -#defineHashKey_2 16*1// store HashKey^2 <<1 mod poly here -#defineHashKey_3 16*2// store HashKey^3 <<1 mod poly here -#defineHashKey_4 16*3// store HashKey^4 <<1 mod poly here -#defineHashKey_k 16*4// store XOR of High 64 bits and Low 64 - // bits of HashKey <<1 mod poly here - //(for Karatsuba purposes) -#defineHashKey_2_k 16*5// store XOR of High 64 bits and Low 64 - // bits of HashKey^2 <<1 mod poly here - // (for Karatsuba purposes) -#defineHashKey_3_k 16*6// store XOR of High 64 bits and Low 64 - // bits of HashKey^3 <<1 mod poly here - // (for Karatsuba purposes) -#defineHashKey_4_k 16*7// store XOR of High 64 bits and Low 64 - // bits of HashKey^4 <<1 mod poly here - // (for Karatsuba purposes) -#defineVARIABLE_OFFSET 16*8 #define AadHash 16*0 #define AadLen 16*1 @@ -118,6 +101,22 @@ ALL_F: .octa 0x #define OrigIV 16*3 #define CurCount 16*4 #define PBlockLen 16*5 +#defineHashKey 16*6// store HashKey <<1 mod poly here +#defineHashKey_2 16*7// store HashKey^2 <<1 mod poly here +#defineHashKey_3 16*8// store HashKey^3 <<1 mod poly here +#defineHashKey_4 16*9// store HashKey^4 <<1 mod poly here +#defineHashKey_k 16*10 // store XOR of High 64 bits and Low 64 + // bits of HashKey <<1 mod poly here + //(for Karatsuba purposes) +#defineHashKey_2_k 16*11 // store XOR of High 64 bits and Low 64 + // bits of HashKey^2 <<1 mod poly here + // (for Karatsuba purposes) +#defineHashKey_3_k 16*12 // store XOR of High 64 bits and Low 64 + // bits of HashKey^3 <<1 mod poly here + // (for Karatsuba purposes) +#defineHashKey_4_k 16*13 // store XOR of High 64 bits and Low 64 + // bits of HashKey^4 <<1 mod poly here + // (for Karatsuba purposes) #define arg1 rdi #define arg2 rsi @@ -125,11 +124,11 @@ ALL_F: .octa 0x #define arg4 rcx #define arg5 r8 #define arg6 r9 -#define arg7 STACK_OFFSET+8(%r14) -#define arg8 STACK_OFFSET+16(%r14) -#define arg9 STACK_OFFSET+24(%r14) -#define arg10 STACK_OFFSET+32(%r14) -#define arg11 STACK_OFFSET+40(%r14) +#define arg7 STACK_OFFSET+8(%rsp) +#define arg8 STACK_OFFSET+16(%rsp) +#define arg9 STACK_OFFSET+24(%rsp) +#define arg10 STACK_OFFSET+32(%rsp) +#define arg11 STACK_OFFSET+40(%rsp) #define keysize 2*15*16(%arg1) #endif @@ -183,28 +182,79 @@ ALL_F: .octa 0x push%r12 push%r13 push%r14 - mov %rsp, %r14 # # states of %xmm registers %xmm6:%xmm15 not saved # all %xmm registers are clobbered # - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp .endm .macro FUNC_RESTORE - mov %r14, %rsp pop %r14 pop %r13 pop %r12 .endm +# Precompute hashkeys. +# Input: Hash subkey. +# Output: HashKeys stored in gcm_context_data. Only needs to be called +# once per key. +# clobbers r12, and tmp xmm registers. +.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 + mov arg7, %r12 + movdqu (%r12), \TMP3 + movdqa SHUF_MASK(%rip), \TMP2 + PSHUFB_XMM \TMP2, \TMP3 + + # precompute HashKey<<1 mod poly from t
[PATCH v2 08/14] x86/crypto: aesni: Fill in new context data structures
Fill in aadhash, aadlen, pblocklen, curcount with appropriate values. pblocklen, aadhash, and pblockenckey are also updated at the end of each scatter/gather operation, to be carried over to the next operation. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 51 ++- 1 file changed, 39 insertions(+), 12 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 58bbfac..aa82493 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -204,6 +204,21 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 .macro GCM_INIT + + mov arg9, %r11 + mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length + xor %r11, %r11 + mov %r11, InLen(%arg2) # ctx_data.in_length = 0 + mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0 + mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0 + mov %arg6, %rax + movdqu (%rax), %xmm0 + movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv + + movdqa SHUF_MASK(%rip), %xmm2 + PSHUFB_XMM %xmm2, %xmm0 + movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv + mov arg7, %r12 movdqu (%r12), %xmm13 movdqa SHUF_MASK(%rip), %xmm2 @@ -226,13 +241,9 @@ ALL_F: .octa 0x pandPOLY(%rip), %xmm2 pxor%xmm2, %xmm13 movdqa %xmm13, HashKey(%rsp) - mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) - and $-16, %r13 - mov %r13, %r12 CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ %xmm5 %xmm6 - mov %r13, %r12 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -240,6 +251,12 @@ ALL_F: .octa 0x # Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK # Clobbers rax, r10-r13, and xmm0-xmm15 .macro GCM_ENC_DEC operation + movdqu AadHash(%arg2), %xmm8 + movdqu HashKey(%rsp), %xmm13 + add %arg5, InLen(%arg2) + mov %arg5, %r13 # save the number of bytes + and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) + mov %r13, %r12 # Encrypt/Decrypt first few blocks and $(3<<4), %r12 @@ -284,16 +301,23 @@ _four_cipher_left_\@: GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 _zero_cipher_left_\@: + movdqu %xmm8, AadHash(%arg2) + movdqu %xmm0, CurCount(%arg2) + mov %arg5, %r13 and $15, %r13 # %r13 = arg5 (mod 16) je _multiple_of_16_bytes_\@ + mov %r13, PBlockLen(%arg2) + # Handle the last <16 Byte block separately paddd ONE(%rip), %xmm0# INCR CNT to get Yn + movdqu %xmm0, CurCount(%arg2) movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) + movdqu %xmm0, PBlockEncKey(%arg2) lea (%arg4,%r11,1), %r10 mov %r13, %r12 @@ -322,6 +346,7 @@ _zero_cipher_left_\@: .endif GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + movdqu %xmm8, AadHash(%arg2) .ifc \operation, enc # GHASH computation for the last <16 byte block movdqa SHUF_MASK(%rip), %xmm10 @@ -351,11 +376,15 @@ _multiple_of_16_bytes_\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE - mov arg9, %r12# %r13 = aadLen (number of bytes) + movdqu AadHash(%arg2), %xmm8 + movdqu HashKey(%rsp), %xmm13 + mov AadLen(%arg2), %r12 # %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg5 # len(C) in bits (*128) - MOVQ_R64_XMM%arg5, %xmm1 + mov InLen(%arg2), %r12 + shl $3, %r12 # len(C) in bits (*128) + MOVQ_R64_XMM%r12, %xmm1 + pslldq $8, %xmm15# %xmm15 = len(A)||0x pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) pxor%xmm15, %xmm8 @@ -364,8 +393,7 @@ _multiple_of_16_bytes_\@: movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm8 - mov %arg6, %rax # %rax = *Y0 - movdqu (%rax), %xmm0 # %xmm0 = Y0 + movdqu OrigIV(%arg2), %xmm0 # %xmm0 = Y0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: @@ -553,15 +581,14 @@ _get_AAD
[PATCH v2 07/14] x86/crypto: aesni: Split AAD hash calculation to separate macro
AAD hash only needs to be calculated once for each scatter/gather operation. Move it to its own macro, and call it from GCM_INIT instead of INITIAL_BLOCKS. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 71 --- 1 file changed, 43 insertions(+), 28 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 6c5a80d..58bbfac 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -229,6 +229,10 @@ ALL_F: .octa 0x mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) and $-16, %r13 mov %r13, %r12 + + CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ + %xmm5 %xmm6 + mov %r13, %r12 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -496,51 +500,62 @@ _read_next_byte_lt8_\@: _done_read_partial_block_\@: .endm -/* -* if a = number of total plaintext bytes -* b = floor(a/16) -* num_initial_blocks = b mod 4 -* encrypt the initial num_initial_blocks blocks and apply ghash on -* the ciphertext -* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers -* are clobbered -* arg1, %arg3, %arg4, %r14 are used as a pointer only, not modified -*/ - - -.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ -XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation -MOVADQ SHUF_MASK(%rip), %xmm14 - movarg8, %r10 # %r10 = AAD - movarg9, %r11 # %r11 = aadLen - pxor %xmm\i, %xmm\i - pxor \XMM2, \XMM2 +# CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted. +# clobbers r10-11, xmm14 +.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \ + TMP6 TMP7 + MOVADQ SHUF_MASK(%rip), %xmm14 + movarg8, %r10 # %r10 = AAD + movarg9, %r11 # %r11 = aadLen + pxor \TMP7, \TMP7 + pxor \TMP6, \TMP6 cmp$16, %r11 jl _get_AAD_rest\@ _get_AAD_blocks\@: - movdqu (%r10), %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor %xmm\i, \XMM2 - GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 + movdqu (%r10), \TMP7 + PSHUFB_XMM %xmm14, \TMP7 # byte-reflect the AAD data + pxor \TMP7, \TMP6 + GHASH_MUL \TMP6, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5 add$16, %r10 sub$16, %r11 cmp$16, %r11 jge_get_AAD_blocks\@ - movdqu \XMM2, %xmm\i + movdqu \TMP6, \TMP7 /* read the last <16B of AAD */ _get_AAD_rest\@: cmp$0, %r11 je _get_AAD_done\@ - READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor \XMM2, %xmm\i - GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 + READ_PARTIAL_BLOCK %r10, %r11, \TMP1, \TMP7 + PSHUFB_XMM %xmm14, \TMP7 # byte-reflect the AAD data + pxor \TMP6, \TMP7 + GHASH_MUL \TMP7, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5 + movdqu \TMP7, \TMP6 _get_AAD_done\@: + movdqu \TMP6, AadHash(%arg2) +.endm + +/* +* if a = number of total plaintext bytes +* b = floor(a/16) +* num_initial_blocks = b mod 4 +* encrypt the initial num_initial_blocks blocks and apply ghash on +* the ciphertext +* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers +* are clobbered +* arg1, %arg2, %arg3, %r14 are used as a pointer only, not modified +*/ + + +.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ + XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation + + movdqu AadHash(%arg2), %xmm\i # XMM0 = Y0 + xor%r11, %r11 # initialise the data pointer offset as zero # start AES for num_initial_blocks blocks -- 2.9.5
[PATCH v2 05/14] x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro
Make a macro for the main encode/decode routine. Only a small handful of lines differ for enc and dec. This will also become the main scatter/gather update routine. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 293 +++--- 1 file changed, 114 insertions(+), 179 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 529c542..8021fd1 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -222,6 +222,118 @@ ALL_F: .octa 0x mov %r13, %r12 .endm +# GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context +# struct has been initialized by GCM_INIT. +# Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK +# Clobbers rax, r10-r13, and xmm0-xmm15 +.macro GCM_ENC_DEC operation + # Encrypt/Decrypt first few blocks + + and $(3<<4), %r12 + jz _initial_num_blocks_is_0_\@ + cmp $(2<<4), %r12 + jb _initial_num_blocks_is_1_\@ + je _initial_num_blocks_is_2_\@ +_initial_num_blocks_is_3_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 5, 678, \operation + sub $48, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_2_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 6, 78, \operation + sub $32, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_1_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 7, 8, \operation + sub $16, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_0_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 8, 0, \operation +_initial_blocks_\@: + + # Main loop - Encrypt/Decrypt remaining blocks + + cmp $0, %r13 + je _zero_cipher_left_\@ + sub $64, %r13 + je _four_cipher_left_\@ +_crypt_by_4_\@: + GHASH_4_ENCRYPT_4_PARALLEL_\operation %xmm9, %xmm10, %xmm11, %xmm12, \ + %xmm13, %xmm14, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, \ + %xmm7, %xmm8, enc + add $64, %r11 + sub $64, %r13 + jne _crypt_by_4_\@ +_four_cipher_left_\@: + GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ +%xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 +_zero_cipher_left_\@: + mov %arg4, %r13 + and $15, %r13 # %r13 = arg4 (mod 16) + je _multiple_of_16_bytes_\@ + + # Handle the last <16 Byte block separately + paddd ONE(%rip), %xmm0# INCR CNT to get Yn +movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm0 + + ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) + + lea (%arg3,%r11,1), %r10 + mov %r13, %r12 + READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 + + lea ALL_F+16(%rip), %r12 + sub %r13, %r12 +.ifc \operation, dec + movdqa %xmm1, %xmm2 +.endif + pxor%xmm1, %xmm0# XOR Encrypt(K, Yn) + movdqu (%r12), %xmm1 + # get the appropriate mask to mask out top 16-r13 bytes of xmm0 + pand%xmm1, %xmm0# mask out top 16-r13 bytes of xmm0 +.ifc \operation, dec + pand%xmm1, %xmm2 + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10 ,%xmm2 + + pxor %xmm2, %xmm8 +.else + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10,%xmm0 + + pxor%xmm0, %xmm8 +.endif + + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 +.ifc \operation, enc + # GHASH computation for the last <16 byte block + movdqa SHUF_MASK(%rip), %xmm10 + # shuffle xmm0 back to output as ciphertext + PSHUFB_XMM %xmm10, %xmm0 +.endif + + # Output %r13 bytes + MOVQ_R64_XMM %xmm0, %rax + cmp $8, %r13 + jle _less_than_8_bytes_left_\@ + mov %rax, (%arg2 , %r11, 1) + add $8, %r11 + psrldq $8, %xmm0 + MOVQ_R64_XMM %xmm0, %rax + sub $8, %r13 +_less_than_8_bytes_left_\@: + mov %al, (%arg2, %r11, 1) + add $1, %r11 + shr $8, %rax + sub $1, %r13 + jne _less_than_8_bytes_left_\@ +_multiple_of_16_bytes_\@: +.endm + # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 @@ -1245,93 +1357,7 @@ ENTRY(aesni_gcm_dec) FUNC_SAVE GCM_INIT - -# Decrypt first few blocks - - and $(3<<4), %r12 - jz _initial_num_blocks_is_0_decrypt -
[PATCH v2 05/14] x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro
Make a macro for the main encode/decode routine. Only a small handful of lines differ for enc and dec. This will also become the main scatter/gather update routine. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 293 +++--- 1 file changed, 114 insertions(+), 179 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 529c542..8021fd1 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -222,6 +222,118 @@ ALL_F: .octa 0x mov %r13, %r12 .endm +# GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context +# struct has been initialized by GCM_INIT. +# Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK +# Clobbers rax, r10-r13, and xmm0-xmm15 +.macro GCM_ENC_DEC operation + # Encrypt/Decrypt first few blocks + + and $(3<<4), %r12 + jz _initial_num_blocks_is_0_\@ + cmp $(2<<4), %r12 + jb _initial_num_blocks_is_1_\@ + je _initial_num_blocks_is_2_\@ +_initial_num_blocks_is_3_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 5, 678, \operation + sub $48, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_2_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 6, 78, \operation + sub $32, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_1_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 7, 8, \operation + sub $16, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_0_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 8, 0, \operation +_initial_blocks_\@: + + # Main loop - Encrypt/Decrypt remaining blocks + + cmp $0, %r13 + je _zero_cipher_left_\@ + sub $64, %r13 + je _four_cipher_left_\@ +_crypt_by_4_\@: + GHASH_4_ENCRYPT_4_PARALLEL_\operation %xmm9, %xmm10, %xmm11, %xmm12, \ + %xmm13, %xmm14, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, \ + %xmm7, %xmm8, enc + add $64, %r11 + sub $64, %r13 + jne _crypt_by_4_\@ +_four_cipher_left_\@: + GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ +%xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 +_zero_cipher_left_\@: + mov %arg4, %r13 + and $15, %r13 # %r13 = arg4 (mod 16) + je _multiple_of_16_bytes_\@ + + # Handle the last <16 Byte block separately + paddd ONE(%rip), %xmm0# INCR CNT to get Yn +movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm0 + + ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) + + lea (%arg3,%r11,1), %r10 + mov %r13, %r12 + READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 + + lea ALL_F+16(%rip), %r12 + sub %r13, %r12 +.ifc \operation, dec + movdqa %xmm1, %xmm2 +.endif + pxor%xmm1, %xmm0# XOR Encrypt(K, Yn) + movdqu (%r12), %xmm1 + # get the appropriate mask to mask out top 16-r13 bytes of xmm0 + pand%xmm1, %xmm0# mask out top 16-r13 bytes of xmm0 +.ifc \operation, dec + pand%xmm1, %xmm2 + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10 ,%xmm2 + + pxor %xmm2, %xmm8 +.else + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10,%xmm0 + + pxor%xmm0, %xmm8 +.endif + + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 +.ifc \operation, enc + # GHASH computation for the last <16 byte block + movdqa SHUF_MASK(%rip), %xmm10 + # shuffle xmm0 back to output as ciphertext + PSHUFB_XMM %xmm10, %xmm0 +.endif + + # Output %r13 bytes + MOVQ_R64_XMM %xmm0, %rax + cmp $8, %r13 + jle _less_than_8_bytes_left_\@ + mov %rax, (%arg2 , %r11, 1) + add $8, %r11 + psrldq $8, %xmm0 + MOVQ_R64_XMM %xmm0, %rax + sub $8, %r13 +_less_than_8_bytes_left_\@: + mov %al, (%arg2, %r11, 1) + add $1, %r11 + shr $8, %rax + sub $1, %r13 + jne _less_than_8_bytes_left_\@ +_multiple_of_16_bytes_\@: +.endm + # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 @@ -1245,93 +1357,7 @@ ENTRY(aesni_gcm_dec) FUNC_SAVE GCM_INIT - -# Decrypt first few blocks - - and $(3<<4), %r12 - jz _initial_num_blocks_is_0_decrypt - cmp $(2<<4), %r12 - jb _initial_num_
[PATCH v2 02/14] x86/crypto: aesni: Macro-ify func save/restore
Macro-ify function save and restore. These will be used in new functions added for scatter/gather update operations. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 53 ++- 1 file changed, 24 insertions(+), 29 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 48911fe..39b42b1 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -170,6 +170,26 @@ ALL_F: .octa 0x #define TKEYP T1 #endif +.macro FUNC_SAVE + push%r12 + push%r13 + push%r14 + mov %rsp, %r14 +# +# states of %xmm registers %xmm6:%xmm15 not saved +# all %xmm registers are clobbered +# + sub $VARIABLE_OFFSET, %rsp + and $~63, %rsp +.endm + + +.macro FUNC_RESTORE + mov %r14, %rsp + pop %r14 + pop %r13 + pop %r12 +.endm #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) @@ -1130,16 +1150,7 @@ _esb_loop_\@: * */ ENTRY(aesni_gcm_dec) - push%r12 - push%r13 - push%r14 - mov %rsp, %r14 -/* -* states of %xmm registers %xmm6:%xmm15 not saved -* all %xmm registers are clobbered -*/ - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp# align rsp to 64 bytes + FUNC_SAVE mov %arg6, %r12 movdqu (%r12), %xmm13# %xmm13 = HashKey movdqa SHUF_MASK(%rip), %xmm2 @@ -1309,10 +1320,7 @@ _T_1_decrypt: _T_16_decrypt: movdqu %xmm0, (%r10) _return_T_done_decrypt: - mov %r14, %rsp - pop %r14 - pop %r13 - pop %r12 + FUNC_RESTORE ret ENDPROC(aesni_gcm_dec) @@ -1393,22 +1401,12 @@ ENDPROC(aesni_gcm_dec) * poly = x^128 + x^127 + x^126 + x^121 + 1 ***/ ENTRY(aesni_gcm_enc) - push%r12 - push%r13 - push%r14 - mov %rsp, %r14 -# -# states of %xmm registers %xmm6:%xmm15 not saved -# all %xmm registers are clobbered -# - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp + FUNC_SAVE mov %arg6, %r12 movdqu (%r12), %xmm13 movdqa SHUF_MASK(%rip), %xmm2 PSHUFB_XMM %xmm2, %xmm13 - # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) movdqa %xmm13, %xmm2 @@ -1576,10 +1574,7 @@ _T_1_encrypt: _T_16_encrypt: movdqu %xmm0, (%r10) _return_T_done_encrypt: - mov %r14, %rsp - pop %r14 - pop %r13 - pop %r12 + FUNC_RESTORE ret ENDPROC(aesni_gcm_enc) -- 2.9.5
[PATCH v2 02/14] x86/crypto: aesni: Macro-ify func save/restore
Macro-ify function save and restore. These will be used in new functions added for scatter/gather update operations. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 53 ++- 1 file changed, 24 insertions(+), 29 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 48911fe..39b42b1 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -170,6 +170,26 @@ ALL_F: .octa 0x #define TKEYP T1 #endif +.macro FUNC_SAVE + push%r12 + push%r13 + push%r14 + mov %rsp, %r14 +# +# states of %xmm registers %xmm6:%xmm15 not saved +# all %xmm registers are clobbered +# + sub $VARIABLE_OFFSET, %rsp + and $~63, %rsp +.endm + + +.macro FUNC_RESTORE + mov %r14, %rsp + pop %r14 + pop %r13 + pop %r12 +.endm #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) @@ -1130,16 +1150,7 @@ _esb_loop_\@: * */ ENTRY(aesni_gcm_dec) - push%r12 - push%r13 - push%r14 - mov %rsp, %r14 -/* -* states of %xmm registers %xmm6:%xmm15 not saved -* all %xmm registers are clobbered -*/ - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp# align rsp to 64 bytes + FUNC_SAVE mov %arg6, %r12 movdqu (%r12), %xmm13# %xmm13 = HashKey movdqa SHUF_MASK(%rip), %xmm2 @@ -1309,10 +1320,7 @@ _T_1_decrypt: _T_16_decrypt: movdqu %xmm0, (%r10) _return_T_done_decrypt: - mov %r14, %rsp - pop %r14 - pop %r13 - pop %r12 + FUNC_RESTORE ret ENDPROC(aesni_gcm_dec) @@ -1393,22 +1401,12 @@ ENDPROC(aesni_gcm_dec) * poly = x^128 + x^127 + x^126 + x^121 + 1 ***/ ENTRY(aesni_gcm_enc) - push%r12 - push%r13 - push%r14 - mov %rsp, %r14 -# -# states of %xmm registers %xmm6:%xmm15 not saved -# all %xmm registers are clobbered -# - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp + FUNC_SAVE mov %arg6, %r12 movdqu (%r12), %xmm13 movdqa SHUF_MASK(%rip), %xmm2 PSHUFB_XMM %xmm2, %xmm13 - # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) movdqa %xmm13, %xmm2 @@ -1576,10 +1574,7 @@ _T_1_encrypt: _T_16_encrypt: movdqu %xmm0, (%r10) _return_T_done_encrypt: - mov %r14, %rsp - pop %r14 - pop %r13 - pop %r12 + FUNC_RESTORE ret ENDPROC(aesni_gcm_enc) -- 2.9.5
[PATCH v2 03/14] x86/crypto: aesni: Add GCM_INIT macro
Reduce code duplication by introducting GCM_INIT macro. This macro will also be exposed as a function for implementing scatter/gather support, since INIT only needs to be called once for the full operation. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 84 +++ 1 file changed, 33 insertions(+), 51 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 39b42b1..b9fe2ab 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -191,6 +191,37 @@ ALL_F: .octa 0x pop %r12 .endm + +# GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. +# Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 +.macro GCM_INIT + mov %arg6, %r12 + movdqu (%r12), %xmm13 + movdqa SHUF_MASK(%rip), %xmm2 + PSHUFB_XMM %xmm2, %xmm13 + + # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) + + movdqa %xmm13, %xmm2 + psllq $1, %xmm13 + psrlq $63, %xmm2 + movdqa %xmm2, %xmm1 + pslldq $8, %xmm2 + psrldq $8, %xmm1 + por %xmm2, %xmm13 + + # reduce HashKey<<1 + + pshufd $0x24, %xmm1, %xmm2 + pcmpeqd TWOONE(%rip), %xmm2 + pandPOLY(%rip), %xmm2 + pxor%xmm2, %xmm13 + movdqa %xmm13, HashKey(%rsp) + mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod poly) + and $-16, %r13 + mov %r13, %r12 +.endm + #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) * @@ -1151,36 +1182,11 @@ _esb_loop_\@: */ ENTRY(aesni_gcm_dec) FUNC_SAVE - mov %arg6, %r12 - movdqu (%r12), %xmm13# %xmm13 = HashKey -movdqa SHUF_MASK(%rip), %xmm2 - PSHUFB_XMM %xmm2, %xmm13 - - -# Precompute HashKey<<1 (mod poly) from the hash key (required for GHASH) - - movdqa %xmm13, %xmm2 - psllq $1, %xmm13 - psrlq $63, %xmm2 - movdqa %xmm2, %xmm1 - pslldq $8, %xmm2 - psrldq $8, %xmm1 - por %xmm2, %xmm13 - -# Reduction - - pshufd $0x24, %xmm1, %xmm2 - pcmpeqd TWOONE(%rip), %xmm2 - pandPOLY(%rip), %xmm2 - pxor%xmm2, %xmm13 # %xmm13 holds the HashKey<<1 (mod poly) + GCM_INIT # Decrypt first few blocks - movdqa %xmm13, HashKey(%rsp) # store HashKey<<1 (mod poly) - mov %arg4, %r13# save the number of bytes of plaintext/ciphertext - and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) - mov %r13, %r12 and $(3<<4), %r12 jz _initial_num_blocks_is_0_decrypt cmp $(2<<4), %r12 @@ -1402,32 +1408,8 @@ ENDPROC(aesni_gcm_dec) ***/ ENTRY(aesni_gcm_enc) FUNC_SAVE - mov %arg6, %r12 - movdqu (%r12), %xmm13 -movdqa SHUF_MASK(%rip), %xmm2 - PSHUFB_XMM %xmm2, %xmm13 - -# precompute HashKey<<1 mod poly from the HashKey (required for GHASH) - - movdqa %xmm13, %xmm2 - psllq $1, %xmm13 - psrlq $63, %xmm2 - movdqa %xmm2, %xmm1 - pslldq $8, %xmm2 - psrldq $8, %xmm1 - por %xmm2, %xmm13 - -# reduce HashKey<<1 - - pshufd $0x24, %xmm1, %xmm2 - pcmpeqd TWOONE(%rip), %xmm2 - pandPOLY(%rip), %xmm2 - pxor%xmm2, %xmm13 - movdqa %xmm13, HashKey(%rsp) - mov %arg4, %r13# %xmm13 holds HashKey<<1 (mod poly) - and $-16, %r13 - mov %r13, %r12 + GCM_INIT # Encrypt first few blocks and $(3<<4), %r12 -- 2.9.5
[PATCH v2 03/14] x86/crypto: aesni: Add GCM_INIT macro
Reduce code duplication by introducting GCM_INIT macro. This macro will also be exposed as a function for implementing scatter/gather support, since INIT only needs to be called once for the full operation. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 84 +++ 1 file changed, 33 insertions(+), 51 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 39b42b1..b9fe2ab 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -191,6 +191,37 @@ ALL_F: .octa 0x pop %r12 .endm + +# GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. +# Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 +.macro GCM_INIT + mov %arg6, %r12 + movdqu (%r12), %xmm13 + movdqa SHUF_MASK(%rip), %xmm2 + PSHUFB_XMM %xmm2, %xmm13 + + # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) + + movdqa %xmm13, %xmm2 + psllq $1, %xmm13 + psrlq $63, %xmm2 + movdqa %xmm2, %xmm1 + pslldq $8, %xmm2 + psrldq $8, %xmm1 + por %xmm2, %xmm13 + + # reduce HashKey<<1 + + pshufd $0x24, %xmm1, %xmm2 + pcmpeqd TWOONE(%rip), %xmm2 + pandPOLY(%rip), %xmm2 + pxor%xmm2, %xmm13 + movdqa %xmm13, HashKey(%rsp) + mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod poly) + and $-16, %r13 + mov %r13, %r12 +.endm + #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) * @@ -1151,36 +1182,11 @@ _esb_loop_\@: */ ENTRY(aesni_gcm_dec) FUNC_SAVE - mov %arg6, %r12 - movdqu (%r12), %xmm13# %xmm13 = HashKey -movdqa SHUF_MASK(%rip), %xmm2 - PSHUFB_XMM %xmm2, %xmm13 - - -# Precompute HashKey<<1 (mod poly) from the hash key (required for GHASH) - - movdqa %xmm13, %xmm2 - psllq $1, %xmm13 - psrlq $63, %xmm2 - movdqa %xmm2, %xmm1 - pslldq $8, %xmm2 - psrldq $8, %xmm1 - por %xmm2, %xmm13 - -# Reduction - - pshufd $0x24, %xmm1, %xmm2 - pcmpeqd TWOONE(%rip), %xmm2 - pandPOLY(%rip), %xmm2 - pxor%xmm2, %xmm13 # %xmm13 holds the HashKey<<1 (mod poly) + GCM_INIT # Decrypt first few blocks - movdqa %xmm13, HashKey(%rsp) # store HashKey<<1 (mod poly) - mov %arg4, %r13# save the number of bytes of plaintext/ciphertext - and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) - mov %r13, %r12 and $(3<<4), %r12 jz _initial_num_blocks_is_0_decrypt cmp $(2<<4), %r12 @@ -1402,32 +1408,8 @@ ENDPROC(aesni_gcm_dec) ***/ ENTRY(aesni_gcm_enc) FUNC_SAVE - mov %arg6, %r12 - movdqu (%r12), %xmm13 -movdqa SHUF_MASK(%rip), %xmm2 - PSHUFB_XMM %xmm2, %xmm13 - -# precompute HashKey<<1 mod poly from the HashKey (required for GHASH) - - movdqa %xmm13, %xmm2 - psllq $1, %xmm13 - psrlq $63, %xmm2 - movdqa %xmm2, %xmm1 - pslldq $8, %xmm2 - psrldq $8, %xmm1 - por %xmm2, %xmm13 - -# reduce HashKey<<1 - - pshufd $0x24, %xmm1, %xmm2 - pcmpeqd TWOONE(%rip), %xmm2 - pandPOLY(%rip), %xmm2 - pxor%xmm2, %xmm13 - movdqa %xmm13, HashKey(%rsp) - mov %arg4, %r13# %xmm13 holds HashKey<<1 (mod poly) - and $-16, %r13 - mov %r13, %r12 + GCM_INIT # Encrypt first few blocks and $(3<<4), %r12 -- 2.9.5
[PATCH v2 00/14] x86/crypto gcmaes SSE scatter/gather support
This patch set refactors the x86 aes/gcm SSE crypto routines to support true scatter/gather by adding gcm_enc/dec_update methods. The layout is: * First 5 patches refactor the code to use macros, so changes only need to be applied once for encode and decode. There should be no functional changes. * The next 6 patches introduce a gcm_context structure to be passed between scatter/gather calls to maintain state. The struct is also used as scratch space for the existing enc/dec routines. * The last 2 set up the asm function entry points for scatter gather support, and then call the new routines per buffer in the passed in sglist in aesni-intel_glue. Testing: asm itself fuzz tested vs. existing code and isa-l asm. Ran libkcapi test suite, passes. perf of a large (16k messages) TLS sends sg vs. no sg: no-sg 33287255597 cycles 53702871176 instructions 43.47% _crypt_by_4 17.83% memcpy 16.36% aes_loop_par_enc_done sg 27568944591 cycles 54580446678 instructions 49.87% _crypt_by_4 17.40% aes_loop_par_enc_done 1.79%aes_loop_initial_5416 1.52%aes_loop_initial_4974 1.27%gcmaes_encrypt_sg.constprop.15 V1 -> V2: patch 14: merge enc/dec also use new routine if cryptlen < AVX_GEN2_OPTSIZE optimize case if assoc is already linear Dave Watson (14): x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC x86/crypto: aesni: Macro-ify func save/restore x86/crypto: aesni: Add GCM_INIT macro x86/crypto: aesni: Add GCM_COMPLETE macro x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro x86/crypto: aesni: Introduce gcm_context_data x86/crypto: aesni: Split AAD hash calculation to separate macro x86/crypto: aesni: Fill in new context data structures x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE x86/crypto: aesni: Move HashKey computation from stack to gcm_context x86/crypto: aesni: Introduce partial block macro x86/crypto: aesni: Add fast path for > 16 byte update x86/crypto: aesni: Introduce scatter/gather asm function stubs x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather arch/x86/crypto/aesni-intel_asm.S | 1414 ++-- arch/x86/crypto/aesni-intel_glue.c | 230 +- 2 files changed, 899 insertions(+), 745 deletions(-) -- 2.9.5
[PATCH v2 01/14] x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC
Use macro operations to merge implemetations of INITIAL_BLOCKS, since they differ by only a small handful of lines. Use macro counter \@ to simplify implementation. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 298 ++ 1 file changed, 48 insertions(+), 250 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 76d8cd4..48911fe 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -275,234 +275,7 @@ _done_read_partial_block_\@: */ -.macro INITIAL_BLOCKS_DEC num_initial_blocks TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ -XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation -MOVADQ SHUF_MASK(%rip), %xmm14 - movarg7, %r10 # %r10 = AAD - movarg8, %r11 # %r11 = aadLen - pxor %xmm\i, %xmm\i - pxor \XMM2, \XMM2 - - cmp$16, %r11 - jl _get_AAD_rest\num_initial_blocks\operation -_get_AAD_blocks\num_initial_blocks\operation: - movdqu (%r10), %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor %xmm\i, \XMM2 - GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 - add$16, %r10 - sub$16, %r11 - cmp$16, %r11 - jge_get_AAD_blocks\num_initial_blocks\operation - - movdqu \XMM2, %xmm\i - - /* read the last <16B of AAD */ -_get_AAD_rest\num_initial_blocks\operation: - cmp$0, %r11 - je _get_AAD_done\num_initial_blocks\operation - - READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor \XMM2, %xmm\i - GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 - -_get_AAD_done\num_initial_blocks\operation: - xor%r11, %r11 # initialise the data pointer offset as zero - # start AES for num_initial_blocks blocks - - mov%arg5, %rax # %rax = *Y0 - movdqu (%rax), \XMM0# XMM0 = Y0 - PSHUFB_XMM %xmm14, \XMM0 - -.if (\i == 5) || (\i == 6) || (\i == 7) - MOVADQ ONE(%RIP),\TMP1 - MOVADQ (%arg1),\TMP2 -.irpc index, \i_seq - paddd \TMP1, \XMM0 # INCR Y0 - movdqa \XMM0, %xmm\index - PSHUFB_XMM %xmm14, %xmm\index # perform a 16 byte swap - pxor \TMP2, %xmm\index -.endr - lea 0x10(%arg1),%r10 - mov keysize,%eax - shr $2,%eax # 128->4, 192->6, 256->8 - add $5,%eax # 128->9, 192->11, 256->13 - -aes_loop_initial_dec\num_initial_blocks: - MOVADQ (%r10),\TMP1 -.irpc index, \i_seq - AESENC \TMP1, %xmm\index -.endr - add $16,%r10 - sub $1,%eax - jnz aes_loop_initial_dec\num_initial_blocks - - MOVADQ (%r10), \TMP1 -.irpc index, \i_seq - AESENCLAST \TMP1, %xmm\index # Last Round -.endr -.irpc index, \i_seq - movdqu (%arg3 , %r11, 1), \TMP1 - pxor \TMP1, %xmm\index - movdqu %xmm\index, (%arg2 , %r11, 1) - # write back plaintext/ciphertext for num_initial_blocks - add$16, %r11 - - movdqa \TMP1, %xmm\index - PSHUFB_XMM %xmm14, %xmm\index -# prepare plaintext/ciphertext for GHASH computation -.endr -.endif - -# apply GHASH on num_initial_blocks blocks - -.if \i == 5 -pxor %xmm5, %xmm6 - GHASH_MUL %xmm6, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm6, %xmm7 - GHASH_MUL %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.elseif \i == 6 -pxor %xmm6, %xmm7 - GHASH_MUL %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.elseif \i == 7 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.endif - cmp$64, %r13 - jl _initial_blocks_done\num_initial_blocks\operation - # no need for precomputed values -/* -* -* Precomputations for HashKey parallel with encryption of first 4 blocks. -* Haskey_i_k holds XORed values of the low and high parts of the Haskey_i -*/ - MOVADQ ONE(%rip), \TMP1 - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM1 - PSHUFB_XMM %xmm14, \XMM1# perform a 16 byte swap - - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM2 - PSHUFB_XMM %xmm14, \XMM2# perform a 16 byte swap - - paddd \TMP1, \XMM0 # INCR Y0 -
[PATCH v2 00/14] x86/crypto gcmaes SSE scatter/gather support
This patch set refactors the x86 aes/gcm SSE crypto routines to support true scatter/gather by adding gcm_enc/dec_update methods. The layout is: * First 5 patches refactor the code to use macros, so changes only need to be applied once for encode and decode. There should be no functional changes. * The next 6 patches introduce a gcm_context structure to be passed between scatter/gather calls to maintain state. The struct is also used as scratch space for the existing enc/dec routines. * The last 2 set up the asm function entry points for scatter gather support, and then call the new routines per buffer in the passed in sglist in aesni-intel_glue. Testing: asm itself fuzz tested vs. existing code and isa-l asm. Ran libkcapi test suite, passes. perf of a large (16k messages) TLS sends sg vs. no sg: no-sg 33287255597 cycles 53702871176 instructions 43.47% _crypt_by_4 17.83% memcpy 16.36% aes_loop_par_enc_done sg 27568944591 cycles 54580446678 instructions 49.87% _crypt_by_4 17.40% aes_loop_par_enc_done 1.79%aes_loop_initial_5416 1.52%aes_loop_initial_4974 1.27%gcmaes_encrypt_sg.constprop.15 V1 -> V2: patch 14: merge enc/dec also use new routine if cryptlen < AVX_GEN2_OPTSIZE optimize case if assoc is already linear Dave Watson (14): x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC x86/crypto: aesni: Macro-ify func save/restore x86/crypto: aesni: Add GCM_INIT macro x86/crypto: aesni: Add GCM_COMPLETE macro x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro x86/crypto: aesni: Introduce gcm_context_data x86/crypto: aesni: Split AAD hash calculation to separate macro x86/crypto: aesni: Fill in new context data structures x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE x86/crypto: aesni: Move HashKey computation from stack to gcm_context x86/crypto: aesni: Introduce partial block macro x86/crypto: aesni: Add fast path for > 16 byte update x86/crypto: aesni: Introduce scatter/gather asm function stubs x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather arch/x86/crypto/aesni-intel_asm.S | 1414 ++-- arch/x86/crypto/aesni-intel_glue.c | 230 +- 2 files changed, 899 insertions(+), 745 deletions(-) -- 2.9.5
[PATCH v2 01/14] x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC
Use macro operations to merge implemetations of INITIAL_BLOCKS, since they differ by only a small handful of lines. Use macro counter \@ to simplify implementation. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 298 ++ 1 file changed, 48 insertions(+), 250 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 76d8cd4..48911fe 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -275,234 +275,7 @@ _done_read_partial_block_\@: */ -.macro INITIAL_BLOCKS_DEC num_initial_blocks TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ -XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation -MOVADQ SHUF_MASK(%rip), %xmm14 - movarg7, %r10 # %r10 = AAD - movarg8, %r11 # %r11 = aadLen - pxor %xmm\i, %xmm\i - pxor \XMM2, \XMM2 - - cmp$16, %r11 - jl _get_AAD_rest\num_initial_blocks\operation -_get_AAD_blocks\num_initial_blocks\operation: - movdqu (%r10), %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor %xmm\i, \XMM2 - GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 - add$16, %r10 - sub$16, %r11 - cmp$16, %r11 - jge_get_AAD_blocks\num_initial_blocks\operation - - movdqu \XMM2, %xmm\i - - /* read the last <16B of AAD */ -_get_AAD_rest\num_initial_blocks\operation: - cmp$0, %r11 - je _get_AAD_done\num_initial_blocks\operation - - READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor \XMM2, %xmm\i - GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 - -_get_AAD_done\num_initial_blocks\operation: - xor%r11, %r11 # initialise the data pointer offset as zero - # start AES for num_initial_blocks blocks - - mov%arg5, %rax # %rax = *Y0 - movdqu (%rax), \XMM0# XMM0 = Y0 - PSHUFB_XMM %xmm14, \XMM0 - -.if (\i == 5) || (\i == 6) || (\i == 7) - MOVADQ ONE(%RIP),\TMP1 - MOVADQ (%arg1),\TMP2 -.irpc index, \i_seq - paddd \TMP1, \XMM0 # INCR Y0 - movdqa \XMM0, %xmm\index - PSHUFB_XMM %xmm14, %xmm\index # perform a 16 byte swap - pxor \TMP2, %xmm\index -.endr - lea 0x10(%arg1),%r10 - mov keysize,%eax - shr $2,%eax # 128->4, 192->6, 256->8 - add $5,%eax # 128->9, 192->11, 256->13 - -aes_loop_initial_dec\num_initial_blocks: - MOVADQ (%r10),\TMP1 -.irpc index, \i_seq - AESENC \TMP1, %xmm\index -.endr - add $16,%r10 - sub $1,%eax - jnz aes_loop_initial_dec\num_initial_blocks - - MOVADQ (%r10), \TMP1 -.irpc index, \i_seq - AESENCLAST \TMP1, %xmm\index # Last Round -.endr -.irpc index, \i_seq - movdqu (%arg3 , %r11, 1), \TMP1 - pxor \TMP1, %xmm\index - movdqu %xmm\index, (%arg2 , %r11, 1) - # write back plaintext/ciphertext for num_initial_blocks - add$16, %r11 - - movdqa \TMP1, %xmm\index - PSHUFB_XMM %xmm14, %xmm\index -# prepare plaintext/ciphertext for GHASH computation -.endr -.endif - -# apply GHASH on num_initial_blocks blocks - -.if \i == 5 -pxor %xmm5, %xmm6 - GHASH_MUL %xmm6, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm6, %xmm7 - GHASH_MUL %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.elseif \i == 6 -pxor %xmm6, %xmm7 - GHASH_MUL %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.elseif \i == 7 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.endif - cmp$64, %r13 - jl _initial_blocks_done\num_initial_blocks\operation - # no need for precomputed values -/* -* -* Precomputations for HashKey parallel with encryption of first 4 blocks. -* Haskey_i_k holds XORed values of the low and high parts of the Haskey_i -*/ - MOVADQ ONE(%rip), \TMP1 - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM1 - PSHUFB_XMM %xmm14, \XMM1# perform a 16 byte swap - - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM2 - PSHUFB_XMM %xmm14, \XMM2# perform a 16 byte swap - - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM3 -
Re: [PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather
On 02/13/18 08:42 AM, Stephan Mueller wrote: > > +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int > > assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx) > > +{ > > + struct crypto_aead *tfm = crypto_aead_reqtfm(req); > > + unsigned long auth_tag_len = crypto_aead_authsize(tfm); > > + struct gcm_context_data data AESNI_ALIGN_ATTR; > > + struct scatter_walk dst_sg_walk = {}; > > + unsigned long left = req->cryptlen; > > + unsigned long len, srclen, dstlen; > > + struct scatter_walk src_sg_walk; > > + struct scatterlist src_start[2]; > > + struct scatterlist dst_start[2]; > > + struct scatterlist *src_sg; > > + struct scatterlist *dst_sg; > > + u8 *src, *dst, *assoc; > > + u8 authTag[16]; > > + > > + assoc = kmalloc(assoclen, GFP_ATOMIC); > > + if (unlikely(!assoc)) > > + return -ENOMEM; > > + scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0); > > Have you tested that this code does not barf when assoclen is 0? > > Maybe it is worth while to finally add a test vector to testmgr.h which > validates such scenario. If you would like, here is a vector you could add to > testmgr: > > https://github.com/smuellerDD/libkcapi/blob/master/test/test.sh#L315 I tested assoclen and cryptlen being 0 and it works, yes. Both kmalloc and scatterwalk_map_and_copy work correctly with 0 assoclen. > This is a decryption of gcm(aes) with no message, no AAD and just a tag. The > result should be EBADMSG. > > + > > + src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen); > > Why do you use assoclen in the map_and_copy, and req->assoclen in the ffwd? If I understand correctly, rfc4106 appends extra data after the assoc. assoclen is the real assoc length, req->assoclen is assoclen + the extra data length. So we ffwd by req->assoclen in the scatterlist, but use assoclen when memcpy and testing. > > > > +static int gcmaes_decrypt_sg(struct aead_request *req, unsigned int > > assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx) > > +{ > > This is a lot of code duplication. I will merge them and send a V2. > Ciao > Stephan > > Thanks!
Re: [PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather
On 02/13/18 08:42 AM, Stephan Mueller wrote: > > +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int > > assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx) > > +{ > > + struct crypto_aead *tfm = crypto_aead_reqtfm(req); > > + unsigned long auth_tag_len = crypto_aead_authsize(tfm); > > + struct gcm_context_data data AESNI_ALIGN_ATTR; > > + struct scatter_walk dst_sg_walk = {}; > > + unsigned long left = req->cryptlen; > > + unsigned long len, srclen, dstlen; > > + struct scatter_walk src_sg_walk; > > + struct scatterlist src_start[2]; > > + struct scatterlist dst_start[2]; > > + struct scatterlist *src_sg; > > + struct scatterlist *dst_sg; > > + u8 *src, *dst, *assoc; > > + u8 authTag[16]; > > + > > + assoc = kmalloc(assoclen, GFP_ATOMIC); > > + if (unlikely(!assoc)) > > + return -ENOMEM; > > + scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0); > > Have you tested that this code does not barf when assoclen is 0? > > Maybe it is worth while to finally add a test vector to testmgr.h which > validates such scenario. If you would like, here is a vector you could add to > testmgr: > > https://github.com/smuellerDD/libkcapi/blob/master/test/test.sh#L315 I tested assoclen and cryptlen being 0 and it works, yes. Both kmalloc and scatterwalk_map_and_copy work correctly with 0 assoclen. > This is a decryption of gcm(aes) with no message, no AAD and just a tag. The > result should be EBADMSG. > > + > > + src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen); > > Why do you use assoclen in the map_and_copy, and req->assoclen in the ffwd? If I understand correctly, rfc4106 appends extra data after the assoc. assoclen is the real assoc length, req->assoclen is assoclen + the extra data length. So we ffwd by req->assoclen in the scatterlist, but use assoclen when memcpy and testing. > > > > +static int gcmaes_decrypt_sg(struct aead_request *req, unsigned int > > assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx) > > +{ > > This is a lot of code duplication. I will merge them and send a V2. > Ciao > Stephan > > Thanks!
Re: [PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather
On 02/12/18 03:12 PM, Junaid Shahid wrote: > Hi Dave, > > > On 02/12/2018 11:51 AM, Dave Watson wrote: > > > +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int > > assoclen, > > + u8 *hash_subkey, u8 *iv, void *aes_ctx) > > > > +static int gcmaes_decrypt_sg(struct aead_request *req, unsigned int > > assoclen, > > + u8 *hash_subkey, u8 *iv, void *aes_ctx) > > These two functions are almost identical. Wouldn't it be better to combine > them into a single encrypt/decrypt function, similar to what you have done > for the assembly macros? > > > + if (((struct crypto_aes_ctx *)aes_ctx)->key_length != AES_KEYSIZE_128 || > > + aesni_gcm_enc_tfm == aesni_gcm_enc) { > > Shouldn't we also include a check for the buffer length being less than > AVX_GEN2_OPTSIZE? AVX will not be used in that case either. Yes, these both sound reasonable. I will send a V2. Thanks!
Re: [PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather
On 02/12/18 03:12 PM, Junaid Shahid wrote: > Hi Dave, > > > On 02/12/2018 11:51 AM, Dave Watson wrote: > > > +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int > > assoclen, > > + u8 *hash_subkey, u8 *iv, void *aes_ctx) > > > > +static int gcmaes_decrypt_sg(struct aead_request *req, unsigned int > > assoclen, > > + u8 *hash_subkey, u8 *iv, void *aes_ctx) > > These two functions are almost identical. Wouldn't it be better to combine > them into a single encrypt/decrypt function, similar to what you have done > for the assembly macros? > > > + if (((struct crypto_aes_ctx *)aes_ctx)->key_length != AES_KEYSIZE_128 || > > + aesni_gcm_enc_tfm == aesni_gcm_enc) { > > Shouldn't we also include a check for the buffer length being less than > AVX_GEN2_OPTSIZE? AVX will not be used in that case either. Yes, these both sound reasonable. I will send a V2. Thanks!
[PATCH 13/14] x86/crypto: aesni: Introduce scatter/gather asm function stubs
The asm macros are all set up now, introduce entry points. GCM_INIT and GCM_COMPLETE have arguments supplied, so that the new scatter/gather entry points don't have to take all the arguments, and only the ones they need. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 116 - arch/x86/crypto/aesni-intel_glue.c | 16 + 2 files changed, 106 insertions(+), 26 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index b941952..311b2de 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -200,8 +200,8 @@ ALL_F: .octa 0x # Output: HashKeys stored in gcm_context_data. Only needs to be called # once per key. # clobbers r12, and tmp xmm registers. -.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 - mov arg7, %r12 +.macro PRECOMPUTE SUBKEY TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 + mov \SUBKEY, %r12 movdqu (%r12), \TMP3 movdqa SHUF_MASK(%rip), \TMP2 PSHUFB_XMM \TMP2, \TMP3 @@ -254,14 +254,14 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 -.macro GCM_INIT - mov arg9, %r11 +.macro GCM_INIT Iv SUBKEY AAD AADLEN + mov \AADLEN, %r11 mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length xor %r11, %r11 mov %r11, InLen(%arg2) # ctx_data.in_length = 0 mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0 mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0 - mov %arg6, %rax + mov \Iv, %rax movdqu (%rax), %xmm0 movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv @@ -269,11 +269,11 @@ ALL_F: .octa 0x PSHUFB_XMM %xmm2, %xmm0 movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv - PRECOMPUTE %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7 + PRECOMPUTE \SUBKEY, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, movdqa HashKey(%arg2), %xmm13 - CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ - %xmm5 %xmm6 + CALC_AAD_HASH %xmm13, \AAD, \AADLEN, %xmm0, %xmm1, %xmm2, %xmm3, \ + %xmm4, %xmm5, %xmm6 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -435,7 +435,7 @@ _multiple_of_16_bytes_\@: # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 -.macro GCM_COMPLETE +.macro GCM_COMPLETE AUTHTAG AUTHTAGLEN movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%arg2), %xmm13 @@ -466,8 +466,8 @@ _partial_done\@: ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: - mov arg10, %r10 # %r10 = authTag - mov arg11, %r11# %r11 = auth_tag_len + mov \AUTHTAG, %r10 # %r10 = authTag + mov \AUTHTAGLEN, %r11# %r11 = auth_tag_len cmp $16, %r11 je _T_16_\@ cmp $8, %r11 @@ -599,11 +599,11 @@ _done_read_partial_block_\@: # CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted. # clobbers r10-11, xmm14 -.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \ +.macro CALC_AAD_HASH HASHKEY AAD AADLEN TMP1 TMP2 TMP3 TMP4 TMP5 \ TMP6 TMP7 MOVADQ SHUF_MASK(%rip), %xmm14 - movarg8, %r10 # %r10 = AAD - movarg9, %r11 # %r11 = aadLen + mov\AAD, %r10 # %r10 = AAD + mov\AADLEN, %r11# %r11 = aadLen pxor \TMP7, \TMP7 pxor \TMP6, \TMP6 @@ -1103,18 +1103,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 operation mov keysize,%eax shr $2,%eax # 128->4, 192->6, 256->8 sub $4,%eax # 128->0, 192->2, 256->4 - jzaes_loop_par_enc_done + jzaes_loop_par_enc_done\@ -aes_loop_par_enc: +aes_loop_par_enc\@: MOVADQ(%r10),\TMP3 .irpc index, 1234 AESENC\TMP3, %xmm\index .endr add $16,%r10 sub $1,%eax - jnz aes_loop_par_enc + jnz aes_loop_par_enc\@ -aes_loop_par_enc_done: +aes_loop_par_enc_done\@: MOVADQ(%r10), \TMP3 AESENCLAST \TMP3, \XMM1 # Round 10 AESENCLAST \TMP3, \XMM2 @@ -1311,18 +1311,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 operation mov keysize,%eax shr $2,%eax # 128->4, 192->6, 256->8 sub $4,%eax
[PATCH 13/14] x86/crypto: aesni: Introduce scatter/gather asm function stubs
The asm macros are all set up now, introduce entry points. GCM_INIT and GCM_COMPLETE have arguments supplied, so that the new scatter/gather entry points don't have to take all the arguments, and only the ones they need. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 116 - arch/x86/crypto/aesni-intel_glue.c | 16 + 2 files changed, 106 insertions(+), 26 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index b941952..311b2de 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -200,8 +200,8 @@ ALL_F: .octa 0x # Output: HashKeys stored in gcm_context_data. Only needs to be called # once per key. # clobbers r12, and tmp xmm registers. -.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 - mov arg7, %r12 +.macro PRECOMPUTE SUBKEY TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 + mov \SUBKEY, %r12 movdqu (%r12), \TMP3 movdqa SHUF_MASK(%rip), \TMP2 PSHUFB_XMM \TMP2, \TMP3 @@ -254,14 +254,14 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 -.macro GCM_INIT - mov arg9, %r11 +.macro GCM_INIT Iv SUBKEY AAD AADLEN + mov \AADLEN, %r11 mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length xor %r11, %r11 mov %r11, InLen(%arg2) # ctx_data.in_length = 0 mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0 mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0 - mov %arg6, %rax + mov \Iv, %rax movdqu (%rax), %xmm0 movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv @@ -269,11 +269,11 @@ ALL_F: .octa 0x PSHUFB_XMM %xmm2, %xmm0 movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv - PRECOMPUTE %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7 + PRECOMPUTE \SUBKEY, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, movdqa HashKey(%arg2), %xmm13 - CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ - %xmm5 %xmm6 + CALC_AAD_HASH %xmm13, \AAD, \AADLEN, %xmm0, %xmm1, %xmm2, %xmm3, \ + %xmm4, %xmm5, %xmm6 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -435,7 +435,7 @@ _multiple_of_16_bytes_\@: # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 -.macro GCM_COMPLETE +.macro GCM_COMPLETE AUTHTAG AUTHTAGLEN movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%arg2), %xmm13 @@ -466,8 +466,8 @@ _partial_done\@: ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: - mov arg10, %r10 # %r10 = authTag - mov arg11, %r11# %r11 = auth_tag_len + mov \AUTHTAG, %r10 # %r10 = authTag + mov \AUTHTAGLEN, %r11# %r11 = auth_tag_len cmp $16, %r11 je _T_16_\@ cmp $8, %r11 @@ -599,11 +599,11 @@ _done_read_partial_block_\@: # CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted. # clobbers r10-11, xmm14 -.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \ +.macro CALC_AAD_HASH HASHKEY AAD AADLEN TMP1 TMP2 TMP3 TMP4 TMP5 \ TMP6 TMP7 MOVADQ SHUF_MASK(%rip), %xmm14 - movarg8, %r10 # %r10 = AAD - movarg9, %r11 # %r11 = aadLen + mov\AAD, %r10 # %r10 = AAD + mov\AADLEN, %r11# %r11 = aadLen pxor \TMP7, \TMP7 pxor \TMP6, \TMP6 @@ -1103,18 +1103,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 operation mov keysize,%eax shr $2,%eax # 128->4, 192->6, 256->8 sub $4,%eax # 128->0, 192->2, 256->4 - jzaes_loop_par_enc_done + jzaes_loop_par_enc_done\@ -aes_loop_par_enc: +aes_loop_par_enc\@: MOVADQ(%r10),\TMP3 .irpc index, 1234 AESENC\TMP3, %xmm\index .endr add $16,%r10 sub $1,%eax - jnz aes_loop_par_enc + jnz aes_loop_par_enc\@ -aes_loop_par_enc_done: +aes_loop_par_enc_done\@: MOVADQ(%r10), \TMP3 AESENCLAST \TMP3, \XMM1 # Round 10 AESENCLAST \TMP3, \XMM2 @@ -1311,18 +1311,18 @@ TMP6 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 operation mov keysize,%eax shr $2,%eax # 128->4, 192->6, 256->8 sub $4,%eax # 128-&g
[PATCH 06/14] x86/crypto: aesni: Introduce gcm_context_data
Introduce a gcm_context_data struct that will be used to pass context data between scatter/gather update calls. It is passed as the second argument (after crypto keys), other args are renumbered. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 115 + arch/x86/crypto/aesni-intel_glue.c | 81 ++ 2 files changed, 121 insertions(+), 75 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 8021fd1..6c5a80d 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -111,6 +111,14 @@ ALL_F: .octa 0x // (for Karatsuba purposes) #defineVARIABLE_OFFSET 16*8 +#define AadHash 16*0 +#define AadLen 16*1 +#define InLen (16*1)+8 +#define PBlockEncKey 16*2 +#define OrigIV 16*3 +#define CurCount 16*4 +#define PBlockLen 16*5 + #define arg1 rdi #define arg2 rsi #define arg3 rdx @@ -121,6 +129,7 @@ ALL_F: .octa 0x #define arg8 STACK_OFFSET+16(%r14) #define arg9 STACK_OFFSET+24(%r14) #define arg10 STACK_OFFSET+32(%r14) +#define arg11 STACK_OFFSET+40(%r14) #define keysize 2*15*16(%arg1) #endif @@ -195,9 +204,9 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 .macro GCM_INIT - mov %arg6, %r12 + mov arg7, %r12 movdqu (%r12), %xmm13 - movdqa SHUF_MASK(%rip), %xmm2 + movdqa SHUF_MASK(%rip), %xmm2 PSHUFB_XMM %xmm2, %xmm13 # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) @@ -217,7 +226,7 @@ ALL_F: .octa 0x pandPOLY(%rip), %xmm2 pxor%xmm2, %xmm13 movdqa %xmm13, HashKey(%rsp) - mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod poly) + mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) and $-16, %r13 mov %r13, %r12 .endm @@ -271,18 +280,18 @@ _four_cipher_left_\@: GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 _zero_cipher_left_\@: - mov %arg4, %r13 - and $15, %r13 # %r13 = arg4 (mod 16) + mov %arg5, %r13 + and $15, %r13 # %r13 = arg5 (mod 16) je _multiple_of_16_bytes_\@ # Handle the last <16 Byte block separately paddd ONE(%rip), %xmm0# INCR CNT to get Yn -movdqa SHUF_MASK(%rip), %xmm10 + movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) - lea (%arg3,%r11,1), %r10 + lea (%arg4,%r11,1), %r10 mov %r13, %r12 READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 @@ -320,13 +329,13 @@ _zero_cipher_left_\@: MOVQ_R64_XMM %xmm0, %rax cmp $8, %r13 jle _less_than_8_bytes_left_\@ - mov %rax, (%arg2 , %r11, 1) + mov %rax, (%arg3 , %r11, 1) add $8, %r11 psrldq $8, %xmm0 MOVQ_R64_XMM %xmm0, %rax sub $8, %r13 _less_than_8_bytes_left_\@: - mov %al, (%arg2, %r11, 1) + mov %al, (%arg3, %r11, 1) add $1, %r11 shr $8, %rax sub $1, %r13 @@ -338,11 +347,11 @@ _multiple_of_16_bytes_\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE - mov arg8, %r12# %r13 = aadLen (number of bytes) + mov arg9, %r12# %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg4 # len(C) in bits (*128) - MOVQ_R64_XMM%arg4, %xmm1 + shl $3, %arg5 # len(C) in bits (*128) + MOVQ_R64_XMM%arg5, %xmm1 pslldq $8, %xmm15# %xmm15 = len(A)||0x pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) pxor%xmm15, %xmm8 @@ -351,13 +360,13 @@ _multiple_of_16_bytes_\@: movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm8 - mov %arg5, %rax # %rax = *Y0 + mov %arg6, %rax # %rax = *Y0 movdqu (%rax), %xmm0 # %xmm0 = Y0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: - mov arg9, %r10 # %r10 = authTag - mov arg10, %r11# %r11 = auth_tag_len + mov arg10, %r10 # %r10 = authTag
[PATCH 06/14] x86/crypto: aesni: Introduce gcm_context_data
Introduce a gcm_context_data struct that will be used to pass context data between scatter/gather update calls. It is passed as the second argument (after crypto keys), other args are renumbered. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 115 + arch/x86/crypto/aesni-intel_glue.c | 81 ++ 2 files changed, 121 insertions(+), 75 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 8021fd1..6c5a80d 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -111,6 +111,14 @@ ALL_F: .octa 0x // (for Karatsuba purposes) #defineVARIABLE_OFFSET 16*8 +#define AadHash 16*0 +#define AadLen 16*1 +#define InLen (16*1)+8 +#define PBlockEncKey 16*2 +#define OrigIV 16*3 +#define CurCount 16*4 +#define PBlockLen 16*5 + #define arg1 rdi #define arg2 rsi #define arg3 rdx @@ -121,6 +129,7 @@ ALL_F: .octa 0x #define arg8 STACK_OFFSET+16(%r14) #define arg9 STACK_OFFSET+24(%r14) #define arg10 STACK_OFFSET+32(%r14) +#define arg11 STACK_OFFSET+40(%r14) #define keysize 2*15*16(%arg1) #endif @@ -195,9 +204,9 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 .macro GCM_INIT - mov %arg6, %r12 + mov arg7, %r12 movdqu (%r12), %xmm13 - movdqa SHUF_MASK(%rip), %xmm2 + movdqa SHUF_MASK(%rip), %xmm2 PSHUFB_XMM %xmm2, %xmm13 # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) @@ -217,7 +226,7 @@ ALL_F: .octa 0x pandPOLY(%rip), %xmm2 pxor%xmm2, %xmm13 movdqa %xmm13, HashKey(%rsp) - mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod poly) + mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) and $-16, %r13 mov %r13, %r12 .endm @@ -271,18 +280,18 @@ _four_cipher_left_\@: GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 _zero_cipher_left_\@: - mov %arg4, %r13 - and $15, %r13 # %r13 = arg4 (mod 16) + mov %arg5, %r13 + and $15, %r13 # %r13 = arg5 (mod 16) je _multiple_of_16_bytes_\@ # Handle the last <16 Byte block separately paddd ONE(%rip), %xmm0# INCR CNT to get Yn -movdqa SHUF_MASK(%rip), %xmm10 + movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) - lea (%arg3,%r11,1), %r10 + lea (%arg4,%r11,1), %r10 mov %r13, %r12 READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 @@ -320,13 +329,13 @@ _zero_cipher_left_\@: MOVQ_R64_XMM %xmm0, %rax cmp $8, %r13 jle _less_than_8_bytes_left_\@ - mov %rax, (%arg2 , %r11, 1) + mov %rax, (%arg3 , %r11, 1) add $8, %r11 psrldq $8, %xmm0 MOVQ_R64_XMM %xmm0, %rax sub $8, %r13 _less_than_8_bytes_left_\@: - mov %al, (%arg2, %r11, 1) + mov %al, (%arg3, %r11, 1) add $1, %r11 shr $8, %rax sub $1, %r13 @@ -338,11 +347,11 @@ _multiple_of_16_bytes_\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE - mov arg8, %r12# %r13 = aadLen (number of bytes) + mov arg9, %r12# %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg4 # len(C) in bits (*128) - MOVQ_R64_XMM%arg4, %xmm1 + shl $3, %arg5 # len(C) in bits (*128) + MOVQ_R64_XMM%arg5, %xmm1 pslldq $8, %xmm15# %xmm15 = len(A)||0x pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) pxor%xmm15, %xmm8 @@ -351,13 +360,13 @@ _multiple_of_16_bytes_\@: movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm8 - mov %arg5, %rax # %rax = *Y0 + mov %arg6, %rax # %rax = *Y0 movdqu (%rax), %xmm0 # %xmm0 = Y0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: - mov arg9, %r10 # %r10 = authTag - mov arg10, %r11# %r11 = auth_tag_len + mov arg10, %r10 # %r10 = authTag + mov arg11, %r11
[PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather
Add gcmaes_en/decrypt_sg routines, that will do scatter/gather by sg. Either src or dst may contain multiple buffers, so iterate over both at the same time if they are different. If the input is the same as the output, iterate only over one. Currently both the AAD and TAG must be linear, so copy them out with scatterlist_map_and_copy. Only the SSE routines are updated so far, so leave the previous gcmaes_en/decrypt routines, and branch to the sg ones if the keysize is inappropriate for avx, or we are SSE only. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_glue.c | 166 + 1 file changed, 166 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index de986f9..1e32fbe 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -791,6 +791,82 @@ static int generic_gcmaes_set_authsize(struct crypto_aead *tfm, return 0; } +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx) +{ + struct crypto_aead *tfm = crypto_aead_reqtfm(req); + unsigned long auth_tag_len = crypto_aead_authsize(tfm); + struct gcm_context_data data AESNI_ALIGN_ATTR; + struct scatter_walk dst_sg_walk = {}; + unsigned long left = req->cryptlen; + unsigned long len, srclen, dstlen; + struct scatter_walk src_sg_walk; + struct scatterlist src_start[2]; + struct scatterlist dst_start[2]; + struct scatterlist *src_sg; + struct scatterlist *dst_sg; + u8 *src, *dst, *assoc; + u8 authTag[16]; + + assoc = kmalloc(assoclen, GFP_ATOMIC); + if (unlikely(!assoc)) + return -ENOMEM; + scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0); + + src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen); + scatterwalk_start(_sg_walk, src_sg); + if (req->src != req->dst) { + dst_sg = scatterwalk_ffwd(dst_start, req->dst, req->assoclen); + scatterwalk_start(_sg_walk, dst_sg); + } + + kernel_fpu_begin(); + aesni_gcm_init(aes_ctx, , iv, + hash_subkey, assoc, assoclen); + if (req->src != req->dst) { + while (left) { + src = scatterwalk_map(_sg_walk); + dst = scatterwalk_map(_sg_walk); + srclen = scatterwalk_clamp(_sg_walk, left); + dstlen = scatterwalk_clamp(_sg_walk, left); + len = min(srclen, dstlen); + if (len) + aesni_gcm_enc_update(aes_ctx, , +dst, src, len); + left -= len; + + scatterwalk_unmap(src); + scatterwalk_unmap(dst); + scatterwalk_advance(_sg_walk, len); + scatterwalk_advance(_sg_walk, len); + scatterwalk_done(_sg_walk, 0, left); + scatterwalk_done(_sg_walk, 1, left); + } + } else { + while (left) { + dst = src = scatterwalk_map(_sg_walk); + len = scatterwalk_clamp(_sg_walk, left); + if (len) + aesni_gcm_enc_update(aes_ctx, , +src, src, len); + left -= len; + scatterwalk_unmap(src); + scatterwalk_advance(_sg_walk, len); + scatterwalk_done(_sg_walk, 1, left); + } + } + aesni_gcm_finalize(aes_ctx, , authTag, auth_tag_len); + kernel_fpu_end(); + + kfree(assoc); + + /* Copy in the authTag */ + scatterwalk_map_and_copy(authTag, req->dst, + req->assoclen + req->cryptlen, + auth_tag_len, 1); + return 0; +} + static int gcmaes_encrypt(struct aead_request *req, unsigned int assoclen, u8 *hash_subkey, u8 *iv, void *aes_ctx) { @@ -802,6 +878,11 @@ static int gcmaes_encrypt(struct aead_request *req, unsigned int assoclen, struct scatter_walk dst_sg_walk = {}; struct gcm_context_data data AESNI_ALIGN_ATTR; + if (((struct crypto_aes_ctx *)aes_ctx)->key_length != AES_KEYSIZE_128 || + aesni_gcm_enc_tfm == aesni_gcm_enc) { + return gcmaes_encrypt_sg(req, assoclen, hash_subkey, iv, + aes_ctx); + } if (sg_is_last(req->src) && (!PageHighMem(sg_page(req->src)) || req->src->offset + req->src->length <= PAGE_SIZE) && @@ -854,6
[PATCH 14/14] x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather
Add gcmaes_en/decrypt_sg routines, that will do scatter/gather by sg. Either src or dst may contain multiple buffers, so iterate over both at the same time if they are different. If the input is the same as the output, iterate only over one. Currently both the AAD and TAG must be linear, so copy them out with scatterlist_map_and_copy. Only the SSE routines are updated so far, so leave the previous gcmaes_en/decrypt routines, and branch to the sg ones if the keysize is inappropriate for avx, or we are SSE only. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_glue.c | 166 + 1 file changed, 166 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index de986f9..1e32fbe 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -791,6 +791,82 @@ static int generic_gcmaes_set_authsize(struct crypto_aead *tfm, return 0; } +static int gcmaes_encrypt_sg(struct aead_request *req, unsigned int assoclen, + u8 *hash_subkey, u8 *iv, void *aes_ctx) +{ + struct crypto_aead *tfm = crypto_aead_reqtfm(req); + unsigned long auth_tag_len = crypto_aead_authsize(tfm); + struct gcm_context_data data AESNI_ALIGN_ATTR; + struct scatter_walk dst_sg_walk = {}; + unsigned long left = req->cryptlen; + unsigned long len, srclen, dstlen; + struct scatter_walk src_sg_walk; + struct scatterlist src_start[2]; + struct scatterlist dst_start[2]; + struct scatterlist *src_sg; + struct scatterlist *dst_sg; + u8 *src, *dst, *assoc; + u8 authTag[16]; + + assoc = kmalloc(assoclen, GFP_ATOMIC); + if (unlikely(!assoc)) + return -ENOMEM; + scatterwalk_map_and_copy(assoc, req->src, 0, assoclen, 0); + + src_sg = scatterwalk_ffwd(src_start, req->src, req->assoclen); + scatterwalk_start(_sg_walk, src_sg); + if (req->src != req->dst) { + dst_sg = scatterwalk_ffwd(dst_start, req->dst, req->assoclen); + scatterwalk_start(_sg_walk, dst_sg); + } + + kernel_fpu_begin(); + aesni_gcm_init(aes_ctx, , iv, + hash_subkey, assoc, assoclen); + if (req->src != req->dst) { + while (left) { + src = scatterwalk_map(_sg_walk); + dst = scatterwalk_map(_sg_walk); + srclen = scatterwalk_clamp(_sg_walk, left); + dstlen = scatterwalk_clamp(_sg_walk, left); + len = min(srclen, dstlen); + if (len) + aesni_gcm_enc_update(aes_ctx, , +dst, src, len); + left -= len; + + scatterwalk_unmap(src); + scatterwalk_unmap(dst); + scatterwalk_advance(_sg_walk, len); + scatterwalk_advance(_sg_walk, len); + scatterwalk_done(_sg_walk, 0, left); + scatterwalk_done(_sg_walk, 1, left); + } + } else { + while (left) { + dst = src = scatterwalk_map(_sg_walk); + len = scatterwalk_clamp(_sg_walk, left); + if (len) + aesni_gcm_enc_update(aes_ctx, , +src, src, len); + left -= len; + scatterwalk_unmap(src); + scatterwalk_advance(_sg_walk, len); + scatterwalk_done(_sg_walk, 1, left); + } + } + aesni_gcm_finalize(aes_ctx, , authTag, auth_tag_len); + kernel_fpu_end(); + + kfree(assoc); + + /* Copy in the authTag */ + scatterwalk_map_and_copy(authTag, req->dst, + req->assoclen + req->cryptlen, + auth_tag_len, 1); + return 0; +} + static int gcmaes_encrypt(struct aead_request *req, unsigned int assoclen, u8 *hash_subkey, u8 *iv, void *aes_ctx) { @@ -802,6 +878,11 @@ static int gcmaes_encrypt(struct aead_request *req, unsigned int assoclen, struct scatter_walk dst_sg_walk = {}; struct gcm_context_data data AESNI_ALIGN_ATTR; + if (((struct crypto_aes_ctx *)aes_ctx)->key_length != AES_KEYSIZE_128 || + aesni_gcm_enc_tfm == aesni_gcm_enc) { + return gcmaes_encrypt_sg(req, assoclen, hash_subkey, iv, + aes_ctx); + } if (sg_is_last(req->src) && (!PageHighMem(sg_page(req->src)) || req->src->offset + req->src->length <= PAGE_SIZE) && @@ -854,6 +935,86 @@ static int gc
[PATCH 05/14] x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro
Make a macro for the main encode/decode routine. Only a small handful of lines differ for enc and dec. This will also become the main scatter/gather update routine. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 293 +++--- 1 file changed, 114 insertions(+), 179 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 529c542..8021fd1 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -222,6 +222,118 @@ ALL_F: .octa 0x mov %r13, %r12 .endm +# GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context +# struct has been initialized by GCM_INIT. +# Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK +# Clobbers rax, r10-r13, and xmm0-xmm15 +.macro GCM_ENC_DEC operation + # Encrypt/Decrypt first few blocks + + and $(3<<4), %r12 + jz _initial_num_blocks_is_0_\@ + cmp $(2<<4), %r12 + jb _initial_num_blocks_is_1_\@ + je _initial_num_blocks_is_2_\@ +_initial_num_blocks_is_3_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 5, 678, \operation + sub $48, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_2_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 6, 78, \operation + sub $32, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_1_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 7, 8, \operation + sub $16, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_0_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 8, 0, \operation +_initial_blocks_\@: + + # Main loop - Encrypt/Decrypt remaining blocks + + cmp $0, %r13 + je _zero_cipher_left_\@ + sub $64, %r13 + je _four_cipher_left_\@ +_crypt_by_4_\@: + GHASH_4_ENCRYPT_4_PARALLEL_\operation %xmm9, %xmm10, %xmm11, %xmm12, \ + %xmm13, %xmm14, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, \ + %xmm7, %xmm8, enc + add $64, %r11 + sub $64, %r13 + jne _crypt_by_4_\@ +_four_cipher_left_\@: + GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ +%xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 +_zero_cipher_left_\@: + mov %arg4, %r13 + and $15, %r13 # %r13 = arg4 (mod 16) + je _multiple_of_16_bytes_\@ + + # Handle the last <16 Byte block separately + paddd ONE(%rip), %xmm0# INCR CNT to get Yn +movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm0 + + ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) + + lea (%arg3,%r11,1), %r10 + mov %r13, %r12 + READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 + + lea ALL_F+16(%rip), %r12 + sub %r13, %r12 +.ifc \operation, dec + movdqa %xmm1, %xmm2 +.endif + pxor%xmm1, %xmm0# XOR Encrypt(K, Yn) + movdqu (%r12), %xmm1 + # get the appropriate mask to mask out top 16-r13 bytes of xmm0 + pand%xmm1, %xmm0# mask out top 16-r13 bytes of xmm0 +.ifc \operation, dec + pand%xmm1, %xmm2 + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10 ,%xmm2 + + pxor %xmm2, %xmm8 +.else + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10,%xmm0 + + pxor%xmm0, %xmm8 +.endif + + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 +.ifc \operation, enc + # GHASH computation for the last <16 byte block + movdqa SHUF_MASK(%rip), %xmm10 + # shuffle xmm0 back to output as ciphertext + PSHUFB_XMM %xmm10, %xmm0 +.endif + + # Output %r13 bytes + MOVQ_R64_XMM %xmm0, %rax + cmp $8, %r13 + jle _less_than_8_bytes_left_\@ + mov %rax, (%arg2 , %r11, 1) + add $8, %r11 + psrldq $8, %xmm0 + MOVQ_R64_XMM %xmm0, %rax + sub $8, %r13 +_less_than_8_bytes_left_\@: + mov %al, (%arg2, %r11, 1) + add $1, %r11 + shr $8, %rax + sub $1, %r13 + jne _less_than_8_bytes_left_\@ +_multiple_of_16_bytes_\@: +.endm + # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 @@ -1245,93 +1357,7 @@ ENTRY(aesni_gcm_dec) FUNC_SAVE GCM_INIT - -# Decrypt first few blocks - - and $(3<<4), %r12 - jz _initial_num_blocks_is_0_decrypt -
[PATCH 05/14] x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro
Make a macro for the main encode/decode routine. Only a small handful of lines differ for enc and dec. This will also become the main scatter/gather update routine. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 293 +++--- 1 file changed, 114 insertions(+), 179 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 529c542..8021fd1 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -222,6 +222,118 @@ ALL_F: .octa 0x mov %r13, %r12 .endm +# GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context +# struct has been initialized by GCM_INIT. +# Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK +# Clobbers rax, r10-r13, and xmm0-xmm15 +.macro GCM_ENC_DEC operation + # Encrypt/Decrypt first few blocks + + and $(3<<4), %r12 + jz _initial_num_blocks_is_0_\@ + cmp $(2<<4), %r12 + jb _initial_num_blocks_is_1_\@ + je _initial_num_blocks_is_2_\@ +_initial_num_blocks_is_3_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 5, 678, \operation + sub $48, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_2_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 6, 78, \operation + sub $32, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_1_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 7, 8, \operation + sub $16, %r13 + jmp _initial_blocks_\@ +_initial_num_blocks_is_0_\@: + INITIAL_BLOCKS_ENC_DEC %xmm9, %xmm10, %xmm13, %xmm11, %xmm12, %xmm0, \ +%xmm1, %xmm2, %xmm3, %xmm4, %xmm8, %xmm5, %xmm6, 8, 0, \operation +_initial_blocks_\@: + + # Main loop - Encrypt/Decrypt remaining blocks + + cmp $0, %r13 + je _zero_cipher_left_\@ + sub $64, %r13 + je _four_cipher_left_\@ +_crypt_by_4_\@: + GHASH_4_ENCRYPT_4_PARALLEL_\operation %xmm9, %xmm10, %xmm11, %xmm12, \ + %xmm13, %xmm14, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, \ + %xmm7, %xmm8, enc + add $64, %r11 + sub $64, %r13 + jne _crypt_by_4_\@ +_four_cipher_left_\@: + GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ +%xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 +_zero_cipher_left_\@: + mov %arg4, %r13 + and $15, %r13 # %r13 = arg4 (mod 16) + je _multiple_of_16_bytes_\@ + + # Handle the last <16 Byte block separately + paddd ONE(%rip), %xmm0# INCR CNT to get Yn +movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm0 + + ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) + + lea (%arg3,%r11,1), %r10 + mov %r13, %r12 + READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 + + lea ALL_F+16(%rip), %r12 + sub %r13, %r12 +.ifc \operation, dec + movdqa %xmm1, %xmm2 +.endif + pxor%xmm1, %xmm0# XOR Encrypt(K, Yn) + movdqu (%r12), %xmm1 + # get the appropriate mask to mask out top 16-r13 bytes of xmm0 + pand%xmm1, %xmm0# mask out top 16-r13 bytes of xmm0 +.ifc \operation, dec + pand%xmm1, %xmm2 + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10 ,%xmm2 + + pxor %xmm2, %xmm8 +.else + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10,%xmm0 + + pxor%xmm0, %xmm8 +.endif + + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 +.ifc \operation, enc + # GHASH computation for the last <16 byte block + movdqa SHUF_MASK(%rip), %xmm10 + # shuffle xmm0 back to output as ciphertext + PSHUFB_XMM %xmm10, %xmm0 +.endif + + # Output %r13 bytes + MOVQ_R64_XMM %xmm0, %rax + cmp $8, %r13 + jle _less_than_8_bytes_left_\@ + mov %rax, (%arg2 , %r11, 1) + add $8, %r11 + psrldq $8, %xmm0 + MOVQ_R64_XMM %xmm0, %rax + sub $8, %r13 +_less_than_8_bytes_left_\@: + mov %al, (%arg2, %r11, 1) + add $1, %r11 + shr $8, %rax + sub $1, %r13 + jne _less_than_8_bytes_left_\@ +_multiple_of_16_bytes_\@: +.endm + # GCM_COMPLETE Finishes update of tag of last partial block # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 @@ -1245,93 +1357,7 @@ ENTRY(aesni_gcm_dec) FUNC_SAVE GCM_INIT - -# Decrypt first few blocks - - and $(3<<4), %r12 - jz _initial_num_blocks_is_0_decrypt - cmp $(2<<4), %r12 - jb _initial_num_
[PATCH 12/14] x86/crypto: aesni: Add fast path for > 16 byte update
We can fast-path any < 16 byte read if the full message is > 16 bytes, and shift over by the appropriate amount. Usually we are reading > 16 bytes, so this should be faster than the READ_PARTIAL macro introduced in b20209c91e2 for the average case. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 398bd2237f..b941952 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -355,12 +355,37 @@ _zero_cipher_left_\@: ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) movdqu %xmm0, PBlockEncKey(%arg2) + cmp $16, %arg5 + jge _large_enough_update_\@ + lea (%arg4,%r11,1), %r10 mov %r13, %r12 READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 + jmp _data_read_\@ + +_large_enough_update_\@: + sub $16, %r11 + add %r13, %r11 + + # receive the last <16 Byte block + movdqu (%arg4, %r11, 1), %xmm1 + sub %r13, %r11 + add $16, %r11 + + lea SHIFT_MASK+16(%rip), %r12 + # adjust the shuffle mask pointer to be able to shift 16-r13 bytes + # (r13 is the number of bytes in plaintext mod 16) + sub %r13, %r12 + # get the appropriate shuffle mask + movdqu (%r12), %xmm2 + # shift right 16-r13 bytes + PSHUFB_XMM %xmm2, %xmm1 + +_data_read_\@: lea ALL_F+16(%rip), %r12 sub %r13, %r12 + .ifc \operation, dec movdqa %xmm1, %xmm2 .endif -- 2.9.5
[PATCH 10/14] x86/crypto: aesni: Move HashKey computation from stack to gcm_context
HashKey computation only needs to happen once per scatter/gather operation, save it between calls in gcm_context struct instead of on the stack. Since the asm no longer stores anything on the stack, we can use %rsp directly, and clean up the frame save/restore macros a bit. Hashkeys actually only need to be calculated once per key and could be moved to when set_key is called, however, the current glue code falls back to generic aes code if fpu is disabled. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 205 -- 1 file changed, 106 insertions(+), 99 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 37b1cee..3ada06b 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -93,23 +93,6 @@ ALL_F: .octa 0x #defineSTACK_OFFSET8*3 -#defineHashKey 16*0// store HashKey <<1 mod poly here -#defineHashKey_2 16*1// store HashKey^2 <<1 mod poly here -#defineHashKey_3 16*2// store HashKey^3 <<1 mod poly here -#defineHashKey_4 16*3// store HashKey^4 <<1 mod poly here -#defineHashKey_k 16*4// store XOR of High 64 bits and Low 64 - // bits of HashKey <<1 mod poly here - //(for Karatsuba purposes) -#defineHashKey_2_k 16*5// store XOR of High 64 bits and Low 64 - // bits of HashKey^2 <<1 mod poly here - // (for Karatsuba purposes) -#defineHashKey_3_k 16*6// store XOR of High 64 bits and Low 64 - // bits of HashKey^3 <<1 mod poly here - // (for Karatsuba purposes) -#defineHashKey_4_k 16*7// store XOR of High 64 bits and Low 64 - // bits of HashKey^4 <<1 mod poly here - // (for Karatsuba purposes) -#defineVARIABLE_OFFSET 16*8 #define AadHash 16*0 #define AadLen 16*1 @@ -118,6 +101,22 @@ ALL_F: .octa 0x #define OrigIV 16*3 #define CurCount 16*4 #define PBlockLen 16*5 +#defineHashKey 16*6// store HashKey <<1 mod poly here +#defineHashKey_2 16*7// store HashKey^2 <<1 mod poly here +#defineHashKey_3 16*8// store HashKey^3 <<1 mod poly here +#defineHashKey_4 16*9// store HashKey^4 <<1 mod poly here +#defineHashKey_k 16*10 // store XOR of High 64 bits and Low 64 + // bits of HashKey <<1 mod poly here + //(for Karatsuba purposes) +#defineHashKey_2_k 16*11 // store XOR of High 64 bits and Low 64 + // bits of HashKey^2 <<1 mod poly here + // (for Karatsuba purposes) +#defineHashKey_3_k 16*12 // store XOR of High 64 bits and Low 64 + // bits of HashKey^3 <<1 mod poly here + // (for Karatsuba purposes) +#defineHashKey_4_k 16*13 // store XOR of High 64 bits and Low 64 + // bits of HashKey^4 <<1 mod poly here + // (for Karatsuba purposes) #define arg1 rdi #define arg2 rsi @@ -125,11 +124,11 @@ ALL_F: .octa 0x #define arg4 rcx #define arg5 r8 #define arg6 r9 -#define arg7 STACK_OFFSET+8(%r14) -#define arg8 STACK_OFFSET+16(%r14) -#define arg9 STACK_OFFSET+24(%r14) -#define arg10 STACK_OFFSET+32(%r14) -#define arg11 STACK_OFFSET+40(%r14) +#define arg7 STACK_OFFSET+8(%rsp) +#define arg8 STACK_OFFSET+16(%rsp) +#define arg9 STACK_OFFSET+24(%rsp) +#define arg10 STACK_OFFSET+32(%rsp) +#define arg11 STACK_OFFSET+40(%rsp) #define keysize 2*15*16(%arg1) #endif @@ -183,28 +182,79 @@ ALL_F: .octa 0x push%r12 push%r13 push%r14 - mov %rsp, %r14 # # states of %xmm registers %xmm6:%xmm15 not saved # all %xmm registers are clobbered # - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp .endm .macro FUNC_RESTORE - mov %r14, %rsp pop %r14 pop %r13 pop %r12 .endm +# Precompute hashkeys. +# Input: Hash subkey. +# Output: HashKeys stored in gcm_context_data. Only needs to be called +# once per key. +# clobbers r12, and tmp xmm registers. +.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 + mov arg7, %r12 + movdqu (%r12), \TMP3 + movdqa SHUF_MASK(%rip), \TMP2 + PSHUFB_XMM \TMP2, \TMP3 + + # precompute HashK
[PATCH 12/14] x86/crypto: aesni: Add fast path for > 16 byte update
We can fast-path any < 16 byte read if the full message is > 16 bytes, and shift over by the appropriate amount. Usually we are reading > 16 bytes, so this should be faster than the READ_PARTIAL macro introduced in b20209c91e2 for the average case. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 398bd2237f..b941952 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -355,12 +355,37 @@ _zero_cipher_left_\@: ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) movdqu %xmm0, PBlockEncKey(%arg2) + cmp $16, %arg5 + jge _large_enough_update_\@ + lea (%arg4,%r11,1), %r10 mov %r13, %r12 READ_PARTIAL_BLOCK %r10 %r12 %xmm2 %xmm1 + jmp _data_read_\@ + +_large_enough_update_\@: + sub $16, %r11 + add %r13, %r11 + + # receive the last <16 Byte block + movdqu (%arg4, %r11, 1), %xmm1 + sub %r13, %r11 + add $16, %r11 + + lea SHIFT_MASK+16(%rip), %r12 + # adjust the shuffle mask pointer to be able to shift 16-r13 bytes + # (r13 is the number of bytes in plaintext mod 16) + sub %r13, %r12 + # get the appropriate shuffle mask + movdqu (%r12), %xmm2 + # shift right 16-r13 bytes + PSHUFB_XMM %xmm2, %xmm1 + +_data_read_\@: lea ALL_F+16(%rip), %r12 sub %r13, %r12 + .ifc \operation, dec movdqa %xmm1, %xmm2 .endif -- 2.9.5
[PATCH 10/14] x86/crypto: aesni: Move HashKey computation from stack to gcm_context
HashKey computation only needs to happen once per scatter/gather operation, save it between calls in gcm_context struct instead of on the stack. Since the asm no longer stores anything on the stack, we can use %rsp directly, and clean up the frame save/restore macros a bit. Hashkeys actually only need to be calculated once per key and could be moved to when set_key is called, however, the current glue code falls back to generic aes code if fpu is disabled. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 205 -- 1 file changed, 106 insertions(+), 99 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 37b1cee..3ada06b 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -93,23 +93,6 @@ ALL_F: .octa 0x #defineSTACK_OFFSET8*3 -#defineHashKey 16*0// store HashKey <<1 mod poly here -#defineHashKey_2 16*1// store HashKey^2 <<1 mod poly here -#defineHashKey_3 16*2// store HashKey^3 <<1 mod poly here -#defineHashKey_4 16*3// store HashKey^4 <<1 mod poly here -#defineHashKey_k 16*4// store XOR of High 64 bits and Low 64 - // bits of HashKey <<1 mod poly here - //(for Karatsuba purposes) -#defineHashKey_2_k 16*5// store XOR of High 64 bits and Low 64 - // bits of HashKey^2 <<1 mod poly here - // (for Karatsuba purposes) -#defineHashKey_3_k 16*6// store XOR of High 64 bits and Low 64 - // bits of HashKey^3 <<1 mod poly here - // (for Karatsuba purposes) -#defineHashKey_4_k 16*7// store XOR of High 64 bits and Low 64 - // bits of HashKey^4 <<1 mod poly here - // (for Karatsuba purposes) -#defineVARIABLE_OFFSET 16*8 #define AadHash 16*0 #define AadLen 16*1 @@ -118,6 +101,22 @@ ALL_F: .octa 0x #define OrigIV 16*3 #define CurCount 16*4 #define PBlockLen 16*5 +#defineHashKey 16*6// store HashKey <<1 mod poly here +#defineHashKey_2 16*7// store HashKey^2 <<1 mod poly here +#defineHashKey_3 16*8// store HashKey^3 <<1 mod poly here +#defineHashKey_4 16*9// store HashKey^4 <<1 mod poly here +#defineHashKey_k 16*10 // store XOR of High 64 bits and Low 64 + // bits of HashKey <<1 mod poly here + //(for Karatsuba purposes) +#defineHashKey_2_k 16*11 // store XOR of High 64 bits and Low 64 + // bits of HashKey^2 <<1 mod poly here + // (for Karatsuba purposes) +#defineHashKey_3_k 16*12 // store XOR of High 64 bits and Low 64 + // bits of HashKey^3 <<1 mod poly here + // (for Karatsuba purposes) +#defineHashKey_4_k 16*13 // store XOR of High 64 bits and Low 64 + // bits of HashKey^4 <<1 mod poly here + // (for Karatsuba purposes) #define arg1 rdi #define arg2 rsi @@ -125,11 +124,11 @@ ALL_F: .octa 0x #define arg4 rcx #define arg5 r8 #define arg6 r9 -#define arg7 STACK_OFFSET+8(%r14) -#define arg8 STACK_OFFSET+16(%r14) -#define arg9 STACK_OFFSET+24(%r14) -#define arg10 STACK_OFFSET+32(%r14) -#define arg11 STACK_OFFSET+40(%r14) +#define arg7 STACK_OFFSET+8(%rsp) +#define arg8 STACK_OFFSET+16(%rsp) +#define arg9 STACK_OFFSET+24(%rsp) +#define arg10 STACK_OFFSET+32(%rsp) +#define arg11 STACK_OFFSET+40(%rsp) #define keysize 2*15*16(%arg1) #endif @@ -183,28 +182,79 @@ ALL_F: .octa 0x push%r12 push%r13 push%r14 - mov %rsp, %r14 # # states of %xmm registers %xmm6:%xmm15 not saved # all %xmm registers are clobbered # - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp .endm .macro FUNC_RESTORE - mov %r14, %rsp pop %r14 pop %r13 pop %r12 .endm +# Precompute hashkeys. +# Input: Hash subkey. +# Output: HashKeys stored in gcm_context_data. Only needs to be called +# once per key. +# clobbers r12, and tmp xmm registers. +.macro PRECOMPUTE TMP1 TMP2 TMP3 TMP4 TMP5 TMP6 TMP7 + mov arg7, %r12 + movdqu (%r12), \TMP3 + movdqa SHUF_MASK(%rip), \TMP2 + PSHUFB_XMM \TMP2, \TMP3 + + # precompute HashKey<<1 mod poly from t
[PATCH 11/14] x86/crypto: aesni: Introduce partial block macro
Before this diff, multiple calls to GCM_ENC_DEC will succeed, but only if all calls are a multiple of 16 bytes. Handle partial blocks at the start of GCM_ENC_DEC, and update aadhash as appropriate. The data offset %r11 is also updated after the partial block. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 151 +- 1 file changed, 150 insertions(+), 1 deletion(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 3ada06b..398bd2237f 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -284,7 +284,13 @@ ALL_F: .octa 0x movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%arg2), %xmm13 add %arg5, InLen(%arg2) + + xor %r11, %r11 # initialise the data pointer offset as zero + PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation + + sub %r11, %arg5 # sub partial block data used mov %arg5, %r13 # save the number of bytes + and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) mov %r13, %r12 # Encrypt/Decrypt first few blocks @@ -605,6 +611,150 @@ _get_AAD_done\@: movdqu \TMP6, AadHash(%arg2) .endm +# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks +# between update calls. +# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK +# Outputs encrypted bytes, and updates hash and partial info in gcm_data_context +# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13 +.macro PARTIAL_BLOCK CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \ + AAD_HASH operation + mov PBlockLen(%arg2), %r13 + cmp $0, %r13 + je _partial_block_done_\@ # Leave Macro if no partial blocks + # Read in input data without over reading + cmp $16, \PLAIN_CYPH_LEN + jl _fewer_than_16_bytes_\@ + movups (\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm + jmp _data_read_\@ + +_fewer_than_16_bytes_\@: + lea (\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10 + mov \PLAIN_CYPH_LEN, %r12 + READ_PARTIAL_BLOCK %r10 %r12 %xmm0 %xmm1 + + mov PBlockLen(%arg2), %r13 + +_data_read_\@: # Finished reading in data + + movdqu PBlockEncKey(%arg2), %xmm9 + movdqu HashKey(%arg2), %xmm13 + + lea SHIFT_MASK(%rip), %r12 + + # adjust the shuffle mask pointer to be able to shift r13 bytes + # r16-r13 is the number of bytes in plaintext mod 16) + add %r13, %r12 + movdqu (%r12), %xmm2 # get the appropriate shuffle mask + PSHUFB_XMM %xmm2, %xmm9 # shift right r13 bytes + +.ifc \operation, dec + movdqa %xmm1, %xmm3 + pxor%xmm1, %xmm9# Cyphertext XOR E(K, Yn) + + mov \PLAIN_CYPH_LEN, %r10 + add %r13, %r10 + # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling + sub $16, %r10 + # Determine if if partial block is not being filled and + # shift mask accordingly + jge _no_extra_mask_1_\@ + sub %r10, %r12 +_no_extra_mask_1_\@: + + movdqu ALL_F-SHIFT_MASK(%r12), %xmm1 + # get the appropriate mask to mask out bottom r13 bytes of xmm9 + pand%xmm1, %xmm9# mask out bottom r13 bytes of xmm9 + + pand%xmm1, %xmm3 + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm3 + PSHUFB_XMM %xmm2, %xmm3 + pxor%xmm3, \AAD_HASH + + cmp $0, %r10 + jl _partial_incomplete_1_\@ + + # GHASH computation for the last <16 Byte block + GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + xor %rax,%rax + + mov %rax, PBlockLen(%arg2) + jmp _dec_done_\@ +_partial_incomplete_1_\@: + add \PLAIN_CYPH_LEN, PBlockLen(%arg2) +_dec_done_\@: + movdqu \AAD_HASH, AadHash(%arg2) +.else + pxor%xmm1, %xmm9# Plaintext XOR E(K, Yn) + + mov \PLAIN_CYPH_LEN, %r10 + add %r13, %r10 + # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling + sub $16, %r10 + # Determine if if partial block is not being filled and + # shift mask accordingly + jge _no_extra_mask_2_\@ + sub %r10, %r12 +_no_extra_mask_2_\@: + + movdqu ALL_F-SHIFT_MASK(%r12), %xmm1 + # get the appropriate mask to mask out bottom r13 bytes of xmm9 + pand%xmm1, %xmm9 + + movdqa SHUF_MASK(%rip), %xmm1 + PSHUFB_XMM %xmm1, %xmm9 + PSHUFB_XMM %xmm2, %xmm9 + pxor%xmm9, \AAD_HASH + + cmp $0, %r10 + jl _partial_incomplete_2_\@ + + # GHASH computation for the last <16 Byte block + GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
[PATCH 11/14] x86/crypto: aesni: Introduce partial block macro
Before this diff, multiple calls to GCM_ENC_DEC will succeed, but only if all calls are a multiple of 16 bytes. Handle partial blocks at the start of GCM_ENC_DEC, and update aadhash as appropriate. The data offset %r11 is also updated after the partial block. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 151 +- 1 file changed, 150 insertions(+), 1 deletion(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 3ada06b..398bd2237f 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -284,7 +284,13 @@ ALL_F: .octa 0x movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%arg2), %xmm13 add %arg5, InLen(%arg2) + + xor %r11, %r11 # initialise the data pointer offset as zero + PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation + + sub %r11, %arg5 # sub partial block data used mov %arg5, %r13 # save the number of bytes + and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) mov %r13, %r12 # Encrypt/Decrypt first few blocks @@ -605,6 +611,150 @@ _get_AAD_done\@: movdqu \TMP6, AadHash(%arg2) .endm +# PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks +# between update calls. +# Requires the input data be at least 1 byte long due to READ_PARTIAL_BLOCK +# Outputs encrypted bytes, and updates hash and partial info in gcm_data_context +# Clobbers rax, r10, r12, r13, xmm0-6, xmm9-13 +.macro PARTIAL_BLOCK CYPH_PLAIN_OUT PLAIN_CYPH_IN PLAIN_CYPH_LEN DATA_OFFSET \ + AAD_HASH operation + mov PBlockLen(%arg2), %r13 + cmp $0, %r13 + je _partial_block_done_\@ # Leave Macro if no partial blocks + # Read in input data without over reading + cmp $16, \PLAIN_CYPH_LEN + jl _fewer_than_16_bytes_\@ + movups (\PLAIN_CYPH_IN), %xmm1 # If more than 16 bytes, just fill xmm + jmp _data_read_\@ + +_fewer_than_16_bytes_\@: + lea (\PLAIN_CYPH_IN, \DATA_OFFSET, 1), %r10 + mov \PLAIN_CYPH_LEN, %r12 + READ_PARTIAL_BLOCK %r10 %r12 %xmm0 %xmm1 + + mov PBlockLen(%arg2), %r13 + +_data_read_\@: # Finished reading in data + + movdqu PBlockEncKey(%arg2), %xmm9 + movdqu HashKey(%arg2), %xmm13 + + lea SHIFT_MASK(%rip), %r12 + + # adjust the shuffle mask pointer to be able to shift r13 bytes + # r16-r13 is the number of bytes in plaintext mod 16) + add %r13, %r12 + movdqu (%r12), %xmm2 # get the appropriate shuffle mask + PSHUFB_XMM %xmm2, %xmm9 # shift right r13 bytes + +.ifc \operation, dec + movdqa %xmm1, %xmm3 + pxor%xmm1, %xmm9# Cyphertext XOR E(K, Yn) + + mov \PLAIN_CYPH_LEN, %r10 + add %r13, %r10 + # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling + sub $16, %r10 + # Determine if if partial block is not being filled and + # shift mask accordingly + jge _no_extra_mask_1_\@ + sub %r10, %r12 +_no_extra_mask_1_\@: + + movdqu ALL_F-SHIFT_MASK(%r12), %xmm1 + # get the appropriate mask to mask out bottom r13 bytes of xmm9 + pand%xmm1, %xmm9# mask out bottom r13 bytes of xmm9 + + pand%xmm1, %xmm3 + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm3 + PSHUFB_XMM %xmm2, %xmm3 + pxor%xmm3, \AAD_HASH + + cmp $0, %r10 + jl _partial_incomplete_1_\@ + + # GHASH computation for the last <16 Byte block + GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + xor %rax,%rax + + mov %rax, PBlockLen(%arg2) + jmp _dec_done_\@ +_partial_incomplete_1_\@: + add \PLAIN_CYPH_LEN, PBlockLen(%arg2) +_dec_done_\@: + movdqu \AAD_HASH, AadHash(%arg2) +.else + pxor%xmm1, %xmm9# Plaintext XOR E(K, Yn) + + mov \PLAIN_CYPH_LEN, %r10 + add %r13, %r10 + # Set r10 to be the amount of data left in CYPH_PLAIN_IN after filling + sub $16, %r10 + # Determine if if partial block is not being filled and + # shift mask accordingly + jge _no_extra_mask_2_\@ + sub %r10, %r12 +_no_extra_mask_2_\@: + + movdqu ALL_F-SHIFT_MASK(%r12), %xmm1 + # get the appropriate mask to mask out bottom r13 bytes of xmm9 + pand%xmm1, %xmm9 + + movdqa SHUF_MASK(%rip), %xmm1 + PSHUFB_XMM %xmm1, %xmm9 + PSHUFB_XMM %xmm2, %xmm9 + pxor%xmm9, \AAD_HASH + + cmp $0, %r10 + jl _partial_incomplete_2_\@ + + # GHASH computation for the last <16 Byte block + GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 + xor %ra
[PATCH 08/14] x86/crypto: aesni: Fill in new context data structures
Fill in aadhash, aadlen, pblocklen, curcount with appropriate values. pblocklen, aadhash, and pblockenckey are also updated at the end of each scatter/gather operation, to be carried over to the next operation. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 51 ++- 1 file changed, 39 insertions(+), 12 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 58bbfac..aa82493 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -204,6 +204,21 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 .macro GCM_INIT + + mov arg9, %r11 + mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length + xor %r11, %r11 + mov %r11, InLen(%arg2) # ctx_data.in_length = 0 + mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0 + mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0 + mov %arg6, %rax + movdqu (%rax), %xmm0 + movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv + + movdqa SHUF_MASK(%rip), %xmm2 + PSHUFB_XMM %xmm2, %xmm0 + movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv + mov arg7, %r12 movdqu (%r12), %xmm13 movdqa SHUF_MASK(%rip), %xmm2 @@ -226,13 +241,9 @@ ALL_F: .octa 0x pandPOLY(%rip), %xmm2 pxor%xmm2, %xmm13 movdqa %xmm13, HashKey(%rsp) - mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) - and $-16, %r13 - mov %r13, %r12 CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ %xmm5 %xmm6 - mov %r13, %r12 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -240,6 +251,12 @@ ALL_F: .octa 0x # Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK # Clobbers rax, r10-r13, and xmm0-xmm15 .macro GCM_ENC_DEC operation + movdqu AadHash(%arg2), %xmm8 + movdqu HashKey(%rsp), %xmm13 + add %arg5, InLen(%arg2) + mov %arg5, %r13 # save the number of bytes + and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) + mov %r13, %r12 # Encrypt/Decrypt first few blocks and $(3<<4), %r12 @@ -284,16 +301,23 @@ _four_cipher_left_\@: GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 _zero_cipher_left_\@: + movdqu %xmm8, AadHash(%arg2) + movdqu %xmm0, CurCount(%arg2) + mov %arg5, %r13 and $15, %r13 # %r13 = arg5 (mod 16) je _multiple_of_16_bytes_\@ + mov %r13, PBlockLen(%arg2) + # Handle the last <16 Byte block separately paddd ONE(%rip), %xmm0# INCR CNT to get Yn + movdqu %xmm0, CurCount(%arg2) movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) + movdqu %xmm0, PBlockEncKey(%arg2) lea (%arg4,%r11,1), %r10 mov %r13, %r12 @@ -322,6 +346,7 @@ _zero_cipher_left_\@: .endif GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + movdqu %xmm8, AadHash(%arg2) .ifc \operation, enc # GHASH computation for the last <16 byte block movdqa SHUF_MASK(%rip), %xmm10 @@ -351,11 +376,15 @@ _multiple_of_16_bytes_\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE - mov arg9, %r12# %r13 = aadLen (number of bytes) + movdqu AadHash(%arg2), %xmm8 + movdqu HashKey(%rsp), %xmm13 + mov AadLen(%arg2), %r12 # %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg5 # len(C) in bits (*128) - MOVQ_R64_XMM%arg5, %xmm1 + mov InLen(%arg2), %r12 + shl $3, %r12 # len(C) in bits (*128) + MOVQ_R64_XMM%r12, %xmm1 + pslldq $8, %xmm15# %xmm15 = len(A)||0x pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) pxor%xmm15, %xmm8 @@ -364,8 +393,7 @@ _multiple_of_16_bytes_\@: movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm8 - mov %arg6, %rax # %rax = *Y0 - movdqu (%rax), %xmm0 # %xmm0 = Y0 + movdqu OrigIV(%arg2), %xmm0 # %xmm0 = Y0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: @@ -553,15
[PATCH 08/14] x86/crypto: aesni: Fill in new context data structures
Fill in aadhash, aadlen, pblocklen, curcount with appropriate values. pblocklen, aadhash, and pblockenckey are also updated at the end of each scatter/gather operation, to be carried over to the next operation. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 51 ++- 1 file changed, 39 insertions(+), 12 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 58bbfac..aa82493 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -204,6 +204,21 @@ ALL_F: .octa 0x # GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. # Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 .macro GCM_INIT + + mov arg9, %r11 + mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length + xor %r11, %r11 + mov %r11, InLen(%arg2) # ctx_data.in_length = 0 + mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0 + mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0 + mov %arg6, %rax + movdqu (%rax), %xmm0 + movdqu %xmm0, OrigIV(%arg2) # ctx_data.orig_IV = iv + + movdqa SHUF_MASK(%rip), %xmm2 + PSHUFB_XMM %xmm2, %xmm0 + movdqu %xmm0, CurCount(%arg2) # ctx_data.current_counter = iv + mov arg7, %r12 movdqu (%r12), %xmm13 movdqa SHUF_MASK(%rip), %xmm2 @@ -226,13 +241,9 @@ ALL_F: .octa 0x pandPOLY(%rip), %xmm2 pxor%xmm2, %xmm13 movdqa %xmm13, HashKey(%rsp) - mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) - and $-16, %r13 - mov %r13, %r12 CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ %xmm5 %xmm6 - mov %r13, %r12 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -240,6 +251,12 @@ ALL_F: .octa 0x # Requires the input data be at least 1 byte long because of READ_PARTIAL_BLOCK # Clobbers rax, r10-r13, and xmm0-xmm15 .macro GCM_ENC_DEC operation + movdqu AadHash(%arg2), %xmm8 + movdqu HashKey(%rsp), %xmm13 + add %arg5, InLen(%arg2) + mov %arg5, %r13 # save the number of bytes + and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) + mov %r13, %r12 # Encrypt/Decrypt first few blocks and $(3<<4), %r12 @@ -284,16 +301,23 @@ _four_cipher_left_\@: GHASH_LAST_4%xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, \ %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm8 _zero_cipher_left_\@: + movdqu %xmm8, AadHash(%arg2) + movdqu %xmm0, CurCount(%arg2) + mov %arg5, %r13 and $15, %r13 # %r13 = arg5 (mod 16) je _multiple_of_16_bytes_\@ + mov %r13, PBlockLen(%arg2) + # Handle the last <16 Byte block separately paddd ONE(%rip), %xmm0# INCR CNT to get Yn + movdqu %xmm0, CurCount(%arg2) movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1# Encrypt(K, Yn) + movdqu %xmm0, PBlockEncKey(%arg2) lea (%arg4,%r11,1), %r10 mov %r13, %r12 @@ -322,6 +346,7 @@ _zero_cipher_left_\@: .endif GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + movdqu %xmm8, AadHash(%arg2) .ifc \operation, enc # GHASH computation for the last <16 byte block movdqa SHUF_MASK(%rip), %xmm10 @@ -351,11 +376,15 @@ _multiple_of_16_bytes_\@: # Output: Authorization Tag (AUTH_TAG) # Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 .macro GCM_COMPLETE - mov arg9, %r12# %r13 = aadLen (number of bytes) + movdqu AadHash(%arg2), %xmm8 + movdqu HashKey(%rsp), %xmm13 + mov AadLen(%arg2), %r12 # %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg5 # len(C) in bits (*128) - MOVQ_R64_XMM%arg5, %xmm1 + mov InLen(%arg2), %r12 + shl $3, %r12 # len(C) in bits (*128) + MOVQ_R64_XMM%r12, %xmm1 + pslldq $8, %xmm15# %xmm15 = len(A)||0x pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) pxor%xmm15, %xmm8 @@ -364,8 +393,7 @@ _multiple_of_16_bytes_\@: movdqa SHUF_MASK(%rip), %xmm10 PSHUFB_XMM %xmm10, %xmm8 - mov %arg6, %rax # %rax = *Y0 - movdqu (%rax), %xmm0 # %xmm0 = Y0 + movdqu OrigIV(%arg2), %xmm0 # %xmm0 = Y0 ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) pxor%xmm8, %xmm0 _return_T_\@: @@ -553,15 +581,14 @@ _get_AAD
[PATCH 09/14] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
Prepare to handle partial blocks between scatter/gather calls. For the last partial block, we only want to calculate the aadhash in GCM_COMPLETE, and a new partial block macro will handle both aadhash update and encrypting partial blocks between calls. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index aa82493..37b1cee 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -345,7 +345,6 @@ _zero_cipher_left_\@: pxor%xmm0, %xmm8 .endif - GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 movdqu %xmm8, AadHash(%arg2) .ifc \operation, enc # GHASH computation for the last <16 byte block @@ -378,6 +377,15 @@ _multiple_of_16_bytes_\@: .macro GCM_COMPLETE movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%rsp), %xmm13 + + mov PBlockLen(%arg2), %r12 + + cmp $0, %r12 + je _partial_done\@ + + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + +_partial_done\@: mov AadLen(%arg2), %r12 # %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 -- 2.9.5
[PATCH 09/14] x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE
Prepare to handle partial blocks between scatter/gather calls. For the last partial block, we only want to calculate the aadhash in GCM_COMPLETE, and a new partial block macro will handle both aadhash update and encrypting partial blocks between calls. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index aa82493..37b1cee 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -345,7 +345,6 @@ _zero_cipher_left_\@: pxor%xmm0, %xmm8 .endif - GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 movdqu %xmm8, AadHash(%arg2) .ifc \operation, enc # GHASH computation for the last <16 byte block @@ -378,6 +377,15 @@ _multiple_of_16_bytes_\@: .macro GCM_COMPLETE movdqu AadHash(%arg2), %xmm8 movdqu HashKey(%rsp), %xmm13 + + mov PBlockLen(%arg2), %r12 + + cmp $0, %r12 + je _partial_done\@ + + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + +_partial_done\@: mov AadLen(%arg2), %r12 # %r13 = aadLen (number of bytes) shl $3, %r12 # convert into number of bits movd%r12d, %xmm15 # len(A) in %xmm15 -- 2.9.5
[PATCH 07/14] x86/crypto: aesni: Split AAD hash calculation to separate macro
AAD hash only needs to be calculated once for each scatter/gather operation. Move it to its own macro, and call it from GCM_INIT instead of INITIAL_BLOCKS. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 71 --- 1 file changed, 43 insertions(+), 28 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 6c5a80d..58bbfac 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -229,6 +229,10 @@ ALL_F: .octa 0x mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) and $-16, %r13 mov %r13, %r12 + + CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ + %xmm5 %xmm6 + mov %r13, %r12 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -496,51 +500,62 @@ _read_next_byte_lt8_\@: _done_read_partial_block_\@: .endm -/* -* if a = number of total plaintext bytes -* b = floor(a/16) -* num_initial_blocks = b mod 4 -* encrypt the initial num_initial_blocks blocks and apply ghash on -* the ciphertext -* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers -* are clobbered -* arg1, %arg3, %arg4, %r14 are used as a pointer only, not modified -*/ - - -.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ -XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation -MOVADQ SHUF_MASK(%rip), %xmm14 - movarg8, %r10 # %r10 = AAD - movarg9, %r11 # %r11 = aadLen - pxor %xmm\i, %xmm\i - pxor \XMM2, \XMM2 +# CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted. +# clobbers r10-11, xmm14 +.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \ + TMP6 TMP7 + MOVADQ SHUF_MASK(%rip), %xmm14 + movarg8, %r10 # %r10 = AAD + movarg9, %r11 # %r11 = aadLen + pxor \TMP7, \TMP7 + pxor \TMP6, \TMP6 cmp$16, %r11 jl _get_AAD_rest\@ _get_AAD_blocks\@: - movdqu (%r10), %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor %xmm\i, \XMM2 - GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 + movdqu (%r10), \TMP7 + PSHUFB_XMM %xmm14, \TMP7 # byte-reflect the AAD data + pxor \TMP7, \TMP6 + GHASH_MUL \TMP6, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5 add$16, %r10 sub$16, %r11 cmp$16, %r11 jge_get_AAD_blocks\@ - movdqu \XMM2, %xmm\i + movdqu \TMP6, \TMP7 /* read the last <16B of AAD */ _get_AAD_rest\@: cmp$0, %r11 je _get_AAD_done\@ - READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor \XMM2, %xmm\i - GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 + READ_PARTIAL_BLOCK %r10, %r11, \TMP1, \TMP7 + PSHUFB_XMM %xmm14, \TMP7 # byte-reflect the AAD data + pxor \TMP6, \TMP7 + GHASH_MUL \TMP7, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5 + movdqu \TMP7, \TMP6 _get_AAD_done\@: + movdqu \TMP6, AadHash(%arg2) +.endm + +/* +* if a = number of total plaintext bytes +* b = floor(a/16) +* num_initial_blocks = b mod 4 +* encrypt the initial num_initial_blocks blocks and apply ghash on +* the ciphertext +* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers +* are clobbered +* arg1, %arg2, %arg3, %r14 are used as a pointer only, not modified +*/ + + +.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ + XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation + + movdqu AadHash(%arg2), %xmm\i # XMM0 = Y0 + xor%r11, %r11 # initialise the data pointer offset as zero # start AES for num_initial_blocks blocks -- 2.9.5
[PATCH 00/14] x86/crypto gcmaes SSE scatter/gather support
This patch set refactors the x86 aes/gcm SSE crypto routines to support true scatter/gather by adding gcm_enc/dec_update methods. The layout is: * First 5 patches refactor the code to use macros, so changes only need to be applied once for encode and decode. There should be no functional changes. * The next 6 patches introduce a gcm_context structure to be passed between scatter/gather calls to maintain state. The struct is also used as scratch space for the existing enc/dec routines. * The last 2 set up the asm function entry points for scatter gather support, and then call the new routines per buffer in the passed in sglist in aesni-intel_glue. Testing: asm itself fuzz tested vs. existing code and isa-l asm. Ran libkcapi test suite, passes. Passes my TLS tests. IPSec or testing of other aesni users would be appreciated. perf of a large (16k messages) TLS sends sg vs. no sg: no-sg 33287255597 cycles 53702871176 instructions 43.47% _crypt_by_4 17.83% memcpy 16.36% aes_loop_par_enc_done sg 27568944591 cycles 54580446678 instructions 49.87% _crypt_by_4 17.40% aes_loop_par_enc_done 1.79%aes_loop_initial_5416 1.52%aes_loop_initial_4974 1.27%gcmaes_encrypt_sg.constprop.15 Dave Watson (14): x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC x86/crypto: aesni: Macro-ify func save/restore x86/crypto: aesni: Add GCM_INIT macro x86/crypto: aesni: Add GCM_COMPLETE macro x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro x86/crypto: aesni: Introduce gcm_context_data x86/crypto: aesni: Split AAD hash calculation to separate macro x86/crypto: aesni: Fill in new context data structures x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE x86/crypto: aesni: Move HashKey computation from stack to gcm_context x86/crypto: aesni: Introduce partial block macro x86/crypto: aesni: Add fast path for > 16 byte update x86/crypto: aesni: Introduce scatter/gather asm function stubs x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather arch/x86/crypto/aesni-intel_asm.S | 1414 ++-- arch/x86/crypto/aesni-intel_glue.c | 263 ++- 2 files changed, 932 insertions(+), 745 deletions(-) -- 2.9.5
[PATCH 07/14] x86/crypto: aesni: Split AAD hash calculation to separate macro
AAD hash only needs to be calculated once for each scatter/gather operation. Move it to its own macro, and call it from GCM_INIT instead of INITIAL_BLOCKS. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 71 --- 1 file changed, 43 insertions(+), 28 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 6c5a80d..58bbfac 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -229,6 +229,10 @@ ALL_F: .octa 0x mov %arg5, %r13 # %xmm13 holds HashKey<<1 (mod poly) and $-16, %r13 mov %r13, %r12 + + CALC_AAD_HASH %xmm13 %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 \ + %xmm5 %xmm6 + mov %r13, %r12 .endm # GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context @@ -496,51 +500,62 @@ _read_next_byte_lt8_\@: _done_read_partial_block_\@: .endm -/* -* if a = number of total plaintext bytes -* b = floor(a/16) -* num_initial_blocks = b mod 4 -* encrypt the initial num_initial_blocks blocks and apply ghash on -* the ciphertext -* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers -* are clobbered -* arg1, %arg3, %arg4, %r14 are used as a pointer only, not modified -*/ - - -.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ -XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation -MOVADQ SHUF_MASK(%rip), %xmm14 - movarg8, %r10 # %r10 = AAD - movarg9, %r11 # %r11 = aadLen - pxor %xmm\i, %xmm\i - pxor \XMM2, \XMM2 +# CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted. +# clobbers r10-11, xmm14 +.macro CALC_AAD_HASH HASHKEY TMP1 TMP2 TMP3 TMP4 TMP5 \ + TMP6 TMP7 + MOVADQ SHUF_MASK(%rip), %xmm14 + movarg8, %r10 # %r10 = AAD + movarg9, %r11 # %r11 = aadLen + pxor \TMP7, \TMP7 + pxor \TMP6, \TMP6 cmp$16, %r11 jl _get_AAD_rest\@ _get_AAD_blocks\@: - movdqu (%r10), %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor %xmm\i, \XMM2 - GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 + movdqu (%r10), \TMP7 + PSHUFB_XMM %xmm14, \TMP7 # byte-reflect the AAD data + pxor \TMP7, \TMP6 + GHASH_MUL \TMP6, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5 add$16, %r10 sub$16, %r11 cmp$16, %r11 jge_get_AAD_blocks\@ - movdqu \XMM2, %xmm\i + movdqu \TMP6, \TMP7 /* read the last <16B of AAD */ _get_AAD_rest\@: cmp$0, %r11 je _get_AAD_done\@ - READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor \XMM2, %xmm\i - GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 + READ_PARTIAL_BLOCK %r10, %r11, \TMP1, \TMP7 + PSHUFB_XMM %xmm14, \TMP7 # byte-reflect the AAD data + pxor \TMP6, \TMP7 + GHASH_MUL \TMP7, \HASHKEY, \TMP1, \TMP2, \TMP3, \TMP4, \TMP5 + movdqu \TMP7, \TMP6 _get_AAD_done\@: + movdqu \TMP6, AadHash(%arg2) +.endm + +/* +* if a = number of total plaintext bytes +* b = floor(a/16) +* num_initial_blocks = b mod 4 +* encrypt the initial num_initial_blocks blocks and apply ghash on +* the ciphertext +* %r10, %r11, %r12, %rax, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9 registers +* are clobbered +* arg1, %arg2, %arg3, %r14 are used as a pointer only, not modified +*/ + + +.macro INITIAL_BLOCKS_ENC_DEC TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ + XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation + + movdqu AadHash(%arg2), %xmm\i # XMM0 = Y0 + xor%r11, %r11 # initialise the data pointer offset as zero # start AES for num_initial_blocks blocks -- 2.9.5
[PATCH 00/14] x86/crypto gcmaes SSE scatter/gather support
This patch set refactors the x86 aes/gcm SSE crypto routines to support true scatter/gather by adding gcm_enc/dec_update methods. The layout is: * First 5 patches refactor the code to use macros, so changes only need to be applied once for encode and decode. There should be no functional changes. * The next 6 patches introduce a gcm_context structure to be passed between scatter/gather calls to maintain state. The struct is also used as scratch space for the existing enc/dec routines. * The last 2 set up the asm function entry points for scatter gather support, and then call the new routines per buffer in the passed in sglist in aesni-intel_glue. Testing: asm itself fuzz tested vs. existing code and isa-l asm. Ran libkcapi test suite, passes. Passes my TLS tests. IPSec or testing of other aesni users would be appreciated. perf of a large (16k messages) TLS sends sg vs. no sg: no-sg 33287255597 cycles 53702871176 instructions 43.47% _crypt_by_4 17.83% memcpy 16.36% aes_loop_par_enc_done sg 27568944591 cycles 54580446678 instructions 49.87% _crypt_by_4 17.40% aes_loop_par_enc_done 1.79%aes_loop_initial_5416 1.52%aes_loop_initial_4974 1.27%gcmaes_encrypt_sg.constprop.15 Dave Watson (14): x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC x86/crypto: aesni: Macro-ify func save/restore x86/crypto: aesni: Add GCM_INIT macro x86/crypto: aesni: Add GCM_COMPLETE macro x86/crypto: aesni: Merge encode and decode to GCM_ENC_DEC macro x86/crypto: aesni: Introduce gcm_context_data x86/crypto: aesni: Split AAD hash calculation to separate macro x86/crypto: aesni: Fill in new context data structures x86/crypto: aesni: Move ghash_mul to GCM_COMPLETE x86/crypto: aesni: Move HashKey computation from stack to gcm_context x86/crypto: aesni: Introduce partial block macro x86/crypto: aesni: Add fast path for > 16 byte update x86/crypto: aesni: Introduce scatter/gather asm function stubs x86/crypto: aesni: Update aesni-intel_glue to use scatter/gather arch/x86/crypto/aesni-intel_asm.S | 1414 ++-- arch/x86/crypto/aesni-intel_glue.c | 263 ++- 2 files changed, 932 insertions(+), 745 deletions(-) -- 2.9.5
[PATCH 04/14] x86/crypto: aesni: Add GCM_COMPLETE macro
Merge encode and decode tag calculations in GCM_COMPLETE macro. Scatter/gather routines will call this once at the end of encryption or decryption. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 172 ++ 1 file changed, 63 insertions(+), 109 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index b9fe2ab..529c542 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -222,6 +222,67 @@ ALL_F: .octa 0x mov %r13, %r12 .endm +# GCM_COMPLETE Finishes update of tag of last partial block +# Output: Authorization Tag (AUTH_TAG) +# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 +.macro GCM_COMPLETE + mov arg8, %r12# %r13 = aadLen (number of bytes) + shl $3, %r12 # convert into number of bits + movd%r12d, %xmm15 # len(A) in %xmm15 + shl $3, %arg4 # len(C) in bits (*128) + MOVQ_R64_XMM%arg4, %xmm1 + pslldq $8, %xmm15# %xmm15 = len(A)||0x + pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) + pxor%xmm15, %xmm8 + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + # final GHASH computation + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm8 + + mov %arg5, %rax # %rax = *Y0 + movdqu (%rax), %xmm0 # %xmm0 = Y0 + ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) + pxor%xmm8, %xmm0 +_return_T_\@: + mov arg9, %r10 # %r10 = authTag + mov arg10, %r11# %r11 = auth_tag_len + cmp $16, %r11 + je _T_16_\@ + cmp $8, %r11 + jl _T_4_\@ +_T_8_\@: + MOVQ_R64_XMM%xmm0, %rax + mov %rax, (%r10) + add $8, %r10 + sub $8, %r11 + psrldq $8, %xmm0 + cmp $0, %r11 + je _return_T_done_\@ +_T_4_\@: + movd%xmm0, %eax + mov %eax, (%r10) + add $4, %r10 + sub $4, %r11 + psrldq $4, %xmm0 + cmp $0, %r11 + je _return_T_done_\@ +_T_123_\@: + movd%xmm0, %eax + cmp $2, %r11 + jl _T_1_\@ + mov %ax, (%r10) + cmp $2, %r11 + je _return_T_done_\@ + add $2, %r10 + sar $16, %eax +_T_1_\@: + mov %al, (%r10) + jmp _return_T_done_\@ +_T_16_\@: + movdqu %xmm0, (%r10) +_return_T_done_\@: +.endm + #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) * @@ -1271,61 +1332,7 @@ _less_than_8_bytes_left_decrypt: sub $1, %r13 jne _less_than_8_bytes_left_decrypt _multiple_of_16_bytes_decrypt: - mov arg8, %r12# %r13 = aadLen (number of bytes) - shl $3, %r12 # convert into number of bits - movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg4 # len(C) in bits (*128) - MOVQ_R64_XMM%arg4, %xmm1 - pslldq $8, %xmm15# %xmm15 = len(A)||0x - pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) - pxor%xmm15, %xmm8 - GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 -# final GHASH computation -movdqa SHUF_MASK(%rip), %xmm10 - PSHUFB_XMM %xmm10, %xmm8 - - mov %arg5, %rax # %rax = *Y0 - movdqu (%rax), %xmm0 # %xmm0 = Y0 - ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) - pxor%xmm8, %xmm0 -_return_T_decrypt: - mov arg9, %r10# %r10 = authTag - mov arg10, %r11 # %r11 = auth_tag_len - cmp $16, %r11 - je _T_16_decrypt - cmp $8, %r11 - jl _T_4_decrypt -_T_8_decrypt: - MOVQ_R64_XMM%xmm0, %rax - mov %rax, (%r10) - add $8, %r10 - sub $8, %r11 - psrldq $8, %xmm0 - cmp $0, %r11 - je _return_T_done_decrypt -_T_4_decrypt: - movd%xmm0, %eax - mov %eax, (%r10) - add $4, %r10 - sub $4, %r11 - psrldq $4, %xmm0 - cmp $0, %r11 - je _return_T_done_decrypt -_T_123_decrypt: - movd%xmm0, %eax - cmp $2, %r11 - jl _T_1_decrypt - mov %ax, (%r10) - cmp $2, %r11 - je _return_T_done_decrypt - add $2, %r10 - sar $16, %eax -_T_1_decrypt: - mov %al, (%r10) - jmp _return_T_done_decrypt -_T_16_decrypt: - movdqu %xmm0, (%r10) -_return_T_done_decrypt: + GCM_COMPLETE FUNC_RESTORE ret E
[PATCH 04/14] x86/crypto: aesni: Add GCM_COMPLETE macro
Merge encode and decode tag calculations in GCM_COMPLETE macro. Scatter/gather routines will call this once at the end of encryption or decryption. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 172 ++ 1 file changed, 63 insertions(+), 109 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index b9fe2ab..529c542 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -222,6 +222,67 @@ ALL_F: .octa 0x mov %r13, %r12 .endm +# GCM_COMPLETE Finishes update of tag of last partial block +# Output: Authorization Tag (AUTH_TAG) +# Clobbers rax, r10-r12, and xmm0, xmm1, xmm5-xmm15 +.macro GCM_COMPLETE + mov arg8, %r12# %r13 = aadLen (number of bytes) + shl $3, %r12 # convert into number of bits + movd%r12d, %xmm15 # len(A) in %xmm15 + shl $3, %arg4 # len(C) in bits (*128) + MOVQ_R64_XMM%arg4, %xmm1 + pslldq $8, %xmm15# %xmm15 = len(A)||0x + pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) + pxor%xmm15, %xmm8 + GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 + # final GHASH computation + movdqa SHUF_MASK(%rip), %xmm10 + PSHUFB_XMM %xmm10, %xmm8 + + mov %arg5, %rax # %rax = *Y0 + movdqu (%rax), %xmm0 # %xmm0 = Y0 + ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) + pxor%xmm8, %xmm0 +_return_T_\@: + mov arg9, %r10 # %r10 = authTag + mov arg10, %r11# %r11 = auth_tag_len + cmp $16, %r11 + je _T_16_\@ + cmp $8, %r11 + jl _T_4_\@ +_T_8_\@: + MOVQ_R64_XMM%xmm0, %rax + mov %rax, (%r10) + add $8, %r10 + sub $8, %r11 + psrldq $8, %xmm0 + cmp $0, %r11 + je _return_T_done_\@ +_T_4_\@: + movd%xmm0, %eax + mov %eax, (%r10) + add $4, %r10 + sub $4, %r11 + psrldq $4, %xmm0 + cmp $0, %r11 + je _return_T_done_\@ +_T_123_\@: + movd%xmm0, %eax + cmp $2, %r11 + jl _T_1_\@ + mov %ax, (%r10) + cmp $2, %r11 + je _return_T_done_\@ + add $2, %r10 + sar $16, %eax +_T_1_\@: + mov %al, (%r10) + jmp _return_T_done_\@ +_T_16_\@: + movdqu %xmm0, (%r10) +_return_T_done_\@: +.endm + #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) * @@ -1271,61 +1332,7 @@ _less_than_8_bytes_left_decrypt: sub $1, %r13 jne _less_than_8_bytes_left_decrypt _multiple_of_16_bytes_decrypt: - mov arg8, %r12# %r13 = aadLen (number of bytes) - shl $3, %r12 # convert into number of bits - movd%r12d, %xmm15 # len(A) in %xmm15 - shl $3, %arg4 # len(C) in bits (*128) - MOVQ_R64_XMM%arg4, %xmm1 - pslldq $8, %xmm15# %xmm15 = len(A)||0x - pxor%xmm1, %xmm15 # %xmm15 = len(A)||len(C) - pxor%xmm15, %xmm8 - GHASH_MUL %xmm8, %xmm13, %xmm9, %xmm10, %xmm11, %xmm5, %xmm6 -# final GHASH computation -movdqa SHUF_MASK(%rip), %xmm10 - PSHUFB_XMM %xmm10, %xmm8 - - mov %arg5, %rax # %rax = *Y0 - movdqu (%rax), %xmm0 # %xmm0 = Y0 - ENCRYPT_SINGLE_BLOCK%xmm0, %xmm1 # E(K, Y0) - pxor%xmm8, %xmm0 -_return_T_decrypt: - mov arg9, %r10# %r10 = authTag - mov arg10, %r11 # %r11 = auth_tag_len - cmp $16, %r11 - je _T_16_decrypt - cmp $8, %r11 - jl _T_4_decrypt -_T_8_decrypt: - MOVQ_R64_XMM%xmm0, %rax - mov %rax, (%r10) - add $8, %r10 - sub $8, %r11 - psrldq $8, %xmm0 - cmp $0, %r11 - je _return_T_done_decrypt -_T_4_decrypt: - movd%xmm0, %eax - mov %eax, (%r10) - add $4, %r10 - sub $4, %r11 - psrldq $4, %xmm0 - cmp $0, %r11 - je _return_T_done_decrypt -_T_123_decrypt: - movd%xmm0, %eax - cmp $2, %r11 - jl _T_1_decrypt - mov %ax, (%r10) - cmp $2, %r11 - je _return_T_done_decrypt - add $2, %r10 - sar $16, %eax -_T_1_decrypt: - mov %al, (%r10) - jmp _return_T_done_decrypt -_T_16_decrypt: - movdqu %xmm0, (%r10) -_return_T_done_decrypt: + GCM_COMPLETE FUNC_RESTORE ret ENDPROC(aesni_gcm_dec
[PATCH 03/14] x86/crypto: aesni: Add GCM_INIT macro
Reduce code duplication by introducting GCM_INIT macro. This macro will also be exposed as a function for implementing scatter/gather support, since INIT only needs to be called once for the full operation. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 84 +++ 1 file changed, 33 insertions(+), 51 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 39b42b1..b9fe2ab 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -191,6 +191,37 @@ ALL_F: .octa 0x pop %r12 .endm + +# GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. +# Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 +.macro GCM_INIT + mov %arg6, %r12 + movdqu (%r12), %xmm13 + movdqa SHUF_MASK(%rip), %xmm2 + PSHUFB_XMM %xmm2, %xmm13 + + # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) + + movdqa %xmm13, %xmm2 + psllq $1, %xmm13 + psrlq $63, %xmm2 + movdqa %xmm2, %xmm1 + pslldq $8, %xmm2 + psrldq $8, %xmm1 + por %xmm2, %xmm13 + + # reduce HashKey<<1 + + pshufd $0x24, %xmm1, %xmm2 + pcmpeqd TWOONE(%rip), %xmm2 + pandPOLY(%rip), %xmm2 + pxor%xmm2, %xmm13 + movdqa %xmm13, HashKey(%rsp) + mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod poly) + and $-16, %r13 + mov %r13, %r12 +.endm + #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) * @@ -1151,36 +1182,11 @@ _esb_loop_\@: */ ENTRY(aesni_gcm_dec) FUNC_SAVE - mov %arg6, %r12 - movdqu (%r12), %xmm13# %xmm13 = HashKey -movdqa SHUF_MASK(%rip), %xmm2 - PSHUFB_XMM %xmm2, %xmm13 - - -# Precompute HashKey<<1 (mod poly) from the hash key (required for GHASH) - - movdqa %xmm13, %xmm2 - psllq $1, %xmm13 - psrlq $63, %xmm2 - movdqa %xmm2, %xmm1 - pslldq $8, %xmm2 - psrldq $8, %xmm1 - por %xmm2, %xmm13 - -# Reduction - - pshufd $0x24, %xmm1, %xmm2 - pcmpeqd TWOONE(%rip), %xmm2 - pandPOLY(%rip), %xmm2 - pxor%xmm2, %xmm13 # %xmm13 holds the HashKey<<1 (mod poly) + GCM_INIT # Decrypt first few blocks - movdqa %xmm13, HashKey(%rsp) # store HashKey<<1 (mod poly) - mov %arg4, %r13# save the number of bytes of plaintext/ciphertext - and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) - mov %r13, %r12 and $(3<<4), %r12 jz _initial_num_blocks_is_0_decrypt cmp $(2<<4), %r12 @@ -1402,32 +1408,8 @@ ENDPROC(aesni_gcm_dec) ***/ ENTRY(aesni_gcm_enc) FUNC_SAVE - mov %arg6, %r12 - movdqu (%r12), %xmm13 -movdqa SHUF_MASK(%rip), %xmm2 - PSHUFB_XMM %xmm2, %xmm13 - -# precompute HashKey<<1 mod poly from the HashKey (required for GHASH) - - movdqa %xmm13, %xmm2 - psllq $1, %xmm13 - psrlq $63, %xmm2 - movdqa %xmm2, %xmm1 - pslldq $8, %xmm2 - psrldq $8, %xmm1 - por %xmm2, %xmm13 - -# reduce HashKey<<1 - - pshufd $0x24, %xmm1, %xmm2 - pcmpeqd TWOONE(%rip), %xmm2 - pandPOLY(%rip), %xmm2 - pxor%xmm2, %xmm13 - movdqa %xmm13, HashKey(%rsp) - mov %arg4, %r13# %xmm13 holds HashKey<<1 (mod poly) - and $-16, %r13 - mov %r13, %r12 + GCM_INIT # Encrypt first few blocks and $(3<<4), %r12 -- 2.9.5
[PATCH 03/14] x86/crypto: aesni: Add GCM_INIT macro
Reduce code duplication by introducting GCM_INIT macro. This macro will also be exposed as a function for implementing scatter/gather support, since INIT only needs to be called once for the full operation. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 84 +++ 1 file changed, 33 insertions(+), 51 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 39b42b1..b9fe2ab 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -191,6 +191,37 @@ ALL_F: .octa 0x pop %r12 .endm + +# GCM_INIT initializes a gcm_context struct to prepare for encoding/decoding. +# Clobbers rax, r10-r13 and xmm0-xmm6, %xmm13 +.macro GCM_INIT + mov %arg6, %r12 + movdqu (%r12), %xmm13 + movdqa SHUF_MASK(%rip), %xmm2 + PSHUFB_XMM %xmm2, %xmm13 + + # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) + + movdqa %xmm13, %xmm2 + psllq $1, %xmm13 + psrlq $63, %xmm2 + movdqa %xmm2, %xmm1 + pslldq $8, %xmm2 + psrldq $8, %xmm1 + por %xmm2, %xmm13 + + # reduce HashKey<<1 + + pshufd $0x24, %xmm1, %xmm2 + pcmpeqd TWOONE(%rip), %xmm2 + pandPOLY(%rip), %xmm2 + pxor%xmm2, %xmm13 + movdqa %xmm13, HashKey(%rsp) + mov %arg4, %r13 # %xmm13 holds HashKey<<1 (mod poly) + and $-16, %r13 + mov %r13, %r12 +.endm + #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) * @@ -1151,36 +1182,11 @@ _esb_loop_\@: */ ENTRY(aesni_gcm_dec) FUNC_SAVE - mov %arg6, %r12 - movdqu (%r12), %xmm13# %xmm13 = HashKey -movdqa SHUF_MASK(%rip), %xmm2 - PSHUFB_XMM %xmm2, %xmm13 - - -# Precompute HashKey<<1 (mod poly) from the hash key (required for GHASH) - - movdqa %xmm13, %xmm2 - psllq $1, %xmm13 - psrlq $63, %xmm2 - movdqa %xmm2, %xmm1 - pslldq $8, %xmm2 - psrldq $8, %xmm1 - por %xmm2, %xmm13 - -# Reduction - - pshufd $0x24, %xmm1, %xmm2 - pcmpeqd TWOONE(%rip), %xmm2 - pandPOLY(%rip), %xmm2 - pxor%xmm2, %xmm13 # %xmm13 holds the HashKey<<1 (mod poly) + GCM_INIT # Decrypt first few blocks - movdqa %xmm13, HashKey(%rsp) # store HashKey<<1 (mod poly) - mov %arg4, %r13# save the number of bytes of plaintext/ciphertext - and $-16, %r13 # %r13 = %r13 - (%r13 mod 16) - mov %r13, %r12 and $(3<<4), %r12 jz _initial_num_blocks_is_0_decrypt cmp $(2<<4), %r12 @@ -1402,32 +1408,8 @@ ENDPROC(aesni_gcm_dec) ***/ ENTRY(aesni_gcm_enc) FUNC_SAVE - mov %arg6, %r12 - movdqu (%r12), %xmm13 -movdqa SHUF_MASK(%rip), %xmm2 - PSHUFB_XMM %xmm2, %xmm13 - -# precompute HashKey<<1 mod poly from the HashKey (required for GHASH) - - movdqa %xmm13, %xmm2 - psllq $1, %xmm13 - psrlq $63, %xmm2 - movdqa %xmm2, %xmm1 - pslldq $8, %xmm2 - psrldq $8, %xmm1 - por %xmm2, %xmm13 - -# reduce HashKey<<1 - - pshufd $0x24, %xmm1, %xmm2 - pcmpeqd TWOONE(%rip), %xmm2 - pandPOLY(%rip), %xmm2 - pxor%xmm2, %xmm13 - movdqa %xmm13, HashKey(%rsp) - mov %arg4, %r13# %xmm13 holds HashKey<<1 (mod poly) - and $-16, %r13 - mov %r13, %r12 + GCM_INIT # Encrypt first few blocks and $(3<<4), %r12 -- 2.9.5
[PATCH 01/14] x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC
Use macro operations to merge implemetations of INITIAL_BLOCKS, since they differ by only a small handful of lines. Use macro counter \@ to simplify implementation. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 298 ++ 1 file changed, 48 insertions(+), 250 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 76d8cd4..48911fe 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -275,234 +275,7 @@ _done_read_partial_block_\@: */ -.macro INITIAL_BLOCKS_DEC num_initial_blocks TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ -XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation -MOVADQ SHUF_MASK(%rip), %xmm14 - movarg7, %r10 # %r10 = AAD - movarg8, %r11 # %r11 = aadLen - pxor %xmm\i, %xmm\i - pxor \XMM2, \XMM2 - - cmp$16, %r11 - jl _get_AAD_rest\num_initial_blocks\operation -_get_AAD_blocks\num_initial_blocks\operation: - movdqu (%r10), %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor %xmm\i, \XMM2 - GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 - add$16, %r10 - sub$16, %r11 - cmp$16, %r11 - jge_get_AAD_blocks\num_initial_blocks\operation - - movdqu \XMM2, %xmm\i - - /* read the last <16B of AAD */ -_get_AAD_rest\num_initial_blocks\operation: - cmp$0, %r11 - je _get_AAD_done\num_initial_blocks\operation - - READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor \XMM2, %xmm\i - GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 - -_get_AAD_done\num_initial_blocks\operation: - xor%r11, %r11 # initialise the data pointer offset as zero - # start AES for num_initial_blocks blocks - - mov%arg5, %rax # %rax = *Y0 - movdqu (%rax), \XMM0# XMM0 = Y0 - PSHUFB_XMM %xmm14, \XMM0 - -.if (\i == 5) || (\i == 6) || (\i == 7) - MOVADQ ONE(%RIP),\TMP1 - MOVADQ (%arg1),\TMP2 -.irpc index, \i_seq - paddd \TMP1, \XMM0 # INCR Y0 - movdqa \XMM0, %xmm\index - PSHUFB_XMM %xmm14, %xmm\index # perform a 16 byte swap - pxor \TMP2, %xmm\index -.endr - lea 0x10(%arg1),%r10 - mov keysize,%eax - shr $2,%eax # 128->4, 192->6, 256->8 - add $5,%eax # 128->9, 192->11, 256->13 - -aes_loop_initial_dec\num_initial_blocks: - MOVADQ (%r10),\TMP1 -.irpc index, \i_seq - AESENC \TMP1, %xmm\index -.endr - add $16,%r10 - sub $1,%eax - jnz aes_loop_initial_dec\num_initial_blocks - - MOVADQ (%r10), \TMP1 -.irpc index, \i_seq - AESENCLAST \TMP1, %xmm\index # Last Round -.endr -.irpc index, \i_seq - movdqu (%arg3 , %r11, 1), \TMP1 - pxor \TMP1, %xmm\index - movdqu %xmm\index, (%arg2 , %r11, 1) - # write back plaintext/ciphertext for num_initial_blocks - add$16, %r11 - - movdqa \TMP1, %xmm\index - PSHUFB_XMM %xmm14, %xmm\index -# prepare plaintext/ciphertext for GHASH computation -.endr -.endif - -# apply GHASH on num_initial_blocks blocks - -.if \i == 5 -pxor %xmm5, %xmm6 - GHASH_MUL %xmm6, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm6, %xmm7 - GHASH_MUL %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.elseif \i == 6 -pxor %xmm6, %xmm7 - GHASH_MUL %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.elseif \i == 7 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.endif - cmp$64, %r13 - jl _initial_blocks_done\num_initial_blocks\operation - # no need for precomputed values -/* -* -* Precomputations for HashKey parallel with encryption of first 4 blocks. -* Haskey_i_k holds XORed values of the low and high parts of the Haskey_i -*/ - MOVADQ ONE(%rip), \TMP1 - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM1 - PSHUFB_XMM %xmm14, \XMM1# perform a 16 byte swap - - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM2 - PSHUFB_XMM %xmm14, \XMM2# perform a 16 byte swap - - paddd \TMP1, \XMM0 # INCR Y0 -
[PATCH 01/14] x86/crypto: aesni: Merge INITIAL_BLOCKS_ENC/DEC
Use macro operations to merge implemetations of INITIAL_BLOCKS, since they differ by only a small handful of lines. Use macro counter \@ to simplify implementation. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 298 ++ 1 file changed, 48 insertions(+), 250 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 76d8cd4..48911fe 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -275,234 +275,7 @@ _done_read_partial_block_\@: */ -.macro INITIAL_BLOCKS_DEC num_initial_blocks TMP1 TMP2 TMP3 TMP4 TMP5 XMM0 XMM1 \ -XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation -MOVADQ SHUF_MASK(%rip), %xmm14 - movarg7, %r10 # %r10 = AAD - movarg8, %r11 # %r11 = aadLen - pxor %xmm\i, %xmm\i - pxor \XMM2, \XMM2 - - cmp$16, %r11 - jl _get_AAD_rest\num_initial_blocks\operation -_get_AAD_blocks\num_initial_blocks\operation: - movdqu (%r10), %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor %xmm\i, \XMM2 - GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 - add$16, %r10 - sub$16, %r11 - cmp$16, %r11 - jge_get_AAD_blocks\num_initial_blocks\operation - - movdqu \XMM2, %xmm\i - - /* read the last <16B of AAD */ -_get_AAD_rest\num_initial_blocks\operation: - cmp$0, %r11 - je _get_AAD_done\num_initial_blocks\operation - - READ_PARTIAL_BLOCK %r10, %r11, \TMP1, %xmm\i - PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data - pxor \XMM2, %xmm\i - GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 - -_get_AAD_done\num_initial_blocks\operation: - xor%r11, %r11 # initialise the data pointer offset as zero - # start AES for num_initial_blocks blocks - - mov%arg5, %rax # %rax = *Y0 - movdqu (%rax), \XMM0# XMM0 = Y0 - PSHUFB_XMM %xmm14, \XMM0 - -.if (\i == 5) || (\i == 6) || (\i == 7) - MOVADQ ONE(%RIP),\TMP1 - MOVADQ (%arg1),\TMP2 -.irpc index, \i_seq - paddd \TMP1, \XMM0 # INCR Y0 - movdqa \XMM0, %xmm\index - PSHUFB_XMM %xmm14, %xmm\index # perform a 16 byte swap - pxor \TMP2, %xmm\index -.endr - lea 0x10(%arg1),%r10 - mov keysize,%eax - shr $2,%eax # 128->4, 192->6, 256->8 - add $5,%eax # 128->9, 192->11, 256->13 - -aes_loop_initial_dec\num_initial_blocks: - MOVADQ (%r10),\TMP1 -.irpc index, \i_seq - AESENC \TMP1, %xmm\index -.endr - add $16,%r10 - sub $1,%eax - jnz aes_loop_initial_dec\num_initial_blocks - - MOVADQ (%r10), \TMP1 -.irpc index, \i_seq - AESENCLAST \TMP1, %xmm\index # Last Round -.endr -.irpc index, \i_seq - movdqu (%arg3 , %r11, 1), \TMP1 - pxor \TMP1, %xmm\index - movdqu %xmm\index, (%arg2 , %r11, 1) - # write back plaintext/ciphertext for num_initial_blocks - add$16, %r11 - - movdqa \TMP1, %xmm\index - PSHUFB_XMM %xmm14, %xmm\index -# prepare plaintext/ciphertext for GHASH computation -.endr -.endif - -# apply GHASH on num_initial_blocks blocks - -.if \i == 5 -pxor %xmm5, %xmm6 - GHASH_MUL %xmm6, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm6, %xmm7 - GHASH_MUL %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.elseif \i == 6 -pxor %xmm6, %xmm7 - GHASH_MUL %xmm7, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.elseif \i == 7 -pxor %xmm7, %xmm8 - GHASH_MUL %xmm8, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 -.endif - cmp$64, %r13 - jl _initial_blocks_done\num_initial_blocks\operation - # no need for precomputed values -/* -* -* Precomputations for HashKey parallel with encryption of first 4 blocks. -* Haskey_i_k holds XORed values of the low and high parts of the Haskey_i -*/ - MOVADQ ONE(%rip), \TMP1 - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM1 - PSHUFB_XMM %xmm14, \XMM1# perform a 16 byte swap - - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM2 - PSHUFB_XMM %xmm14, \XMM2# perform a 16 byte swap - - paddd \TMP1, \XMM0 # INCR Y0 - MOVADQ \XMM0, \XMM3 -
[PATCH 02/14] x86/crypto: aesni: Macro-ify func save/restore
Macro-ify function save and restore. These will be used in new functions added for scatter/gather update operations. Signed-off-by: Dave Watson <davejwat...@fb.com> --- arch/x86/crypto/aesni-intel_asm.S | 53 ++- 1 file changed, 24 insertions(+), 29 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 48911fe..39b42b1 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -170,6 +170,26 @@ ALL_F: .octa 0x #define TKEYP T1 #endif +.macro FUNC_SAVE + push%r12 + push%r13 + push%r14 + mov %rsp, %r14 +# +# states of %xmm registers %xmm6:%xmm15 not saved +# all %xmm registers are clobbered +# + sub $VARIABLE_OFFSET, %rsp + and $~63, %rsp +.endm + + +.macro FUNC_RESTORE + mov %r14, %rsp + pop %r14 + pop %r13 + pop %r12 +.endm #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) @@ -1130,16 +1150,7 @@ _esb_loop_\@: * */ ENTRY(aesni_gcm_dec) - push%r12 - push%r13 - push%r14 - mov %rsp, %r14 -/* -* states of %xmm registers %xmm6:%xmm15 not saved -* all %xmm registers are clobbered -*/ - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp# align rsp to 64 bytes + FUNC_SAVE mov %arg6, %r12 movdqu (%r12), %xmm13# %xmm13 = HashKey movdqa SHUF_MASK(%rip), %xmm2 @@ -1309,10 +1320,7 @@ _T_1_decrypt: _T_16_decrypt: movdqu %xmm0, (%r10) _return_T_done_decrypt: - mov %r14, %rsp - pop %r14 - pop %r13 - pop %r12 + FUNC_RESTORE ret ENDPROC(aesni_gcm_dec) @@ -1393,22 +1401,12 @@ ENDPROC(aesni_gcm_dec) * poly = x^128 + x^127 + x^126 + x^121 + 1 ***/ ENTRY(aesni_gcm_enc) - push%r12 - push%r13 - push%r14 - mov %rsp, %r14 -# -# states of %xmm registers %xmm6:%xmm15 not saved -# all %xmm registers are clobbered -# - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp + FUNC_SAVE mov %arg6, %r12 movdqu (%r12), %xmm13 movdqa SHUF_MASK(%rip), %xmm2 PSHUFB_XMM %xmm2, %xmm13 - # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) movdqa %xmm13, %xmm2 @@ -1576,10 +1574,7 @@ _T_1_encrypt: _T_16_encrypt: movdqu %xmm0, (%r10) _return_T_done_encrypt: - mov %r14, %rsp - pop %r14 - pop %r13 - pop %r12 + FUNC_RESTORE ret ENDPROC(aesni_gcm_enc) -- 2.9.5
[PATCH 02/14] x86/crypto: aesni: Macro-ify func save/restore
Macro-ify function save and restore. These will be used in new functions added for scatter/gather update operations. Signed-off-by: Dave Watson --- arch/x86/crypto/aesni-intel_asm.S | 53 ++- 1 file changed, 24 insertions(+), 29 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 48911fe..39b42b1 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -170,6 +170,26 @@ ALL_F: .octa 0x #define TKEYP T1 #endif +.macro FUNC_SAVE + push%r12 + push%r13 + push%r14 + mov %rsp, %r14 +# +# states of %xmm registers %xmm6:%xmm15 not saved +# all %xmm registers are clobbered +# + sub $VARIABLE_OFFSET, %rsp + and $~63, %rsp +.endm + + +.macro FUNC_RESTORE + mov %r14, %rsp + pop %r14 + pop %r13 + pop %r12 +.endm #ifdef __x86_64__ /* GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) @@ -1130,16 +1150,7 @@ _esb_loop_\@: * */ ENTRY(aesni_gcm_dec) - push%r12 - push%r13 - push%r14 - mov %rsp, %r14 -/* -* states of %xmm registers %xmm6:%xmm15 not saved -* all %xmm registers are clobbered -*/ - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp# align rsp to 64 bytes + FUNC_SAVE mov %arg6, %r12 movdqu (%r12), %xmm13# %xmm13 = HashKey movdqa SHUF_MASK(%rip), %xmm2 @@ -1309,10 +1320,7 @@ _T_1_decrypt: _T_16_decrypt: movdqu %xmm0, (%r10) _return_T_done_decrypt: - mov %r14, %rsp - pop %r14 - pop %r13 - pop %r12 + FUNC_RESTORE ret ENDPROC(aesni_gcm_dec) @@ -1393,22 +1401,12 @@ ENDPROC(aesni_gcm_dec) * poly = x^128 + x^127 + x^126 + x^121 + 1 ***/ ENTRY(aesni_gcm_enc) - push%r12 - push%r13 - push%r14 - mov %rsp, %r14 -# -# states of %xmm registers %xmm6:%xmm15 not saved -# all %xmm registers are clobbered -# - sub $VARIABLE_OFFSET, %rsp - and $~63, %rsp + FUNC_SAVE mov %arg6, %r12 movdqu (%r12), %xmm13 movdqa SHUF_MASK(%rip), %xmm2 PSHUFB_XMM %xmm2, %xmm13 - # precompute HashKey<<1 mod poly from the HashKey (required for GHASH) movdqa %xmm13, %xmm2 @@ -1576,10 +1574,7 @@ _T_1_encrypt: _T_16_encrypt: movdqu %xmm0, (%r10) _return_T_done_encrypt: - mov %r14, %rsp - pop %r14 - pop %r13 - pop %r12 + FUNC_RESTORE ret ENDPROC(aesni_gcm_enc) -- 2.9.5
Re: [PATCH v4] membarrier: expedited private command
On 07/28/17 04:40 PM, Mathieu Desnoyers wrote: > Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built > from all runqueues for which current thread's mm is the same as the > thread calling sys_membarrier. It executes faster than the non-expedited > variant (no blocking). It also works on NOHZ_FULL configurations. I tested this with our hazard pointer use case on x86_64, and it seems to work great. We don't currently have any uses needing SHARED. Tested-by: Dave Watson <davejwat...@fb.com> Thanks! https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555 https://github.com/facebook/folly/blob/master/folly/experimental/AsymmetricMemoryBarrier.cpp#L86
Re: [PATCH v4] membarrier: expedited private command
On 07/28/17 04:40 PM, Mathieu Desnoyers wrote: > Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built > from all runqueues for which current thread's mm is the same as the > thread calling sys_membarrier. It executes faster than the non-expedited > variant (no blocking). It also works on NOHZ_FULL configurations. I tested this with our hazard pointer use case on x86_64, and it seems to work great. We don't currently have any uses needing SHARED. Tested-by: Dave Watson Thanks! https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555 https://github.com/facebook/folly/blob/master/folly/experimental/AsymmetricMemoryBarrier.cpp#L86
Re: Udpated sys_membarrier() speedup patch, FYI
Hi Paul, Thanks for looking at this again! On 07/27/17 11:12 AM, Paul E. McKenney wrote: > Hello! > > But my main question is whether the throttling shown below is acceptable > for your use cases, namely only one expedited sys_membarrier() permitted > per scheduling-clock period (1 millisecond on many platforms), with any > excess being silently converted to non-expedited form. The reason for > the throttling is concerns about DoS attacks based on user code with a > tight loop invoking this system call. We've been using sys_membarrier for the last year or so in a handful of places with no issues. Recently we made it an option in our hazard pointers implementation, giving us something with performance between hazard pointers and RCU: https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555 Currently hazard pointers tries to free retired memory the same thread that did the retire(), so the latency spiked for workloads that were retire() heavy. For the moment we dropped back to using mprotect hacks. I've tested Mathieu's v4 patch, it works great. We currently don't have any cases where we need SHARED. I also tested the rate-limited version, while better than the current non-EXPEDITED SHARED version, we still hit the slow path pretty often. I agree with other commenters that returning an error code instead of silently slowing down might be better. > + case MEMBARRIER_CMD_SHARED_EXPEDITED: > + if (num_online_cpus() > 1) { > + static unsigned long lastexp; > + unsigned long j; > + > + j = jiffies; > + if (READ_ONCE(lastexp) == j) { > + synchronize_sched(); > + WRITE_ONCE(lastexp, j); It looks like this update of lastexp should be in the other branch? > + } else { > + synchronize_sched_expedited(); > + } > + } > + return 0;
Re: Udpated sys_membarrier() speedup patch, FYI
Hi Paul, Thanks for looking at this again! On 07/27/17 11:12 AM, Paul E. McKenney wrote: > Hello! > > But my main question is whether the throttling shown below is acceptable > for your use cases, namely only one expedited sys_membarrier() permitted > per scheduling-clock period (1 millisecond on many platforms), with any > excess being silently converted to non-expedited form. The reason for > the throttling is concerns about DoS attacks based on user code with a > tight loop invoking this system call. We've been using sys_membarrier for the last year or so in a handful of places with no issues. Recently we made it an option in our hazard pointers implementation, giving us something with performance between hazard pointers and RCU: https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555 Currently hazard pointers tries to free retired memory the same thread that did the retire(), so the latency spiked for workloads that were retire() heavy. For the moment we dropped back to using mprotect hacks. I've tested Mathieu's v4 patch, it works great. We currently don't have any cases where we need SHARED. I also tested the rate-limited version, while better than the current non-EXPEDITED SHARED version, we still hit the slow path pretty often. I agree with other commenters that returning an error code instead of silently slowing down might be better. > + case MEMBARRIER_CMD_SHARED_EXPEDITED: > + if (num_online_cpus() > 1) { > + static unsigned long lastexp; > + unsigned long j; > + > + j = jiffies; > + if (READ_ONCE(lastexp) == j) { > + synchronize_sched(); > + WRITE_ONCE(lastexp, j); It looks like this update of lastexp should be in the other branch? > + } else { > + synchronize_sched_expedited(); > + } > + } > + return 0;
Re: [RFC PATCH v8 1/9] Restartable sequences system call
On 08/19/16 02:24 PM, Josh Triplett wrote: > On Fri, Aug 19, 2016 at 01:56:11PM -0700, Andi Kleen wrote: > > > Nobody gets a cpu number just to get a cpu number - it's not a useful > > > thing to benchmark. What does getcpu() so much that we care? > > > > malloc is the primary target I believe. Saves lots of memory to keep > > caches per CPU rather than per thread. > > Also improves locality; that does seem like a good idea. Has anyone > written and tested the corresponding changes to a malloc implementation? > I had modified jemalloc to use rseq instead of per-thread caches, and did some testing on one of our services. Memory usage decreased by ~20% due to fewer caches. Our services generally have lots and lots of idle threads (~400), and we already go through a few hoops to try and flush idle thread caches. Threads are often coming from dependent libraries written by disparate teams, making them harder to reduce to a smaller number. We also have quite a few data structures that are sharded thread-locally only to avoid contention, for example we have extensive statistics code that would also be a prime candidate for rseq . We often have to prune some stats because they're taking up too much memory, rseq would let us fit a bit more in. jemalloc diff here (pretty stale now): https://github.com/djwatson/jemalloc/commit/51f6e6f61b88eee8de981f0f2d52bc48f85e0d01 Original numbers posted here: https://lkml.org/lkml/2015/10/22/588
Re: [RFC PATCH v8 1/9] Restartable sequences system call
On 08/19/16 02:24 PM, Josh Triplett wrote: > On Fri, Aug 19, 2016 at 01:56:11PM -0700, Andi Kleen wrote: > > > Nobody gets a cpu number just to get a cpu number - it's not a useful > > > thing to benchmark. What does getcpu() so much that we care? > > > > malloc is the primary target I believe. Saves lots of memory to keep > > caches per CPU rather than per thread. > > Also improves locality; that does seem like a good idea. Has anyone > written and tested the corresponding changes to a malloc implementation? > I had modified jemalloc to use rseq instead of per-thread caches, and did some testing on one of our services. Memory usage decreased by ~20% due to fewer caches. Our services generally have lots and lots of idle threads (~400), and we already go through a few hoops to try and flush idle thread caches. Threads are often coming from dependent libraries written by disparate teams, making them harder to reduce to a smaller number. We also have quite a few data structures that are sharded thread-locally only to avoid contention, for example we have extensive statistics code that would also be a prime candidate for rseq . We often have to prune some stats because they're taking up too much memory, rseq would let us fit a bit more in. jemalloc diff here (pretty stale now): https://github.com/djwatson/jemalloc/commit/51f6e6f61b88eee8de981f0f2d52bc48f85e0d01 Original numbers posted here: https://lkml.org/lkml/2015/10/22/588
Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
Would pairing one rseq_start with two rseq_finish do the trick there ? >>> >>> Yes, two rseq_finish works, as long as the extra rseq management overhead >>> is not substantial. >> >> I've added a commit implementing rseq_finish2() in my rseq volatile >> dev branch. You can fetch it at: >> >> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback >> >> I also have a separate test and benchmark tree in addition to the >> kernel selftests here: >> >> https://github.com/compudj/rseq-test >> >> I named the first write a "speculative" write, and the second write >> the "final" write. >> >> Would you like to extend the test cases to cover your intended use-case ? >> > >Hi Dave! > >I just pushed a rseq_finish2() test in my rseq-fallback branch. It implements >a per-cpu buffer holding pointers, and pushes/pops items to/from it. > >To use it: > >cd tools/testing/selftests/rseq >./param_test -T b > >(see -h for advanced usage) > >Let me know if I got it right! Hi Mathieu, Thanks, you beat me to it.I commented on the github, that's pretty much it. > In the kernel, if rather than testing for: > > if ((void __user *)instruction_pointer(regs) < post_commit_ip) { > > we could test for both start_ip and post_commit_ip: > > if ((void __user *)instruction_pointer(regs) < post_commit_ip > && (void __user *)instruction_pointer(regs) >= start_ip) { > > We could perform the failure path (storing NULL into the rseq_cs > field of struct rseq) in C rather than being required to do it in > assembly at addresses >= to post_commit_ip, all because the kernel > would test whether we are within the assembly block address range > using both the lower and upper bounds (start_ip and post_commit_ip). Sounds reasonable to me. I agree it would be best to move the failure path out of the asm if possible.
Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
Would pairing one rseq_start with two rseq_finish do the trick there ? >>> >>> Yes, two rseq_finish works, as long as the extra rseq management overhead >>> is not substantial. >> >> I've added a commit implementing rseq_finish2() in my rseq volatile >> dev branch. You can fetch it at: >> >> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback >> >> I also have a separate test and benchmark tree in addition to the >> kernel selftests here: >> >> https://github.com/compudj/rseq-test >> >> I named the first write a "speculative" write, and the second write >> the "final" write. >> >> Would you like to extend the test cases to cover your intended use-case ? >> > >Hi Dave! > >I just pushed a rseq_finish2() test in my rseq-fallback branch. It implements >a per-cpu buffer holding pointers, and pushes/pops items to/from it. > >To use it: > >cd tools/testing/selftests/rseq >./param_test -T b > >(see -h for advanced usage) > >Let me know if I got it right! Hi Mathieu, Thanks, you beat me to it.I commented on the github, that's pretty much it. > In the kernel, if rather than testing for: > > if ((void __user *)instruction_pointer(regs) < post_commit_ip) { > > we could test for both start_ip and post_commit_ip: > > if ((void __user *)instruction_pointer(regs) < post_commit_ip > && (void __user *)instruction_pointer(regs) >= start_ip) { > > We could perform the failure path (storing NULL into the rseq_cs > field of struct rseq) in C rather than being required to do it in > assembly at addresses >= to post_commit_ip, all because the kernel > would test whether we are within the assembly block address range > using both the lower and upper bounds (start_ip and post_commit_ip). Sounds reasonable to me. I agree it would be best to move the failure path out of the asm if possible.
Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
>> +static inline __attribute__((always_inline)) >> +bool rseq_finish(struct rseq_lock *rlock, >> + intptr_t *p, intptr_t to_write, >> + struct rseq_state start_value) >> This ABI looks like it will work fine for our use case. I don't think it >> has been mentioned yet, but we may still need multiple asm blocks >> for differing numbers of writes. For example, an array-based freelist push: >> void push(void *obj) { >> if (index < maxlen) { >> freelist[index++] = obj; >> } >> } >> would be more efficiently implemented with a two-write rseq_finish: >> rseq_finish2([index], obj, // first write >> , index + 1, // second write >> ...); > Would pairing one rseq_start with two rseq_finish do the trick > there ? Yes, two rseq_finish works, as long as the extra rseq management overhead is not substantial.
Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
>> +static inline __attribute__((always_inline)) >> +bool rseq_finish(struct rseq_lock *rlock, >> + intptr_t *p, intptr_t to_write, >> + struct rseq_state start_value) >> This ABI looks like it will work fine for our use case. I don't think it >> has been mentioned yet, but we may still need multiple asm blocks >> for differing numbers of writes. For example, an array-based freelist push: >> void push(void *obj) { >> if (index < maxlen) { >> freelist[index++] = obj; >> } >> } >> would be more efficiently implemented with a two-write rseq_finish: >> rseq_finish2([index], obj, // first write >> , index + 1, // second write >> ...); > Would pairing one rseq_start with two rseq_finish do the trick > there ? Yes, two rseq_finish works, as long as the extra rseq management overhead is not substantial.
Re: [RFC PATCH 2/2] Crypto kernel tls socket
On 11/23/15 02:27 PM, Sowmini Varadhan wrote: > On (11/23/15 09:43), Dave Watson wrote: > > Currently gcm(aes) represents ~80% of our SSL connections. > > > > Userspace interface: > > > > 1) A transform and op socket are created using the userspace crypto > > interface > > 2) Setsockopt ALG_SET_AUTHSIZE is called > > 3) Setsockopt ALG_SET_KEY is called twice, since we need both send/recv keys > > 4) ALG_SET_IV cmsgs are sent twice, since we need both send/recv IVs. > >To support userspace heartbeats, changeciphersuite, etc, we would also > > need > >to get these back out, use them, then reset them via CMSG. > > 5) ALG_SET_OP cmsg is overloaded to mean FD to read/write from. > > [from patch 0/2:] > > If a non application-data TLS record is seen, it is left on the TCP > > socket and an error is returned on the ALG socket, and the record is > > left for userspace to manage. > > I'm trying to see how your approach would fit with the RDS-type of > use-case. RDS-TCP is mostly similar in concept to kcm, > except that rds has its own header for multiplexing, and has no > dependancy on BPF for basic things like re-assembling the datagram. > If I were to try to use this for RDS-TCP, the tls_tcp_read_sock() logic > would be merged into the recv_actor callback for RDS, right? Thus tls > control-plane message could be seen in the middle of the > data-stream, so we really have to freeze the processing of the data > stream till the control-plane message is processed? Correct. > In the tls.c example that you have, the opfd is generated from > the accept() on the AF_ALG socket- how would this work if I wanted > my opfd to be a PF_RDS or a PF_KCM or similar? For kcm, opfd is the fd you would pass along in kcm_attach. For rds, it looks like you'd want to use opfd as the sock instead of the new one created by sock_create_kern in rds_tcp_conn_connect. > One concern is that this patchset provides a solution for the "80%" > case but what about the other 20% (and the non x86 platforms)? Almost all the rest are aes sha. The actual encrypt / decrypt code would be similar to this previous patch: http://marc.info/?l=linux-kernel=140662647602192=2 The software routines in gcm(aes) should work for all platforms without aesni. > E.g., if I get a cipher-suite request outside the aes-ni, what would > happen (punt to uspace?) > > --Sowmini Right, bind() would fail and you would fallback to uspace. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/2] Crypto support aesni rfc5288
Support rfc5288 using intel aesni routines. See also rfc5246. AAD length is 13 bytes padded out to 16. Padding bytes have to be passed in in scatterlist currently, which probably isn't quite the right fix. The assoclen checks were moved to the individual rfc stubs, and the common routines support all assoc lengths. --- arch/x86/crypto/aesni-intel_asm.S| 6 ++ arch/x86/crypto/aesni-intel_avx-x86_64.S | 4 ++ arch/x86/crypto/aesni-intel_glue.c | 105 +++ 3 files changed, 88 insertions(+), 27 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 6bd2c6c..49667c4 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -228,6 +228,9 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation MOVADQ SHUF_MASK(%rip), %xmm14 movarg7, %r10 # %r10 = AAD movarg8, %r12 # %r12 = aadLen + add$3, %r12 + and$~3, %r12 + mov%r12, %r11 pxor %xmm\i, %xmm\i @@ -453,6 +456,9 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation MOVADQ SHUF_MASK(%rip), %xmm14 movarg7, %r10 # %r10 = AAD movarg8, %r12 # %r12 = aadLen + add$3, %r12 + and$~3, %r12 + mov%r12, %r11 pxor %xmm\i, %xmm\i _get_AAD_loop\num_initial_blocks\operation: diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 522ab68..0756e4a 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -360,6 +360,8 @@ VARIABLE_OFFSET = 16*8 mov arg6, %r10 # r10 = AAD mov arg7, %r12 # r12 = aadLen +add $3, %r12 +and $~3, %r12 mov %r12, %r11 @@ -1619,6 +1621,8 @@ ENDPROC(aesni_gcm_dec_avx_gen2) mov arg6, %r10 # r10 = AAD mov arg7, %r12 # r12 = aadLen +add $3, %r12 +and $~3, %r12 mov %r12, %r11 diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 3633ad6..00a42ca 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -949,12 +949,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req) struct scatter_walk src_sg_walk; struct scatter_walk dst_sg_walk; unsigned int i; - - /* Assuming we are supporting rfc4106 64-bit extended */ - /* sequence numbers We need to have the AAD length equal */ - /* to 16 or 20 bytes */ - if (unlikely(req->assoclen != 16 && req->assoclen != 20)) - return -EINVAL; + unsigned int padded_assoclen = (req->assoclen + 3) & ~3; /* IV below built */ for (i = 0; i < 4; i++) @@ -970,21 +965,21 @@ static int helper_rfc4106_encrypt(struct aead_request *req) one_entry_in_sg = 1; scatterwalk_start(_sg_walk, req->src); assoc = scatterwalk_map(_sg_walk); - src = assoc + req->assoclen; + src = assoc + padded_assoclen; dst = src; if (unlikely(req->src != req->dst)) { scatterwalk_start(_sg_walk, req->dst); - dst = scatterwalk_map(_sg_walk) + req->assoclen; + dst = scatterwalk_map(_sg_walk) + padded_assoclen; } } else { /* Allocate memory for src, dst, assoc */ - assoc = kmalloc(req->cryptlen + auth_tag_len + req->assoclen, + assoc = kmalloc(req->cryptlen + auth_tag_len + padded_assoclen, GFP_ATOMIC); if (unlikely(!assoc)) return -ENOMEM; scatterwalk_map_and_copy(assoc, req->src, 0, -req->assoclen + req->cryptlen, 0); - src = assoc + req->assoclen; +padded_assoclen + req->cryptlen, 0); + src = assoc + padded_assoclen; dst = src; } @@ -998,7 +993,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req) * back to the packet. */ if (one_entry_in_sg) { if (unlikely(req->src != req->dst)) { - scatterwalk_unmap(dst - req->assoclen); + scatterwalk_unmap(dst - padded_assoclen); scatterwalk_advance(_sg_walk, req->dst->length); scatterwalk_done(_sg_walk, 1, 0); } @@ -1006,7 +1001,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req) scatterwalk_advance(_sg_walk, req->src->length); scatterwalk_done(_sg_walk,
[RFC PATCH 2/2] Crypto kernel tls socket
Userspace crypto interface for TLS. Currently supports gcm(aes) 128bit only, however the interface is the same as the rest of the SOCK_ALG interface, so it should be possible to add more without any user interface changes. Currently gcm(aes) represents ~80% of our SSL connections. Userspace interface: 1) A transform and op socket are created using the userspace crypto interface 2) Setsockopt ALG_SET_AUTHSIZE is called 3) Setsockopt ALG_SET_KEY is called twice, since we need both send/recv keys 4) ALG_SET_IV cmsgs are sent twice, since we need both send/recv IVs. To support userspace heartbeats, changeciphersuite, etc, we would also need to get these back out, use them, then reset them via CMSG. 5) ALG_SET_OP cmsg is overloaded to mean FD to read/write from. Example program: https://github.com/djwatson/ktls At a high level, this could be implemented on TCP sockets directly instead with various tradeoffs. The userspace crypto interface might benefit from some interface tweaking to deal with multiple keys / ivs better. The crypto accept() op socket interface isn't a great fit, since there are never multiple parallel operations. There's also some questions around using skbuffs instead of scatterlists for send/recv, and if we are buffering on recv, when we should be decrypting the data. --- crypto/Kconfig | 12 + crypto/Makefile|1 + crypto/algif_tls.c | 1233 3 files changed, 1246 insertions(+) create mode 100644 crypto/algif_tls.c diff --git a/crypto/Kconfig b/crypto/Kconfig index 7240821..c15638a 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1639,6 +1639,18 @@ config CRYPTO_USER_API_AEAD This option enables the user-spaces interface for AEAD cipher algorithms. +config CRYPTO_USER_API_TLS + tristate "User-space interface for TLS net sockets" + depends on NET + select CRYPTO_AEAD + select CRYPTO_USER_API + help + This option enables kernel TLS socket framing + cipher algorithms. TLS framing is added/removed and + chained to a TCP socket. Handshake is done in + userspace. + + config CRYPTO_HASH_INFO bool diff --git a/crypto/Makefile b/crypto/Makefile index f7aba92..fc26012 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -121,6 +121,7 @@ obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o +obj-$(CONFIG_CRYPTO_USER_API_TLS) += algif_tls.o # # generic algorithms and the async_tx api diff --git a/crypto/algif_tls.c b/crypto/algif_tls.c new file mode 100644 index 000..123ade3 --- /dev/null +++ b/crypto/algif_tls.c @@ -0,0 +1,1233 @@ +/* + * algif_tls: User-space interface for TLS + * + * Copyright (C) 2015, Dave Watson + * + * This file provides the user-space API for AEAD ciphers. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define TLS_HEADER_SIZE 13 +#define TLS_TAG_SIZE 16 +#define TLS_IV_SIZE 8 +#define TLS_PADDED_AADLEN 16 +#define TLS_MAX_MESSAGE_LEN (1 << 14) + +/* Bytes not included in tls msg size field */ +#define TLS_FRAMING_SIZE 5 + +#define TLS_APPLICATION_DATA_MSG 0x17 +#define TLS_VERSION 3 + +struct tls_tfm_pair { + struct crypto_aead *tfm_send; + struct crypto_aead *tfm_recv; + int cur_setkey; +}; + +static struct workqueue_struct *tls_wq; + +struct tls_sg_list { + unsigned int cur; + struct scatterlist sg[ALG_MAX_PAGES]; +}; + +#define RSGL_MAX_ENTRIES ALG_MAX_PAGES + +struct tls_ctx { + /* Send and encrypted transmit buffers */ + struct tls_sg_list tsgl; + struct scatterlist tcsgl[ALG_MAX_PAGES]; + + /* Encrypted receive and receive buffers. */ + struct tls_sg_list rcsgl; + struct af_alg_sgl rsgl[RSGL_MAX_ENTRIES]; + + /* Sequence numbers. */ + int iv_set; + void *iv_send; + void *iv_recv; + + struct af_alg_completion completion; + + /* Bytes to send */ + unsigned long used; + + /* padded */ + size_t aead_assoclen; + /* unpadded */ + size_t assoclen; + struct aead_request aead_req; + struct aead_request aead_resp; + + bool more; + bool merge; + + /* Chained TCP socket */ + struct sock *sock; + struct socket *socket; + + void (*save_data_ready)(struct sock *sk); + void (*save_write_space)(struct sock *sk); + void (*save_state_change)(struct sock *sk); + struct work_struct tx_work
[RFC PATCH 0/2] Crypto kernel TLS socket
An approach for a kernel TLS socket. Only the symmetric encryption / decryption is done in-kernel, as well as minimal framing handling. The handshake is kept in userspace, and the negotiated cipher / keys / IVs are then set on the algif_tls socket, which is then hooked in to a tcp socket using sk_write_space/sk_data_ready hooks. If a non application-data TLS record is seen, it is left on the TCP socket and an error is returned on the ALG socket, and the record is left for userspace to manage. Userspace can't ignore the message, but could just close the socket. TLS could potentially also be done directly on the TCP socket, but seemed a bit harder to work with the OOB data for non application_data messages, and the sockopts / CMSGS already exist for ALG sockets. The flip side is having to manage two fds in userspace. Some reasons we're looking at this: 1) Access to sendfile/splice for CDN-type applications. We were inspired by Netflix exploring this in FreeBSD https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf For perf, this patch is almost on par with userspace OpenSSL. Currently there are some copies and allocs to support scatter/gather in aesni-intel_glue.c, but with some extra work to remove those (not included here), a sendfile() is faster than the equivalent userspace read/SSL_write using a 128k buffer by 2~7%. 2) Access to the unencrypted bytes in kernelspace. For example, Tom Herbert's kcm would need this https://lwn.net/Articles/657999/ 3) NIC offload. To support running aesni routines on the NIC instead of the processor, we would probably need enough of the framing interface put in kernel. Dave Watson (2): Crypto support aesni rfc5288 Crypto kernel tls socket arch/x86/crypto/aesni-intel_asm.S|6 + arch/x86/crypto/aesni-intel_avx-x86_64.S |4 + arch/x86/crypto/aesni-intel_glue.c | 105 ++- crypto/Kconfig | 12 + crypto/Makefile |1 + crypto/algif_tls.c | 1233 ++ 6 files changed, 1334 insertions(+), 27 deletions(-) create mode 100644 crypto/algif_tls.c -- 2.4.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 2/2] Crypto kernel tls socket
Userspace crypto interface for TLS. Currently supports gcm(aes) 128bit only, however the interface is the same as the rest of the SOCK_ALG interface, so it should be possible to add more without any user interface changes. Currently gcm(aes) represents ~80% of our SSL connections. Userspace interface: 1) A transform and op socket are created using the userspace crypto interface 2) Setsockopt ALG_SET_AUTHSIZE is called 3) Setsockopt ALG_SET_KEY is called twice, since we need both send/recv keys 4) ALG_SET_IV cmsgs are sent twice, since we need both send/recv IVs. To support userspace heartbeats, changeciphersuite, etc, we would also need to get these back out, use them, then reset them via CMSG. 5) ALG_SET_OP cmsg is overloaded to mean FD to read/write from. Example program: https://github.com/djwatson/ktls At a high level, this could be implemented on TCP sockets directly instead with various tradeoffs. The userspace crypto interface might benefit from some interface tweaking to deal with multiple keys / ivs better. The crypto accept() op socket interface isn't a great fit, since there are never multiple parallel operations. There's also some questions around using skbuffs instead of scatterlists for send/recv, and if we are buffering on recv, when we should be decrypting the data. --- crypto/Kconfig | 12 + crypto/Makefile|1 + crypto/algif_tls.c | 1233 3 files changed, 1246 insertions(+) create mode 100644 crypto/algif_tls.c diff --git a/crypto/Kconfig b/crypto/Kconfig index 7240821..c15638a 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1639,6 +1639,18 @@ config CRYPTO_USER_API_AEAD This option enables the user-spaces interface for AEAD cipher algorithms. +config CRYPTO_USER_API_TLS + tristate "User-space interface for TLS net sockets" + depends on NET + select CRYPTO_AEAD + select CRYPTO_USER_API + help + This option enables kernel TLS socket framing + cipher algorithms. TLS framing is added/removed and + chained to a TCP socket. Handshake is done in + userspace. + + config CRYPTO_HASH_INFO bool diff --git a/crypto/Makefile b/crypto/Makefile index f7aba92..fc26012 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -121,6 +121,7 @@ obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o +obj-$(CONFIG_CRYPTO_USER_API_TLS) += algif_tls.o # # generic algorithms and the async_tx api diff --git a/crypto/algif_tls.c b/crypto/algif_tls.c new file mode 100644 index 000..123ade3 --- /dev/null +++ b/crypto/algif_tls.c @@ -0,0 +1,1233 @@ +/* + * algif_tls: User-space interface for TLS + * + * Copyright (C) 2015, Dave Watson <davejwat...@fb.com> + * + * This file provides the user-space API for AEAD ciphers. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define TLS_HEADER_SIZE 13 +#define TLS_TAG_SIZE 16 +#define TLS_IV_SIZE 8 +#define TLS_PADDED_AADLEN 16 +#define TLS_MAX_MESSAGE_LEN (1 << 14) + +/* Bytes not included in tls msg size field */ +#define TLS_FRAMING_SIZE 5 + +#define TLS_APPLICATION_DATA_MSG 0x17 +#define TLS_VERSION 3 + +struct tls_tfm_pair { + struct crypto_aead *tfm_send; + struct crypto_aead *tfm_recv; + int cur_setkey; +}; + +static struct workqueue_struct *tls_wq; + +struct tls_sg_list { + unsigned int cur; + struct scatterlist sg[ALG_MAX_PAGES]; +}; + +#define RSGL_MAX_ENTRIES ALG_MAX_PAGES + +struct tls_ctx { + /* Send and encrypted transmit buffers */ + struct tls_sg_list tsgl; + struct scatterlist tcsgl[ALG_MAX_PAGES]; + + /* Encrypted receive and receive buffers. */ + struct tls_sg_list rcsgl; + struct af_alg_sgl rsgl[RSGL_MAX_ENTRIES]; + + /* Sequence numbers. */ + int iv_set; + void *iv_send; + void *iv_recv; + + struct af_alg_completion completion; + + /* Bytes to send */ + unsigned long used; + + /* padded */ + size_t aead_assoclen; + /* unpadded */ + size_t assoclen; + struct aead_request aead_req; + struct aead_request aead_resp; + + bool more; + bool merge; + + /* Chained TCP socket */ + struct sock *sock; + struct socket *socket; + + void (*save_data_ready)(struct sock *sk); + void (*save_write_space)(struct sock *sk); + void (*save_state_change)(struct sock *sk); + stru
[RFC PATCH 0/2] Crypto kernel TLS socket
An approach for a kernel TLS socket. Only the symmetric encryption / decryption is done in-kernel, as well as minimal framing handling. The handshake is kept in userspace, and the negotiated cipher / keys / IVs are then set on the algif_tls socket, which is then hooked in to a tcp socket using sk_write_space/sk_data_ready hooks. If a non application-data TLS record is seen, it is left on the TCP socket and an error is returned on the ALG socket, and the record is left for userspace to manage. Userspace can't ignore the message, but could just close the socket. TLS could potentially also be done directly on the TCP socket, but seemed a bit harder to work with the OOB data for non application_data messages, and the sockopts / CMSGS already exist for ALG sockets. The flip side is having to manage two fds in userspace. Some reasons we're looking at this: 1) Access to sendfile/splice for CDN-type applications. We were inspired by Netflix exploring this in FreeBSD https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf For perf, this patch is almost on par with userspace OpenSSL. Currently there are some copies and allocs to support scatter/gather in aesni-intel_glue.c, but with some extra work to remove those (not included here), a sendfile() is faster than the equivalent userspace read/SSL_write using a 128k buffer by 2~7%. 2) Access to the unencrypted bytes in kernelspace. For example, Tom Herbert's kcm would need this https://lwn.net/Articles/657999/ 3) NIC offload. To support running aesni routines on the NIC instead of the processor, we would probably need enough of the framing interface put in kernel. Dave Watson (2): Crypto support aesni rfc5288 Crypto kernel tls socket arch/x86/crypto/aesni-intel_asm.S|6 + arch/x86/crypto/aesni-intel_avx-x86_64.S |4 + arch/x86/crypto/aesni-intel_glue.c | 105 ++- crypto/Kconfig | 12 + crypto/Makefile |1 + crypto/algif_tls.c | 1233 ++ 6 files changed, 1334 insertions(+), 27 deletions(-) create mode 100644 crypto/algif_tls.c -- 2.4.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/2] Crypto support aesni rfc5288
Support rfc5288 using intel aesni routines. See also rfc5246. AAD length is 13 bytes padded out to 16. Padding bytes have to be passed in in scatterlist currently, which probably isn't quite the right fix. The assoclen checks were moved to the individual rfc stubs, and the common routines support all assoc lengths. --- arch/x86/crypto/aesni-intel_asm.S| 6 ++ arch/x86/crypto/aesni-intel_avx-x86_64.S | 4 ++ arch/x86/crypto/aesni-intel_glue.c | 105 +++ 3 files changed, 88 insertions(+), 27 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 6bd2c6c..49667c4 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -228,6 +228,9 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation MOVADQ SHUF_MASK(%rip), %xmm14 movarg7, %r10 # %r10 = AAD movarg8, %r12 # %r12 = aadLen + add$3, %r12 + and$~3, %r12 + mov%r12, %r11 pxor %xmm\i, %xmm\i @@ -453,6 +456,9 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation MOVADQ SHUF_MASK(%rip), %xmm14 movarg7, %r10 # %r10 = AAD movarg8, %r12 # %r12 = aadLen + add$3, %r12 + and$~3, %r12 + mov%r12, %r11 pxor %xmm\i, %xmm\i _get_AAD_loop\num_initial_blocks\operation: diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 522ab68..0756e4a 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -360,6 +360,8 @@ VARIABLE_OFFSET = 16*8 mov arg6, %r10 # r10 = AAD mov arg7, %r12 # r12 = aadLen +add $3, %r12 +and $~3, %r12 mov %r12, %r11 @@ -1619,6 +1621,8 @@ ENDPROC(aesni_gcm_dec_avx_gen2) mov arg6, %r10 # r10 = AAD mov arg7, %r12 # r12 = aadLen +add $3, %r12 +and $~3, %r12 mov %r12, %r11 diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 3633ad6..00a42ca 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -949,12 +949,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req) struct scatter_walk src_sg_walk; struct scatter_walk dst_sg_walk; unsigned int i; - - /* Assuming we are supporting rfc4106 64-bit extended */ - /* sequence numbers We need to have the AAD length equal */ - /* to 16 or 20 bytes */ - if (unlikely(req->assoclen != 16 && req->assoclen != 20)) - return -EINVAL; + unsigned int padded_assoclen = (req->assoclen + 3) & ~3; /* IV below built */ for (i = 0; i < 4; i++) @@ -970,21 +965,21 @@ static int helper_rfc4106_encrypt(struct aead_request *req) one_entry_in_sg = 1; scatterwalk_start(_sg_walk, req->src); assoc = scatterwalk_map(_sg_walk); - src = assoc + req->assoclen; + src = assoc + padded_assoclen; dst = src; if (unlikely(req->src != req->dst)) { scatterwalk_start(_sg_walk, req->dst); - dst = scatterwalk_map(_sg_walk) + req->assoclen; + dst = scatterwalk_map(_sg_walk) + padded_assoclen; } } else { /* Allocate memory for src, dst, assoc */ - assoc = kmalloc(req->cryptlen + auth_tag_len + req->assoclen, + assoc = kmalloc(req->cryptlen + auth_tag_len + padded_assoclen, GFP_ATOMIC); if (unlikely(!assoc)) return -ENOMEM; scatterwalk_map_and_copy(assoc, req->src, 0, -req->assoclen + req->cryptlen, 0); - src = assoc + req->assoclen; +padded_assoclen + req->cryptlen, 0); + src = assoc + padded_assoclen; dst = src; } @@ -998,7 +993,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req) * back to the packet. */ if (one_entry_in_sg) { if (unlikely(req->src != req->dst)) { - scatterwalk_unmap(dst - req->assoclen); + scatterwalk_unmap(dst - padded_assoclen); scatterwalk_advance(_sg_walk, req->dst->length); scatterwalk_done(_sg_walk, 1, 0); } @@ -1006,7 +1001,7 @@ static int helper_rfc4106_encrypt(struct aead_request *req) scatterwalk_advance(_sg_walk, req->src->length); scatterwalk_done(_sg_walk,
Re: [RFC PATCH 2/2] Crypto kernel tls socket
On 11/23/15 02:27 PM, Sowmini Varadhan wrote: > On (11/23/15 09:43), Dave Watson wrote: > > Currently gcm(aes) represents ~80% of our SSL connections. > > > > Userspace interface: > > > > 1) A transform and op socket are created using the userspace crypto > > interface > > 2) Setsockopt ALG_SET_AUTHSIZE is called > > 3) Setsockopt ALG_SET_KEY is called twice, since we need both send/recv keys > > 4) ALG_SET_IV cmsgs are sent twice, since we need both send/recv IVs. > >To support userspace heartbeats, changeciphersuite, etc, we would also > > need > >to get these back out, use them, then reset them via CMSG. > > 5) ALG_SET_OP cmsg is overloaded to mean FD to read/write from. > > [from patch 0/2:] > > If a non application-data TLS record is seen, it is left on the TCP > > socket and an error is returned on the ALG socket, and the record is > > left for userspace to manage. > > I'm trying to see how your approach would fit with the RDS-type of > use-case. RDS-TCP is mostly similar in concept to kcm, > except that rds has its own header for multiplexing, and has no > dependancy on BPF for basic things like re-assembling the datagram. > If I were to try to use this for RDS-TCP, the tls_tcp_read_sock() logic > would be merged into the recv_actor callback for RDS, right? Thus tls > control-plane message could be seen in the middle of the > data-stream, so we really have to freeze the processing of the data > stream till the control-plane message is processed? Correct. > In the tls.c example that you have, the opfd is generated from > the accept() on the AF_ALG socket- how would this work if I wanted > my opfd to be a PF_RDS or a PF_KCM or similar? For kcm, opfd is the fd you would pass along in kcm_attach. For rds, it looks like you'd want to use opfd as the sock instead of the new one created by sock_create_kern in rds_tcp_conn_connect. > One concern is that this patchset provides a solution for the "80%" > case but what about the other 20% (and the non x86 platforms)? Almost all the rest are aes sha. The actual encrypt / decrypt code would be similar to this previous patch: http://marc.info/?l=linux-kernel=140662647602192=2 The software routines in gcm(aes) should work for all platforms without aesni. > E.g., if I get a cipher-suite request outside the aes-ni, what would > happen (punt to uspace?) > > --Sowmini Right, bind() would fail and you would fallback to uspace. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
On 10/27/15 04:56 PM, Paul Turner wrote: > This series is a new approach which introduces an alternate ABI that does not > depend on open-coded assembly nor a central 'repository' of rseq sequences. > Sequences may now be inlined and the preparatory[*] work for the sequence can > be written in a higher level language. Very nice, it's definitely much easier to use. > Exactly, for x86_64 this looks like: > movq , rcx [1] > movq $1f, [2] > cmpq , [3] (start is in rcx) > jnz (4) > movq , () (5) > 1: movq $0, > > There has been some related discussion, which I am supportive of, in which > we use fs/gs instead of TLS. This maps naturally to the above and removes > the current requirement for per-thread initialization (this is a good thing!). > > On debugger interactions: > > There are some nice properties about this new style of API which allow it to > actually support safe interactions with a debugger: > a) The event counter is a per-cpu value. This means that we can not advance > it if no threads from the same process execute on that cpu. This > naturally allows basic single step support with thread-isolation. I think this means multiple processes would no longer be able to use per-cpu variables in shared memory, since they would no longer restart with respect to each other? > b) Single-step can be augmented to evalute the ABI without incrementing the > event count. > c) A debugger can also be augmented to evaluate this ABI and push restarts > on the kernel's behalf. > > This is also compatible with David's approach of not single stepping between > 2-4 above. However, I think these are ultimately a little stronger since true > single-stepping and breakpoint support would be available. Which would be > nice to allow actual debugging of sequences. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
On 10/27/15 04:56 PM, Paul Turner wrote: > This series is a new approach which introduces an alternate ABI that does not > depend on open-coded assembly nor a central 'repository' of rseq sequences. > Sequences may now be inlined and the preparatory[*] work for the sequence can > be written in a higher level language. Very nice, it's definitely much easier to use. > Exactly, for x86_64 this looks like: > movq , rcx [1] > movq $1f, [2] > cmpq , [3] (start is in rcx) > jnz (4) > movq , () (5) > 1: movq $0, > > There has been some related discussion, which I am supportive of, in which > we use fs/gs instead of TLS. This maps naturally to the above and removes > the current requirement for per-thread initialization (this is a good thing!). > > On debugger interactions: > > There are some nice properties about this new style of API which allow it to > actually support safe interactions with a debugger: > a) The event counter is a per-cpu value. This means that we can not advance > it if no threads from the same process execute on that cpu. This > naturally allows basic single step support with thread-isolation. I think this means multiple processes would no longer be able to use per-cpu variables in shared memory, since they would no longer restart with respect to each other? > b) Single-step can be augmented to evalute the ABI without incrementing the > event count. > c) A debugger can also be augmented to evaluate this ABI and push restarts > on the kernel's behalf. > > This is also compatible with David's approach of not single stepping between > 2-4 above. However, I think these are ultimately a little stronger since true > single-stepping and breakpoint support would be available. Which would be > nice to allow actual debugging of sequences. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/