Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
Hi Martin, On Tue, Nov 20, 2018 at 5:29 PM Martin Willi wrote: > Thanks for the offer, no need at this time. But I certainly would > welcome if you could do some (Wireguard) benching with that code to see > if it works for you. I certainly will test it in a few different network circumstances, especially since real testing like this is sometimes more telling than busy-loop benchmarks. > > Actually, similarly here, a 10nm Cannon Lake machine should be > > arriving at my house this week, which should make for some > > interesting testing ground for non-throttled zmm, if you'd like to > > play with it. > > Maybe in a future iteration, thanks. In fact would it be interesting to > know if Cannon Lake can handle that throttling better. Everything I've read on the Internet seems to indicate that's the case, so one of the first things I'll be doing is seeing if that's true. There are also the AVX512 IFMA instructions to play with! Jason
[PATCH 3/3] crypto: x86/chacha20 - Add a 4-block AVX-512VL variant
This version uses the same principle as the AVX2 version by scheduling the operations for two block pairs in parallel. It benefits from the AVX-512VL rotate instructions and the more efficient partial block handling using "vmovdqu8", resulting in a speedup of the raw block function of ~20%. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-avx512vl-x86_64.S | 272 + arch/x86/crypto/chacha20_glue.c| 7 + 2 files changed, 279 insertions(+) diff --git a/arch/x86/crypto/chacha20-avx512vl-x86_64.S b/arch/x86/crypto/chacha20-avx512vl-x86_64.S index 261097578715..55d34de29e3e 100644 --- a/arch/x86/crypto/chacha20-avx512vl-x86_64.S +++ b/arch/x86/crypto/chacha20-avx512vl-x86_64.S @@ -12,6 +12,11 @@ CTR2BL:.octa 0x .octa 0x0001 +.section .rodata.cst32.CTR4BL, "aM", @progbits, 32 +.align 32 +CTR4BL:.octa 0x0002 + .octa 0x0003 + .section .rodata.cst32.CTR8BL, "aM", @progbits, 32 .align 32 CTR8BL:.octa 0x000300020001 @@ -185,6 +190,273 @@ ENTRY(chacha20_2block_xor_avx512vl) ENDPROC(chacha20_2block_xor_avx512vl) +ENTRY(chacha20_4block_xor_avx512vl) + # %rdi: Input state matrix, s + # %rsi: up to 4 data blocks output, o + # %rdx: up to 4 data blocks input, i + # %rcx: input/output length in bytes + + # This function encrypts four ChaCha20 block by loading the state + # matrix four times across eight AVX registers. It performs matrix + # operations on four words in two matrices in parallel, sequentially + # to the operations on the four words of the other two matrices. The + # required word shuffling has a rather high latency, we can do the + # arithmetic on two matrix-pairs without much slowdown. + + vzeroupper + + # x0..3[0-4] = s0..3 + vbroadcasti128 0x00(%rdi),%ymm0 + vbroadcasti128 0x10(%rdi),%ymm1 + vbroadcasti128 0x20(%rdi),%ymm2 + vbroadcasti128 0x30(%rdi),%ymm3 + + vmovdqa %ymm0,%ymm4 + vmovdqa %ymm1,%ymm5 + vmovdqa %ymm2,%ymm6 + vmovdqa %ymm3,%ymm7 + + vpaddd CTR2BL(%rip),%ymm3,%ymm3 + vpaddd CTR4BL(%rip),%ymm7,%ymm7 + + vmovdqa %ymm0,%ymm11 + vmovdqa %ymm1,%ymm12 + vmovdqa %ymm2,%ymm13 + vmovdqa %ymm3,%ymm14 + vmovdqa %ymm7,%ymm15 + + mov $10,%rax + +.Ldoubleround4: + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + + vpaddd %ymm5,%ymm4,%ymm4 + vpxord %ymm4,%ymm7,%ymm7 + vprold $16,%ymm7,%ymm7 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + + vpaddd %ymm7,%ymm6,%ymm6 + vpxord %ymm6,%ymm5,%ymm5 + vprold $12,%ymm5,%ymm5 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + + vpaddd %ymm5,%ymm4,%ymm4 + vpxord %ymm4,%ymm7,%ymm7 + vprold $8,%ymm7,%ymm7 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + + vpaddd %ymm7,%ymm6,%ymm6 + vpxord %ymm6,%ymm5,%ymm5 + vprold $7,%ymm5,%ymm5 + + # x1 = shuffle32(x1, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm1,%ymm1 + vpshufd $0x39,%ymm5,%ymm5 + # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) + vpshufd $0x4e,%ymm2,%ymm2 + vpshufd $0x4e,%ymm6,%ymm6 + # x3 = shuffle32(x3, MASK(2, 1, 0, 3)) + vpshufd $0x93,%ymm3,%ymm3 + vpshufd $0x93,%ymm7,%ymm7 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + + vpaddd %ymm5,%ymm4,%ymm4 + vpxord %ymm4,%ymm7,%ymm7 + vprold $16,%ymm7,%ymm7 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + + vpaddd %ymm7,%ymm6,%ymm6 + vpxord %ymm6,%ymm5,%ymm5 + vprold $12,%ymm5,%ymm5 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + + vpaddd
[PATCH 0/3] crypto: x86/chacha20 - AVX-512VL block functions
In the quest for pushing the limits of chacha20 encryption for both IPsec and Wireguard, this small series adds AVX-512VL block functions. The VL variant works on 256-bit ymm registers, but compared to AVX2 can benefit from the new instructions. Compared to the AVX2 version, these block functions bring an overall speed improvement across encryption lengths of ~20%. Below the tcrypt results for additional block sizes in kOps/s, for the current AVX2 code path, the new AVX-512VL code path and the comparison to Zinc in AVX2 and AVX-512VL. All numbers from a Xeon Platinum 8168 (2.7GHz). These numbers result in a very nice chart, available at: https://download.strongswan.org/misc/chacha-avx-512vl.svg zinc zinc len avx2 512vl avx2 512vl 8 5719 5672 5468 5612 16 5675 5627 5355 5621 24 5687 5601 5322 5633 32 5667 5622 5244 5564 40 5603 5582 5337 5578 48 5638 5539 5400 5556 56 5624 5566 5375 5482 64 5590 5573 5352 5531 72 4841 5467 3365 3457 80 5316 5761 3310 3381 88 4798 5470 3239 3343 96 5324 5723 3197 3281 104 4819 5460 3155 3232 112 5266 5749 3020 3195 120 4776 5391 2959 3145 128 5291 5723 3398 3489 136 4122 4837 3321 3423 144 4507 5057 3247 3389 152 4139 4815 3233 3329 160 4482 5043 3159 3256 168 4142 4766 3131 3224 176 4506 5028 3073 3162 184 4119 4772 3010 3109 192 4499 5016 3402 3502 200 4127 4766 3329 3448 208 4452 5012 3276 3371 216 4128 4744 3243 3334 224 4484 5008 3203 3298 232 4103 4772 3141 3237 240 4458 4963 3115 3217 248 4121 4751 3085 3177 256 4461 4987 3364 4046 264 3406 4282 3270 4006 272 3408 4287 3207 3961 280 3371 4271 3203 3825 288 3625 4301 3129 3751 296 3402 4283 3093 3688 304 3401 4247 3062 3637 312 3382 4282 2995 3614 320 3611 4279 3305 4070 328 3386 4260 3276 3968 336 3369 4288 3171 3929 344 3389 4289 3134 3847 352 3609 4266 3127 3720 360 3355 4252 3076 3692 368 3387 4264 3048 3650 376 3387 4238 2967 3553 384 3568 4265 3277 4035 392 3369 4262 3299 3973 400 3362 4235 3239 3899 408 3352 4269 3196 3843 416 3585 4243 3127 3736 424 3364 4216 3092 3672 432 3341 4246 3067 3628 440 3353 4235 3018 3593 448 3538 4245 3327 4035 456 3322 4244 3275 3900 464 3340 4237 3212 3880 472 3330 4242 3054 3802 480 3530 4234 3078 3707 488 3337 4228 3094 3664 496 3330 4223 3015 3591 504 3317 4214 3002 3517 512 3531 4197 3339 4016 520 2511 3101 2030 2682 528 2627 3087 2027 2641 536 2508 3102 2001 2601 544 2638 3090 1964 2564 552 2494 3077 1962 2516 560 2625 3064 1941 2515 568 2500 3086 1922 2493 576 2611 3074 2050 2689 584 2482 3062 2041 2680 592 2595 3074 2026 2644 600 2470 3060 1985 2595 608 2581 3039 1961 2555 616 2478 3062 1956 2521 624 2587 3066 1930 2493 632 2457 3053 1923 2486 640 2581 3050 2059 2712 648 2296 2839 2024 2655 656 2389 2845 2019 2642 664 2292 2842 2002 2610 672 2404 2838 1959 2537 680 2273 2827 1956 2527 688 2389 2840 1938 2510 696 2280 2837 1911 2463 704 2370 2819 2055 2702 712 2277 2834 2029 2663 720 2369 2829 2020 2625 728 2255 2820 2001 2600 736 2373 2819 1958 2543 744 2269 2827 1956 2524 752 2364 2817 1937 2492 760 2270 2805 1909 2483 768 2378 2820 2050 2696 776 2053 2700 2002 2643 784 2066 2693 1922 2640 792 2065 2703 1928 2602 800 2138 2706 1962 2535 808 2065 2679 1938 2528 816 2063 2699 1929 2500 824 2053 2676 1915 2468 832 2149 2692 2036 2693 840 2055 2689 2024 2659 848 2049 2689 2006 2610 856 2057 2702 1979 2585 864 2144 2703 1960 2547 872 2047 2685 1945 2501 880 2055 2683 1902 2497 888 2060 2689 1897 2478 896 2139 2693 2023 2663 904 2049 2686 1970 2644 912 2055 2688 1925 2621 920 2047 2685 1911 2572 928 2114 2695 1907 2545 936 2055 2681 1927 2492 944 2055 2693 1930 2478 952 2042 2688 1909 2471 960 2136 2682 2014 2672 968 2054 2687 1999 2626 976 2040 2682 1982 2598 984 2055 2687 1943 2569 992 2138 2694 1884 2522 1000 2036 2681 1929 2506 1008 2052 2676 1926 2475 1016 2050 2686 1889 2430 1024 2125 2670 2039 2656
[PATCH 2/3] crypto: x86/chacha20 - Add a 2-block AVX-512VL variant
This version uses the same principle as the AVX2 version. It benefits from the AVX-512VL rotate instructions and the more efficient partial block handling using "vmovdqu8", resulting in a speedup of ~20%. Unlike the AVX2 version, it is faster than the single block SSSE3 version to process a single block. Hence we engage that function for (partial) single block lengths as well. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-avx512vl-x86_64.S | 171 + arch/x86/crypto/chacha20_glue.c| 7 + 2 files changed, 178 insertions(+) diff --git a/arch/x86/crypto/chacha20-avx512vl-x86_64.S b/arch/x86/crypto/chacha20-avx512vl-x86_64.S index e1877afcaa73..261097578715 100644 --- a/arch/x86/crypto/chacha20-avx512vl-x86_64.S +++ b/arch/x86/crypto/chacha20-avx512vl-x86_64.S @@ -7,6 +7,11 @@ #include +.section .rodata.cst32.CTR2BL, "aM", @progbits, 32 +.align 32 +CTR2BL:.octa 0x + .octa 0x0001 + .section .rodata.cst32.CTR8BL, "aM", @progbits, 32 .align 32 CTR8BL:.octa 0x000300020001 @@ -14,6 +19,172 @@ CTR8BL: .octa 0x000300020001 .text +ENTRY(chacha20_2block_xor_avx512vl) + # %rdi: Input state matrix, s + # %rsi: up to 2 data blocks output, o + # %rdx: up to 2 data blocks input, i + # %rcx: input/output length in bytes + + # This function encrypts two ChaCha20 blocks by loading the state + # matrix twice across four AVX registers. It performs matrix operations + # on four words in each matrix in parallel, but requires shuffling to + # rearrange the words after each round. + + vzeroupper + + # x0..3[0-2] = s0..3 + vbroadcasti128 0x00(%rdi),%ymm0 + vbroadcasti128 0x10(%rdi),%ymm1 + vbroadcasti128 0x20(%rdi),%ymm2 + vbroadcasti128 0x30(%rdi),%ymm3 + + vpaddd CTR2BL(%rip),%ymm3,%ymm3 + + vmovdqa %ymm0,%ymm8 + vmovdqa %ymm1,%ymm9 + vmovdqa %ymm2,%ymm10 + vmovdqa %ymm3,%ymm11 + + mov $10,%rax + +.Ldoubleround: + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + + # x1 = shuffle32(x1, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm1,%ymm1 + # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) + vpshufd $0x4e,%ymm2,%ymm2 + # x3 = shuffle32(x3, MASK(2, 1, 0, 3)) + vpshufd $0x93,%ymm3,%ymm3 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + + # x1 = shuffle32(x1, MASK(2, 1, 0, 3)) + vpshufd $0x93,%ymm1,%ymm1 + # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) + vpshufd $0x4e,%ymm2,%ymm2 + # x3 = shuffle32(x3, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm3,%ymm3 + + dec %rax + jnz .Ldoubleround + + # o0 = i0 ^ (x0 + s0) + vpaddd %ymm8,%ymm0,%ymm7 + cmp $0x10,%rcx + jl .Lxorpart2 + vpxord 0x00(%rdx),%xmm7,%xmm6 + vmovdqu %xmm6,0x00(%rsi) + vextracti128$1,%ymm7,%xmm0 + # o1 = i1 ^ (x1 + s1) + vpaddd %ymm9,%ymm1,%ymm7 + cmp $0x20,%rcx + jl .Lxorpart2 + vpxord 0x10(%rdx),%xmm7,%xmm6 + vmovdqu %xmm6,0x10(%rsi) + vextracti128$1,%ymm7,%xmm1 + # o2 = i2 ^ (x2 + s2) + vpaddd %ymm10,%ymm2,%ymm7 + cmp $0x30,%rcx + jl .Lxorpart2 + vpxord 0x20(%rdx),%xmm7,%xmm6 + vmovdqu %xmm6,0x20(%rsi) + vextracti128$1,%ymm7,%xmm2 + # o3 = i3 ^ (x3 +
[PATCH 1/3] crypto: x86/chacha20 - Add a 8-block AVX-512VL variant
This variant is similar to the AVX2 version, but benefits from the AVX-512 rotate instructions and the additional registers, so it can operate without any data on the stack. It uses ymm registers only to avoid the massive core throttling on Skylake-X platforms. Nontheless does it bring a ~30% speed improvement compared to the AVX2 variant for random encryption lengths. The AVX2 version uses "rep movsb" for partial block XORing via the stack. With AVX-512, the new "vmovdqu8" can do this much more efficiently. The associated "kmov" instructions to work with dynamic masks is not part of the AVX-512VL instruction set, hence we depend on AVX-512BW as well. Given that the major AVX-512VL architectures provide AVX-512BW and this extension does not affect core clocking, this seems to be no problem at least for now. Signed-off-by: Martin Willi --- arch/x86/crypto/Makefile | 5 + arch/x86/crypto/chacha20-avx512vl-x86_64.S | 396 + arch/x86/crypto/chacha20_glue.c| 26 ++ 3 files changed, 427 insertions(+) create mode 100644 arch/x86/crypto/chacha20-avx512vl-x86_64.S diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index a4b0007a54e1..ce4e43642984 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -8,6 +8,7 @@ OBJECT_FILES_NON_STANDARD := y avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no) avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\ $(comma)4)$(comma)%ymm2,yes,no) +avx512_supported :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,yes,no) sha1_ni_supported :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,yes,no) sha256_ni_supported :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,yes,no) @@ -103,6 +104,10 @@ ifeq ($(avx2_supported),yes) morus1280-avx2-y := morus1280-avx2-asm.o morus1280-avx2-glue.o endif +ifeq ($(avx512_supported),yes) + chacha20-x86_64-y += chacha20-avx512vl-x86_64.o +endif + aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o diff --git a/arch/x86/crypto/chacha20-avx512vl-x86_64.S b/arch/x86/crypto/chacha20-avx512vl-x86_64.S new file mode 100644 index ..e1877afcaa73 --- /dev/null +++ b/arch/x86/crypto/chacha20-avx512vl-x86_64.S @@ -0,0 +1,396 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* + * ChaCha20 256-bit cipher algorithm, RFC7539, x64 AVX-512VL functions + * + * Copyright (C) 2018 Martin Willi + */ + +#include + +.section .rodata.cst32.CTR8BL, "aM", @progbits, 32 +.align 32 +CTR8BL:.octa 0x000300020001 + .octa 0x0007000600050004 + +.text + +ENTRY(chacha20_8block_xor_avx512vl) + # %rdi: Input state matrix, s + # %rsi: up to 8 data blocks output, o + # %rdx: up to 8 data blocks input, i + # %rcx: input/output length in bytes + + # This function encrypts eight consecutive ChaCha20 blocks by loading + # the state matrix in AVX registers eight times. Compared to AVX2, this + # mostly benefits from the new rotate instructions in VL and the + # additional registers. + + vzeroupper + + # x0..15[0-7] = s[0..15] + vpbroadcastd0x00(%rdi),%ymm0 + vpbroadcastd0x04(%rdi),%ymm1 + vpbroadcastd0x08(%rdi),%ymm2 + vpbroadcastd0x0c(%rdi),%ymm3 + vpbroadcastd0x10(%rdi),%ymm4 + vpbroadcastd0x14(%rdi),%ymm5 + vpbroadcastd0x18(%rdi),%ymm6 + vpbroadcastd0x1c(%rdi),%ymm7 + vpbroadcastd0x20(%rdi),%ymm8 + vpbroadcastd0x24(%rdi),%ymm9 + vpbroadcastd0x28(%rdi),%ymm10 + vpbroadcastd0x2c(%rdi),%ymm11 + vpbroadcastd0x30(%rdi),%ymm12 + vpbroadcastd0x34(%rdi),%ymm13 + vpbroadcastd0x38(%rdi),%ymm14 + vpbroadcastd0x3c(%rdi),%ymm15 + + # x12 += counter values 0-3 + vpaddd CTR8BL(%rip),%ymm12,%ymm12 + + vmovdqa64 %ymm0,%ymm16 + vmovdqa64 %ymm1,%ymm17 + vmovdqa64 %ymm2,%ymm18 + vmovdqa64 %ymm3,%ymm19 + vmovdqa64 %ymm4,%ymm20 + vmovdqa64 %ymm5,%ymm21 + vmovdqa64 %ymm6,%ymm22 + vmovdqa64 %ymm7,%ymm23 + vmovdqa64 %ymm8,%ymm24 + vmovdqa64 %ymm9,%ymm25 + vmovdqa64 %ymm10,%ymm26 + vmovdqa64 %ymm11,%ymm27 + vmovdqa64 %ymm12,%ymm28 + vmovdqa64 %ymm13,%ymm29 + vmovdqa64 %ymm14,%ymm30 + vmovdqa64 %ymm15,%ymm31 + + mov $10,%eax + +.Ldoubleround8: + # x0 += x4, x12 = rotl32(x12 ^ x0, 16) + vpaddd %ymm0,%ymm4,%ymm0 + vpxord %ymm0,%ymm12,%ymm12 + vprold $16,%ymm12,%ymm12 + # x1 += x5, x13 = rotl32(x13 ^ x1, 16) +
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
Hi Jason, > [...] I have a massive Xeon Gold 5120 machine that I can give you > access to if you'd like to do some testing and benching. Thanks for the offer, no need at this time. But I certainly would welcome if you could do some (Wireguard) benching with that code to see if it works for you. > Actually, similarly here, a 10nm Cannon Lake machine should be > arriving at my house this week, which should make for some > interesting testing ground for non-throttled zmm, if you'd like to > play with it. Maybe in a future iteration, thanks. In fact would it be interesting to know if Cannon Lake can handle that throttling better. Regards Martin