Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
On 13.11.2010, 00:25 Mathias Krause wrote: On 12.11.2010, 08:34 Huang Ying wrote: On Fri, 2010-11-12 at 15:30 +0800, Mathias Krause wrote: On 12.11.2010, 01:33 Huang Ying wrote: Why the improvement of ECB is so small? I can not understand it. It should be as big as CBC. I don't know why the ECB variant is so slow compared to the other variants. But it is so even for the current x86-64 version. See the above values for x86-64 (old). I setup dm-crypt for this test like this: # cryptsetup -c aes-ecb-plain -d /dev/urandom create cfs /dev/loop0 What where the numbers you measured in your tests while developing the x86-64 version? Can't remember the number. Do you have interest to dig into the issue? I looked at /proc/crypto while doing the tests again and noticed that ECB isn't handled using cryptd, while all other modes, e.g. CBC and CTR, are. The reason for that seems to be that for ECB, and only for ECB, the kernel is using the synchronous block algorithm instead of the asynchronous one. So the question is: Why is the ECB variant handled using the synchronous cipher -- because of the missing iv handling in this mode? Herbert, any idea why this is the case? Regards, Mathias -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
On 12.11.2010, 08:34 Huang Ying wrote: On Fri, 2010-11-12 at 15:30 +0800, Mathias Krause wrote: On 12.11.2010, 01:33 Huang Ying wrote: Why the improvement of ECB is so small? I can not understand it. It should be as big as CBC. I don't know why the ECB variant is so slow compared to the other variants. But it is so even for the current x86-64 version. See the above values for x86-64 (old). I setup dm-crypt for this test like this: # cryptsetup -c aes-ecb-plain -d /dev/urandom create cfs /dev/loop0 What where the numbers you measured in your tests while developing the x86-64 version? Can't remember the number. Do you have interest to dig into the issue? I looked at /proc/crypto while doing the tests again and noticed that ECB isn't handled using cryptd, while all other modes, e.g. CBC and CTR, are. The reason for that seems to be that for ECB, and only for ECB, the kernel is using the synchronous block algorithm instead of the asynchronous one. So the question is: Why is the ECB variant handled using the synchronous cipher -- because of the missing iv handling in this mode? Best regards, Mathias -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
Hello Huang Ying, On 03.11.2010, 23:27 Huang Ying wrote: On Wed, 2010-11-03 at 14:14 -0700, Mathias Krause wrote: The AES-NI instructions are also available in legacy mode so the 32-bit architecture may profit from those, too. To illustrate the performance gain here's a short summary of the tcrypt speed test on a Core i7 M620 running at 2.67GHz comparing both assembler implementations: x86: i568 aes-ni delta 256 bit, 8kB blocks, ECB: 125.94 MB/s 187.09 MB/s +48.6% Which method do you used for speed testing? modprobe tcrypt mode=200 sec=? That actually does not work very well for AES-NI. Because AES-NI blkcipher is tested in synchronous mode, and in that mode, kernel_fpu_begin/end() must be called for every block, and kernel_fpu_begin/end() is quite slow. At the same time, some further optimization for AES-NI can not be tested (such as ecb-aes-aesni driver) in that mode, because they are only available in asynchronous mode. When developing AES-NI for x86_64, I uses dm-crypt + AES-NI for speed testing, where AES-NI blkcipher will be tested in asynchronous mode, and kernel_fpu_begin/end() is called for every page. Can you use that to test? Or you can add test_acipher_speed (similar with test_ahash_speed) to test cipher in asynchronous mode. here are the numbers for dm-crypt. I run the test again on the Core i7 M620, 2.67GHz. During the test I noticed that not porting the CBC variant to x86 was a bad idea so I did that too and got pretty nice numbers (see v3 vs. v4 of the patch). All test were run five times in a row using a 256 bit key and doing i/o to the block device in chunks of 1MB. The numbers are MB/s. x86 (i586 variant): 1. run 2. run 3. run 4. run 5. runmean ECB: 93.993.994.093.593.893.8 CBC: 84.984.884.984.984.884.8 XTS: 108.2 108.3 109.6 108.3 108.9 108.6 LRW: 105.0 105.0 105.1 105.1 105.1 105.0 x86 (AES-NI), v3 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 124.8 120.8 124.5 120.6 124.5 123.0 CBC: 112.6 109.6 112.6 110.7 109.4 110.9 XTS: 221.6 221.1 220.9 223.5 224.4 222.3 LRW: 206.2 209.7 207.4 203.7 209.3 207.2 x86 (AES-NI), v4 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 122.5 121.2 121.6 125.7 125.5 123.3 CBC: 259.5 259.2 261.2 264.0 267.6 262.3 XTS: 225.1 230.7 220.6 217.9 216.3 222.1 LRW: 202.7 202.8 210.6 208.9 202.7 205.5 Comparing the values for the CBC variant between v3 and v4 of the patch shows that porting the CBC variant to x86 more then doubled the performance so the little bit ugly #ifdefed code is worth the effort. x86-64 (old): 1. run 2. run 3. run 4. run 5. runmean ECB: 121.4 120.9 121.1 121.2 120.9 121.1 CBC: 282.5 286.3 281.5 282.0 294.5 285.3 XTS: 263.6 260.3 263.0 267.0 264.6 263.7 LRW: 249.6 249.8 250.5 253.4 252.2 251.1 x86-64 (new): 1. run 2. run 3. run 4. run 5. runmean ECB: 122.1 122.0 122.0 127.0 121.9 123.0 CBC: 291.2 286.2 295.6 291.4 289.9 290.8 XTS: 263.3 264.4 264.5 264.2 270.4 265.3 LRW: 254.9 252.3 253.6 258.2 257.5 255.3 Comparing the mean values gives us: x86: i586 aes-nidelta ECB: 93.8123.3 +31.4% CBC: 84.8262.3 +209.3% LRW:108.6222.1 +104.5% XTS:105.0205.5 +95.7% x86-64: old newdelta ECB:121.1123.0+1.5% CBC:285.3290.8+1.9% LRW:263.7265.3+0.6% XTS:251.1255.3+1.7% The improvement for the old vs. the new x86-64 version is not as drastically as for the synchronous variant (see the tcrypt tests in the previous email), but nevertheless an improvement. The improvement for the x86 case, albeit, should be noticeable. It's almost as fast as the x86-64 version. I'll post the new version of the patch in a follow-up email. Regards, Mathias -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
Hi, Mathias, On Fri, 2010-11-12 at 06:18 +0800, Mathias Krause wrote: All test were run five times in a row using a 256 bit key and doing i/o to the block device in chunks of 1MB. The numbers are MB/s. x86 (i586 variant): 1. run 2. run 3. run 4. run 5. runmean ECB: 93.993.994.093.593.893.8 CBC: 84.984.884.984.984.884.8 XTS: 108.2 108.3 109.6 108.3 108.9 108.6 LRW: 105.0 105.0 105.1 105.1 105.1 105.0 x86 (AES-NI), v3 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 124.8 120.8 124.5 120.6 124.5 123.0 CBC: 112.6 109.6 112.6 110.7 109.4 110.9 XTS: 221.6 221.1 220.9 223.5 224.4 222.3 LRW: 206.2 209.7 207.4 203.7 209.3 207.2 x86 (AES-NI), v4 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 122.5 121.2 121.6 125.7 125.5 123.3 CBC: 259.5 259.2 261.2 264.0 267.6 262.3 XTS: 225.1 230.7 220.6 217.9 216.3 222.1 LRW: 202.7 202.8 210.6 208.9 202.7 205.5 Comparing the values for the CBC variant between v3 and v4 of the patch shows that porting the CBC variant to x86 more then doubled the performance so the little bit ugly #ifdefed code is worth the effort. x86-64 (old): 1. run 2. run 3. run 4. run 5. runmean ECB: 121.4 120.9 121.1 121.2 120.9 121.1 CBC: 282.5 286.3 281.5 282.0 294.5 285.3 XTS: 263.6 260.3 263.0 267.0 264.6 263.7 LRW: 249.6 249.8 250.5 253.4 252.2 251.1 x86-64 (new): 1. run 2. run 3. run 4. run 5. runmean ECB: 122.1 122.0 122.0 127.0 121.9 123.0 CBC: 291.2 286.2 295.6 291.4 289.9 290.8 XTS: 263.3 264.4 264.5 264.2 270.4 265.3 LRW: 254.9 252.3 253.6 258.2 257.5 255.3 Comparing the mean values gives us: x86: i586 aes-nidelta ECB: 93.8123.3 +31.4% Why the improvement of ECB is so small? I can not understand it. It should be as big as CBC. Best Regards, Huang Ying CBC: 84.8262.3 +209.3% LRW:108.6222.1 +104.5% XTS:105.0205.5 +95.7% x86-64: old newdelta ECB:121.1123.0+1.5% CBC:285.3290.8+1.9% LRW:263.7265.3+0.6% XTS:251.1255.3+1.7% The improvement for the old vs. the new x86-64 version is not as drastically as for the synchronous variant (see the tcrypt tests in the previous email), but nevertheless an improvement. The improvement for the x86 case, albeit, should be noticeable. It's almost as fast as the x86-64 version. I'll post the new version of the patch in a follow-up email. Regards, Mathias -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
On 12.11.2010, 01:33 Huang Ying wrote: Hi, Mathias, On Fri, 2010-11-12 at 06:18 +0800, Mathias Krause wrote: All test were run five times in a row using a 256 bit key and doing i/o to the block device in chunks of 1MB. The numbers are MB/s. x86 (i586 variant): 1. run 2. run 3. run 4. run 5. runmean ECB: 93.993.994.093.593.893.8 CBC: 84.984.884.984.984.884.8 XTS: 108.2 108.3 109.6 108.3 108.9 108.6 LRW: 105.0 105.0 105.1 105.1 105.1 105.0 x86 (AES-NI), v3 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 124.8 120.8 124.5 120.6 124.5 123.0 CBC: 112.6 109.6 112.6 110.7 109.4 110.9 XTS: 221.6 221.1 220.9 223.5 224.4 222.3 LRW: 206.2 209.7 207.4 203.7 209.3 207.2 x86 (AES-NI), v4 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 122.5 121.2 121.6 125.7 125.5 123.3 CBC: 259.5 259.2 261.2 264.0 267.6 262.3 XTS: 225.1 230.7 220.6 217.9 216.3 222.1 LRW: 202.7 202.8 210.6 208.9 202.7 205.5 Comparing the values for the CBC variant between v3 and v4 of the patch shows that porting the CBC variant to x86 more then doubled the performance so the little bit ugly #ifdefed code is worth the effort. x86-64 (old): 1. run 2. run 3. run 4. run 5. runmean ECB: 121.4 120.9 121.1 121.2 120.9 121.1 CBC: 282.5 286.3 281.5 282.0 294.5 285.3 XTS: 263.6 260.3 263.0 267.0 264.6 263.7 LRW: 249.6 249.8 250.5 253.4 252.2 251.1 x86-64 (new): 1. run 2. run 3. run 4. run 5. runmean ECB: 122.1 122.0 122.0 127.0 121.9 123.0 CBC: 291.2 286.2 295.6 291.4 289.9 290.8 XTS: 263.3 264.4 264.5 264.2 270.4 265.3 LRW: 254.9 252.3 253.6 258.2 257.5 255.3 Comparing the mean values gives us: x86: i586 aes-nidelta ECB: 93.8123.3 +31.4% Why the improvement of ECB is so small? I can not understand it. It should be as big as CBC. I don't know why the ECB variant is so slow compared to the other variants. But it is so even for the current x86-64 version. See the above values for x86-64 (old). I setup dm-crypt for this test like this: # cryptsetup -c aes-ecb-plain -d /dev/urandom create cfs /dev/loop0 What where the numbers you measured in your tests while developing the x86-64 version? Best regards, Mathias Best Regards, Huang Ying CBC: 84.8262.3 +209.3% LRW:108.6222.1 +104.5% XTS:105.0205.5 +95.7% x86-64: old newdelta ECB:121.1123.0+1.5% CBC:285.3290.8+1.9% LRW:263.7265.3+0.6% XTS:251.1255.3+1.7% The improvement for the old vs. the new x86-64 version is not as drastically as for the synchronous variant (see the tcrypt tests in the previous email), but nevertheless an improvement. The improvement for the x86 case, albeit, should be noticeable. It's almost as fast as the x86-64 version. I'll post the new version of the patch in a follow-up email. Regards, Mathias -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
On Fri, 2010-11-12 at 15:30 +0800, Mathias Krause wrote: On 12.11.2010, 01:33 Huang Ying wrote: Hi, Mathias, On Fri, 2010-11-12 at 06:18 +0800, Mathias Krause wrote: All test were run five times in a row using a 256 bit key and doing i/o to the block device in chunks of 1MB. The numbers are MB/s. x86 (i586 variant): 1. run 2. run 3. run 4. run 5. runmean ECB: 93.993.994.093.593.893.8 CBC: 84.984.884.984.984.884.8 XTS: 108.2 108.3 109.6 108.3 108.9 108.6 LRW: 105.0 105.0 105.1 105.1 105.1 105.0 x86 (AES-NI), v3 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 124.8 120.8 124.5 120.6 124.5 123.0 CBC: 112.6 109.6 112.6 110.7 109.4 110.9 XTS: 221.6 221.1 220.9 223.5 224.4 222.3 LRW: 206.2 209.7 207.4 203.7 209.3 207.2 x86 (AES-NI), v4 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 122.5 121.2 121.6 125.7 125.5 123.3 CBC: 259.5 259.2 261.2 264.0 267.6 262.3 XTS: 225.1 230.7 220.6 217.9 216.3 222.1 LRW: 202.7 202.8 210.6 208.9 202.7 205.5 Comparing the values for the CBC variant between v3 and v4 of the patch shows that porting the CBC variant to x86 more then doubled the performance so the little bit ugly #ifdefed code is worth the effort. x86-64 (old): 1. run 2. run 3. run 4. run 5. runmean ECB: 121.4 120.9 121.1 121.2 120.9 121.1 CBC: 282.5 286.3 281.5 282.0 294.5 285.3 XTS: 263.6 260.3 263.0 267.0 264.6 263.7 LRW: 249.6 249.8 250.5 253.4 252.2 251.1 x86-64 (new): 1. run 2. run 3. run 4. run 5. runmean ECB: 122.1 122.0 122.0 127.0 121.9 123.0 CBC: 291.2 286.2 295.6 291.4 289.9 290.8 XTS: 263.3 264.4 264.5 264.2 270.4 265.3 LRW: 254.9 252.3 253.6 258.2 257.5 255.3 Comparing the mean values gives us: x86: i586 aes-nidelta ECB: 93.8123.3 +31.4% Why the improvement of ECB is so small? I can not understand it. It should be as big as CBC. I don't know why the ECB variant is so slow compared to the other variants. But it is so even for the current x86-64 version. See the above values for x86-64 (old). I setup dm-crypt for this test like this: # cryptsetup -c aes-ecb-plain -d /dev/urandom create cfs /dev/loop0 What where the numbers you measured in your tests while developing the x86-64 version? Can't remember the number. Do you have interest to dig into the issue? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
On 12.11.2010, 08:34 Huang Ying wrote: On Fri, 2010-11-12 at 15:30 +0800, Mathias Krause wrote: On 12.11.2010, 01:33 Huang Ying wrote: Hi, Mathias, On Fri, 2010-11-12 at 06:18 +0800, Mathias Krause wrote: All test were run five times in a row using a 256 bit key and doing i/o to the block device in chunks of 1MB. The numbers are MB/s. x86 (i586 variant): 1. run 2. run 3. run 4. run 5. runmean ECB: 93.993.994.093.593.893.8 CBC: 84.984.884.984.984.884.8 XTS: 108.2 108.3 109.6 108.3 108.9 108.6 LRW: 105.0 105.0 105.1 105.1 105.1 105.0 x86 (AES-NI), v3 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 124.8 120.8 124.5 120.6 124.5 123.0 CBC: 112.6 109.6 112.6 110.7 109.4 110.9 XTS: 221.6 221.1 220.9 223.5 224.4 222.3 LRW: 206.2 209.7 207.4 203.7 209.3 207.2 x86 (AES-NI), v4 of the patch: 1. run 2. run 3. run 4. run 5. runmean ECB: 122.5 121.2 121.6 125.7 125.5 123.3 CBC: 259.5 259.2 261.2 264.0 267.6 262.3 XTS: 225.1 230.7 220.6 217.9 216.3 222.1 LRW: 202.7 202.8 210.6 208.9 202.7 205.5 Comparing the values for the CBC variant between v3 and v4 of the patch shows that porting the CBC variant to x86 more then doubled the performance so the little bit ugly #ifdefed code is worth the effort. x86-64 (old): 1. run 2. run 3. run 4. run 5. runmean ECB: 121.4 120.9 121.1 121.2 120.9 121.1 CBC: 282.5 286.3 281.5 282.0 294.5 285.3 XTS: 263.6 260.3 263.0 267.0 264.6 263.7 LRW: 249.6 249.8 250.5 253.4 252.2 251.1 x86-64 (new): 1. run 2. run 3. run 4. run 5. runmean ECB: 122.1 122.0 122.0 127.0 121.9 123.0 CBC: 291.2 286.2 295.6 291.4 289.9 290.8 XTS: 263.3 264.4 264.5 264.2 270.4 265.3 LRW: 254.9 252.3 253.6 258.2 257.5 255.3 Comparing the mean values gives us: x86: i586 aes-nidelta ECB: 93.8123.3 +31.4% Why the improvement of ECB is so small? I can not understand it. It should be as big as CBC. I don't know why the ECB variant is so slow compared to the other variants. But it is so even for the current x86-64 version. See the above values for x86-64 (old). I setup dm-crypt for this test like this: # cryptsetup -c aes-ecb-plain -d /dev/urandom create cfs /dev/loop0 What where the numbers you measured in your tests while developing the x86-64 version? Can't remember the number. Do you have interest to dig into the issue? Sure. Increasing performance is always a good thing to do. :) Best regards, Mathias-- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
On 03.11.2010, 23:27 Huang Ying wrote: On Wed, 2010-11-03 at 14:14 -0700, Mathias Krause wrote: The AES-NI instructions are also available in legacy mode so the 32-bit architecture may profit from those, too. To illustrate the performance gain here's a short summary of the tcrypt speed test on a Core i7 M620 running at 2.67GHz comparing both assembler implementations: x86: i568 aes-ni delta 256 bit, 8kB blocks, ECB: 125.94 MB/s 187.09 MB/s +48.6% Which method do you used for speed testing? modprobe tcrypt mode=200 sec=? Yes. I used: modprobe tcrypt mode=200 sec=1 That actually does not work very well for AES-NI. Because AES-NI blkcipher is tested in synchronous mode, and in that mode, kernel_fpu_begin/end() must be called for every block, and kernel_fpu_begin/end() is quite slow. That's what I figured, too. Can this slowdown be avoided by saving and restoring the used FPU registers within the assembler implementation or would this be even slower? At the same time, some further optimization for AES-NI can not be tested (such as ecb-aes-aesni driver) in that mode, because they are only available in asynchronous mode. After finding the bug in the second version of the patch I noticed this, too. When developing AES-NI for x86_64, I uses dm-crypt + AES-NI for speed testing, where AES-NI blkcipher will be tested in asynchronous mode, and kernel_fpu_begin/end() is called for every page. Can you use that to test? But wouldn't this be even slower than the above measurement? I took the results for 8kB blocks and a page would only be 4kB ... well, depends on what kind of pages you took. IIRC x86-64 not only supports 2MB but also 1GB pages ;) Or you can add test_acipher_speed (similar with test_ahash_speed) to test cipher in asynchronous mode. Maybe I'll try this approach, since it looks like just a minor modification of the tcrypt module. Thanks for the hints! Best regards, Mathias Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] x86, crypto: ported aes-ni implementation to x86
On Wed, 2010-11-03 at 14:14 -0700, Mathias Krause wrote: The AES-NI instructions are also available in legacy mode so the 32-bit architecture may profit from those, too. To illustrate the performance gain here's a short summary of the tcrypt speed test on a Core i7 M620 running at 2.67GHz comparing both assembler implementations: x86: i568 aes-ni delta 256 bit, 8kB blocks, ECB: 125.94 MB/s 187.09 MB/s +48.6% Which method do you used for speed testing? modprobe tcrypt mode=200 sec=? That actually does not work very well for AES-NI. Because AES-NI blkcipher is tested in synchronous mode, and in that mode, kernel_fpu_begin/end() must be called for every block, and kernel_fpu_begin/end() is quite slow. At the same time, some further optimization for AES-NI can not be tested (such as ecb-aes-aesni driver) in that mode, because they are only available in asynchronous mode. When developing AES-NI for x86_64, I uses dm-crypt + AES-NI for speed testing, where AES-NI blkcipher will be tested in asynchronous mode, and kernel_fpu_begin/end() is called for every page. Can you use that to test? Or you can add test_acipher_speed (similar with test_ahash_speed) to test cipher in asynchronous mode. Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html