Hi Jan,

>what HW engine is this?  I think your best bet is to actually get the
engine to support GCM; with AES and SHA acceleration in place there is very
little to stop the HW engine from not being able to support GCM..
The HW engine is a part of SoC al314. It connects with A15 CPU via PCI in
SoC. Chip vendor will not support GCM due to all kinds of reasons.

>the numbers do suggest some form of cryptodev acceleration - can you
unload the cryptodev module or block access to it (e.g. chmod 000
/dev/crypto) ?
In my second set of test numbers, I uploaded the cryptodev moduled. You can
see the CCM performance is almost same.

Tony

Jan Just Keijser <janj...@nikhef.nl> 于2020年12月4日周五 下午5:49写道:

> Hi Tony,
>
> On 04/12/20 08:41, Tony He wrote:
>
> Hi Jan,
> Yeah, need option " -elapsed" because OpenSSL counts user time instead of
> total time(user+sys time) without this option. You can see:
> * aes-128-cbc and sha1 are accelerated by HW engine. I believe speed is
> faster for openvpn dco module because it uses the HW engine in kernel space
> and bypasses the path between openssl and cryptodev.
>
> that is correct the openvpn dco module sits in kernel space and does need
> to pass the userspace<->kernelspace barrier and thus should have better
> performance
>
> * aes-128-gcm is NOT accelerated by HW engine.
>
> what HW engine is this?  I think your best bet is to actually get the
> engine to support GCM; with AES and SHA acceleration in place there is very
> little to stop the HW engine from not being able to support GCM...
>
> * aes-128-ccm is NOT accelerated by HW engine but it seems that it is
> accelerated by HW instruction or other. I don't know my device has such
> function. SoC type is al314.
>
> the numbers do suggest some form of cryptodev acceleration - can you
> unload the cryptodev module or block access to it (e.g. chmod 000
> /dev/crypto) ?
>
> The AL314 is a quad core Cortex A15 CPU @ 1.7 GHz ; the numbers *without*
> cryptodev look about right for that particular CPU.
>
> Most modern crypto packages use AES-GCM or chacha20-poly1305 as they are
> considered more secure. CBC is considered a bit outdated and as far as I
> know no openvpn release supports CCM thus far (which is a shame, really).
>
> HTH,
>
> JJK
>
>
>
> With cryptodev: # openssl speed -evp aes-128-cbc -elapsed You have chosen
> to measure elapsed time instead of user CPU time. Doing aes-128-cbc for 3s
> on 16 size blocks: 252783 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s
> on 64 size blocks: 253044 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s
> on 256 size blocks: 251746 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s
> on 1024 size blocks: 190306 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s
> on 8192 size blocks: 122657 aes-128-cbc's in 3.00s ......................
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 1348.18k
> 5398.27k 21482.33k 64957.78k 334935.38k # openssl speed -evp aes-128-gcm
> -elapsed You have chosen to measure elapsed time instead of user CPU time.
> Doing aes-128-gcm for 3s on 16 size blocks: 3509485 aes-128-gcm's in 3.00s
> Doing aes-128-gcm for 3s on 64 size blocks: 900678 aes-128-gcm's in 3.00s
> Doing aes-128-gcm for 3s on 256 size blocks: 228961 aes-128-gcm's in 3.00s
> Doing aes-128-gcm for 3s on 1024 size blocks: 57475 aes-128-gcm's in 3.00s
> Doing aes-128-gcm for 3s on 8192 size blocks: 7189 aes-128-gcm's in 3.00s
> .................. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
> aes-128-gcm 18717.25k 19214.46k 19538.01k 19618.13k 19630.76k
> # openssl speed -evp aes-128-ccm -elapsed You have chosen to measure
> elapsed time instead of user CPU time. Doing aes-128-ccm for 3s on 16 size
> blocks: 10179383 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s on 64 size
> blocks: 10179215 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s on 256
> size blocks: 10179785 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s on
> 1024 size blocks: 10182095 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s
> on 8192 size blocks: 10179225 aes-128-ccm's in 3.00s ..................
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ccm
> 54290.04k 217156.59k 868674.99k 3475488.43k 27796070.40k # openssl speed
> -evp sha1 -elapsed You have chosen to measure elapsed time instead of user
> CPU time. Doing sha1 for 3s on 16 size blocks: 95252 sha1's in 3.00s Doing
> sha1 for 3s on 64 size blocks: 95166 sha1's in 3.00s Doing sha1 for 3s on
> 256 size blocks: 76177 sha1's in 3.00s Doing sha1 for 3s on 1024 size
> blocks: 68799 sha1's in 3.00s Doing sha1 for 3s on 8192 size blocks: 53034
> sha1's in 3.00s ................. type 16 bytes 64 bytes 256 bytes 1024
> bytes 8192 bytes sha1 508.01k 2030.21k 6500.44k 23483.39k 144818.18k
> Without cryptodev:
> # openssl speed -evp aes-128-cbc -elapsed You have chosen to measure
> elapsed time instead of user CPU time. Doing aes-128-cbc for 3s on 16 size
> blocks: 9235207 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 64 size
> blocks: 2498066 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size
> blocks: 645288 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size
> blocks: 161372 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size
> blocks: 20385 aes-128-cbc's in 3.00s ................ type 16 bytes 64
> bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 49254.44k 53292.07k
> 55064.58k 55081.64k 55664.64k
> # openssl speed -evp aes-128-gcm -elapsed You have chosen to measure
> elapsed time instead of user CPU time. Doing aes-128-gcm for 3s on 16 size
> blocks: 3507422 aes-128-gcm's in 3.00s Doing aes-128-gcm for 3s on 64 size
> blocks: 901036 aes-128-gcm's in 3.00s Doing aes-128-gcm for 3s on 256 size
> blocks: 228857 aes-128-gcm's in 3.00s Doing aes-128-gcm for 3s on 1024 size
> blocks: 57411 aes-128-gcm's in 3.00s Doing aes-128-gcm for 3s on 8192 size
> blocks: 7188 aes-128-gcm's in 3.00s ................ type 16 bytes 64 bytes
> 256 bytes 1024 bytes 8192 bytes aes-128-gcm 18706.25k 19222.10k 19529.13k
> 19596.29k 19628.03k
> # openssl speed -evp aes-128-ccm -elapsed You have chosen to measure
> elapsed time instead of user CPU time. Doing aes-128-ccm for 3s on 16 size
> blocks: 10170897 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s on 64 size
> blocks: 10167692 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s on 256
> size blocks: 10166117 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s on
> 1024 size blocks: 10167095 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s
> on 8192 size blocks: 10172046 aes-128-ccm's in 3.00s ................. type
> 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ccm 54244.78k
> 216910.76k 867508.65k 3470368.43k 27776466.94k
> openssl speed -evp sha1 -elapsed You have chosen to measure elapsed time
> instead of user CPU time. Doing sha1 for 3s on 16 size blocks: 1877571
> sha1's in 3.00s Doing sha1 for 3s on 64 size blocks: 1250523 sha1's in
> 3.00s Doing sha1 for 3s on 256 size blocks: 603090 sha1's in 3.00s Doing
> sha1 for 3s on 1024 size blocks: 198963 sha1's in 3.00s Doing sha1 for 3s
> on 8192 size blocks: 27380 sha1's in 3.00s ............... type 16 bytes 64
> bytes 256 bytes 1024 bytes 8192 bytes sha1 10013.71k 26677.82k 51463.68k
> 67912.70k 74765.65k
> Tony
>
> Jan Just Keijser <janj...@nikhef.nl> 于2020年12月2日周三 下午11:24写道:
>
>> Hi Tony,
>>
>> On 02/12/20 15:51, Jan Just Keijser wrote:
>>
>>
>> On 02/12/20 15:22, Tony He wrote:
>>
>> Hi Jan,
>>
>> Welcome to join the discussion.
>>
>> >the second set of numbers doesn't make sense, and a much better test is
>> to do an actual encryption test
>> I don't compile cryptodev kernel module for my PC and can not reproduce
>> this issue for now.  You don't understand  the reason why the performance
>> is much worse with cryptodev module for *big* blocks, right?
>> If yes, I guess the reason maybe kernel assign the work to multi cores
>> while OpenSSL uses one core. Would you share the output of command "mpstat
>> -P ALL 2"?
>>
>> sure, while using the cryptodev I see this:
>>
>> 15:28:36     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal
>> %guest  %gnice   %idle
>> 15:28:38     all    1.87    0.00   23.19    0.12    0.00    0.00
>> 0.00    0.00    0.00   74.81
>> 15:28:38       0    0.00    0.00    0.00    0.50    0.00    0.00
>> 0.00    0.00    0.00   99.50
>> 15:28:38       1    7.00    0.00   93.00    0.00    0.00    0.00
>> 0.00    0.00    0.00    0.00
>> 15:28:38       2    0.00    0.00    0.00    0.00    0.00    0.00
>> 0.00    0.00    0.00  100.00
>> 15:28:38       3    0.00    0.00    0.00    0.00    0.00    0.00
>> 0.00    0.00    0.00  100.00
>>
>> 15:28:38     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal
>> %guest  %gnice   %idle
>> 15:28:40     all    0.75    0.00   24.19    0.00    0.00    0.00
>> 0.00    0.00    0.00   75.06
>> 15:28:40       0    0.00    0.00    0.00    0.50    0.00    0.00
>> 0.00    0.00    0.00   99.50
>> 15:28:40       1    3.50    0.00   96.50    0.00    0.00    0.00
>> 0.00    0.00    0.00    0.00
>> 15:28:40       2    0.00    0.00    0.00    0.00    0.00    0.00
>> 0.00    0.00    0.00  100.00
>> 15:28:40       3    0.00    0.00    0.00    0.00    0.00    0.00
>> 0.00    0.00    0.00  100.00
>>
>> on a 4 core box; this means that 1 core is used 100% (which is what I
>> expected).
>>
>>
>> I suspect the main reason the cryptodev results on my i5-6800 go off the
>> rails is due to this:
>> (look at the "Doing aes-128-cbc lines")
>>
>> $ ./openssl speed -evp aes-128-cbc
>> Doing aes-128-cbc for 3s on 16 size blocks: 2835368 aes-128-cbc's in 1.14s
>> Doing aes-128-cbc for 3s on 64 size blocks: 2720745 aes-128-cbc's in 1.01s
>> Doing aes-128-cbc for 3s on 256 size blocks: 2377830 aes-128-cbc's in
>> *0.74s*
>> Doing aes-128-cbc for 3s on 1024 size blocks: 1538693 aes-128-cbc's in
>> *0.40s*
>> Doing aes-128-cbc for 3s on 8192 size blocks: 370202 aes-128-cbc's in
>> *0.11s*
>> OpenSSL 1.0.2m  2 Nov 2017
>> built on: reproducible build, date unspecified
>> options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial)
>> idea(int) blowfish(idx)
>> compiler: gcc -I. -I.. -I../include  -DOPENSSL_THREADS -D_REENTRANT
>> -DDSO_DLFCN -DHAVE_DLFCN_H -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS
>> -Wa,--noexecstack -m64 -DL_ENDIAN -O3 -Wall -DOPENSSL_IA32_SSE2
>> -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m
>> -DRC4_ASM -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM
>> -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM
>> The 'numbers' are in 1000s of bytes per second processed.
>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192
>> bytes
>> aes-128-cbc      39794.64k   172403.64k   822600.65k  3939054.08k
>> 27569952.58k
>>
>>
>> The timing for how quickly the results are returned are way off and
>> probably just wrong. The Openssl speed test is supposed to run for 3
>> seconds. The actual results returned for 8192 byte blocks is
>>
>> Doing aes-128-cbc for 3s on 8192 size blocks: 370202 aes-128-cbc's in
>> *0.11s*
>>
>> whereas without cryptodev I see
>>
>> Doing aes-128-cbc for 3s on 8192 size blocks: 457255 aes-128-cbc's in
>> *3.00s*
>>
>> So you can see that without cryptodev the i5-6800 actually says it's
>> doing more blocks (457,255 vs 370,202) but with cryptodev it is doing it in
>> WAY less time.  This leads me to believe the openssl speed code when using
>> cryptodev just "goes wrong".
>> It will be very interesting to see what the encryption test will bring -
>> that is a much better real-life-like example than a simple speed test.
>>
>> as a follow-up : someone whispered in my ear (thanks, André ;) ) that one
>> should use the -elapsed option for this, so here are new results:
>>
>> *with* cryptodev:
>>
>> ./openssl speed -evp aes-128-cbc -elapsed
>> You have chosen to measure elapsed time instead of user CPU time.
>> Doing aes-128-cbc for 3s on 16 size blocks: 2825786 aes-128-cbc's in 3.00s
>> Doing aes-128-cbc for 3s on 64 size blocks: 2716822 aes-128-cbc's in 3.00s
>> Doing aes-128-cbc for 3s on 256 size blocks: 2369723 aes-128-cbc's in
>> 3.00s
>> Doing aes-128-cbc for 3s on 1024 size blocks: 1536054 aes-128-cbc's in
>> 3.00s
>> Doing aes-128-cbc for 3s on 8192 size blocks: 369984 aes-128-cbc's in
>> 3.00s
>> [...]
>> aes-128-cbc      15,070.86k    57,958.87k   202,216.36k   524,306.43k
>> 1,010,302.98k
>>
>> *without* cryptodev:
>>
>> $ openssl speed -evp aes-128-cbc -elapsed
>> You have chosen to measure elapsed time instead of user CPU time.
>> Doing aes-128-cbc for 3s on 16 size blocks: 207188725 aes-128-cbc's in
>> 3.00s
>> Doing aes-128-cbc for 3s on 64 size blocks: 56855717 aes-128-cbc's in
>> 3.00s
>> Doing aes-128-cbc for 3s on 256 size blocks: 14382122 aes-128-cbc's in
>> 3.00s
>> Doing aes-128-cbc for 3s on 1024 size blocks: 3618996 aes-128-cbc's in
>> 3.00s
>> Doing aes-128-cbc for 3s on 8192 size blocks: 456727 aes-128-cbc's in
>> 3.00s
>> [...]
>> aes-128-cbc    1,105,006.53k  1,212,921.96k  1,227,274.41k
>> 1,235,283.97k  1,247,169.19k
>>
>> which more or less reflects the encryption test results I posted earlier.
>> The question becomes, what are you results when using the -elapsed flag?
>>
>> JJK
>>
>>
>> >My advice is to rerun your tests *without* the cryptodev module and then
>> decide wheter you really need CBC+CCM hmacs.
>> Yes, I confirm that without the cryptodev the performance is very bad for
>> my device. I don't have that device in my hand right now. But I saved one
>> aes-256-cbc result in my web notebook as below:
>>
>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
>> aes-256-cbc 19626.95k 24289.71k 25054.46k 25347.75k 25337.86k
>> Please note, there are two modes to accelerate encryption/decryption.
>> 1. HW instructions like intel x86 CPU.
>> 2. Using a crypto engine.
>> When your device is 2 and its CPU is not powerful, normally with
>> cryptodev speed is much faster at least for big blocks. Maybe for small
>> blocks it's slower because
>> it needs the time to push the work to kernel and then HW engine and the
>> time spent is may longer than the time costed by OpenSSL directly does the
>> encryption/decryption.
>> Tony
>>
>> Jan Just Keijser <janj...@nikhef.nl> 于2020年12月2日周三 下午7:24写道:
>>
>>> hi Tony,
>>>
>>> On 01/12/20 02:50, Tony He wrote:
>>>
>>> Hi Arne,
>>>
>>> openssl speed -evp aes-128-cbc
>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc
>>> 20035.60k 123261.54k 267081.60k 1094764.09k 9181370.18k
>>> openssl speed -evp aes-128-gcm
>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-gcm
>>> 18738.76k 19284.91k 19524.44k 19606.87k 19685.46k
>>> openssl speed -evp aes-128-ccm
>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ccm
>>> 53859.07k 215581.12k 862070.02k 3460786.43k 27566347.61k
>>> openssl speed -evp sha1
>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes sha1 3108.57k
>>> 12177.79k 57325.18k 181610.34k 1207364.27k
>>> openssl speed -evp chacha20-poly1305
>>> chacha20-poly1305 is an unknown cipher or digest
>>> Using old openssl, so chacha20-poly1305 is not supported.
>>>
>>>
>>> these numbers look suspiciously like you're using the linux cryptodev
>>> module. Openssl speed results for the linux cryptodev module are totally
>>> unreliable and I'd even go so far as to say that the *only* numbers I trust
>>> in the output above are for aes-128-gcm
>>>
>>> For example, if I do the same on an i5-6800 I get *without* the
>>> cryptodev module:
>>>   $ openssl speed -evp aes-128-cbc
>>>   aes-128-cbc    1,104,599.38k  1,208,651.07k  1,231,766.70k
>>> 1,237,545.64k  1,248,793.94k
>>>
>>> and with the module I get
>>>   aes-128-cbc      45,087.41k   127,822.72k   581,517.17k  2,256,593.19k
>>> 27,583,804.51k
>>>
>>> the second set of numbers doesn't make sense, and a much better test is
>>> to do an actual encryption test, e.g.
>>>
>>> *without* the module
>>> cat BIGFILE | openssl aes-256-cbc -e -pass  pass:thisisabadpassword |
>>> pv > /dev/null
>>> 2.93GB 0:00:05 [ 549MB/s] [
>>> <=>
>>> ]
>>>
>>> ('pv' aka 'pipeview' is a handy tool to measure the throughput of a UNIX
>>> pipe)
>>>
>>> and with the module:
>>> cat BIGFILE | ./openssl aes-256-cbc -e -pass  pass:thisisabadpassword
>>> -engine cryptodev|  pv > /dev/null
>>> engine "cryptodev" set.
>>> 2.93GB 0:00:07 [ 426MB/s] [              <=>
>>>
>>> so you see that using the cryptodev module actually slows things down -
>>> which is to be expected, as the application needs to do more work using the
>>> cryptodev module.
>>>
>>> My advice is to rerun your tests *without* the cryptodev module and then
>>> decide wheter you really need CBC+CCM hmacs.
>>>
>>> HTH,
>>>
>>> JJK
>>>
>>>
>>> Arne Schwabe <a...@rfc2549.org> 于2020年11月26日周四 下午6:40写道:
>>>
>>>> Am 26.11.20 um 10:41 schrieb Tony He:
>>>> > Hi Arne,
>>>> >
>>>> >>Since the original thread was not on the mailing list I am missing
>>>> your
>>>> >>goal but if your crypto acelator already works with OpenSSL, then it
>>>> >>will also work with the "normal" OpenVPN
>>>> >
>>>> > Yes, it wokrs with "normal" OpenVPN(OpenVPN2), but according to the
>>>> test
>>>> > result, it's still not fast(about 60Mbps).
>>>> > The bottleneck is not encryption operation any more. It comes from the
>>>> > switch of user space and kernel space in the OpenVPN2,
>>>> > which makes the poor CPU of  embedded device very busy. That's why we
>>>> > need OpenVPN3 running in the kernel space.
>>>>
>>>>
>>>> What numbers are we are talking in crypto speed? Could you provide from
>>>> your "poor" device:
>>>>
>>>>
>>>> openssl speed -evp aes-128-cbc
>>>> openssl speed -evp aes-128-gcm
>>>> openssl speed -evp aes-128-ccm
>>>> openssl speed -evp sha1
>>>> openssl speed -evp chacha20-poly1305
>>>>
>>>> I want to what difference/gain in terms of raw crypto speed we are
>>>> talking here.
>>>>
>>>
>
_______________________________________________
Openvpn-devel mailing list
Openvpn-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openvpn-devel

Reply via email to