On 9/5/2015 2:50 PM, Leyne, Sean wrote: > Jim and Boris, > >> Something you may want to investigate is replacing the "pure C" >> implementation of ChaCha20 with the rotate step replaced with either a >> compiler intrinsic (Microsoft) or a bit of assembler (gcc). SHA1 has >> the same issue. I haven't a clue as to why popular crypto algoritms >> use a rotate, virtually all microprocessors have rotate instructions, >> but C lacks a rotate operator and the standard libraries neglect to support >> it. > Forgive my naïve point of view, but given that AES instruction set has been > built into AMD and Intel CPUs since 2011, why do you feel that it is > necessary to push for ChaCha20***?
The goal is to make encryption so cheap that it becomes standard (without even an option to turn it off). But first we need to understand the performance issues. For example, there is a big difference between AES encryption and decryption speeds, particularly with CBC mode. It may have only to do with the additional bit juggling necessary to implement CBC, but it's still something that we need to understand to evaluate the costs tradeoffs. ChaCha20 is symmetric, so the encryption and decryption costs are the same. ChaCha20 -- and virtually every other stream cipher -- is easier to use than a block cipher with AES, and especially so if there is a possibility of messages under the block size (16 bytes for AES). And having to add complexity to the protocols to handle explicit message padding adds to the cost, even when it isn't necessary for larger packets. Sure, it would be nice if AES-NI made it the clear performance winner across the board, but I don't think that will prove to the case. It might, though... > > To my reading, Boris' numbers have shown that AES performance is more than > adequate (53.2 AVG seconds to process 256MB = 4+MB/s). > > Further, considering that the use can is the encryption of data blocks which > would be much smaller than even 1MB, will be performance difference really be > noticeable? If the delta cost for encryption is significant, then it pretty much needs to be made optional, which necessarily increases complexity and further reduces performance. I think I explained the lessons from the bit-blt chip in the Sun 3-50. The abbreviated version is that while the blt-bit chip was much faster for large operations, most operations were small, single character cell transfers, and software had an edge over the hardware. But putting in a test for operation size further slowed down small operations, tilting the net gain/loss deeper in the loss territory. In other words, number in situ count more than abstract performance studies. > > > Sean > > *** Separately, with Intel HyperThreaded CPUs and considering that AES in > "on-chip" wouldn't that allow the core processing the encryption to shift to > focus on the other thread instruction while the first thread wait for the on > chip AES processor operates? In other words, isn't it possible that ChaCha20 > is only faster when CPUs are being "single minded" and that real world > performance on a server dealing with several tasks might favor CPUs with > native AES instructions? I don't think so, but I haven't been able to find a definitive answer. My understanding, at it may be wrong, is that the AES-NI instructions aren't really hardware but just very complex microcoded instructions that operate off internal processor registers, so threading doesn't come into it. By the way, my measurements at NuoDB with hyper-threading was that is was worth a plug nickle -- almost no measurable difference between real threads and hyperthreads. I'm sure Intel has benchmarks that show otherwise, but unless you have benchmarks on your own code that show something else, I wouldn't make any assumptions of performance benefits of hyperthreading. Intel's focus on multi-threading tends to be on parallelizing existing single threaded code across cores. In this case, hyperthreading might actually work. But when you have many more client threads than cores, which is the case for database systems using a thread-per-client model, you will get quite different results. At NuoDB, switching from a thread per client model to a (more or less) fixed pool of worker threads caused the performance to take off like a rocket. But in that model, kicking the worker thread pool up the the number hyper-threads just increases the number of stalled threads contending with running threads for the same resources. Not good. But in any case, Sean, learning about what's actually happening does have merit. Surely you aren't against knowledge, are you? > > >> Here are numbers: >> ---------------------------------------------------------------------- >> ------- AES, BOTAN based code, with AES-NI instruction set all enc >> ------------ >> 531.1 53.2 >> >> ---------------------------------------------------------------------- >> >> AES, INTEL based code, with AES-NI instruction set all enc >> ------------ >> 544.8 76.6 >> >> >> ---------------------------------------------------------------------- >> AES, code based on Bouncy Castle (Java) , without AES-NI instruction set >> all enc >> ------------ >> 2071.8 1620.6 >> >> >> ---------------------------------------------------------------------- >> ChaCha20, code based on Bouncy Castle (Java) >> ------------ >> 1712.7 1234.8 > > ------------------------------------------------------------------------------ > Firebird-Devel mailing list, web interface at > https://lists.sourceforge.net/lists/listinfo/firebird-devel ------------------------------------------------------------------------------ Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel