Re: 5x speedup for AES using SSE5?
Paul Crowley wrote, On 24/8/08 1:00 AM: http://www.ddj.com/hpc-high-performance-computing/201803067 [...] However, glancing through the SSE5 specification, I can't see at all how such a dramatic speedup might be achieved A commenter on slashdot hinted at the vector permutation instructions, similar to those on Altivec, being useful: http://developers.slashdot.org/comments.pl?sid=284695&cid=20423869 Altivec is also known as VMX http://en.wikipedia.org/wiki/AltiVec That led me to this paper with a section on use of VMX vector operations in an AES implementation: http://diploma-thesis.siewior.net/html/diplomarbeitch3.html I didn't see performance comparisons or anything specific to SSE5, but it looks like the kind of thing that AMD might have meant. -- Sidney Markowitz http://www.sidney.com - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Re: [cryptography] 5x speedup for AES using SSE5?
Paul Crowley wrote: > http://www.ddj.com/hpc-high-performance-computing/201803067 > > In the above Dr Dobb's article from a little over a year ago, AMD > Senior Fellow Leendert vanDoorn states "the Advanced Encryption > Standard (AES) algorithm gets a factor of 5 performance improvement by > using the new SSE5 extension". However, glancing through the SSE5 > specification, I can't see at all how such a dramatic speedup might be > achieved. Does anyone know any more, or can anyone see more than I > can in the spec? > > http://developer.amd.com/cpu/SSE5/Pages/default.aspx I've only just seen this, but I've been playing with the VIA's AES and looking at Intels AES instructions. I believe the PPERM instruction will be rather important. Combined with the packed byte rotate and shift some rather interesting SIMD byte fiddles should be possible. >From my initial look, it should be possible to implement AES without tables, doing SIMD operations on all 16 bytes at once. I've not looked at it enough yet, but currently I'm doing an AES round in about 140 cycles a block (call it 13 per round plus overhead) on a AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions. I don't believe they will be taking 30 instructions , so they probably have 4-8 SSE instructions per round, it then comes down to how many SSE execution units there are to execute in parallel. As for VIA, on a 1ghz C7 part, cbc mode, 128bit key, for 16byte aligned, I'm getting about 24 cycles per block, for unaligned, about 67 cycles. The chip does ECB mode at 12.6 cycles a block if aligned (2 at a time). It does not handle unaligned ECB, so with manual alignment, 75 cycles. Not bad for a single issue cpu considering the x86 instruction version of AES I have takes 1010 cycles per block. For the intel AES instructions, from my readings, it will be able to do a single AES (128bit) in a bit more that 60 cycles (10 rounds, 6 cycle latency for the instructions). The good part is that they will pipeline. So if you say do 6 AES ecb blocks at once, you can get a throughput of about 12 cycles a block (intel's figures). This is obviously of relevance for counter mode, cbc decrypt and more recent standards like xts and gcm mode. Part of the intel justification for the AES instruction seems to stop cache timing attacks. If the SSE5 instructions allow AES to be done with SIMD instead of tables, they will achieve the same affect, but without as much parallel upside. It also looks like the GF(2^8) maths will also benefit. eric (who has only been able to play with via hardware :-( - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Re: [cryptography] 5x speedup for AES using SSE5?
Speaking of CPU-specific optimisations, I've seen a few algorithm proposals from the last few years that assume that an algorithm can be scaled linearly in the number of CPU cores, treating a multicore CPU as some kind of SIMD engine with all cores operating in lock-step, or at least engaging in some kind of rendezvous every couple of cycles (for example the recently-discussed MD6 uses a round of 16 steps, if I read the description correctly) to exchange data. This abstraction seems to be particularly convenient when dealing with things like hash trees. However I'm not aware of any multicore CPU that actually works this way, you'd need to have exclusive use of each core by one thread and use incredibly expensive (compared to the other primitive CPU operations used in hashing) barriers or something similar to ensure synchronisation. Is there some feature of multicore CPUs that I'm missing, or is it a case of cryptographers abstracting a bit too much away? And if it's the latter, should someone tell them that multicore CPUs don't actually work that way? Peter. - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Period for public comments on XTS (as standardized by IEEE std 1619-2007) ends Sept 3, 2008
Hi Folks, Please remember that the 90-day public comment period for XTS ends Sept 3, which is coming up very quickly. If you have any comments you would like to submit to NIST concerning XTS-AES (as specified in IEEE Std 1619-2007), please send an e-mail to [EMAIL PROTECTED] The excerpt of IEEE 1619-2007 that specifies XTS-AES will be removed after the public review period ends. If you would like to get a free copy of the XTS specification, this will be your last chance! See http://grouper.ieee.org/groups/1619tmp/1619-2007-NIST-Submission.pdf [Original solicitation from NIST:] Request for Public Comment on XTS (See http://csrc.nist.gov/groups/ST/documents/Request-for-Public-Comment-on_XTS.pdf) The P1619 Task Group of the Security in Storage Working Group (SISWG) of the Institute of Electrical and Electronics Engineers, Inc. (IEEE) has submitted the XTS-AES algorithm (XTS, for short) to NIST as an encryption mode of operation of the Advanced Encryption Standard (AES) block cipher. Although XTS does not provide authentication in order to avoid expansion of the data, it is designed to provide some protection against malicious manipulation of the encrypted data. Subject to the 90-day period of public comment that is described below, NIST proposes to approve XTS for government use under the auspices of FIPS Pub. 140-2. XTS is specified in IEEE Std 1619-2007. IEEE has agreed to make a relevant extract from this standard available for free during the period of public comment. NIST proposes to approve the specification by reference to IEEE Std 1619-2007, while reserving the right to specify additional requirements/restrictions on XTS for government use. After the period of public comment, the standard would be available for purchase from IEEE for $85 to IEEE members and affiliates, and $105 to non-members. The chair of the SISWG informed NIST that he is unaware of any patent claims on XTS, but that NeoScale Systems, subsequently acquired by nCipher, submitted a Letter of Assurance of Essential Patents to the IEEE, without elaborating on what aspect of IEEE 1619 was patented. The period of public comment for this proposal is from June 5, 2008 to September 3, 2008. The extract of IEEE Std 1619-2007 is available for free during this period at http://grouper.ieee.org/groups/1619tmp/1619-2007-NIST-Submission.pdf. Comments may be submitted to [EMAIL PROTECTED] NIST particularly invites comments on the following topics: * The XTS algorithm itself; * The depth of support in the storage industry for which it was designed; * The appeal of XTS for wider applications; * The proposal for the approved specification to be available only by purchase from IEEE; * Concerns of intellectual property rights. Thanks! -Matt - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Re:5x speedup for AES using SSE5?
Eric Young wrote: > I've not looked at it enough yet, but currently I'm doing an AES round > in about 140 cycles a block (call it 13 per round plus overhead) on a > AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions. Urk, correction, I forgot I've recently upgraded from a 2ghz machine to 2.5ghz. So that should read about 182 cycles per block, and 18 cycles per round. I though the number seems strange :-(. I tent to always quote numbers from a 2-3 second run encrypting a 4k buffer, not a machine cycle counter over one or two blocks, so I leave myself open to this kind of error :-( Still, looking further at the various SSE5 instructions, I'm having difficultly seeing how to avoid instruction dependencies when using the SIMD instructions (specifically using PPERM to implement the sbox). eric - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]