Re: 5x speedup for AES using SSE5?

2008-08-24 Thread Sidney Markowitz

Paul Crowley wrote, On 24/8/08 1:00 AM:

http://www.ddj.com/hpc-high-performance-computing/201803067
[...] However, glancing through the SSE5 specification, I 
can't see at all how such a dramatic speedup might be achieved


A commenter on slashdot hinted at the vector permutation instructions, 
similar to those on Altivec, being useful:


http://developers.slashdot.org/comments.pl?sid=284695&cid=20423869

Altivec is also known as VMX
http://en.wikipedia.org/wiki/AltiVec

That led me to this paper with a section on use of VMX vector operations 
in an AES implementation:


http://diploma-thesis.siewior.net/html/diplomarbeitch3.html

I didn't see performance comparisons or anything specific to SSE5, but 
it looks like the kind of thing that AMD might have meant.


 -- Sidney Markowitz
http://www.sidney.com

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-24 Thread Eric Young
Paul Crowley wrote:
> http://www.ddj.com/hpc-high-performance-computing/201803067
>
> In the above Dr Dobb's article from a little over a year ago, AMD
> Senior Fellow Leendert vanDoorn states "the Advanced Encryption
> Standard (AES) algorithm gets a factor of 5 performance improvement by
> using the new SSE5 extension".  However, glancing through the SSE5
> specification, I can't see at all how such a dramatic speedup might be
> achieved.  Does anyone know any more, or can anyone see more than I
> can in the spec?
>
> http://developer.amd.com/cpu/SSE5/Pages/default.aspx

I've only just seen this, but I've been playing with the VIA's AES and
looking at Intels AES instructions.

I believe the PPERM instruction will be rather important.  Combined with
the packed byte rotate and shift some rather
interesting SIMD byte fiddles should be possible.

>From my initial look, it should be possible to implement AES without
tables, doing SIMD operations on all 16 bytes at once.
I've not looked at it enough yet, but currently I'm doing an AES round
in about 140 cycles a block (call it 13 per round plus overhead) on a
AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions.  I
don't believe they will be taking 30 instructions , so they probably
have 4-8 SSE instructions per round, it then comes down to how many SSE
execution units there are to execute in parallel.

As for VIA, on a 1ghz C7 part, cbc mode, 128bit key, for 16byte aligned,
I'm getting about 24 cycles per block, for unaligned, about 67 cycles. 
The chip does ECB mode at 12.6 cycles a block if aligned (2 at a time). 
It does not handle unaligned ECB, so with manual alignment, 75 cycles. 
Not bad for a single issue cpu considering the x86 instruction version
of AES I have
takes 1010 cycles per block.

For the intel AES instructions, from my readings, it will be able to do
a single AES (128bit) in a bit more that 60 cycles
(10 rounds, 6 cycle latency for the instructions).  The good part is
that they will pipeline.  So if you say do 6
AES ecb blocks at once, you can get a throughput of about 12 cycles a
block (intel's figures).  This is obviously of relevance for counter
mode, cbc decrypt and more recent standards like xts and gcm mode.

Part of the intel justification for the AES instruction seems to stop
cache timing attacks.  If the SSE5 instructions allow AES
to be done with SIMD instead of tables, they will achieve the same
affect, but without as much parallel upside.

It also looks like the  GF(2^8) maths will also benefit.


eric (who has only been able to play with via hardware :-(

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-24 Thread Peter Gutmann
Speaking of CPU-specific optimisations, I've seen a few algorithm proposals
from the last few years that assume that an algorithm can be scaled linearly
in the number of CPU cores, treating a multicore CPU as some kind of SIMD
engine with all cores operating in lock-step, or at least engaging in some
kind of rendezvous every couple of cycles (for example the recently-discussed
MD6 uses a round of 16 steps, if I read the description correctly) to exchange
data.  This abstraction seems to be particularly convenient when dealing with
things like hash trees.  However I'm not aware of any multicore CPU that
actually works this way, you'd need to have exclusive use of each core by one
thread and use incredibly expensive (compared to the other primitive CPU
operations used in hashing) barriers or something similar to ensure
synchronisation.

Is there some feature of multicore CPUs that I'm missing, or is it a case of
cryptographers abstracting a bit too much away?  And if it's the latter,
should someone tell them that multicore CPUs don't actually work that way?

Peter.

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Period for public comments on XTS (as standardized by IEEE std 1619-2007) ends Sept 3, 2008

2008-08-24 Thread Matt Ball
Hi Folks,

Please remember that the 90-day public comment period for XTS ends
Sept 3, which is coming up very quickly.  If you have any comments you
would like to submit to NIST concerning XTS-AES (as specified in IEEE
Std 1619-2007), please send an e-mail to [EMAIL PROTECTED]

The excerpt of IEEE 1619-2007 that specifies XTS-AES will be removed
after the public review period ends.  If you would like to get a free
copy of the XTS specification, this will be your last chance!  See
http://grouper.ieee.org/groups/1619tmp/1619-2007-NIST-Submission.pdf

[Original solicitation from NIST:]

Request for Public Comment on XTS (See
http://csrc.nist.gov/groups/ST/documents/Request-for-Public-Comment-on_XTS.pdf)

The P1619 Task Group of the Security in Storage Working Group (SISWG)
of the Institute of Electrical and Electronics Engineers, Inc. (IEEE)
has submitted the XTS-AES algorithm (XTS, for short) to NIST as an
encryption mode of operation of the Advanced Encryption Standard (AES)
block cipher. Although XTS does not provide authentication in order to
avoid expansion of the data, it is designed to provide some protection
against malicious manipulation of the encrypted data. Subject to the
90-day period of public comment that is described below, NIST proposes
to approve XTS for government use under the auspices of FIPS Pub.
140-2.

XTS is specified in IEEE Std 1619-2007. IEEE has agreed to make a
relevant extract from this standard available for free during the
period of public comment. NIST proposes to approve the specification
by reference to IEEE Std 1619-2007, while reserving the right to
specify additional requirements/restrictions on XTS for government
use. After the period of public comment, the standard would be
available for purchase from IEEE for $85 to IEEE members and
affiliates, and $105 to non-members. The chair of the SISWG informed
NIST that he is unaware of any patent claims on XTS, but that NeoScale
Systems, subsequently acquired by nCipher, submitted a Letter of
Assurance of Essential Patents to the IEEE, without elaborating on
what aspect of IEEE 1619 was patented.

The period of public comment for this proposal is from June 5, 2008 to
September 3, 2008. The extract of IEEE Std 1619-2007 is available for
free during this period at
http://grouper.ieee.org/groups/1619tmp/1619-2007-NIST-Submission.pdf.

Comments may be submitted to [EMAIL PROTECTED] NIST
particularly invites comments on the following topics:
* The XTS algorithm itself;
* The depth of support in the storage industry for which it was designed;
* The appeal of XTS for wider applications;
* The proposal for the approved specification to be available only by
purchase from IEEE;
* Concerns of intellectual property rights.

Thanks!
-Matt

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re:5x speedup for AES using SSE5?

2008-08-24 Thread Eric Young
Eric Young wrote:
> I've not looked at it enough yet, but currently I'm doing an AES round
> in about 140 cycles a block (call it 13 per round plus overhead) on a
> AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions. 
Urk, correction, I forgot I've recently upgraded from a 2ghz machine to
2.5ghz.
So that should read about 182 cycles per block, and 18 cycles per round.
I though the number seems strange :-(.  I tent to always quote numbers
from a 2-3 second run encrypting a 4k buffer, not a machine cycle
counter over one or two blocks, so I leave myself open to this kind of
error :-(

Still, looking further at the various SSE5 instructions, I'm having
difficultly seeing how
to avoid instruction dependencies when using the SIMD instructions
(specifically using PPERM to implement the sbox).

eric

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]