Re: [PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-22 Thread Andy Polyakov

No, before thinking about 32-bit mode, I quickly ask what's with save-s
without arguments?


Sorry, I just wrote that code as pseudo-code off the top of my
head without attending to all of the necessary details.

We would indeed need to allocate a minimal stack frame in each
save instruction.

It's just an oversight in my example code, that's all.


But the main question was about how context switch is handled between 
save and say mulmont. I mean the part after save-s ought to allocate 
frames.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-22 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Sat, 22 Sep 2012 19:09:27 +0200

 No, before thinking about 32-bit mode, I quickly ask what's with
 save-s
 without arguments?
 Sorry, I just wrote that code as pseudo-code off the top of my
 head without attending to all of the necessary details.
 We would indeed need to allocate a minimal stack frame in each
 save instruction.
 It's just an oversight in my example code, that's all.
 
 But the main question was about how context switch is handled between
 save and say mulmont. I mean the part after save-s ought to allocate
 frames.

I'm confused.

The cpu has 8 register windows.

This means that we can save down 7 times and fill all of the
registers in each window with the values we need.

At each save we allocate the minimal stack frame, at least
enough for the spill/fill trap handlers to save the register
window if needed.

The montmul instruction occurs in the deepest register window.

The cpu will force all 7 register windows to be restored, if
needed, if some spills have occurred due to context switches
or similar.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-22 Thread Andy Polyakov

But the main question was about how context switch is handled between
save and say mulmont. I mean the part after save-s ought to allocate
frames.


I'm confused.

The cpu has 8 register windows.

This means that we can save down 7 times and fill all of the
registers in each window with the values we need.

At each save we allocate the minimal stack frame, at least
enough for the spill/fill trap handlers to save the register
window if needed.

The montmul instruction occurs in the deepest register window.

The cpu will force all 7 register windows to be restored, if
needed, if some spills have occurred due to context switches
or similar.


The question was if it's actually the case, i.e. that all the register 
windows are *in fact* restored. And you say there are. Just wanted to 
hear. I wondered about specific mechanism on how it's achieved (does the 
montmul triggers window trap), but it's more of curiosity, i.e. the 
question is optional and you don't have to answer.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-22 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Sat, 22 Sep 2012 20:11:11 +0200

 I wondered about specific mechanism on how it's achieved (does
 the montmul triggers window trap),

Yes, this is exactly what the instruction does.

It issues fill traps until the CANRESTORE register is NWINDOWS-2.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-21 Thread Andy Polyakov
 You mentioned Montgomery BN.
 
 Here are how the instructions work.
 
 The basic model is that there is a range of sizes supported by the
 instruction, and all of the data is loaded into a combination of
 the floating point registers and all of the register windows of
 the cpu.

Ouch!

   ...
 
   save
 
   ...
 
   restore
   ...
 
 Of course, you might quickly ask what happens in 32-bit mode?

No, before thinking about 32-bit mode, I quickly ask what's with save-s
without arguments? I quickly ask what happens if context switch strikes
in the middle? save without argument means that %sp will be effectively
uninitialized and attempts to refer stack [during context switch or
asynchronous signal delivery] are either doomed or corrupt stack. So
save-s ought to allocate frames. But even then, [and in 64-bit mode], do
instructions in question ensure that register windows are loaded prior
execution? I mean consider context switch between a save and say
montmul. Kernel dumps all windows on stack and when execution resumes it
normally brings in only one top window and let's window trap bring in
remaining ones on demand. So that before instructions in question can
start actual processing, all windows has to be loaded. Presumably the
instructions can trigger window trap, then kernel would have to see that
it's one of the instructions that triggered it and act accordingly, i.e.
bring in all the windows. Does it work that way? Or do I get it
backwards? I assume that instructions in question are uninterruptible,
so that trap can be generated only prior calculation...
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-21 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Fri, 21 Sep 2012 11:36:16 +0200

 No, before thinking about 32-bit mode, I quickly ask what's with save-s
 without arguments?

Sorry, I just wrote that code as pseudo-code off the top of my
head without attending to all of the necessary details.

We would indeed need to allocate a minimal stack frame in each
save instruction.

It's just an oversight in my example code, that's all.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-20 Thread Andy Polyakov
There is no need to send me personal copy.

 This is the first phase of changes to support the new cryptographic
 opcodes found starting in the SPARC-T4 processor.

Cool.

 Oracle provided me with programmer's manuals that document these
 instructions, and I've been promised that these would be made public
 at some point in the not too distant future.

Could you ask your contact if they could provide second copy for
OpenSSL? You mentioned Montgomery BN. There will be intersections with
other platforms. I mean there is interest to provide alternative
framework for exponentiation that would benefit such cases and having
look at multiple platforms including T4 would help to choose better
strategy.

There will be more replies, but not right away.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-20 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Thu, 20 Sep 2012 11:23:03 +0200

 There is no need to send me personal copy.

Ok, I was simply acknowledging the author of the code I was
touching :-)

 Could you ask your contact if they could provide second copy for
 OpenSSL?

I'll see what I can do, it took me more than a year of tireless
work and daily poking to get a copy for myself from people I've
been interacting with for a decade.

 You mentioned Montgomery BN. There will be intersections with
 other platforms. I mean there is interest to provide alternative
 framework for exponentiation that would benefit such cases and having
 look at multiple platforms including T4 would help to choose better
 strategy.

Here are how the instructions work.

The basic model is that there is a range of sizes supported by the
instruction, and all of the data is loaded into a combination of
the floating point registers and all of the register windows of
the cpu.

For exmaple, the montmul (Montgomery Multiply) instruction simply has
a 5-bit immediate field which indicates the size of the operands.
If it is set to N the operands are (N + 1) * 64-bits in size.

Nprime is stored in register %f60.

A[] values are stored in float and integer registers (integers go into
register window 5), in this order:

%l0,   %l1,  %l2,  %l3,  %l4,  %l5,  %l6,  %l7
%o0,   %o1,  %o2,  %o3,  %o4,  %o5, %f24, %f26
%f28, %f30, %f32, %f34, %f36, %f38, %f40, %f42
$f44, %f46, %f48, %f50, %f52, %f54, %f56, %f58

B[] values are stored in integer registers (3 register windows, 2 to 0):

%o0, %o1, %o2, %o3, %o4, %o5,   (register window 2)
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7  (register window 1)
%o0, %o1, %o2, %o3, %o4, %o5(register window 1)
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7  (register window 0)
%o0, %o1, %o2, %o3  (register window 0)

Similarly for the other inputs, you can see the pattern in use here.
The result is left in register window 5.  If an internal ECC error
occurs on the register file during the operation, %fcc3 will be set to
unordered.  This means there needs to be a limited retry loop over
this condition.

So basically the implementation starts at register window zero, loads
all the initial values of B[], does a 'save', loads the middle values
ot B[], does a 'save', leads the last values of B[].

Then it moves on the N[], which goes into register windows 2, 3, and
4.

Next comes A[], in floating point registers and register window 5.

And finally M[], in floating point registers and register window 6.

Nprime is loaded into %f60 and the montmul instruction is executed.

This instruction can essentially be used directly via the
bn_mul_mont() function signature in openssl().  I don't think
any special amends are necessary to facilitate the use of these
instructions.

The 'montsqr' (Montgomery Square) instruction uses the same scheme
and layout as 'montmul' for inputs and outputs.

Finally 'mpmul' (Multiple Precision Multiply) has a similar flavor
to montmul and montsqr, in that multiple register windows and the
float point registers are used to load the inputs all at once for
the operation.

Again, a 5-bit immedate field 'N' encodes the size of the operands,
as (N + 1) * 64-bits.

The multiplier goes into a mixture of float regs and integer registers
in register window 6.  The multiplicand goes into a mixture of float
regs and integer registers in register window 5, and the product goes
into integer registers in register windows 4, 3, 2, 1, and 0.

For example, to do a 2048 bit multiply given a pointer to the
multiplier in %g1, a pointer to the multiplicand in %g2, and
a pointer to the place to store the product in %g3 one would
go:

/* Register window 6 */
ldd [%g1 + 0x000], %f22
ldd [%g1 + 0x008], %f20
ldd [%g1 + 0x010], %f18
ldd [%g1 + 0x018], %f16
ldd [%g1 + 0x020], %f14
ldd [%g1 + 0x028], %f12
ldd [%g1 + 0x030], %f10
ldd [%g1 + 0x038], %f8
ldd [%g1 + 0x040], %f6
ldd [%g1 + 0x048], %f4
ldx [%g1 + 0x050], %i5
ldx [%g1 + 0x058], %i4
ldx [%g1 + 0x060], %i3
ldx [%g1 + 0x068], %i2
ldx [%g1 + 0x070], %i1
ldx [%g1 + 0x078], %i0
ldx [%g1 + 0x080], %l7
ldx [%g1 + 0x088], %l6
ldx [%g1 + 0x090], %l5
ldx [%g1 + 0x098], %l4
ldx [%g1 + 0x0a0], %l3
ldx [%g1 + 0x0a8], %l2
ldx [%g1 + 0x0b0], %l1
ldx [%g1 + 0x0b8], %l0
ldd [%g1 + 0x0c0], %f2
ldd [%g1 + 0x0c8], %f0
ldx [%g1 + 0x0d0], %o5
ldx [%g1 + 0x0d8], %o4
ldx [%g1 + 0x0e0], %o3
ldx [%g1 + 0x0e8], %o2
ldx [%g1 + 0x0f0], %o1
ldx [%g1 + 0x0f8], %g1

save

/* Register window 5 */
ldd [%g2 + 0x000], %f58
ldd [%g2 + 0x008], %f56
  

[PATCH 0/7] Phase one of sparc crypto opcode support.

2012-09-19 Thread David Miller

This is the first phase of changes to support the new cryptographic
opcodes found starting in the SPARC-T4 processor.

It first builds the infrastructure for feature presence detection,
then adds support for all of the hashing functions implemented in
current cpus (MD5, SHA1, SHA256, SHA512).

Here are some benchmarks on a SPARC T4-2 with these changes applied.

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
md5  14423.71k50416.70k   173663.49k   445940.05k   816587.27k
sha1 33231.78k   115492.48k   318273.91k   579320.83k   759701.50k
sha256   46641.41k   157805.85k   419859.54k   708643.16k   889514.67k
sha512   50184.57k   202770.99k   529172.57k  1023763.11k  1405414.06k

These numbers, with crypto-opcode-disabled numbers for comparison,
are duplicated in the relevant patch log messages.

I have cipher patches for AES, DES, and CAMELLIA as well but I would
like to refine them a bit before I make a formal submission.  And once
I get those changes refined I will work on mongomery multiply,
montgomery square-root, etc. which these chips also support directly.

I've tested these changes on all of {static,shared}/linux{,64}-sparcv9

Oracle provided me with programmer's manuals that document these
instructions, and I've been promised that these would be made public
at some point in the not too distant future.  But these instructions
are very straightforward, and when I post the AES changes later on
you will see that the AES instructions are virtually identical to
the AESNI stuff from Intel.

All of the patches are against mainline, but should be backportable
to 1.0.x without much difficulty.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org