Re: [PATCH 0/7] Phase one of sparc crypto opcode support.
No, before thinking about 32-bit mode, I quickly ask what's with save-s without arguments? Sorry, I just wrote that code as pseudo-code off the top of my head without attending to all of the necessary details. We would indeed need to allocate a minimal stack frame in each save instruction. It's just an oversight in my example code, that's all. But the main question was about how context switch is handled between save and say mulmont. I mean the part after save-s ought to allocate frames. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/7] Phase one of sparc crypto opcode support.
From: Andy Polyakov ap...@openssl.org Date: Sat, 22 Sep 2012 19:09:27 +0200 No, before thinking about 32-bit mode, I quickly ask what's with save-s without arguments? Sorry, I just wrote that code as pseudo-code off the top of my head without attending to all of the necessary details. We would indeed need to allocate a minimal stack frame in each save instruction. It's just an oversight in my example code, that's all. But the main question was about how context switch is handled between save and say mulmont. I mean the part after save-s ought to allocate frames. I'm confused. The cpu has 8 register windows. This means that we can save down 7 times and fill all of the registers in each window with the values we need. At each save we allocate the minimal stack frame, at least enough for the spill/fill trap handlers to save the register window if needed. The montmul instruction occurs in the deepest register window. The cpu will force all 7 register windows to be restored, if needed, if some spills have occurred due to context switches or similar. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/7] Phase one of sparc crypto opcode support.
But the main question was about how context switch is handled between save and say mulmont. I mean the part after save-s ought to allocate frames. I'm confused. The cpu has 8 register windows. This means that we can save down 7 times and fill all of the registers in each window with the values we need. At each save we allocate the minimal stack frame, at least enough for the spill/fill trap handlers to save the register window if needed. The montmul instruction occurs in the deepest register window. The cpu will force all 7 register windows to be restored, if needed, if some spills have occurred due to context switches or similar. The question was if it's actually the case, i.e. that all the register windows are *in fact* restored. And you say there are. Just wanted to hear. I wondered about specific mechanism on how it's achieved (does the montmul triggers window trap), but it's more of curiosity, i.e. the question is optional and you don't have to answer. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/7] Phase one of sparc crypto opcode support.
From: Andy Polyakov ap...@openssl.org Date: Sat, 22 Sep 2012 20:11:11 +0200 I wondered about specific mechanism on how it's achieved (does the montmul triggers window trap), Yes, this is exactly what the instruction does. It issues fill traps until the CANRESTORE register is NWINDOWS-2. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/7] Phase one of sparc crypto opcode support.
You mentioned Montgomery BN. Here are how the instructions work. The basic model is that there is a range of sizes supported by the instruction, and all of the data is loaded into a combination of the floating point registers and all of the register windows of the cpu. Ouch! ... save ... restore ... Of course, you might quickly ask what happens in 32-bit mode? No, before thinking about 32-bit mode, I quickly ask what's with save-s without arguments? I quickly ask what happens if context switch strikes in the middle? save without argument means that %sp will be effectively uninitialized and attempts to refer stack [during context switch or asynchronous signal delivery] are either doomed or corrupt stack. So save-s ought to allocate frames. But even then, [and in 64-bit mode], do instructions in question ensure that register windows are loaded prior execution? I mean consider context switch between a save and say montmul. Kernel dumps all windows on stack and when execution resumes it normally brings in only one top window and let's window trap bring in remaining ones on demand. So that before instructions in question can start actual processing, all windows has to be loaded. Presumably the instructions can trigger window trap, then kernel would have to see that it's one of the instructions that triggered it and act accordingly, i.e. bring in all the windows. Does it work that way? Or do I get it backwards? I assume that instructions in question are uninterruptible, so that trap can be generated only prior calculation... __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/7] Phase one of sparc crypto opcode support.
From: Andy Polyakov ap...@openssl.org Date: Fri, 21 Sep 2012 11:36:16 +0200 No, before thinking about 32-bit mode, I quickly ask what's with save-s without arguments? Sorry, I just wrote that code as pseudo-code off the top of my head without attending to all of the necessary details. We would indeed need to allocate a minimal stack frame in each save instruction. It's just an oversight in my example code, that's all. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/7] Phase one of sparc crypto opcode support.
There is no need to send me personal copy. This is the first phase of changes to support the new cryptographic opcodes found starting in the SPARC-T4 processor. Cool. Oracle provided me with programmer's manuals that document these instructions, and I've been promised that these would be made public at some point in the not too distant future. Could you ask your contact if they could provide second copy for OpenSSL? You mentioned Montgomery BN. There will be intersections with other platforms. I mean there is interest to provide alternative framework for exponentiation that would benefit such cases and having look at multiple platforms including T4 would help to choose better strategy. There will be more replies, but not right away. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/7] Phase one of sparc crypto opcode support.
From: Andy Polyakov ap...@openssl.org Date: Thu, 20 Sep 2012 11:23:03 +0200 There is no need to send me personal copy. Ok, I was simply acknowledging the author of the code I was touching :-) Could you ask your contact if they could provide second copy for OpenSSL? I'll see what I can do, it took me more than a year of tireless work and daily poking to get a copy for myself from people I've been interacting with for a decade. You mentioned Montgomery BN. There will be intersections with other platforms. I mean there is interest to provide alternative framework for exponentiation that would benefit such cases and having look at multiple platforms including T4 would help to choose better strategy. Here are how the instructions work. The basic model is that there is a range of sizes supported by the instruction, and all of the data is loaded into a combination of the floating point registers and all of the register windows of the cpu. For exmaple, the montmul (Montgomery Multiply) instruction simply has a 5-bit immediate field which indicates the size of the operands. If it is set to N the operands are (N + 1) * 64-bits in size. Nprime is stored in register %f60. A[] values are stored in float and integer registers (integers go into register window 5), in this order: %l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7 %o0, %o1, %o2, %o3, %o4, %o5, %f24, %f26 %f28, %f30, %f32, %f34, %f36, %f38, %f40, %f42 $f44, %f46, %f48, %f50, %f52, %f54, %f56, %f58 B[] values are stored in integer registers (3 register windows, 2 to 0): %o0, %o1, %o2, %o3, %o4, %o5, (register window 2) %l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7 (register window 1) %o0, %o1, %o2, %o3, %o4, %o5(register window 1) %l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7 (register window 0) %o0, %o1, %o2, %o3 (register window 0) Similarly for the other inputs, you can see the pattern in use here. The result is left in register window 5. If an internal ECC error occurs on the register file during the operation, %fcc3 will be set to unordered. This means there needs to be a limited retry loop over this condition. So basically the implementation starts at register window zero, loads all the initial values of B[], does a 'save', loads the middle values ot B[], does a 'save', leads the last values of B[]. Then it moves on the N[], which goes into register windows 2, 3, and 4. Next comes A[], in floating point registers and register window 5. And finally M[], in floating point registers and register window 6. Nprime is loaded into %f60 and the montmul instruction is executed. This instruction can essentially be used directly via the bn_mul_mont() function signature in openssl(). I don't think any special amends are necessary to facilitate the use of these instructions. The 'montsqr' (Montgomery Square) instruction uses the same scheme and layout as 'montmul' for inputs and outputs. Finally 'mpmul' (Multiple Precision Multiply) has a similar flavor to montmul and montsqr, in that multiple register windows and the float point registers are used to load the inputs all at once for the operation. Again, a 5-bit immedate field 'N' encodes the size of the operands, as (N + 1) * 64-bits. The multiplier goes into a mixture of float regs and integer registers in register window 6. The multiplicand goes into a mixture of float regs and integer registers in register window 5, and the product goes into integer registers in register windows 4, 3, 2, 1, and 0. For example, to do a 2048 bit multiply given a pointer to the multiplier in %g1, a pointer to the multiplicand in %g2, and a pointer to the place to store the product in %g3 one would go: /* Register window 6 */ ldd [%g1 + 0x000], %f22 ldd [%g1 + 0x008], %f20 ldd [%g1 + 0x010], %f18 ldd [%g1 + 0x018], %f16 ldd [%g1 + 0x020], %f14 ldd [%g1 + 0x028], %f12 ldd [%g1 + 0x030], %f10 ldd [%g1 + 0x038], %f8 ldd [%g1 + 0x040], %f6 ldd [%g1 + 0x048], %f4 ldx [%g1 + 0x050], %i5 ldx [%g1 + 0x058], %i4 ldx [%g1 + 0x060], %i3 ldx [%g1 + 0x068], %i2 ldx [%g1 + 0x070], %i1 ldx [%g1 + 0x078], %i0 ldx [%g1 + 0x080], %l7 ldx [%g1 + 0x088], %l6 ldx [%g1 + 0x090], %l5 ldx [%g1 + 0x098], %l4 ldx [%g1 + 0x0a0], %l3 ldx [%g1 + 0x0a8], %l2 ldx [%g1 + 0x0b0], %l1 ldx [%g1 + 0x0b8], %l0 ldd [%g1 + 0x0c0], %f2 ldd [%g1 + 0x0c8], %f0 ldx [%g1 + 0x0d0], %o5 ldx [%g1 + 0x0d8], %o4 ldx [%g1 + 0x0e0], %o3 ldx [%g1 + 0x0e8], %o2 ldx [%g1 + 0x0f0], %o1 ldx [%g1 + 0x0f8], %g1 save /* Register window 5 */ ldd [%g2 + 0x000], %f58 ldd [%g2 + 0x008], %f56
[PATCH 0/7] Phase one of sparc crypto opcode support.
This is the first phase of changes to support the new cryptographic opcodes found starting in the SPARC-T4 processor. It first builds the infrastructure for feature presence detection, then adds support for all of the hashing functions implemented in current cpus (MD5, SHA1, SHA256, SHA512). Here are some benchmarks on a SPARC T4-2 with these changes applied. type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes md5 14423.71k50416.70k 173663.49k 445940.05k 816587.27k sha1 33231.78k 115492.48k 318273.91k 579320.83k 759701.50k sha256 46641.41k 157805.85k 419859.54k 708643.16k 889514.67k sha512 50184.57k 202770.99k 529172.57k 1023763.11k 1405414.06k These numbers, with crypto-opcode-disabled numbers for comparison, are duplicated in the relevant patch log messages. I have cipher patches for AES, DES, and CAMELLIA as well but I would like to refine them a bit before I make a formal submission. And once I get those changes refined I will work on mongomery multiply, montgomery square-root, etc. which these chips also support directly. I've tested these changes on all of {static,shared}/linux{,64}-sparcv9 Oracle provided me with programmer's manuals that document these instructions, and I've been promised that these would be made public at some point in the not too distant future. But these instructions are very straightforward, and when I post the AES changes later on you will see that the AES instructions are virtually identical to the AESNI stuff from Intel. All of the patches are against mainline, but should be backportable to 1.0.x without much difficulty. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org