Andy wrote:
> Would it handle [the sub from %o7 being in the delay slot of the call]?

Good idea, but no. This will fail regardless of whether the offset is the 
same as the target of the call. The reason is that this is still using an 
interprocedural trick to pass the called function its own address. Except 
for situations we manage heuristically, we don't carry knowledge like 
"register X contains a certain code address" from caller to callee. That 
isn't to say we couldn't handle it, but we don't. So your patch and the 
new code in aes_sparcv9.pl will also kill today's Purify.

You seem amazed that Purify can stretch code by inserting new 
instructions, and then patch up the resulting mess. You should be amazed: 
it's a hard problem. We've solved it for enough cases over time that we 
have been shipping a valuable bug-finding product to happy customers for 
15+ years. The patterns coming out of supported compilers are varied but 
not infinite, and they don't usually change faster than we can keep up. 
The few cases where we have to manage hand-coded assembly tend to be 
pretty stable. (They come up sometimes in system libraries that many 
customers will have in common, for example.) And if you think SPARC is 
bad, consider that we also do it for x86!

Regarding 13-bit offsets: Purify knows how to rewrite (recognized) code 
sequences when necessary to turn a simple "add" into the equivalent 
"sethi/or/add" sequence when it has to. Transformations like that are old 
hat for us. When we see a call8, we know that o7 contains a certain text 
address and we can see when it is used to compute other addresses. We 
patch those sequences to refer to the same place (logically) as before, 
even after things have moved around. On SPARC, call8 is unique; other 
calls are presumed to be calls, not "get my own PC into a register" 
pseudo-calls.

As for the SPARC people seeing call8 as a special case and not disrupting 
the retl prediction stack: it seems natural to me. Yes, it would make the 
prediction stack logic more complex, but the stack itself represents a 
choice to use more complexity to get better performance. I checked the 
Sun/Forte compiler and it uses call8 instead of gcc's call/ret stub. In 
light of that, and the fact that a large fraction of PIC functions access 
the GOT and thus will use call8, the prediction stack is virtually useless 
without special-casing call8: it'll very often be wrong for 
Forte-generated code.

Now for some more bad news, separate from the call8 question:

Further digging in des_enc.m4 revealed another problem. Besides the actual 
instructions in .PIC.me.up there is something else Purify doesn't notice 
that it should patch. The data item at .PIC.DES_SPtrans is the 32-bit 
offset from its own location (in .text) to DES_SPtrans (in .rodata). 
Because of code movement, Purify needs to patch this data item, but 
doesn't notice that it should. (Once again, this is a nonstandard way for 
a program to get the address of a data item in a position-independent 
way.)

I completely understand the desire to optimize all this PIC nonsense away: 
the streamlined code currently in des_enc.m4 is much shorter and cleaner. 
I'm not saying that anything about des_enc.m4 is bad or wrong, just that 
Purify doesn't recognize it.

My best idea so far is that you should write a C function that does what 
.PIC.me.up does, then compile it to assembly (twice, for 32-bit and 
64-bit) and paste the assembly (with minimal changes) into des_enc.m4. 
This way you know the instruction pattern is exactly as Purify would see 
from the compiler. (The code under #ifdef OPENSSL_PIC almost does this, 
but not quite - it's still nonstandard.)

Sadly, this will add overhead. In the readme I saw that DES_encrypt1 and 
friends operate on 64 bits of input at a time. This surprises me: it means 
any setup you do is repeated N/8 times (where N=message length in bytes). 
Is it so costly to encrypt or decrypt 64 bits that the repeated setup cost 
is trivial? 

If you can not absorb the additional cost of being well-behaved from 
Purify's perspective, there are more extreme ideas. One is for you to ship 
a de-optimized version of the library (built with the "no-asm" parameter 
to Configure) for your users to use when they want to use PurifyPlus. Of 
course people can also build this themselves from source, once they learn 
they must. More extreme ideas involve auto-selecting the suitable code at 
run time based on whether Purify is in the picture, but libcrypto really 
isn't set up for that.

-- Allan Pratt, [email protected]
Rational software division of IBM

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to