Andy wrote: > Would it handle [the sub from %o7 being in the delay slot of the call]?
Good idea, but no. This will fail regardless of whether the offset is the same as the target of the call. The reason is that this is still using an interprocedural trick to pass the called function its own address. Except for situations we manage heuristically, we don't carry knowledge like "register X contains a certain code address" from caller to callee. That isn't to say we couldn't handle it, but we don't. So your patch and the new code in aes_sparcv9.pl will also kill today's Purify. You seem amazed that Purify can stretch code by inserting new instructions, and then patch up the resulting mess. You should be amazed: it's a hard problem. We've solved it for enough cases over time that we have been shipping a valuable bug-finding product to happy customers for 15+ years. The patterns coming out of supported compilers are varied but not infinite, and they don't usually change faster than we can keep up. The few cases where we have to manage hand-coded assembly tend to be pretty stable. (They come up sometimes in system libraries that many customers will have in common, for example.) And if you think SPARC is bad, consider that we also do it for x86! Regarding 13-bit offsets: Purify knows how to rewrite (recognized) code sequences when necessary to turn a simple "add" into the equivalent "sethi/or/add" sequence when it has to. Transformations like that are old hat for us. When we see a call8, we know that o7 contains a certain text address and we can see when it is used to compute other addresses. We patch those sequences to refer to the same place (logically) as before, even after things have moved around. On SPARC, call8 is unique; other calls are presumed to be calls, not "get my own PC into a register" pseudo-calls. As for the SPARC people seeing call8 as a special case and not disrupting the retl prediction stack: it seems natural to me. Yes, it would make the prediction stack logic more complex, but the stack itself represents a choice to use more complexity to get better performance. I checked the Sun/Forte compiler and it uses call8 instead of gcc's call/ret stub. In light of that, and the fact that a large fraction of PIC functions access the GOT and thus will use call8, the prediction stack is virtually useless without special-casing call8: it'll very often be wrong for Forte-generated code. Now for some more bad news, separate from the call8 question: Further digging in des_enc.m4 revealed another problem. Besides the actual instructions in .PIC.me.up there is something else Purify doesn't notice that it should patch. The data item at .PIC.DES_SPtrans is the 32-bit offset from its own location (in .text) to DES_SPtrans (in .rodata). Because of code movement, Purify needs to patch this data item, but doesn't notice that it should. (Once again, this is a nonstandard way for a program to get the address of a data item in a position-independent way.) I completely understand the desire to optimize all this PIC nonsense away: the streamlined code currently in des_enc.m4 is much shorter and cleaner. I'm not saying that anything about des_enc.m4 is bad or wrong, just that Purify doesn't recognize it. My best idea so far is that you should write a C function that does what .PIC.me.up does, then compile it to assembly (twice, for 32-bit and 64-bit) and paste the assembly (with minimal changes) into des_enc.m4. This way you know the instruction pattern is exactly as Purify would see from the compiler. (The code under #ifdef OPENSSL_PIC almost does this, but not quite - it's still nonstandard.) Sadly, this will add overhead. In the readme I saw that DES_encrypt1 and friends operate on 64 bits of input at a time. This surprises me: it means any setup you do is repeated N/8 times (where N=message length in bytes). Is it so costly to encrypt or decrypt 64 bits that the repeated setup cost is trivial? If you can not absorb the additional cost of being well-behaved from Purify's perspective, there are more extreme ideas. One is for you to ship a de-optimized version of the library (built with the "no-asm" parameter to Configure) for your users to use when they want to use PurifyPlus. Of course people can also build this themselves from source, once they learn they must. More extreme ideas involve auto-selecting the suitable code at run time based on whether Purify is in the picture, but libcrypto really isn't set up for that. -- Allan Pratt, [email protected] Rational software division of IBM ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [email protected] Automated List Manager [email protected]
