Secondarily, since we can end up having to retry (deep window spill on
32-bit and register ECC errors on 32-bit and 64-bit)
I'm thinking about letting be the check after *every* montsqr, issuing
multiple montsqr back to back and only then check for retry
condition. One can do it only for inputs shorter than specific
length. What do you think?
This gets to the issue of outputs aliasing an input.
And? It's just that we don't try to identify which particular montsqr
that failed, but short sequence of them. And in case of failure retry
the sequence, not single instruction. Why not detect specific
instruction failure? In 32-bit mode detection involves traversing back
the register windows in order detect if the subroutine has suffered from
windows flush. But even in 64-bit mode if we were to detect specific
instruction failure, we would have to traverse the windows back to
window holding the result in order to save the correct one from previous
instruction. It appears to be expensive operation, at least for shorter
keys (as mentioned vis3-mont delivered better result on RSA1024 sign).
And as failure is seldom, it makes sense to share the costs. So
suggestion is to fire several montsqr without looking at result,
accumulate FSR.fcc3, and only then traverse windows back in order to
either save the result from sequence or discard it and reload inputs to
sequence.
Question in context of 32-bit application. My understanding is that in
order to detect if multi-window subroutine such one we have to use here
has suffered from windows flush (as result of context switch or delivery
of asynchronous signal) it's sufficient to detect if current window is
reloaded. I mean it never flushes say couple of top windows, but all of
them. Is it correct understanding?
One annoying aspect of all of this is that we need to use
a temporary on-stack location for the result until we know
we don't have to do a retry. Otherwise we might corrupt
one of the inputs.
Really, the thing to do is to put the whole RSA/DSA/etc. path
into a specially written T4 code block. That way we won't have
to deal with details such as the fact that the words in the
openssl bignum layout are transposed to what the T4 engine
wants in the registers, etc.
The above suggestion implies that we do break dependence from
bn_mul_mont and do something else. Naturally it allows for amortizing
various overheads such as transposing the words in input and output
vectors. So don't worry, it's all considered ;-)
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [email protected]
Automated List Manager [email protected]