From: Andy Polyakov <[email protected]>
Date: Sat, 20 Oct 2012 20:58:42 +0200

> It's just that we don't try to identify which particular montsqr
> that failed, but short sequence of them. And in case of failure
> retry the sequence, not single instruction. Why not detect specific
> instruction failure? In 32-bit mode detection involves traversing
> back the register windows in order detect if the subroutine has
> suffered from windows flush.

Right, and time is your worst enemy for this issue.  You want to
minimize, not increase, the amount of time that those top level
register windows are live in the chip and potentially flushed out.

Therefore, for systems that don't have support for a biased 64-bit
stack in 32-bit processes, you should check after every operation.

This reminds me that I wanted to look into doing things like loading
only the floating point registers first, and then the windowed integer
registers.  Again, to minimize the amount of time that the windows
are exposed to potential flushes on 32-bit.

> But even in 64-bit mode if we were to detect specific instruction
> failure, we would have to traverse the windows back to window
> holding the result in order to save the correct one from previous
> instruction.

I think we can consider this restart cases as nearly never happening.
It can be expensive to recover from, and it doesn't matter.

> And as failure is seldom, it makes sense to share the
> costs. So suggestion is to fire several montsqr without looking at
> result, accumulate FSR.fcc3, and only then traverse windows back in
> order to either save the result from sequence or discard it and
> reload inputs to sequence.

I agree for the FSR.fcc3 case, that this idea is reasonable.

For the 32-bit register window case, I do not think it is reasonable.

> Question in context of 32-bit application. My understanding is that in
> order to detect if multi-window subroutine such one we have to use
> here has suffered from windows flush (as result of context switch or
> delivery of asynchronous signal) it's sufficient to detect if current
> window is reloaded.

What usually happens is something as simple as a device interrupt, or
the per-cpu timer interrupt, comes in.  That's enough to blow the top
register window and cause a restart.

So it is wise to minimise the exposure time as much as possible.

It is possible (very easy example is when using the cpu's profiling
facilities) to have high enough frequency interrupts that these
instructions never complete successfully on 32-bit.

Which means two things:

1) We have to limit the retries and fallback to software if necessary.

2) The only reasonable thing to do longer term is the biased stack
   idea we've designed the other day.  I'm almost done with an
   implementation for Linux and I'll let you know when it's running
   on the T4 test system.

> The above suggestion implies that we do break dependence from
> bn_mul_mont and do something else. Naturally it allows for amortizing
> various overheads such as transposing the words in input and output
> vectors. So don't worry, it's all considered ;-)

The chip was designed for this, to be honest.  It's meant to be used
such that you only reload the operands you need to rather than go
through the full register loading sequence for every operation.

That means we emit a "script", or programmed sequence, of register
loads/stores and montgomery opcodes.

BTW, we could create even a JIT compiler for this.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to