* Thomas Gleixner <[email protected]> wrote:
> > Useful also for code that needs AVX-like registers to do things like CRCs.
>
> x86/crypto/ has a lot of AVX optimized code.
Yeah, that's true, but the crypto code is processing fundamentally bigger
blocks
of data, which amortizes the cost of using kernel_fpu_begin()/_end().
kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full
FPU
save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily copy
a
thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter
for
something small, like a single 256-bit or 512-bit word access.
But there's actually a new thing in modern kernels: we got rid of (most of)
lazy
save/restore FPU code, our new x86 FPU model is very "direct" with no FPU
faults
taken normally.
So assuming the target driver will only load on modern FPUs I *think* it should
actually be possible to do something like (pseudocode):
vmovdqa %ymm0, 40(%rsp)
vmovdqa %ymm1, 80(%rsp)
...
# use ymm0 and ymm1
...
vmovdqa 80(%rsp), %ymm1
vmovdqa 40(%rsp), %ymm0
... without using the heavy XSAVE/XRSTOR instructions.
Note that preemption probably still needs to be disabled and possibly there are
other details as well, but there should be no 'heavy' FPU operations.
I think this should still preserve all user-space FPU state and shouldn't muck
up
any 'weird' user-space FPU state (such as pending exceptions, legacy x87
running
code, NaN registers or weird FPU control word settings) we might have
interrupted
either.
But I could be wrong, it should be checked whether this sequence is safe.
Worst-case we might have to save/restore the FPU control and tag words - but
those
operations should still be much faster than a full XSAVE/XRSTOR pair.
So I do think we could do more in this area to improve driver performance, if
the
code is correct and if there's actual benchmarks that are showing real benefits.
Thanks,
Ingo