On Tue, 2022-05-03 at 10:41 +0000, Bezdeka, Florian via Xenomai wrote:
> Hi all,
>
> it seems that I'm able to reproduce a register (or stack) corruption on
> x86.
>
> The problem does not appear when running the Xenomai testsuite
> (especially switchtest) without any additional load. Stressing Linux
> with stress-ng makes the test fail.
>
> Kernel: 4.19.231-cip68
> Xenomai: 3.2.1
> Hardware:
> - Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
> - 32 cores
> stress-ng cmdline:
> stress-ng --cpu 16 --io 8 --vm 4 --vm-bytes 128M --fork 8
>
Here is the result of my analysis so far. I'm quite sure I figured it
out, but let's see if you all agree. Seems I'm facing a test issue, so
let's start with the test:
In rtswitch_ktask() we're calling fp_regs_set() which holds:
for (i = 0; i < 8; i++)
__asm__ __volatile__("fildl %0": /* no output */ :"m"(val));
We're trying to fill the complete FPU stack. But: Is there any
guarantee that we will succeed? There is no "fninit" in front, so we
might get an dirty FPU state.
I modified the fpu trace infrastructure to show the x87 status word and
got:
rtk5/15-209875 [015] 81.038137: x86_fpu_init_state: x86/fpu:
0xffff9368965ecf00 initialized: 0 xfeatures: 0 xcomp_bv: 0 swd(0)
rtk5/15-209875 [015] 81.038137: x86_fpu_activate_state: x86/fpu:
0xffff9368965ecf00 initialized: 0 xfeatures: 0 xcomp_bv: 0 swd(0)
rtk5/15-209875 [015] 81.038143: x86_fpu_xsave: x86/fpu:
0xffff9368965ecf00 initialized: 1 xfeatures: 3 xcomp_bv: 0 swd(0x1929)
So the first FPU state written into the xsave memory has TOP (top of
stack pointer) already set to 3 (bits 11-13 of 0x1929).
When dumping the FPU ST registers after calling fp_regs_set() right
before the "register corruption detection" I can see that 3 fildl
operations were successful and an updated status word of 0x1969.
0x1969 means that the C1 and SF (Stack Fault) bits are set now. This
combination (according to Intel Developer Manual) means that a stack
overflow happened.
Adding a "fninit" instruction to the test before starting to fill the
FPU stack fixes my issue. No more corruptions detected on my machine.
Does that make sense to you?
Btw: Most of the FPU tests are currently disabled on x86 when one of
the following config options is set. In my eyes this doesn't make sense
and I would come up with a patch that removes this limitation. The
options:
- CONFIG_X86_USE_3DNOW
- CONFIG_MD_RAID456
- CONFIG_MD_RAID456_MODULE
Regards,
Florian