I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1). In numerous tests, I can't keep a computer running for more than a day before the computer hard-locks (no kbd/mouse/ping). Frequently the kernel Oopses within 4-6 hours. I have tried 2 identical motherboards, changed RAM, and tried another manufacturer's motherboard on a 3rd computer.
* Can someone supply me with a known successful x68 kernel 4.19.89 config so I can compare and try those settings? I will attach my kernel config to this email, in hopes someone can see something wrong with them. Specs: Intel i5-4590 CPU, Advantech motherboard with Q87 intel chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1 (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89 kernel from kernel.org source. Sometimes onscreen (in a text terminal) I get this Oops: kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) BUG: unable to handle kernel paging request at ... PGD ... P4D ... PUD .. PHD ... Oops: 0011 [#1] SMP PTI CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY, BIOS 4.6.5 08/29/2017 I-pipe domain: Linux RIP: ... : ... Code: Bad RIP value. Which means the Instruction Pointer is in a Data area. That is bad, and I think it is caused by Cobalt code not restoring the stack/registers correctly during a context switch. Other times I get : Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted in: __xnsched_run.part.63 h - CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2 Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021 I-pipe domain: Linux Call Trace: <IRQ> dump_stack+8x95/8xna panic+8xe§l8x246 ? ___xnsched_run.part.63+8x5c4/8x4d0 __stack_chhk_fail+8x19x8x28 ___xnsched_run.part.63+8x§c4/Bx§d8 ? release_ioapic_irq+8x3f/8x58 ? __ipipe_end_fasteoi_irq+BNZZ/8x38 xnintr;edge_vec_handler+BXBIA/8x558 __ipipe_do_sync_pipeline+8xS/ana dispatch_irq_head+8xe6/Bx118 __ipipe_dispatch_irq+ax1bc/Bx1e8 __ipipe_handle_irq+8x198/x208 ? common_interrupt+8xf/Bx2c </IRQ> The accompanying stack trace seems to implicate an ipipe interrupt handler as causing the problem. I'm using xeno_16550A.ko interrupts on an isolated interrupt level (IRQ 18). Interestingly, the Cobalt scheduler and my RT userspace app are still running after this, even though the Linux kernel is halted. I proved this on an oscilloscope: I can see serial packets going into and out of the serial ports at the expected periodic time base. (Note that the text of these kernel faults above is reconstructed with OCR so some addresses are not complete. The computer is hard-locked in a text terminal when these happen. I can supply the full JPG pictures or re-type addresses if you like.) The application scenario which causes the above problems: The primary app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch applied for x86 kernel 4.19.89. It has shared memory via mmap() with an RTDM module (“modrt1”) but nothing is happening in “modrt1” at present, no interrupts etc. There are also two non-RT userspace linux apps which have attached to the same shared memory via mmap() but those are doing nothing much during these tests. I have attached several (1-6) RS232 serial devices and one CAN device all communicating with “apprt2”. The system does not fault (for 48+ hours) when no peripheral connections are present (Serial/CAN). The faults happen with Serial traffic, whether the CAN device is attached or not. The CAN device alone with no Serial does not cause the fault (tested for 48+ hours), and the fault has also happened when the motherboard serial ports were used, so the PCI Moxa code is not implicated. Note that in order to get 32-bit userspace support to fully work I had to manually patch the 16550A.c serial driver with the 32 bit “compatibility” patch from the xenomai mailing list. That works OK and my apps can communicate fine for hours. The serial packets in my applications have CRC checks so we know if data ever gets corrupted. Note that my apps have been running OK 32-bit on Xenomai v2.6 for two years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and did not get any faults in a test lasting 21+ hours (serial driver only, no CAN). Since I imagine Xenomai developers prefer to debug on recent builds, I also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit. I still get kernel Oopses with Xeno 3.2.1 : kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) BUG: unable to handle kernel paging request at ... PGD ... P4D ... PUD ... PMD ... Oops: 0011 [#1] SMP PTI CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY, BIOS 4.6.5 08/29/2017 I-pipe domain: Linux RIP: … : ... Code: Bad RIP value. … * Is there some way to instrument the Cobalt kernel to debug this ? It seems impossible to get any debug data from /proc/xenomai because the Linux kernel is Oopsed. A debugging problem: occasionally with my apps compiled 64 bit on Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault eventually, or in another test). So I get 'false positives' and it takes weeks to make progress. It is easiest to generate a kernel Oops rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the testing process may I propose to keep compiling 32 bit and we instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to xeno-3.2 (k4.19.89)? Thanks. -C Smith -------------- next part -------------- A non-text attachment was scrubbed... Name: config_4.19.89-20211206 Type: application/octet-stream Size: 190113 bytes Desc: not available URL: <http://xenomai.org/pipermail/xenomai/attachments/20220102/c6ee52df/attachment.obj>