Re: x86 kernel Oops in Xeno-3.1/3.2
On Sun, Jan 9, 2022 at 8:49 AM Philippe Gerum wrote: > > > C Smith writes: > > > On Mon, Jan 3, 2022 at 11:44 PM C Smith wrote: > >> > >> On Mon, Jan 3, 2022 at 11:05 PM Jan Kiszka wrote: > >> > > >> > On 03.01.22 22:12, C Smith wrote: > >> > > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka > >> > > wrote: > >> > >> > >> > >> On 03.01.22 08:29, C Smith wrote: > >> > >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1). > >> > >>> In numerous tests, I can't keep a computer running for more than a > >> > >>> day > >> > >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the > >> > >>> kernel Oopses within 4-6 hours. I have tried 2 identical > >> > >>> motherboards, > >> > >>> changed RAM, and tried another manufacturer's motherboard on a 3rd > >> > >>> computer. > >> > >>> > >> > >>> * Can someone supply me with a known successful x68 kernel 4.19.89 > >> > >>> config so I can compare and try those settings? I will attach my > >> > >>> kernel config to this email, in hopes someone can see something wrong > >> > >>> with them. > >> > >>> > >> > >>> Specs: Intel i5-4590 CPU, Advantech motherboard with Q87 intel > >> > >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard > >> > >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1 > >> > >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89 > >> > >>> kernel from kernel.org source. > >> > >>> > >> > >>> Sometimes onscreen (in a text terminal) I get this Oops: > >> > >>> > >> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: > >> > >>> 1000) > >> > >>> BUG: unable to handle kernel paging request at ... > >> > >>> PGD ... P4D ... PUD .. PHD ... > >> > >>> Oops: 0011 [#1] SMP PTI > >> > >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 > >> > >>> Hardware name: To be filled by O.E.M. To be filled by > >> > >>> O.E.M./SHARKBAY, > >> > >>> BIOS 4.6.5 08/29/2017 > >> > >>> I-pipe domain: Linux > >> > >>> RIP: ... : ... > >> > >>> Code: Bad RIP value. > >> > >>> > >> > >>> Which means the Instruction Pointer is in a Data area. That is bad, > >> > >>> and I think it is caused by Cobalt code not restoring the > >> > >>> stack/registers correctly during a context switch. > >> > >>> Other times I get : > >> > >>> > >> > >>> Kernel Panic - not syncing: stack-protector: Kernel stack is > >> > >>> corrupted > >> > >>> in: __xnsched_run.part.63 h - > >> > >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 > >> > >>> #2 > >> > >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS > >> > >>> 4.6.5 04/23/2021 > >> > >>> I-pipe domain: Linux > >> > >>> Call Trace: > >> > >>> > >> > >>> dump_stack+8x95/8xna > >> > >>> panic+8xe§l8x246 > >> > >>> ? ___xnsched_run.part.63+8x5c4/8x4d0 > >> > >>> __stack_chhk_fail+8x19x8x28 > >> > >>> ___xnsched_run.part.63+8x§c4/Bx§d8 > >> > >>> ? release_ioapic_irq+8x3f/8x58 > >> > >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38 > >> > >>> xnintr;edge_vec_handler+BXBIA/8x558 > >> > >>> __ipipe_do_sync_pipeline+8xS/ana > >> > >>> dispatch_irq_head+8xe6/Bx118 > >> > >>> __ipipe_dispatch_irq+ax1bc/Bx1e8 > >> > >>> __ipipe_handle_irq+8x198/x208 > >> > >>> ? common_interrupt+8xf/Bx2c > >> > >>> > >> > >>> > >> > >>> The accompanying stack trace seems to implicate an ipipe interrupt > >> > >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts > >> > >>> on > >> > >>> an isolated interrupt level (IRQ 18). > >> > >>> > >> > >>> Interestingly, the Cobalt scheduler and my RT userspace app are still > >> > >>> running after this, even though the Linux kernel is halted. I proved > >> > >>> this on an oscilloscope: I can see serial packets going into and out > >> > >>> of the serial ports at the expected periodic time base. > >> > >>> > >> > >>> (Note that the text of these kernel faults above is reconstructed > >> > >>> with > >> > >>> OCR so some addresses are not complete. The computer is hard-locked > >> > >>> in > >> > >>> a text terminal when these happen. I can supply the full JPG pictures > >> > >>> or re-type addresses if you like.) > >> > >>> > >> > >>> The application scenario which causes the above problems: The > >> > >>> primary > >> > >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on > >> > >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe > >> > >>> patch > >> > >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with > >> > >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at > >> > >>> present, no interrupts etc. There are also two non-RT userspace linux > >> > >>> apps which have attached to the same shared memory via mmap() but > >> > >>> those are doing nothing much during these tests. I have attached > >> > >>> several (1-6) RS232 serial devices and one CAN device all > >> > >>> communicating with “apprt2”. > >> > >>> > >> > >>> The system does not fault (for 48+ hours) when n
Re: x86 kernel Oops in Xeno-3.1/3.2
C Smith writes: > On Mon, Jan 3, 2022 at 11:44 PM C Smith wrote: >> >> On Mon, Jan 3, 2022 at 11:05 PM Jan Kiszka wrote: >> > >> > On 03.01.22 22:12, C Smith wrote: >> > > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka >> > > wrote: >> > >> >> > >> On 03.01.22 08:29, C Smith wrote: >> > >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1). >> > >>> In numerous tests, I can't keep a computer running for more than a day >> > >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the >> > >>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards, >> > >>> changed RAM, and tried another manufacturer's motherboard on a 3rd >> > >>> computer. >> > >>> >> > >>> * Can someone supply me with a known successful x68 kernel 4.19.89 >> > >>> config so I can compare and try those settings? I will attach my >> > >>> kernel config to this email, in hopes someone can see something wrong >> > >>> with them. >> > >>> >> > >>> Specs: Intel i5-4590 CPU, Advantech motherboard with Q87 intel >> > >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard >> > >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1 >> > >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89 >> > >>> kernel from kernel.org source. >> > >>> >> > >>> Sometimes onscreen (in a text terminal) I get this Oops: >> > >>> >> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: >> > >>> 1000) >> > >>> BUG: unable to handle kernel paging request at ... >> > >>> PGD ... P4D ... PUD .. PHD ... >> > >>> Oops: 0011 [#1] SMP PTI >> > >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 >> > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY, >> > >>> BIOS 4.6.5 08/29/2017 >> > >>> I-pipe domain: Linux >> > >>> RIP: ... : ... >> > >>> Code: Bad RIP value. >> > >>> >> > >>> Which means the Instruction Pointer is in a Data area. That is bad, >> > >>> and I think it is caused by Cobalt code not restoring the >> > >>> stack/registers correctly during a context switch. >> > >>> Other times I get : >> > >>> >> > >>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted >> > >>> in: __xnsched_run.part.63 h - >> > >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2 >> > >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 >> > >>> 04/23/2021 >> > >>> I-pipe domain: Linux >> > >>> Call Trace: >> > >>> >> > >>> dump_stack+8x95/8xna >> > >>> panic+8xe§l8x246 >> > >>> ? ___xnsched_run.part.63+8x5c4/8x4d0 >> > >>> __stack_chhk_fail+8x19x8x28 >> > >>> ___xnsched_run.part.63+8x§c4/Bx§d8 >> > >>> ? release_ioapic_irq+8x3f/8x58 >> > >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38 >> > >>> xnintr;edge_vec_handler+BXBIA/8x558 >> > >>> __ipipe_do_sync_pipeline+8xS/ana >> > >>> dispatch_irq_head+8xe6/Bx118 >> > >>> __ipipe_dispatch_irq+ax1bc/Bx1e8 >> > >>> __ipipe_handle_irq+8x198/x208 >> > >>> ? common_interrupt+8xf/Bx2c >> > >>> >> > >>> >> > >>> The accompanying stack trace seems to implicate an ipipe interrupt >> > >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on >> > >>> an isolated interrupt level (IRQ 18). >> > >>> >> > >>> Interestingly, the Cobalt scheduler and my RT userspace app are still >> > >>> running after this, even though the Linux kernel is halted. I proved >> > >>> this on an oscilloscope: I can see serial packets going into and out >> > >>> of the serial ports at the expected periodic time base. >> > >>> >> > >>> (Note that the text of these kernel faults above is reconstructed with >> > >>> OCR so some addresses are not complete. The computer is hard-locked in >> > >>> a text terminal when these happen. I can supply the full JPG pictures >> > >>> or re-type addresses if you like.) >> > >>> >> > >>> The application scenario which causes the above problems: The primary >> > >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on >> > >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch >> > >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with >> > >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at >> > >>> present, no interrupts etc. There are also two non-RT userspace linux >> > >>> apps which have attached to the same shared memory via mmap() but >> > >>> those are doing nothing much during these tests. I have attached >> > >>> several (1-6) RS232 serial devices and one CAN device all >> > >>> communicating with “apprt2”. >> > >>> >> > >>> The system does not fault (for 48+ hours) when no peripheral >> > >>> connections are present (Serial/CAN). The faults happen with Serial >> > >>> traffic, whether the CAN device is attached or not. The CAN device >> > >>> alone with no Serial does not cause the fault (tested for 48+ hours), >> > >>> and the fault has also happened when the motherboard serial ports were >> > >>> used, so the PCI Moxa code is not i
Re: x86 kernel Oops in Xeno-3.1/3.2
On Mon, Jan 3, 2022 at 11:44 PM C Smith wrote: > > On Mon, Jan 3, 2022 at 11:05 PM Jan Kiszka wrote: > > > > On 03.01.22 22:12, C Smith wrote: > > > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka wrote: > > >> > > >> On 03.01.22 08:29, C Smith wrote: > > >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1). > > >>> In numerous tests, I can't keep a computer running for more than a day > > >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the > > >>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards, > > >>> changed RAM, and tried another manufacturer's motherboard on a 3rd > > >>> computer. > > >>> > > >>> * Can someone supply me with a known successful x68 kernel 4.19.89 > > >>> config so I can compare and try those settings? I will attach my > > >>> kernel config to this email, in hopes someone can see something wrong > > >>> with them. > > >>> > > >>> Specs: Intel i5-4590 CPU, Advantech motherboard with Q87 intel > > >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard > > >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1 > > >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89 > > >>> kernel from kernel.org source. > > >>> > > >>> Sometimes onscreen (in a text terminal) I get this Oops: > > >>> > > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) > > >>> BUG: unable to handle kernel paging request at ... > > >>> PGD ... P4D ... PUD .. PHD ... > > >>> Oops: 0011 [#1] SMP PTI > > >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 > > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY, > > >>> BIOS 4.6.5 08/29/2017 > > >>> I-pipe domain: Linux > > >>> RIP: ... : ... > > >>> Code: Bad RIP value. > > >>> > > >>> Which means the Instruction Pointer is in a Data area. That is bad, > > >>> and I think it is caused by Cobalt code not restoring the > > >>> stack/registers correctly during a context switch. > > >>> Other times I get : > > >>> > > >>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted > > >>> in: __xnsched_run.part.63 h - > > >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2 > > >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 > > >>> 04/23/2021 > > >>> I-pipe domain: Linux > > >>> Call Trace: > > >>> > > >>> dump_stack+8x95/8xna > > >>> panic+8xe§l8x246 > > >>> ? ___xnsched_run.part.63+8x5c4/8x4d0 > > >>> __stack_chhk_fail+8x19x8x28 > > >>> ___xnsched_run.part.63+8x§c4/Bx§d8 > > >>> ? release_ioapic_irq+8x3f/8x58 > > >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38 > > >>> xnintr;edge_vec_handler+BXBIA/8x558 > > >>> __ipipe_do_sync_pipeline+8xS/ana > > >>> dispatch_irq_head+8xe6/Bx118 > > >>> __ipipe_dispatch_irq+ax1bc/Bx1e8 > > >>> __ipipe_handle_irq+8x198/x208 > > >>> ? common_interrupt+8xf/Bx2c > > >>> > > >>> > > >>> The accompanying stack trace seems to implicate an ipipe interrupt > > >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on > > >>> an isolated interrupt level (IRQ 18). > > >>> > > >>> Interestingly, the Cobalt scheduler and my RT userspace app are still > > >>> running after this, even though the Linux kernel is halted. I proved > > >>> this on an oscilloscope: I can see serial packets going into and out > > >>> of the serial ports at the expected periodic time base. > > >>> > > >>> (Note that the text of these kernel faults above is reconstructed with > > >>> OCR so some addresses are not complete. The computer is hard-locked in > > >>> a text terminal when these happen. I can supply the full JPG pictures > > >>> or re-type addresses if you like.) > > >>> > > >>> The application scenario which causes the above problems: The primary > > >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on > > >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch > > >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with > > >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at > > >>> present, no interrupts etc. There are also two non-RT userspace linux > > >>> apps which have attached to the same shared memory via mmap() but > > >>> those are doing nothing much during these tests. I have attached > > >>> several (1-6) RS232 serial devices and one CAN device all > > >>> communicating with “apprt2”. > > >>> > > >>> The system does not fault (for 48+ hours) when no peripheral > > >>> connections are present (Serial/CAN). The faults happen with Serial > > >>> traffic, whether the CAN device is attached or not. The CAN device > > >>> alone with no Serial does not cause the fault (tested for 48+ hours), > > >>> and the fault has also happened when the motherboard serial ports were > > >>> used, so the PCI Moxa code is not implicated. > > >>> > > >>> Note that in order to get 32-bit userspace support to fully work I had > > >>> to manually patch the 16550A.c
Re: x86 kernel Oops in Xeno-3.1/3.2
On Mon, Jan 3, 2022 at 11:05 PM Jan Kiszka wrote: > > On 03.01.22 22:12, C Smith wrote: > > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka wrote: > >> > >> On 03.01.22 08:29, C Smith wrote: > >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1). > >>> In numerous tests, I can't keep a computer running for more than a day > >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the > >>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards, > >>> changed RAM, and tried another manufacturer's motherboard on a 3rd > >>> computer. > >>> > >>> * Can someone supply me with a known successful x68 kernel 4.19.89 > >>> config so I can compare and try those settings? I will attach my > >>> kernel config to this email, in hopes someone can see something wrong > >>> with them. > >>> > >>> Specs: Intel i5-4590 CPU, Advantech motherboard with Q87 intel > >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard > >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1 > >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89 > >>> kernel from kernel.org source. > >>> > >>> Sometimes onscreen (in a text terminal) I get this Oops: > >>> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) > >>> BUG: unable to handle kernel paging request at ... > >>> PGD ... P4D ... PUD .. PHD ... > >>> Oops: 0011 [#1] SMP PTI > >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY, > >>> BIOS 4.6.5 08/29/2017 > >>> I-pipe domain: Linux > >>> RIP: ... : ... > >>> Code: Bad RIP value. > >>> > >>> Which means the Instruction Pointer is in a Data area. That is bad, > >>> and I think it is caused by Cobalt code not restoring the > >>> stack/registers correctly during a context switch. > >>> Other times I get : > >>> > >>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted > >>> in: __xnsched_run.part.63 h - > >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2 > >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 > >>> 04/23/2021 > >>> I-pipe domain: Linux > >>> Call Trace: > >>> > >>> dump_stack+8x95/8xna > >>> panic+8xe§l8x246 > >>> ? ___xnsched_run.part.63+8x5c4/8x4d0 > >>> __stack_chhk_fail+8x19x8x28 > >>> ___xnsched_run.part.63+8x§c4/Bx§d8 > >>> ? release_ioapic_irq+8x3f/8x58 > >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38 > >>> xnintr;edge_vec_handler+BXBIA/8x558 > >>> __ipipe_do_sync_pipeline+8xS/ana > >>> dispatch_irq_head+8xe6/Bx118 > >>> __ipipe_dispatch_irq+ax1bc/Bx1e8 > >>> __ipipe_handle_irq+8x198/x208 > >>> ? common_interrupt+8xf/Bx2c > >>> > >>> > >>> The accompanying stack trace seems to implicate an ipipe interrupt > >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on > >>> an isolated interrupt level (IRQ 18). > >>> > >>> Interestingly, the Cobalt scheduler and my RT userspace app are still > >>> running after this, even though the Linux kernel is halted. I proved > >>> this on an oscilloscope: I can see serial packets going into and out > >>> of the serial ports at the expected periodic time base. > >>> > >>> (Note that the text of these kernel faults above is reconstructed with > >>> OCR so some addresses are not complete. The computer is hard-locked in > >>> a text terminal when these happen. I can supply the full JPG pictures > >>> or re-type addresses if you like.) > >>> > >>> The application scenario which causes the above problems: The primary > >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on > >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch > >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with > >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at > >>> present, no interrupts etc. There are also two non-RT userspace linux > >>> apps which have attached to the same shared memory via mmap() but > >>> those are doing nothing much during these tests. I have attached > >>> several (1-6) RS232 serial devices and one CAN device all > >>> communicating with “apprt2”. > >>> > >>> The system does not fault (for 48+ hours) when no peripheral > >>> connections are present (Serial/CAN). The faults happen with Serial > >>> traffic, whether the CAN device is attached or not. The CAN device > >>> alone with no Serial does not cause the fault (tested for 48+ hours), > >>> and the fault has also happened when the motherboard serial ports were > >>> used, so the PCI Moxa code is not implicated. > >>> > >>> Note that in order to get 32-bit userspace support to fully work I had > >>> to manually patch the 16550A.c serial driver with the 32 bit > >>> “compatibility” patch from the xenomai mailing list. That works OK and > >>> my apps can communicate fine for hours. The serial packets in my > >>> applications have CRC checks so we know if data ever gets corrup
Re: x86 kernel Oops in Xeno-3.1/3.2
On 03.01.22 22:12, C Smith wrote: > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka wrote: >> >> On 03.01.22 08:29, C Smith wrote: >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1). >>> In numerous tests, I can't keep a computer running for more than a day >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the >>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards, >>> changed RAM, and tried another manufacturer's motherboard on a 3rd >>> computer. >>> >>> * Can someone supply me with a known successful x68 kernel 4.19.89 >>> config so I can compare and try those settings? I will attach my >>> kernel config to this email, in hopes someone can see something wrong >>> with them. >>> >>> Specs: Intel i5-4590 CPU, Advantech motherboard with Q87 intel >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1 >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89 >>> kernel from kernel.org source. >>> >>> Sometimes onscreen (in a text terminal) I get this Oops: >>> >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) >>> BUG: unable to handle kernel paging request at ... >>> PGD ... P4D ... PUD .. PHD ... >>> Oops: 0011 [#1] SMP PTI >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY, >>> BIOS 4.6.5 08/29/2017 >>> I-pipe domain: Linux >>> RIP: ... : ... >>> Code: Bad RIP value. >>> >>> Which means the Instruction Pointer is in a Data area. That is bad, >>> and I think it is caused by Cobalt code not restoring the >>> stack/registers correctly during a context switch. >>> Other times I get : >>> >>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted >>> in: __xnsched_run.part.63 h - >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2 >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 >>> 04/23/2021 >>> I-pipe domain: Linux >>> Call Trace: >>> >>> dump_stack+8x95/8xna >>> panic+8xe§l8x246 >>> ? ___xnsched_run.part.63+8x5c4/8x4d0 >>> __stack_chhk_fail+8x19x8x28 >>> ___xnsched_run.part.63+8x§c4/Bx§d8 >>> ? release_ioapic_irq+8x3f/8x58 >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38 >>> xnintr;edge_vec_handler+BXBIA/8x558 >>> __ipipe_do_sync_pipeline+8xS/ana >>> dispatch_irq_head+8xe6/Bx118 >>> __ipipe_dispatch_irq+ax1bc/Bx1e8 >>> __ipipe_handle_irq+8x198/x208 >>> ? common_interrupt+8xf/Bx2c >>> >>> >>> The accompanying stack trace seems to implicate an ipipe interrupt >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on >>> an isolated interrupt level (IRQ 18). >>> >>> Interestingly, the Cobalt scheduler and my RT userspace app are still >>> running after this, even though the Linux kernel is halted. I proved >>> this on an oscilloscope: I can see serial packets going into and out >>> of the serial ports at the expected periodic time base. >>> >>> (Note that the text of these kernel faults above is reconstructed with >>> OCR so some addresses are not complete. The computer is hard-locked in >>> a text terminal when these happen. I can supply the full JPG pictures >>> or re-type addresses if you like.) >>> >>> The application scenario which causes the above problems: The primary >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at >>> present, no interrupts etc. There are also two non-RT userspace linux >>> apps which have attached to the same shared memory via mmap() but >>> those are doing nothing much during these tests. I have attached >>> several (1-6) RS232 serial devices and one CAN device all >>> communicating with “apprt2”. >>> >>> The system does not fault (for 48+ hours) when no peripheral >>> connections are present (Serial/CAN). The faults happen with Serial >>> traffic, whether the CAN device is attached or not. The CAN device >>> alone with no Serial does not cause the fault (tested for 48+ hours), >>> and the fault has also happened when the motherboard serial ports were >>> used, so the PCI Moxa code is not implicated. >>> >>> Note that in order to get 32-bit userspace support to fully work I had >>> to manually patch the 16550A.c serial driver with the 32 bit >>> “compatibility” patch from the xenomai mailing list. That works OK and >>> my apps can communicate fine for hours. The serial packets in my >>> applications have CRC checks so we know if data ever gets corrupted. >>> >>> Note that my apps have been running OK 32-bit on Xenomai v2.6 for two >>> years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and >>> did not get any faults in a test lasting 21+ hours (serial driver >>> only, no CAN). >>> >>> S
Re: x86 kernel Oops in Xeno-3.1/3.2
On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka wrote: > > On 03.01.22 08:29, C Smith wrote: > > I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1). > > In numerous tests, I can't keep a computer running for more than a day > > before the computer hard-locks (no kbd/mouse/ping). Frequently the > > kernel Oopses within 4-6 hours. I have tried 2 identical motherboards, > > changed RAM, and tried another manufacturer's motherboard on a 3rd > > computer. > > > > * Can someone supply me with a known successful x68 kernel 4.19.89 > > config so I can compare and try those settings? I will attach my > > kernel config to this email, in hopes someone can see something wrong > > with them. > > > > Specs: Intel i5-4590 CPU, Advantech motherboard with Q87 intel > > chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard > > 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1 > > (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89 > > kernel from kernel.org source. > > > > Sometimes onscreen (in a text terminal) I get this Oops: > > > > kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) > > BUG: unable to handle kernel paging request at ... > > PGD ... P4D ... PUD .. PHD ... > > Oops: 0011 [#1] SMP PTI > > CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 > > Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY, > > BIOS 4.6.5 08/29/2017 > > I-pipe domain: Linux > > RIP: ... : ... > > Code: Bad RIP value. > > > > Which means the Instruction Pointer is in a Data area. That is bad, > > and I think it is caused by Cobalt code not restoring the > > stack/registers correctly during a context switch. > > Other times I get : > > > > Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted > > in: __xnsched_run.part.63 h - > > CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2 > > Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 > > 04/23/2021 > > I-pipe domain: Linux > > Call Trace: > > > > dump_stack+8x95/8xna > > panic+8xe§l8x246 > > ? ___xnsched_run.part.63+8x5c4/8x4d0 > > __stack_chhk_fail+8x19x8x28 > > ___xnsched_run.part.63+8x§c4/Bx§d8 > > ? release_ioapic_irq+8x3f/8x58 > > ? __ipipe_end_fasteoi_irq+BNZZ/8x38 > > xnintr;edge_vec_handler+BXBIA/8x558 > > __ipipe_do_sync_pipeline+8xS/ana > > dispatch_irq_head+8xe6/Bx118 > > __ipipe_dispatch_irq+ax1bc/Bx1e8 > > __ipipe_handle_irq+8x198/x208 > > ? common_interrupt+8xf/Bx2c > > > > > > The accompanying stack trace seems to implicate an ipipe interrupt > > handler as causing the problem. I'm using xeno_16550A.ko interrupts on > > an isolated interrupt level (IRQ 18). > > > > Interestingly, the Cobalt scheduler and my RT userspace app are still > > running after this, even though the Linux kernel is halted. I proved > > this on an oscilloscope: I can see serial packets going into and out > > of the serial ports at the expected periodic time base. > > > > (Note that the text of these kernel faults above is reconstructed with > > OCR so some addresses are not complete. The computer is hard-locked in > > a text terminal when these happen. I can supply the full JPG pictures > > or re-type addresses if you like.) > > > > The application scenario which causes the above problems: The primary > > app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on > > CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch > > applied for x86 kernel 4.19.89. It has shared memory via mmap() with > > an RTDM module (“modrt1”) but nothing is happening in “modrt1” at > > present, no interrupts etc. There are also two non-RT userspace linux > > apps which have attached to the same shared memory via mmap() but > > those are doing nothing much during these tests. I have attached > > several (1-6) RS232 serial devices and one CAN device all > > communicating with “apprt2”. > > > > The system does not fault (for 48+ hours) when no peripheral > > connections are present (Serial/CAN). The faults happen with Serial > > traffic, whether the CAN device is attached or not. The CAN device > > alone with no Serial does not cause the fault (tested for 48+ hours), > > and the fault has also happened when the motherboard serial ports were > > used, so the PCI Moxa code is not implicated. > > > > Note that in order to get 32-bit userspace support to fully work I had > > to manually patch the 16550A.c serial driver with the 32 bit > > “compatibility” patch from the xenomai mailing list. That works OK and > > my apps can communicate fine for hours. The serial packets in my > > applications have CRC checks so we know if data ever gets corrupted. > > > > Note that my apps have been running OK 32-bit on Xenomai v2.6 for two > > years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and > > did not get any faults in a test lasting 21+ hours (serial driver > > only, no CAN). > > > > Since I imagine Xenomai developers pref
Re: x86 kernel Oops in Xeno-3.1/3.2
On 03.01.22 08:29, C Smith wrote: > I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1). > In numerous tests, I can't keep a computer running for more than a day > before the computer hard-locks (no kbd/mouse/ping). Frequently the > kernel Oopses within 4-6 hours. I have tried 2 identical motherboards, > changed RAM, and tried another manufacturer's motherboard on a 3rd > computer. > > * Can someone supply me with a known successful x68 kernel 4.19.89 > config so I can compare and try those settings? I will attach my > kernel config to this email, in hopes someone can see something wrong > with them. > > Specs: Intel i5-4590 CPU, Advantech motherboard with Q87 intel > chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard > 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1 > (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89 > kernel from kernel.org source. > > Sometimes onscreen (in a text terminal) I get this Oops: > > kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) > BUG: unable to handle kernel paging request at ... > PGD ... P4D ... PUD .. PHD ... > Oops: 0011 [#1] SMP PTI > CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2 > Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY, > BIOS 4.6.5 08/29/2017 > I-pipe domain: Linux > RIP: ... : ... > Code: Bad RIP value. > > Which means the Instruction Pointer is in a Data area. That is bad, > and I think it is caused by Cobalt code not restoring the > stack/registers correctly during a context switch. > Other times I get : > > Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted > in: __xnsched_run.part.63 h - > CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2 > Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 > 04/23/2021 > I-pipe domain: Linux > Call Trace: > > dump_stack+8x95/8xna > panic+8xe§l8x246 > ? ___xnsched_run.part.63+8x5c4/8x4d0 > __stack_chhk_fail+8x19x8x28 > ___xnsched_run.part.63+8x§c4/Bx§d8 > ? release_ioapic_irq+8x3f/8x58 > ? __ipipe_end_fasteoi_irq+BNZZ/8x38 > xnintr;edge_vec_handler+BXBIA/8x558 > __ipipe_do_sync_pipeline+8xS/ana > dispatch_irq_head+8xe6/Bx118 > __ipipe_dispatch_irq+ax1bc/Bx1e8 > __ipipe_handle_irq+8x198/x208 > ? common_interrupt+8xf/Bx2c > > > The accompanying stack trace seems to implicate an ipipe interrupt > handler as causing the problem. I'm using xeno_16550A.ko interrupts on > an isolated interrupt level (IRQ 18). > > Interestingly, the Cobalt scheduler and my RT userspace app are still > running after this, even though the Linux kernel is halted. I proved > this on an oscilloscope: I can see serial packets going into and out > of the serial ports at the expected periodic time base. > > (Note that the text of these kernel faults above is reconstructed with > OCR so some addresses are not complete. The computer is hard-locked in > a text terminal when these happen. I can supply the full JPG pictures > or re-type addresses if you like.) > > The application scenario which causes the above problems: The primary > app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on > CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch > applied for x86 kernel 4.19.89. It has shared memory via mmap() with > an RTDM module (“modrt1”) but nothing is happening in “modrt1” at > present, no interrupts etc. There are also two non-RT userspace linux > apps which have attached to the same shared memory via mmap() but > those are doing nothing much during these tests. I have attached > several (1-6) RS232 serial devices and one CAN device all > communicating with “apprt2”. > > The system does not fault (for 48+ hours) when no peripheral > connections are present (Serial/CAN). The faults happen with Serial > traffic, whether the CAN device is attached or not. The CAN device > alone with no Serial does not cause the fault (tested for 48+ hours), > and the fault has also happened when the motherboard serial ports were > used, so the PCI Moxa code is not implicated. > > Note that in order to get 32-bit userspace support to fully work I had > to manually patch the 16550A.c serial driver with the 32 bit > “compatibility” patch from the xenomai mailing list. That works OK and > my apps can communicate fine for hours. The serial packets in my > applications have CRC checks so we know if data ever gets corrupted. > > Note that my apps have been running OK 32-bit on Xenomai v2.6 for two > years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and > did not get any faults in a test lasting 21+ hours (serial driver > only, no CAN). > > Since I imagine Xenomai developers prefer to debug on recent builds, I > also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit. I > still get kernel Oopses with Xeno 3.2.1 : > > kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) > BUG: unable to hand