Sorry for the delayed answer, it took us some time to instrument our
setup for broadcasting the kernel output over serial,
and now we have some interesting results.
See below.

> On 05.04.22 15:43, Arturo Laurenzi wrote:
> >> On 04.04.22 15:21, Arturo Laurenzi via Xenomai wrote:
> >
> >>>
> >>> Recently, we have started a transition towards Ubuntu 20.04, and things
> >>> have started to break.
> >>>
> >>> The first attempt was to install kernel 5.4.151 and stick to ipipe. Under
> >>> this setup, we experience issues even before starting our applications. We
> >>> have seen random crashes while compiling with GCC, sporadic "System 
> >>> Program
> >>> Problem Detected" popups by Ubuntu, and others. We even tried to 
> >>> re-install
> >>> OS and kernel from scratch with no luck.
> >>
> >> A reference setup for this kernel line can be found in xenomai-images
> >> (https://source.denx.de/Xenomai/xenomai-images). Would be good to
> >> understand which deviation from it makes the difference for which
> >> component (see also further questions below).
> >
> > I'm attaching the config we're using (from /boot/config-$(uname -r)).
> > If that makes sense, we're going to try to configure the kernel
> > according to this file
> > (https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig).
> > What kernel version do you recommend to try?
> >
>
> Always the latest of the individual kernel series.

We still have to test the reference .config file, as we gave higher
priority to the kernel output over serial stuff.

> >>>
> >>> The second attempt was to stick to our old kernel 4.19.140. All the weird
> >>> issues disappear and the system is stable. However, we are unable to have
> >>> the system pass our suite of "stress tests", which basically involve 
> >>> starting,
> >>> running, and killing process B multiple times in a cyclic fashion, while
> >>> process A runs in the background. After a short while (minutes), the whole
> >>> system just hangs, forcing us to do an hard reset. Only once, we managed 
> >>> to
> >>> get this kernel oops after rebooting (journalctl -k -b -1 --no-pager).
> >>>
> >>
> >> For reliably recording crashes, it is highly recommended to use a UART
> >> as kernel debug output.
> >
> > Will do ASAP and let you know.

Done, see below.

> >>> The third attempt was to try out kernel 5.10.89 plus the new dovetail
> >>> patch, and Xenomai v3.2.1. Again, all the weird issues are gone and the
> >>> system is stable. However, we are unable to have the system pass our suite
> >>> of "stress tests". Differently from 4.19-ipipe, the system resists for a
> >>> longer time before hanging (few hours sometimes), but this also varies a
> >>> lot.
> >>>
> >>> After some more investigation, we found out something interesting. By
> >>> removing the code that interacts with Process A, Process B is then able to
> >>> run "forever" (overnight at least), but *only if Process A is not 
> >>> running*.
> >>> Otherwise, the system will hang. In other words, the mere presence of
> >>> Process A is affecting Process B, even though both IDDP and ZMQ have been
> >>> removed from B and replaced with fake data. Furthermore, the system does
> >>> not freeze if we set B1's scheduling policy to SCHED_OTHER.
> >>
> >> Do you have the Xenomai watchdog enabled, thus will you be able to tell
> >> RT application "hangs" (infinite loop at high prio) apart from real
> >> hangs/crashes?
> >
> > Yes. When we try a while(true) inside a RT context, we see the
> > watchdog killing our application
> > as expected.
> >
> >
> >>>
> >>> From these - rather heuristic - tests, it looks like there could be some
> >>> coupling between unrelated processes which causes some sort of bug, that 
> >>> is
> >>> probably related to some interaction with mutexes/condvars, when these are
> >>> used from a RT context. This issue shows up (or at least we have seen it)
> >>> only under Ubuntu 20.04 (GCC 9.x), whereas a 18.04 build (GCC 7.x) looks
> >>> fine.
> >>
> >> Ubuntu toolchains are known for agressively enabling certain security
> >> features. Maybe one that we didn't check yet flipped between 18.04 and
> >> 20.04 - if that switch is only difference between working and
> >> non-working builds in your case. GCC itself should be fine, we are
> >> testing with gcc-10 via Debian 11 in our CI.
> >>
> >> Can you check whether the toolchain change breaks the kernel (kernel
> >> with old toolchain runs fine with userspace built via new toolchain)?
> >
> > We have tried this, and still the system freezes after a while. We
> > followed the procedure that follows:
> >  1) generate binaries for our "working" kernel 4.19.140-xeno-ipipe-3.1
> > on a Ubuntu 18 machine (make deb-pkg)
> >  2) copy the whole /usr/xenomai directory (compiled with the 18.04
> > toolchain) to the test machine with Ubuntu 20.04
> >  3) install the kernel binaries to the test machine
> >  4) re-compile our application
> > Is this ok?
> >
>
> Wait, these are three variables: kernel, Xenomai application and Ubuntu
> userspace. Does your system also break when using both kernel and
> application binaries from a Ubuntu 18 build? Or will it start to break
> once you recompile the Xenomai application with Ubuntu 20 toolchain?

Also this needs further investigation. Right now we're focusing on
5.10-dovetail + Xenomai 3.2 + application all built under
the default 20.04 toolchain.

> >>>
> >>> The purpose of this message is twofold.
> >>> First, to see if these symptoms might "ring a bell" to anyone in the
> >>> community, who might be able to suggest a fix.
> >>> Second, we'd like to ask what you would do to debug this issue. Which tool
> >>> could we use to trace what's going on, considering that whatever the bug
> >>> is, it leads to a state where the machine is not usable at all. We can
> >>> share our .config files if required, and we are willing to test more
> >>> combinations of kernel and xenomai patch or library versions upon your
> >>> advice. Any help you can give us is greatly appreciated.
> >>>
> >>
> >> Can you simplify your test case to a level that makes it sharable,
> >> executable by third parties? Please also share your kernel .config.
> >
> > Will try. It's not going to be quick though, as any trial we make
> > needs hours of testing to understand if it causes a system freeze.
> >
> > What is the recommended to trace/debug this kind of problems? Is there
> > anything "fancier" than broadcasting kernel output over a serial port?
> >
>
> Hard to say in general. Full system freezes can be tricky to debug
> unless there are at least some hints provided by the kernel. That's why
> the focus is first on validating that.

In this regard, we managed to produce a stack trace via serial port.
This is obtained on 5.10-dovetail + Xenomai 3.2 + application all
built under
the default 20.04 toolchain. This happens consistently in both our
scenarios, i.e.
 1) process A interacting with process B via IDDP and ZMQ (i.e. TCP/IP)
 2) process A and a "modified" process B running at the same time, and
not interacting in any way
The stack trace is always the same (I am attaching a few examples)

[  594.117307] kernel tried to execute NX-protected page - exploit
attempt? (uid: 1000)
[  594.117308] BUG: unable to handle page fault for address: ffffa20908ee1b00
[  594.117308] #PF: supervisor instruction fetch in kernel mode
[  594.117308] #PF: error_code(0x0011) - permissions violation
[  594.117309] PGD 44b601067 P4D 44b601067 PUD 80000001c00001e3
[  594.117310] Oops: 0011 [#1] SMP PTI IRQ_PIPELINE
[  594.117310] CPU: 1 PID: 34507 Comm: xbot2-core Not tainted
5.10.89-xeno-ipipe-3.1+ #1
[  594.117311] Hardware name:  /TS175, BIOS BQKLR112 07/04/2017
[  594.117311] IRQ stage: Linux
[  594.117311] RIP: 0010:0xffffa20908ee1b00
[  594.117312] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 <02> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
[  594.117312] RSP: 0018:ffffa37b8003cf80 EFLAGS: 00010202
[  594.117313] RAX: ffffffff84cb1d29 RBX: ffffa37b89e8bd98 RCX: 00000000cd46ea8f
[  594.117313] RDX: ffffa37b89e8bda0 RSI: ffffa20b9fc40000 RDI: ffffa37b89e8bd98
[  594.117314] RBP: ffffffff84d10064 R08: ffffa2084005d800 R09: 0000000000000001
[  594.117314] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000001e
[  594.117314] R13: 000000000000001c R14: 0000000000000000 R15: 0000000000000024
[  594.117315] FS:  00007fef2c191600(0000) GS:ffffa20b9fc40000(0000)
knlGS:0000000000000000
[  594.117315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  594.117316] CR2: ffffa20908ee1b00 CR3: 000000010095e003 CR4: 00000000003706e0
[  594.117316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  594.117316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  594.117317] Call Trace:
[  594.117317]  <IRQ>
[  594.117317]  ? irq_work_run_list+0x32/0x40
[  594.117317]  ? irq_work_run+0x18/0x30
[  594.117318]  ? inband_work_interrupt+0x9/0x10
[  594.117318]  ? handle_synthetic_irq+0x59/0x80
[  594.117318]  ? asm_call_irq_on_stack+0x12/0x20
[  594.117319]  </IRQ>
[  594.117319]  ? arch_do_IRQ_pipelined+0xc2/0x150
[  594.117319]  ? sync_current_irq_stage+0x1ae/0x230
[  594.117320]  ? __inband_irq_enable+0x47/0x50
[  594.117320]  ? inband_irq_restore+0x21/0x30
[  594.117320]  ? _raw_spin_unlock_irqrestore+0x1d/0x20
[  594.117320]  ? __set_cpus_allowed_ptr+0xa2/0x200
[  594.117321]  ? sched_setaffinity+0x1b7/0x2a0
[  594.117321]  ? __x64_sys_sched_setaffinity+0x4e/0x90
[  594.117321]  ? do_syscall_64+0x44/0xa0
[  594.117322]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  594.117322] Modules linked in: fuse rtpacket binfmt_misc nls_ascii
nls_cp437 vfat fat evdev x86_pkg_temp_thermal intel_powerclamp
rt_e1000e crc32c_intel rtnet i915 i2c_algo_bit video drm_kms_helper
cfa
[  594.117332] CR2: ffffa20908ee1b00
[  597.667838] ---[ end trace 384903a16448d047 ]---
[  597.667839] RIP: 0010:0xffffa20908ee1b00
[  597.667840] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 <02> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
[  597.667840] RSP: 0018:ffffa37b8003cf80 EFLAGS: 00010202
[  597.667841] RAX: ffffffff84cb1d29 RBX: ffffa37b89e8bd98 RCX: 00000000cd46ea8f
[  597.667841] RDX: ffffa37b89e8bda0 RSI: ffffa20b9fc40000 RDI: ffffa37b89e8bd98
[  597.667841] RBP: ffffffff84d10064 R08: ffffa2084005d800 R09: 0000000000000001
[  597.667842] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000001e
[  597.667842] R13: 000000000000001c R14: 0000000000000000 R15: 0000000000000024
[  597.667843] FS:  00007fef2c191600(0000) GS:ffffa20b9fc40000(0000)
knlGS:0000000000000000
[  597.667843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  597.667843] CR2: ffffa20908ee1b00 CR3: 000000010095e003 CR4: 00000000003706e0
[  597.667844] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  597.667844] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  597.667844] Kernel panic - not syncing: Fatal exception in interrupt

More interesting, we tried to remove all calls to
pthread_setaffinity_np in our code (since it appears in the stack
trace). After this modification, the system does not freeze anymore
(or, at least,
it manages to survive overnight). Re-introducing
pthread_setaffinity_np consistently causes the system to freeze again.

Given this, what should be our next step in your opinion? We could
 - move to latest dovetail and try again with our .config (5.15? which xenomai?)
 - use the reference .config and try again (also 5.15? which xenomai?)
 - other?

We're also trying to produce a minimal reproducible example that can
trigger the crash, but it's not easy as the number or variables
is big and every trial requires hours of validation.

> If you want it fancier: In the past, we used kgdb on real hardware as
> well, but that wasn't tried in a while. More reliable to debug -
> provided the issue is reproducible then - is moving everything into a
> KVM machine and debugging the guest from the host once it locked up.
>
> Jan

Thanks for this piece of advice, we are willing to learn more about
this if more basic techniques won't
fix our issue.

Arturo Laurenzi, Davide Antonucci

> --
> Siemens AG, Technology
> Competence Center Embedded Linux
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kernel_log_3_idle_pos
Type: application/octet-stream
Size: 6123 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kernel_log_1_idle_pos
Type: application/octet-stream
Size: 5297 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kernel_log_2_idle_pos
Type: application/octet-stream
Size: 6533 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kernel_log_1_dummy
Type: application/octet-stream
Size: 6782 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0003.obj>

Reply via email to