Sorry for the delayed answer, it took us some time to instrument our setup for broadcasting the kernel output over serial, and now we have some interesting results. See below.
> On 05.04.22 15:43, Arturo Laurenzi wrote: > >> On 04.04.22 15:21, Arturo Laurenzi via Xenomai wrote: > > > >>> > >>> Recently, we have started a transition towards Ubuntu 20.04, and things > >>> have started to break. > >>> > >>> The first attempt was to install kernel 5.4.151 and stick to ipipe. Under > >>> this setup, we experience issues even before starting our applications. We > >>> have seen random crashes while compiling with GCC, sporadic "System > >>> Program > >>> Problem Detected" popups by Ubuntu, and others. We even tried to > >>> re-install > >>> OS and kernel from scratch with no luck. > >> > >> A reference setup for this kernel line can be found in xenomai-images > >> (https://source.denx.de/Xenomai/xenomai-images). Would be good to > >> understand which deviation from it makes the difference for which > >> component (see also further questions below). > > > > I'm attaching the config we're using (from /boot/config-$(uname -r)). > > If that makes sense, we're going to try to configure the kernel > > according to this file > > (https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig). > > What kernel version do you recommend to try? > > > > Always the latest of the individual kernel series. We still have to test the reference .config file, as we gave higher priority to the kernel output over serial stuff. > >>> > >>> The second attempt was to stick to our old kernel 4.19.140. All the weird > >>> issues disappear and the system is stable. However, we are unable to have > >>> the system pass our suite of "stress tests", which basically involve > >>> starting, > >>> running, and killing process B multiple times in a cyclic fashion, while > >>> process A runs in the background. After a short while (minutes), the whole > >>> system just hangs, forcing us to do an hard reset. Only once, we managed > >>> to > >>> get this kernel oops after rebooting (journalctl -k -b -1 --no-pager). > >>> > >> > >> For reliably recording crashes, it is highly recommended to use a UART > >> as kernel debug output. > > > > Will do ASAP and let you know. Done, see below. > >>> The third attempt was to try out kernel 5.10.89 plus the new dovetail > >>> patch, and Xenomai v3.2.1. Again, all the weird issues are gone and the > >>> system is stable. However, we are unable to have the system pass our suite > >>> of "stress tests". Differently from 4.19-ipipe, the system resists for a > >>> longer time before hanging (few hours sometimes), but this also varies a > >>> lot. > >>> > >>> After some more investigation, we found out something interesting. By > >>> removing the code that interacts with Process A, Process B is then able to > >>> run "forever" (overnight at least), but *only if Process A is not > >>> running*. > >>> Otherwise, the system will hang. In other words, the mere presence of > >>> Process A is affecting Process B, even though both IDDP and ZMQ have been > >>> removed from B and replaced with fake data. Furthermore, the system does > >>> not freeze if we set B1's scheduling policy to SCHED_OTHER. > >> > >> Do you have the Xenomai watchdog enabled, thus will you be able to tell > >> RT application "hangs" (infinite loop at high prio) apart from real > >> hangs/crashes? > > > > Yes. When we try a while(true) inside a RT context, we see the > > watchdog killing our application > > as expected. > > > > > >>> > >>> From these - rather heuristic - tests, it looks like there could be some > >>> coupling between unrelated processes which causes some sort of bug, that > >>> is > >>> probably related to some interaction with mutexes/condvars, when these are > >>> used from a RT context. This issue shows up (or at least we have seen it) > >>> only under Ubuntu 20.04 (GCC 9.x), whereas a 18.04 build (GCC 7.x) looks > >>> fine. > >> > >> Ubuntu toolchains are known for agressively enabling certain security > >> features. Maybe one that we didn't check yet flipped between 18.04 and > >> 20.04 - if that switch is only difference between working and > >> non-working builds in your case. GCC itself should be fine, we are > >> testing with gcc-10 via Debian 11 in our CI. > >> > >> Can you check whether the toolchain change breaks the kernel (kernel > >> with old toolchain runs fine with userspace built via new toolchain)? > > > > We have tried this, and still the system freezes after a while. We > > followed the procedure that follows: > > 1) generate binaries for our "working" kernel 4.19.140-xeno-ipipe-3.1 > > on a Ubuntu 18 machine (make deb-pkg) > > 2) copy the whole /usr/xenomai directory (compiled with the 18.04 > > toolchain) to the test machine with Ubuntu 20.04 > > 3) install the kernel binaries to the test machine > > 4) re-compile our application > > Is this ok? > > > > Wait, these are three variables: kernel, Xenomai application and Ubuntu > userspace. Does your system also break when using both kernel and > application binaries from a Ubuntu 18 build? Or will it start to break > once you recompile the Xenomai application with Ubuntu 20 toolchain? Also this needs further investigation. Right now we're focusing on 5.10-dovetail + Xenomai 3.2 + application all built under the default 20.04 toolchain. > >>> > >>> The purpose of this message is twofold. > >>> First, to see if these symptoms might "ring a bell" to anyone in the > >>> community, who might be able to suggest a fix. > >>> Second, we'd like to ask what you would do to debug this issue. Which tool > >>> could we use to trace what's going on, considering that whatever the bug > >>> is, it leads to a state where the machine is not usable at all. We can > >>> share our .config files if required, and we are willing to test more > >>> combinations of kernel and xenomai patch or library versions upon your > >>> advice. Any help you can give us is greatly appreciated. > >>> > >> > >> Can you simplify your test case to a level that makes it sharable, > >> executable by third parties? Please also share your kernel .config. > > > > Will try. It's not going to be quick though, as any trial we make > > needs hours of testing to understand if it causes a system freeze. > > > > What is the recommended to trace/debug this kind of problems? Is there > > anything "fancier" than broadcasting kernel output over a serial port? > > > > Hard to say in general. Full system freezes can be tricky to debug > unless there are at least some hints provided by the kernel. That's why > the focus is first on validating that. In this regard, we managed to produce a stack trace via serial port. This is obtained on 5.10-dovetail + Xenomai 3.2 + application all built under the default 20.04 toolchain. This happens consistently in both our scenarios, i.e. 1) process A interacting with process B via IDDP and ZMQ (i.e. TCP/IP) 2) process A and a "modified" process B running at the same time, and not interacting in any way The stack trace is always the same (I am attaching a few examples) [ 594.117307] kernel tried to execute NX-protected page - exploit attempt? (uid: 1000) [ 594.117308] BUG: unable to handle page fault for address: ffffa20908ee1b00 [ 594.117308] #PF: supervisor instruction fetch in kernel mode [ 594.117308] #PF: error_code(0x0011) - permissions violation [ 594.117309] PGD 44b601067 P4D 44b601067 PUD 80000001c00001e3 [ 594.117310] Oops: 0011 [#1] SMP PTI IRQ_PIPELINE [ 594.117310] CPU: 1 PID: 34507 Comm: xbot2-core Not tainted 5.10.89-xeno-ipipe-3.1+ #1 [ 594.117311] Hardware name: /TS175, BIOS BQKLR112 07/04/2017 [ 594.117311] IRQ stage: Linux [ 594.117311] RIP: 0010:0xffffa20908ee1b00 [ 594.117312] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <02> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 [ 594.117312] RSP: 0018:ffffa37b8003cf80 EFLAGS: 00010202 [ 594.117313] RAX: ffffffff84cb1d29 RBX: ffffa37b89e8bd98 RCX: 00000000cd46ea8f [ 594.117313] RDX: ffffa37b89e8bda0 RSI: ffffa20b9fc40000 RDI: ffffa37b89e8bd98 [ 594.117314] RBP: ffffffff84d10064 R08: ffffa2084005d800 R09: 0000000000000001 [ 594.117314] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000001e [ 594.117314] R13: 000000000000001c R14: 0000000000000000 R15: 0000000000000024 [ 594.117315] FS: 00007fef2c191600(0000) GS:ffffa20b9fc40000(0000) knlGS:0000000000000000 [ 594.117315] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 594.117316] CR2: ffffa20908ee1b00 CR3: 000000010095e003 CR4: 00000000003706e0 [ 594.117316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 594.117316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 594.117317] Call Trace: [ 594.117317] <IRQ> [ 594.117317] ? irq_work_run_list+0x32/0x40 [ 594.117317] ? irq_work_run+0x18/0x30 [ 594.117318] ? inband_work_interrupt+0x9/0x10 [ 594.117318] ? handle_synthetic_irq+0x59/0x80 [ 594.117318] ? asm_call_irq_on_stack+0x12/0x20 [ 594.117319] </IRQ> [ 594.117319] ? arch_do_IRQ_pipelined+0xc2/0x150 [ 594.117319] ? sync_current_irq_stage+0x1ae/0x230 [ 594.117320] ? __inband_irq_enable+0x47/0x50 [ 594.117320] ? inband_irq_restore+0x21/0x30 [ 594.117320] ? _raw_spin_unlock_irqrestore+0x1d/0x20 [ 594.117320] ? __set_cpus_allowed_ptr+0xa2/0x200 [ 594.117321] ? sched_setaffinity+0x1b7/0x2a0 [ 594.117321] ? __x64_sys_sched_setaffinity+0x4e/0x90 [ 594.117321] ? do_syscall_64+0x44/0xa0 [ 594.117322] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 594.117322] Modules linked in: fuse rtpacket binfmt_misc nls_ascii nls_cp437 vfat fat evdev x86_pkg_temp_thermal intel_powerclamp rt_e1000e crc32c_intel rtnet i915 i2c_algo_bit video drm_kms_helper cfa [ 594.117332] CR2: ffffa20908ee1b00 [ 597.667838] ---[ end trace 384903a16448d047 ]--- [ 597.667839] RIP: 0010:0xffffa20908ee1b00 [ 597.667840] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <02> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 [ 597.667840] RSP: 0018:ffffa37b8003cf80 EFLAGS: 00010202 [ 597.667841] RAX: ffffffff84cb1d29 RBX: ffffa37b89e8bd98 RCX: 00000000cd46ea8f [ 597.667841] RDX: ffffa37b89e8bda0 RSI: ffffa20b9fc40000 RDI: ffffa37b89e8bd98 [ 597.667841] RBP: ffffffff84d10064 R08: ffffa2084005d800 R09: 0000000000000001 [ 597.667842] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000001e [ 597.667842] R13: 000000000000001c R14: 0000000000000000 R15: 0000000000000024 [ 597.667843] FS: 00007fef2c191600(0000) GS:ffffa20b9fc40000(0000) knlGS:0000000000000000 [ 597.667843] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 597.667843] CR2: ffffa20908ee1b00 CR3: 000000010095e003 CR4: 00000000003706e0 [ 597.667844] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 597.667844] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 597.667844] Kernel panic - not syncing: Fatal exception in interrupt More interesting, we tried to remove all calls to pthread_setaffinity_np in our code (since it appears in the stack trace). After this modification, the system does not freeze anymore (or, at least, it manages to survive overnight). Re-introducing pthread_setaffinity_np consistently causes the system to freeze again. Given this, what should be our next step in your opinion? We could - move to latest dovetail and try again with our .config (5.15? which xenomai?) - use the reference .config and try again (also 5.15? which xenomai?) - other? We're also trying to produce a minimal reproducible example that can trigger the crash, but it's not easy as the number or variables is big and every trial requires hours of validation. > If you want it fancier: In the past, we used kgdb on real hardware as > well, but that wasn't tried in a while. More reliable to debug - > provided the issue is reproducible then - is moving everything into a > KVM machine and debugging the guest from the host once it locked up. > > Jan Thanks for this piece of advice, we are willing to learn more about this if more basic techniques won't fix our issue. Arturo Laurenzi, Davide Antonucci > -- > Siemens AG, Technology > Competence Center Embedded Linux -------------- next part -------------- A non-text attachment was scrubbed... Name: Kernel_log_3_idle_pos Type: application/octet-stream Size: 6123 bytes Desc: not available URL: <http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: Kernel_log_1_idle_pos Type: application/octet-stream Size: 5297 bytes Desc: not available URL: <http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0001.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: Kernel_log_2_idle_pos Type: application/octet-stream Size: 6533 bytes Desc: not available URL: <http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0002.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: Kernel_log_1_dummy Type: application/octet-stream Size: 6782 bytes Desc: not available URL: <http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0003.obj>
