Re: Machine freezes under Ubuntu 20.04

Arturo Laurenzi via Xenomai Tue, 19 Apr 2022 03:02:45 -0700

Sorry for the delayed answer, it took us some time to instrument our
setup for broadcasting the kernel output over serial,
and now we have some interesting results.
See below.


> On 05.04.22 15:43, Arturo Laurenzi wrote:
> >> On 04.04.22 15:21, Arturo Laurenzi via Xenomai wrote:
> >
> >>>
> >>> Recently, we have started a transition towards Ubuntu 20.04, and things
> >>> have started to break.
> >>>
> >>> The first attempt was to install kernel 5.4.151 and stick to ipipe. Under
> >>> this setup, we experience issues even before starting our applications. We
> >>> have seen random crashes while compiling with GCC, sporadic "System 
> >>> Program
> >>> Problem Detected" popups by Ubuntu, and others. We even tried to 
> >>> re-install
> >>> OS and kernel from scratch with no luck.
> >>
> >> A reference setup for this kernel line can be found in xenomai-images
> >> (https://source.denx.de/Xenomai/xenomai-images). Would be good to
> >> understand which deviation from it makes the difference for which
> >> component (see also further questions below).
> >
> > I'm attaching the config we're using (from /boot/config-$(uname -r)).
> > If that makes sense, we're going to try to configure the kernel
> > according to this file
> > (https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig).
> > What kernel version do you recommend to try?
> >
>
> Always the latest of the individual kernel series.

We still have to test the reference .config file, as we gave higher
priority to the kernel output over serial stuff.

> >>>
> >>> The second attempt was to stick to our old kernel 4.19.140. All the weird
> >>> issues disappear and the system is stable. However, we are unable to have
> >>> the system pass our suite of "stress tests", which basically involve 
> >>> starting,
> >>> running, and killing process B multiple times in a cyclic fashion, while
> >>> process A runs in the background. After a short while (minutes), the whole
> >>> system just hangs, forcing us to do an hard reset. Only once, we managed 
> >>> to
> >>> get this kernel oops after rebooting (journalctl -k -b -1 --no-pager).
> >>>
> >>
> >> For reliably recording crashes, it is highly recommended to use a UART
> >> as kernel debug output.
> >
> > Will do ASAP and let you know.

Done, see below.

> >>> The third attempt was to try out kernel 5.10.89 plus the new dovetail
> >>> patch, and Xenomai v3.2.1. Again, all the weird issues are gone and the
> >>> system is stable. However, we are unable to have the system pass our suite
> >>> of "stress tests". Differently from 4.19-ipipe, the system resists for a
> >>> longer time before hanging (few hours sometimes), but this also varies a
> >>> lot.
> >>>
> >>> After some more investigation, we found out something interesting. By
> >>> removing the code that interacts with Process A, Process B is then able to
> >>> run "forever" (overnight at least), but *only if Process A is not 
> >>> running*.
> >>> Otherwise, the system will hang. In other words, the mere presence of
> >>> Process A is affecting Process B, even though both IDDP and ZMQ have been
> >>> removed from B and replaced with fake data. Furthermore, the system does
> >>> not freeze if we set B1's scheduling policy to SCHED_OTHER.
> >>
> >> Do you have the Xenomai watchdog enabled, thus will you be able to tell
> >> RT application "hangs" (infinite loop at high prio) apart from real
> >> hangs/crashes?
> >
> > Yes. When we try a while(true) inside a RT context, we see the
> > watchdog killing our application
> > as expected.
> >
> >
> >>>
> >>> From these - rather heuristic - tests, it looks like there could be some
> >>> coupling between unrelated processes which causes some sort of bug, that 
> >>> is
> >>> probably related to some interaction with mutexes/condvars, when these are
> >>> used from a RT context. This issue shows up (or at least we have seen it)
> >>> only under Ubuntu 20.04 (GCC 9.x), whereas a 18.04 build (GCC 7.x) looks
> >>> fine.
> >>
> >> Ubuntu toolchains are known for agressively enabling certain security
> >> features. Maybe one that we didn't check yet flipped between 18.04 and
> >> 20.04 - if that switch is only difference between working and
> >> non-working builds in your case. GCC itself should be fine, we are
> >> testing with gcc-10 via Debian 11 in our CI.
> >>
> >> Can you check whether the toolchain change breaks the kernel (kernel
> >> with old toolchain runs fine with userspace built via new toolchain)?
> >
> > We have tried this, and still the system freezes after a while. We
> > followed the procedure that follows:
> >  1) generate binaries for our "working" kernel 4.19.140-xeno-ipipe-3.1
> > on a Ubuntu 18 machine (make deb-pkg)
> >  2) copy the whole /usr/xenomai directory (compiled with the 18.04
> > toolchain) to the test machine with Ubuntu 20.04
> >  3) install the kernel binaries to the test machine
> >  4) re-compile our application
> > Is this ok?
> >
>
> Wait, these are three variables: kernel, Xenomai application and Ubuntu
> userspace. Does your system also break when using both kernel and
> application binaries from a Ubuntu 18 build? Or will it start to break
> once you recompile the Xenomai application with Ubuntu 20 toolchain?

Also this needs further investigation. Right now we're focusing on
5.10-dovetail + Xenomai 3.2 + application all built under
the default 20.04 toolchain.

> >>>
> >>> The purpose of this message is twofold.
> >>> First, to see if these symptoms might "ring a bell" to anyone in the
> >>> community, who might be able to suggest a fix.
> >>> Second, we'd like to ask what you would do to debug this issue. Which tool
> >>> could we use to trace what's going on, considering that whatever the bug
> >>> is, it leads to a state where the machine is not usable at all. We can
> >>> share our .config files if required, and we are willing to test more
> >>> combinations of kernel and xenomai patch or library versions upon your
> >>> advice. Any help you can give us is greatly appreciated.
> >>>
> >>
> >> Can you simplify your test case to a level that makes it sharable,
> >> executable by third parties? Please also share your kernel .config.
> >
> > Will try. It's not going to be quick though, as any trial we make
> > needs hours of testing to understand if it causes a system freeze.
> >
> > What is the recommended to trace/debug this kind of problems? Is there
> > anything "fancier" than broadcasting kernel output over a serial port?
> >
>
> Hard to say in general. Full system freezes can be tricky to debug
> unless there are at least some hints provided by the kernel. That's why
> the focus is first on validating that.

In this regard, we managed to produce a stack trace via serial port.
This is obtained on 5.10-dovetail + Xenomai 3.2 + application all
built under
the default 20.04 toolchain. This happens consistently in both our
scenarios, i.e.
 1) process A interacting with process B via IDDP and ZMQ (i.e. TCP/IP)
 2) process A and a "modified" process B running at the same time, and
not interacting in any way
The stack trace is always the same (I am attaching a few examples)

[  594.117307] kernel tried to execute NX-protected page - exploit
attempt? (uid: 1000)
[  594.117308] BUG: unable to handle page fault for address: ffffa20908ee1b00
[  594.117308] #PF: supervisor instruction fetch in kernel mode
[  594.117308] #PF: error_code(0x0011) - permissions violation
[  594.117309] PGD 44b601067 P4D 44b601067 PUD 80000001c00001e3
[  594.117310] Oops: 0011 [#1] SMP PTI IRQ_PIPELINE
[  594.117310] CPU: 1 PID: 34507 Comm: xbot2-core Not tainted
5.10.89-xeno-ipipe-3.1+ #1
[  594.117311] Hardware name:  /TS175, BIOS BQKLR112 07/04/2017
[  594.117311] IRQ stage: Linux
[  594.117311] RIP: 0010:0xffffa20908ee1b00
[  594.117312] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 <02> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
[  594.117312] RSP: 0018:ffffa37b8003cf80 EFLAGS: 00010202
[  594.117313] RAX: ffffffff84cb1d29 RBX: ffffa37b89e8bd98 RCX: 00000000cd46ea8f
[  594.117313] RDX: ffffa37b89e8bda0 RSI: ffffa20b9fc40000 RDI: ffffa37b89e8bd98
[  594.117314] RBP: ffffffff84d10064 R08: ffffa2084005d800 R09: 0000000000000001
[  594.117314] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000001e
[  594.117314] R13: 000000000000001c R14: 0000000000000000 R15: 0000000000000024
[  594.117315] FS:  00007fef2c191600(0000) GS:ffffa20b9fc40000(0000)
knlGS:0000000000000000
[  594.117315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  594.117316] CR2: ffffa20908ee1b00 CR3: 000000010095e003 CR4: 00000000003706e0
[  594.117316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  594.117316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  594.117317] Call Trace:
[  594.117317]  <IRQ>
[  594.117317]  ? irq_work_run_list+0x32/0x40
[  594.117317]  ? irq_work_run+0x18/0x30
[  594.117318]  ? inband_work_interrupt+0x9/0x10
[  594.117318]  ? handle_synthetic_irq+0x59/0x80
[  594.117318]  ? asm_call_irq_on_stack+0x12/0x20
[  594.117319]  </IRQ>
[  594.117319]  ? arch_do_IRQ_pipelined+0xc2/0x150
[  594.117319]  ? sync_current_irq_stage+0x1ae/0x230
[  594.117320]  ? __inband_irq_enable+0x47/0x50
[  594.117320]  ? inband_irq_restore+0x21/0x30
[  594.117320]  ? _raw_spin_unlock_irqrestore+0x1d/0x20
[  594.117320]  ? __set_cpus_allowed_ptr+0xa2/0x200
[  594.117321]  ? sched_setaffinity+0x1b7/0x2a0
[  594.117321]  ? __x64_sys_sched_setaffinity+0x4e/0x90
[  594.117321]  ? do_syscall_64+0x44/0xa0
[  594.117322]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  594.117322] Modules linked in: fuse rtpacket binfmt_misc nls_ascii
nls_cp437 vfat fat evdev x86_pkg_temp_thermal intel_powerclamp
rt_e1000e crc32c_intel rtnet i915 i2c_algo_bit video drm_kms_helper
cfa
[  594.117332] CR2: ffffa20908ee1b00
[  597.667838] ---[ end trace 384903a16448d047 ]---
[  597.667839] RIP: 0010:0xffffa20908ee1b00
[  597.667840] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 <02> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
[  597.667840] RSP: 0018:ffffa37b8003cf80 EFLAGS: 00010202
[  597.667841] RAX: ffffffff84cb1d29 RBX: ffffa37b89e8bd98 RCX: 00000000cd46ea8f
[  597.667841] RDX: ffffa37b89e8bda0 RSI: ffffa20b9fc40000 RDI: ffffa37b89e8bd98
[  597.667841] RBP: ffffffff84d10064 R08: ffffa2084005d800 R09: 0000000000000001
[  597.667842] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000001e
[  597.667842] R13: 000000000000001c R14: 0000000000000000 R15: 0000000000000024
[  597.667843] FS:  00007fef2c191600(0000) GS:ffffa20b9fc40000(0000)
knlGS:0000000000000000
[  597.667843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  597.667843] CR2: ffffa20908ee1b00 CR3: 000000010095e003 CR4: 00000000003706e0
[  597.667844] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  597.667844] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  597.667844] Kernel panic - not syncing: Fatal exception in interrupt

More interesting, we tried to remove all calls to
pthread_setaffinity_np in our code (since it appears in the stack
trace). After this modification, the system does not freeze anymore
(or, at least,
it manages to survive overnight). Re-introducing
pthread_setaffinity_np consistently causes the system to freeze again.

Given this, what should be our next step in your opinion? We could
 - move to latest dovetail and try again with our .config (5.15? which xenomai?)
 - use the reference .config and try again (also 5.15? which xenomai?)
 - other?

We're also trying to produce a minimal reproducible example that can
trigger the crash, but it's not easy as the number or variables
is big and every trial requires hours of validation.

> If you want it fancier: In the past, we used kgdb on real hardware as
> well, but that wasn't tried in a while. More reliable to debug -
> provided the issue is reproducible then - is moving everything into a
> KVM machine and debugging the guest from the host once it locked up.
>
> Jan

Thanks for this piece of advice, we are willing to learn more about
this if more basic techniques won't
fix our issue.

Arturo Laurenzi, Davide Antonucci

> --
> Siemens AG, Technology
> Competence Center Embedded Linux
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kernel_log_3_idle_pos
Type: application/octet-stream
Size: 6123 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kernel_log_1_idle_pos
Type: application/octet-stream
Size: 5297 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kernel_log_2_idle_pos
Type: application/octet-stream
Size: 6533 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kernel_log_1_dummy
Type: application/octet-stream
Size: 6782 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20220419/015bcc1b/attachment-0003.obj>

Re: Machine freezes under Ubuntu 20.04

Reply via email to