Re: [linux-sunxi] Kernel crash in "cpu_freq"

Torsten Beyer Fri, 15 Jul 2022 23:54:38 -0700

Hi Samuel,

thanks for your insights. Will try to follow them and will report back 
here.


In the meantime I have built a kernel with dynamic debug and I can see that 
cpu_freq and the associated calls shown in my earlier post must run 
millions of times. And then out of the blue a crash...so some hw flakiness 
came to my mind, too.

regards
Torsten

sam...@sholland.org schrieb am Samstag, 16. Juli 2022 um 06:16:16 UTC+2:

> Hi Torsten,
>
> On 7/13/22 3:18 AM, Torsten Beyer wrote:
> > Hi all,
> > 
> > I am trying to debug a bug on an open source air navigation box for
> > gliders called openvario <https://www.openvario.org/doku.php>. It is
> > based on a cubieboard (A20) plus some additional serial connections
> > and an optional sensor board for various flight related pressures.
> > 
> > System runs on kernel 5.18.5 generated using Yocto 4.0 kirkstone. The
> > system tends to run for a couple of hours and then freezes/crashes.
> > At the bottom of this post I have pasted a typical kernel debug
> > output once these freezes happen. The crash always happens in the
> > cpu_freq driver. If I set cpu frequency to a fixed frequency (setting
> > min=max frequency) those crashed disappear. This seems to be a work
> > around at the cost of fixing cpu speed.
> > 
> > So it _seems_ the crash is caused by cpu_freq trying to change the
> > cpu frequency (at least at some point in time).
> > 
> > To be honest, I am rather clueless on how to go about finding the
> > root of this issue, let along fixing it. So I thought, I'd ask around
> > here whether this bug somehow looks familiar and may have been
> > tackled (or even fixed) previously (didn't find anything, though, via
> > the search function). In other words: I am thankful for any hint
> > people may be able to give me to get nearer to a fix. 
>
> I have not seen something like this before. It looks like hardware
> flakiness. Can you provide a disassembly of ccu_div_recalc_rate
> from the kernel this splat came from, to confirm my analysis?
>
> > thanks for any pointers
> > Torsten
> > 
> > [26996.004010] Unable to handle kernel paging request at virtual address 
> 08d80050
> > [26996.011337] [08d80050] *pgd=00000000
> > [26996.014952] Internal error: Oops: 5 [#1] SMP ARM
> > [26996.019590] Modules linked in:
> > [26996.022663] CPU: 1 PID: 95 Comm: sugov:0 Not tainted 5.18.5 #1
> > [26996.028509] Hardware name: Allwinner sun7i (A20) Family
> > [26996.033738] PC is at ccu_div_recalc_rate+0x48/0x90
> > [26996.038555] LR is at ccu_mux_helper_apply_prediv+0x18/0x1c
>
> The crash is between the calls to ccu_mux_helper_apply_prediv and
> divider_recalc_rate, so we are loading arguments for the call to
> divider_recalc_rate.
>
> > [26996.044054] pc : [] lr : [] psr: 600b0113
> > [26996.050326] sp : f09e5dc8 ip : 00000000 fp : c1938200
> > [26996.055554] r10: c1867440 r9 : 1f78a400 r8 : c1302d00
> > [26996.060781] r7 : 1312d000 r6 : 1f78a400 r5 : 00000002 r4 : 08d80084
>
> Assuming r4 is "hw", then the faulting address is cd->div.flags.
> This is weird because r5 already contains cd->div.width...
>
> > [26996.067311] r3 : 00000000 r2 : ffffffff r1 : 00000001 r0 : 1f78a400
>
> ..and r3 already contains cd->div.table. So we were already able
> to access parts of the struct both before and after the faulting
> address.
>
> > [26996.073843] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment 
> none
> > [26996.080985] Control: 10c5387d Table: 41ff006a DAC: 00000051
> > [26996.086733] Register r0 information: non-paged memory
> > [26996.091799] Register r1 information: non-paged memory
> > [26996.096858] Register r2 information: non-paged memory
> > [26996.101915] Register r3 information: NULL pointer
> > [26996.106627] Register r4 information: non-paged memory
> > [26996.111688] Register r5 information: non-paged memory
> > [26996.116746] Register r6 information: non-paged memory
> > [26996.121805] Register r7 information: non-paged memory
> > [26996.126863] Register r8 information: slab kmalloc-128 start c1302d00 
> pointer offset 0 size 128
> > [26996.135514] Register r9 information: non-paged memory
> > [26996.140574] Register r10 information: slab task_struct start c1867440 
> pointer offset 0
> > [26996.148517] Register r11 information: slab kmalloc-128 start c1938200 
> pointer offset 0 size 128
> > [26996.157244] Register r12 information: NULL pointer
> > [26996.162049] Process sugov:0 (pid: 95, stack limit = 0xf4bf205c)
> > [26996.167985] Stack: (0xf09e5dc8 to 0xf09e6000)
> > [26996.172361] 5dc0: c0d81584 c03db530 00000000 1f78a400 c1355700 
> c03d181c
>
> What I think is happening is that the value in r4 got corrupted from
> 0xc0d81584 (the saved value on the top of the stack) to 0x08d80084.
>
> Can you try increasing the voltage of the lower OPPs by 100 mV? And
> if that doesn't work, try setting all of the OPPs to 1.4 V. That
> should rule out any instability due to an insufficient CPU supply
> voltage, and also due to any delay in slewing the regulator output.
>
> Regards,
> Samuel
>
> > [26996.180547] 5de0: c1355600 c1355700 1f78a400 c03d34ec 00000000 
> c1355600 1f78a400 39387000
> > [26996.188733] 5e00: c1302d00 1f78a400 c1867440 c03d3554 00000000 
> c1302d00 016e3600 39387000
> > [26996.196917] 5e20: c1302d00 1f78a400 c1867440 c03d3554 c1355600 
> 00000000 1f78a400 c1867440
> > [26996.205101] 5e40: c1302d00 1f78a400 c1867440 c03d39f0 1f78a400 
> 00000000 ffffffff 1f78a400
> > [26996.213287] 5e60: c0d81bd0 df7bf617 c193a340 1f78a400 1f78a400 
> c1938300 ef7dc050 1f78a400
> > [26996.221474] 5e80: c1867440 c03d3c28 c18b3b00 c1938500 1f78a400 
> c1938300 ef7dc050 c06122a4
> > [26996.229659] 5ea0: c1938300 00000001 ffffffff df7bf617 c0d81bd0 
> c18b3b00 ef7dc050 1f78a400
> > [26996.237844] 5ec0: 00000007 c1867440 c1938500 c0db652c 00080e80 
> c0612674 00000000 c0db617c
> > [26996.246030] 5ee0: 1f78a400 df7bf617 c1812800 c1812800 00000000 
> c0dfd944 000ea600 00000000
> > [26996.254214] 5f00: 00000002 c0617054 00000001 c1867440 00000000 
> 00000000 f09e5f5c c1812800
> > [26996.262400] 5f20: 000ea600 00080e80 00000024 df7bf617 00000004 
> c184ba00 c184ba14 00000000
> > [26996.270585] 5f40: 00080e80 c184ba2c 00000001 c0a34650 00000000 
> c0159c98 00000000 c184ba28
> > [26996.278770] 5f60: c1867440 c0dea144 c184ba2c c0136954 c193a500 
> c1867440 c01368e0 c184ba28
> > [26996.286955] 5f80: c13c2100 f0891c44 00000000 c0138194 c193a500 
> c01380c4 00000000 00000000
> > [26996.295138] 5fa0: 00000000 00000000 00000000 c0100148 00000000 
> 00000000 00000000 00000000
> > [26996.303321] 5fc0: 00000000 00000000 00000000 00000000 00000000 
> 00000000 00000000 00000000
> > [26996.311505] 5fe0: 00000000 00000000 00000000 00000000 00000013 
> 00000000 00000000 00000000
> > [26996.319695] ccu_div_recalc_rate from clk_recalc+0x34/0x78
> > [26996.325215] clk_recalc from clk_change_rate+0xa4/0x29c
> > [26996.330461] clk_change_rate from clk_change_rate+0x10c/0x29c
> > [26996.336226] clk_change_rate from clk_change_rate+0x10c/0x29c
> > [26996.341991] clk_change_rate from clk_core_set_rate_nolock+0x16c/0x234
> > [26996.348539] clk_core_set_rate_nolock from clk_set_rate+0x30/0x154
> > [26996.354741] clk_set_rate from _set_opp+0x268/0x550
> > [26996.359644] _set_opp from dev_pm_opp_set_rate+0xe8/0x20c
> > [26996.365062] dev_pm_opp_set_rate from 
> __cpufreq_driver_target+0x584/0x6e4
> > [26996.371876] __cpufreq_driver_target from sugov_work+0x48/0x54
> > [26996.377741] sugov_work from kthread_worker_fn+0x74/0x1a4
> > [26996.383167] kthread_worker_fn from kthread+0xd0/0xec
> > [26996.388242] kthread from ret_from_fork+0x14/0x2c
> > [26996.392967] Exception stack(0xf09e5fb0 to 0xf09e5ff8)
> > [26996.398032] 5fa0: 00000000 00000000 00000000 00000000
> > [26996.406216] 5fc0: 00000000 00000000 00000000 00000000 00000000 
> 00000000 00000000 00000000
> > [26996.414398] 5fe0: 00000000 00000000 00000000 00000000 00000013 
> 00000000
> > [26996.421027] Code: e0055231 e244102c e3e02000 eb0001f3 (e5143034)
>

-- 
You received this message because you are subscribed to the Google Groups 
"linux-sunxi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to linux-sunxi+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/linux-sunxi/73b23b16-d359-43a7-8e96-4a2d891b50d8n%40googlegroups.com.

Re: [linux-sunxi] Kernel crash in "cpu_freq"

Reply via email to