Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
Hi Rafal, On Mon, Sep 4, 2023 at 10:34 AM Rafał Miłecki wrote: > I'm clueless at this point. > Maybe someone can come up with an idea of actual issue & ideally a > solution. Damn this is frustrating. > 2. Clock (arm,armv7-timer) > > While comparing main clock in Broadcom's SDK with upstream one I noticed > a tiny difference: mask value. I don't know it it makes any sense but > switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in > arm_arch_timer.c (to match SDK) increases average uptime (time before a > hang/lockup happens) from 4 minutes to 36 minutes. This could be related to how often the system goes to idle. > + if (cpu_idle_force_poll == 1234) > + arch_cpu_idle(); > + if (cpu_idle_force_poll == 5678) > + arch_cpu_idle(); > + if (cpu_idle_force_poll == 1234) > + arch_cpu_idle(); > + if (cpu_idle_force_poll == 5678) > + arch_cpu_idle(); > + if (cpu_idle_force_poll == 1234) > + arch_cpu_idle(); > + if (cpu_idle_force_poll == 5678) > + arch_cpu_idle(); > + if (cpu_idle_force_poll == 1234) > + arch_cpu_idle(); Idle again. I would have tried to see what arch_cpu_idle() is doing. arm_pm_idle() or cpu_do_idle()? What happens if you just put return in arch_cpu_idle() so it does nothing? Yours, Linus Walleij ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
On Wed, Nov 29, 2023 at 10:20 PM Rafał Miłecki wrote: > Here comes more interesting experiment though. Putting there: > > if (!(foo++ % 1)) { > pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle); > } > > doesn't seem to help. > > > Putting following however seems to make kernel/device stable: > > if (!(foo++ % 100)) { > pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle); > } That's just too weird. > I think I'm just going to assume those chipsets are simply hw broken. If disabling CPU idle on these altogether stabilize them, then maybe that is what we need to do? Yours, Linus Walleij ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
I made a second attempt on debugging some longstanding stability issues affecting BCM53753 SoCs. Those are single CPU core ARM Cortex-A7 boards with a pretty slow arch timer running at 36,8 kHz. After 0 to 20 minutes of close to zero activity I experience hangs and I need to wait a minute for watchdog to kick in and reboot device. First debugging attempt: https://lore.kernel.org/netdev/0f9d0cd6-d344-7915-7bc1-7a090b830...@gmail.com/T/ ("ARM board lockups/hangs triggered by locks and mutexes") After a lot of bisecting, testing & hacking I believe there are 3 types of kernel aspects that affect BCM53573 stability. I'd like to describe them below to document my debugging work. I'm clueless at this point. Maybe someone can come up with an idea of actual issue & ideally a solution. # 1. Locking During my first bisecting attempts I found multiple locking-related commit that regressed stability. Bisected commits: 131287ff833d ("once: add DO_ONCE_SLOW() for sleepable contexts"). and a following group: d0d583484d2e ("locking/refcount: Consolidate implementations of refcount_t") dab787c73f6e ("locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions") 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line") 809554147d60 ("locking/refcount: Improve performance of generic REFCOUNT_FULL code") 9c9269977f03 ("locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header") 04bff7d7b808 ("locking/refcount: Remove unused refcount_*_checked() variants") 513b19a43bec ("locking/refcount: Ensure integer operands are treated as signed") 68b4ee68e8c8 ("locking/refcount: Define constants for saturation and max refcount values") I don't believe there is actually anything wrong about above changes. Maybe it's some tiny timing thing that my board just doesn't like? # 2. Clock (arm,armv7-timer) While comparing main clock in Broadcom's SDK with upstream one I noticed a tiny difference: mask value. I don't know it it makes any sense but switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in arm_arch_timer.c (to match SDK) increases average uptime (time before a hang/lockup happens) from 4 minutes to 36 minutes. # 3. Random code changes During my bisecting attempts I found one commit that regressed kernel stability but actual changes were meaningless in context of locking. It was commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic partitions"): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26 I thought that maybe it was all about making add_mtd_device() bigger and changing addresses of a lot of symbols (looking at System.map). So I reverted that mtd commit and developed a dummy change relocating as few symbols (System.map) as possible while still breaking stability: --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -94,6 +94,21 @@ void __cpuidle default_idle_call(void) arch_cpu_idle(); start_critical_timings(); } + + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); } static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev, Above dummy change didn't relocate thousands of symbols but only about 20 of them. They happened to be lock symbols however. Does it make any sense for above diff to regress kernel stability for me and cause hangs/lockups? --- System.map.good +++ System.map.bad @@ -22214,36 +22214,36 @@ c062e7e0 T __cpuidle_text_start c062e7e0 t cpu_idle_poll c062e860 T default_idle_call -c062e884 T __cpuidle_text_end -c062e888 T __lock_text_start -c062e8a0 T _raw_spin_unlock_irqrestore -c062e8c0 T _raw_spin_trylock -c062e900 T _raw_write_unlock_irqrestore -c062e920 T _raw_read_trylock -c062e960 T _raw_write_trylock -c062e9a0 T _raw_spin_lock_bh -c062ea00 T _raw_read_lock_bh -c062ea40 T _raw_write_lock_bh -c062ea80 T _raw_spin_trylock_bh -c062eb00 T _raw_spin_unlock_bh -c062eb40 T _raw_write_unlock_bh -c062eb80 T _raw_read_unlock_bh -c062ebc0 T _raw_read_unlock_irqrestore -c062ec00 T _raw_write_lock -c062ec40 T _raw_write_lock_irq -c062ec80 T _raw_write_lock_irqsave -c062ecc0 T _raw_read_lock -c062ed00 T _raw_spin_lock -c062ed40 T _raw_read_lock_irq -c062ed80 T _raw_spin_lock_irq -c062ede0 T _raw_spin_lock_irqsave -c062ee40 T _raw_read_lock_irqsave -c062ee70 T __hyp_text_end -c062ee70 T __hyp_text_start -c062ee70 T __kprobes_text_end -c062ee70 T __kprobes_text_start -c062ee70 T __lock_text_end -c062ee70 T _etext +c062e954 T
Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
Hi, it's a late reply but I didn't find enough determination earlier. On 8.09.2023 10:10, Linus Walleij wrote: On Mon, Sep 4, 2023 at 10:34 AM Rafał Miłecki wrote: I'm clueless at this point. Maybe someone can come up with an idea of actual issue & ideally a solution. Damn this is frustrating. 2. Clock (arm,armv7-timer) While comparing main clock in Broadcom's SDK with upstream one I noticed a tiny difference: mask value. I don't know it it makes any sense but switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in arm_arch_timer.c (to match SDK) increases average uptime (time before a hang/lockup happens) from 4 minutes to 36 minutes. This could be related to how often the system goes to idle. + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); Idle again. I would have tried to see what arch_cpu_idle() is doing. arm_pm_idle() or cpu_do_idle()? In my case arm_pm_idle is NULL. What happens if you just put return in arch_cpu_idle() so it does nothing? Doesn't help. I also tried putting: udelay(10); and udelay(1000); at the arch_cpu_idle() beginning. None helped. Here comes more interesting experiment though. Putting there: if (!(foo++ % 1)) { pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle); } doesn't seem to help. Putting following however seems to make kernel/device stable: if (!(foo++ % 100)) { pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle); } I think I'm just going to assume those chipsets are simply hw broken. ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
Hi Rafał, On Mon, Sep 4, 2023 at 10:35 AM Rafał Miłecki wrote: > 2. Clock (arm,armv7-timer) > > While comparing main clock in Broadcom's SDK with upstream one I noticed > a tiny difference: mask value. I don't know it it makes any sense but > switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in > arm_arch_timer.c (to match SDK) increases average uptime (time before a > hang/lockup happens) from 4 minutes to 36 minutes. That code path is used only for type != ARCH_TIMER_TYPE_CP15, but your kernel log arch_timer: cp15 timer(s) running at 0.03MHz (virt). suggest that type == ARCH_TIMER_TYPE_CP15?!? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel