Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 18.08.2023 22:23, Rafał Miłecki wrote:

On 14.08.2023 11:04, Geert Uytterhoeven wrote:

Hi Rafal,

On Mon, Aug 7, 2023 at 1:11 PM Rafał Miłecki  wrote:

On 4.08.2023 13:07, Rafał Miłecki wrote:

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.


I had similar experiences on other ARM platforms many years ago:
bisection lead to something completely bogus, and it turned out
adding a single line of innocent code made the system lock-up or crash
unexpectedly.  It was definitely related to alignment, as adding the
right extra amount of innocent code would fix the problem. Until some
later change changing alignment again...
I never found the real cause, but the problems went away over time.
I am not sure I did enable all required errata config options, so I
may have missed some...


I already experiented some weird performance variations on Broadcom's
Northstar platform that was related to symbols layout & cache hit/miss
ratio. For that reason I use -falign-functions=32 for that whole
OpenWrt's "bcm53xx" target (it covers Northstar and BCM53573). So
this aspect should be ruled out already in my case.


Relevant OpenWrt commit with some description and links: b54ef39e0b91 ("bcm53xx: use 
-falign-functions=32 for kernel compilation"):

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=b54ef39e0b910a4b8eaca0497fe9b63e8392262a

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 14.08.2023 11:04, Geert Uytterhoeven wrote:

Hi Rafal,

On Mon, Aug 7, 2023 at 1:11 PM Rafał Miłecki  wrote:

On 4.08.2023 13:07, Rafał Miłecki wrote:

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.


I had similar experiences on other ARM platforms many years ago:
bisection lead to something completely bogus, and it turned out
adding a single line of innocent code made the system lock-up or crash
unexpectedly.  It was definitely related to alignment, as adding the
right extra amount of innocent code would fix the problem. Until some
later change changing alignment again...
I never found the real cause, but the problems went away over time.
I am not sure I did enable all required errata config options, so I
may have missed some...


I already experiented some weird performance variations on Broadcom's
Northstar platform that was related to symbols layout & cache hit/miss
ratio. For that reason I use -falign-functions=32 for that whole
OpenWrt's "bcm53xx" target (it covers Northstar and BCM53573). So
this aspect should be ruled out already in my case.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 7.08.2023 20:34, Florian Fainelli wrote:

On 8/7/23 04:10, Rafał Miłecki wrote:

On 4.08.2023 13:07, Rafał Miłecki wrote:

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.


In the config.gz file you attached in your first email, both CONFIG_MTD_* and 
CONFIG_NVMEM_* so it is not like we are reaching into module space for code 
and/or data and need veneers or anything, it is part of the kernel image so we 
can assert the maximum distance between instructions etc.

Now is it just that specific mutex that is an issue, or do other mutexes 
through the system do cause problems as well?


If you mean mtd mutex, I'm quite sure it's not the one to blame. It just
happened modified function was using a mutex. Could be any other.



Do we suspect the toolchain to be possibly problematic?


Maybe, I really don't know much such low level stuff.




Following dump change brings back lockups/hangs:

diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index ee437af41..0a24dec55 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1028,6 +1028,22 @@ static long mtdchar_unlocked_ioctl(struct file *file, 
u_int cmd, u_long arg)
  {
  int ret;

+    if (!file)
+    pr_info("Missing\n");
+    WARN_ON(!file);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+
  mutex_lock(_mutex);
  ret = mtdchar_ioctl(file, cmd, arg);
  mutex_unlock(_mutex);






___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 4.08.2023 13:07, Rafał Miłecki wrote:

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.

Following dump change brings back lockups/hangs:

diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index ee437af41..0a24dec55 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1028,6 +1028,22 @@ static long mtdchar_unlocked_ioctl(struct file *file, 
u_int cmd, u_long arg)
 {
int ret;

+   if (!file)
+   pr_info("Missing\n");
+   WARN_ON(!file);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+
mutex_lock(_mutex);
ret = mtdchar_ioctl(file, cmd, arg);
mutex_unlock(_mutex);


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 00:10, Rafał Miłecki wrote:

Unfortunately enabling *any* of following options:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
seems to make locksup/hangs go away. I tried for few hours.


I decided to find out why enabling CONFIG_DEBUG_MUTEXES "fixes" kernel /
device stability for me. I tried enabling manually code that normally
hides behind the #ifdev CONFIG_DEBUG_MUTEXES.

Attached to this e-mail is a small patch that is enough to make my
kernel stable (mutex-fix-bcm53573.diff).

#

It's not what's the most interesting thought. What really doesn't make
sense anymore is that below diff (on top of attached one) brings back
hangs/lockups.

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 4fe40910f..c440222a4 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -34,6 +34,8 @@ void debug_mutex_lock_common(struct mutex *lock, struct 
mutex_waiter *waiter)
INIT_LIST_HEAD(>list);
 }

+/* Dropping below function brings back hangs/lockups & reboots */
+#if 0
 void debug_mutex_wake_waiter(struct mutex *lock, struct mutex_waiter *waiter)
 {
lockdep_assert_held(>wait_lock);
@@ -41,6 +43,7 @@ void debug_mutex_wake_waiter(struct mutex *lock, struct 
mutex_waiter *waiter)
DEBUG_LOCKS_WARN_ON(waiter->magic != waiter);
DEBUG_LOCKS_WARN_ON(list_empty(>list));
 }
+#endif

 void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 {
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 479bc96c3..15bd4691b 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -57,9 +57,7 @@ struct mutex {
 	struct optimistic_spin_queue osq; /* Spinner MCS lock */
 #endif
 	struct list_head	wait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
-#endif
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 #endif
@@ -73,12 +71,10 @@ struct mutex_waiter {
 	struct list_head	list;
 	struct task_struct	*task;
 	struct ww_acquire_ctx	*ww_ctx;
-#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
-#endif
 };
 
-#ifdef CONFIG_DEBUG_MUTEXES
+#if 1 //def CONFIG_DEBUG_MUTEXES
 
 #define __DEBUG_MUTEX_INITIALIZER(lockname)\
 	, .magic = 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d0e639497..8fef4485e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -958,10 +958,8 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
 	/* Mutex deadlock detection: */
 	struct mutex_waiter		*blocked_on;
-#endif
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	intnon_block_count;
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 45452facf..b22e6ecd8 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -12,7 +12,7 @@ CFLAGS_REMOVE_mutex-debug.o = $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_rtmutex-debug.o = $(CC_FLAGS_FTRACE)
 endif
 
-obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
+obj-y += mutex-debug.o
 obj-$(CONFIG_LOCKDEP) += lockdep.o
 ifeq ($(CONFIG_PROC_FS),y)
 obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index b02fff282..6dc3f80a3 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -946,9 +946,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 
 	might_sleep();
 
-#ifdef CONFIG_DEBUG_MUTEXES
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-#endif
 
 	ww = container_of(lock, struct ww_mutex, base);
 	if (ww_ctx) {
@@ -1417,9 +1415,7 @@ int __sched mutex_trylock(struct mutex *lock)
 {
 	bool locked;
 
-#ifdef CONFIG_DEBUG_MUTEXES
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-#endif
 
 	locked = __mutex_trylock(lock);
 	if (locked)
___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 00:10, Rafał Miłecki wrote:

Reverting that extra commit from v5.4.238 allows me to run Linux for
hours again (currently 3 devices x 6 hours and counting). So I need in
total 10+1 reverts from 5.4 branch to get a stable kernel.


I switched back to OpenWrt's kernel 5.4 and applied all those reverts I
found. Nothing. I was still getting hangs / lockups + reboots.

After more bisecting and I found out it's because OpenWrt backported
commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic
partitions"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26

It didn't make any sense to me. That patch does nothing on my device and
its code is only executed when booting.

It makes even less sense to me. Why such changes that should not affect
anything actually break stability for BCM53573?

I narrowed above patch even furher. It's actually enough to apply below
diff to break kernel stability:

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index f69c5b94e..f10dd3af1 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -590,6 +590,25 @@ static int mtd_nvmem_add(struct mtd_info *mtd)
return 0;
 }

+static void mtd_check_of_node(struct mtd_info *mtd)
+{
+   struct device_node *partitions, *parent_dn;
+   struct mtd_info *parent;
+
+   /* Check if MTD already has a device node */
+   if (dev_of_node(>dev))
+   return;
+
+   /* Check if a partitions node exist */
+   parent = mtd_get_master(mtd);
+   parent_dn = dev_of_node(>dev);
+   pr_info("[%s] mtd->name:%s parent_dn:%pOF\n", __func__, mtd->name, 
parent_dn);
+   if (!parent_dn)
+   return;
+
+   of_node_put(parent_dn);
+}
+
 /**
  * add_mtd_device - register an MTD device
  * @mtd: pointer to new MTD device info structure
@@ -673,6 +692,7 @@ int add_mtd_device(struct mtd_info *mtd)
mtd->dev.devt = MTD_DEVT(i);
dev_set_name(>dev, "mtd%d", i);
dev_set_drvdata(>dev, mtd);
+   mtd_check_of_node(mtd);
of_node_get(mtd_get_of_node(mtd));
error = device_register(>dev);
if (error) {


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 00:21, Russell King (Oracle) wrote:

On Wed, Aug 02, 2023 at 12:10:24AM +0200, Rafał Miłecki wrote:

Years ago I added support for Broadcom's BCM53573 SoCs. We released
firmwares based on Linux 4.4 (and later on 4.14) that worked almost
fine. There was one little issue we couldn't debug or fix: random hangs
and reboots. They were too rare to deal with (most devices worked fine
for weeks or months).

Recently I updated my stable kernel 5.4 and I started experiencing
stability issues on my own! After some uptime (usually from 0 to 20
minutes of close to zero activity) serial console hangs. I can't type
anything and I stop getting any messages. I've to wait about a minute
for watchdog to kick in and reboot device.

(...)

I'm clueless at this point. Is that possible kernel has some locking bug
I can hit only using this specific SoC? BCM53573s have a single ARM
Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
can think of is a slow arch timer running at 36,8 kHz.

I tried compiling kernel with:
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_WQ_WATCHDOG=y
but it didn't change or report anything.

Unfortunately enabling *any* of following options:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
seems to make locksup/hangs go away. I tried for few hours.

Sadly I don't have access to JTAG or any low level debugging interface.

Does looking at commits I reported above give anyone a hint on what may
be going on maybe?


If you suspect locking issues, make sure you have lockdep enabled which
will detect locking errors. You will want CONFIG_PROVE_LOCKING enabled.

I will say that I use IPv6, and I run 32-bit kernels here both on real
ARMv7 hardware (Armada 388 and iMX6 based stuff) and also in KVM based
VMs, and these have run virtually every release of the kernel (not
stable kernels though) and I haven't ever seen the behaviour that you
describe.

If it is specific to stable kernels, then that would be rather
disappointing.


I wrote above that with any of: CONFIG_DEBUG_RT_MUTEXES,
CONFIG_DEBUG_SPINLOCK or CONFIG_DEBUG_MUTEXES enabled I can't reproduce
the issue anymore. Right? Well I swear it was true for some random 5.4
release I tested before.

With your comment I decided to try CONFIG_PROVE_LOCKING anyway / again
and this time on 1 of my BCM53573 devices I got something very
interesting on the first boot.

FWIW following error:
Broadcom B53 (2) bcma_mdio-0-0:1e: failed to register switch: -517
is caused by invalid DT I sent fixes for just recently.

Please scroll through the first booting lines for the WARNING:

[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 5.4.238 (ubuntu@nat) (gcc version 8.4.0 (OpenWrt 
GCC 8.4.0 r15234+1-d89a7f0120)) #0 SMP Fri Jul 14 12:56:51 2023
[0.00] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[0.00] CPU: div instructions available: patching division code
[0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache
[0.00] OF: fdt: Machine model: Tenda AC9
[0.00] earlycon: ns16550a0 at MMIO 0x18000300 (options '115200n8')
[0.00] printk: bootconsole [ns16550a0] enabled
[0.00] Memory policy: Data cache writealloc
[0.00] Hit pending asynchronous external abort (FSR=0x0c06) during 
first unmask, this is most likely caused by a firmware/bootloader bug.
[0.00] percpu: Embedded 14 pages/cpu s27944 r8192 d21208 u57344
[0.00] Built 1 zonelists, mobility grouping on.  Total pages: 32480
[0.00] Kernel command line: console=ttyS0,115200 earlycon
[0.00] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes, 
linear)
[0.00] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes, 
linear)
[0.00] mem auto-init: stack:off, heap alloc:off, heap free:off
[0.00] Memory: 118164K/131072K available (5531K kernel code, 201K 
rwdata, 1960K rodata, 1024K init, 2106K bss, 12908K reserved, 0K cma-reserved, 
0K highmem)
[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[0.00] rcu: Hierarchical RCU implementation.
[0.00] rcu: RCU restricting CPUs from NR_CPUS=2 to nr_cpu_ids=1.
[0.00] rcu: RCU calculated value of scheduler-enlistment delay is 10 
jiffies.
[0.00] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[0.00] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[0.00] arch_timer: cp15 timer(s) running at 0.03MHz (virt).
[0.00] clocksource: arch_sys_counter: mask: 0xff 
max_cycles: 0x10eb00226, max_idle_ns: 56421785894076 ns
[0.27] sched_clock: 56 bits at 35kHz, resolution 27918ns, wraps every 
70368744165810ns
[0.008654] Ignoring delay timer arch_delay_timer, which has insufficient 
resolution of 27918ns
[0.017951] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., 
Ingo Molnar
[0.025936] ... 

Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 00:25, Florian Fainelli wrote:

Hi Rafal,

On 8/1/23 15:10, Rafał Miłecki wrote:

Hi,

Years ago I added support for Broadcom's BCM53573 SoCs. We released
firmwares based on Linux 4.4 (and later on 4.14) that worked almost
fine. There was one little issue we couldn't debug or fix: random hangs
and reboots. They were too rare to deal with (most devices worked fine
for weeks or months).

Recently I updated my stable kernel 5.4 and I started experiencing
stability issues on my own! After some uptime (usually from 0 to 20
minutes of close to zero activity) serial console hangs. I can't type
anything and I stop getting any messages. I've to wait about a minute
for watchdog to kick in and reboot device.

#

I took that great chance and decided to track the regression.

Linux 5.4 stable branch worked stable up to the release v5.4.197.
Starting with v5.4.198 I started experiencing those stability issues. I
bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
with loops over idev->addr_list"):
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=4460066eb2480b9e203c73755e12e2efc820a27e

With above commit reverted I was able to use stable 5.4 branch up to the
release v5.4.207. Starting with v5.4.208 it got unstable again. I
bisected it down to:
commit d0d583484d2e ("locking/refcount: Consolidate implementations of
refcount_t")
commit dab787c73f6e ("locking/refcount: Consolidate
REFCOUNT_{MAX,SATURATED} definitions")
commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
commit 809554147d60 ("locking/refcount: Improve performance of generic
REFCOUNT_FULL code")
commit 9c9269977f03 ("locking/refcount: Move the bulk of the
REFCOUNT_FULL implementation into the  header")
commit 04bff7d7b808 ("locking/refcount: Remove unused
refcount_*_checked() variants")
commit 513b19a43bec ("locking/refcount: Ensure integer operands are
treated as signed")
commit 68b4ee68e8c8 ("locking/refcount: Define constants for
saturation and max refcount values")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=809554147d609163cfbaf815c443c575b538a7ef
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=04bff7d7b8081c4bb2e8171be31d33df297eee5b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=513b19a43becee5f7af6d283bb9d3d241a8a21a8
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
(I didn't actually check above commits individually).

Reverting above locking/refcount commits worked fine for few releases:
up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
sleepable contexts").

Reverting that extra commit from v5.4.238 allows me to run Linux for
hours again (currently 3 devices x 6 hours and counting). So I need in
total 10+1 reverts from 5.4 branch to get a stable kernel.

#

I'm clueless at this point. Is that possible kernel has some locking bug
I can hit only using this specific SoC? BCM53573s have a single ARM
Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
can think of is a slow arch timer running at 36,8 kHz.


 From the look of it, it seems like the CPU might have bugs with atomics?

Your log indicates that your Cortex-A7 is r0p5 which is described to be 
susceptible to ARM_ERRATA_814220, do you have it enabled by any chance, if not, 
can you enable it and see if makes any difference?


I had it disabled. Unfortunately CONFIG_ARM_ERRATA_814220=y doesn't help.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 09:00, Rafał Miłecki wrote:

With your comment I decided to try CONFIG_PROVE_LOCKING anyway / again
and this time on 1 of my BCM53573 devices I got something very
interesting on the first boot.

FWIW following error:
Broadcom B53 (2) bcma_mdio-0-0:1e: failed to register switch: -517
is caused by invalid DT I sent fixes for just recently.

Please scroll through the first booting lines for the WARNING:

(...)
[    1.167234] bgmac_bcma bcma0:5: Found PHY addr: 30 (NOREGS)
[    1.173655] [ cut here ]
[    1.178374] WARNING: CPU: 0 PID: 1 at kernel/locking/mutex.c:950 
__mutex_lock+0x6b4/0x8a0
[    1.186721] DEBUG_LOCKS_WARN_ON(lock->magic != lock)


Ah, that mutex WARNING comes from my Tenda AC9 device which happens to
use a hacky OpenWrt downstream b53 driver. That driver uses wrong API
(it behaves as PHY driver instead of MDIO driver). It results in probing
against PHY device which isn't properly initialized.

Long story short: above WARNING is just a noise. Ignore it please. Sorry
for that.

Kernel compiled with CONFIG_PROVE_LOCKING still works fine on other
devices and on Tenda AC9 after fixing PHY<->MDIO thing. That kernel
option hides actual bug whatever it is.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Geert Uytterhoeven
Hi Rafal,

On Mon, Aug 7, 2023 at 1:11 PM Rafał Miłecki  wrote:
> On 4.08.2023 13:07, Rafał Miłecki wrote:
> > I triple checked that. Dropping a single unused function breaks kernel /
> > device stability on BCM53573!
> >
> > AFAIK the only thing below diff actually affects is location of symbols
> > (I actually verified that by comparing System.map before and after -
> > over 22'000 of relocated symbols).
> >
> > Can some unfortunate location of symbols cause those hangs/lockups?
>
> I performed another experiment. First I dropped mtd_check_of_node() to
> bring kernel back to the stable state.
>
> Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
> ended up adding just enough to make sure all post-mtd symbols in
> System.map got the same offset as in case of backporting
> mtd_check_of_node().
>
> I started experiencing lockups/hangs again.
>
> I repeated the same test with adding dumb code to the brcm_nvram_probe()
> and verifying symbols offsets following brcm_nvram_probe one.
>
> I believe this confirms that this problem is about offset or alignment
> of some specific symbol(s). The remaining question is what symbols and
> how to fix or workaround that.

I had similar experiences on other ARM platforms many years ago:
bisection lead to something completely bogus, and it turned out
adding a single line of innocent code made the system lock-up or crash
unexpectedly.  It was definitely related to alignment, as adding the
right extra amount of innocent code would fix the problem. Until some
later change changing alignment again...
I never found the real cause, but the problems went away over time.
I am not sure I did enable all required errata config options, so I
may have missed some...

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel