On Sun, Apr 04, 2021 at 09:30:38PM -0700, Paul E. McKenney wrote: > On Sun, Apr 04, 2021 at 09:01:25PM -0700, Paul E. McKenney wrote: > > On Mon, Apr 05, 2021 at 04:08:55AM +0100, Matthew Wilcox wrote: > > > On Sun, Apr 04, 2021 at 02:40:30PM -0700, Paul E. McKenney wrote: > > > > On Sun, Apr 04, 2021 at 10:38:41PM +0200, Thomas Gleixner wrote: > > > > > On Sun, Apr 04 2021 at 12:05, syzbot wrote: > > > > > > > > > > Cc + ... > > > > > > > > And a couple more... > > > > > > > > > > Hello, > > > > > > > > > > > > syzbot found the following issue on: > > > > > > > > > > > > HEAD commit: 5e46d1b7 reiserfs: update > > > > > > reiserfs_xattrs_initialized() co.. > > > > > > git tree: upstream > > > > > > console output: > > > > > > https://syzkaller.appspot.com/x/log.txt?x=1125f831d00000 > > > > > > kernel config: > > > > > > https://syzkaller.appspot.com/x/.config?x=78ef1d159159890 > > > > > > dashboard link: > > > > > > https://syzkaller.appspot.com/bug?extid=88e4f02896967fe1ab0d > > > > > > > > > > > > Unfortunately, I don't have any reproducer for this issue yet. > > > > > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to > > > > > > the commit: > > > > > > Reported-by: syzbot+88e4f02896967fe1a...@syzkaller.appspotmail.com > > > > > > > > > > > > ============================= > > > > > > WARNING: suspicious RCU usage > > > > > > 5.12.0-rc5-syzkaller #0 Not tainted > > > > > > ----------------------------- > > > > > > kernel/sched/core.c:8294 Illegal context switch in RCU-sched > > > > > > read-side critical section! > > > > > > > > > > > > other info that might help us debug this: > > > > > > > > > > > > > > > > > > rcu_scheduler_active = 2, debug_locks = 0 > > > > > > 3 locks held by syz-executor.4/8418: > > > > > > #0: > > > > > > ffff8880751d2b28 > > > > > > ( > > > > > > &p->pi_lock > > > > > > ){-.-.}-{2:2} > > > > > > , at: try_to_wake_up+0x98/0x14a0 kernel/sched/core.c:3345 > > > > > > #1: > > > > > > ffff8880b9d35258 > > > > > > ( > > > > > > &rq->lock > > > > > > ){-.-.}-{2:2} > > > > > > , at: rq_lock kernel/sched/sched.h:1321 [inline] > > > > > > , at: ttwu_queue kernel/sched/core.c:3184 [inline] > > > > > > , at: try_to_wake_up+0x5e6/0x14a0 kernel/sched/core.c:3464 > > > > > > #2: ffff8880b9d1f948 (&per_cpu_ptr(group->pcpu, > > > > > > cpu)->seq){-.-.}-{0:0}, at: psi_task_change+0x142/0x220 > > > > > > kernel/sched/psi.c:807 > > > > > > > > This looks similar to > > > > syzbot+dde0cc33951735441...@syzkaller.appspotmail.com > > > > in that rcu_sleep_check() sees an RCU lock held, but the later call to > > > > lockdep_print_held_locks() does not. Did something change recently that > > > > could let the ->lockdep_depth counter get out of sync with the actual > > > > number of locks held? > > > > > > Dmitri had a different theory here: > > > > > > https://groups.google.com/g/syzkaller-bugs/c/FmYvfZCZzqA/m/nc2CXUgsAgAJ > > > > There is always room for more than one bug. ;-) > > > > He says "one-off false positives". I was afraid of that... > > And both the examples I have been copied on today are consistent with > debug_locks getting zeroed (e.g., via a call to __debug_locks_off()) > in the midst of a call to rcu_sleep_check(). But I would expect to see > a panic or another splat if that were to happen. > > Dmitry's example did have an additional splat, but I would expect the > RCU-related one to come second. Again, there is always room for more > than one bug. > > On the other hand, there are a lot more callers to debug_locks_off() > than there were last I looked into this. And both of these splats > are consistent with an interrupt in the middle of rcu_sleep_check(), > and that interrupt's handler invoking debug_locks_off(), but without > printing anything to the console. Does that sequence of events ring a > bell for anyone? > > If this is the new normal, I could make RCU_LOCKDEP_WARN() recheck > debug_lockdep_rcu_enabled() after evaluating the condition, but with > a memory barrier immediately before the recheck. But I am not at all > excited by doing this on speculation. Especially given that doing > so might be covering up some other bug. >
Just check the original console log and find: [ 356.696686][ T8418] ============================= [ 356.696692][ T8418] WARNING: suspicious RCU usage [ 356.700193][T14782] ==================================== [ 356.704548][ T8418] 5.12.0-rc5-syzkaller #0 Not tainted [ 356.729981][ T8418] ----------------------------- [ 356.732473][T14782] WARNING: iou-sqp-14780/14782 still has locks held! , so there are two warnnings here, one is from lockdep_rcu_suspisous() and the other is from print_held_locks_bug(). I think this is what happened: in RCU_LOCKDEP_WARN(): if (debug_lockdep_rcu_enabled() // this is true and at this time debug_locks = 1 <interrupted> // lockdep detects a lock bug, set debug_locks = 0 <swicth back> && !__warned // true && (c)) // "c" is a lock_is_held(), which will always returns true if debug_locks == 0! the cause of the problem is that RCU_LOCKDEP_WARN() in fact read debug_locks twice and get different values. But if you change the ordering of two reads, probably can avoid the problem: First read: lock_is_held(); // true if 1) lock is really held or 2) lockdep is off Second read: debug_lockdep_rcu_enabled(); // if lockdep is not off, we know // that the first read got correct // value, otherwise we just ignore // the first read, because either // there is a bug reported between // two reads, or lockdep is already // off when the first read happens. So maybe something below: Regards, Boqun diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index bd04f722714f..d11bee5d9347 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -315,7 +315,7 @@ static inline int rcu_read_lock_any_held(void) #define RCU_LOCKDEP_WARN(c, s) \ do { \ static bool __section(".data.unlikely") __warned; \ - if (debug_lockdep_rcu_enabled() && !__warned && (c)) { \ + if ((c) && debug_lockdep_rcu_enabled() && !__warned) { \ __warned = true; \ lockdep_rcu_suspicious(__FILE__, __LINE__, s); \ } \ > Thanx, Paul