On Fri, Dec 07, 2018 at 06:11:31AM -0800, Paul E. McKenney wrote: > On Fri, Dec 07, 2018 at 08:25:09AM +0000, He, Bo wrote: > > Bad news, the issue is still reproduced after 61 Hours monkey test on 1/6 > > boards with the CONFIG_RCU_BOOST=y, and the issue is not seen on kernel > > 4.14, the CONFIG_RCU_BOOST is also not enabled in our kernel 4.14 config. > > Here enclosed is the logs. > > > > > So the question becomes "Why is the grace-period kthread being awakened > > > so many times, but not actually waking up?" > > maybe it's not schedule issue, I have two suspects: > > we can see tons of grace_period with 117392312: > > [219346.919405, 0] showmap-31232 [000] d..1 219346.136035: > > rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Startleaf > > [219346.919417, 0] showmap-31232 [000] d..1 219346.136036: > > rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Prestarted > > [219346.919429, 0] showmap-31232 [000] d..1 219346.136036: > > rcu_grace_period: rcu_preempt 117392312 AccWaitCB > > > > "Startleaf" means start the grace period, "Prestarted" means the grace > > period is already started or other conditions blocked, RCU_GP_FLAG_INIT > > should follow the "Startedroot", then the kthread can be wakeup. > > Yes, when "Startleaf" is followed by "Prestarted", that means that we > reached an rcu_node structure that is already aware that the requested > grace period is needed. Breaking down the relevant "if" statement in > rcu_start_this_gp(): > > if (ULONG_CMP_GE(rnp->gp_seq_needed, gp_seq_req) || > // A. GP already requested at this rcu_node > rcu_seq_started(&rnp->gp_seq, gp_seq_req) || > // B. The requested grace period already started > (rnp != rnp_start && > rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))) { > // C. Leaf rcu_node recorded request, and > // some grace period is in progress > > A: In this case, the "Startedroot" should be taken care of by some > other thread, or one of B or C held earlier. > > B: This cannot be the case, because your earlier trace showed that > the requested grace period had not started. > > C: This cannot be the case because both traces above are on the > leaf rcu_node structure. If it were the case, the currently > running grace period would notice the need for the requested > grace period when it ended, and would start the grace period > at that time. > > So you are saying that your trace goes back far enough to capture the > very first "Startleaf" for this new grace period, and you don't ever see a > "Startedroot"? This would be OK if the later "Startedleaf" showed up at > that point. If you do have such a trace, could you please send it to me > (or post it somewhere and send me the URL)? > > In any case, this code has bee reworked recently, so I will take a closer > look, which will take some time. Please feel free to continue to do so > as well, of course!
Hmmm... Could you please build with CONFIG_PROVE_RCU=y and run the original (for example, CONFIG_RCU_BOOST=n)? I would expect this to trigger the warning in rcu_check_gp_start_stall(). Of course, if it does not trigger, that would be valuable information as well. Thanx, Paul > > I do experiment to dump the backtrace, the rcu_quiescent_state_report is > > called in softirq context: > > <idle>-0 [000] dNs2 24471.669280: rcu_quiescent_state_report: > > rcu_preempt 3562401 1>0 0 0 3 0 > > <idle>-0 [000] dNs2 24471.669293: <stack trace> > > => rcu_report_qs_rnp+0x1e2/0x2a0 > > => rcu_process_callbacks+0x2f1/0x3c0 > > => __do_softirq+0x12a/0x386 > > => irq_exit+0xb1/0xc0 > > => smp_apic_timer_interrupt+0xd4/0x1e0 > > => apic_timer_interrupt+0xf/0x20 > > => cpuidle_enter_state+0xb1/0x340 > > => cpuidle_enter+0x17/0x20 > > => call_cpuidle+0x23/0x40 > > => do_idle+0x1ed/0x250 > > => cpu_startup_entry+0x73/0x80 > > => rest_init+0xf3/0x100 > > => start_kernel+0x46f/0x490 > > => x86_64_start_reservations+0x2a/0x2c > > => x86_64_start_kernel+0x72/0x75 > > => secondary_startup_64+0xa4/0xb0 > > rcu_report_qs_rnp=>rcu_report_qs_rdp > > > > and in the rcu_report_qs_rdp(), rcu_report_qs_rnp is follow the > > rcu_accelerate_cbs, we can see AccWaitCB log, but can't see > > rcu_quiescent_state_report, mostly it's due to condition rnp->qsmask & mask > > blocked. > > > > static void > > rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp) > > { > > ... > > if ((rnp->qsmask & mask) == 0) { > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > } else { > > rdp->core_needs_qs = false; > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp); > > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags); > > > > > > > > if (needwake) > > rcu_gp_kthread_wake(rsp); > > } > > } > > This is a completely different code path. The rcu_start_this_gp() > function is trying to start a new grace period. In contrast, this > rcu_report_qs_rdp() function reports a quiescent state for a currently > running grace period. In your earlier trace, there was no currently > running grace period, so rcu_report_qs_rdp() exiting early is expected > behavior. > > Thanx, Paul > > > -----Original Message----- > > From: Paul E. McKenney <paul...@linux.ibm.com> > > Sent: Friday, December 7, 2018 1:38 AM > > To: He, Bo <bo...@intel.com> > > Cc: Steven Rostedt <rost...@goodmis.org>; linux-kernel@vger.kernel.org; > > j...@joshtriplett.org; mathieu.desnoy...@efficios.com; > > jiangshan...@gmail.com; Zhang, Jun <jun.zh...@intel.com>; Xiao, Jin > > <jin.x...@intel.com>; Zhang, Yanmin <yanmin.zh...@intel.com>; Bai, Jie A > > <jie.a....@intel.com> > > Subject: Re: rcu_preempt caused oom > > > > On Thu, Dec 06, 2018 at 01:23:01PM +0000, He, Bo wrote: > > > 1. The test is positive after set the kthread priority to SCHED_FIFO > > > without CONFIG_RCU_BOOST, the issue is not reproduced until now. > > > 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds > > > ftrace. The panic log was triggered with the enclosed debug patch, > > > replaced the wait_for_completion(&rs_array[i].completion) with > > > wait_for_completion_timeout(&rs_array[i].completion, 3*HZ) in > > > __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump > > > all tasks backtrace. > > > > Thank you for collecting this information! > > > > (By the way, the usual downside of the priority increase is increased > > context-switch rate and thus CPU overhead.) > > > > And all three grace-period kthreads are blocked apparently in their > > top-level loops (though inlining and all that). There are quite a few > > preemptions ("72738.702815: rcu_preempt_task: rcu_preempt"), but they are > > all blocking the next grace period (29041008), not the current one > > (29041004). And the "rcu_unlock_preempted_task" trace records flag the > > current grace-period sequence number as 29041004, which means that there is > > no grace period in progress, that is, RCU is idle. > > > > Which explains why there is no RCU CPU stall warning -- after all, if there > > is no grace period in flight, it is not possible to stall that non-existent > > grace period. > > > > That also could explain why increasing the priority of the grace-period > > kthreads gets things going again. There have been a great number of > > requests for a new grace period (for example, "rcu_future_grace_period: > > rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the > > grace-period kthread wakes up, a new grace period will start. > > > > Except that the rcu_preempt task says "I" rather than "R", as you noted in > > an earlier email. > > > > And there should have been multiple attempts to wake up the grace-period > > kthread, because there are lots of callbacks queued as in 136,045 of them > > ("rcu_callback: rcu_preempt rhp=0000000066f735c9 func=file_free_rcu > > 2811/136045"). Which is of course why you are seeing the OOM. > > > > So the question becomes "Why is the grace-period kthread being awakened so > > many times, but not actually waking up?" In the past, there was a > > scheduler bug that could cause that, but that was -way- before the v4.19 > > that you are running. More recently, there have been timer-related > > problems, but those only happened while a grace period was active, and > > where also long before v4.19. > > > > Hmmm... One possibility is that you have somehow managed to invoke > > call_rcu() with interrupts disabled, which would in turn disable the extra > > wakeups that RCU sends when it sees excessive numbers of callbacks. > > Except that in that case, boosting the priority wouldn't help. Besides, > > the scheduling-clock interrupt should also check for this, and should push > > things forward if need be. > > > > If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL > > bucket on all CPUs, that would defeat the wakeup-if-no-grace-period checks > > (RCU is supposed to have started the relevant grace period before putting > > callbacks into that bucket). But that cannot be the case here, because new > > callbacks are being enqueued throughout, and these would then trigger RCU's > > start-a-new-grace-period checks. > > > > But it would be good to confirm that this is actually working like I would > > expect it to. Could you please add scheduler wakeup to your tracing, if > > possible, only displaying those sent to the rcu_preempt task? > > > > Thanx, Paul > > > > > -----Original Message----- > > > From: Paul E. McKenney <paul...@linux.ibm.com> > > > Sent: Thursday, December 6, 2018 1:45 AM > > > To: He, Bo <bo...@intel.com> > > > Cc: Steven Rostedt <rost...@goodmis.org>; > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun > > > <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; Zhang, Yanmin > > > <yanmin.zh...@intel.com>; Bai, Jie A <jie.a....@intel.com> > > > Subject: Re: rcu_preempt caused oom > > > > > > On Wed, Dec 05, 2018 at 08:42:54AM +0000, He, Bo wrote: > > > > I double checked the .config, we don't enable CONFIG_NO_HZ_FULL . > > > > Our previous logs can dump all the task backtrace, and kthread (the > > > > rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in > > > > "R" state, my understandings are if it's the side-effect of causing > > > > RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should > > > > be in R state. > > > > > > Hmmm... Well, the tasks could in theory be waiting on a blocking mutex. > > > But in practice the grace-period kthreads wait on events, so that makes > > > no sense. > > > > > > Is it possible for you to dump out the grace-period kthread's stack, > > > for example, with sysreq-t? (Steve might know a better way to do > > > this.) > > > > > > > I will do more experiments and keep you update once we have more > > > > findings: > > > > 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and > > > > see if the issue can reproduce. > > > > > > That sounds like a most excellent experiment! > > > > > > > 2. check more ftrace to double confirm why there is no > > > > trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period > > > > are in "AccWaitCB". > > > > > > As noted earlier, to see something interesting, you will need to start > > > the ftrace before the grace period starts. This would probably mean > > > having ftrace running before starting the test. Starting the ftrace > > > after the hang commences is unlikely to produce useful information. > > > > > > Thanx, Paul > > > > > > > -----Original Message----- > > > > From: Paul E. McKenney <paul...@linux.ibm.com> > > > > Sent: Wednesday, December 5, 2018 3:50 AM > > > > To: He, Bo <bo...@intel.com> > > > > Cc: Steven Rostedt <rost...@goodmis.org>; > > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun > > > > <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; Zhang, Yanmin > > > > <yanmin.zh...@intel.com>; Bai, Jie A <jie.a....@intel.com> > > > > Subject: Re: rcu_preempt caused oom > > > > > > > > On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote: > > > > > Hi, Paul: > > > > > the enclosed is the log trigger the 120s hung_task_panic without > > > > > other debug patches, the hung task is blocked at __wait_rcu_gp, it > > > > > means the rcu_cpu_stall can't detect the scenario: > > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall > > > > > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout > > > > > > > > Not necessarily. If there is an RCU CPU stall warning, blocking > > > > within > > > > __wait_rcu_gp() is expected behavior. It is possible that the problem > > > > is that although the grace period is completing as required, the > > > > callbacks are not being invoked in a timely fashion. And that could > > > > happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, > > > > alternatively, callback offloading enabled. But I don't see these in > > > > your previous emails. Another possible cause is that the grace-period > > > > kthread is being delayed, so that the grace period never starts. This > > > > seems unlikely, but it is the only thing thus far that matches the > > > > symptoms. > > > > > > > > CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be > > > > run at SCHED_FIFO priority 1, and that would help in the case where > > > > RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh > > > > tasks, all of which execute in the rcu_gp_kthread() function) was being > > > > starved of CPU time. > > > > > > > > Does that sound likely? > > > > > > > > Thanx, Paul > > > > > > > > > -----Original Message----- > > > > > From: Paul E. McKenney <paul...@linux.ibm.com> > > > > > Sent: Monday, December 3, 2018 9:57 PM > > > > > To: He, Bo <bo...@intel.com> > > > > > Cc: Steven Rostedt <rost...@goodmis.org>; > > > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > > > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun > > > > > <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; Zhang, > > > > > Yanmin <yanmin.zh...@intel.com> > > > > > Subject: Re: rcu_preempt caused oom > > > > > > > > > > On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote: > > > > > > Thanks, we have run the test for the whole weekend and not > > > > > > reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix > > > > > > the issue. > > > > > > > > > > Very good, that is encouraging. Perhaps I should think about making > > > > > CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at > > > > > least for architectures for which rt_mutexes are implemented. > > > > > > > > > > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set > > > > > > panic on rcu stall and will see if we can see the panic, will keep > > > > > > you posed with the test results. > > > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall > > > > > > > > > > Looking forward to seeing what is going on! Of course, to reproduce, > > > > > you will need to again build with CONFIG_RCU_BOOST=n. > > > > > > > > > > Thanx, Paul > > > > > > > > > > > -----Original Message----- > > > > > > From: Paul E. McKenney <paul...@linux.ibm.com> > > > > > > Sent: Saturday, December 1, 2018 12:49 AM > > > > > > To: He, Bo <bo...@intel.com> > > > > > > Cc: Steven Rostedt <rost...@goodmis.org>; > > > > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > > > > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, > > > > > > Jun <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; > > > > > > Zhang, Yanmin <yanmin.zh...@intel.com> > > > > > > Subject: Re: rcu_preempt caused oom > > > > > > > > > > > > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote: > > > > > > > Here is the kernel cmdline: > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > Kernel command line: androidboot.acpio_idx=0 > > > > > > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinge > > > > > > > r- > > > > > > > 06 > > > > > > > _0 > > > > > > > 3- userdebug androidboot.diskbus=00.0 > > > > > > > androidboot.verifiedbootstate=green > > > > > > > androidboot.bootreason=power-on > > > > > > > androidboot.serialno=R1J56L6006a7bb > > > > > > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves > > > > > > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr > > > > > > > nopti ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off > > > > > > > gpt > > > > > > > loglevel=4 androidboot.hardware=gordon_peak > > > > > > > firmware_class.path=/vendor/firmware relative_sleep_states=1 > > > > > > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10 > > > > > > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001: > > > > > > > 00 /p ro pe rties/android/ pstore.backend=ramoops > > > > > > > memmap=0x1400000$0x50000000 > > > > > > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000 > > > > > > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000 > > > > > > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current > > > > > > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1 > > > > > > > drm.vblankoffdelay= > > > > > > > > > > > > And no sign of any suppression of RCU CPU stall warnings. Hmmm... > > > > > > It does take more than 21 seconds to OOM? Or do things happen > > > > > > faster than that? If they do happen faster than that, then on > > > > > > approach would be to add something like this to the kernel command > > > > > > line: > > > > > > > > > > > > rcupdate.rcu_cpu_stall_timeout=7 > > > > > > > > > > > > This would set the stall timeout to seven seconds. Note that > > > > > > timeouts less than three seconds are silently interpreted as three > > > > > > seconds. > > > > > > > > > > > > Thanx, Paul > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Steven Rostedt <rost...@goodmis.org> > > > > > > > Sent: Friday, November 30, 2018 11:17 PM > > > > > > > To: Paul E. McKenney <paul...@linux.ibm.com> > > > > > > > Cc: He, Bo <bo...@intel.com>; linux-kernel@vger.kernel.org; > > > > > > > j...@joshtriplett.org; mathieu.desnoy...@efficios.com; > > > > > > > jiangshan...@gmail.com; Zhang, Jun <jun.zh...@intel.com>; > > > > > > > Xiao, Jin <jin.x...@intel.com>; Zhang, Yanmin > > > > > > > <yanmin.zh...@intel.com> > > > > > > > Subject: Re: rcu_preempt caused oom > > > > > > > > > > > > > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney" > > > > > > > <paul...@linux.ibm.com> wrote: > > > > > > > > > > > > > > > Could you please send me your list of kernel boot parameters? > > > > > > > > They usually appear near the start of your console output. > > > > > > > > > > > > > > Or just: cat /proc/cmdline > > > > > > > > > > > > > > -- Steve > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >