On Fri, Dec 07, 2018 at 06:11:31AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 07, 2018 at 08:25:09AM +0000, He, Bo wrote:
> > Bad news, the issue is still reproduced after 61 Hours monkey test on 1/6
> > boards with the CONFIG_RCU_BOOST=y, and the issue is not seen on kernel
> > 4.14, the CONFIG_RCU_BOOST is also not enabled in our kernel 4.14 config.
> > Here enclosed is the logs.
> >
> > > So the question becomes "Why is the grace-period kthread being awakened
> > > so many times, but not actually waking up?"
> > maybe it's not schedule issue, I have two suspects:
> > we can see tons of grace_period with 117392312:
> > [219346.919405, 0] showmap-31232 [000] d..1 219346.136035:
> > rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Startleaf
> > [219346.919417, 0] showmap-31232 [000] d..1 219346.136036:
> > rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Prestarted
> > [219346.919429, 0] showmap-31232 [000] d..1 219346.136036:
> > rcu_grace_period: rcu_preempt 117392312 AccWaitCB
> >
> > "Startleaf" means start the grace period, "Prestarted" means the grace
> > period is already started or other conditions blocked, RCU_GP_FLAG_INIT
> > should follow the "Startedroot", then the kthread can be wakeup.
>
> Yes, when "Startleaf" is followed by "Prestarted", that means that we
> reached an rcu_node structure that is already aware that the requested
> grace period is needed. Breaking down the relevant "if" statement in
> rcu_start_this_gp():
>
> if (ULONG_CMP_GE(rnp->gp_seq_needed, gp_seq_req) ||
> // A. GP already requested at this rcu_node
> rcu_seq_started(&rnp->gp_seq, gp_seq_req) ||
> // B. The requested grace period already started
> (rnp != rnp_start &&
> rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))) {
> // C. Leaf rcu_node recorded request, and
> // some grace period is in progress
>
> A: In this case, the "Startedroot" should be taken care of by some
> other thread, or one of B or C held earlier.
>
> B: This cannot be the case, because your earlier trace showed that
> the requested grace period had not started.
>
> C: This cannot be the case because both traces above are on the
> leaf rcu_node structure. If it were the case, the currently
> running grace period would notice the need for the requested
> grace period when it ended, and would start the grace period
> at that time.
>
> So you are saying that your trace goes back far enough to capture the
> very first "Startleaf" for this new grace period, and you don't ever see a
> "Startedroot"? This would be OK if the later "Startedleaf" showed up at
> that point. If you do have such a trace, could you please send it to me
> (or post it somewhere and send me the URL)?
>
> In any case, this code has bee reworked recently, so I will take a closer
> look, which will take some time. Please feel free to continue to do so
> as well, of course!
Hmmm... Could you please build with CONFIG_PROVE_RCU=y and run the
original (for example, CONFIG_RCU_BOOST=n)? I would expect this to
trigger the warning in rcu_check_gp_start_stall(). Of course, if it
does not trigger, that would be valuable information as well.
Thanx, Paul
> > I do experiment to dump the backtrace, the rcu_quiescent_state_report is
> > called in softirq context:
> > <idle>-0 [000] dNs2 24471.669280: rcu_quiescent_state_report:
> > rcu_preempt 3562401 1>0 0 0 3 0
> > <idle>-0 [000] dNs2 24471.669293: <stack trace>
> > => rcu_report_qs_rnp+0x1e2/0x2a0
> > => rcu_process_callbacks+0x2f1/0x3c0
> > => __do_softirq+0x12a/0x386
> > => irq_exit+0xb1/0xc0
> > => smp_apic_timer_interrupt+0xd4/0x1e0
> > => apic_timer_interrupt+0xf/0x20
> > => cpuidle_enter_state+0xb1/0x340
> > => cpuidle_enter+0x17/0x20
> > => call_cpuidle+0x23/0x40
> > => do_idle+0x1ed/0x250
> > => cpu_startup_entry+0x73/0x80
> > => rest_init+0xf3/0x100
> > => start_kernel+0x46f/0x490
> > => x86_64_start_reservations+0x2a/0x2c
> > => x86_64_start_kernel+0x72/0x75
> > => secondary_startup_64+0xa4/0xb0
> > rcu_report_qs_rnp=>rcu_report_qs_rdp
> >
> > and in the rcu_report_qs_rdp(), rcu_report_qs_rnp is follow the
> > rcu_accelerate_cbs, we can see AccWaitCB log, but can't see
> > rcu_quiescent_state_report, mostly it's due to condition rnp->qsmask & mask
> > blocked.
> >
> > static void
> > rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
> > {
> > ...
> > if ((rnp->qsmask & mask) == 0) {
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > } else {
> > rdp->core_needs_qs = false;
> > needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> >
> >
> >
> > if (needwake)
> > rcu_gp_kthread_wake(rsp);
> > }
> > }
>
> This is a completely different code path. The rcu_start_this_gp()
> function is trying to start a new grace period. In contrast, this
> rcu_report_qs_rdp() function reports a quiescent state for a currently
> running grace period. In your earlier trace, there was no currently
> running grace period, so rcu_report_qs_rdp() exiting early is expected
> behavior.
>
> Thanx, Paul
>
> > -----Original Message-----
> > From: Paul E. McKenney <[email protected]>
> > Sent: Friday, December 7, 2018 1:38 AM
> > To: He, Bo <[email protected]>
> > Cc: Steven Rostedt <[email protected]>; [email protected];
> > [email protected]; [email protected];
> > [email protected]; Zhang, Jun <[email protected]>; Xiao, Jin
> > <[email protected]>; Zhang, Yanmin <[email protected]>; Bai, Jie A
> > <[email protected]>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 06, 2018 at 01:23:01PM +0000, He, Bo wrote:
> > > 1. The test is positive after set the kthread priority to SCHED_FIFO
> > > without CONFIG_RCU_BOOST, the issue is not reproduced until now.
> > > 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds
> > > ftrace. The panic log was triggered with the enclosed debug patch,
> > > replaced the wait_for_completion(&rs_array[i].completion) with
> > > wait_for_completion_timeout(&rs_array[i].completion, 3*HZ) in
> > > __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump
> > > all tasks backtrace.
> >
> > Thank you for collecting this information!
> >
> > (By the way, the usual downside of the priority increase is increased
> > context-switch rate and thus CPU overhead.)
> >
> > And all three grace-period kthreads are blocked apparently in their
> > top-level loops (though inlining and all that). There are quite a few
> > preemptions ("72738.702815: rcu_preempt_task: rcu_preempt"), but they are
> > all blocking the next grace period (29041008), not the current one
> > (29041004). And the "rcu_unlock_preempted_task" trace records flag the
> > current grace-period sequence number as 29041004, which means that there is
> > no grace period in progress, that is, RCU is idle.
> >
> > Which explains why there is no RCU CPU stall warning -- after all, if there
> > is no grace period in flight, it is not possible to stall that non-existent
> > grace period.
> >
> > That also could explain why increasing the priority of the grace-period
> > kthreads gets things going again. There have been a great number of
> > requests for a new grace period (for example, "rcu_future_grace_period:
> > rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the
> > grace-period kthread wakes up, a new grace period will start.
> >
> > Except that the rcu_preempt task says "I" rather than "R", as you noted in
> > an earlier email.
> >
> > And there should have been multiple attempts to wake up the grace-period
> > kthread, because there are lots of callbacks queued as in 136,045 of them
> > ("rcu_callback: rcu_preempt rhp=0000000066f735c9 func=file_free_rcu
> > 2811/136045"). Which is of course why you are seeing the OOM.
> >
> > So the question becomes "Why is the grace-period kthread being awakened so
> > many times, but not actually waking up?" In the past, there was a
> > scheduler bug that could cause that, but that was -way- before the v4.19
> > that you are running. More recently, there have been timer-related
> > problems, but those only happened while a grace period was active, and
> > where also long before v4.19.
> >
> > Hmmm... One possibility is that you have somehow managed to invoke
> > call_rcu() with interrupts disabled, which would in turn disable the extra
> > wakeups that RCU sends when it sees excessive numbers of callbacks.
> > Except that in that case, boosting the priority wouldn't help. Besides,
> > the scheduling-clock interrupt should also check for this, and should push
> > things forward if need be.
> >
> > If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL
> > bucket on all CPUs, that would defeat the wakeup-if-no-grace-period checks
> > (RCU is supposed to have started the relevant grace period before putting
> > callbacks into that bucket). But that cannot be the case here, because new
> > callbacks are being enqueued throughout, and these would then trigger RCU's
> > start-a-new-grace-period checks.
> >
> > But it would be good to confirm that this is actually working like I would
> > expect it to. Could you please add scheduler wakeup to your tracing, if
> > possible, only displaying those sent to the rcu_preempt task?
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney <[email protected]>
> > > Sent: Thursday, December 6, 2018 1:45 AM
> > > To: He, Bo <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Zhang, Jun
> > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > <[email protected]>; Bai, Jie A <[email protected]>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Wed, Dec 05, 2018 at 08:42:54AM +0000, He, Bo wrote:
> > > > I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> > > > Our previous logs can dump all the task backtrace, and kthread (the
> > > > rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in
> > > > "R" state, my understandings are if it's the side-effect of causing
> > > > RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should
> > > > be in R state.
> > >
> > > Hmmm... Well, the tasks could in theory be waiting on a blocking mutex.
> > > But in practice the grace-period kthreads wait on events, so that makes
> > > no sense.
> > >
> > > Is it possible for you to dump out the grace-period kthread's stack,
> > > for example, with sysreq-t? (Steve might know a better way to do
> > > this.)
> > >
> > > > I will do more experiments and keep you update once we have more
> > > > findings:
> > > > 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and
> > > > see if the issue can reproduce.
> > >
> > > That sounds like a most excellent experiment!
> > >
> > > > 2. check more ftrace to double confirm why there is no
> > > > trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period
> > > > are in "AccWaitCB".
> > >
> > > As noted earlier, to see something interesting, you will need to start
> > > the ftrace before the grace period starts. This would probably mean
> > > having ftrace running before starting the test. Starting the ftrace
> > > after the hang commences is unlikely to produce useful information.
> > >
> > > Thanx, Paul
> > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <[email protected]>
> > > > Sent: Wednesday, December 5, 2018 3:50 AM
> > > > To: He, Bo <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Zhang, Jun
> > > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > > <[email protected]>; Bai, Jie A <[email protected]>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> > > > > Hi, Paul:
> > > > > the enclosed is the log trigger the 120s hung_task_panic without
> > > > > other debug patches, the hung task is blocked at __wait_rcu_gp, it
> > > > > means the rcu_cpu_stall can't detect the scenario:
> > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> > > >
> > > > Not necessarily. If there is an RCU CPU stall warning, blocking
> > > > within
> > > > __wait_rcu_gp() is expected behavior. It is possible that the problem
> > > > is that although the grace period is completing as required, the
> > > > callbacks are not being invoked in a timely fashion. And that could
> > > > happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or,
> > > > alternatively, callback offloading enabled. But I don't see these in
> > > > your previous emails. Another possible cause is that the grace-period
> > > > kthread is being delayed, so that the grace period never starts. This
> > > > seems unlikely, but it is the only thing thus far that matches the
> > > > symptoms.
> > > >
> > > > CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be
> > > > run at SCHED_FIFO priority 1, and that would help in the case where
> > > > RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh
> > > > tasks, all of which execute in the rcu_gp_kthread() function) was being
> > > > starved of CPU time.
> > > >
> > > > Does that sound likely?
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <[email protected]>
> > > > > Sent: Monday, December 3, 2018 9:57 PM
> > > > > To: He, Bo <[email protected]>
> > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Zhang, Jun
> > > > > <[email protected]>; Xiao, Jin <[email protected]>; Zhang,
> > > > > Yanmin <[email protected]>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > > > > > Thanks, we have run the test for the whole weekend and not
> > > > > > reproduce the issue, so we confirm the CONFIG_RCU_BOOST can fix
> > > > > > the issue.
> > > > >
> > > > > Very good, that is encouraging. Perhaps I should think about making
> > > > > CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at
> > > > > least for architectures for which rt_mutexes are implemented.
> > > > >
> > > > > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set
> > > > > > panic on rcu stall and will see if we can see the panic, will keep
> > > > > > you posed with the test results.
> > > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > >
> > > > > Looking forward to seeing what is going on! Of course, to reproduce,
> > > > > you will need to again build with CONFIG_RCU_BOOST=n.
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Paul E. McKenney <[email protected]>
> > > > > > Sent: Saturday, December 1, 2018 12:49 AM
> > > > > > To: He, Bo <[email protected]>
> > > > > > Cc: Steven Rostedt <[email protected]>;
> > > > > > [email protected]; [email protected];
> > > > > > [email protected]; [email protected]; Zhang,
> > > > > > Jun <[email protected]>; Xiao, Jin <[email protected]>;
> > > > > > Zhang, Yanmin <[email protected]>
> > > > > > Subject: Re: rcu_preempt caused oom
> > > > > >
> > > > > > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > > > > > Here is the kernel cmdline:
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > > > > Kernel command line: androidboot.acpio_idx=0
> > > > > > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinge
> > > > > > > r-
> > > > > > > 06
> > > > > > > _0
> > > > > > > 3- userdebug androidboot.diskbus=00.0
> > > > > > > androidboot.verifiedbootstate=green
> > > > > > > androidboot.bootreason=power-on
> > > > > > > androidboot.serialno=R1J56L6006a7bb
> > > > > > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves
> > > > > > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr
> > > > > > > nopti ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off
> > > > > > > gpt
> > > > > > > loglevel=4 androidboot.hardware=gordon_peak
> > > > > > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > > > > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > > > > > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:
> > > > > > > 00 /p ro pe rties/android/ pstore.backend=ramoops
> > > > > > > memmap=0x1400000$0x50000000
> > > > > > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > > > > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > > > > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > > > > > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1
> > > > > > > drm.vblankoffdelay=
> > > > > >
> > > > > > And no sign of any suppression of RCU CPU stall warnings. Hmmm...
> > > > > > It does take more than 21 seconds to OOM? Or do things happen
> > > > > > faster than that? If they do happen faster than that, then on
> > > > > > approach would be to add something like this to the kernel command
> > > > > > line:
> > > > > >
> > > > > > rcupdate.rcu_cpu_stall_timeout=7
> > > > > >
> > > > > > This would set the stall timeout to seven seconds. Note that
> > > > > > timeouts less than three seconds are silently interpreted as three
> > > > > > seconds.
> > > > > >
> > > > > > Thanx, Paul
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Steven Rostedt <[email protected]>
> > > > > > > Sent: Friday, November 30, 2018 11:17 PM
> > > > > > > To: Paul E. McKenney <[email protected]>
> > > > > > > Cc: He, Bo <[email protected]>; [email protected];
> > > > > > > [email protected]; [email protected];
> > > > > > > [email protected]; Zhang, Jun <[email protected]>;
> > > > > > > Xiao, Jin <[email protected]>; Zhang, Yanmin
> > > > > > > <[email protected]>
> > > > > > > Subject: Re: rcu_preempt caused oom
> > > > > > >
> > > > > > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney"
> > > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > > Could you please send me your list of kernel boot parameters?
> > > > > > > > They usually appear near the start of your console output.
> > > > > > >
> > > > > > > Or just: cat /proc/cmdline
> > > > > > >
> > > > > > > -- Steve
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>
>