Re: rcu_preempt caused oom

Paul E. McKenney Sun, 09 Dec 2018 11:56:21 -0800

On Fri, Dec 07, 2018 at 06:11:31AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 07, 2018 at 08:25:09AM +0000, He, Bo wrote:
> > Bad news,  the issue is still reproduced after 61 Hours monkey test on 1/6 
> > boards with the CONFIG_RCU_BOOST=y, and the issue is not seen on kernel 
> > 4.14, the CONFIG_RCU_BOOST is also not enabled in our kernel 4.14 config.
> > Here enclosed is the logs.
> > 
> > > So the question becomes "Why is the grace-period kthread being awakened 
> > > so many times, but not actually waking up?"  
> > maybe it's not schedule issue, I have two suspects:
> > we can see tons of grace_period with 117392312: 
> >  [219346.919405, 0] showmap-31232 [000] d..1 219346.136035: 
> > rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Startleaf
> >  [219346.919417, 0] showmap-31232 [000] d..1 219346.136036: 
> > rcu_future_grace_period: rcu_preempt 117392312 117392316 0 0 3 Prestarted
> >  [219346.919429, 0] showmap-31232 [000] d..1 219346.136036: 
> > rcu_grace_period: rcu_preempt 117392312 AccWaitCB
> > 
> > "Startleaf" means start the grace period, "Prestarted" means the grace 
> > period is already started or other conditions blocked, RCU_GP_FLAG_INIT 
> > should follow the "Startedroot", then the kthread can be wakeup.
> 
> Yes, when "Startleaf" is followed by "Prestarted", that means that we
> reached an rcu_node structure that is already aware that the requested
> grace period is needed.  Breaking down the relevant "if" statement in
> rcu_start_this_gp():
> 
>               if (ULONG_CMP_GE(rnp->gp_seq_needed, gp_seq_req) ||
>                       // A.  GP already requested at this rcu_node
>                   rcu_seq_started(&rnp->gp_seq, gp_seq_req) ||
>                       // B.  The requested grace period already started
>                   (rnp != rnp_start &&
>                    rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))) {
>                       // C.  Leaf rcu_node recorded request, and
>                       //     some grace period is in progress
> 
> A:    In this case, the "Startedroot" should be taken care of by some
>       other thread, or one of B or C held earlier.
> 
> B:    This cannot be the case, because your earlier trace showed that
>       the requested grace period had not started.
> 
> C:    This cannot be the case because both traces above are on the
>       leaf rcu_node structure.  If it were the case, the currently
>       running grace period would notice the need for the requested
>       grace period when it ended, and would start the grace period
>       at that time.
> 
> So you are saying that your trace goes back far enough to capture the
> very first "Startleaf" for this new grace period, and you don't ever see a
> "Startedroot"?  This would be OK if the later "Startedleaf" showed up at
> that point.  If you do have such a trace, could you please send it to me
> (or post it somewhere and send me the URL)?
> 
> In any case, this code has bee reworked recently, so I will take a closer
> look, which will take some time.  Please feel free to continue to do so
> as well, of course!


Hmmm...  Could you please build with CONFIG_PROVE_RCU=y and run the
original (for example, CONFIG_RCU_BOOST=n)?  I would expect this to
trigger the warning in rcu_check_gp_start_stall().  Of course, if it
does not trigger, that would be valuable information as well.

                                                        Thanx, Paul

> > I do experiment to dump the backtrace, the rcu_quiescent_state_report is 
> > called in softirq context:
> >         <idle>-0     [000] dNs2 24471.669280: rcu_quiescent_state_report: 
> > rcu_preempt 3562401 1>0 0 0 3 0
> >           <idle>-0     [000] dNs2 24471.669293: <stack trace>
> >  => rcu_report_qs_rnp+0x1e2/0x2a0
> >  => rcu_process_callbacks+0x2f1/0x3c0
> >  => __do_softirq+0x12a/0x386
> >  => irq_exit+0xb1/0xc0
> >  => smp_apic_timer_interrupt+0xd4/0x1e0
> >  => apic_timer_interrupt+0xf/0x20
> >  => cpuidle_enter_state+0xb1/0x340
> >  => cpuidle_enter+0x17/0x20
> >  => call_cpuidle+0x23/0x40
> >  => do_idle+0x1ed/0x250
> >  => cpu_startup_entry+0x73/0x80
> >  => rest_init+0xf3/0x100
> >  => start_kernel+0x46f/0x490
> >  => x86_64_start_reservations+0x2a/0x2c
> >  => x86_64_start_kernel+0x72/0x75
> >  => secondary_startup_64+0xa4/0xb0
> > rcu_report_qs_rnp=>rcu_report_qs_rdp
> > 
> > and in the rcu_report_qs_rdp(), rcu_report_qs_rnp is follow the 
> > rcu_accelerate_cbs, we can see AccWaitCB log, but can't see 
> > rcu_quiescent_state_report, mostly it's due to condition rnp->qsmask & mask 
> > blocked.
> > 
> > static void
> > rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
> > {
> > ...
> > if ((rnp->qsmask & mask) == 0) {
> >                 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> >         } else {
> >                 rdp->core_needs_qs = false;
> >                 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); 
> >                 rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);      
> >                                                                             
> >                    
> > 
> >                 if (needwake)
> >                         rcu_gp_kthread_wake(rsp);
> >         }
> > }
> 
> This is a completely different code path.  The rcu_start_this_gp()
> function is trying to start a new grace period.  In contrast, this
> rcu_report_qs_rdp() function reports a quiescent state for a currently
> running grace period.  In your earlier trace, there was no currently
> running grace period, so rcu_report_qs_rdp() exiting early is expected
> behavior.
> 
>                                                       Thanx, Paul
> 
> > -----Original Message-----
> > From: Paul E. McKenney <paul...@linux.ibm.com> 
> > Sent: Friday, December 7, 2018 1:38 AM
> > To: He, Bo <bo...@intel.com>
> > Cc: Steven Rostedt <rost...@goodmis.org>; linux-kernel@vger.kernel.org; 
> > j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> > jiangshan...@gmail.com; Zhang, Jun <jun.zh...@intel.com>; Xiao, Jin 
> > <jin.x...@intel.com>; Zhang, Yanmin <yanmin.zh...@intel.com>; Bai, Jie A 
> > <jie.a....@intel.com>
> > Subject: Re: rcu_preempt caused oom
> > 
> > On Thu, Dec 06, 2018 at 01:23:01PM +0000, He, Bo wrote:
> > > 1. The test is positive after set the kthread priority to SCHED_FIFO 
> > > without CONFIG_RCU_BOOST,  the issue is not reproduced until now.
> > > 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds 
> > > ftrace. The panic log was triggered with the enclosed debug patch, 
> > > replaced the wait_for_completion(&rs_array[i].completion) with 
> > > wait_for_completion_timeout(&rs_array[i].completion, 3*HZ) in 
> > > __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump 
> > > all tasks backtrace.
> > 
> > Thank you for collecting this information!
> > 
> > (By the way, the usual downside of the priority increase is increased 
> > context-switch rate and thus CPU overhead.)
> > 
> > And all three grace-period kthreads are blocked apparently in their 
> > top-level loops (though inlining and all that).  There are quite a few 
> > preemptions ("72738.702815: rcu_preempt_task: rcu_preempt"), but they are 
> > all blocking the next grace period (29041008), not the current one 
> > (29041004).  And the "rcu_unlock_preempted_task" trace records flag the 
> > current grace-period sequence number as 29041004, which means that there is 
> > no grace period in progress, that is, RCU is idle.
> > 
> > Which explains why there is no RCU CPU stall warning -- after all, if there 
> > is no grace period in flight, it is not possible to stall that non-existent 
> > grace period.
> > 
> > That also could explain why increasing the priority of the grace-period 
> > kthreads gets things going again.  There have been a great number of 
> > requests for a new grace period (for example, "rcu_future_grace_period:
> > rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the 
> > grace-period kthread wakes up, a new grace period will start.
> > 
> > Except that the rcu_preempt task says "I" rather than "R", as you noted in 
> > an earlier email.
> > 
> > And there should have been multiple attempts to wake up the grace-period 
> > kthread, because there are lots of callbacks queued as in 136,045 of them 
> > ("rcu_callback: rcu_preempt rhp=0000000066f735c9 func=file_free_rcu 
> > 2811/136045").  Which is of course why you are seeing the OOM.
> > 
> > So the question becomes "Why is the grace-period kthread being awakened so 
> > many times, but not actually waking up?"  In the past, there was a 
> > scheduler bug that could cause that, but that was -way- before the v4.19 
> > that you are running.  More recently, there have been timer-related 
> > problems, but those only happened while a grace period was active, and 
> > where also long before v4.19.
> > 
> > Hmmm...  One possibility is that you have somehow managed to invoke
> > call_rcu() with interrupts disabled, which would in turn disable the extra 
> > wakeups that RCU sends when it sees excessive numbers of callbacks.
> > Except that in that case, boosting the priority wouldn't help.  Besides, 
> > the scheduling-clock interrupt should also check for this, and should push 
> > things forward if need be.
> > 
> > If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL 
> > bucket on all CPUs, that would defeat the wakeup-if-no-grace-period checks 
> > (RCU is supposed to have started the relevant grace period before putting 
> > callbacks into that bucket).  But that cannot be the case here, because new 
> > callbacks are being enqueued throughout, and these would then trigger RCU's 
> > start-a-new-grace-period checks.
> > 
> > But it would be good to confirm that this is actually working like I would 
> > expect it to.  Could you please add scheduler wakeup to your tracing, if 
> > possible, only displaying those sent to the rcu_preempt task?
> > 
> >                                                     Thanx, Paul
> > 
> > > -----Original Message-----
> > > From: Paul E. McKenney <paul...@linux.ibm.com>
> > > Sent: Thursday, December 6, 2018 1:45 AM
> > > To: He, Bo <bo...@intel.com>
> > > Cc: Steven Rostedt <rost...@goodmis.org>; 
> > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; 
> > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun 
> > > <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; Zhang, Yanmin 
> > > <yanmin.zh...@intel.com>; Bai, Jie A <jie.a....@intel.com>
> > > Subject: Re: rcu_preempt caused oom
> > > 
> > > On Wed, Dec 05, 2018 at 08:42:54AM +0000, He, Bo wrote:
> > > > I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> > > > Our previous logs can dump all the task backtrace, and kthread (the 
> > > > rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in 
> > > > "R" state, my understandings are if it's the side-effect of causing 
> > > > RCU's kthreads to be run at SCHED_FIFO priority 1, the kthreads should 
> > > > be in R state.
> > > 
> > > Hmmm...  Well, the tasks could in theory be waiting on a blocking mutex.
> > > But in practice the grace-period kthreads wait on events, so that makes 
> > > no sense.
> > > 
> > > Is it possible for you to dump out the grace-period kthread's stack, 
> > > for example, with sysreq-t?  (Steve might know a better way to do 
> > > this.)
> > > 
> > > > I will do more experiments and keep you update once we have more 
> > > > findings:
> > > > 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and 
> > > > see if the issue can reproduce.
> > > 
> > > That sounds like a most excellent experiment!
> > > 
> > > > 2. check more ftrace to double confirm why there is no 
> > > > trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period 
> > > > are in "AccWaitCB".
> > > 
> > > As noted earlier, to see something interesting, you will need to start 
> > > the ftrace before the grace period starts.  This would probably mean 
> > > having ftrace running before starting the test.  Starting the ftrace 
> > > after the hang commences is unlikely to produce useful information.
> > > 
> > >                                                   Thanx, Paul
> > > 
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <paul...@linux.ibm.com>
> > > > Sent: Wednesday, December 5, 2018 3:50 AM
> > > > To: He, Bo <bo...@intel.com>
> > > > Cc: Steven Rostedt <rost...@goodmis.org>; 
> > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; 
> > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun 
> > > > <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; Zhang, Yanmin 
> > > > <yanmin.zh...@intel.com>; Bai, Jie A <jie.a....@intel.com>
> > > > Subject: Re: rcu_preempt caused oom
> > > > 
> > > > On Tue, Dec 04, 2018 at 07:50:04AM +0000, He, Bo wrote:
> > > > > Hi, Paul:
> > > > > the enclosed is the log trigger the 120s hung_task_panic without 
> > > > > other debug patches, the hung task is blocked at __wait_rcu_gp, it 
> > > > > means the rcu_cpu_stall can't detect the scenario:
> > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> > > > 
> > > > Not necessarily.  If there is an RCU CPU stall warning, blocking 
> > > > within
> > > > __wait_rcu_gp() is expected behavior.  It is possible that the problem 
> > > > is that although the grace period is completing as required, the 
> > > > callbacks are not being invoked in a timely fashion.  And that could 
> > > > happen if you had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, 
> > > > alternatively, callback offloading enabled.  But I don't see these in 
> > > > your previous emails.  Another possible cause is that the grace-period 
> > > > kthread is being delayed, so that the grace period never starts.  This 
> > > > seems unlikely, but it is the only thing thus far that matches the 
> > > > symptoms.
> > > > 
> > > > CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be 
> > > > run at SCHED_FIFO priority 1, and that would help in the case where 
> > > > RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh 
> > > > tasks, all of which execute in the rcu_gp_kthread() function) was being 
> > > > starved of CPU time.
> > > > 
> > > > Does that sound likely?
> > > > 
> > > >                                                         Thanx, Paul
> > > > 
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <paul...@linux.ibm.com>
> > > > > Sent: Monday, December 3, 2018 9:57 PM
> > > > > To: He, Bo <bo...@intel.com>
> > > > > Cc: Steven Rostedt <rost...@goodmis.org>; 
> > > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; 
> > > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun 
> > > > > <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; Zhang, 
> > > > > Yanmin <yanmin.zh...@intel.com>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > > 
> > > > > On Mon, Dec 03, 2018 at 07:44:03AM +0000, He, Bo wrote:
> > > > > > Thanks, we have run the test for the whole weekend and not 
> > > > > > reproduce the issue,  so we confirm the CONFIG_RCU_BOOST can fix 
> > > > > > the issue.
> > > > > 
> > > > > Very good, that is encouraging.  Perhaps I should think about making 
> > > > > CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at 
> > > > > least for architectures for which rt_mutexes are implemented.
> > > > > 
> > > > > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set 
> > > > > > panic on rcu stall and will see if we can see the panic, will keep 
> > > > > > you posed with the test results.
> > > > > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > > > > 
> > > > > Looking forward to seeing what is going on!  Of course, to reproduce, 
> > > > > you will need to again build with CONFIG_RCU_BOOST=n.
> > > > > 
> > > > >                                                       Thanx, Paul
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Paul E. McKenney <paul...@linux.ibm.com>
> > > > > > Sent: Saturday, December 1, 2018 12:49 AM
> > > > > > To: He, Bo <bo...@intel.com>
> > > > > > Cc: Steven Rostedt <rost...@goodmis.org>; 
> > > > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; 
> > > > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, 
> > > > > > Jun <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; 
> > > > > > Zhang, Yanmin <yanmin.zh...@intel.com>
> > > > > > Subject: Re: rcu_preempt caused oom
> > > > > > 
> > > > > > On Fri, Nov 30, 2018 at 03:18:58PM +0000, He, Bo wrote:
> > > > > > > Here is the kernel cmdline:
> > > > > > 
> > > > > > Thank you!
> > > > > > 
> > > > > > > Kernel command line: androidboot.acpio_idx=0
> > > > > > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinge
> > > > > > > r-
> > > > > > > 06
> > > > > > > _0
> > > > > > > 3- userdebug androidboot.diskbus=00.0 
> > > > > > > androidboot.verifiedbootstate=green
> > > > > > > androidboot.bootreason=power-on 
> > > > > > > androidboot.serialno=R1J56L6006a7bb
> > > > > > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves 
> > > > > > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr 
> > > > > > > nopti ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off 
> > > > > > > gpt
> > > > > > > loglevel=4 androidboot.hardware=gordon_peak 
> > > > > > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > > > > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10
> > > > > > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:
> > > > > > > 00 /p ro pe rties/android/ pstore.backend=ramoops
> > > > > > > memmap=0x1400000$0x50000000
> > > > > > > ramoops.mem_address=0x50000000 ramoops.mem_size=0x1400000
> > > > > > > ramoops.record_size=0x4000 ramoops.console_size=0x1000000
> > > > > > > ramoops.ftrace_size=0x10000 ramoops.dump_oops=1 vga=current
> > > > > > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1 
> > > > > > > drm.vblankoffdelay=
> > > > > > 
> > > > > > And no sign of any suppression of RCU CPU stall warnings.  Hmmm...
> > > > > > It does take more than 21 seconds to OOM?  Or do things happen 
> > > > > > faster than that?  If they do happen faster than that, then on 
> > > > > > approach would be to add something like this to the kernel command 
> > > > > > line:
> > > > > > 
> > > > > >     rcupdate.rcu_cpu_stall_timeout=7
> > > > > > 
> > > > > > This would set the stall timeout to seven seconds.  Note that 
> > > > > > timeouts less than three seconds are silently interpreted as three 
> > > > > > seconds.
> > > > > > 
> > > > > >                                                     Thanx, Paul
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Steven Rostedt <rost...@goodmis.org>
> > > > > > > Sent: Friday, November 30, 2018 11:17 PM
> > > > > > > To: Paul E. McKenney <paul...@linux.ibm.com>
> > > > > > > Cc: He, Bo <bo...@intel.com>; linux-kernel@vger.kernel.org; 
> > > > > > > j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> > > > > > > jiangshan...@gmail.com; Zhang, Jun <jun.zh...@intel.com>; 
> > > > > > > Xiao, Jin <jin.x...@intel.com>; Zhang, Yanmin 
> > > > > > > <yanmin.zh...@intel.com>
> > > > > > > Subject: Re: rcu_preempt caused oom
> > > > > > > 
> > > > > > > On Fri, 30 Nov 2018 06:43:17 -0800 "Paul E. McKenney" 
> > > > > > > <paul...@linux.ibm.com> wrote:
> > > > > > > 
> > > > > > > > Could you please send me your list of kernel boot parameters?  
> > > > > > > > They usually appear near the start of your console output.
> > > > > > > 
> > > > > > > Or just: cat /proc/cmdline
> > > > > > > 
> > > > > > > -- Steve
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
> 
>

Re: rcu_preempt caused oom

Reply via email to