On Fri, Dec 14, 2018 at 01:30:04AM +0000, He, Bo wrote: > as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I > double checked there is no FAST_NO_HZ in .config:
Yes, you are correct, CONFIG_RCU_FAST_NO_HZ. OK, you do not have it set, which means several code paths can be ignored. Also CONFIG_HZ=1000, so 300 second delay. Thanx, Paul > Here is the grep from .config: > egrep "HZ|RCU" .config > CONFIG_NO_HZ_COMMON=y > # CONFIG_HZ_PERIODIC is not set > CONFIG_NO_HZ_IDLE=y > # CONFIG_NO_HZ_FULL is not set > CONFIG_NO_HZ=y > # RCU Subsystem > CONFIG_PREEMPT_RCU=y > # CONFIG_RCU_EXPERT is not set > CONFIG_SRCU=y > CONFIG_TREE_SRCU=y > CONFIG_TASKS_RCU=y > CONFIG_RCU_STALL_COMMON=y > CONFIG_RCU_NEED_SEGCBLIST=y > # CONFIG_HZ_100 is not set > # CONFIG_HZ_250 is not set > # CONFIG_HZ_300 is not set > CONFIG_HZ_1000=y > CONFIG_HZ=1000 > # CONFIG_MACHZ_WDT is not set > # RCU Debugging > CONFIG_PROVE_RCU=y > CONFIG_RCU_PERF_TEST=m > CONFIG_RCU_TORTURE_TEST=m > CONFIG_RCU_CPU_STALL_TIMEOUT=7 > CONFIG_RCU_TRACE=y > CONFIG_RCU_EQS_DEBUG=y > > -----Original Message----- > From: Paul E. McKenney <paul...@linux.ibm.com> > Sent: Friday, December 14, 2018 2:12 AM > To: He, Bo <bo...@intel.com> > Cc: Zhang, Jun <jun.zh...@intel.com>; Steven Rostedt <rost...@goodmis.org>; > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Xiao, Jin > <jin.x...@intel.com>; Zhang, Yanmin <yanmin.zh...@intel.com>; Bai, Jie A > <jie.a....@intel.com>; Sun, Yi J <yi.j....@intel.com> > Subject: Re: rcu_preempt caused oom > > On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote: > > one of the board reproduce the issue with the show_rcu_gp_kthreads(), I > > also enclosed the logs as attachment. > > > > [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: > > 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time > > 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed > > 21808196 ->gp_flags 0x1 > > This is quite helpful, thank you! > > The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is > good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting > for a new grace-period request. The "->state: 0x402" means that it is > sleeping, neither running nor in the process of waking up. > The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time > 308258" means that it has been more than 300,000 jiffies since the > rcu_preempt task did anything or was requested to do anything. > > The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to > awaken the rcu_preempt task happened during the last grace period. > The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone > requested a new grace period. So if the rcu_preempt task were to wake up, it > would process the new grace period. Note again also the ->gp_req_activity > 308256, which indicates that ->gp_flags was set more than 300,000 jiffies > ago, just after the last recorded activity of the rcu_preempt task. > > But this is exactly the situation that rcu_check_gp_start_stall() is designed > to warn about (and does warn about for me when I comment out the wakeup > code). So why is rcu_check_gp_start_stall() not being called? Here are a > couple of possibilities: > > 1. Because rcu_check_gp_start_stall() is only ever invoked from > RCU_SOFTIRQ, it is possible that softirqs are stalled for > whatever reason. > > 2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock > interrupt handler, it is possible that the scheduler tick has > somehow been disabled. Traces from earlier runs showed a great > deal of RCU callbacks queued, which would have caused RCU to > refuse to allow the scheduler tick to be disabled, even if the > corresponding CPU was idle. > > 3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given > that you are building for a battery-powered device) and all of the > CPU's callbacks are lazy. Except that your earlier traces showed > lots of non-lazy callbacks. Besides, even if all callbacks were > lazy, there would still be a scheduling-clock interrupt every > six seconds, and there are quite a few six-second intervals > in a two-minute watchdog timeout. > > But if we cannot find the problem quickly, I will likely ask > you to try reproducing with CONFIG_FAST_NO_HZ=n. This could > be thought of as bisecting the RCU code looking for the bug. > > The first two of these seem unlikely given that the watchdog timer was still > firing. Still, I don't see how 300,000 jiffies elapsed with a grace period > requested and not started otherwise. Could you please check? > One way to do so would be to enable ftrace on rcu_check_callbacks(), > __rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might be > necessary to no-inline rcu_check_gp_start_stall(). You might have better > ways to collect this information. > > Without this information, the only workaround patch I can give you will > degrade battery lifetime, which might not be what you want. > > You do have a lockdep complaint early at boot. Although I don't immediately > see how this self-deadlock would affect RCU, please do get it fixed. > Sometimes the consequences of this sort of deadlock can propagate to > unexepected places. > > Regardless of why rcu_check_gp_start_stall() failed to complain, it looks > like this was set after the rcu_preempt task slept for the last time, and so > there should have been a wakeup the last time that ->gp_flags was set. > Perhaps there is some code path that drops the wakeup. > I did check this in current -rcu, but you are instead running v4.19, so I > should also check there. > > The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in > rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from consideration > because only the rcu_preempt task will execute that code, and we know that > this task was asleep at the last time this bit was set. > Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is > needed, and the caller must do the wakeup once it is safe to do so, that is, > after the various rcu_node locks have been released (doing a wakeup while > holding any of those locks results in deadlock). > > The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and > rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you are > building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs() is invoked > from: > > o rcu_accelerate_cbs_unlocked(), which does the following, thus > properly awakening the rcu_preempt task when needed: > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp); > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ > if (needwake) > rcu_gp_kthread_wake(rsp); > > o rcu_advance_cbs(), which returns the value returned by > rcu_accelerate_cbs(), thus pushing the problem off to its > callers, which are called out below. > > o __note_gp_changes(), which also returns the value returned by > rcu_accelerate_cbs(), thus pushing the problem off to its callers, > which are called out below. > > o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period > kthreads such as the rcu_preempt task. Therefore, this function > never needs to awaken the rcu_preempt task, because the fact > that this function is executing means that this task is already > awake. (Also, as noted above, we can eliminate this code from > consideration because this task is known to have been sleeping > at the last time that the RCU_GP_FLAG_INIT bit was set.) > > o rcu_report_qs_rdp(), which does the following, thus properly > awakening the rcu_preempt task when needed: > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp); > > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags); > /* ^^^ Released rnp->lock */ > if (needwake) > rcu_gp_kthread_wake(rsp); > > o rcu_prepare_for_idle(), which does the following, thus properly > awakening the rcu_preempt task when needed: > > needwake = rcu_accelerate_cbs(rsp, rnp, rdp); > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ > if (needwake) > rcu_gp_kthread_wake(rsp); > > Now for rcu_advance_cbs(): > > o __note_gp_changes(), which which also returns the value returned > by rcu_advance_cbs(), thus pushing the problem off to its callers, > which are called out below. > > o rcu_migrate_callbacks(), which does the following, thus properly > awakening the rcu_preempt task when needed: > > needwake = rcu_advance_cbs(rsp, rnp_root, rdp) || > rcu_advance_cbs(rsp, rnp_root, my_rdp); > rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist); > WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) != > !rcu_segcblist_n_cbs(&my_rdp->cblist)); > raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags); > if (needwake) > rcu_gp_kthread_wake(rsp); > > Now for __note_gp_changes(): > > o note_gp_changes(), which does the following, thus properly > awakening the rcu_preempt task when needed: > > needwake = __note_gp_changes(rsp, rnp, rdp); > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > if (needwake) > rcu_gp_kthread_wake(rsp); > > o rcu_gp_init() which is only ever invoked by RCU grace-period > kthreads such as the rcu_preempt task, which makes wakeups > unnecessary, just as for rcu_gp_cleanup() above. > > o rcu_gp_cleanup(), ditto. > > So I am not seeing how I am losing a wakeup, but please do feel free to > double-check my analysis. One way to do that is using event tracing. > > Thanx, Paul > > ------------------------------------------------------------------------ > lockdep complaint: > ------------------------------------------------------------------------ > > [ 2.895507] ====================================================== > [ 2.895511] WARNING: possible circular locking dependency detected > [ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U > > [ 2.895521] ------------------------------------------------------ > [ 2.895525] earlyEvs/1839 is trying to acquire lock: > [ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: > ipu_isys_subdev_get_ffmt+0x32/0x90 > [ 2.895546] > [ 2.895546] but task is already holding lock: > [ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: > media_pipeline_start+0x28/0x50 > [ 2.895561] > [ 2.895561] which lock already depends on the new lock. > [ 2.895561] > [ 2.895566] > [ 2.895566] the existing dependency chain (in reverse order) is: > [ 2.895570] > [ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}: > [ 2.895583] __mutex_lock+0x80/0x9a0 > [ 2.895588] mutex_lock_nested+0x1b/0x20 > [ 2.895593] media_device_register_entity+0x92/0x1e0 > [ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0 > [ 2.895604] ipu_isys_csi2_init+0x22c/0x520 > [ 2.895608] isys_probe+0x6cb/0xed0 > [ 2.895613] ipu_bus_probe+0xfd/0x2e0 > [ 2.895620] really_probe+0x268/0x3d0 > [ 2.895625] driver_probe_device+0x11a/0x130 > [ 2.895630] __device_attach_driver+0x86/0x100 > [ 2.895635] bus_for_each_drv+0x6e/0xb0 > [ 2.895640] __device_attach+0xdf/0x160 > [ 2.895645] device_initial_probe+0x13/0x20 > [ 2.895650] bus_probe_device+0xa6/0xc0 > [ 2.895655] deferred_probe_work_func+0x88/0xe0 > [ 2.895661] process_one_work+0x220/0x5c0 > [ 2.895665] worker_thread+0x1da/0x3b0 > [ 2.895670] kthread+0x12c/0x150 > [ 2.895675] ret_from_fork+0x3a/0x50 > [ 2.895678] > [ 2.895678] -> #0 (&asd->mutex){+.+.}: > [ 2.895688] lock_acquire+0x95/0x1a0 > [ 2.895693] __mutex_lock+0x80/0x9a0 > [ 2.895698] mutex_lock_nested+0x1b/0x20 > [ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90 > [ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30 > [ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80 > [ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120 > [ 2.895723] v4l2_subdev_link_validate+0x246/0x490 > [ 2.895728] csi2_link_validate+0xc6/0x220 > [ 2.895733] __media_pipeline_start+0x15b/0x2f0 > [ 2.895738] media_pipeline_start+0x33/0x50 > [ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610 > [ 2.895748] start_streaming+0x186/0x3a0 > [ 2.895753] vb2_start_streaming+0x6d/0x130 > [ 2.895758] vb2_core_streamon+0x108/0x140 > [ 2.895762] vb2_streamon+0x29/0x50 > [ 2.895767] vb2_ioctl_streamon+0x42/0x50 > [ 2.895772] v4l_streamon+0x20/0x30 > [ 2.895776] __video_do_ioctl+0x1af/0x3c0 > [ 2.895781] video_usercopy+0x27e/0x7e0 > [ 2.895785] video_ioctl2+0x15/0x20 > [ 2.895789] v4l2_ioctl+0x49/0x50 > [ 2.895794] do_video_ioctl+0x93c/0x2360 > [ 2.895799] v4l2_compat_ioctl32+0x93/0xe0 > [ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90 > [ 2.895813] do_fast_syscall_32+0x9a/0x2d6 > [ 2.895818] entry_SYSENTER_compat+0x6d/0x7c > [ 2.895821] > [ 2.895821] other info that might help us debug this: > [ 2.895821] > [ 2.895826] Possible unsafe locking scenario: > [ 2.895826] > [ 2.895830] CPU0 CPU1 > [ 2.895833] ---- ---- > [ 2.895836] lock(&mdev->graph_mutex); > [ 2.895842] lock(&asd->mutex); > [ 2.895847] lock(&mdev->graph_mutex); > [ 2.895852] lock(&asd->mutex); > [ 2.895857] > [ 2.895857] *** DEADLOCK *** > [ 2.895857] > [ 2.895863] 3 locks held by earlyEvs/1839: > [ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: > __video_do_ioctl+0xbf/0x3c0 > [ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: > start_streaming+0x5c/0x3a0 > [ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: > media_pipeline_start+0x28/0x50 > [ 2.895896] > [ 2.895896] stack backtrace: > [ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U > 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 > [ 2.895907] Call Trace: > [ 2.895915] dump_stack+0x70/0xa5 > [ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6 > [ 2.895927] __lock_acquire+0x1284/0x1340 > [ 2.895931] ? __lock_acquire+0x2b5/0x1340 > [ 2.895940] lock_acquire+0x95/0x1a0 > [ 2.895945] ? lock_acquire+0x95/0x1a0 > [ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90 > [ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90 > [ 2.895961] __mutex_lock+0x80/0x9a0 > [ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90 > [ 2.895971] ? crlmodule_get_format+0x43/0x50 > [ 2.895979] mutex_lock_nested+0x1b/0x20 > [ 2.895984] ? mutex_lock_nested+0x1b/0x20 > [ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90 > [ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30 > [ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80 > [ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120 > [ 2.896011] ? crlmodule_get_format+0x2a/0x50 > [ 2.896018] ? find_held_lock+0x35/0xa0 > [ 2.896023] ? crlmodule_get_format+0x43/0x50 > [ 2.896030] v4l2_subdev_link_validate+0x246/0x490 > [ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0 > [ 2.896042] ? mutex_unlock+0x12/0x20 > [ 2.896046] ? crlmodule_get_format+0x43/0x50 > [ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80 > [ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120 > [ 2.896065] ? __is_insn_slot_addr+0xad/0x120 > [ 2.896070] ? kernel_text_address+0xc4/0x100 > [ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490 > [ 2.896085] ? kernel_text_address+0xc4/0x100 > [ 2.896092] ? __lock_acquire+0x1106/0x1340 > [ 2.896096] ? __lock_acquire+0x1169/0x1340 > [ 2.896103] csi2_link_validate+0xc6/0x220 > [ 2.896110] ? __lock_is_held+0x5a/0xa0 > [ 2.896115] ? mark_held_locks+0x58/0x80 > [ 2.896122] ? __kmalloc+0x207/0x2e0 > [ 2.896127] ? __lock_is_held+0x5a/0xa0 > [ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90 > [ 2.896139] ? __kmalloc+0x2a3/0x2e0 > [ 2.896144] ? media_pipeline_start+0x28/0x50 > [ 2.896150] ? __media_entity_enum_init+0x33/0x70 > [ 2.896155] ? csi2_has_route+0x18/0x20 > [ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290 > [ 2.896166] __media_pipeline_start+0x15b/0x2f0 > [ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90 > [ 2.896179] media_pipeline_start+0x33/0x50 > [ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610 > [ 2.896191] ? __lock_acquire+0x132e/0x1340 > [ 2.896198] ? __lock_acquire+0x2b5/0x1340 > [ 2.896204] ? lock_acquire+0x95/0x1a0 > [ 2.896209] ? start_streaming+0x5c/0x3a0 > [ 2.896215] ? start_streaming+0x5c/0x3a0 > [ 2.896221] ? __mutex_lock+0x391/0x9a0 > [ 2.896226] ? v4l_enable_media_source+0x2d/0x70 > [ 2.896233] ? find_held_lock+0x35/0xa0 > [ 2.896238] ? v4l_enable_media_source+0x57/0x70 > [ 2.896245] start_streaming+0x186/0x3a0 > [ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0 > [ 2.896257] vb2_start_streaming+0x6d/0x130 > [ 2.896262] ? vb2_start_streaming+0x6d/0x130 > [ 2.896267] vb2_core_streamon+0x108/0x140 > [ 2.896273] vb2_streamon+0x29/0x50 > [ 2.896278] vb2_ioctl_streamon+0x42/0x50 > [ 2.896284] v4l_streamon+0x20/0x30 > [ 2.896288] __video_do_ioctl+0x1af/0x3c0 > [ 2.896296] ? __might_fault+0x85/0x90 > [ 2.896302] video_usercopy+0x27e/0x7e0 > [ 2.896307] ? copy_overflow+0x20/0x20 > [ 2.896313] ? find_held_lock+0x35/0xa0 > [ 2.896319] ? __might_fault+0x3e/0x90 > [ 2.896325] video_ioctl2+0x15/0x20 > [ 2.896330] v4l2_ioctl+0x49/0x50 > [ 2.896335] do_video_ioctl+0x93c/0x2360 > [ 2.896343] v4l2_compat_ioctl32+0x93/0xe0 > [ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90 > [ 2.896354] ? lockdep_hardirqs_on+0xef/0x180 > [ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6 > [ 2.896364] do_fast_syscall_32+0x9a/0x2d6 > [ 2.896370] entry_SYSENTER_compat+0x6d/0x7c > [ 2.896377] RIP: 0023:0xf7e79b79 > [ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c > 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a > 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 > [ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: > 0000000000000036 > [ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: > 0000000040045612 > [ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: > 00000000f768172c > [ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: > 0000000000000000 > [ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: > 0000000000000000 > [ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: > 0000000000000000 > > ------------------------------------------------------------------------ > > > [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed > > 21808196 > > [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: > > 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time > > 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags > > 0x0 > > > > [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 > > delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time > > 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 > > ->gp_flags 0x0 > > > > -----Original Message----- > > From: Paul E. McKenney <paul...@linux.ibm.com> > > Sent: Thursday, December 13, 2018 12:40 PM > > To: Zhang, Jun <jun.zh...@intel.com> > > Cc: He, Bo <bo...@intel.com>; Steven Rostedt <rost...@goodmis.org>; > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Xiao, Jin > > <jin.x...@intel.com>; Zhang, Yanmin <yanmin.zh...@intel.com>; Bai, Jie > > A <jie.a....@intel.com>; Sun, Yi J <yi.j....@intel.com> > > Subject: Re: rcu_preempt caused oom > > > > On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote: > > > Ok, we will test it, thanks! > > > > But please also try the sysrq-y with the earlier patch after a hang! > > > > Thanx, Paul > > > > > -----Original Message----- > > > From: Paul E. McKenney [mailto:paul...@linux.ibm.com] > > > Sent: Thursday, December 13, 2018 10:43 > > > To: Zhang, Jun <jun.zh...@intel.com> > > > Cc: He, Bo <bo...@intel.com>; Steven Rostedt <rost...@goodmis.org>; > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Xiao, Jin > > > <jin.x...@intel.com>; Zhang, Yanmin <yanmin.zh...@intel.com>; Bai, > > > Jie A <jie.a....@intel.com>; Sun, Yi J <yi.j....@intel.com> > > > Subject: Re: rcu_preempt caused oom > > > > > > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote: > > > > Hello, Paul > > > > > > > > I think the next patch is better. > > > > Because ULONG_CMP_GE could cause double write, which has risk that > > > > write back old value. > > > > Please help review. > > > > I don't test it. If you agree, we will test it. > > > > > > Just to make sure that I understand, you are worried about something like > > > the following, correct? > > > > > > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed > > > and finds them equal. > > > > > > o At just this time something like rcu_start_this_gp() assigns a new > > > (larger) value to rdp->gp_seq_needed. > > > > > > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the > > > old value. > > > > > > This cannot happen because __note_gp_changes() runs with interrupts > > > disabled on the CPU corresponding to the rcu_data structure referenced by > > > the rdp pointer. So there is no way for rcu_start_this_gp() to be > > > invoked on the same CPU during this "if" statement. > > > > > > Of course, there could be bugs. For example: > > > > > > o __note_gp_changes() might be called on a different CPU than that > > > corresponding to rdp. You can check this with something like: > > > > > > WARN_ON_ONCE(rdp->cpu != smp_processor_id()); > > > > > > o The same things could happen with rcu_start_this_gp(), and the > > > above WARN_ON_ONCE() would work there as well. > > > > > > o rcutree_prepare_cpu() is a special case, but is irrelevant unless > > > you are doing CPU-hotplug operations. (It can run on a CPU other > > > than rdp->cpu, but only at times when rdp->cpu is offline.) > > > > > > o Interrupts might not really be disabled. > > > > > > That said, your patch could reduce overhead slightly, given that the two > > > values will be equal much of the time. So it might be worth testing just > > > for that reason. > > > > > > So why not just test it anyway? If it makes the bug go away, I will > > > be surprised, but it would not be the first surprise for me. ;-) > > > > > > Thanx, Paul > > > > > > > Thanks! > > > > > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index > > > > 0b760c1..c00f34e 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state > > > > *rsp, struct rcu_node *rnp, > > > > zero_cpu_stall_ticks(rdp); > > > > } > > > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. > > > > */ > > > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || > > > > rdp->gpwrap) > > > > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) > > > > + || > > > > + rdp->gpwrap) > > > > rdp->gp_seq_needed = rnp->gp_seq_needed; > > > > WRITE_ONCE(rdp->gpwrap, false); > > > > rcu_gpnum_ovf(rnp, rdp); > > > > > > > > > > > > -----Original Message----- > > > > From: Paul E. McKenney [mailto:paul...@linux.ibm.com] > > > > Sent: Thursday, December 13, 2018 08:12 > > > > To: He, Bo <bo...@intel.com> > > > > Cc: Steven Rostedt <rost...@goodmis.org>; > > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun > > > > <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; Zhang, > > > > Yanmin <yanmin.zh...@intel.com>; Bai, Jie A <jie.a....@intel.com>; > > > > Sun, Yi J <yi.j....@intel.com> > > > > Subject: Re: rcu_preempt caused oom > > > > > > > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote: > > > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also > > > > > checked the latest kernel and the latest tag v4.20-rc6, not see the > > > > > sysrq_rcu. > > > > > Please correct me if I have something wrong. > > > > > > > > That would be because I sent you the wrong patch, apologies! :-/ > > > > > > > > Please instead see the one below, which does add sysrq_rcu. > > > > > > > > Thanx, Paul > > > > > > > > > -----Original Message----- > > > > > From: Paul E. McKenney <paul...@linux.ibm.com> > > > > > Sent: Thursday, December 13, 2018 5:03 AM > > > > > To: He, Bo <bo...@intel.com> > > > > > Cc: Steven Rostedt <rost...@goodmis.org>; > > > > > linux-kernel@vger.kernel.org; j...@joshtriplett.org; > > > > > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, > > > > > Jun <jun.zh...@intel.com>; Xiao, Jin <jin.x...@intel.com>; > > > > > Zhang, Yanmin <yanmin.zh...@intel.com>; Bai, Jie A > > > > > <jie.a....@intel.com> > > > > > Subject: Re: rcu_preempt caused oom > > > > > > > > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote: > > > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote: > > > > > > > we reproduce on two boards, but I still not see the > > > > > > > show_rcu_gp_kthreads() dump logs, it seems the patch can't catch > > > > > > > the scenario. > > > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the > > > > > > > config as it's extracted from the /proc/config.gz. > > > > > > > > > > > > Strange. > > > > > > > > > > > > Are the systems responsive to sysrq keys once failure occurs? > > > > > > If so, I will provide you a sysrq-R or some such to dump out the > > > > > > RCU state. > > > > > > > > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 > > > > > using the patch below. Only lightly tested. > > > > > > > > ------------------------------------------------------------------ > > > > -- > > > > -- > > > > -- > > > > > > > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141 > > > > Author: Paul E. McKenney <paul...@linux.ibm.com> > > > > Date: Wed Dec 12 16:10:09 2018 -0800 > > > > > > > > rcu: Add sysrq rcu_node-dump capability > > > > > > > > Backported from v4.21/v5.0 > > > > > > > > Life is hard if RCU manages to get stuck without triggering RCU CPU > > > > stall warnings or triggering the rcu_check_gp_start_stall() checks > > > > for failing to start a grace period. This commit therefore adds a > > > > boot-time-selectable sysrq key (commandeering "y") that allows > > > > manually > > > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot > > > > parameter > > > > must be set for this sysrq to be available. > > > > > > > > Signed-off-by: Paul E. McKenney <paul...@linux.ibm.com> > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index > > > > 0b760c1369f7..e9392a9d6291 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -61,6 +61,7 @@ > > > > #include <linux/trace_events.h> > > > > #include <linux/suspend.h> > > > > #include <linux/ftrace.h> > > > > +#include <linux/sysrq.h> > > > > > > > > #include "tree.h" > > > > #include "rcu.h" > > > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int > > > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes > > > > in use. */ > > > > /* panic() on RCU Stall sysctl. */ int sysctl_panic_on_rcu_stall > > > > __read_mostly; > > > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool > > > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444); > > > > > > > > /* > > > > * The rcu_scheduler_active variable is initialized to the value > > > > @@ > > > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) } > > > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads); > > > > > > > > +/* Dump grace-period-request information due to commandeered sysrq. > > > > +*/ static void sysrq_show_rcu(int key) { > > > > + show_rcu_gp_kthreads(); > > > > +} > > > > + > > > > +static struct sysrq_key_op sysrq_rcudump_op = { > > > > + .handler = sysrq_show_rcu, > > > > + .help_msg = "show-rcu(y)", > > > > + .action_msg = "Show RCU tree", > > > > + .enable_mask = SYSRQ_ENABLE_DUMP, }; > > > > + > > > > +static int __init rcu_sysrq_init(void) { > > > > + if (sysrq_rcu) > > > > + return register_sysrq_key('y', &sysrq_rcudump_op); > > > > + return 0; > > > > +} > > > > +early_initcall(rcu_sysrq_init); > > > > + > > > > /* > > > > * Send along grace-period-related data for rcutorture diagnostics. > > > > */ > > > > > > > > > > >