On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote: > On 5/16/2016 10:34 AM, Paul E. McKenney wrote: > >On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote: > >>On 5/16/2016 5:03 AM, Paul E. McKenney wrote: > >>>On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilim...@oracle.com > >>>wrote: > >>>>On 5/15/16 2:18 PM, Santosh Shilimkar wrote: > >>>>>Hi Paul, > >>>>> > >>>>>I was asking Sasha about [1] since other folks in Oracle > >>>>>also stumbled upon similar RCU stalls with v4.1 kernel in > >>>>>different workloads. I was reported similar issue with > >>>>>RDS as well and looking at [1], [2], [3] and [4], thought > >>>>>of reaching out to see if you can help us to understand > >>>>>this issue better. > >>>>> > >>>>>Have also included RCU specific config used in these > >>>>>test(s). Its very hard to reproduce the issue but one of > >>>>>the data point is, it reproduces on systems with larger > >>>>>CPUs(64+). Same workload with less than 64 CPUs, don't > >>>>>show the issue. Someone also told me, making use of > >>>>>SLAB instead SLUB allocator makes difference but I > >>>>>haven't verified that part for RDS. > >>>>> > >>>>>Let me know your thoughts. Thanks in advance !! > >>>>> > >>>>One of my colleague told me the pastebin server I used > >>>>is Oracle internal only so adding the relevant logs along > >>>>with email. > >>>> > >> > >>[...] > >> > >>>>>[1] https://lkml.org/lkml/2014/12/14/304 > >>>> > >>>> > >>>>[2] Log 1 snippet: > >>>>----------------------------------------------------------------- > >>>>INFO: rcu_sched self-detected stall on CPU > >>>>INFO: rcu_sched self-detected stall on CPU { 54} (t=60000 jiffies > >>>>g=66023 c=66022 q=0) > >>>>Task dump for CPU 54: > >>>>ksoftirqd/54 R running task 0 389 2 0x00000008 > >>>> 0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036 > >>>> ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086 > >>>> ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540 > >>>>Call Trace: > >>>> <IRQ> [<ffffffff810a8621>] sched_show_task+0xb1/0x120 > >>>> [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50 > >>>> [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0 > >>>> [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170 > >>>> [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0 > >>>> [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170 > >>>> [<ffffffff810e9772>] update_process_times+0x42/0x70 > >>>> [<ffffffff810fb589>] tick_sched_handle+0x39/0x80 > >>>> [<ffffffff810fb824>] tick_sched_timer+0x44/0x80 > >>>> [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0 > >>>> [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0 > >>>> [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240 > >>>> [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60 > >>>> [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59 > >>>> [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80 > >>>> <EOI> [<ffffffff8118db64>] ? free_one_page+0x164/0x380 > >>>> [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0 > >>>> [<ffffffff8118e775>] __free_pages+0x25/0x40 > >>>> [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds] > >>>> [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds] > >>>> [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma] > >>>> [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50 > >>>> [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib] > >>>> [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma] > >>>> [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma] > >>> > >>>The most likely possibility is that there is a 60-second-long loop in > >>>one of the above functions. This is within bottom-half execution, so > >>>unfortunately the usual trick of placing cond_resched_rcu_qs() within this > >>>loop, but outside of any RCU read-side critical section does not work. > >>> > >>First of all thanks for explanation. > >> > >>There is no loop which can last for 60 seconds in above code since > >>its just completion queue handler used to free up buffers much like > >>NIC > >>drivers bottom half(NAPI). Its done in tasklet context for latency > >>reasons which RDS care most. Just to get your attention, the RCU > >>stall is also seen with XEN code too. Log for it end of the email. > >> > >>Another important observation is, for RDS if we avoid higher > >>order page(s) allocation, issue is not reproducible so far. > >>In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations, > >>the system continues to run without any issue, so the loop scenario > >>is ruled out more or less. > >> > >>To be specific, with PAGE_SIZE allocations, alloc_pages() > >>is just allocating a page and __free_page() is used > >>instead of __free_pages() from below snippet. > >> > >>------------------ > >>if (bytes >= PAGE_SIZE) > >> page = alloc_pages(gfp, get_order(bytes)); > >> > >>..... > >> > >>(rm->data.op_sg[i].length <= PAGE_SIZE) ? > >>__free_page(sg_page(&rm->data.op_sg[i])) : > >>__free_pages(sg_page(&rm->data.op_sg[i]), > >>get_order(rm->data.op_sg[i].length)); > >>---------------------------- > > > >This sounds like something to take up with the mm folks. > > > Sure. Will do once the link between two issues is established.
Fair enough! > >>>Therefore, if there really is a loop here, one fix would be to > >>>periodically unwind back out to run_ksoftirqd(), but setting up so that > >>>the work would be continued later. Another fix might be to move this > >>>from tasklet context to workqueue context, where cond_resched_rcu_qs() > >>>can be used -- however, this looks a bit like networking code, which > >>>does not always take kindly to being run in process context (though > >>>careful use of local_bh_disable() and local_bh_enable() can sometimes > >>>overcome this issue). A third fix, which works only if this code does > >>>not use RCU and does not invoke any code that does use RCU, is to tell > >>>RCU that it should ignore this code (which will require a little work > >>>on RCU, as it currently does not tolerate this sort of thing aside from > >>>the idle threads). In this last approach, event-tracing calls must use > >>>the _nonidle suffix. > >>> > >>>I am not familiar with the RDS code, so I cannot be more specific. > >> > >>No worries. Since we saw the issue with XEN too, I was suspecting > >>that somehow we didn't have RCU_TREE config setup correctly or > >>some important RCU patch(s) missing in v4.1 which made it in > >>later kernels. > >> > >>The only common denominator I saw between these two different > >>usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain > >>which I believe triggers the rcu_sched() chain. > > > >Exactly! > > > >>I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can > >>get more information out of the system. Do you suggest any other > >>RCU option(s)/patch(s) which we can help us to capture more > >>information to understand the issue better. I wasn't sure about > >>options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config > >>is also mentioned end of the email. > > > >Hmmm... I just now noticed the "All QSes seen" message below. > >That can happen if the load is quite high, and could as you say > >one thing to try would be to set CONFIG_RCU_KTHREAD_PRIO=1. > >This is not free, as it will mean more context switches involving > >the RCU grace-period kthreads. > > > >Another thing to try is to set CONFIG_RCU_CPU_STALL_TIMEOUT to something > >smaller than 60, say, 21. This will cause the stall warnings to leave > >less time before splatting. > > > OK will try with CONFIG_RCU_KTHREAD_PRIO=1 & > CONFIG_RCU_CPU_STALL_TIMEOUT =21 . Indeed the usecases generate very > high load on the systems. > > >Are you running CONFIG_NO_HZ_FULL=y? If so, the problem might be that > >you need more housekeeping CPUs than you currently have configured. > > > Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for > book keeping. Seems like without that clock-event code will just use > CPU0 for things like broadcasting which might become bottleneck. > This could explain connect the hrtimer_interrupt() path getting slowed > down because of book keeping bottleneck. > > $cat .config | grep NO_HZ > CONFIG_NO_HZ_COMMON=y > # CONFIG_NO_HZ_IDLE is not set > CONFIG_NO_HZ_FULL=y > # CONFIG_NO_HZ_FULL_ALL is not set > # CONFIG_NO_HZ_FULL_SYSIDLE is not set > CONFIG_NO_HZ=y > # CONFIG_RCU_FAST_NO_HZ is not set Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all housekeeping tasks, including the RCU grace-period kthreads. So you are booting without any nohz_full boot parameter? You can end up with the same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter that you can with CONFIG_NO_HZ_FULL_ALL=y. Thanx, Paul