On 2017/11/29 18:41, Eduardo Habkost wrote:
> On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: >> On 2017/11/29 5:13, Eduardo Habkost wrote: >>> [CCing the people who were copied in the original patch that >>> enabled l3cache] >>> >>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: >>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: >>>>> Hi, >>>>> >>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: >>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" >>>>>> introduced and set by default exposing l3 to the guest. >>>>>> >>>>>> The motivation behind it was that in the Linux scheduler, when waking up >>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue >>>>>> directly, without sending a reschedule IPI. Reduction in the IPI count >>>>>> led to performance gain. >>>>>> >>>>>> However, this isn't the whole story. Once the task is on the target >>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be >>>>>> it the idle task putting the CPU to sleep or just another running task. >>>>>> For that a reschedule IPI will have to be issued, too. Only when that >>>>>> other CPU is running a normal task for too little time, the fairness >>>>>> constraints will prevent the preemption and thus the IPI. >>>>>> >> >> Agree. :) >> >> Our testing VM is Suse11 guest with idle=poll at that time and now I realize >> that Suse11 has a BUG in its scheduler. >> >> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued >> if >> rq->idle is not polling: >> ''' >> static void ttwu_queue_remote(struct task_struct *p, int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> >> if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { >> if (!set_nr_if_polling(rq->idle)) >> smp_send_reschedule(cpu); >> else >> trace_sched_wake_idle_without_ipi(cpu); >> } >> } >> ''' >> >> But for Suse11, it does not check, it send a RES IPI unconditionally. > > So, does that mean no Linux guest benefits from the l3-cache=on > default except SuSE 11 guests? > Not only that, there is another scenario: static void ttwu_queue(...) { if (...two cpus NOT sharing L3-cache) { ... ttwu_queue_remote(p, cpu, wake_flags); return; } ... ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here* ... } In ttwu_do_activate(), there are also some opportunities with low probability to do not send RES IPI even if the target cpu isn't in IDLE polling state. > >> >>>>>> This boils down to the improvement being only achievable in workloads >>>>>> with many actively switching tasks. We had no access to the >>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the >>>>>> pattern is also reproduced with "perf bench sched messaging -g 1" >>>>>> on 1 socket, 8 cores vCPU topology, we see indeed: >>>>>> >>>>>> l3-cache #res IPI /s #time / 10000 loops >>>>>> off 560K 1.8 sec >>>>>> on 40K 0.9 sec >>>>>> >>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager >>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU >>>>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench >>>>>> sched pipe -i 100000" gives >>>>>> >>>>>> l3-cache #res IPI /s #HLT /s #time /100000 loops >>>>>> off 200 (no K) 230 0.2 sec >>>>>> on 400K 330K 0.5 sec >>>>>> >> >> I guess this issue could be resolved by disable the SD_WAKE_AFFINE. >> >> As Gonglei said: >> 1. the L3 cache relates to the user experience. > > This is true, in a way: I have seen a fair share of user reports > where they incorrectly blame the L3 cache absence or the L3 cache > size for performance problems. > >> 2. the glibc would get the cache info by CPUID directly, and relates to the >> memory performance. > > I'm interested in numbers that demonstrate that. > Sorry I have no numbers in hand currently :( I'll do some tests these days, please give me some time. >> >> What's more, the L3 cache relates to the sched_domain which is important to >> the >> (load) balancer when system is busy. >> >> All this doesn't mean the patch is insignificant, I just think we should do >> more >> research before decide. I'll do some tests, thanks. :) > > Yes, we need more data. But if we find out that there are no > cases where the l3-cache=on default actually improves > performance, I will be willing to apply this patch. > That's a good thing if we find the truth, it's free. :) OTOH, I think we should notice that: Linux is designed on real hardware, maybe there're some other problems if QEMU lacks some related features. If we search 'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block Layer. > IMO, the long term solution is to make Linux guests not misbehave > when we stop lying about the L3 cache. Maybe we could provide a > "IPIs are expensive, please avoid them" hint in the KVM CPUID > leaf? > Good idea. :) Maybe more PV features could be digged. >> >>>>>> In a more realistic test, we observe 15% degradation in VM density >>>>>> (measured as the number of VMs, each running Drupal CMS serving 2 http >>>>>> requests per second to its main page, with 95%-percentile response >>>>>> latency under 100 ms) with l3-cache=on. >>>>>> >>>>>> We think that mostly-idle scenario is more common in cloud and personal >>>>>> usage, and should be optimized for by default; users of highly loaded >>>>>> VMs should be able to tune them up themselves. >>>>>> >>>>> There's one thing I don't understand in your test case: if you >>>>> just found out that Linux will behave worse if it assumes that >>>>> the VCPUs are sharing a L3 cache, why are you configuring a >>>>> 8-core VCPU topology explicitly? >>>>> >>>>> Do you still see a difference in the numbers if you use "-smp 8" >>>>> with no "cores" and "threads" options? >>>>> >>>> This is quite simple. A lot of software licenses are bound to the amount >>>> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology >>>> with 1 socket/xx cores to reduce the amount of money necessary to >>>> be paid for the software. >>> >>> In this case it looks like we're talking about the expected >>> meaning of "cores=N". My first interpretation would be that the >>> user obviously want the guest to see the multiple cores sharing a >>> L3 cache, because that's how real CPUs normally work. But I see >>> why you have different expectations. >>> >>> Numbers on dedicated-pCPU scenarios would be helpful to guide the >>> decision. I wouldn't like to cause a performance regression for >>> users that fine-tuned vCPU topology and set up CPU pinning. >>> >> >> >> -- >> Regards, >> Longpeng(Mike) >> > -- Regards, Longpeng(Mike)