Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default

Longpeng (Mike) Wed, 29 Nov 2017 03:59:36 -0800


On 2017/11/29 18:41, Eduardo Habkost wrote:


> On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
>> On 2017/11/29 5:13, Eduardo Habkost wrote:
>>> [CCing the people who were copied in the original patch that
>>> enabled l3cache]
>>>
>>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
>>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
>>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
>>>>>> introduced and set by default exposing l3 to the guest.
>>>>>>
>>>>>> The motivation behind it was that in the Linux scheduler, when waking up
>>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
>>>>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
>>>>>> led to performance gain.
>>>>>>
>>>>>> However, this isn't the whole story.  Once the task is on the target
>>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
>>>>>> it the idle task putting the CPU to sleep or just another running task.
>>>>>> For that a reschedule IPI will have to be issued, too.  Only when that
>>>>>> other CPU is running a normal task for too little time, the fairness
>>>>>> constraints will prevent the preemption and thus the IPI.
>>>>>>
>>
>> Agree. :)
>>
>> Our testing VM is Suse11 guest with idle=poll at that time and now I realize
>> that Suse11 has a BUG in its scheduler.
>>
>> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued 
>> if
>> rq->idle is not polling:
>> '''
>> static void ttwu_queue_remote(struct task_struct *p, int cpu)
>> {
>>      struct rq *rq = cpu_rq(cpu);
>>
>>      if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>>              if (!set_nr_if_polling(rq->idle))
>>                      smp_send_reschedule(cpu);
>>              else
>>                      trace_sched_wake_idle_without_ipi(cpu);
>>      }
>> }
>> '''
>>
>> But for Suse11, it does not check, it send a RES IPI unconditionally.
> 
> So, does that mean no Linux guest benefits from the l3-cache=on
> default except SuSE 11 guests?
> 

Not only that, there is another scenario:

static void ttwu_queue(...)
{
        if (...two cpus NOT sharing L3-cache) {
                ...
                ttwu_queue_remote(p, cpu, wake_flags);
                return;
        }
        ...
        ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here*
        ...
}

In ttwu_do_activate(), there are also some opportunities with low probability to
do not send RES IPI even if the target cpu isn't in IDLE polling state.

> 
>>
>>>>>> This boils down to the improvement being only achievable in workloads
>>>>>> with many actively switching tasks.  We had no access to the
>>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
>>>>>> pattern is also reproduced with "perf bench sched messaging -g 1"
>>>>>> on 1 socket, 8 cores vCPU topology, we see indeed:
>>>>>>
>>>>>> l3-cache #res IPI /s     #time / 10000 loops
>>>>>> off              560K            1.8 sec
>>>>>> on               40K             0.9 sec
>>>>>>
>>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>>>>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
>>>>>> sched pipe -i 100000" gives
>>>>>>
>>>>>> l3-cache #res IPI /s     #HLT /s         #time /100000 loops
>>>>>> off              200 (no K)      230             0.2 sec
>>>>>> on               400K            330K            0.5 sec
>>>>>>
>>
>> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.
>>
>> As Gonglei said:
>> 1. the L3 cache relates to the user experience.
> 
> This is true, in a way: I have seen a fair share of user reports
> where they incorrectly blame the L3 cache absence or the L3 cache
> size for performance problems.
> 
>> 2. the glibc would get the cache info by CPUID directly, and relates to the
>> memory performance.
> 
> I'm interested in numbers that demonstrate that.
> 

Sorry I have no numbers in hand currently :(

I'll do some tests these days, please give me some time.

>>
>> What's more, the L3 cache relates to the sched_domain which is important to 
>> the
>> (load) balancer when system is busy.
>>
>> All this doesn't mean the patch is insignificant, I just think we should do 
>> more
>> research before decide. I'll do some tests, thanks. :)
> 
> Yes, we need more data.  But if we find out that there are no
> cases where the l3-cache=on default actually improves
> performance, I will be willing to apply this patch.
> 

That's a good thing if we find the truth, it's free. :)

OTOH, I think we should notice that: Linux is designed on real hardware, maybe
there're some other problems if QEMU lacks some related features. If we search
'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block
Layer.

> IMO, the long term solution is to make Linux guests not misbehave
> when we stop lying about the L3 cache.  Maybe we could provide a
> "IPIs are expensive, please avoid them" hint in the KVM CPUID
> leaf?
> 

Good idea. :)

Maybe more PV features could be digged.

>>
>>>>>> In a more realistic test, we observe 15% degradation in VM density
>>>>>> (measured as the number of VMs, each running Drupal CMS serving 2 http
>>>>>> requests per second to its main page, with 95%-percentile response
>>>>>> latency under 100 ms) with l3-cache=on.
>>>>>>
>>>>>> We think that mostly-idle scenario is more common in cloud and personal
>>>>>> usage, and should be optimized for by default; users of highly loaded
>>>>>> VMs should be able to tune them up themselves.
>>>>>>
>>>>> There's one thing I don't understand in your test case: if you
>>>>> just found out that Linux will behave worse if it assumes that
>>>>> the VCPUs are sharing a L3 cache, why are you configuring a
>>>>> 8-core VCPU topology explicitly?
>>>>>
>>>>> Do you still see a difference in the numbers if you use "-smp 8"
>>>>> with no "cores" and "threads" options?
>>>>>
>>>> This is quite simple. A lot of software licenses are bound to the amount
>>>> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
>>>> with 1 socket/xx cores to reduce the amount of money necessary to
>>>> be paid for the software.
>>>
>>> In this case it looks like we're talking about the expected
>>> meaning of "cores=N".  My first interpretation would be that the
>>> user obviously want the guest to see the multiple cores sharing a
>>> L3 cache, because that's how real CPUs normally work.  But I see
>>> why you have different expectations.
>>>
>>> Numbers on dedicated-pCPU scenarios would be helpful to guide the
>>> decision.  I wouldn't like to cause a performance regression for
>>> users that fine-tuned vCPU topology and set up CPU pinning.
>>>
>>
>>
>> -- 
>> Regards,
>> Longpeng(Mike)
>>
> 


-- 
Regards,
Longpeng(Mike)

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default

Reply via email to