In some special scenarios like #vcpu <= #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU.
Similarly, when we have large number of small guests, it is possible that a spinning vcpu fails to yield_to any vcpu of same VM and go back and spin. This is also not effective when we are over-committed. Instead, we do a yield() so that we give chance to other VMs to run. This patch tries to optimize above scenarios. The first patch optimizes all the yield_to by bailing out when there is no need to continue yield_to (i.e., when there is only one task in source and target rq). Second patch uses that in PLE handler. Third patch uses overall system load knowledge to take decison on continuing in yield_to handler, and also yielding in overcommits. To be precise, * loadavg is converted to a scale of 2048 / per CPU * a load value of less than 1024 is considered as undercommit and we return from PLE handler in those cases * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit and we yield to other VMs in such cases. (let threshold = 2048) Rationale for using threshold/2 for undercommit limit: Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik) scenarios where we still have lock holder preempted vcpu waiting to be scheduled. (scenario arises when rq length is > 1 even when we are under committed) Rationale for using (1.75 * threshold) for overcommit scenario: This is a heuristic where we should probably see rq length > 1 and a vcpu of a different VM is waiting to be scheduled. Related future work (independent of this series): - Dynamically changing PLE window depending on system load. Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x with 32 core PLE machine with 32 vcpu guest. I believe we should get very good improvements for overcommit (especially > 2) on large machines with small vcpu guests. (Could not test this as I do not have access to a bigger machine) base = 3.7.0-rc1 machine: 32 core mx3850 x5 PLE mc --+-----------+-----------+-----------+------------+-----------+ ebizzy (rec/sec higher is beter) --+-----------+-----------+-----------+------------+-----------+ base stdev patched stdev %improve --+-----------+-----------+-----------+------------+-----------+ 1x 2543.3750 20.2903 6279.3750 82.5226 146.89143 2x 2410.8750 96.4327 2450.7500 207.8136 1.65396 3x 2184.9167 205.5226 2178.3333 97.2034 -0.30131 --+-----------+-----------+-----------+------------+-----------+ --+-----------+-----------+-----------+------------+-----------+ dbench (throughput in MB/sec. higher is better) --+-----------+-----------+-----------+------------+-----------+ base stdev patched stdev %improve --+-----------+-----------+-----------+------------+-----------+ 1x 5545.4330 596.4344 7042.8510 1012.0924 27.00272 2x 1993.0970 43.6548 1990.6200 75.7837 -0.12428 3x 1295.3867 22.3997 1315.5208 36.0075 1.55429 --+-----------+-----------+-----------+------------+-----------+ Changes since V1: - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter) - Use yield() instead of schedule in overcommit scenarios (Rik) - Use loadavg knowledge to detect undercommit/overcommit Peter Zijlstra (1): Bail out of yield_to when source and target runqueue has one task Raghavendra K T (2): Handle yield_to failure return for potential undercommit case Check system load and handle different commit cases accordingly Please let me know your comments and suggestions. Link for V1: https://lkml.org/lkml/2012/9/21/168 kernel/sched/core.c | 25 +++++++++++++++++++------ virt/kvm/kvm_main.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++---------- 2 files changed, 65 insertions(+), 16 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/