Re: [PATCH v5 0/9] numa,sched,mm: pseudo-interleaving for automatic NUMA balancing
On 1/27/2014 2:03 PM, r...@redhat.com wrote: The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being available to the workload. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch series identifies the NUMA nodes on which the workload is actively running, and balances (somewhat lazily) the memory between those nodes, satisfying the criteria above. As usual, the series has had some performance testing, but it could always benefit from more testing, on other systems. Changes since v4: - remove some code that did not help performance - implement all the cleanups suggested by Mel Gorman - lots more testing, by Chegu Vinod and myself - rebase against -tip instead of -next, to make merging easier Acked-by: Chegu Vinod --- The following 1, 2, 4 & 8 socket-wide results on an 8-socket box are an average of 4 runs. I) Eight 1-socket wide instances (10 warehouse threads/instance) a) numactl pinning results throughput = 350720 bops throughput = 355250 bops throughput = 350338 bops throughput = 345963 bops throughput = 344723 bops throughput = 347838 bops throughput = 347623 bops throughput = 347963 bops b) Automatic NUMA balancing results (Avg# page migrations : 10317611) throughput = 319037 bops throughput = 319612 bops throughput = 314089 bops throughput = 317499 bops throughput = 320516 bops throughput = 314905 bops throughput = 315821 bops throughput = 320575 bops c) No Automatic NUMA balancing and NO-pinning results throughput = 175433 bops throughput = 179470 bops throughput = 176262 bops throughput = 162551 bops throughput = 167874 bops throughput = 173196 bops throughput = 172001 bops throughput = 174332 bops --- II) Four 2-socket wide instances (20 warehouse threads/instance) a) numactl pinning results throughput = 611391 bops throughput = 618464 bops throughput = 612350 bops throughput = 616826 bops b) Automatic NUMA balancing results (Avg# page migrations : 8643581) throughput = 523053 bops throughput = 519375 bops throughput = 502800 bops throughput = 528880 bops c) No Automatic NUMA balancing and NO-pinning results throughput = 334807 bops throughput = 330348 bops throughput = 306250 bops throughput = 309624 bops --- III) Two 4-socket wide instances (40 warehouse threads/instance) a) numactl pinning results throughput = 946760 bops throughput = 949712 bops b) Automatic NUMA balancing results (Avg# page migrations : 5710932) throughput = 861105 bops throughput = 879878 bops c) No Automatic NUMA balancing and NO-pinning results throughput = 500527 bops throughput = 450884 bops --- IV) One 8-socket wide instance (80 warehouse threads/instance) a) numactl pinning results throughput =1199211 bops b) Automatic NUMA balancing results (Avg# page migrations : 3426618) throughput =1119524 bops c) No Automatic NUMA balancing and NO-pinning results throughput = 789243 bops Thanks Vinod Changes since v3: - various code cleanups suggested by Mel Gorman (some in their own patches) - after some testing, switch back to the NUMA specific CPU use stats, since that results in a 1% performance increase for two 8-warehouse specjbb instances on a 4-node system, and reduced page migration across the board Changes since v2: - dropped tracepoint (for now?) - implement obvious improvements suggested by Peter - use the scheduler maintained CPU use statistics, drop the NUMA specific ones for now. We can add those later if they turn out to be beneficial Changes since v1: - fix divide by zero found by Chegu Vinod - improve comment, as suggested by Peter Zijlstra - do stats calculations in task_numa_placement in local variables Some performance numbers, with two 40-warehouse specjbb instances on an 8 node system with 10 CPU cores per node, using a pre-cleanup version of these patches, courtesy of Chegu Vinod: numactl manual pinning spec1.txt: throughput = 755900.20 SPECjbb2005 bops spec2.txt: throughput = 754914.40 SPECjbb2005 bops NO-pinning results (Automatic NUMA balancing, with patches) spec1.txt: throughput = 706439.84 SPECjbb2005 bops spec2.txt: throughput = 729347.75 SPECjbb2005 bops NO-pinning results (Automatic NUMA balancing, wit
Re: [PATCH v2 0/7] pseudo-interleaving for automatic NUMA balancing
On 1/17/2014 1:12 PM, r...@redhat.com wrote: The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being available to the workload. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch series identifies the NUMA nodes on which the workload is actively running, and balances (somewhat lazily) the memory between those nodes, satisfying the criteria above. As usual, the series has had some performance testing, but it could always benefit from more testing, on other systems. Changes since v1: - fix divide by zero found by Chegu Vinod - improve comment, as suggested by Peter Zijlstra - do stats calculations in task_numa_placement in local variables Some performance numbers, with two 40-warehouse specjbb instances on an 8 node system with 10 CPU cores per node, using a pre-cleanup version of these patches, courtesy of Chegu Vinod: numactl manual pinning spec1.txt: throughput = 755900.20 SPECjbb2005 bops spec2.txt: throughput = 754914.40 SPECjbb2005 bops NO-pinning results (Automatic NUMA balancing, with patches) spec1.txt: throughput = 706439.84 SPECjbb2005 bops spec2.txt: throughput = 729347.75 SPECjbb2005 bops NO-pinning results (Automatic NUMA balancing, without patches) spec1.txt: throughput = 667988.47 SPECjbb2005 bops spec2.txt: throughput = 638220.45 SPECjbb2005 bops No Automatic NUMA and NO-pinning results spec1.txt: throughput = 544120.97 SPECjbb2005 bops spec2.txt: throughput = 453553.41 SPECjbb2005 bops My own performance numbers are not as relevant, since I have been running with a more hostile workload on purpose, and I have run into a scheduler issue that caused the workload to run on only two of the four NUMA nodes on my test system... . Acked-by: Chegu Vinod Here are some results using the v2 version of the patches on an 8 socket box using SPECjbb2005 as a workload : I) Eight 1-socket wide instances(10 warehouse threads each) : Without patchesWith patches a) numactl pinning results spec1.txt: throughput = 270620.04 273675.10 spec2.txt: throughput = 274115.33 272845.17 spec3.txt: throughput = 277830.09 272057.33 spec4.txt: throughput = 270898.52 270670.54 spec5.txt: throughput = 270397.30 270906.82 spec6.txt: throughput = 270451.93 268217.55 spec7.txt: throughput = 269511.07 269354.46 spec8.txt: throughput = 269386.06 270540.00 b)Automatic NUMA balancing results spec1.txt: throughput = 244333.41 248072.72 spec2.txt: throughput = 252166.99 251818.30 spec3.txt: throughput = 251365.58 258266.24 spec4.txt: throughput = 245247.91 256873.51 spec5.txt: throughput = 245579.68 247743.18 spec6.txt: throughput = 249767.38 256285.86 spec7.txt: throughput = 244570.64 255343.99 spec8.txt: throughput = 245703.60 254434.36 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 132959.73 136957.12 spec2.txt: throughput = 127937.11 129326.23 spec3.txt: throughput = 130697.10 125772.11 spec4.txt: throughput = 134978.49 141607.58 spec5.txt: throughput = 127574.34 126748.18 spec6.txt: throughput = 138699.99 128597.95 spec7.txt: throughput = 133247.25 137344.57 spec8.txt: throughput = 124548.00 139040.98 -- II) Four 2-socket wide instances(20 warehouse threads each) : Without patchesWith patches a) numactl pinning results spec1.txt: throughput = 479931.16 472467.58 spec2.txt: throughput = 466652.15 466237.10 spec3.txt: throughput
Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V8
Mel Gorman suse.de> writes: > > Weighing in at 63 patches makes the term "basic" in the series title is > a misnomer. > > This series still has roughly the same goals as previous versions. It > reduces overhead of automatic balancing through scan rate reduction > and the avoidance of TLB flushes. It selects a preferred node and moves > tasks towards their memory as well as moving memory toward their task. It > handles shared pages and groups related tasks together. Some problems such > as shared page interleaving and properly dealing with processes that are > larger than a node are being deferred. > > It is still based on 3.11 because that's what I was testing against. If > we can agree this should be merged to -tip for testing I'll rebase to deal > with any scheduler conflicts but for now I do not want to invalidate other > peoples testing. The only obvious thing that is missing is hotplug handling. > Peter is currently working on reducing [get|put]_online_cpus overhead so > that it can be added to migrate_swap. > > Peter, some of your patches are missing signed-offs-by -- 4-5, 43 and 55. > > Changelog since V7 > o THP migration race and pagetable insertion fixes > o Do no handle PMDs in batch > o Shared page migration throttling > o Various changes to how last nid/pid information is recorded > o False pid match sanity checks when joining NUMA task groups > o Adapt scan rate based on local/remote fault statistics > o Period retry of migration to preferred node > o Limit scope of system-wide search > o Schedule threads on the same node as process that created them > o Cleanup numa_group on exec > > Changelog since V6 > o Group tasks that share pages together > o More scan avoidance of VMAs mapping pages that are not likely to migrate > o cpunid conversion, system-wide searching of tasks to balance with > > Changelog since V6 > o Various TLB flush optimisations > o Comment updates > o Sanitise task_numa_fault callsites for consistent semantics > o Revert some of the scanning adaption stuff > o Revert patch that defers scanning until task schedules on another node > o Start delayed scanning properly > o Avoid the same task always performing the PTE scan > o Continue PTE scanning even if migration is rate limited > > Changelog since V5 > o Add __GFP_NOWARN for numa hinting fault count > o Use is_huge_zero_page > o Favour moving tasks towards nodes with higher faults > o Optionally resist moving tasks towards nodes with lower faults > o Scan shared THP pages > > Changelog since V4 > o Added code that avoids overloading preferred nodes > o Swap tasks if nodes are overloaded and the swap does not impair locality > > Changelog since V3 > o Correct detection of unset last nid/pid information > o Dropped nr_preferred_running and replaced it with Peter's load balancing > o Pass in correct node information for THP hinting faults > o Pressure tasks sharing a THP page to move towards same node > o Do not set pmd_numa if false sharing is detected > > Changelog since V2 > o Reshuffle to match Peter's implied preference for layout > o Reshuffle to move private/shared split towards end of series to make it > easier to evaluate the impact > o Use PID information to identify private accesses > o Set the floor for PTE scanning based on virtual address space scan rates > instead of time > o Some locking improvements > o Do not preempt pinned tasks unless they are kernel threads > > Changelog since V1 > o Scan pages with elevated map count (shared pages) > o Scale scan rates based on the vsz of the process so the sampling of the > task is independant of its size > o Favour moving towards nodes with more faults even if it's not the > preferred node > o Laughably basic accounting of a compute overloaded node when selecting > the preferred node. > o Applied review comments > > This series integrates basic scheduler support for automatic NUMA balancing. > It was initially based on Peter Ziljstra's work in "sched, numa, mm: > Add adaptive NUMA affinity support" but deviates too much to preserve > Signed-off-bys. As before, if the relevant authors are ok with it I'll > add Signed-off-bys (or add them yourselves if you pick the patches up). > There has been a tonne of additional work from both Peter and Rik van Riel. > > Some reports indicate that the performance is getting close to manual > bindings for some workloads but your mileage will vary. As before, the > intention is not to complete the work but to incrementally improve mainline > and preserve bisectability for any bug reports that crop up. > > Patch 1 is a monolithic dump of patches thare are destined for upstream that > this series indirectly depends upon. > > Patches 2-3 adds sysctl documentation and comment fixlets > > Patch 4 avoids accounting for a hinting fault if another thread handled the > fault in parallel > > Patches 5-6 avoid races with parallel THP migration and THP splits. > > Patch 7 corrects a THP NUMA
Re: kvm_intel: Could not allocate 42 bytes percpu data
On 7/1/2013 10:49 PM, Rusty Russell wrote: Chegu Vinod writes: On 6/30/2013 11:22 PM, Rusty Russell wrote: Chegu Vinod writes: Hello, Lots (~700+) of the following messages are showing up in the dmesg of a 3.10-rc1 based kernel (Host OS is running on a large socket count box with HT-on). [ 82.270682] PERCPU: allocation failed, size=42 align=16, alloc from reserved chunk failed [ 82.272633] kvm_intel: Could not allocate 42 bytes percpu data Woah, weird Oh. Shit. Um, this is embarrassing. Thanks, Rusty. Thanks for your response! === module: do percpu allocation after uniqueness check. No, really! v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting percpu memory on large machines: Now we have a new state MODULE_STATE_UNFORMED, we can insert the module into the list (and thus guarantee its uniqueness) before we allocate the per-cpu region. In my defence, it didn't actually say the patch did this. Just that we "can". This patch actually *does* it. Signed-off-by: Rusty Russell Tested-by: Noone it seems. Your following "updated" fix seems to be working fine on the larger socket count machine with HT-on. OK, did you definitely revert every other workaround? Yes no other workarounds were there when your change was tested. If so, please give me a Tested-by: line... FYI The actual verification of your change was done by my esteemed colleague :Jim Hull (cc'd) who had access to this larger socket count box. Tested-by: Jim Hull Thanks Vinod Thanks, Rusty. . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kvm_intel: Could not allocate 42 bytes percpu data
On 6/30/2013 11:22 PM, Rusty Russell wrote: Chegu Vinod writes: Hello, Lots (~700+) of the following messages are showing up in the dmesg of a 3.10-rc1 based kernel (Host OS is running on a large socket count box with HT-on). [ 82.270682] PERCPU: allocation failed, size=42 align=16, alloc from reserved chunk failed [ 82.272633] kvm_intel: Could not allocate 42 bytes percpu data Woah, weird Oh. Shit. Um, this is embarrassing. Thanks, Rusty. Thanks for your response! === module: do percpu allocation after uniqueness check. No, really! v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting percpu memory on large machines: Now we have a new state MODULE_STATE_UNFORMED, we can insert the module into the list (and thus guarantee its uniqueness) before we allocate the per-cpu region. In my defence, it didn't actually say the patch did this. Just that we "can". This patch actually *does* it. Signed-off-by: Rusty Russell Tested-by: Noone it seems. Your following "updated" fix seems to be working fine on the larger socket count machine with HT-on. Thx Vinod diff --git a/kernel/module.c b/kernel/module.c index cab4bce..fa53db8 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2927,7 +2927,6 @@ static struct module *layout_and_allocate(struct load_info *info, int flags) { /* Module within temporary copy. */ struct module *mod; - Elf_Shdr *pcpusec; int err; mod = setup_load_info(info, flags); @@ -2942,17 +2941,10 @@ static struct module *layout_and_allocate(struct load_info *info, int flags) err = module_frob_arch_sections(info->hdr, info->sechdrs, info->secstrings, mod); if (err < 0) - goto out; + return ERR_PTR(err); - pcpusec = &info->sechdrs[info->index.pcpu]; - if (pcpusec->sh_size) { - /* We have a special allocation for this section. */ - err = percpu_modalloc(mod, - pcpusec->sh_size, pcpusec->sh_addralign); - if (err) - goto out; - pcpusec->sh_flags &= ~(unsigned long)SHF_ALLOC; - } + /* We will do a special allocation for per-cpu sections later. */ + info->sechdrs[info->index.pcpu].sh_flags &= ~(unsigned long)SHF_ALLOC; /* Determine total sizes, and put offsets in sh_entsize. For now this is done generically; there doesn't appear to be any @@ -2963,17 +2955,22 @@ static struct module *layout_and_allocate(struct load_info *info, int flags) /* Allocate and move to the final place */ err = move_module(mod, info); if (err) - goto free_percpu; + return ERR_PTR(err); /* Module has been copied to its final place now: return it. */ mod = (void *)info->sechdrs[info->index.mod].sh_addr; kmemleak_load_module(mod, info); return mod; +} -free_percpu: - percpu_modfree(mod); -out: - return ERR_PTR(err); +static int alloc_module_percpu(struct module *mod, struct load_info *info) +{ + Elf_Shdr *pcpusec = &info->sechdrs[info->index.pcpu]; + if (!pcpusec->sh_size) + return 0; + + /* We have a special allocation for this section. */ + return percpu_modalloc(mod, pcpusec->sh_size, pcpusec->sh_addralign); } /* mod is no longer valid after this! */ @@ -3237,6 +3234,11 @@ static int load_module(struct load_info *info, const char __user *uargs, } #endif + /* To avoid stressing percpu allocator, do this once we're unique. */ + err = alloc_module_percpu(mod, info); + if (err) + goto unlink_mod; + /* Now module is in final location, initialize linked lists, etc. */ err = module_unload_init(mod); if (err) . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kvm_intel: Could not allocate 42 bytes percpu data
Hello, Lots (~700+) of the following messages are showing up in the dmesg of a 3.10-rc1 based kernel (Host OS is running on a large socket count box with HT-on). [ 82.270682] PERCPU: allocation failed, size=42 align=16, alloc from reserved chunk failed [ 82.272633] kvm_intel: Could not allocate 42 bytes percpu data ... also call traces like the following... [ 101.852136] c901ad5aa090 88084675dd08 81633743 88084675ddc8 [ 101.860889] 81145053 81f3fa78 88084809dd40 8907d1cfd2e8 [ 101.869466] 8907d1cfd280 88087fffdb08 88084675c010 88084675dfd8 [ 101.878190] Call Trace: [ 101.880953] [] dump_stack+0x19/0x1e [ 101.886679] [] pcpu_alloc+0x9a3/0xa40 [ 101.892754] [] __alloc_reserved_percpu+0x13/0x20 [ 101.899733] [] load_module+0x35f/0x1a70 [ 101.905835] [] ? do_page_fault+0xe/0x10 [ 101.911953] [] SyS_init_module+0xfb/0x140 [ 101.918287] [] system_call_fastpath+0x16/0x1b [ 101.924981] kvm_intel: Could not allocate 42 bytes percpu data Wondering if anyone else has seen this with the recent [3.10] based kernels esp. on larger boxes? There was a similar issue that was reported earlier (where modules were being loaded per cpu without checking if an instance was already loaded/being-loaded). That issue seems to have been addressed in the recent past (e.g. https://lkml.org/lkml/2013/1/24/659 along with a couple of follow on cleanups) Is the above yet another variant of the original issue or perhaps some race condition that got exposed when there are lot more threads ? Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/2] vfio: Provide module option to disable vfio_iommu_type1 hugepage support
On 5/28/2013 9:27 AM, Alex Williamson wrote: Add a module option to vfio_iommu_type1 to disable IOMMU hugepage support. This causes iommu_map to only be called with single page mappings, disabling the IOMMU driver's ability to use hugepages. This option can be enabled by loading vfio_iommu_type1 with disable_hugepages=1 or dynamically through sysfs. If enabled dynamically, only new mappings are restricted. Signed-off-by: Alex Williamson --- As suggested by Konrad. This is cleaner to add as a follow-on drivers/vfio/vfio_iommu_type1.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 6654a7e..8a2be4e 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -48,6 +48,12 @@ module_param_named(allow_unsafe_interrupts, MODULE_PARM_DESC(allow_unsafe_interrupts, "Enable VFIO IOMMU support for on platforms without interrupt remapping support."); +static bool disable_hugepages; +module_param_named(disable_hugepages, + disable_hugepages, bool, S_IRUGO | S_IWUSR); +MODULE_PARM_DESC(disable_hugepages, +"Disable VFIO IOMMU support for IOMMU hugepages."); + struct vfio_iommu { struct iommu_domain *domain; struct mutexlock; @@ -270,6 +276,11 @@ static long vfio_pin_pages(unsigned long vaddr, long npage, return -ENOMEM; } + if (unlikely(disable_hugepages)) { + vfio_lock_acct(1); + return 1; + } + /* Lock all the consecutive pages from pfn_base */ for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) { unsigned long pfn = 0; . Tested-by: Chegu Vinod I was able to verify your changes on a 2 Sandybridge-EP socket platform and observed about ~7-8% improvement in the netperf's TCP_RR performance. The guest size was small (16vcpu/32GB). Hopefully these changes also have an indirect benefit of avoiding soft lockups on the host side when larger guests (> 256GB ) are rebooted. Someone who has ready access to a larger Sandybridge-EP/EX platform can verify this. FYI Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] KVM: x86: Increase the "hard" max VCPU limit
KVM guests today use 8bit APIC ids allowing for 256 ID's. Reserving one ID for Broadcast interrupts should leave 255 ID's. In case of KVM there is no need for reserving another ID for IO-APIC so the hard max limit for VCPUS can be increased from 254 to 255. (This was confirmed by Gleb Natapov http://article.gmane.org/gmane.comp.emulators.kvm.devel/99713 ) Signed-off-by: Chegu Vinod --- arch/x86/include/asm/kvm_host.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4979778..bc57bfa 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -31,7 +31,7 @@ #include #include -#define KVM_MAX_VCPUS 254 +#define KVM_MAX_VCPUS 255 #define KVM_SOFT_MAX_VCPUS 160 #define KVM_USER_MEM_SLOTS 125 /* memory slots that are not exposed to userspace */ -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Preemptable Ticket Spinlock
On 4/22/2013 1:50 PM, Jiannan Ouyang wrote: On Mon, Apr 22, 2013 at 4:44 PM, Peter Zijlstra wrote: On Mon, 2013-04-22 at 16:32 -0400, Rik van Riel wrote: IIRC one of the reasons was that the performance improvement wasn't as obvious. Rescheduling VCPUs takes a fair amount of time, quite probably more than the typical hold time of a spinlock. IIRC it would spin for a while before blocking.. /me goes re-read some of that thread... Ah, its because PLE is curing most of it.. !PLE it had huge gains but apparently nobody cares about !PLE hardware anymore :-) For now, I don't know how good it can work with PLE. But I think it should save the time of VMEXIT on PLE machine. . Thanks for sharing your patch. 'am waiting for your v2 patch(es) and then let you any review feedback. Hoping to verify your changes on a large box (PLE enabled) and get back to you with some data... Thanks Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 2/2] kvm: Iterate over only vcpus that are preempted
On 3/4/2013 10:02 AM, Raghavendra K T wrote: From: Raghavendra K T This helps in filtering out the eligible candidates further and thus potentially helps in quickly allowing preempted lockholders to run. Note that if a vcpu was spinning during preemption we filter them by checking whether they are preempted due to pause loop exit. Signed-off-by: Raghavendra K T --- virt/kvm/kvm_main.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 83a804c..60114e1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1790,6 +1790,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; } else if (pass && i > last_boosted_vcpu) break; + if (!ACCESS_ONCE(vcpu->preempted)) + continue; if (vcpu == me) continue; if (waitqueue_active(&vcpu->wq)) . Reviewed-by: Chegu Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 1/2] kvm: Record the preemption status of vcpus using preempt notifiers
On 3/4/2013 10:02 AM, Raghavendra K T wrote: From: Raghavendra K T Note that we mark as preempted only when vcpu's task state was Running during preemption. Thanks Jiannan, Avi for preemption notifier ideas. Thanks Gleb, PeterZ for their precious suggestions. Thanks Srikar for an idea on avoiding rcu lock while checking task state that improved overcommit numbers. Signed-off-by: Raghavendra K T --- include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c |5 + 2 files changed, 6 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index cad77fe..0b31e1c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -252,6 +252,7 @@ struct kvm_vcpu { bool dy_eligible; } spin_loop; #endif + bool preempted; struct kvm_vcpu_arch arch; }; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index adc68fe..83a804c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -244,6 +244,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) kvm_vcpu_set_in_spin_loop(vcpu, false); kvm_vcpu_set_dy_eligible(vcpu, false); + vcpu->preempted = false; r = kvm_arch_vcpu_init(vcpu); if (r < 0) @@ -2902,6 +2903,8 @@ struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn) static void kvm_sched_in(struct preempt_notifier *pn, int cpu) { struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + if (vcpu->preempted) + vcpu->preempted = false; kvm_arch_vcpu_load(vcpu, cpu); } @@ -2911,6 +2914,8 @@ static void kvm_sched_out(struct preempt_notifier *pn, { struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + if (current->state == TASK_RUNNING) + vcpu->preempted = true; kvm_arch_vcpu_put(vcpu); } . Reviewed-by: Chegu Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 1/25/2013 11:05 AM, Rik van Riel wrote: Many spinlocks are embedded in data structures; having many CPUs pounce on the cache line the lock is in will slow down the lock holder, and can cause system performance to fall off a cliff. The paper "Non-scalable locks are dangerous" is a good reference: http://pdos.csail.mit.edu/papers/linux:lock.pdf In the Linux kernel, spinlocks are optimized for the case of there not being contention. After all, if there is contention, the data structure can be improved to reduce or eliminate lock contention. Likewise, the spinlock API should remain simple, and the common case of the lock not being contended should remain as fast as ever. However, since spinlock contention should be fairly uncommon, we can add functionality into the spinlock slow path that keeps system performance from falling off a cliff when there is lock contention. Proportional delay in ticket locks is delaying the time between checking the ticket based on a delay factor, and the number of CPUs ahead of us in the queue for this lock. Checking the lock less often allows the lock holder to continue running, resulting in better throughput and preventing performance from dropping off a cliff. The test case has a number of threads locking and unlocking a semaphore. With just one thread, everything sits in the CPU cache and throughput is around 2.6 million operations per second, with a 5-10% variation. Once a second thread gets involved, data structures bounce from CPU to CPU, and performance deteriorates to about 1.25 million operations per second, with a 5-10% variation. However, as more and more threads get added to the mix, performance with the vanilla kernel continues to deteriorate. Once I hit 24 threads, on a 24 CPU, 4 node test system, performance is down to about 290k operations/second. With a proportional backoff delay added to the spinlock code, performance with 24 threads goes up to about 400k operations/second with a 50x delay, and about 900k operations/second with a 250x delay. However, with a 250x delay, performance with 2-5 threads is worse than with a 50x delay. Making the code auto-tune the delay factor results in a system that performs well with both light and heavy lock contention, and should also protect against the (likely) case of the fixed delay factor being wrong for other hardware. The attached graph shows the performance of the multi threaded semaphore lock/unlock test case, with 1-24 threads, on the vanilla kernel, with 10x, 50x, and 250x proportional delay, as well as the v1 patch series with autotuning for 2x and 2.7x spinning before the lock is obtained, and with the v2 series. The v2 series integrates several ideas from Michel Lespinasse and Eric Dumazet, which should result in better throughput and nicer behaviour in situations with contention on multiple locks. For the v3 series, I tried out all the ideas suggested by Michel. They made perfect sense, but in the end it turned out they did not work as well as the simple, aggressive "try to make the delay longer" policy I have now. Several small bug fixes and cleanups have been integrated. For the v4 series, I added code to keep the maximum spinlock delay to a small value when running in a virtual machine. That should solve the performance regression seen in virtual machines. The performance issue observed with AIM7 is still a mystery. Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... I got back on the machine re-ran the AIM7-highsystime microbenchmark with a 2000 users and 100 jobs per user on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's latest patches. Host Platform : 8 socket (80 Core) Westmere with 1TB RAM. Config 1 : 3.7.5 base running on host and in the guest Config 2 : 3.7.5 + Rik's patches running on host and in the guest (Note: I didn't change the PLE settings on the host... The severe drop from at 40 way and 80 way is due to the un-optimized PLE handler. Raghu's PLE's fixes should address those. ). Config 1 Config 2 20vcpu - 170K 168K 40vcpu - 37K 37K 80vcpu - 10.5K11.5K Not much difference between the two configs.. (need to test it along with Raghu's fixes). BTW, I noticed that there were results posted using AIM7-compute workload earlier. The AIM7-highsystime is a lot more kernel intensive. FYI Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 1/8/2013 2:26 PM, Rik van Riel wrote: <...> Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... Attached below is some preliminary data with one of the AIM7 micro-benchmark workloads (i.e. high_systime). This is a kernel intensive workload which does tons of forks/execs etc.and stresses quite a few of the same set of spinlocks and semaphores. Observed a drop in performance as we go to 40way and 80 way. Wondering if the back off keeps increasing to such an extent that it actually starts to hurt given the nature of this workload ? Also in the case of 80way observed quite a bit of variation from run to run... Also ran it inside a single KVM guest. There were some perf. dips but interestingly didn't observe the same level of drop (compared to the drop in the native case) as the guest size was scaled up to 40vcpu or 80vcpu. FYI Vinod --- Platform : 8 socket (80 Core) Westmere with 1TB RAM. Workload: AIM7-highsystime microbenchmark - 2000 users & 100 jobs per user. Values reported are Jobs Per Minute (Higher is better). The values are average of 3 runs. 1) Native run: -- Config 1: 3.7 kernel Config 2: 3.7 + Rik's 1-4 patches 20way 40way 80way Config 1 ~179K ~159K ~146K Config 2 ~180K ~134K ~21K-43K <- high variation! (Note: Used numactl to restrict workload to 2 sockets (20way) and 4 sockets(40way)) -- 2) KVM run : Single guest of different sizes (No over commit, NUMA enabled in the guest). Note: This kernel intensive micro benchmark is exposes the PLE handler issue esp. for large guests. Since Raghu's PLE changes are not yet in upstream 'have just run with current PLE handler & then by disabling PLE (ple_gap=0). Config 1 : Host & Guest at 3.7 Config 2 : Host & Guest are at 3.7 + Rik's 1-4 patches -- 20vcpu/128G 40vcpu/256G 80vcpu/512G (on 2 sockets) (on 4 sockets) (on 8 sockets) -- Config 1 ~144K ~39K ~10K -- Config 2 ~143K ~37.5K ~11K -- Config 3 : Host & Guest at 3.7 AND ple_gap=0 Config 4 : Host & Guest are at 3.7 + Rik's 1-4 patches AND ple_gap=0 -- Config 3 ~154K~131K~116K -- Config 4 ~151K~130K~115K -- (Note: Used numactl to restrict qemu to 2 sockets (20way) and 4 sockets(40way))
Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
On 11/28/2012 5:09 PM, Chegu Vinod wrote: On 11/27/2012 6:23 AM, Chegu Vinod wrote: On 11/27/2012 2:30 AM, Raghavendra K T wrote: On 11/26/2012 07:05 PM, Andrew Jones wrote: On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote: From: Peter Zijlstra In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju Signed-off-by: Raghavendra K T --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + *true (>0) if we indeed boosted the target task. + *false (0) if we failed to boost the target. + *-ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); +/* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ +if (rq->nr_running == 1 && p_rq->nr_running == 1) { +yielded = -ESRCH; +goto out_irq; +} + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4310,13 +4322,13 @@ again: } if (!curr->sched_class->yield_to_task) -goto out; +goto out_unlock; if (curr->sched_class != p->sched_class) -goto out; +goto out_unlock; if (task_running(p_rq, p) || p->state) -goto out; +goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4329,11 +4341,12 @@ again: resched_task(p_rq->curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); -if (yielded) +if (yielded > 0) schedule(); return yielded; Acked-by: Andrew Jones Thank you Drew. Marcelo Gleb.. Please let me know if you have comments / concerns on the patches.. Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios especially for large guests where we do have overhead of vcpu iteration of ple handler.. . Thanks Raghu. Will try to get this latest patch set evaluated and get back to you. Vinod < Resending as prev. email to the kvm and lkml email aliases bounced twice... Apologies for any repeats! > Hi Raghu, Here is some preliminary data with your latest set ofPLE patches (& also with Andrew's throttled yield_to() change). Ran a single guest on a 80 core Westmere platform. [Note: Host and Guest had the latest kernel from kvm.git and also using the latestqemu from qemu.git as of yesterday morning]. The guest was running a AIM7 high_systime workload. (Note: high_systime is a kernel intensive micro-benchmark but in this case it was run just as a workload in the guest to trigger spinlock etc. contention in the guest OS and hence PLE (i.e. this is not a real benchmark run). 'have run this workload with a constant # (i.e. 2000) users with 100 jobs per user. The numbers below represent the # of jobs per minute (JPM) -higher the better) . 40VCPU60VCPU80VCPU a) 3.7.0-rc6+ w/ ple_gap=0~102K~88K~81K b) 3.7.0-rc6+~53K~25K~18-20K c) 3.7.0-rc6+ w/ PLE patches~100K~81K~48K-69K<- lot of variation from run to run. d) 3.7.0-rc6+ w/throttled yield_to() change~101K~87K~78K --- The PLE patches case (c) does show improvements in this non-overcommit large guest case when compared to the case (b). However at 80way started to observe quite a bit of variation from run to run and the JPM was lower when compared with the throttled yield_to() change case (d). For this 80way in case (c) also noticed that average time spent in the PLE exit (as reported by a small samplings from perf kvm stat) was varying quite a bit and was at times much greater when compared with the case of throttled yield_to() c
Re: [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios
On 11/26/2012 4:07 AM, Raghavendra K T wrote: In some special scenarios like #vcpu <= #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. The first patch optimizes all the yield_to by bailing out when there is no need to continue in yield_to (i.e., when there is only one task in source and target rq). Second patch uses that in PLE handler. Further when a yield_to fails we do not immediately go out of PLE handler instead we try thrice to have better statistical possibility of false return. Otherwise that would affect moderate overcommit cases. Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and around 51% for dbench 1x with 32 core PLE machine with 32 vcpu guest. base = 3.7.0-rc6 machine: 32 core mx3850 x5 PLE mc --+---+---+---++---+ ebizzy (rec/sec higher is beter) --+---+---+---++---+ basestdev patched stdev %improve --+---+---+---++---+ 1x 2511.300021.54096051.8000 170.2592 140.98276 2x 2679.4000 332.44822692.3000 251.4005 0.48145 3x 2253.5000 266.42432192.1667 178.9753-2.72169 4x 1784.3750 102.26992018.7500 187.572313.13485 --+---+---+---++---+ --+---+---+---++---+ dbench (throughput in MB/sec. higher is better) --+---+---+---++---+ basestdev patched stdev %improve --+---+---+---++---+ 1x 6677.4080 638.504810098.0060 3449.7026 51.22643 2x 2012.676064.76422019.0440 62.6702 0.31639 3x 1302.078340.83361292.7517 27.0515 -0.71629 4x 3043.1725 3243.72814664.4662 5946.5741 53.27643 --+---+---+---++---+ Here is the refernce of no ple result. ebizzy-1x_nople 7592.6000 rec/sec dbench_1x_nople 7853.6960 MB/sec The result says we can still improve by 60% for ebizzy, but overall we are getting impressive performance with the patches. Changes Since V2: - Dropped global measures usage patch (Peter Zilstra) - Do not bail out on first failure (Avi Kivity) - Try thrice for the failure of yield_to to get statistically more correct behaviour. Changes since V1: - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter) - Use yield() instead of schedule in overcommit scenarios (Rik) - Use loadavg knowledge to detect undercommit/overcommit Peter Zijlstra (1): Bail out of yield_to when source and target runqueue has one task Raghavendra K T (1): Handle yield_to failure return for potential undercommit case Please let me know your comments and suggestions. Link for V2: https://lkml.org/lkml/2012/10/29/287 Link for V1: https://lkml.org/lkml/2012/9/21/168 kernel/sched/core.c | 25 +++-- virt/kvm/kvm_main.c | 26 -- 2 files changed, 35 insertions(+), 16 deletions(-) . Tested-by: Chegu Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
On 11/27/2012 2:30 AM, Raghavendra K T wrote: On 11/26/2012 07:05 PM, Andrew Jones wrote: On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote: From: Peter Zijlstra In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju Signed-off-by: Raghavendra K T --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + *true (>0) if we indeed boosted the target task. + *false (0) if we failed to boost the target. + *-ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); +/* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ +if (rq->nr_running == 1 && p_rq->nr_running == 1) { +yielded = -ESRCH; +goto out_irq; +} + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4310,13 +4322,13 @@ again: } if (!curr->sched_class->yield_to_task) -goto out; +goto out_unlock; if (curr->sched_class != p->sched_class) -goto out; +goto out_unlock; if (task_running(p_rq, p) || p->state) -goto out; +goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4329,11 +4341,12 @@ again: resched_task(p_rq->curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); -if (yielded) +if (yielded > 0) schedule(); return yielded; Acked-by: Andrew Jones Thank you Drew. Marcelo Gleb.. Please let me know if you have comments / concerns on the patches.. Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios especially for large guests where we do have overhead of vcpu iteration of ple handler.. . Thanks Raghu. Will try to get this latest patch set evaluated and get back to you. Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 9/21/2012 4:59 AM, Raghavendra K T wrote: In some special scenarios like #vcpu <= #pcpu, PLE handler may prove very costly, Yes. because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. An idea to solve this is: 1) As Avi had proposed we can modify hardware ple_window dynamically to avoid frequent PL-exit. Yes. We had to do this to get around some scaling issues for large (>20way) guests (with no overcommitment) As part of some experimentation we even tried "switching off" PLE too :( (IMHO, it is difficult to decide when we have mixed type of VMs). Agree. Not sure if the following alternatives have also been looked at : - Could the behavior associated with the "ple_window" be modified to be a function of some [new] per-guest attribute (which can be conveyed to the host as part of the guest launch sequence). The user can choose to set this [new] attribute for a given guest. This would help avoid the frequent exits due to PLE (as Avi had mentioned earlier) ? - Can the PLE feature ( in VT) be "enhanced" to be made a per guest attribute ? IMHO, the approach of not taking a frequent exit is better than taking an exit and returning back from the handler etc. Thanks Vinod Another idea, proposed in the first patch, is to identify non-overcommit case and just return from the PLE handler. There are are many ways to identify non-overcommit scenario. 1) Using loadavg etc (get_avenrun/calc_global_load /this_cpu_load) 2) Explicitly check nr_running()/num_online_cpus() 3) Check source vcpu runqueue length. Not sure how can we make use of (1) effectively/how to use it. (2) has significant overhead since it iterates all cpus. so this patch uses third method. (I feel it is uglier to export runqueue length, but expecting suggestion on this). In second patch, when we have large number of small guests, it is possible that a spinning vcpu fails to yield_to any vcpu of same VM and go back and spin. This is also not effective when we are over-committed. Instead, we do a schedule() so that we give chance to other VMs to run. Raghavendra K T(2): Handle undercommitted guest case in PLE handler Be courteous to other VMs in overcommitted scenario in PLE handler Results: base = 3.6.0-rc5 + ple handler optimization patches from kvm tree. patched = base + patch1 + patch2 machine: x240 with 16 core with HT enabled (32 cpu thread). 32 vcpu guest with 8GB RAM. +---+---+---++---+ ebizzy (record/sec higher is better) +---+---+---++---+ basestddev patchedstdev%improve +---+---+---++---+ 11293.3750 624.4378 18209.6250 371.706161.24166 3641.8750 468.9400 3725.5000 253.7823 2.29621 +---+---+---++---+ +---+---+---++---+ kernbench (time in sec lower is better) +---+---+---++---+ basestddev patchedstdev%improve +---+---+---++---+ 30.6020 1.3018 30.8287 1.1517-0.74080 64.0825 2.3764 63.4721 5.0191 0.95252 95.8638 8.7030 94.5988 8.3832 1.31958 +---+---+---++---+ Note: on mx3850x5 machine with 32 cores HT disabled I got around ebizzy 209% kernbench 6% improvement for 1x scenario. Thanks Srikar for his active partipation in discussing ideas and reviewing the patch. Please let me know your suggestions and comments. --- include/linux/sched.h |1 + kernel/sched/core.c |6 ++ virt/kvm/kvm_main.c |7 +++ 3 files changed, 14 insertions(+), 0 deletions(-) . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/