Re: [PATCH v5 0/9] numa,sched,mm: pseudo-interleaving for automatic NUMA balancing

2014-01-27 Thread Chegu Vinod

On 1/27/2014 2:03 PM, r...@redhat.com wrote:

The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Changes since v4:
  - remove some code that did not help performance
  - implement all the cleanups suggested by Mel Gorman
  - lots more testing, by Chegu Vinod and myself
  - rebase against -tip instead of -next, to make merging easier


Acked-by:  Chegu Vinod 

---

The following 1, 2, 4 & 8 socket-wide results on an 8-socket box are an 
average of 4 runs.



I) Eight 1-socket wide instances (10 warehouse threads/instance)

a) numactl pinning results
throughput = 350720  bops
throughput = 355250  bops
throughput = 350338  bops
throughput = 345963  bops
throughput = 344723  bops
throughput = 347838  bops
throughput = 347623  bops
throughput = 347963  bops

b) Automatic NUMA balancing results
  (Avg# page migrations : 10317611)
throughput = 319037  bops
throughput = 319612  bops
throughput = 314089  bops
throughput = 317499  bops
throughput = 320516  bops
throughput = 314905  bops
throughput = 315821  bops
throughput = 320575  bops

c) No Automatic NUMA balancing and NO-pinning results
throughput = 175433  bops
throughput = 179470  bops
throughput = 176262  bops
throughput = 162551  bops
throughput = 167874  bops
throughput = 173196  bops
throughput = 172001  bops
throughput = 174332  bops

---

II) Four 2-socket wide instances (20 warehouse threads/instance)

a) numactl pinning results
throughput = 611391  bops
throughput = 618464  bops
throughput = 612350  bops
throughput = 616826  bops

b) Automatic NUMA balancing results
  (Avg# page migrations : 8643581)
throughput = 523053  bops
throughput = 519375  bops
throughput = 502800  bops
throughput = 528880  bops

c) No Automatic NUMA balancing and NO-pinning results
throughput = 334807  bops
throughput = 330348  bops
throughput = 306250  bops
throughput = 309624  bops

---

III) Two 4-socket wide instances (40 warehouse threads/instance)

a) numactl pinning results
throughput = 946760  bops
throughput = 949712  bops

b) Automatic NUMA balancing results
  (Avg# page migrations : 5710932)
throughput = 861105  bops
throughput = 879878  bops

c) No Automatic NUMA balancing and NO-pinning results
throughput = 500527  bops
throughput = 450884  bops

---

IV) One 8-socket wide instance (80 warehouse threads/instance)

a) numactl pinning results
throughput =1199211  bops

b) Automatic NUMA balancing results
  (Avg# page migrations : 3426618)
throughput =1119524  bops

c) No Automatic NUMA balancing and NO-pinning results
throughput = 789243  bops


Thanks
Vinod

Changes since v3:
  - various code cleanups suggested by Mel Gorman (some in their own patches)
  - after some testing, switch back to the NUMA specific CPU use stats,
since that results in a 1% performance increase for two 8-warehouse
specjbb instances on a 4-node system, and reduced page migration across
the board
Changes since v2:
  - dropped tracepoint (for now?)
  - implement obvious improvements suggested by Peter
  - use the scheduler maintained CPU use statistics, drop
the NUMA specific ones for now. We can add those later
if they turn out to be beneficial
Changes since v1:
  - fix divide by zero found by Chegu Vinod
  - improve comment, as suggested by Peter Zijlstra
  - do stats calculations in task_numa_placement in local variables


Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt:   throughput = 755900.20 SPECjbb2005 bops
spec2.txt:   throughput = 754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt:   throughput = 706439.84 SPECjbb2005 bops
spec2.txt:   throughput = 729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, wit

Re: [PATCH v2 0/7] pseudo-interleaving for automatic NUMA balancing

2014-01-18 Thread Chegu Vinod

On 1/17/2014 1:12 PM, r...@redhat.com wrote:

The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Changes since v1:
  - fix divide by zero found by Chegu Vinod
  - improve comment, as suggested by Peter Zijlstra
  - do stats calculations in task_numa_placement in local variables


Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt:   throughput = 755900.20 SPECjbb2005 bops
spec2.txt:   throughput = 754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt:   throughput = 706439.84 SPECjbb2005 bops
spec2.txt:   throughput = 729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt:   throughput = 667988.47 SPECjbb2005 bops
spec2.txt:   throughput = 638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt:   throughput = 544120.97 SPECjbb2005 bops
spec2.txt:   throughput = 453553.41 SPECjbb2005 bops


My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

.




Acked-by:  Chegu Vinod 



Here are some results using the v2 version of the patches
on an 8 socket box using SPECjbb2005 as a workload :

I) Eight 1-socket wide instances(10 warehouse threads each) :

 Without 
patchesWith patches


a) numactl pinning results
spec1.txt:   throughput = 270620.04 273675.10
spec2.txt:   throughput = 274115.33 272845.17
spec3.txt:   throughput = 277830.09 272057.33
spec4.txt:   throughput = 270898.52 270670.54
spec5.txt:   throughput = 270397.30 270906.82
spec6.txt:   throughput = 270451.93 268217.55
spec7.txt:   throughput = 269511.07 269354.46
spec8.txt:   throughput = 269386.06 270540.00

b)Automatic NUMA balancing results
spec1.txt:   throughput = 244333.41 248072.72
spec2.txt:   throughput = 252166.99 251818.30
spec3.txt:   throughput = 251365.58 258266.24
spec4.txt:   throughput = 245247.91 256873.51
spec5.txt:   throughput = 245579.68 247743.18
spec6.txt:   throughput = 249767.38 256285.86
spec7.txt:   throughput = 244570.64 255343.99
spec8.txt:   throughput = 245703.60 254434.36

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt:   throughput = 132959.73 136957.12
spec2.txt:   throughput = 127937.11 129326.23
spec3.txt:   throughput = 130697.10 125772.11
spec4.txt:   throughput = 134978.49 141607.58
spec5.txt:   throughput = 127574.34 126748.18
spec6.txt:   throughput = 138699.99 128597.95
spec7.txt:   throughput = 133247.25 137344.57
spec8.txt:   throughput = 124548.00 139040.98

--

II) Four 2-socket wide instances(20 warehouse threads each) :

 Without 
patchesWith patches


a) numactl pinning results
spec1.txt:   throughput = 479931.16 472467.58
spec2.txt:   throughput = 466652.15 466237.10
spec3.txt:   throughput

Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V8

2013-10-23 Thread Chegu Vinod
Mel Gorman  suse.de> writes:

> 
> Weighing in at 63 patches makes the term "basic" in the series title is
> a misnomer.
> 
> This series still has roughly the same goals as previous versions. It
> reduces overhead of automatic balancing through scan rate reduction
> and the avoidance of TLB flushes. It selects a preferred node and moves
> tasks towards their memory as well as moving memory toward their task. It
> handles shared pages and groups related tasks together. Some problems 
such
> as shared page interleaving and properly dealing with processes that are
> larger than a node are being deferred.
> 
> It is still based on 3.11 because that's what I was testing against. If
> we can agree this should be merged to -tip for testing I'll rebase to 
deal
> with any scheduler conflicts but for now I do not want to invalidate 
other
> peoples testing. The only obvious thing that is missing is hotplug 
handling.
> Peter is currently working on reducing [get|put]_online_cpus overhead so
> that it can be added to migrate_swap.
> 
> Peter, some of your patches are missing signed-offs-by -- 4-5, 43 and 55.
> 
> Changelog since V7
> o THP migration race and pagetable insertion fixes
> o Do no handle PMDs in batch
> o Shared page migration throttling
> o Various changes to how last nid/pid information is recorded
> o False pid match sanity checks when joining NUMA task groups
> o Adapt scan rate based on local/remote fault statistics
> o Period retry of migration to preferred node
> o Limit scope of system-wide search
> o Schedule threads on the same node as process that created them
> o Cleanup numa_group on exec
> 
> Changelog since V6
> o Group tasks that share pages together
> o More scan avoidance of VMAs mapping pages that are not likely to 
migrate
> o cpunid conversion, system-wide searching of tasks to balance with
> 
> Changelog since V6
> o Various TLB flush optimisations
> o Comment updates
> o Sanitise task_numa_fault callsites for consistent semantics
> o Revert some of the scanning adaption stuff
> o Revert patch that defers scanning until task schedules on another node
> o Start delayed scanning properly
> o Avoid the same task always performing the PTE scan
> o Continue PTE scanning even if migration is rate limited
> 
> Changelog since V5
> o Add __GFP_NOWARN for numa hinting fault count
> o Use is_huge_zero_page
> o Favour moving tasks towards nodes with higher faults
> o Optionally resist moving tasks towards nodes with lower faults
> o Scan shared THP pages
> 
> Changelog since V4
> o Added code that avoids overloading preferred nodes
> o Swap tasks if nodes are overloaded and the swap does not impair 
locality
> 
> Changelog since V3
> o Correct detection of unset last nid/pid information
> o Dropped nr_preferred_running and replaced it with Peter's load 
balancing
> o Pass in correct node information for THP hinting faults
> o Pressure tasks sharing a THP page to move towards same node
> o Do not set pmd_numa if false sharing is detected
> 
> Changelog since V2
> o Reshuffle to match Peter's implied preference for layout
> o Reshuffle to move private/shared split towards end of series to make it
>   easier to evaluate the impact
> o Use PID information to identify private accesses
> o Set the floor for PTE scanning based on virtual address space scan 
rates
>   instead of time
> o Some locking improvements
> o Do not preempt pinned tasks unless they are kernel threads
> 
> Changelog since V1
> o Scan pages with elevated map count (shared pages)
> o Scale scan rates based on the vsz of the process so the sampling of the
>   task is independant of its size
> o Favour moving towards nodes with more faults even if it's not the
>   preferred node
> o Laughably basic accounting of a compute overloaded node when selecting
>   the preferred node.
> o Applied review comments
> 
> This series integrates basic scheduler support for automatic NUMA 
balancing.
> It was initially based on Peter Ziljstra's work in "sched, numa, mm:
> Add adaptive NUMA affinity support" but deviates too much to preserve
> Signed-off-bys. As before, if the relevant authors are ok with it I'll
> add Signed-off-bys (or add them yourselves if you pick the patches up).
> There has been a tonne of additional work from both Peter and Rik van 
Riel.
> 
> Some reports indicate that the performance is getting close to manual
> bindings for some workloads but your mileage will vary.  As before, the
> intention is not to complete the work but to incrementally improve 
mainline
> and preserve bisectability for any bug reports that crop up.
> 
> Patch 1 is a monolithic dump of patches thare are destined for upstream 
that
>   this series indirectly depends upon.
> 
> Patches 2-3 adds sysctl documentation and comment fixlets
> 
> Patch 4 avoids accounting for a hinting fault if another thread handled 
the
>   fault in parallel
> 
> Patches 5-6 avoid races with parallel THP migration and THP splits.
> 
> Patch 7 corrects a THP NUMA 

Re: kvm_intel: Could not allocate 42 bytes percpu data

2013-07-02 Thread Chegu Vinod

On 7/1/2013 10:49 PM, Rusty Russell wrote:

Chegu Vinod  writes:

On 6/30/2013 11:22 PM, Rusty Russell wrote:

Chegu Vinod  writes:

Hello,

Lots (~700+) of the following messages are showing up in the dmesg of a
3.10-rc1 based kernel (Host OS is running on a large socket count box
with HT-on).

[   82.270682] PERCPU: allocation failed, size=42 align=16, alloc from
reserved chunk failed
[   82.272633] kvm_intel: Could not allocate 42 bytes percpu data

Woah, weird

Oh.  Shit.  Um, this is embarrassing.

Thanks,
Rusty.


Thanks for your response!


===
module: do percpu allocation after uniqueness check.  No, really!

v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting
percpu memory on large machines:

  Now we have a new state MODULE_STATE_UNFORMED, we can insert the
  module into the list (and thus guarantee its uniqueness) before we
  allocate the per-cpu region.

In my defence, it didn't actually say the patch did this.  Just that
we "can".

This patch actually *does* it.

Signed-off-by: Rusty Russell 
Tested-by: Noone it seems.

Your following "updated" fix seems to be working fine on the larger
socket count machine with HT-on.

OK, did you definitely revert every other workaround?


Yes no other workarounds were there when your change was tested.



If so, please give me a Tested-by: line...


FYI The actual verification of your change was done by my esteemed 
colleague :Jim Hull (cc'd) who had access to this larger socket count box.




Tested-by: Jim Hull 




Thanks
Vinod




Thanks,
Rusty.
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kvm_intel: Could not allocate 42 bytes percpu data

2013-07-01 Thread Chegu Vinod

On 6/30/2013 11:22 PM, Rusty Russell wrote:

Chegu Vinod  writes:

Hello,

Lots (~700+) of the following messages are showing up in the dmesg of a
3.10-rc1 based kernel (Host OS is running on a large socket count box
with HT-on).

[   82.270682] PERCPU: allocation failed, size=42 align=16, alloc from
reserved chunk failed
[   82.272633] kvm_intel: Could not allocate 42 bytes percpu data

Woah, weird

Oh.  Shit.  Um, this is embarrassing.

Thanks,
Rusty.



Thanks for your response!


===
module: do percpu allocation after uniqueness check.  No, really!

v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting
percpu memory on large machines:

 Now we have a new state MODULE_STATE_UNFORMED, we can insert the
 module into the list (and thus guarantee its uniqueness) before we
 allocate the per-cpu region.

In my defence, it didn't actually say the patch did this.  Just that
we "can".

This patch actually *does* it.

Signed-off-by: Rusty Russell 
Tested-by: Noone it seems.


Your following "updated" fix seems to be working fine on the larger 
socket count machine with HT-on.


Thx
Vinod


diff --git a/kernel/module.c b/kernel/module.c
index cab4bce..fa53db8 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2927,7 +2927,6 @@ static struct module *layout_and_allocate(struct 
load_info *info, int flags)
  {
/* Module within temporary copy. */
struct module *mod;
-   Elf_Shdr *pcpusec;
int err;
  
  	mod = setup_load_info(info, flags);

@@ -2942,17 +2941,10 @@ static struct module *layout_and_allocate(struct 
load_info *info, int flags)
err = module_frob_arch_sections(info->hdr, info->sechdrs,
info->secstrings, mod);
if (err < 0)
-   goto out;
+   return ERR_PTR(err);
  
-	pcpusec = &info->sechdrs[info->index.pcpu];

-   if (pcpusec->sh_size) {
-   /* We have a special allocation for this section. */
-   err = percpu_modalloc(mod,
- pcpusec->sh_size, pcpusec->sh_addralign);
-   if (err)
-   goto out;
-   pcpusec->sh_flags &= ~(unsigned long)SHF_ALLOC;
-   }
+   /* We will do a special allocation for per-cpu sections later. */
+   info->sechdrs[info->index.pcpu].sh_flags &= ~(unsigned long)SHF_ALLOC;
  
  	/* Determine total sizes, and put offsets in sh_entsize.  For now

   this is done generically; there doesn't appear to be any
@@ -2963,17 +2955,22 @@ static struct module *layout_and_allocate(struct 
load_info *info, int flags)
/* Allocate and move to the final place */
err = move_module(mod, info);
if (err)
-   goto free_percpu;
+   return ERR_PTR(err);
  
  	/* Module has been copied to its final place now: return it. */

mod = (void *)info->sechdrs[info->index.mod].sh_addr;
kmemleak_load_module(mod, info);
return mod;
+}
  
-free_percpu:

-   percpu_modfree(mod);
-out:
-   return ERR_PTR(err);
+static int alloc_module_percpu(struct module *mod, struct load_info *info)
+{
+   Elf_Shdr *pcpusec = &info->sechdrs[info->index.pcpu];
+   if (!pcpusec->sh_size)
+   return 0;
+
+   /* We have a special allocation for this section. */
+   return percpu_modalloc(mod, pcpusec->sh_size, pcpusec->sh_addralign);
  }
  
  /* mod is no longer valid after this! */

@@ -3237,6 +3234,11 @@ static int load_module(struct load_info *info, const 
char __user *uargs,
}
  #endif
  
+	/* To avoid stressing percpu allocator, do this once we're unique. */

+   err = alloc_module_percpu(mod, info);
+   if (err)
+   goto unlink_mod;
+
/* Now module is in final location, initialize linked lists, etc. */
err = module_unload_init(mod);
if (err)
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kvm_intel: Could not allocate 42 bytes percpu data

2013-06-24 Thread Chegu Vinod


Hello,

Lots (~700+) of the following messages are showing up in the dmesg of a 
3.10-rc1 based kernel (Host OS is running on a large socket count box 
with HT-on).


[   82.270682] PERCPU: allocation failed, size=42 align=16, alloc from 
reserved chunk failed

[   82.272633] kvm_intel: Could not allocate 42 bytes percpu data

... also call traces like the following...

[  101.852136]  c901ad5aa090 88084675dd08 81633743 
88084675ddc8
[  101.860889]  81145053 81f3fa78 88084809dd40 
8907d1cfd2e8
[  101.869466]  8907d1cfd280 88087fffdb08 88084675c010 
88084675dfd8

[  101.878190] Call Trace:
[  101.880953]  [] dump_stack+0x19/0x1e
[  101.886679]  [] pcpu_alloc+0x9a3/0xa40
[  101.892754]  [] __alloc_reserved_percpu+0x13/0x20
[  101.899733]  [] load_module+0x35f/0x1a70
[  101.905835]  [] ? do_page_fault+0xe/0x10
[  101.911953]  [] SyS_init_module+0xfb/0x140
[  101.918287]  [] system_call_fastpath+0x16/0x1b
[  101.924981] kvm_intel: Could not allocate 42 bytes percpu data


Wondering if anyone else has seen this with the recent [3.10] based 
kernels esp. on larger boxes?


There was a similar issue that was reported earlier (where modules were 
being loaded per cpu without checking if an instance was already 
loaded/being-loaded). That issue seems to have been addressed in the 
recent past (e.g. https://lkml.org/lkml/2013/1/24/659 along with a 
couple of follow on cleanups)   Is the above yet another variant of the 
original issue or perhaps some race condition that got exposed when 
there are lot more threads ?


Vinod



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/2] vfio: Provide module option to disable vfio_iommu_type1 hugepage support

2013-05-30 Thread Chegu Vinod

On 5/28/2013 9:27 AM, Alex Williamson wrote:

Add a module option to vfio_iommu_type1 to disable IOMMU hugepage
support.  This causes iommu_map to only be called with single page
mappings, disabling the IOMMU driver's ability to use hugepages.
This option can be enabled by loading vfio_iommu_type1 with
disable_hugepages=1 or dynamically through sysfs.  If enabled
dynamically, only new mappings are restricted.

Signed-off-by: Alex Williamson 
---

As suggested by Konrad.  This is cleaner to add as a follow-on

  drivers/vfio/vfio_iommu_type1.c |   11 +++
  1 file changed, 11 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 6654a7e..8a2be4e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -48,6 +48,12 @@ module_param_named(allow_unsafe_interrupts,
  MODULE_PARM_DESC(allow_unsafe_interrupts,
 "Enable VFIO IOMMU support for on platforms without interrupt 
remapping support.");
  
+static bool disable_hugepages;

+module_param_named(disable_hugepages,
+  disable_hugepages, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_hugepages,
+"Disable VFIO IOMMU support for IOMMU hugepages.");
+
  struct vfio_iommu {
struct iommu_domain *domain;
struct mutexlock;
@@ -270,6 +276,11 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
return -ENOMEM;
}
  
+	if (unlikely(disable_hugepages)) {

+   vfio_lock_acct(1);
+   return 1;
+   }
+
/* Lock all the consecutive pages from pfn_base */
for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
unsigned long pfn = 0;

.



Tested-by: Chegu Vinod 

I was able to verify your changes on a 2 Sandybridge-EP socket platform 
and observed about ~7-8% improvement in the netperf's TCP_RR 
performance.  The guest size was small (16vcpu/32GB).


Hopefully these changes also have an indirect benefit of avoiding soft 
lockups on the host side when larger guests (> 256GB ) are rebooted. 
Someone who has ready access to a larger Sandybridge-EP/EX platform can 
verify this.


FYI
Vinod

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] KVM: x86: Increase the "hard" max VCPU limit

2013-04-27 Thread Chegu Vinod
KVM guests today use 8bit APIC ids allowing for 256 ID's. Reserving one
ID for Broadcast interrupts should leave 255 ID's. In case of KVM there
is no need for reserving another ID for IO-APIC so the hard max limit for
VCPUS can be increased from 254 to 255. (This was confirmed by Gleb Natapov
http://article.gmane.org/gmane.comp.emulators.kvm.devel/99713  )

Signed-off-by: Chegu Vinod 
---
 arch/x86/include/asm/kvm_host.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4979778..bc57bfa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -31,7 +31,7 @@
 #include 
 #include 
 
-#define KVM_MAX_VCPUS 254
+#define KVM_MAX_VCPUS 255
 #define KVM_SOFT_MAX_VCPUS 160
 #define KVM_USER_MEM_SLOTS 125
 /* memory slots that are not exposed to userspace */
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Preemptable Ticket Spinlock

2013-04-22 Thread Chegu Vinod

On 4/22/2013 1:50 PM, Jiannan Ouyang wrote:

On Mon, Apr 22, 2013 at 4:44 PM, Peter Zijlstra  wrote:

On Mon, 2013-04-22 at 16:32 -0400, Rik van Riel wrote:

IIRC one of the reasons was that the performance improvement wasn't
as obvious.  Rescheduling VCPUs takes a fair amount of time, quite
probably more than the typical hold time of a spinlock.

IIRC it would spin for a while before blocking..

/me goes re-read some of that thread...

Ah, its because PLE is curing most of it.. !PLE it had huge gains but
apparently nobody cares about !PLE hardware anymore :-)


For now, I don't know how good it can work with PLE. But I think it
should save the time of VMEXIT on PLE machine.
.

Thanks for sharing your patch. 'am waiting for your v2 patch(es) and 
then let you any review feedback. Hoping to verify your changes on a 
large box (PLE enabled) and get back to you with some data...


Thanks
Vinod
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/2] kvm: Iterate over only vcpus that are preempted

2013-03-05 Thread Chegu Vinod

On 3/4/2013 10:02 AM, Raghavendra K T wrote:

From: Raghavendra K T 

This helps in filtering out the eligible candidates further and
thus potentially helps in quickly allowing preempted lockholders to run.
Note that if a vcpu was spinning during preemption we filter them
by checking whether they are preempted due to pause loop exit.

Signed-off-by: Raghavendra K T 
---
  virt/kvm/kvm_main.c |2 ++
  1 file changed, 2 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 83a804c..60114e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1790,6 +1790,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
} else if (pass && i > last_boosted_vcpu)
break;
+   if (!ACCESS_ONCE(vcpu->preempted))
+   continue;
if (vcpu == me)
continue;
if (waitqueue_active(&vcpu->wq))

.


Reviewed-by: Chegu Vinod 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 1/2] kvm: Record the preemption status of vcpus using preempt notifiers

2013-03-05 Thread Chegu Vinod

On 3/4/2013 10:02 AM, Raghavendra K T wrote:

From: Raghavendra K T 

Note that we mark as preempted only when vcpu's task state was
Running during preemption.

Thanks Jiannan, Avi for preemption notifier ideas. Thanks Gleb, PeterZ
for their precious suggestions. Thanks Srikar for an idea on avoiding
rcu lock while checking task state that improved overcommit numbers.

Signed-off-by: Raghavendra K T 
---
  include/linux/kvm_host.h |1 +
  virt/kvm/kvm_main.c  |5 +
  2 files changed, 6 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cad77fe..0b31e1c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -252,6 +252,7 @@ struct kvm_vcpu {
bool dy_eligible;
} spin_loop;
  #endif
+   bool preempted;
struct kvm_vcpu_arch arch;
  };
  
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c

index adc68fe..83a804c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -244,6 +244,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
  
  	kvm_vcpu_set_in_spin_loop(vcpu, false);

kvm_vcpu_set_dy_eligible(vcpu, false);
+   vcpu->preempted = false;
  
  	r = kvm_arch_vcpu_init(vcpu);

if (r < 0)
@@ -2902,6 +2903,8 @@ struct kvm_vcpu *preempt_notifier_to_vcpu(struct 
preempt_notifier *pn)
  static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
  {
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+   if (vcpu->preempted)
+   vcpu->preempted = false;
  
  	kvm_arch_vcpu_load(vcpu, cpu);

  }
@@ -2911,6 +2914,8 @@ static void kvm_sched_out(struct preempt_notifier *pn,
  {
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
  
+	if (current->state == TASK_RUNNING)

+   vcpu->preempted = true;
kvm_arch_vcpu_put(vcpu);
  }
  


.


Reviewed-by: Chegu Vinod 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-02-03 Thread Chegu Vinod

On 1/25/2013 11:05 AM, Rik van Riel wrote:

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The performance issue observed with AIM7 is still a mystery.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

I got back on the machine re-ran the AIM7-highsystime microbenchmark 
with a 2000 users and 100 jobs per user
on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's 
latest patches.


Host Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Config 1 : 3.7.5 base running on host and in the guest

Config 2 : 3.7.5 + Rik's patches running on host and in the guest

(Note: I didn't change the PLE settings on the host... The severe drop 
from at 40 way and 80 way is due to the un-optimized PLE handler. 
Raghu's PLE's fixes should address those. ).


  Config 1  Config 2
20vcpu  -  170K   168K
40vcpu  -  37K   37K
80vcpu  - 10.5K11.5K


Not much difference between the two configs.. (need to test it along 
with Raghu's fixes).


BTW, I noticed that there were results posted using AIM7-compute 
workload earlier.  The AIM7-highsystime is a lot more kernel intensive.


FYI
Vinod
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Chegu Vinod

On 1/8/2013 2:26 PM, Rik van Riel wrote:
<...>

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...



Attached below is some preliminary data with one of the AIM7 micro-benchmark
workloads (i.e. high_systime). This is a kernel intensive workload which
does tons of forks/execs etc.and stresses quite a few of the same set
of spinlocks and semaphores.

Observed a drop in performance as we go to 40way and 80 way. Wondering
if the back off keeps increasing to such an extent that it actually starts
to hurt given the nature of this workload ?  Also in the case of 80way
observed quite a bit of variation from run to run...

Also ran it inside a single KVM guest. There were some perf. dips but
interestingly didn't observe the same level of drop (compared to the
drop in the native case) as the guest size was scaled up to 40vcpu or
80vcpu.

FYI
Vinod



---

Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Workload: AIM7-highsystime microbenchmark - 2000 users & 100 jobs per user.  

Values reported are Jobs Per Minute (Higher is better).  The values
are average of 3 runs.

1) Native run:
--

Config 1:  3.7 kernel
Config 2:  3.7 + Rik's 1-4 patches


  20way 40way 80way

Config 1 ~179K ~159K ~146K 

Config 2 ~180K ~134K ~21K-43K  <- high variation!


(Note: Used numactl to restrict workload to 
2 sockets (20way) and 4 sockets(40way))

--

2) KVM run : 


Single guest of different sizes (No over commit, NUMA enabled in the guest).

Note: This kernel intensive micro benchmark is exposes the PLE handler issue 
  esp. for large guests. Since Raghu's PLE changes are not yet in upstream 
  'have just run with current PLE handler & then by disabling 
  PLE (ple_gap=0).

Config 1 : Host & Guest at 3.7
Config 2 : Host & Guest are at 3.7 + Rik's 1-4 patches

--
 20vcpu/128G  40vcpu/256G  80vcpu/512G
(on 2 sockets)   (on 4 sockets)   (on 8 sockets)
--
Config 1   ~144K ~39K ~10K
--
Config 2   ~143K ~37.5K   ~11K
--

Config 3 : Host & Guest at 3.7 AND ple_gap=0
Config 4 : Host & Guest are at 3.7 + Rik's 1-4 patches AND ple_gap=0

--
Config 3   ~154K~131K~116K 
--
Config 4   ~151K~130K~115K
--


(Note: Used numactl to restrict qemu to 
2 sockets (20way) and 4 sockets(40way))


Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-28 Thread Chegu Vinod

On 11/28/2012 5:09 PM, Chegu Vinod wrote:

On 11/27/2012 6:23 AM, Chegu Vinod wrote:

On 11/27/2012 2:30 AM, Raghavendra K T wrote:

On 11/26/2012 07:05 PM, Andrew Jones wrote:

On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:

From: Peter Zijlstra 

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly 
come

out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple 
handler upon

  seeing source runqueue length = 1, but it had to export rq length).
  Peter came up with the elegant idea of return -ESRCH in 
scheduler core.


Signed-off-by: Peter Zijlstra 
Raghavendra, Checking the rq length of target vcpu condition 
added.(thanks Avi)

Reviewed-by: Srikar Dronamraju 
Signed-off-by: Raghavendra K T 
---

  kernel/sched/core.c |   25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *true (>0) if we indeed boosted the target task.
+ *false (0) if we failed to boost the target.
+ *-ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct 
*p, bool preempt)


  again:
  p_rq = task_rq(p);
+/*
+ * If we're the only runnable task on the rq and target rq also
+ * has only one task, there's absolutely no point in yielding.
+ */
+if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+yielded = -ESRCH;
+goto out_irq;
+}
+
  double_rq_lock(rq, p_rq);
  while (task_rq(p) != p_rq) {
  double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
  }

  if (!curr->sched_class->yield_to_task)
-goto out;
+goto out_unlock;

  if (curr->sched_class != p->sched_class)
-goto out;
+goto out_unlock;

  if (task_running(p_rq, p) || p->state)
-goto out;
+goto out_unlock;

  yielded = curr->sched_class->yield_to_task(rq, p, preempt);
  if (yielded) {
@@ -4329,11 +4341,12 @@ again:
  resched_task(p_rq->curr);
  }

-out:
+out_unlock:
  double_rq_unlock(rq, p_rq);
+out_irq:
  local_irq_restore(flags);

-if (yielded)
+if (yielded > 0)
  schedule();

  return yielded;



Acked-by: Andrew Jones 



Thank you Drew.

Marcelo Gleb.. Please let me know if you have comments / concerns on 
the patches..


Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
especially for large guests where we do have overhead of vcpu iteration
of ple handler..

.

Thanks Raghu. Will try to get this latest patch set evaluated and get 
back to you.


Vinod





< Resending as prev. email to the kvm and lkml email aliases bounced 
twice... Apologies for any repeats! >



Hi Raghu,

Here is some preliminary data with your latest set ofPLE patches (& 
also with Andrew's throttled yield_to() change).


Ran a single guest on a 80 core Westmere platform. [Note: Host and 
Guest had the latest kernel from kvm.git and also using the latestqemu 
from qemu.git as of yesterday morning].


The guest was running a AIM7 high_systime workload. (Note: 
high_systime is a kernel intensive micro-benchmark but in this case it 
was run just as a workload in the guest to trigger spinlock etc. 
contention in the guest OS and hence PLE (i.e. this is not a real 
benchmark run). 'have run this workload with a constant # (i.e. 2000) 
users with 100 jobs per user. The numbers below represent the # of 
jobs per minute (JPM) -higher the better) .


40VCPU60VCPU80VCPU

a) 3.7.0-rc6+ w/ ple_gap=0~102K~88K~81K


b) 3.7.0-rc6+~53K~25K~18-20K

c) 3.7.0-rc6+ w/ PLE patches~100K~81K~48K-69K<- lot of variation from 
run to run.


d) 3.7.0-rc6+ w/throttled

yield_to() change~101K~87K~78K

---

The PLE patches case (c) does show improvements in this non-overcommit 
large guest case when compared to the case (b). However at 80way 
started to observe quite a bit of variation from run to run and the 
JPM was lower when compared with the throttled yield_to() change case (d).


For this 80way in case (c) also noticed that average time spent in the 
PLE exit (as reported by a small samplings from perf kvm stat) was 
varying quite a bit and was at times much greater when compared with 
the case of throttled yield_to() c

Re: [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios

2012-11-28 Thread Chegu Vinod

On 11/26/2012 4:07 AM, Raghavendra K T wrote:

  In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

  The first patch optimizes all the yield_to by bailing out when there
  is no need to continue in yield_to (i.e., when there is only one task
  in source and target rq).

  Second patch uses that in PLE handler. Further when a yield_to fails
  we do not immediately go out of PLE handler instead we try thrice
  to have better statistical possibility of false return. Otherwise that
  would affect moderate overcommit cases.
  
  Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and

  around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.


base = 3.7.0-rc6
machine: 32 core mx3850 x5 PLE mc

--+---+---+---++---+
ebizzy (rec/sec higher is beter)
--+---+---+---++---+
 basestdev   patched stdev   %improve
--+---+---+---++---+
1x   2511.300021.54096051.8000   170.2592   140.98276
2x   2679.4000   332.44822692.3000   251.4005 0.48145
3x   2253.5000   266.42432192.1667   178.9753-2.72169
4x   1784.3750   102.26992018.7500   187.572313.13485
--+---+---+---++---+

--+---+---+---++---+
 dbench (throughput in MB/sec. higher is better)
--+---+---+---++---+
 basestdev   patched stdev   %improve
--+---+---+---++---+
1x  6677.4080   638.504810098.0060   3449.7026 51.22643
2x  2012.676064.76422019.0440 62.6702   0.31639
3x  1302.078340.83361292.7517 27.0515  -0.71629
4x  3043.1725  3243.72814664.4662   5946.5741  53.27643
--+---+---+---++---+

Here is the refernce of no ple result.
  ebizzy-1x_nople 7592.6000 rec/sec
  dbench_1x_nople 7853.6960 MB/sec

The result says we can still improve by 60% for ebizzy, but overall we are
getting impressive performance with the patches.

  Changes Since V2:
  - Dropped global measures usage patch (Peter Zilstra)
  - Do not bail out on first failure (Avi Kivity)
  - Try thrice for the failure of yield_to to get statistically more correct
behaviour.

  Changes since V1:
  - Discard the idea of exporting nrrunning and optimize in core scheduler 
(Peter)
  - Use yield() instead of schedule in overcommit scenarios (Rik)
  - Use loadavg knowledge to detect undercommit/overcommit

  Peter Zijlstra (1):
   Bail out of yield_to when source and target runqueue has one task

  Raghavendra K T (1):
   Handle yield_to failure return for potential undercommit case

  Please let me know your comments and suggestions.

  Link for V2:
  https://lkml.org/lkml/2012/10/29/287

  Link for V1:
  https://lkml.org/lkml/2012/9/21/168

  kernel/sched/core.c | 25 +++--
  virt/kvm/kvm_main.c | 26 --
  2 files changed, 35 insertions(+), 16 deletions(-)

.


Tested-by: Chegu Vinod 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-27 Thread Chegu Vinod

On 11/27/2012 2:30 AM, Raghavendra K T wrote:

On 11/26/2012 07:05 PM, Andrew Jones wrote:

On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:

From: Peter Zijlstra 

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple 
handler upon

  seeing source runqueue length = 1, but it had to export rq length).
  Peter came up with the elegant idea of return -ESRCH in scheduler 
core.


Signed-off-by: Peter Zijlstra 
Raghavendra, Checking the rq length of target vcpu condition 
added.(thanks Avi)

Reviewed-by: Srikar Dronamraju 
Signed-off-by: Raghavendra K T 
---

  kernel/sched/core.c |   25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *true (>0) if we indeed boosted the target task.
+ *false (0) if we failed to boost the target.
+ *-ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, 
bool preempt)


  again:
  p_rq = task_rq(p);
+/*
+ * If we're the only runnable task on the rq and target rq also
+ * has only one task, there's absolutely no point in yielding.
+ */
+if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+yielded = -ESRCH;
+goto out_irq;
+}
+
  double_rq_lock(rq, p_rq);
  while (task_rq(p) != p_rq) {
  double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
  }

  if (!curr->sched_class->yield_to_task)
-goto out;
+goto out_unlock;

  if (curr->sched_class != p->sched_class)
-goto out;
+goto out_unlock;

  if (task_running(p_rq, p) || p->state)
-goto out;
+goto out_unlock;

  yielded = curr->sched_class->yield_to_task(rq, p, preempt);
  if (yielded) {
@@ -4329,11 +4341,12 @@ again:
  resched_task(p_rq->curr);
  }

-out:
+out_unlock:
  double_rq_unlock(rq, p_rq);
+out_irq:
  local_irq_restore(flags);

-if (yielded)
+if (yielded > 0)
  schedule();

  return yielded;



Acked-by: Andrew Jones 



Thank you Drew.

Marcelo Gleb.. Please let me know if you have comments / concerns on 
the patches..


Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
especially for large guests where we do have overhead of vcpu iteration
of ple handler..

.

Thanks Raghu. Will try to get this latest patch set evaluated and get 
back to you.


Vinod

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-21 Thread Chegu Vinod

On 9/21/2012 4:59 AM, Raghavendra K T wrote:

In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly,


Yes.

  because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit.


Yes. We had to do this to get around some scaling issues for large 
(>20way) guests (with no overcommitment)


As part of some experimentation we even tried "switching off"  PLE too :(




(IMHO, it is difficult to
decide when we have mixed type of VMs).


Agree.

Not sure if the following alternatives have also been looked at :

- Could the  behavior  associated with the "ple_window" be modified to 
be a function of some [new] per-guest attribute (which can be conveyed 
to the host as part of the guest launch sequence). The user can choose 
to set this [new] attribute for a given guest. This would help avoid the 
frequent exits due to PLE (as Avi had mentioned earlier) ?


- Can the PLE feature ( in VT) be "enhanced" to be made a per guest 
attribute ?



IMHO, the approach of not taking a frequent exit is better than taking 
an exit and returning back from the handler etc.


Thanks
Vinod






Another idea, proposed in the first patch, is to identify
non-overcommit case and just return from the PLE handler.

There are are many ways to identify non-overcommit scenario.
1) Using loadavg etc (get_avenrun/calc_global_load
  /this_cpu_load)

2) Explicitly check nr_running()/num_online_cpus()

3) Check source vcpu runqueue length.

Not sure how can we make use of (1) effectively/how to use it.
(2) has significant overhead since it iterates all cpus.
so this patch uses third method. (I feel it is uglier to export
runqueue length, but expecting suggestion on this).

In second patch, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a schedule() so that we give chance
to other VMs to run.

Raghavendra K T(2):
  Handle undercommitted guest case in PLE handler
  Be courteous to other VMs in overcommitted scenario in PLE handler

Results:
base = 3.6.0-rc5 + ple handler optimization patches from kvm tree.
patched = base + patch1 + patch2
machine: x240 with 16 core with HT enabled (32 cpu thread).
32 vcpu guest with 8GB RAM.

+---+---+---++---+
  ebizzy (record/sec higher is better)
+---+---+---++---+
basestddev   patchedstdev%improve
+---+---+---++---+
  11293.3750   624.4378  18209.6250   371.706161.24166
   3641.8750   468.9400   3725.5000   253.7823 2.29621
+---+---+---++---+

+---+---+---++---+
 kernbench (time in sec lower is better)
+---+---+---++---+
basestddev   patchedstdev%improve
+---+---+---++---+
 30.6020 1.3018 30.8287 1.1517-0.74080
 64.0825 2.3764 63.4721 5.0191 0.95252
 95.8638 8.7030 94.5988 8.3832 1.31958
+---+---+---++---+

Note:
on mx3850x5 machine with 32 cores HT disabled I got around
ebizzy  209%
kernbench   6%
improvement for 1x scenario.

Thanks Srikar for his active partipation in discussing ideas and
reviewing the patch.

Please let me know your suggestions and comments.
---
  include/linux/sched.h |1 +
  kernel/sched/core.c   |6 ++
  virt/kvm/kvm_main.c   |7 +++
  3 files changed, 14 insertions(+), 0 deletions(-)

.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/