Re: [Qemu-devel] Windows VM slow boot

2012-09-13 Thread Mel Gorman
On Wed, Sep 12, 2012 at 05:46:15PM +0100, Richard Davies wrote:
 Hi Mel - thanks for replying to my underhand bcc!
 
 Mel Gorman wrote:
  I see that this is an old-ish bug but I did not read the full history.
  Is it now booting faster than 3.5.0 was? I'm asking because I'm
  interested to see if commit c67fe375 helped your particular case.
 
 Yes, I think 3.6.0-rc5 is already better than 3.5.x but can still be
 improved, as discussed.
 

What are the boot times for each kernel?

 PATCH SNIPPED
 
 I have applied and tested again - perf results below.
 
 isolate_migratepages_range is indeed much reduced.
 
 There is now a lot of time in isolate_freepages_block and still quite a lot
 of lock contention, although in a different place.
 

This on top please.

---8---
From: Shaohua Li s...@fusionio.com
compaction: abort compaction loop if lock is contended or run too long

isolate_migratepages_range() might isolate none pages, for example, when
zone-lru_lock is contended and compaction is async. In this case, we should
abort compaction, otherwise, compact_zone will run a useless loop and make
zone-lru_lock is even contended.

V2:
only abort the compaction if lock is contended or run too long
Rearranged the code by Andrea Arcangeli.

[minc...@kernel.org: Putback pages isolated for migration if aborting]
[a...@linux-foundation.org: Fixup one contended usage site]
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Shaohua Li s...@fusionio.com
Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |   17 -
 mm/internal.h   |2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 7fcd3a5..a8de20d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -70,8 +70,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
 
/* async aborts if taking too long or contended */
if (!cc-sync) {
-   if (cc-contended)
-   *cc-contended = true;
+   cc-contended = true;
return false;
}
 
@@ -634,7 +633,7 @@ static isolate_migrate_t isolate_migratepages(struct zone 
*zone,
 
/* Perform the isolation */
low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-   if (!low_pfn)
+   if (!low_pfn || cc-contended)
return ISOLATE_ABORT;
 
cc-migrate_pfn = low_pfn;
@@ -787,6 +786,8 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
+   putback_lru_pages(cc-migratepages);
+   cc-nr_migratepages = 0;
goto out;
case ISOLATE_NONE:
continue;
@@ -831,6 +832,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 int order, gfp_t gfp_mask,
 bool sync, bool *contended)
 {
+   unsigned long ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -838,12 +840,17 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
-   .contended = contended,
};
INIT_LIST_HEAD(cc.freepages);
INIT_LIST_HEAD(cc.migratepages);
 
-   return compact_zone(zone, cc);
+   ret = compact_zone(zone, cc);
+
+   VM_BUG_ON(!list_empty(cc.freepages));
+   VM_BUG_ON(!list_empty(cc.migratepages));
+
+   *contended = cc.contended;
+   return ret;
 }
 
 int sysctl_extfrag_threshold = 500;
diff --git a/mm/internal.h b/mm/internal.h
index b8c91b3..4bd7c0e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -130,7 +130,7 @@ struct compact_control {
int order;  /* order a direct compactor needs */
int migratetype;/* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
-   bool *contended;/* True if a lock was contended */
+   bool contended; /* True if a lock was contended */
 };
 
 unsigned long



Re: [Qemu-devel] Windows VM slow boot

2012-09-12 Thread Richard Davies
[ adding linux-mm - previously at http://marc.info/?t=13451150943 ]

Hi Rik,

Since qemu-kvm 1.2.0 and Linux 3.6.0-rc5 came out, I thought that I would
retest with these.

The typical symptom now appears to be that the Windows VMs boot reasonably
fast, but then there is high CPU use and load for many minutes afterwards -
the high CPU use is both for the qemu-kvm processes themselves and also for
% sys.

I attach a perf report which seems to show that the high CPU use is in the
memory manager.

Cheers,

Richard.


# 
# captured on: Wed Sep 12 10:25:43 2012
# os release : 3.6.0-rc5-elastic
# perf version : 3.5.2
# arch : x86_64
# nrcpus online : 16
# nrcpus avail : 16
# cpudesc : AMD Opteron(tm) Processor 6128
# cpuid : AuthenticAMD,16,9,1
# total memory : 131973280 kB
# cmdline : /home/root/bin/perf record -g -a 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, 
excl_usr = 0, excl_kern = 0, id = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
14, 15, 16 }
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# 
#
# Samples: 870K of event 'cycles'
# Event count (approx.): 432968175910
#
# Overhead  Command Shared Object   
   Symbol
#   ...    
..
#
89.14% qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave   
 
   |
   --- _raw_spin_lock_irqsave
  |  
  |--95.47%-- isolate_migratepages_range
  |  compact_zone
  |  compact_zone_order
  |  try_to_compact_pages
  |  __alloc_pages_direct_compact
  |  __alloc_pages_nodemask
  |  alloc_pages_vma
  |  do_huge_pmd_anonymous_page
  |  handle_mm_fault
  |  __get_user_pages
  |  get_user_page_nowait
  |  hva_to_pfn.isra.17
  |  __gfn_to_pfn
  |  gfn_to_pfn_async
  |  try_async_pf
  |  tdp_page_fault
  |  kvm_mmu_page_fault
  |  pf_interception
  |  handle_exit
  |  kvm_arch_vcpu_ioctl_run
  |  kvm_vcpu_ioctl
  |  do_vfs_ioctl
  |  sys_ioctl
  |  system_call_fastpath
  |  ioctl
  |  |  
  |  |--55.64%-- 0x1010002
  |  |  
  |   --44.36%-- 0x1010006
  |  
  |--4.53%-- compact_zone
  |  compact_zone_order
  |  try_to_compact_pages
  |  __alloc_pages_direct_compact
  |  __alloc_pages_nodemask
  |  alloc_pages_vma
  |  do_huge_pmd_anonymous_page
  |  handle_mm_fault
  |  __get_user_pages
  |  get_user_page_nowait
  |  hva_to_pfn.isra.17
  |  __gfn_to_pfn
  |  gfn_to_pfn_async
  |  try_async_pf
  |  tdp_page_fault
  |  kvm_mmu_page_fault
  |  pf_interception
  |  handle_exit
  |  kvm_arch_vcpu_ioctl_run
  |  kvm_vcpu_ioctl
  |  do_vfs_ioctl
  |  sys_ioctl
  |  system_call_fastpath
  |  ioctl
  |  |  
  |  |--55.36%-- 0x1010002
  |  |  
  |   --44.64%-- 0x1010006
   --0.00%-- [...]
 4.92% qemu-kvm  [kernel.kallsyms] [k] migrate_pages
 
   |
   --- migrate_pages
  |  
  |--99.74%-- compact_zone
  |  compact_zone_order
  |  try_to_compact_pages
  |  __alloc_pages_direct_compact

Re: [Qemu-devel] Windows VM slow boot

2012-09-12 Thread Mel Gorman
On Wed, Sep 12, 2012 at 11:56:59AM +0100, Richard Davies wrote:
 [ adding linux-mm - previously at http://marc.info/?t=13451150943 ]
 
 Hi Rik,
 

I'm not Rik but hi anyway.

 Since qemu-kvm 1.2.0 and Linux 3.6.0-rc5 came out, I thought that I would
 retest with these.
 

Ok. 3.6.0-rc5 contains [c67fe375: mm: compaction: Abort async compaction
if locks are contended or taking too long] that should have helped mitigate
some of the lock contention problem but not all of it as we'll see later.

 The typical symptom now appears to be that the Windows VMs boot reasonably
 fast,

I see that this is an old-ish bug but I did not read the full history.
Is it now booting faster than 3.5.0 was? I'm asking because I'm
interested to see if commit c67fe375 helped your particular case.

 but then there is high CPU use and load for many minutes afterwards -
 the high CPU use is both for the qemu-kvm processes themselves and also for
 % sys.
 

Ok, I cannot comment on the userspace portion of things but the kernel
portion still indicates that there is a high percentage of time on what
appears to be lock contention.

 I attach a perf report which seems to show that the high CPU use is in the
 memory manager.
 

A follow-on from commit c67fe375 was the following patch (author cc'd)
which addresses lock contention in isolate_migratepages_range where your
perf report indicates that we're spending 95% of the time. Would you be
willing to test it please?

---8---
From: Shaohua Li s...@kernel.org
Subject: mm: compaction: check lock contention first before taking lock

isolate_migratepages_range will take zone-lru_lock first and check if the
lock is contented, if yes, it will release the lock.  This isn't
efficient.  If the lock is truly contented, a lock/unlock pair will
increase the lock contention.  We'd better check if the lock is contended
first.  compact_trylock_irqsave perfectly meets the requirement.

Signed-off-by: Shaohua Li s...@fusionio.com
Acked-by: Mel Gorman mgor...@suse.de
Acked-by: Minchan Kim minc...@kernel.org
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 mm/compaction.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff -puN 
mm/compaction.c~mm-compaction-check-lock-contention-first-before-taking-lock 
mm/compaction.c
--- 
a/mm/compaction.c~mm-compaction-check-lock-contention-first-before-taking-lock
+++ a/mm/compaction.c
@@ -349,8 +349,9 @@ isolate_migratepages_range(struct zone *
 
/* Time to isolate some pages for migration */
cond_resched();
-   spin_lock_irqsave(zone-lru_lock, flags);
-   locked = true;
+   locked = compact_trylock_irqsave(zone-lru_lock, flags, cc);
+   if (!locked)
+   return 0;
for (; low_pfn  end_pfn; low_pfn++) {
struct page *page;
 



Re: [Qemu-devel] Windows VM slow boot

2012-09-12 Thread Richard Davies
Hi Mel - thanks for replying to my underhand bcc!

Mel Gorman wrote:
 I see that this is an old-ish bug but I did not read the full history.
 Is it now booting faster than 3.5.0 was? I'm asking because I'm
 interested to see if commit c67fe375 helped your particular case.

Yes, I think 3.6.0-rc5 is already better than 3.5.x but can still be
improved, as discussed.

 A follow-on from commit c67fe375 was the following patch (author cc'd)
 which addresses lock contention in isolate_migratepages_range where your
 perf report indicates that we're spending 95% of the time. Would you be
 willing to test it please?

 ---8---
 From: Shaohua Li s...@kernel.org
 Subject: mm: compaction: check lock contention first before taking lock

 isolate_migratepages_range will take zone-lru_lock first and check if the
 lock is contented, if yes, it will release the lock.  This isn't
 efficient.  If the lock is truly contented, a lock/unlock pair will
 increase the lock contention.  We'd better check if the lock is contended
 first.  compact_trylock_irqsave perfectly meets the requirement.

 Signed-off-by: Shaohua Li s...@fusionio.com
 Acked-by: Mel Gorman mgor...@suse.de
 Acked-by: Minchan Kim minc...@kernel.org
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 ---

  mm/compaction.c |5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

 diff -puN 
 mm/compaction.c~mm-compaction-check-lock-contention-first-before-taking-lock 
 mm/compaction.c
 --- 
 a/mm/compaction.c~mm-compaction-check-lock-contention-first-before-taking-lock
 +++ a/mm/compaction.c
 @@ -349,8 +349,9 @@ isolate_migratepages_range(struct zone *
 
   /* Time to isolate some pages for migration */
   cond_resched();
 - spin_lock_irqsave(zone-lru_lock, flags);
 - locked = true;
 + locked = compact_trylock_irqsave(zone-lru_lock, flags, cc);
 + if (!locked)
 + return 0;
   for (; low_pfn  end_pfn; low_pfn++) {
   struct page *page;

I have applied and tested again - perf results below.

isolate_migratepages_range is indeed much reduced.

There is now a lot of time in isolate_freepages_block and still quite a lot
of lock contention, although in a different place.


# 
# captured on: Wed Sep 12 16:00:52 2012
# os release : 3.6.0-rc5-elastic+
# perf version : 3.5.2
# arch : x86_64
# nrcpus online : 16
# nrcpus avail : 16
# cpudesc : AMD Opteron(tm) Processor 6128
# cpuid : AuthenticAMD,16,9,1
# total memory : 131973280 kB
# cmdline : /home/root/bin/perf record -g -a 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, 
excl_usr = 0, excl_kern = 0, id = { 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 
76, 77, 78, 79, 80 }
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# 
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 560365005583
#
# Overhead  Command Shared Object   
   Symbol
#   ...    
..
#
43.95% qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block  
 
   |
   --- isolate_freepages_block
  |  
  |--99.99%-- compaction_alloc
  |  migrate_pages
  |  compact_zone
  |  compact_zone_order
  |  try_to_compact_pages
  |  __alloc_pages_direct_compact
  |  __alloc_pages_nodemask
  |  alloc_pages_vma
  |  do_huge_pmd_anonymous_page
  |  handle_mm_fault
  |  __get_user_pages
  |  get_user_page_nowait
  |  hva_to_pfn.isra.17
  |  __gfn_to_pfn
  |  gfn_to_pfn_async
  |  try_async_pf
  |  tdp_page_fault
  |  kvm_mmu_page_fault
  |  pf_interception
  |  handle_exit
  |  kvm_arch_vcpu_ioctl_run
  |  kvm_vcpu_ioctl
  |  do_vfs_ioctl
  |  sys_ioctl
  |  system_call_fastpath
  |  ioctl
  |  |  
  |  |--95.17%-- 0x1010006
  |  |  
  |   --4.83%-- 0x1010002
   --0.01%-- [...]
15.98% qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave   
 
   |
   ---