Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Marcelo Tosatti
On Mon, May 24, 2010 at 04:05:29PM +0900, Takuya Yoshikawa wrote:
 (2010/05/17 18:06), Takuya Yoshikawa wrote:
 
 User allocated bitmaps have the advantage of reducing pinned memory.
 However we have plenty more pinned memory allocated in memory slots, so
 by itself, user allocated bitmaps don't justify this change.
 
 Sorry for pinging several times.
 
 
 In that sense, what do you think about the question I sent last week?
 
 === REPOST 1 ===
  
   mark_page_dirty is called with the mmu_lock spinlock held in set_spte.
   Must find a way to move it outside of the spinlock section.
 
 I am now trying to do something to solve this spinlock problem. But the
 spinlock section looks too wide to solve with simple workaround.
 
 Sorry but I have to say that mmu_lock spin_lock problem was completely
 out of
 my mind. Although I looked through the code, it seems not easy to move the
 set_bit_user to outside of spinlock section without breaking the
 semantics of
 its protection.
 
 So this may take some time to solve.
 
 But personally, I want to do something for x86's vmallc() every time
 problem
 even though moving dirty bitmaps to user space cannot be achieved soon.
 
 In that sense, do you mind if we do double buffering without moving
 dirty bitmaps to
 user space?
 
 So I would be happy if you give me any comments about this kind of other
 options.

What if you pin the bitmaps? 

The alternative to that is to move mark_page_dirty(gfn) before acquision 
of mmu_lock, in the page fault paths. The downside of that is a
potentially (large?) number of false positives in the dirty bitmap.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Takuya Yoshikawa

(2010/06/01 19:55), Marcelo Tosatti wrote:


Sorry but I have to say that mmu_lock spin_lock problem was completely
out of
my mind. Although I looked through the code, it seems not easy to move the
set_bit_user to outside of spinlock section without breaking the
semantics of
its protection.

So this may take some time to solve.

But personally, I want to do something for x86's vmallc() every time
problem
even though moving dirty bitmaps to user space cannot be achieved soon.

In that sense, do you mind if we do double buffering without moving
dirty bitmaps to
user space?


So I would be happy if you give me any comments about this kind of other
options.


What if you pin the bitmaps?


Yes, pinning bitmaps works. The small problem is that we need to hold
the dirty_bitmap_pages[] array for every slot, the size of this array
depends on the slot length, and of course pinning itself.

In the performance point of view, having double sized vmalloc'ed
area may be better.



The alternative to that is to move mark_page_dirty(gfn) before acquision
of mmu_lock, in the page fault paths. The downside of that is a
potentially (large?) number of false positives in the dirty bitmap.



Interesting, but probably dangerous.


From my experience, though this includes my personal view, removing vmalloc
currently used by x86 is the most simple and effective change.

So if you don't mind, I want to double the size of vmalloc'ed area for x86
without changing other parts.

 == if this one more bitmap is problematic, dirty logging itself would be
 in danger of failure: we need to have the same size in the timing of
 switch.

Make sense?


We can consider moving dirty bitmaps to user space later.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Marcelo Tosatti
On Tue, Jun 01, 2010 at 09:05:38PM +0900, Takuya Yoshikawa wrote:
 (2010/06/01 19:55), Marcelo Tosatti wrote:
 
 Sorry but I have to say that mmu_lock spin_lock problem was completely
 out of
 my mind. Although I looked through the code, it seems not easy to move the
 set_bit_user to outside of spinlock section without breaking the
 semantics of
 its protection.
 
 So this may take some time to solve.
 
 But personally, I want to do something for x86's vmallc() every time
 problem
 even though moving dirty bitmaps to user space cannot be achieved soon.
 
 In that sense, do you mind if we do double buffering without moving
 dirty bitmaps to
 user space?
 
 So I would be happy if you give me any comments about this kind of other
 options.
 
 What if you pin the bitmaps?
 
 Yes, pinning bitmaps works. The small problem is that we need to hold
 the dirty_bitmap_pages[] array for every slot, the size of this array
 depends on the slot length, and of course pinning itself.
 
 In the performance point of view, having double sized vmalloc'ed
 area may be better.
 
 
 The alternative to that is to move mark_page_dirty(gfn) before acquision
 of mmu_lock, in the page fault paths. The downside of that is a
 potentially (large?) number of false positives in the dirty bitmap.
 
 
 Interesting, but probably dangerous.
 
 
 From my experience, though this includes my personal view, removing vmalloc
 currently used by x86 is the most simple and effective change.
 
 So if you don't mind, I want to double the size of vmalloc'ed area for x86
 without changing other parts.
 
  == if this one more bitmap is problematic, dirty logging itself would be
  in danger of failure: we need to have the same size in the timing of
  switch.
 
 Make sense?

That seems the most sensible approach.

 
 We can consider moving dirty bitmaps to user space later.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Marcelo Tosatti
On Mon, May 24, 2010 at 04:05:29PM +0900, Takuya Yoshikawa wrote:
 (2010/05/17 18:06), Takuya Yoshikawa wrote:
 
 User allocated bitmaps have the advantage of reducing pinned memory.
 However we have plenty more pinned memory allocated in memory slots, so
 by itself, user allocated bitmaps don't justify this change.
 
 Sorry for pinging several times.
 
 
 In that sense, what do you think about the question I sent last week?
 
 === REPOST 1 ===
  
   mark_page_dirty is called with the mmu_lock spinlock held in set_spte.
   Must find a way to move it outside of the spinlock section.
 
 I am now trying to do something to solve this spinlock problem. But the
 spinlock section looks too wide to solve with simple workaround.
 
 Sorry but I have to say that mmu_lock spin_lock problem was completely
 out of
 my mind. Although I looked through the code, it seems not easy to move the
 set_bit_user to outside of spinlock section without breaking the
 semantics of
 its protection.
 
 So this may take some time to solve.
 
 But personally, I want to do something for x86's vmallc() every time
 problem
 even though moving dirty bitmaps to user space cannot be achieved soon.
 
 In that sense, do you mind if we do double buffering without moving
 dirty bitmaps to
 user space?
 
 So I would be happy if you give me any comments about this kind of other
 options.

What if you pin the bitmaps? 

The alternative to that is to move mark_page_dirty(gfn) before acquision 
of mmu_lock, in the page fault paths. The downside of that is a
potentially (large?) number of false positives in the dirty bitmap.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Takuya Yoshikawa

(2010/06/01 19:55), Marcelo Tosatti wrote:


Sorry but I have to say that mmu_lock spin_lock problem was completely
out of
my mind. Although I looked through the code, it seems not easy to move the
set_bit_user to outside of spinlock section without breaking the
semantics of
its protection.

So this may take some time to solve.

But personally, I want to do something for x86's vmallc() every time
problem
even though moving dirty bitmaps to user space cannot be achieved soon.

In that sense, do you mind if we do double buffering without moving
dirty bitmaps to
user space?


So I would be happy if you give me any comments about this kind of other
options.


What if you pin the bitmaps?


Yes, pinning bitmaps works. The small problem is that we need to hold
the dirty_bitmap_pages[] array for every slot, the size of this array
depends on the slot length, and of course pinning itself.

In the performance point of view, having double sized vmalloc'ed
area may be better.



The alternative to that is to move mark_page_dirty(gfn) before acquision
of mmu_lock, in the page fault paths. The downside of that is a
potentially (large?) number of false positives in the dirty bitmap.



Interesting, but probably dangerous.


From my experience, though this includes my personal view, removing vmalloc
currently used by x86 is the most simple and effective change.

So if you don't mind, I want to double the size of vmalloc'ed area for x86
without changing other parts.

 == if this one more bitmap is problematic, dirty logging itself would be
 in danger of failure: we need to have the same size in the timing of
 switch.

Make sense?


We can consider moving dirty bitmaps to user space later.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Marcelo Tosatti
On Tue, Jun 01, 2010 at 09:05:38PM +0900, Takuya Yoshikawa wrote:
 (2010/06/01 19:55), Marcelo Tosatti wrote:
 
 Sorry but I have to say that mmu_lock spin_lock problem was completely
 out of
 my mind. Although I looked through the code, it seems not easy to move the
 set_bit_user to outside of spinlock section without breaking the
 semantics of
 its protection.
 
 So this may take some time to solve.
 
 But personally, I want to do something for x86's vmallc() every time
 problem
 even though moving dirty bitmaps to user space cannot be achieved soon.
 
 In that sense, do you mind if we do double buffering without moving
 dirty bitmaps to
 user space?
 
 So I would be happy if you give me any comments about this kind of other
 options.
 
 What if you pin the bitmaps?
 
 Yes, pinning bitmaps works. The small problem is that we need to hold
 the dirty_bitmap_pages[] array for every slot, the size of this array
 depends on the slot length, and of course pinning itself.
 
 In the performance point of view, having double sized vmalloc'ed
 area may be better.
 
 
 The alternative to that is to move mark_page_dirty(gfn) before acquision
 of mmu_lock, in the page fault paths. The downside of that is a
 potentially (large?) number of false positives in the dirty bitmap.
 
 
 Interesting, but probably dangerous.
 
 
 From my experience, though this includes my personal view, removing vmalloc
 currently used by x86 is the most simple and effective change.
 
 So if you don't mind, I want to double the size of vmalloc'ed area for x86
 without changing other parts.
 
  == if this one more bitmap is problematic, dirty logging itself would be
  in danger of failure: we need to have the same size in the timing of
  switch.
 
 Make sense?

That seems the most sensible approach.

 
 We can consider moving dirty bitmaps to user space later.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-24 Thread Takuya Yoshikawa

(2010/05/17 18:06), Takuya Yoshikawa wrote:



User allocated bitmaps have the advantage of reducing pinned memory.
However we have plenty more pinned memory allocated in memory slots, so
by itself, user allocated bitmaps don't justify this change.


Sorry for pinging several times.



In that sense, what do you think about the question I sent last week?

=== REPOST 1 ===
 
  mark_page_dirty is called with the mmu_lock spinlock held in set_spte.
  Must find a way to move it outside of the spinlock section.


I am now trying to do something to solve this spinlock problem. But the
spinlock section looks too wide to solve with simple workaround.


Sorry but I have to say that mmu_lock spin_lock problem was completely
out of
my mind. Although I looked through the code, it seems not easy to move the
set_bit_user to outside of spinlock section without breaking the
semantics of
its protection.

So this may take some time to solve.

But personally, I want to do something for x86's vmallc() every time
problem
even though moving dirty bitmaps to user space cannot be achieved soon.

In that sense, do you mind if we do double buffering without moving
dirty bitmaps to
user space?


So I would be happy if you give me any comments about this kind of other
options.

Thanks,
  Takuya




I know that the resource for vmalloc() is precious for x86 but even now,
at the timing
of get_dirty_log, we use the same amount of memory as double buffering.
=== 1 END ===




Perhaps if we optimize memory slot write protection (I have some ideas
about this) we can make the performance improvement more pronounced.



It's really nice!

Even now we can measure the performance improvement by introducing
switch ioctl
when guest is relatively idle, so the combination will be really effective!

=== REPOST 2 ===
 
  Can you post such a test, for an idle large guest?
 
  OK, I'll do!


Result of low workload test (running top during migration) first,

4GB guest
picked up slots[1](len=3757047808) only
*
get.org get.opt switch.opt

1060875 310292 190335
1076754 301295 188600
655504 318284 196029
529769 301471 325
694796 70216 221172
651868 353073 196184
543339 312865 213236
1061938 72785 203090
689527 323901 249519
621364 323881 473
1063671 70703 192958
915903 336318 174008
1046462 332384 782
1037942 72783 190655
680122 318305 243544
688156 314935 193526
558658 265934 190550
652454 372135 196270
660140 68613 352
1101947 378642 186575
... ... ...
*

As expected we've got the difference more clearly.

In this case, switch.opt reduced 1/3 (.1 msec) compared to get.opt
for each iteration.

And when the slot is cleaner, the ratio is bigger.
=== 2 END ===
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-24 Thread Takuya Yoshikawa

(2010/05/17 18:06), Takuya Yoshikawa wrote:



User allocated bitmaps have the advantage of reducing pinned memory.
However we have plenty more pinned memory allocated in memory slots, so
by itself, user allocated bitmaps don't justify this change.


Sorry for pinging several times.



In that sense, what do you think about the question I sent last week?

=== REPOST 1 ===
 
  mark_page_dirty is called with the mmu_lock spinlock held in set_spte.
  Must find a way to move it outside of the spinlock section.


I am now trying to do something to solve this spinlock problem. But the
spinlock section looks too wide to solve with simple workaround.


Sorry but I have to say that mmu_lock spin_lock problem was completely
out of
my mind. Although I looked through the code, it seems not easy to move the
set_bit_user to outside of spinlock section without breaking the
semantics of
its protection.

So this may take some time to solve.

But personally, I want to do something for x86's vmallc() every time
problem
even though moving dirty bitmaps to user space cannot be achieved soon.

In that sense, do you mind if we do double buffering without moving
dirty bitmaps to
user space?


So I would be happy if you give me any comments about this kind of other
options.

Thanks,
  Takuya




I know that the resource for vmalloc() is precious for x86 but even now,
at the timing
of get_dirty_log, we use the same amount of memory as double buffering.
=== 1 END ===




Perhaps if we optimize memory slot write protection (I have some ideas
about this) we can make the performance improvement more pronounced.



It's really nice!

Even now we can measure the performance improvement by introducing
switch ioctl
when guest is relatively idle, so the combination will be really effective!

=== REPOST 2 ===
 
  Can you post such a test, for an idle large guest?
 
  OK, I'll do!


Result of low workload test (running top during migration) first,

4GB guest
picked up slots[1](len=3757047808) only
*
get.org get.opt switch.opt

1060875 310292 190335
1076754 301295 188600
655504 318284 196029
529769 301471 325
694796 70216 221172
651868 353073 196184
543339 312865 213236
1061938 72785 203090
689527 323901 249519
621364 323881 473
1063671 70703 192958
915903 336318 174008
1046462 332384 782
1037942 72783 190655
680122 318305 243544
688156 314935 193526
558658 265934 190550
652454 372135 196270
660140 68613 352
1101947 378642 186575
... ... ...
*

As expected we've got the difference more clearly.

In this case, switch.opt reduced 1/3 (.1 msec) compared to get.opt
for each iteration.

And when the slot is cleaner, the ratio is bigger.
=== 2 END ===
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-17 Thread Takuya Yoshikawa



User allocated bitmaps have the advantage of reducing pinned memory.
However we have plenty more pinned memory allocated in memory slots, so
by itself, user allocated bitmaps don't justify this change.


In that sense, what do you think about the question I sent last week?

=== REPOST 1 ===

 mark_page_dirty is called with the mmu_lock spinlock held in set_spte.
 Must find a way to move it outside of the spinlock section.


 Oh, it's a serious problem. I have to consider it.

Avi, Marcelo,

Sorry but I have to say that mmu_lock spin_lock problem was completely out of
my mind. Although I looked through the code, it seems not easy to move the
set_bit_user to outside of spinlock section without breaking the semantics of
its protection.

So this may take some time to solve.

But personally, I want to do something for x86's vmallc() every time problem
even though moving dirty bitmaps to user space cannot be achieved soon.

In that sense, do you mind if we do double buffering without moving dirty 
bitmaps to
user space?

I know that the resource for vmalloc() is precious for x86 but even now, at the 
timing
of get_dirty_log, we use the same amount of memory as double buffering.
=== 1 END ===




Perhaps if we optimize memory slot write protection (I have some ideas
about this) we can make the performance improvement more pronounced.



It's really nice!

Even now we can measure the performance improvement by introducing switch ioctl
when guest is relatively idle, so the combination will be really effective!

=== REPOST 2 ===

 Can you post such a test, for an idle large guest?

 OK, I'll do!


Result of low workload test (running top during migration) first,

4GB guest
picked up slots[1](len=3757047808) only
*
get.org get.optswitch.opt

1060875 310292 190335
1076754 301295 188600
 655504 318284 196029
 529769 301471325
 694796  70216 221172
 651868 353073 196184
 543339 312865 213236
1061938  72785 203090
 689527 323901 249519
 621364 323881473
1063671  70703 192958
 915903 336318 174008
1046462 332384782
1037942  72783 190655
 680122 318305 243544
 688156 314935 193526
 558658 265934 190550
 652454 372135 196270
 660140  68613352
1101947 378642 186575
.........
*

As expected we've got the difference more clearly.

In this case, switch.opt reduced 1/3 (.1 msec) compared to get.opt
for each iteration.

And when the slot is cleaner, the ratio is bigger.
=== 2 END ===
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-13 Thread Avi Kivity

On 05/10/2010 03:26 PM, Takuya Yoshikawa wrote:

No doubt get.org - get.opt is measurable, but get.opt-switch.opt is
problematic. Have you tried profiling to see where the time is spent
(well I can guess, clearing the write access from the sptes).



Sorry but no, and I agree with your guess.
Anyway, I want to do some profiling to confirm this guess.


BTW, If we only think about performance improvement of time, optimized
get(get.opt) may be enough at this stage.

But if we consider the future expansion like using user allocated 
bitmaps,
new API's introduced for switch.opt won't become waste, I think, 
because we

need a structure to get and export bitmap addresses.


User allocated bitmaps have the advantage of reducing pinned memory.  
However we have plenty more pinned memory allocated in memory slots, so 
by itself, user allocated bitmaps don't justify this change.


Perhaps if we optimize memory slot write protection (I have some ideas 
about this) we can make the performance improvement more pronounced.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-12 Thread Takuya Yoshikawa



[To ppc people]

Hi, Benjamin, Paul, Alex,

Please see the patches 6,7/12. I first say sorry for that I've not tested these
yet. In that sense, these may not be in the quality for precise reviews. But I
will be happy if you would give me any comments.

Alex, could you help me? Though I have a plan to get PPC box in the future,
currently I cannot test these.



Could you please point me to a git tree where everything's readily
applied? That would make testing a lot easier.


OK, I'll prepare one. Probably on sourceforge or somewhere like Kemari.

Thanks,
  Takuya




Alex



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-12 Thread Takuya Yoshikawa



[To ppc people]

Hi, Benjamin, Paul, Alex,

Please see the patches 6,7/12. I first say sorry for that I've not tested these
yet. In that sense, these may not be in the quality for precise reviews. But I
will be happy if you would give me any comments.

Alex, could you help me? Though I have a plan to get PPC box in the future,
currently I cannot test these.



Could you please point me to a git tree where everything's readily
applied? That would make testing a lot easier.


OK, I'll prepare one. Probably on sourceforge or somewhere like Kemari.

Thanks,
  Takuya




Alex



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-11 Thread Takuya Yoshikawa




In usual workload, the number of dirty pages varies a lot for each
iteration
and we should gain really a lot for relatively clean cases.


Can you post such a test, for an idle large guest?


OK, I'll do!



Result of low workload test (running top during migration) first,

4GB guest
picked up slots[1](len=3757047808) only
*
get.org get.optswitch.opt

1060875 310292 190335
1076754 301295 188600
 655504 318284 196029
 529769 301471325
 694796  70216 221172
 651868 353073 196184
 543339 312865 213236
1061938  72785 203090
 689527 323901 249519
 621364 323881473
1063671  70703 192958
 915903 336318 174008
1046462 332384782
1037942  72783 190655
 680122 318305 243544
 688156 314935 193526
 558658 265934 190550
 652454 372135 196270
 660140  68613352
1101947 378642 186575
.........
*

As expected we've got the difference more clearly.

In this case, switch.opt reduced 1/3 (.1 msec) compared to get.opt
for each iteration.

And when the slot is cleaner, the ratio is bigger.










--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-11 Thread Alexander Graf
Takuya Yoshikawa wrote:
 Hi, sorry for sending from my personal account.
 The following series are all from me:

   From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

   The 3rd version of moving dirty bitmaps to user space.

 From this version, we add x86 and ppc and asm-generic people to CC lists.


 [To KVM people]

 Sorry for being late to reply your comments.

 Avi,
  - I've wrote an answer to your question in patch 5/12: drivers/vhost/vhost.c 
 .

  - I've considered to change the set_bit_user_non_atomic to an inline 
 function,
but did not change because the other helpers in the uaccess.h are written 
 as
macros. Anyway, I hope that x86 people will give us appropriate suggestions
about this.

  - I thought that documenting about making bitmaps 64-bit aligned will be
written when we add an API to register user-allocated bitmaps. So probably
in the next series.

 Avi, Alex,
  - Could you check the ia64 and ppc parts, please? I tried to keep the logical
changes as small as possible.

I personally tried to build these with cross compilers. For ia64, I could 
 check
build success with my patch series. But book3s, even without my patch 
 series,
it failed with the following errors:

   arch/powerpc/kvm/book3s_paired_singles.c: In function 
 'kvmppc_emulate_paired_single':
   arch/powerpc/kvm/book3s_paired_singles.c:1289: error: the frame size of 
 2288 bytes is larger than 2048 bytes
   make[1]: *** [arch/powerpc/kvm/book3s_paired_singles.o] Error 1
   make: *** [arch/powerpc/kvm] Error 2
   

This is bad. I haven't encountered that one at all so far, but I guess
my compiler version is different from yours. Sigh.


 About changelog: there are two main changes from the 2nd version:
   1. I changed the treatment of clean slots (see patch 1/12).
  This was already applied today, thanks!
   2. I changed the switch API. (see patch 11/12).

 To show this API's advantage, I also did a test (see the end of this mail).


 [To x86 people]

 Hi, Thomas, Ingo, Peter,

 Please review the patches 4,5/12. Because this is the first experience for
 me to send patches to x86, please tell me if this lacks anything.


 [To ppc people]

 Hi, Benjamin, Paul, Alex,

 Please see the patches 6,7/12. I first say sorry for that I've not tested 
 these
 yet. In that sense, these may not be in the quality for precise reviews. But I
 will be happy if you would give me any comments.

 Alex, could you help me? Though I have a plan to get PPC box in the future,
 currently I cannot test these.
   

Could you please point me to a git tree where everything's readily
applied? That would make testing a lot easier.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-10 Thread Avi Kivity

On 05/04/2010 03:56 PM, Takuya Yoshikawa wrote:

[Performance test]

We measured the tsc needed to the ioctl()s for getting dirty logs in
kernel.

Test environment

   AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory


1. GUI test (running Ubuntu guest in graphical mode)

   sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ...

We show a relatively stable part to compare how much time is needed
for the basic parts of dirty log ioctl.

get.org   get.opt  switch.opt

slots[7].len=32768  278379 66398 64024
slots[8].len=32768  181246   270   160
slots[7].len=32768  263961 64673 64494
slots[8].len=32768  181655   265   160
slots[7].len=32768  263736 64701 64610
slots[8].len=32768  182785   267   160
slots[7].len=32768  260925 65360 65042
slots[8].len=32768  182579   264   160
slots[7].len=32768  267823 65915 65682
slots[8].len=32768  186350   271   160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.
   


100 ns... this is a bit on the low side (and if you can measure it 
interactively you have much better reflexes than I).



To feel the difference, please try GUI on your PC with our patch series!
   


No doubt get.org - get.opt is measurable, but get.opt-switch.opt is 
problematic.  Have you tried profiling to see where the time is spent 
(well I can guess, clearing the write access from the sptes).




2. Live-migration test (4GB guest, write loop with 1GB buf)

We also did a live-migration test.

get.org   get.opt  switch.opt

slots[0].len=655360 797383261144222181
slots[1].len=37570478082186721   1965244   1842824
slots[2].len=637534208 1433562   1012723   1031213
slots[3].len=131072 216858   331   331
slots[4].len=131072 121635   225   164
slots[5].len=131072 120863   356   164
slots[6].len=16777216   121746  1133   156
slots[7].len=32768  120415   230   278
slots[8].len=32768  120368   216   149
slots[0].len=655360 806497194710223582
slots[1].len=37570478082142922   1878025   1895369
slots[2].len=637534208 1386512   1021309   1000345
slots[3].len=131072 221118   459   296
slots[4].len=131072 121516   272   166
slots[5].len=131072 122652   244   173
slots[6].len=16777216   123226 99185   149
slots[7].len=32768  121803   457   505
slots[8].len=32768  121586   216   155
slots[0].len=655360 766113211317213179
slots[1].len=37570478082155662   1974790   1842361
slots[2].len=637534208 1481411   1020004   1031352
slots[3].len=131072 223100   351   295
slots[4].len=131072 122982   436   164
slots[5].len=131072 122100   300   503
slots[6].len=16777216   123653   779   151
slots[7].len=32768  122617   284   157
slots[8].len=32768  122737   253   149

For slots other than 0,1,2 we can see the similar improvement.

Considering the fact that switch.opt does not depend on the bitmap length
except for kvm_mmu_slot_remove_write_access(), this is the cause of some
usec to msec time consumption: there might be some context switches.

But note that this was done with the workload which dirtied the memory
endlessly during the live-migration.

In usual workload, the number of dirty pages varies a lot for each iteration
and we should gain really a lot for relatively clean cases.
   


Can you post such a test, for an idle large guest?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-10 Thread Takuya Yoshikawa




get.org get.opt switch.opt

slots[7].len=32768 278379 66398 64024
slots[8].len=32768 181246 270 160
slots[7].len=32768 263961 64673 64494
slots[8].len=32768 181655 265 160
slots[7].len=32768 263736 64701 64610
slots[8].len=32768 182785 267 160
slots[7].len=32768 260925 65360 65042
slots[8].len=32768 182579 264 160
slots[7].len=32768 267823 65915 65682
slots[8].len=32768 186350 271 160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.


100 ns... this is a bit on the low side (and if you can measure it
interactively you have much better reflexes than I).


To feel the difference, please try GUI on your PC with our patch series!


No doubt get.org - get.opt is measurable, but get.opt-switch.opt is
problematic. Have you tried profiling to see where the time is spent
(well I can guess, clearing the write access from the sptes).


Sorry but no, and I agree with your guess.
Anyway, I want to do some profiling to confirm this guess.


BTW, If we only think about performance improvement of time, optimized
get(get.opt) may be enough at this stage.

But if we consider the future expansion like using user allocated bitmaps,
new API's introduced for switch.opt won't become waste, I think, because we
need a structure to get and export bitmap addresses.




In usual workload, the number of dirty pages varies a lot for each
iteration
and we should gain really a lot for relatively clean cases.


Can you post such a test, for an idle large guest?


OK, I'll do!





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-10 Thread Avi Kivity

On 05/04/2010 03:56 PM, Takuya Yoshikawa wrote:

[Performance test]

We measured the tsc needed to the ioctl()s for getting dirty logs in
kernel.

Test environment

   AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory


1. GUI test (running Ubuntu guest in graphical mode)

   sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ...

We show a relatively stable part to compare how much time is needed
for the basic parts of dirty log ioctl.

get.org   get.opt  switch.opt

slots[7].len=32768  278379 66398 64024
slots[8].len=32768  181246   270   160
slots[7].len=32768  263961 64673 64494
slots[8].len=32768  181655   265   160
slots[7].len=32768  263736 64701 64610
slots[8].len=32768  182785   267   160
slots[7].len=32768  260925 65360 65042
slots[8].len=32768  182579   264   160
slots[7].len=32768  267823 65915 65682
slots[8].len=32768  186350   271   160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.
   


100 ns... this is a bit on the low side (and if you can measure it 
interactively you have much better reflexes than I).



To feel the difference, please try GUI on your PC with our patch series!
   


No doubt get.org - get.opt is measurable, but get.opt-switch.opt is 
problematic.  Have you tried profiling to see where the time is spent 
(well I can guess, clearing the write access from the sptes).




2. Live-migration test (4GB guest, write loop with 1GB buf)

We also did a live-migration test.

get.org   get.opt  switch.opt

slots[0].len=655360 797383261144222181
slots[1].len=37570478082186721   1965244   1842824
slots[2].len=637534208 1433562   1012723   1031213
slots[3].len=131072 216858   331   331
slots[4].len=131072 121635   225   164
slots[5].len=131072 120863   356   164
slots[6].len=16777216   121746  1133   156
slots[7].len=32768  120415   230   278
slots[8].len=32768  120368   216   149
slots[0].len=655360 806497194710223582
slots[1].len=37570478082142922   1878025   1895369
slots[2].len=637534208 1386512   1021309   1000345
slots[3].len=131072 221118   459   296
slots[4].len=131072 121516   272   166
slots[5].len=131072 122652   244   173
slots[6].len=16777216   123226 99185   149
slots[7].len=32768  121803   457   505
slots[8].len=32768  121586   216   155
slots[0].len=655360 766113211317213179
slots[1].len=37570478082155662   1974790   1842361
slots[2].len=637534208 1481411   1020004   1031352
slots[3].len=131072 223100   351   295
slots[4].len=131072 122982   436   164
slots[5].len=131072 122100   300   503
slots[6].len=16777216   123653   779   151
slots[7].len=32768  122617   284   157
slots[8].len=32768  122737   253   149

For slots other than 0,1,2 we can see the similar improvement.

Considering the fact that switch.opt does not depend on the bitmap length
except for kvm_mmu_slot_remove_write_access(), this is the cause of some
usec to msec time consumption: there might be some context switches.

But note that this was done with the workload which dirtied the memory
endlessly during the live-migration.

In usual workload, the number of dirty pages varies a lot for each iteration
and we should gain really a lot for relatively clean cases.
   


Can you post such a test, for an idle large guest?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-10 Thread Takuya Yoshikawa




get.org get.opt switch.opt

slots[7].len=32768 278379 66398 64024
slots[8].len=32768 181246 270 160
slots[7].len=32768 263961 64673 64494
slots[8].len=32768 181655 265 160
slots[7].len=32768 263736 64701 64610
slots[8].len=32768 182785 267 160
slots[7].len=32768 260925 65360 65042
slots[8].len=32768 182579 264 160
slots[7].len=32768 267823 65915 65682
slots[8].len=32768 186350 271 160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.


100 ns... this is a bit on the low side (and if you can measure it
interactively you have much better reflexes than I).


To feel the difference, please try GUI on your PC with our patch series!


No doubt get.org - get.opt is measurable, but get.opt-switch.opt is
problematic. Have you tried profiling to see where the time is spent
(well I can guess, clearing the write access from the sptes).


Sorry but no, and I agree with your guess.
Anyway, I want to do some profiling to confirm this guess.


BTW, If we only think about performance improvement of time, optimized
get(get.opt) may be enough at this stage.

But if we consider the future expansion like using user allocated bitmaps,
new API's introduced for switch.opt won't become waste, I think, because we
need a structure to get and export bitmap addresses.




In usual workload, the number of dirty pages varies a lot for each
iteration
and we should gain really a lot for relatively clean cases.


Can you post such a test, for an idle large guest?


OK, I'll do!





--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-04 Thread Takuya Yoshikawa
Hi, sorry for sending from my personal account.
The following series are all from me:

  From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

  The 3rd version of moving dirty bitmaps to user space.

From this version, we add x86 and ppc and asm-generic people to CC lists.


[To KVM people]

Sorry for being late to reply your comments.

Avi,
 - I've wrote an answer to your question in patch 5/12: drivers/vhost/vhost.c .

 - I've considered to change the set_bit_user_non_atomic to an inline function,
   but did not change because the other helpers in the uaccess.h are written as
   macros. Anyway, I hope that x86 people will give us appropriate suggestions
   about this.

 - I thought that documenting about making bitmaps 64-bit aligned will be
   written when we add an API to register user-allocated bitmaps. So probably
   in the next series.

Avi, Alex,
 - Could you check the ia64 and ppc parts, please? I tried to keep the logical
   changes as small as possible.

   I personally tried to build these with cross compilers. For ia64, I could 
check
   build success with my patch series. But book3s, even without my patch series,
   it failed with the following errors:

  arch/powerpc/kvm/book3s_paired_singles.c: In function 
'kvmppc_emulate_paired_single':
  arch/powerpc/kvm/book3s_paired_singles.c:1289: error: the frame size of 2288 
bytes is larger than 2048 bytes
  make[1]: *** [arch/powerpc/kvm/book3s_paired_singles.o] Error 1
  make: *** [arch/powerpc/kvm] Error 2


About changelog: there are two main changes from the 2nd version:
  1. I changed the treatment of clean slots (see patch 1/12).
 This was already applied today, thanks!
  2. I changed the switch API. (see patch 11/12).

To show this API's advantage, I also did a test (see the end of this mail).


[To x86 people]

Hi, Thomas, Ingo, Peter,

Please review the patches 4,5/12. Because this is the first experience for
me to send patches to x86, please tell me if this lacks anything.


[To ppc people]

Hi, Benjamin, Paul, Alex,

Please see the patches 6,7/12. I first say sorry for that I've not tested these
yet. In that sense, these may not be in the quality for precise reviews. But I
will be happy if you would give me any comments.

Alex, could you help me? Though I have a plan to get PPC box in the future,
currently I cannot test these.



[To asm-generic people]

Hi, Arnd,

Please review the patch 8/12. This kind of macro is acceptable?





[Performance test]

We measured the tsc needed to the ioctl()s for getting dirty logs in
kernel.

Test environment

  AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory


1. GUI test (running Ubuntu guest in graphical mode)

  sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ...

We show a relatively stable part to compare how much time is needed
for the basic parts of dirty log ioctl.

   get.org   get.opt  switch.opt

slots[7].len=32768  278379 66398 64024
slots[8].len=32768  181246   270   160
slots[7].len=32768  263961 64673 64494
slots[8].len=32768  181655   265   160
slots[7].len=32768  263736 64701 64610
slots[8].len=32768  182785   267   160
slots[7].len=32768  260925 65360 65042
slots[8].len=32768  182579   264   160
slots[7].len=32768  267823 65915 65682
slots[8].len=32768  186350   271   160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.

To feel the difference, please try GUI on your PC with our patch series!


2. Live-migration test (4GB guest, write loop with 1GB buf)

We also did a live-migration test.

   get.org   get.opt  switch.opt

slots[0].len=655360 797383261144222181
slots[1].len=37570478082186721   1965244   1842824
slots[2].len=637534208 1433562   1012723   1031213
slots[3].len=131072 216858   331   331
slots[4].len=131072 121635   225   164
slots[5].len=131072 120863   356   164
slots[6].len=16777216   121746  1133   156
slots[7].len=32768  120415   230   278
slots[8].len=32768  120368   216   149
slots[0].len=655360 806497194710223582
slots[1].len=37570478082142922   1878025   1895369
slots[2].len=637534208 1386512   1021309   1000345
slots[3].len=131072 221118   459   296
slots[4].len=131072 121516   272   166
slots[5].len=131072 122652   244   173