Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On Mon, May 24, 2010 at 04:05:29PM +0900, Takuya Yoshikawa wrote: (2010/05/17 18:06), Takuya Yoshikawa wrote: User allocated bitmaps have the advantage of reducing pinned memory. However we have plenty more pinned memory allocated in memory slots, so by itself, user allocated bitmaps don't justify this change. Sorry for pinging several times. In that sense, what do you think about the question I sent last week? === REPOST 1 === mark_page_dirty is called with the mmu_lock spinlock held in set_spte. Must find a way to move it outside of the spinlock section. I am now trying to do something to solve this spinlock problem. But the spinlock section looks too wide to solve with simple workaround. Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? So I would be happy if you give me any comments about this kind of other options. What if you pin the bitmaps? The alternative to that is to move mark_page_dirty(gfn) before acquision of mmu_lock, in the page fault paths. The downside of that is a potentially (large?) number of false positives in the dirty bitmap. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
(2010/06/01 19:55), Marcelo Tosatti wrote: Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? So I would be happy if you give me any comments about this kind of other options. What if you pin the bitmaps? Yes, pinning bitmaps works. The small problem is that we need to hold the dirty_bitmap_pages[] array for every slot, the size of this array depends on the slot length, and of course pinning itself. In the performance point of view, having double sized vmalloc'ed area may be better. The alternative to that is to move mark_page_dirty(gfn) before acquision of mmu_lock, in the page fault paths. The downside of that is a potentially (large?) number of false positives in the dirty bitmap. Interesting, but probably dangerous. From my experience, though this includes my personal view, removing vmalloc currently used by x86 is the most simple and effective change. So if you don't mind, I want to double the size of vmalloc'ed area for x86 without changing other parts. == if this one more bitmap is problematic, dirty logging itself would be in danger of failure: we need to have the same size in the timing of switch. Make sense? We can consider moving dirty bitmaps to user space later. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On Tue, Jun 01, 2010 at 09:05:38PM +0900, Takuya Yoshikawa wrote: (2010/06/01 19:55), Marcelo Tosatti wrote: Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? So I would be happy if you give me any comments about this kind of other options. What if you pin the bitmaps? Yes, pinning bitmaps works. The small problem is that we need to hold the dirty_bitmap_pages[] array for every slot, the size of this array depends on the slot length, and of course pinning itself. In the performance point of view, having double sized vmalloc'ed area may be better. The alternative to that is to move mark_page_dirty(gfn) before acquision of mmu_lock, in the page fault paths. The downside of that is a potentially (large?) number of false positives in the dirty bitmap. Interesting, but probably dangerous. From my experience, though this includes my personal view, removing vmalloc currently used by x86 is the most simple and effective change. So if you don't mind, I want to double the size of vmalloc'ed area for x86 without changing other parts. == if this one more bitmap is problematic, dirty logging itself would be in danger of failure: we need to have the same size in the timing of switch. Make sense? That seems the most sensible approach. We can consider moving dirty bitmaps to user space later. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On Mon, May 24, 2010 at 04:05:29PM +0900, Takuya Yoshikawa wrote: (2010/05/17 18:06), Takuya Yoshikawa wrote: User allocated bitmaps have the advantage of reducing pinned memory. However we have plenty more pinned memory allocated in memory slots, so by itself, user allocated bitmaps don't justify this change. Sorry for pinging several times. In that sense, what do you think about the question I sent last week? === REPOST 1 === mark_page_dirty is called with the mmu_lock spinlock held in set_spte. Must find a way to move it outside of the spinlock section. I am now trying to do something to solve this spinlock problem. But the spinlock section looks too wide to solve with simple workaround. Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? So I would be happy if you give me any comments about this kind of other options. What if you pin the bitmaps? The alternative to that is to move mark_page_dirty(gfn) before acquision of mmu_lock, in the page fault paths. The downside of that is a potentially (large?) number of false positives in the dirty bitmap. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
(2010/06/01 19:55), Marcelo Tosatti wrote: Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? So I would be happy if you give me any comments about this kind of other options. What if you pin the bitmaps? Yes, pinning bitmaps works. The small problem is that we need to hold the dirty_bitmap_pages[] array for every slot, the size of this array depends on the slot length, and of course pinning itself. In the performance point of view, having double sized vmalloc'ed area may be better. The alternative to that is to move mark_page_dirty(gfn) before acquision of mmu_lock, in the page fault paths. The downside of that is a potentially (large?) number of false positives in the dirty bitmap. Interesting, but probably dangerous. From my experience, though this includes my personal view, removing vmalloc currently used by x86 is the most simple and effective change. So if you don't mind, I want to double the size of vmalloc'ed area for x86 without changing other parts. == if this one more bitmap is problematic, dirty logging itself would be in danger of failure: we need to have the same size in the timing of switch. Make sense? We can consider moving dirty bitmaps to user space later. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On Tue, Jun 01, 2010 at 09:05:38PM +0900, Takuya Yoshikawa wrote: (2010/06/01 19:55), Marcelo Tosatti wrote: Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? So I would be happy if you give me any comments about this kind of other options. What if you pin the bitmaps? Yes, pinning bitmaps works. The small problem is that we need to hold the dirty_bitmap_pages[] array for every slot, the size of this array depends on the slot length, and of course pinning itself. In the performance point of view, having double sized vmalloc'ed area may be better. The alternative to that is to move mark_page_dirty(gfn) before acquision of mmu_lock, in the page fault paths. The downside of that is a potentially (large?) number of false positives in the dirty bitmap. Interesting, but probably dangerous. From my experience, though this includes my personal view, removing vmalloc currently used by x86 is the most simple and effective change. So if you don't mind, I want to double the size of vmalloc'ed area for x86 without changing other parts. == if this one more bitmap is problematic, dirty logging itself would be in danger of failure: we need to have the same size in the timing of switch. Make sense? That seems the most sensible approach. We can consider moving dirty bitmaps to user space later. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
(2010/05/17 18:06), Takuya Yoshikawa wrote: User allocated bitmaps have the advantage of reducing pinned memory. However we have plenty more pinned memory allocated in memory slots, so by itself, user allocated bitmaps don't justify this change. Sorry for pinging several times. In that sense, what do you think about the question I sent last week? === REPOST 1 === mark_page_dirty is called with the mmu_lock spinlock held in set_spte. Must find a way to move it outside of the spinlock section. I am now trying to do something to solve this spinlock problem. But the spinlock section looks too wide to solve with simple workaround. Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? So I would be happy if you give me any comments about this kind of other options. Thanks, Takuya I know that the resource for vmalloc() is precious for x86 but even now, at the timing of get_dirty_log, we use the same amount of memory as double buffering. === 1 END === Perhaps if we optimize memory slot write protection (I have some ideas about this) we can make the performance improvement more pronounced. It's really nice! Even now we can measure the performance improvement by introducing switch ioctl when guest is relatively idle, so the combination will be really effective! === REPOST 2 === Can you post such a test, for an idle large guest? OK, I'll do! Result of low workload test (running top during migration) first, 4GB guest picked up slots[1](len=3757047808) only * get.org get.opt switch.opt 1060875 310292 190335 1076754 301295 188600 655504 318284 196029 529769 301471 325 694796 70216 221172 651868 353073 196184 543339 312865 213236 1061938 72785 203090 689527 323901 249519 621364 323881 473 1063671 70703 192958 915903 336318 174008 1046462 332384 782 1037942 72783 190655 680122 318305 243544 688156 314935 193526 558658 265934 190550 652454 372135 196270 660140 68613 352 1101947 378642 186575 ... ... ... * As expected we've got the difference more clearly. In this case, switch.opt reduced 1/3 (.1 msec) compared to get.opt for each iteration. And when the slot is cleaner, the ratio is bigger. === 2 END === -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
(2010/05/17 18:06), Takuya Yoshikawa wrote: User allocated bitmaps have the advantage of reducing pinned memory. However we have plenty more pinned memory allocated in memory slots, so by itself, user allocated bitmaps don't justify this change. Sorry for pinging several times. In that sense, what do you think about the question I sent last week? === REPOST 1 === mark_page_dirty is called with the mmu_lock spinlock held in set_spte. Must find a way to move it outside of the spinlock section. I am now trying to do something to solve this spinlock problem. But the spinlock section looks too wide to solve with simple workaround. Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? So I would be happy if you give me any comments about this kind of other options. Thanks, Takuya I know that the resource for vmalloc() is precious for x86 but even now, at the timing of get_dirty_log, we use the same amount of memory as double buffering. === 1 END === Perhaps if we optimize memory slot write protection (I have some ideas about this) we can make the performance improvement more pronounced. It's really nice! Even now we can measure the performance improvement by introducing switch ioctl when guest is relatively idle, so the combination will be really effective! === REPOST 2 === Can you post such a test, for an idle large guest? OK, I'll do! Result of low workload test (running top during migration) first, 4GB guest picked up slots[1](len=3757047808) only * get.org get.opt switch.opt 1060875 310292 190335 1076754 301295 188600 655504 318284 196029 529769 301471 325 694796 70216 221172 651868 353073 196184 543339 312865 213236 1061938 72785 203090 689527 323901 249519 621364 323881 473 1063671 70703 192958 915903 336318 174008 1046462 332384 782 1037942 72783 190655 680122 318305 243544 688156 314935 193526 558658 265934 190550 652454 372135 196270 660140 68613 352 1101947 378642 186575 ... ... ... * As expected we've got the difference more clearly. In this case, switch.opt reduced 1/3 (.1 msec) compared to get.opt for each iteration. And when the slot is cleaner, the ratio is bigger. === 2 END === -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
User allocated bitmaps have the advantage of reducing pinned memory. However we have plenty more pinned memory allocated in memory slots, so by itself, user allocated bitmaps don't justify this change. In that sense, what do you think about the question I sent last week? === REPOST 1 === mark_page_dirty is called with the mmu_lock spinlock held in set_spte. Must find a way to move it outside of the spinlock section. Oh, it's a serious problem. I have to consider it. Avi, Marcelo, Sorry but I have to say that mmu_lock spin_lock problem was completely out of my mind. Although I looked through the code, it seems not easy to move the set_bit_user to outside of spinlock section without breaking the semantics of its protection. So this may take some time to solve. But personally, I want to do something for x86's vmallc() every time problem even though moving dirty bitmaps to user space cannot be achieved soon. In that sense, do you mind if we do double buffering without moving dirty bitmaps to user space? I know that the resource for vmalloc() is precious for x86 but even now, at the timing of get_dirty_log, we use the same amount of memory as double buffering. === 1 END === Perhaps if we optimize memory slot write protection (I have some ideas about this) we can make the performance improvement more pronounced. It's really nice! Even now we can measure the performance improvement by introducing switch ioctl when guest is relatively idle, so the combination will be really effective! === REPOST 2 === Can you post such a test, for an idle large guest? OK, I'll do! Result of low workload test (running top during migration) first, 4GB guest picked up slots[1](len=3757047808) only * get.org get.optswitch.opt 1060875 310292 190335 1076754 301295 188600 655504 318284 196029 529769 301471325 694796 70216 221172 651868 353073 196184 543339 312865 213236 1061938 72785 203090 689527 323901 249519 621364 323881473 1063671 70703 192958 915903 336318 174008 1046462 332384782 1037942 72783 190655 680122 318305 243544 688156 314935 193526 558658 265934 190550 652454 372135 196270 660140 68613352 1101947 378642 186575 ......... * As expected we've got the difference more clearly. In this case, switch.opt reduced 1/3 (.1 msec) compared to get.opt for each iteration. And when the slot is cleaner, the ratio is bigger. === 2 END === -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On 05/10/2010 03:26 PM, Takuya Yoshikawa wrote: No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). Sorry but no, and I agree with your guess. Anyway, I want to do some profiling to confirm this guess. BTW, If we only think about performance improvement of time, optimized get(get.opt) may be enough at this stage. But if we consider the future expansion like using user allocated bitmaps, new API's introduced for switch.opt won't become waste, I think, because we need a structure to get and export bitmap addresses. User allocated bitmaps have the advantage of reducing pinned memory. However we have plenty more pinned memory allocated in memory slots, so by itself, user allocated bitmaps don't justify this change. Perhaps if we optimize memory slot write protection (I have some ideas about this) we can make the performance improvement more pronounced. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
[To ppc people] Hi, Benjamin, Paul, Alex, Please see the patches 6,7/12. I first say sorry for that I've not tested these yet. In that sense, these may not be in the quality for precise reviews. But I will be happy if you would give me any comments. Alex, could you help me? Though I have a plan to get PPC box in the future, currently I cannot test these. Could you please point me to a git tree where everything's readily applied? That would make testing a lot easier. OK, I'll prepare one. Probably on sourceforge or somewhere like Kemari. Thanks, Takuya Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
[To ppc people] Hi, Benjamin, Paul, Alex, Please see the patches 6,7/12. I first say sorry for that I've not tested these yet. In that sense, these may not be in the quality for precise reviews. But I will be happy if you would give me any comments. Alex, could you help me? Though I have a plan to get PPC box in the future, currently I cannot test these. Could you please point me to a git tree where everything's readily applied? That would make testing a lot easier. OK, I'll prepare one. Probably on sourceforge or somewhere like Kemari. Thanks, Takuya Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? OK, I'll do! Result of low workload test (running top during migration) first, 4GB guest picked up slots[1](len=3757047808) only * get.org get.optswitch.opt 1060875 310292 190335 1076754 301295 188600 655504 318284 196029 529769 301471325 694796 70216 221172 651868 353073 196184 543339 312865 213236 1061938 72785 203090 689527 323901 249519 621364 323881473 1063671 70703 192958 915903 336318 174008 1046462 332384782 1037942 72783 190655 680122 318305 243544 688156 314935 193526 558658 265934 190550 652454 372135 196270 660140 68613352 1101947 378642 186575 ......... * As expected we've got the difference more clearly. In this case, switch.opt reduced 1/3 (.1 msec) compared to get.opt for each iteration. And when the slot is cleaner, the ratio is bigger. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
Takuya Yoshikawa wrote: Hi, sorry for sending from my personal account. The following series are all from me: From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp The 3rd version of moving dirty bitmaps to user space. From this version, we add x86 and ppc and asm-generic people to CC lists. [To KVM people] Sorry for being late to reply your comments. Avi, - I've wrote an answer to your question in patch 5/12: drivers/vhost/vhost.c . - I've considered to change the set_bit_user_non_atomic to an inline function, but did not change because the other helpers in the uaccess.h are written as macros. Anyway, I hope that x86 people will give us appropriate suggestions about this. - I thought that documenting about making bitmaps 64-bit aligned will be written when we add an API to register user-allocated bitmaps. So probably in the next series. Avi, Alex, - Could you check the ia64 and ppc parts, please? I tried to keep the logical changes as small as possible. I personally tried to build these with cross compilers. For ia64, I could check build success with my patch series. But book3s, even without my patch series, it failed with the following errors: arch/powerpc/kvm/book3s_paired_singles.c: In function 'kvmppc_emulate_paired_single': arch/powerpc/kvm/book3s_paired_singles.c:1289: error: the frame size of 2288 bytes is larger than 2048 bytes make[1]: *** [arch/powerpc/kvm/book3s_paired_singles.o] Error 1 make: *** [arch/powerpc/kvm] Error 2 This is bad. I haven't encountered that one at all so far, but I guess my compiler version is different from yours. Sigh. About changelog: there are two main changes from the 2nd version: 1. I changed the treatment of clean slots (see patch 1/12). This was already applied today, thanks! 2. I changed the switch API. (see patch 11/12). To show this API's advantage, I also did a test (see the end of this mail). [To x86 people] Hi, Thomas, Ingo, Peter, Please review the patches 4,5/12. Because this is the first experience for me to send patches to x86, please tell me if this lacks anything. [To ppc people] Hi, Benjamin, Paul, Alex, Please see the patches 6,7/12. I first say sorry for that I've not tested these yet. In that sense, these may not be in the quality for precise reviews. But I will be happy if you would give me any comments. Alex, could you help me? Though I have a plan to get PPC box in the future, currently I cannot test these. Could you please point me to a git tree where everything's readily applied? That would make testing a lot easier. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On 05/04/2010 03:56 PM, Takuya Yoshikawa wrote: [Performance test] We measured the tsc needed to the ioctl()s for getting dirty logs in kernel. Test environment AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory 1. GUI test (running Ubuntu guest in graphical mode) sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ... We show a relatively stable part to compare how much time is needed for the basic parts of dirty log ioctl. get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. 100 ns... this is a bit on the low side (and if you can measure it interactively you have much better reflexes than I). To feel the difference, please try GUI on your PC with our patch series! No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). 2. Live-migration test (4GB guest, write loop with 1GB buf) We also did a live-migration test. get.org get.opt switch.opt slots[0].len=655360 797383261144222181 slots[1].len=37570478082186721 1965244 1842824 slots[2].len=637534208 1433562 1012723 1031213 slots[3].len=131072 216858 331 331 slots[4].len=131072 121635 225 164 slots[5].len=131072 120863 356 164 slots[6].len=16777216 121746 1133 156 slots[7].len=32768 120415 230 278 slots[8].len=32768 120368 216 149 slots[0].len=655360 806497194710223582 slots[1].len=37570478082142922 1878025 1895369 slots[2].len=637534208 1386512 1021309 1000345 slots[3].len=131072 221118 459 296 slots[4].len=131072 121516 272 166 slots[5].len=131072 122652 244 173 slots[6].len=16777216 123226 99185 149 slots[7].len=32768 121803 457 505 slots[8].len=32768 121586 216 155 slots[0].len=655360 766113211317213179 slots[1].len=37570478082155662 1974790 1842361 slots[2].len=637534208 1481411 1020004 1031352 slots[3].len=131072 223100 351 295 slots[4].len=131072 122982 436 164 slots[5].len=131072 122100 300 503 slots[6].len=16777216 123653 779 151 slots[7].len=32768 122617 284 157 slots[8].len=32768 122737 253 149 For slots other than 0,1,2 we can see the similar improvement. Considering the fact that switch.opt does not depend on the bitmap length except for kvm_mmu_slot_remove_write_access(), this is the cause of some usec to msec time consumption: there might be some context switches. But note that this was done with the workload which dirtied the memory endlessly during the live-migration. In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. 100 ns... this is a bit on the low side (and if you can measure it interactively you have much better reflexes than I). To feel the difference, please try GUI on your PC with our patch series! No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). Sorry but no, and I agree with your guess. Anyway, I want to do some profiling to confirm this guess. BTW, If we only think about performance improvement of time, optimized get(get.opt) may be enough at this stage. But if we consider the future expansion like using user allocated bitmaps, new API's introduced for switch.opt won't become waste, I think, because we need a structure to get and export bitmap addresses. In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? OK, I'll do! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On 05/04/2010 03:56 PM, Takuya Yoshikawa wrote: [Performance test] We measured the tsc needed to the ioctl()s for getting dirty logs in kernel. Test environment AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory 1. GUI test (running Ubuntu guest in graphical mode) sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ... We show a relatively stable part to compare how much time is needed for the basic parts of dirty log ioctl. get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. 100 ns... this is a bit on the low side (and if you can measure it interactively you have much better reflexes than I). To feel the difference, please try GUI on your PC with our patch series! No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). 2. Live-migration test (4GB guest, write loop with 1GB buf) We also did a live-migration test. get.org get.opt switch.opt slots[0].len=655360 797383261144222181 slots[1].len=37570478082186721 1965244 1842824 slots[2].len=637534208 1433562 1012723 1031213 slots[3].len=131072 216858 331 331 slots[4].len=131072 121635 225 164 slots[5].len=131072 120863 356 164 slots[6].len=16777216 121746 1133 156 slots[7].len=32768 120415 230 278 slots[8].len=32768 120368 216 149 slots[0].len=655360 806497194710223582 slots[1].len=37570478082142922 1878025 1895369 slots[2].len=637534208 1386512 1021309 1000345 slots[3].len=131072 221118 459 296 slots[4].len=131072 121516 272 166 slots[5].len=131072 122652 244 173 slots[6].len=16777216 123226 99185 149 slots[7].len=32768 121803 457 505 slots[8].len=32768 121586 216 155 slots[0].len=655360 766113211317213179 slots[1].len=37570478082155662 1974790 1842361 slots[2].len=637534208 1481411 1020004 1031352 slots[3].len=131072 223100 351 295 slots[4].len=131072 122982 436 164 slots[5].len=131072 122100 300 503 slots[6].len=16777216 123653 779 151 slots[7].len=32768 122617 284 157 slots[8].len=32768 122737 253 149 For slots other than 0,1,2 we can see the similar improvement. Considering the fact that switch.opt does not depend on the bitmap length except for kvm_mmu_slot_remove_write_access(), this is the cause of some usec to msec time consumption: there might be some context switches. But note that this was done with the workload which dirtied the memory endlessly during the live-migration. In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. 100 ns... this is a bit on the low side (and if you can measure it interactively you have much better reflexes than I). To feel the difference, please try GUI on your PC with our patch series! No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). Sorry but no, and I agree with your guess. Anyway, I want to do some profiling to confirm this guess. BTW, If we only think about performance improvement of time, optimized get(get.opt) may be enough at this stage. But if we consider the future expansion like using user allocated bitmaps, new API's introduced for switch.opt won't become waste, I think, because we need a structure to get and export bitmap addresses. In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? OK, I'll do! -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
Hi, sorry for sending from my personal account. The following series are all from me: From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp The 3rd version of moving dirty bitmaps to user space. From this version, we add x86 and ppc and asm-generic people to CC lists. [To KVM people] Sorry for being late to reply your comments. Avi, - I've wrote an answer to your question in patch 5/12: drivers/vhost/vhost.c . - I've considered to change the set_bit_user_non_atomic to an inline function, but did not change because the other helpers in the uaccess.h are written as macros. Anyway, I hope that x86 people will give us appropriate suggestions about this. - I thought that documenting about making bitmaps 64-bit aligned will be written when we add an API to register user-allocated bitmaps. So probably in the next series. Avi, Alex, - Could you check the ia64 and ppc parts, please? I tried to keep the logical changes as small as possible. I personally tried to build these with cross compilers. For ia64, I could check build success with my patch series. But book3s, even without my patch series, it failed with the following errors: arch/powerpc/kvm/book3s_paired_singles.c: In function 'kvmppc_emulate_paired_single': arch/powerpc/kvm/book3s_paired_singles.c:1289: error: the frame size of 2288 bytes is larger than 2048 bytes make[1]: *** [arch/powerpc/kvm/book3s_paired_singles.o] Error 1 make: *** [arch/powerpc/kvm] Error 2 About changelog: there are two main changes from the 2nd version: 1. I changed the treatment of clean slots (see patch 1/12). This was already applied today, thanks! 2. I changed the switch API. (see patch 11/12). To show this API's advantage, I also did a test (see the end of this mail). [To x86 people] Hi, Thomas, Ingo, Peter, Please review the patches 4,5/12. Because this is the first experience for me to send patches to x86, please tell me if this lacks anything. [To ppc people] Hi, Benjamin, Paul, Alex, Please see the patches 6,7/12. I first say sorry for that I've not tested these yet. In that sense, these may not be in the quality for precise reviews. But I will be happy if you would give me any comments. Alex, could you help me? Though I have a plan to get PPC box in the future, currently I cannot test these. [To asm-generic people] Hi, Arnd, Please review the patch 8/12. This kind of macro is acceptable? [Performance test] We measured the tsc needed to the ioctl()s for getting dirty logs in kernel. Test environment AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory 1. GUI test (running Ubuntu guest in graphical mode) sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ... We show a relatively stable part to compare how much time is needed for the basic parts of dirty log ioctl. get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. To feel the difference, please try GUI on your PC with our patch series! 2. Live-migration test (4GB guest, write loop with 1GB buf) We also did a live-migration test. get.org get.opt switch.opt slots[0].len=655360 797383261144222181 slots[1].len=37570478082186721 1965244 1842824 slots[2].len=637534208 1433562 1012723 1031213 slots[3].len=131072 216858 331 331 slots[4].len=131072 121635 225 164 slots[5].len=131072 120863 356 164 slots[6].len=16777216 121746 1133 156 slots[7].len=32768 120415 230 278 slots[8].len=32768 120368 216 149 slots[0].len=655360 806497194710223582 slots[1].len=37570478082142922 1878025 1895369 slots[2].len=637534208 1386512 1021309 1000345 slots[3].len=131072 221118 459 296 slots[4].len=131072 121516 272 166 slots[5].len=131072 122652 244 173