Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 18.02.2010, at 06:57, OHMURA Kei wrote: >> "We think"? I mean - yes, I think so too. But have you actually measured >> it? >> How much improvement are we talking here? >> Is it still faster when a bswap is involved? > Thanks for pointing out. > I will post the data for x86 later. > However, I don't have a test environment to check the impact of bswap. > Would you please measure the run time between the following section if > possible? It'd make more sense to have a real stand alone test program, no? I can try to write one today, but I have some really nasty important bugs to fix first. >>> >>> OK. I will prepare a test code with sample data. Since I found a ppc >>> machine around, I will run the code and post the results of >>> x86 and ppc. >>> >>> >>> By the way, the following data is a result of x86 measured in QEMU/KVM. >>> This data shows, how many times the function is called (#called), runtime >>> of original function(orig.), runtime of this patch(patch), speedup ratio >>> (ratio). >> That does indeed look promising! >> Thanks for doing this micro-benchmark. I just want to be 100% sure that it >> doesn't affect performance for big endian badly. > > > I measured runtime of the test code with sample data. My test environment > and results are described below. > > x86 Test Environment: > CPU: 4x Intel Xeon Quad Core 2.66GHz > Mem size: 6GB > > ppc Test Environment: > CPU: 2x Dual Core PPC970MP > Mem size: 2GB > > The sample data of dirty bitmap was produced by QEMU/KVM while the guest OS > was live migrating. To measure the runtime I copied cpu_get_real_ticks() of > QEMU to my test program. > > > Experimental results: > Test1: Guest OS read 3GB file, which is bigger than memory. orig.(msec) >patch(msec)ratio > x860.30.16.4 ppc7.92.7 > 3.0 > Test2: Guest OS read/write 3GB file, which is bigger than memory. > orig.(msec)patch(msec)ratio > x8612.0 3.23.7 ppc251.1 123 > 2.0 > > I also measured the runtime of bswap itself on ppc, and I found it was only > just 0.3% ~ 0.7 % of the runtime described above. Awesome! Thank you so much for giving actual data to make me feel comfortable with it :-). Alex
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
"We think"? I mean - yes, I think so too. But have you actually measured it? How much improvement are we talking here? Is it still faster when a bswap is involved? Thanks for pointing out. I will post the data for x86 later. However, I don't have a test environment to check the impact of bswap. Would you please measure the run time between the following section if possible? It'd make more sense to have a real stand alone test program, no? I can try to write one today, but I have some really nasty important bugs to fix first. OK. I will prepare a test code with sample data. Since I found a ppc machine around, I will run the code and post the results of x86 and ppc. By the way, the following data is a result of x86 measured in QEMU/KVM. This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio). That does indeed look promising! Thanks for doing this micro-benchmark. I just want to be 100% sure that it doesn't affect performance for big endian badly. I measured runtime of the test code with sample data. My test environment and results are described below. x86 Test Environment: CPU: 4x Intel Xeon Quad Core 2.66GHz Mem size: 6GB ppc Test Environment: CPU: 2x Dual Core PPC970MP Mem size: 2GB The sample data of dirty bitmap was produced by QEMU/KVM while the guest OS was live migrating. To measure the runtime I copied cpu_get_real_ticks() of QEMU to my test program. Experimental results: Test1: Guest OS read 3GB file, which is bigger than memory. orig.(msec)patch(msec)ratio x860.30.16.4 ppc7.92.73.0 Test2: Guest OS read/write 3GB file, which is bigger than memory. orig.(msec)patch(msec)ratio x8612.0 3.23.7 ppc251.1 1232.0 I also measured the runtime of bswap itself on ppc, and I found it was only just 0.3% ~ 0.7 % of the runtime described above.
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 17.02.2010, at 10:47, Avi Kivity wrote: > On 02/17/2010 11:42 AM, OHMURA Kei wrote: > "We think"? I mean - yes, I think so too. But have you actually measured > it? > How much improvement are we talking here? > Is it still faster when a bswap is involved? Thanks for pointing out. I will post the data for x86 later. However, I don't have a test environment to check the impact of bswap. Would you please measure the run time between the following section if possible? >>> >>> It'd make more sense to have a real stand alone test program, no? >>> I can try to write one today, but I have some really nasty important bugs >>> to fix first. >> >> >> OK. I will prepare a test code with sample data. Since I found a ppc >> machine around, I will run the code and post the results of >> x86 and ppc. >> > > I've applied the patch - I think the x86 results justify it, and I'll be very > surprised if ppc doesn't show a similar gain. Skipping 7 memory accesses and > 7 tests must be a win. Sounds good to me. I don't assume bswap to be horribly slow either. Just want to be sure. Alex
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 02/17/2010 11:42 AM, OHMURA Kei wrote: "We think"? I mean - yes, I think so too. But have you actually measured it? How much improvement are we talking here? Is it still faster when a bswap is involved? Thanks for pointing out. I will post the data for x86 later. However, I don't have a test environment to check the impact of bswap. Would you please measure the run time between the following section if possible? It'd make more sense to have a real stand alone test program, no? I can try to write one today, but I have some really nasty important bugs to fix first. OK. I will prepare a test code with sample data. Since I found a ppc machine around, I will run the code and post the results of x86 and ppc. I've applied the patch - I think the x86 results justify it, and I'll be very surprised if ppc doesn't show a similar gain. Skipping 7 memory accesses and 7 tests must be a win. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 17.02.2010, at 10:42, OHMURA Kei wrote: "We think"? I mean - yes, I think so too. But have you actually measured it? How much improvement are we talking here? Is it still faster when a bswap is involved? >>> Thanks for pointing out. >>> I will post the data for x86 later. >>> However, I don't have a test environment to check the impact of bswap. >>> Would you please measure the run time between the following section if >>> possible? >> It'd make more sense to have a real stand alone test program, no? >> I can try to write one today, but I have some really nasty important bugs to >> fix first. > > > OK. I will prepare a test code with sample data. Since I found a ppc > machine around, I will run the code and post the results of > x86 and ppc. > > > By the way, the following data is a result of x86 measured in QEMU/KVM. > This data shows, how many times the function is called (#called), runtime of > original function(orig.), runtime of this patch(patch), speedup ratio (ratio). That does indeed look promising! Thanks for doing this micro-benchmark. I just want to be 100% sure that it doesn't affect performance for big endian badly. Alex
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
"We think"? I mean - yes, I think so too. But have you actually measured it? How much improvement are we talking here? Is it still faster when a bswap is involved? Thanks for pointing out. I will post the data for x86 later. However, I don't have a test environment to check the impact of bswap. Would you please measure the run time between the following section if possible? It'd make more sense to have a real stand alone test program, no? I can try to write one today, but I have some really nasty important bugs to fix first. OK. I will prepare a test code with sample data. Since I found a ppc machine around, I will run the code and post the results of x86 and ppc. By the way, the following data is a result of x86 measured in QEMU/KVM. This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio). Test1: Guest OS read 3GB file, which is bigger than memory. #called orig.(msec) patch(msec) ratio 108 1.1 0.1 7.6 102 1.0 0.1 6.8 132 1.6 0.2 7.1 Test2: Guest OS read/write 3GB file, which is bigger than memory. #called orig.(msec) patch(msec) ratio 239433 7.7 4.3 210029 7.1 4.1 283240 9.9 4.0
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 16.02.2010, at 12:16, OHMURA Kei wrote: >> "We think"? I mean - yes, I think so too. But have you actually measured it? >> How much improvement are we talking here? >> Is it still faster when a bswap is involved? > > Thanks for pointing out. > I will post the data for x86 later. > However, I don't have a test environment to check the impact of bswap. > Would you please measure the run time between the following section if > possible? It'd make more sense to have a real stand alone test program, no? I can try to write one today, but I have some really nasty important bugs to fix first. Alex
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
"We think"? I mean - yes, I think so too. But have you actually measured it? How much improvement are we talking here? Is it still faster when a bswap is involved? Thanks for pointing out. I will post the data for x86 later. However, I don't have a test environment to check the impact of bswap. Would you please measure the run time between the following section if possible? start -> qemu-kvm.c: static int kvm_get_dirty_bitmap_cb(unsigned long start, unsigned long len, void *bitmap, void *opaque) { /* warm up each function */ kvm_get_dirty_pages_log_range(start, bitmap, start, len); kvm_get_dirty_pages_log_range_new(start, bitmap, start, len); /* measurement */ int64_t t1, t2; t1 = cpu_get_real_ticks(); kvm_get_dirty_pages_log_range(start, bitmap, start, len); t1 = cpu_get_real_ticks() - t1; t2 = cpu_get_real_ticks(); kvm_get_dirty_pages_log_range_new(start, bitmap, start, len); t2 = cpu_get_real_ticks() - t2; printf("## %zd, %zd\n", t1, t2); fflush(stdout); return kvm_get_dirty_pages_log_range_new(start, bitmap, start, len); } end ->
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 15.02.2010, at 07:12, OHMURA Kei wrote: > dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c. > But We think that dirty-bitmap-traveling by long size is faster than by byte "We think"? I mean - yes, I think so too. But have you actually measured it? How much improvement are we talking here? Is it still faster when a bswap is involved? Alex
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
Avi Kivity wrote: > On 02/10/2010 06:47 PM, Alexander Graf wrote: > Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel. Unlike with x86, there's no real benefit in using 64 bit userspace. >>> btw, does 32-bit ppc qemu support large memory guests? It doesn't on >>> x86, and I don't remember any hacks to support large memory guests >>> elsewhere. >>> >>> >>> >> It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2 >> GB anyways, because that needs an iommu implementation. >> >> > > Oh, so you may want to revisit the "there's no real benefit in using 64 > bit userspace". > Well, for normal users they don't. SLES11 is 64-bit only, so we're good on that. But openSUSE uses 32-bit userland. > Seriously, that looks like a big deficiency. What would it take to > implement an iommu? > > I imagine Anthony's latest patches are a first step in that journey. > All reads/writes from PCI devices would need to go through a wrapper. Maybe we could also define a per-device offset for memory accesses. That way the overhead might be less. Yes, Anthony's patches look like they are a really big step in that direction. Alex
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 02/10/2010 06:47 PM, Alexander Graf wrote: >>> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel. >>> Unlike with x86, there's no real benefit in using 64 bit userspace. >>> >>> >>> >> btw, does 32-bit ppc qemu support large memory guests? It doesn't on >> x86, and I don't remember any hacks to support large memory guests >> elsewhere. >> >> > > It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2 > GB anyways, because that needs an iommu implementation. > Oh, so you may want to revisit the "there's no real benefit in using 64 bit userspace". Seriously, that looks like a big deficiency. What would it take to implement an iommu? I imagine Anthony's latest patches are a first step in that journey. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
Avi Kivity wrote: > On 02/10/2010 06:43 PM, Alexander Graf wrote: > >>> Out of curiousity, why? It seems like an odd interface. >>> >>> >>> >> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel. >> Unlike with x86, there's no real benefit in using 64 bit userspace. >> >> > > btw, does 32-bit ppc qemu support large memory guests? It doesn't on > x86, and I don't remember any hacks to support large memory guests > elsewhere. > It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2 GB anyways, because that needs an iommu implementation. Alex
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 02/10/2010 06:43 PM, Alexander Graf wrote: > >> Out of curiousity, why? It seems like an odd interface. >> >> > Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel. > Unlike with x86, there's no real benefit in using 64 bit userspace. > btw, does 32-bit ppc qemu support large memory guests? It doesn't on x86, and I don't remember any hacks to support large memory guests elsewhere. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 02/10/2010 06:35 PM, Anthony Liguori wrote: > On 02/10/2010 10:00 AM, Alexander Graf wrote: > >> On PPC the bitmap is Little Endian. >> >> > Out of curiousity, why? It seems like an odd interface. > > Exactly this issue. If you specify it as unsigned long native endian, there is ambiguity between 32-bit and 64-bit userspace. If you specify it as uint64_t native endian, you have an inefficient implementation on 32-bit userspace. So we went for unsigned byte native endian, which is the same as any size little endian. (well I think the real reason is that it just grew that way out of x86, but the above is quite plausible). -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
Anthony Liguori wrote: > On 02/10/2010 10:00 AM, Alexander Graf wrote: > >> On PPC the bitmap is Little Endian. >> >> > > Out of curiousity, why? It seems like an odd interface. > Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel. Unlike with x86, there's no real benefit in using 64 bit userspace. So thanks to the nature of big endianness, that breaks our set_bit helpers, because they assume you're using "long" data types for the bits. While that's no real issue on little endian, since the next int is just the high part of a u64, it messes everything up on ppc. For more details, please just look in the archives on my patches to make it little endian. Alex
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
On 02/10/2010 10:00 AM, Alexander Graf wrote: > On PPC the bitmap is Little Endian. > Out of curiousity, why? It seems like an odd interface. Regards, Anthony Liguori
Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling
Anthony Liguori wrote: > On 02/10/2010 07:20 AM, Avi Kivity wrote: > >> On 02/10/2010 12:52 PM, OHMURA Kei wrote: >> >> >>> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c. >>> But We think that dirty-bitmap-traveling by long size is faster than by byte >>> size especially when most of memory is not dirty. >>> >>> --- a/bswap.h >>> +++ b/bswap.h >>> @@ -209,7 +209,6 @@ static inline void cpu_to_be32wu(uint32_t *p, uint32_t >>> v) >>> #define cpu_to_32wu cpu_to_le32wu >>> #endif >>> >>> -#undef le_bswap >>> #undef be_bswap >>> #undef le_bswaps >>> >>> >>> >> Anthony, is it okay to export le_bswap this way, or will you want >> leul_to_cpu()? >> >> > > kvm_get_dirty_pages_log_range() is kvm-specific code. We're guaranteed > that when we're using kvm, target byte order == host byte order. > > So is it really necessary to use a byte swapping function at all? > On PPC the bitmap is Little Endian. Alex