Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-18 Thread Alexander Graf

On 18.02.2010, at 06:57, OHMURA Kei wrote:

>> "We think"? I mean - yes, I think so too. But have you actually measured 
>> it?
>> How much improvement are we talking here?
>> Is it still faster when a bswap is involved?
> Thanks for pointing out.
> I will post the data for x86 later.
> However, I don't have a test environment to check the impact of bswap.
> Would you please measure the run time between the following section if 
> possible?
 It'd make more sense to have a real stand alone test program, no?
 I can try to write one today, but I have some really nasty important bugs 
 to fix first.
>>> 
>>> OK.  I will prepare a test code with sample data.  Since I found a ppc 
>>> machine around, I will run the code and post the results of
>>> x86 and ppc.
>>> 
>>> 
>>> By the way, the following data is a result of x86 measured in QEMU/KVM.  
>>> This data shows, how many times the function is called (#called), runtime 
>>> of original function(orig.), runtime of this patch(patch), speedup ratio 
>>> (ratio).
>> That does indeed look promising!
>> Thanks for doing this micro-benchmark. I just want to be 100% sure that it 
>> doesn't affect performance for big endian badly.
> 
> 
> I measured runtime of the test code with sample data.  My test environment 
> and results are described below.
> 
> x86 Test Environment:
> CPU: 4x Intel Xeon Quad Core 2.66GHz
> Mem size: 6GB
> 
> ppc Test Environment:
> CPU: 2x Dual Core PPC970MP
> Mem size: 2GB
> 
> The sample data of dirty bitmap was produced by QEMU/KVM while the guest OS
> was live migrating.  To measure the runtime I copied cpu_get_real_ticks() of
> QEMU to my test program.
> 
> 
> Experimental results:
> Test1: Guest OS read 3GB file, which is bigger than memory.   orig.(msec) 
>patch(msec)ratio
> x860.30.16.4 ppc7.92.7
> 3.0 
> Test2: Guest OS read/write 3GB file, which is bigger than memory.   
> orig.(msec)patch(msec)ratio
> x8612.0   3.23.7 ppc251.1  123
> 2.0 
> 
> I also measured the runtime of bswap itself on ppc, and I found it was only 
> just 0.3% ~ 0.7 % of the runtime described above. 

Awesome! Thank you so much for giving actual data to make me feel comfortable 
with it :-).


Alex



Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-17 Thread OHMURA Kei

"We think"? I mean - yes, I think so too. But have you actually measured it?
How much improvement are we talking here?
Is it still faster when a bswap is involved?

Thanks for pointing out.
I will post the data for x86 later.
However, I don't have a test environment to check the impact of bswap.
Would you please measure the run time between the following section if possible?

It'd make more sense to have a real stand alone test program, no?
I can try to write one today, but I have some really nasty important bugs to 
fix first.


OK.  I will prepare a test code with sample data.  Since I found a ppc machine 
around, I will run the code and post the results of
x86 and ppc.


By the way, the following data is a result of x86 measured in QEMU/KVM.  
This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio).


That does indeed look promising!

Thanks for doing this micro-benchmark. I just want to be 100% sure that it 
doesn't affect performance for big endian badly.



I measured runtime of the test code with sample data.  My test environment 
and results are described below.


x86 Test Environment:
CPU: 4x Intel Xeon Quad Core 2.66GHz
Mem size: 6GB

ppc Test Environment:
CPU: 2x Dual Core PPC970MP
Mem size: 2GB

The sample data of dirty bitmap was produced by QEMU/KVM while the guest OS
was live migrating.  To measure the runtime I copied cpu_get_real_ticks() of
QEMU to my test program.


Experimental results:
Test1: Guest OS read 3GB file, which is bigger than memory. 
  orig.(msec)patch(msec)ratio
x860.30.16.4 
ppc7.92.73.0 

Test2: Guest OS read/write 3GB file, which is bigger than memory. 
  orig.(msec)patch(msec)ratio
x8612.0   3.23.7 
ppc251.1  1232.0 



I also measured the runtime of bswap itself on ppc, and I found it was only 
just 0.3% ~ 0.7 % of the runtime described above. 






Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-17 Thread Alexander Graf

On 17.02.2010, at 10:47, Avi Kivity wrote:

> On 02/17/2010 11:42 AM, OHMURA Kei wrote:
> "We think"? I mean - yes, I think so too. But have you actually measured 
> it?
> How much improvement are we talking here?
> Is it still faster when a bswap is involved?
 Thanks for pointing out.
 I will post the data for x86 later.
 However, I don't have a test environment to check the impact of bswap.
 Would you please measure the run time between the following section if 
 possible?
>>> 
>>> It'd make more sense to have a real stand alone test program, no?
>>> I can try to write one today, but I have some really nasty important bugs 
>>> to fix first.
>> 
>> 
>> OK.  I will prepare a test code with sample data.  Since I found a ppc 
>> machine around, I will run the code and post the results of
>> x86 and ppc.
>> 
> 
> I've applied the patch - I think the x86 results justify it, and I'll be very 
> surprised if ppc doesn't show a similar gain.  Skipping 7 memory accesses and 
> 7 tests must be a win.

Sounds good to me. I don't assume bswap to be horribly slow either. Just want 
to be sure.


Alex



Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-17 Thread Avi Kivity

On 02/17/2010 11:42 AM, OHMURA Kei wrote:
"We think"? I mean - yes, I think so too. But have you actually 
measured it?

How much improvement are we talking here?
Is it still faster when a bswap is involved?

Thanks for pointing out.
I will post the data for x86 later.
However, I don't have a test environment to check the impact of bswap.
Would you please measure the run time between the following section 
if possible?


It'd make more sense to have a real stand alone test program, no?
I can try to write one today, but I have some really nasty important 
bugs to fix first.



OK.  I will prepare a test code with sample data.  Since I found a ppc 
machine around, I will run the code and post the results of

x86 and ppc.



I've applied the patch - I think the x86 results justify it, and I'll be 
very surprised if ppc doesn't show a similar gain.  Skipping 7 memory 
accesses and 7 tests must be a win.



--
error compiling committee.c: too many arguments to function





Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-17 Thread Alexander Graf

On 17.02.2010, at 10:42, OHMURA Kei wrote:

 "We think"? I mean - yes, I think so too. But have you actually measured 
 it?
 How much improvement are we talking here?
 Is it still faster when a bswap is involved?
>>> Thanks for pointing out.
>>> I will post the data for x86 later.
>>> However, I don't have a test environment to check the impact of bswap.
>>> Would you please measure the run time between the following section if 
>>> possible?
>> It'd make more sense to have a real stand alone test program, no?
>> I can try to write one today, but I have some really nasty important bugs to 
>> fix first.
> 
> 
> OK.  I will prepare a test code with sample data.  Since I found a ppc 
> machine around, I will run the code and post the results of
> x86 and ppc.
> 
> 
> By the way, the following data is a result of x86 measured in QEMU/KVM.  
> This data shows, how many times the function is called (#called), runtime of 
> original function(orig.), runtime of this patch(patch), speedup ratio (ratio).

That does indeed look promising!

Thanks for doing this micro-benchmark. I just want to be 100% sure that it 
doesn't affect performance for big endian badly.


Alex



Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-17 Thread OHMURA Kei

"We think"? I mean - yes, I think so too. But have you actually measured it?
How much improvement are we talking here?
Is it still faster when a bswap is involved?

Thanks for pointing out.
I will post the data for x86 later.
However, I don't have a test environment to check the impact of bswap.
Would you please measure the run time between the following section if possible?


It'd make more sense to have a real stand alone test program, no?
I can try to write one today, but I have some really nasty important bugs to 
fix first.



OK.  I will prepare a test code with sample data.  
Since I found a ppc machine around, I will run the code and post the results of

x86 and ppc.


By the way, the following data is a result of x86 measured in QEMU/KVM.  

This data shows, how many times the function is called (#called), runtime of 
original function(orig.), runtime of this patch(patch), speedup ratio (ratio).


Test1: Guest OS read 3GB file, which is bigger than memory.
#called orig.(msec) patch(msec) ratio
108 1.1 0.1 7.6
102 1.0 0.1 6.8
132 1.6 0.2 7.1

Test2: Guest OS read/write 3GB file, which is bigger than memory.
#called orig.(msec) patch(msec) ratio
239433  7.7 4.3
210029  7.1 4.1
283240  9.9 4.0





Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-16 Thread Alexander Graf

On 16.02.2010, at 12:16, OHMURA Kei wrote:

>> "We think"? I mean - yes, I think so too. But have you actually measured it?
>> How much improvement are we talking here?
>> Is it still faster when a bswap is involved?
> 
> Thanks for pointing out.
> I will post the data for x86 later.
> However, I don't have a test environment to check the impact of bswap.
> Would you please measure the run time between the following section if 
> possible?

It'd make more sense to have a real stand alone test program, no?
I can try to write one today, but I have some really nasty important bugs to 
fix first.


Alex



Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-16 Thread OHMURA Kei

"We think"? I mean - yes, I think so too. But have you actually measured it?
How much improvement are we talking here?
Is it still faster when a bswap is involved?


Thanks for pointing out.
I will post the data for x86 later.
However, I don't have a test environment to check the impact of bswap.
Would you please measure the run time between the following section if possible?

start ->
qemu-kvm.c:

static int kvm_get_dirty_bitmap_cb(unsigned long start, unsigned long len,
  void *bitmap, void *opaque)
{
   /* warm up each function */
   kvm_get_dirty_pages_log_range(start, bitmap, start, len);
   kvm_get_dirty_pages_log_range_new(start, bitmap, start, len);

   /* measurement */
   int64_t t1, t2;
   t1 = cpu_get_real_ticks();
   kvm_get_dirty_pages_log_range(start, bitmap, start, len);
   t1 = cpu_get_real_ticks() - t1;
   t2 = cpu_get_real_ticks();
   kvm_get_dirty_pages_log_range_new(start, bitmap, start, len);
   t2 = cpu_get_real_ticks() - t2;

   printf("## %zd, %zd\n", t1, t2); fflush(stdout);

   return kvm_get_dirty_pages_log_range_new(start, bitmap, start, len);
}
end ->





Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-15 Thread Alexander Graf

On 15.02.2010, at 07:12, OHMURA Kei wrote:

> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
> But We think that dirty-bitmap-traveling by long size is faster than by byte

"We think"? I mean - yes, I think so too. But have you actually measured it? 
How much improvement are we talking here?
Is it still faster when a bswap is involved?

Alex



Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-10 Thread Alexander Graf
Avi Kivity wrote:
> On 02/10/2010 06:47 PM, Alexander Graf wrote:
>   
 Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
 Unlike with x86, there's no real benefit in using 64 bit userspace.
   
 
   
 
>>> btw, does 32-bit ppc qemu support large memory guests? It doesn't on
>>> x86, and I don't remember any hacks to support large memory guests
>>> elsewhere.
>>>   
>>> 
>>>   
>> It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2
>> GB anyways, because that needs an iommu implementation.
>>   
>> 
>
> Oh, so you may want to revisit the "there's no real benefit in using 64
> bit userspace".
>   

Well, for normal users they don't. SLES11 is 64-bit only, so we're good
on that. But openSUSE uses 32-bit userland.

> Seriously, that looks like a big deficiency. What would it take to
> implement an iommu?
>
> I imagine Anthony's latest patches are a first step in that journey.
>   

All reads/writes from PCI devices would need to go through a wrapper.
Maybe we could also define a per-device offset for memory accesses. That
way the overhead might be less.

Yes, Anthony's patches look like they are a really big step in that
direction.


Alex




Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-10 Thread Avi Kivity
On 02/10/2010 06:47 PM, Alexander Graf wrote:
>>> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
>>> Unlike with x86, there's no real benefit in using 64 bit userspace.
>>>   
>>> 
>>>   
>> btw, does 32-bit ppc qemu support large memory guests? It doesn't on
>> x86, and I don't remember any hacks to support large memory guests
>> elsewhere.
>>   
>> 
>
> It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2
> GB anyways, because that needs an iommu implementation.
>   

Oh, so you may want to revisit the "there's no real benefit in using 64
bit userspace".

Seriously, that looks like a big deficiency. What would it take to
implement an iommu?

I imagine Anthony's latest patches are a first step in that journey.

-- 
error compiling committee.c: too many arguments to function





Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-10 Thread Alexander Graf
Avi Kivity wrote:
> On 02/10/2010 06:43 PM, Alexander Graf wrote:
>   
>>> Out of curiousity, why? It seems like an odd interface.
>>>   
>>> 
>>>   
>> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
>> Unlike with x86, there's no real benefit in using 64 bit userspace.
>>   
>> 
>
> btw, does 32-bit ppc qemu support large memory guests? It doesn't on
> x86, and I don't remember any hacks to support large memory guests
> elsewhere.
>   


It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2
GB anyways, because that needs an iommu implementation.


Alex




Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-10 Thread Avi Kivity
On 02/10/2010 06:43 PM, Alexander Graf wrote:
>
>> Out of curiousity, why? It seems like an odd interface.
>>   
>> 
> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
> Unlike with x86, there's no real benefit in using 64 bit userspace.
>   

btw, does 32-bit ppc qemu support large memory guests? It doesn't on
x86, and I don't remember any hacks to support large memory guests
elsewhere.

-- 
error compiling committee.c: too many arguments to function





Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-10 Thread Avi Kivity
On 02/10/2010 06:35 PM, Anthony Liguori wrote:
> On 02/10/2010 10:00 AM, Alexander Graf wrote:
>   
>> On PPC the bitmap is Little Endian.
>>   
>> 
> Out of curiousity, why? It seems like an odd interface.
>
>   

Exactly this issue. If you specify it as unsigned long native endian,
there is ambiguity between 32-bit and 64-bit userspace. If you specify
it as uint64_t native endian, you have an inefficient implementation on
32-bit userspace. So we went for unsigned byte native endian, which is
the same as any size little endian.

(well I think the real reason is that it just grew that way out of x86,
but the above is quite plausible).

-- 
error compiling committee.c: too many arguments to function





Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-10 Thread Alexander Graf
Anthony Liguori wrote:
> On 02/10/2010 10:00 AM, Alexander Graf wrote:
>   
>> On PPC the bitmap is Little Endian.
>>   
>> 
>
> Out of curiousity, why? It seems like an odd interface.
>   

Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
Unlike with x86, there's no real benefit in using 64 bit userspace.

So thanks to the nature of big endianness, that breaks our set_bit
helpers, because they assume you're using "long" data types for the
bits. While that's no real issue on little endian, since the next int is
just the high part of a u64, it messes everything up on ppc.

For more details, please just look in the archives on my patches to make
it little endian.


Alex




Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-10 Thread Anthony Liguori
On 02/10/2010 10:00 AM, Alexander Graf wrote:
> On PPC the bitmap is Little Endian.
>   

Out of curiousity, why? It seems like an odd interface.

Regards,

Anthony Liguori





Re: [Qemu-devel] Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

2010-02-10 Thread Alexander Graf
Anthony Liguori wrote:
> On 02/10/2010 07:20 AM, Avi Kivity wrote:
>   
>> On 02/10/2010 12:52 PM, OHMURA Kei wrote:
>>   
>> 
>>> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
>>> But We think that dirty-bitmap-traveling by long size is faster than by byte
>>> size especially when most of memory is not dirty.
>>>
>>> --- a/bswap.h
>>> +++ b/bswap.h
>>> @@ -209,7 +209,6 @@ static inline void cpu_to_be32wu(uint32_t *p, uint32_t 
>>> v)
>>>  #define cpu_to_32wu cpu_to_le32wu
>>>  #endif
>>>  
>>> -#undef le_bswap
>>>  #undef be_bswap
>>>  #undef le_bswaps
>>>   
>>> 
>>>   
>> Anthony, is it okay to export le_bswap this way, or will you want
>> leul_to_cpu()?
>>   
>> 
>
> kvm_get_dirty_pages_log_range() is kvm-specific code. We're guaranteed
> that when we're using kvm, target byte order == host byte order.
>
> So is it really necessary to use a byte swapping function at all?
>   

On PPC the bitmap is Little Endian.


Alex