Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-12 Thread Stefan Hajnoczi
On Tue, Mar 12, 2013 at 09:41:04AM +0100, Peter Lieven wrote:
> 
> Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi :
> 
> > On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
> >> I ever since had a few VMs which are very hard to migrate because of a lot 
> >> of memory I/O. I found that finding the next dirty bit
> >> seemed to be one of the culprits (apart from removing locking which Paolo 
> >> is working on).
> >> 
> >> I have to following proposal which seems to help a lot in my case. Just 
> >> wanted to have some feedback.
> > 
> > Hi Peter,
> > Do you have any performance numbers for this patch?  I'm just curious
> > how big the win is.
> 
> Hi Stefan,
> 
> please see my recent email to the list with the final patch.
> The win is up to 100%. Worst case execution time (whole
> array is zero) is halved on x86_64.

Thanks!

Stefan



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-12 Thread Peter Lieven

Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi :

> On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
>> I ever since had a few VMs which are very hard to migrate because of a lot 
>> of memory I/O. I found that finding the next dirty bit
>> seemed to be one of the culprits (apart from removing locking which Paolo is 
>> working on).
>> 
>> I have to following proposal which seems to help a lot in my case. Just 
>> wanted to have some feedback.
> 
> Hi Peter,
> Do you have any performance numbers for this patch?  I'm just curious
> how big the win is.

Hi Stefan,

please see my recent email to the list with the final patch.
The win is up to 100%. Worst case execution time (whole
array is zero) is halved on x86_64.

Peter

> 
> Stefan




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-12 Thread Stefan Hajnoczi
On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
> I ever since had a few VMs which are very hard to migrate because of a lot of 
> memory I/O. I found that finding the next dirty bit
> seemed to be one of the culprits (apart from removing locking which Paolo is 
> working on).
> 
> I have to following proposal which seems to help a lot in my case. Just 
> wanted to have some feedback.

Hi Peter,
Do you have any performance numbers for this patch?  I'm just curious
how big the win is.

Stefan



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Lieven
Am 11.03.2013 18:07, schrieb Paolo Bonzini:
> Il 11/03/2013 18:06, ronnie sahlberg ha scritto:
>> Even more efficient might be to do bitwise instead of logical or
>>
if (tmp | d1 | d2 | d3) {
>> that should remove 3 of the 4 conditional jumps
>> and should become 3 bitwise ors and one conditional jump
> 
> Without any serious profiling, please let the compiler do that.

Paolo is right, i ran some tests with gcc 4.6.3 on x86_64 (with -O3) and tried 
the
various ideas. They all made no significant difference. Even unrolling to 8 
unsigned
longs didn't change anything.

What I tried is running 1^20 interations of find_next_bit(bitfield,4194304,0);
I choosed the bitfield to be 4MByte which equals a 16GB VM. The bitfield was
complete zeroed so find_next_bit had to run completely through the bitfield.

The original version took 1 minute and 10 seconds whereas all other took
approx. 37-38 seconds which is almost a 100% boost ;-)

So I think this here is the final version:

while (size >= 4*BITS_PER_LONG) {
unsigned long d1, d2, d3;
tmp = *p;
d1 = *(p+1);
d2 = *(p+2);
d3 = *(p+3);
if (tmp) {
goto found_middle;
}
if (d1 || d2 || d3) {
break;
}
p += 4;
result += 4*BITS_PER_LONG;
size -= 4*BITS_PER_LONG;
}

Peter




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Paolo Bonzini
Il 11/03/2013 18:06, ronnie sahlberg ha scritto:
> Even more efficient might be to do bitwise instead of logical or
> 
>> >if (tmp | d1 | d2 | d3) {
> that should remove 3 of the 4 conditional jumps
> and should become 3 bitwise ors and one conditional jump

Without any serious profiling, please let the compiler do that.

Paolo



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread ronnie sahlberg
Even more efficient might be to do bitwise instead of logical or

>if (tmp | d1 | d2 | d3) {

that should remove 3 of the 4 conditional jumps
and should become 3 bitwise ors and one conditional jump


On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini  wrote:
> Il 11/03/2013 16:37, Peter Lieven ha scritto:
>>
>> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini :
>>
>>> Il 11/03/2013 16:24, Peter Lieven ha scritto:

> How would that be different in your patch?  But you can solve it by
> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
> BITS_PER_LONG.

 This is what I have now:

 diff --git a/util/bitops.c b/util/bitops.c
 index e72237a..b0dc93f 100644
 --- a/util/bitops.c
 +++ b/util/bitops.c
 @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, 
 unsigned long size,
 const unsigned long *p = addr + BITOP_WORD(offset);
 unsigned long result = offset & ~(BITS_PER_LONG-1);
 unsigned long tmp;
 +unsigned long d0,d1,d2,d3;

 if (offset >= size) {
 return size;
 }
 size -= result;
 -offset %= BITS_PER_LONG;
 +offset &= (BITS_PER_LONG-1);
 if (offset) {
 tmp = *(p++);
 tmp &= (~0UL << offset);
 @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, 
 unsigned long size,
 size -= BITS_PER_LONG;
 result += BITS_PER_LONG;
 }
 -while (size & ~(BITS_PER_LONG-1)) {
 +while (size >= 4*BITS_PER_LONG) {
 +d0 = *p;
 +d1 = *(p+1);
 +d2 = *(p+2);
 +d3 = *(p+3);
 +if (d0 || d1 || d2 || d3) {
 +break;
 +}
 +p+=4;
 +result += 4*BITS_PER_LONG;
 +size -= 4*BITS_PER_LONG;
 +}
 +while (size >= BITS_PER_LONG) {
 if ((tmp = *(p++))) {
 goto found_middle;
 }

>>>
>>> Minus the %= vs. &=,
>>>
>>> Reviewed-by: Paolo Bonzini 
>>>
>>> Perhaps:
>>>
>>>tmp = *p;
>>>d1 = *(p+1);
>>>d2 = *(p+2);
>>>d3 = *(p+3);
>>>if (tmp) {
>>>goto found_middle;
>>>}
>>>if (d1 || d2 || d3) {
>>>break;
>>>}
>>
>> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
>
> It depends on the target and how expensive branches are.
>
>> i would guess its sth like one addition w/ carry and 1 test?
>
> It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
> followed by a compare-and-jump.
>
> That is, either:
>
>  test %r8, %r8
>  jz   second_loop
>  test %r9, %r9
>  jz   second_loop
>  test %r10, %r10
>  jz   second_loop
>  test %r11, %r11
>  jz   second_loop
>
> or
>
>  or %r9, %r8
>  or %r11, %r10
>  or %r8, %r10
>  jz   second_loop
>
> Don't let the length of the code fool you.  The processor knows how to
> optimize all of these, and GCC knows too.
>
>> your proposed change would introduce 2 tests (maybe)?
>
> Yes, but I expect they to be fairly well predicted.
>
>> what about this to be sure?
>>
>>tmp = *p;
>>d1 = *(p+1);
>>d2 = *(p+2);
>>d3 = *(p+3);
>>if (tmp || d1 || d2 || d3) {
>>if (tmp) {
>>goto found_middle;
>
> I suspect that GCC would rewrite it my version (definitely if it
> produces 4 compare-and-jumps; but possibly it does it even if it goes
> for bitwise ORs, I haven't checked.
>
> Regarding your other question ("one last thought. would it make sense to
> update only `size`in the while loops and compute the `result` at the end
> as `orgsize` - `size`?"), again the compiler knows better and might even
> do this for you.  It will likely drop the p increases and use p[result],
> so if you do that change you may even get the same code, only this time
> p is increased and you get an extra subtraction at the end. :)
>
> Bottom line: don't try to outsmart an optimizing C compiler on
> micro-optimization, unless you have benchmarked it and it shows there is
> a problem.
>
> Paolo
>
>>}
>>break;
>>}
>>
>> Peter
>>
>
>



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Paolo Bonzini
Il 11/03/2013 16:37, Peter Lieven ha scritto:
> 
> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini :
> 
>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>
 How would that be different in your patch?  But you can solve it by
 making two >= loops, one checking for 4*BITS_PER_LONG and one checking
 BITS_PER_LONG.
>>>
>>> This is what I have now:
>>>
>>> diff --git a/util/bitops.c b/util/bitops.c
>>> index e72237a..b0dc93f 100644
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, 
>>> unsigned long size,
>>> const unsigned long *p = addr + BITOP_WORD(offset);
>>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>>> unsigned long tmp;
>>> +unsigned long d0,d1,d2,d3;
>>>
>>> if (offset >= size) {
>>> return size;
>>> }
>>> size -= result;
>>> -offset %= BITS_PER_LONG;
>>> +offset &= (BITS_PER_LONG-1);
>>> if (offset) {
>>> tmp = *(p++);
>>> tmp &= (~0UL << offset);
>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, 
>>> unsigned long size,
>>> size -= BITS_PER_LONG;
>>> result += BITS_PER_LONG;
>>> }
>>> -while (size & ~(BITS_PER_LONG-1)) {
>>> +while (size >= 4*BITS_PER_LONG) {
>>> +d0 = *p;
>>> +d1 = *(p+1);
>>> +d2 = *(p+2);
>>> +d3 = *(p+3);
>>> +if (d0 || d1 || d2 || d3) {
>>> +break;
>>> +}
>>> +p+=4;
>>> +result += 4*BITS_PER_LONG;
>>> +size -= 4*BITS_PER_LONG;
>>> +}
>>> +while (size >= BITS_PER_LONG) {
>>> if ((tmp = *(p++))) {
>>> goto found_middle;
>>> }
>>>
>>
>> Minus the %= vs. &=,
>>
>> Reviewed-by: Paolo Bonzini 
>>
>> Perhaps:
>>
>>tmp = *p;
>>d1 = *(p+1);
>>d2 = *(p+2);
>>d3 = *(p+3);
>>if (tmp) {
>>goto found_middle;
>>}
>>if (d1 || d2 || d3) {
>>break;
>>}
> 
> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?

It depends on the target and how expensive branches are.

> i would guess its sth like one addition w/ carry and 1 test?

It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
followed by a compare-and-jump.

That is, either:

 test %r8, %r8
 jz   second_loop
 test %r9, %r9
 jz   second_loop
 test %r10, %r10
 jz   second_loop
 test %r11, %r11
 jz   second_loop

or

 or %r9, %r8
 or %r11, %r10
 or %r8, %r10
 jz   second_loop

Don't let the length of the code fool you.  The processor knows how to
optimize all of these, and GCC knows too.

> your proposed change would introduce 2 tests (maybe)?

Yes, but I expect they to be fairly well predicted.

> what about this to be sure?
> 
>tmp = *p;
>d1 = *(p+1);
>d2 = *(p+2);
>d3 = *(p+3);
>if (tmp || d1 || d2 || d3) {
>if (tmp) {
>goto found_middle;

I suspect that GCC would rewrite it my version (definitely if it
produces 4 compare-and-jumps; but possibly it does it even if it goes
for bitwise ORs, I haven't checked.

Regarding your other question ("one last thought. would it make sense to
update only `size`in the while loops and compute the `result` at the end
as `orgsize` - `size`?"), again the compiler knows better and might even
do this for you.  It will likely drop the p increases and use p[result],
so if you do that change you may even get the same code, only this time
p is increased and you get an extra subtraction at the end. :)

Bottom line: don't try to outsmart an optimizing C compiler on
micro-optimization, unless you have benchmarked it and it shows there is
a problem.

Paolo

>}
>break;
>}
> 
> Peter
> 




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Lieven

Am 11.03.2013 um 16:42 schrieb Paolo Bonzini :

> Il 11/03/2013 16:41, Peter Lieven ha scritto:
>> can you verify if this does not make difference in the generated object code?
>> in buffer_is_zero() its outside the loop.
> 
> No, it doesn't.

ok, i will sent the final patch tomorrow.

one last thought. would it make sense to update only `size`in the while loops
and compute the `result` at the end as `orgsize` - `size`?

Peter


Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Paolo Bonzini
Il 11/03/2013 16:41, Peter Lieven ha scritto:
> can you verify if this does not make difference in the generated object code?
> in buffer_is_zero() its outside the loop.

No, it doesn't.

Paolo



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Lieven

Am 11.03.2013 um 16:37 schrieb Peter Maydell :

> On 11 March 2013 15:24, Peter Lieven  wrote:
>> +unsigned long d0,d1,d2,d3;
> 
> These commas should have spaces after them. Also, since
> the variables are only used inside the scope of your
> newly added while loop:
> 
>> -while (size & ~(BITS_PER_LONG-1)) {
>> +while (size >= 4*BITS_PER_LONG) {
> 
> it would be better to declare them here.

can you verify if this does not make difference in the generated object code?
in buffer_is_zero() its outside the loop.

thanks
peter



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Maydell
On 11 March 2013 15:24, Peter Lieven  wrote:
> +unsigned long d0,d1,d2,d3;

These commas should have spaces after them. Also, since
the variables are only used inside the scope of your
newly added while loop:

> -while (size & ~(BITS_PER_LONG-1)) {
> +while (size >= 4*BITS_PER_LONG) {

it would be better to declare them here.

> +d0 = *p;
> +d1 = *(p+1);
> +d2 = *(p+2);
> +d3 = *(p+3);
> +if (d0 || d1 || d2 || d3) {
> +break;
> +}
> +p+=4;
> +result += 4*BITS_PER_LONG;
> +size -= 4*BITS_PER_LONG;
> +}
> +while (size >= BITS_PER_LONG) {
>  if ((tmp = *(p++))) {
>  goto found_middle;
>  }

thanks
-- PMM



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Lieven

Am 11.03.2013 um 16:29 schrieb Paolo Bonzini :

> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>> 
>>> How would that be different in your patch?  But you can solve it by
>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>> BITS_PER_LONG.
>> 
>> This is what I have now:
>> 
>> diff --git a/util/bitops.c b/util/bitops.c
>> index e72237a..b0dc93f 100644
>> --- a/util/bitops.c
>> +++ b/util/bitops.c
>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, 
>> unsigned long size,
>> const unsigned long *p = addr + BITOP_WORD(offset);
>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>> unsigned long tmp;
>> +unsigned long d0,d1,d2,d3;
>> 
>> if (offset >= size) {
>> return size;
>> }
>> size -= result;
>> -offset %= BITS_PER_LONG;
>> +offset &= (BITS_PER_LONG-1);
>> if (offset) {
>> tmp = *(p++);
>> tmp &= (~0UL << offset);
>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, 
>> unsigned long size,
>> size -= BITS_PER_LONG;
>> result += BITS_PER_LONG;
>> }
>> -while (size & ~(BITS_PER_LONG-1)) {
>> +while (size >= 4*BITS_PER_LONG) {
>> +d0 = *p;
>> +d1 = *(p+1);
>> +d2 = *(p+2);
>> +d3 = *(p+3);
>> +if (d0 || d1 || d2 || d3) {
>> +break;
>> +}
>> +p+=4;
>> +result += 4*BITS_PER_LONG;
>> +size -= 4*BITS_PER_LONG;
>> +}
>> +while (size >= BITS_PER_LONG) {
>> if ((tmp = *(p++))) {
>> goto found_middle;
>> }
>> 
> 
> Minus the %= vs. &=,
> 
> Reviewed-by: Paolo Bonzini 
> 
> Perhaps:
> 
>tmp = *p;
>d1 = *(p+1);
>d2 = *(p+2);
>d3 = *(p+3);
>if (tmp) {
>goto found_middle;
>}
>if (d1 || d2 || d3) {
>break;
>}

i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
i would guess its sth like one addition w/ carry and 1 test?

your proposed change would introduce 2 tests (maybe)?

what about this to be sure?

   tmp = *p;
   d1 = *(p+1);
   d2 = *(p+2);
   d3 = *(p+3);
   if (tmp || d1 || d2 || d3) {
   if (tmp) {
   goto found_middle;
   }
   break;
   }

Peter




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Maydell
On 11 March 2013 15:24, Peter Lieven  wrote:
> -offset %= BITS_PER_LONG;
> +offset &= (BITS_PER_LONG-1);

Still pointless.

-- PMM



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Paolo Bonzini
Il 11/03/2013 16:24, Peter Lieven ha scritto:
> 
>> How would that be different in your patch?  But you can solve it by
>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>> BITS_PER_LONG.
> 
> This is what I have now:
> 
> diff --git a/util/bitops.c b/util/bitops.c
> index e72237a..b0dc93f 100644
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, 
> unsigned long size,
>  const unsigned long *p = addr + BITOP_WORD(offset);
>  unsigned long result = offset & ~(BITS_PER_LONG-1);
>  unsigned long tmp;
> +unsigned long d0,d1,d2,d3;
>  
>  if (offset >= size) {
>  return size;
>  }
>  size -= result;
> -offset %= BITS_PER_LONG;
> +offset &= (BITS_PER_LONG-1);
>  if (offset) {
>  tmp = *(p++);
>  tmp &= (~0UL << offset);
> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, 
> unsigned long size,
>  size -= BITS_PER_LONG;
>  result += BITS_PER_LONG;
>  }
> -while (size & ~(BITS_PER_LONG-1)) {
> +while (size >= 4*BITS_PER_LONG) {
> +d0 = *p;
> +d1 = *(p+1);
> +d2 = *(p+2);
> +d3 = *(p+3);
> +if (d0 || d1 || d2 || d3) {
> +break;
> +}
> +p+=4;
> +result += 4*BITS_PER_LONG;
> +size -= 4*BITS_PER_LONG;
> +}
> +while (size >= BITS_PER_LONG) {
>  if ((tmp = *(p++))) {
>  goto found_middle;
>  }
> 

Minus the %= vs. &=,

Reviewed-by: Paolo Bonzini 

Perhaps:

tmp = *p;
d1 = *(p+1);
d2 = *(p+2);
d3 = *(p+3);
if (tmp) {
goto found_middle;
}
if (d1 || d2 || d3) {
break;
}

Paolo



Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Lieven

> How would that be different in your patch?  But you can solve it by
> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
> BITS_PER_LONG.

This is what I have now:

diff --git a/util/bitops.c b/util/bitops.c
index e72237a..b0dc93f 100644
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, 
unsigned long size,
 const unsigned long *p = addr + BITOP_WORD(offset);
 unsigned long result = offset & ~(BITS_PER_LONG-1);
 unsigned long tmp;
+unsigned long d0,d1,d2,d3;
 
 if (offset >= size) {
 return size;
 }
 size -= result;
-offset %= BITS_PER_LONG;
+offset &= (BITS_PER_LONG-1);
 if (offset) {
 tmp = *(p++);
 tmp &= (~0UL << offset);
@@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, 
unsigned long size,
 size -= BITS_PER_LONG;
 result += BITS_PER_LONG;
 }
-while (size & ~(BITS_PER_LONG-1)) {
+while (size >= 4*BITS_PER_LONG) {
+d0 = *p;
+d1 = *(p+1);
+d2 = *(p+2);
+d3 = *(p+3);
+if (d0 || d1 || d2 || d3) {
+break;
+}
+p+=4;
+result += 4*BITS_PER_LONG;
+size -= 4*BITS_PER_LONG;
+}
+while (size >= BITS_PER_LONG) {
 if ((tmp = *(p++))) {
 goto found_middle;
 }




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Paolo Bonzini
Il 11/03/2013 15:22, Peter Lieven ha scritto:
> 
> Am 11.03.2013 um 15:14 schrieb Paolo Bonzini :
> 
>> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>>> Hi,
>>>
>>> I ever since had a few VMs which are very hard to migrate because of a
>>> lot of memory I/O. I found that finding the next dirty bit
>>> seemed to be one of the culprits (apart from removing locking which
>>> Paolo is working on).
>>>
>>> I have to following proposal which seems to help a lot in my case. Just
>>> wanted to have some feedback.
>>> I applied the same unrolling idea like in buffer_is_zero().
>>>
>>> Peter
>>>
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>> const unsigned long *p = addr + BITOP_WORD(offset);
>>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>>> unsigned long tmp;
>>> +unsigned long d0,d1,d2,d3;
>>>
>>> if (offset >= size) {
>>> return size;
>>> }
>>> size -= result;
>>> -offset %= BITS_PER_LONG;
>>> +offset &= (BITS_PER_LONG-1);
>>> if (offset) {
>>> tmp = *(p++);
>>> tmp &= (~0UL << offset);
>>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>> result += BITS_PER_LONG;
>>> }
>>> while (size & ~(BITS_PER_LONG-1)) {
>>> +while (!(size & (4*BITS_PER_LONG-1))) {
>>
>> This really means
>>
>>   if (!(size & (4*BITS_PER_LONG-1))) {
>>   while (1) {
>>   ...
>>   }
>>   }
>>
>> because the subtraction will not change the result of the "while" loop
>> condition.
> 
> Are you sure? The above is working nicely for me (wondering why ;-))

   while (!(size & (4*BITS_PER_LONG-1)))=>
   while (!(size % (4*BITS_PER_LONG))   =>
   while ((size % (4*BITS_PER_LONG)) == 0)

Subtracting 4*BITS_PER_LONG doesn't change the modulus.

> I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
> propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
> by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.

In fact I'm not really sure why it works for you. :)

>> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
>> in turn means "while (size >= 4*BITS_PER_LONG).
>>
>> Please change both while loops to use a ">=" condition, it's easier to read.
> 
> Good idea, its easier to understand.
> 
>>> Please change both while loops to use a ">=" condition, it's easier to read.
> 
> Thinking again, in case a bit is found, this might lead to unnecessary 
> iterations
> in the while loop if the bit is in d1, d2 or d3.

How would that be different in your patch?  But you can solve it by
making two >= loops, one checking for 4*BITS_PER_LONG and one checking
BITS_PER_LONG.

Paolo

> 
> Peter
> 
>>
>> Paolo
>>
>>> +d0 = *p;
>>> +d1 = *(p+1);
>>> +d2 = *(p+2);
>>> +d3 = *(p+3);
>>> +if (d0 || d1 || d2 || d3) {
>>> +break;
>>> +}
>>> +p+=4;
>>> +result += 4*BITS_PER_LONG;
>>> +size -= 4*BITS_PER_LONG;
>>> +}
>>> if ((tmp = *(p++))) {
>>> goto found_middle;
>>> }
>>
> 




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Lieven

Am 11.03.2013 um 15:22 schrieb Peter Lieven :

> 
> Am 11.03.2013 um 15:14 schrieb Paolo Bonzini :
> 
>> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>>> Hi,
>>> 
>>> I ever since had a few VMs which are very hard to migrate because of a
>>> lot of memory I/O. I found that finding the next dirty bit
>>> seemed to be one of the culprits (apart from removing locking which
>>> Paolo is working on).
>>> 
>>> I have to following proposal which seems to help a lot in my case. Just
>>> wanted to have some feedback.
>>> I applied the same unrolling idea like in buffer_is_zero().
>>> 
>>> Peter
>>> 
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>const unsigned long *p = addr + BITOP_WORD(offset);
>>>unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>unsigned long tmp;
>>> +unsigned long d0,d1,d2,d3;
>>> 
>>>if (offset >= size) {
>>>return size;
>>>}
>>>size -= result;
>>> -offset %= BITS_PER_LONG;
>>> +offset &= (BITS_PER_LONG-1);
>>>if (offset) {
>>>tmp = *(p++);
>>>tmp &= (~0UL << offset);
>>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>result += BITS_PER_LONG;
>>>}
>>>while (size & ~(BITS_PER_LONG-1)) {
>>> +while (!(size & (4*BITS_PER_LONG-1))) {
>> 
>> This really means
>> 
>>  if (!(size & (4*BITS_PER_LONG-1))) {
>>  while (1) {
>>  ...
>>  }
>>  }
>> 
>> because the subtraction will not change the result of the "while" loop
>> condition.
> 
> Are you sure? The above is working nicely for me (wondering why ;-))
> I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
> propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
> by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.
> 
>> 
>> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
>> in turn means "while (size >= 4*BITS_PER_LONG).
>> 
>> Please change both while loops to use a ">=" condition, it's easier to read.

Thinking again, in case a bit is found, this might lead to unnecessary 
iterations
in the while loop if the bit is in d1, d2 or d3.


> 
> Good idea, its easier to understand.
> 
> Has anyone evidence if unrolling 4 longs is optimal on today processors?
> I just chooses 4 longs because it was the same in buffer_is_zero.
> 
> Peter
> 
>> 
>> Paolo
>> 
>>> +d0 = *p;
>>> +d1 = *(p+1);
>>> +d2 = *(p+2);
>>> +d3 = *(p+3);
>>> +if (d0 || d1 || d2 || d3) {
>>> +break;
>>> +}
>>> +p+=4;
>>> +result += 4*BITS_PER_LONG;
>>> +size -= 4*BITS_PER_LONG;
>>> +}
>>>if ((tmp = *(p++))) {
>>>goto found_middle;
>>>}
>> 
> 




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Lieven

Am 11.03.2013 um 15:14 schrieb Paolo Bonzini :

> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>> Hi,
>> 
>> I ever since had a few VMs which are very hard to migrate because of a
>> lot of memory I/O. I found that finding the next dirty bit
>> seemed to be one of the culprits (apart from removing locking which
>> Paolo is working on).
>> 
>> I have to following proposal which seems to help a lot in my case. Just
>> wanted to have some feedback.
>> I applied the same unrolling idea like in buffer_is_zero().
>> 
>> Peter
>> 
>> --- a/util/bitops.c
>> +++ b/util/bitops.c
>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>> *addr, unsigned long size,
>> const unsigned long *p = addr + BITOP_WORD(offset);
>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>> unsigned long tmp;
>> +unsigned long d0,d1,d2,d3;
>> 
>> if (offset >= size) {
>> return size;
>> }
>> size -= result;
>> -offset %= BITS_PER_LONG;
>> +offset &= (BITS_PER_LONG-1);
>> if (offset) {
>> tmp = *(p++);
>> tmp &= (~0UL << offset);
>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>> *addr, unsigned long size,
>> result += BITS_PER_LONG;
>> }
>> while (size & ~(BITS_PER_LONG-1)) {
>> +while (!(size & (4*BITS_PER_LONG-1))) {
> 
> This really means
> 
>   if (!(size & (4*BITS_PER_LONG-1))) {
>   while (1) {
>   ...
>   }
>   }
> 
> because the subtraction will not change the result of the "while" loop
> condition.

Are you sure? The above is working nicely for me (wondering why ;-))
I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.

> 
> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
> in turn means "while (size >= 4*BITS_PER_LONG).
> 
> Please change both while loops to use a ">=" condition, it's easier to read.

Good idea, its easier to understand.

Has anyone evidence if unrolling 4 longs is optimal on today processors?
I just chooses 4 longs because it was the same in buffer_is_zero.

Peter

> 
> Paolo
> 
>> +d0 = *p;
>> +d1 = *(p+1);
>> +d2 = *(p+2);
>> +d3 = *(p+3);
>> +if (d0 || d1 || d2 || d3) {
>> +break;
>> +}
>> +p+=4;
>> +result += 4*BITS_PER_LONG;
>> +size -= 4*BITS_PER_LONG;
>> +}
>> if ((tmp = *(p++))) {
>> goto found_middle;
>> }
> 




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Paolo Bonzini
Il 11/03/2013 14:44, Peter Lieven ha scritto:
> Hi,
> 
> I ever since had a few VMs which are very hard to migrate because of a
> lot of memory I/O. I found that finding the next dirty bit
> seemed to be one of the culprits (apart from removing locking which
> Paolo is working on).
> 
> I have to following proposal which seems to help a lot in my case. Just
> wanted to have some feedback.
> I applied the same unrolling idea like in buffer_is_zero().
> 
> Peter
> 
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
> *addr, unsigned long size,
>  const unsigned long *p = addr + BITOP_WORD(offset);
>  unsigned long result = offset & ~(BITS_PER_LONG-1);
>  unsigned long tmp;
> +unsigned long d0,d1,d2,d3;
> 
>  if (offset >= size) {
>  return size;
>  }
>  size -= result;
> -offset %= BITS_PER_LONG;
> +offset &= (BITS_PER_LONG-1);
>  if (offset) {
>  tmp = *(p++);
>  tmp &= (~0UL << offset);
> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
> *addr, unsigned long size,
>  result += BITS_PER_LONG;
>  }
>  while (size & ~(BITS_PER_LONG-1)) {
> +while (!(size & (4*BITS_PER_LONG-1))) {

This really means

   if (!(size & (4*BITS_PER_LONG-1))) {
   while (1) {
   ...
   }
   }

because the subtraction will not change the result of the "while" loop
condition.

What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
in turn means "while (size >= 4*BITS_PER_LONG).

Please change both while loops to use a ">=" condition, it's easier to read.

Paolo

> +d0 = *p;
> +d1 = *(p+1);
> +d2 = *(p+2);
> +d3 = *(p+3);
> +if (d0 || d1 || d2 || d3) {
> +break;
> +}
> +p+=4;
> +result += 4*BITS_PER_LONG;
> +size -= 4*BITS_PER_LONG;
> +}
>  if ((tmp = *(p++))) {
>  goto found_middle;
>  }




Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Maydell
On 11 March 2013 13:44, Peter Lieven  wrote:

> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr,
> unsigned long size,
>  const unsigned long *p = addr + BITOP_WORD(offset);
>  unsigned long result = offset & ~(BITS_PER_LONG-1);
>  unsigned long tmp;
> +unsigned long d0,d1,d2,d3;
>
>  if (offset >= size) {
>  return size;
>  }
>  size -= result;
> -offset %= BITS_PER_LONG;
> +offset &= (BITS_PER_LONG-1);

This change at least is unnecessary -- I just checked, and gcc
is already smart enough to turn the % operation into a logical
and. The generated object files are identical.

-- PMM



[Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread Peter Lieven

Hi,

I ever since had a few VMs which are very hard to migrate because of a lot of 
memory I/O. I found that finding the next dirty bit
seemed to be one of the culprits (apart from removing locking which Paolo is 
working on).

I have to following proposal which seems to help a lot in my case. Just wanted 
to have some feedback.
I applied the same unrolling idea like in buffer_is_zero().

Peter

--- a/util/bitops.c
+++ b/util/bitops.c
@@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, 
unsigned long size,
 const unsigned long *p = addr + BITOP_WORD(offset);
 unsigned long result = offset & ~(BITS_PER_LONG-1);
 unsigned long tmp;
+unsigned long d0,d1,d2,d3;

 if (offset >= size) {
 return size;
 }
 size -= result;
-offset %= BITS_PER_LONG;
+offset &= (BITS_PER_LONG-1);
 if (offset) {
 tmp = *(p++);
 tmp &= (~0UL << offset);
@@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long *addr, 
unsigned long size,
 result += BITS_PER_LONG;
 }
 while (size & ~(BITS_PER_LONG-1)) {
+while (!(size & (4*BITS_PER_LONG-1))) {
+d0 = *p;
+d1 = *(p+1);
+d2 = *(p+2);
+d3 = *(p+3);
+if (d0 || d1 || d2 || d3) {
+break;
+}
+p+=4;
+result += 4*BITS_PER_LONG;
+size -= 4*BITS_PER_LONG;
+}
 if ((tmp = *(p++))) {
 goto found_middle;
 }