Re: [Qemu-devel] [RFC] find_next_bit optimizations
On Tue, Mar 12, 2013 at 09:41:04AM +0100, Peter Lieven wrote: > > Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi : > > > On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote: > >> I ever since had a few VMs which are very hard to migrate because of a lot > >> of memory I/O. I found that finding the next dirty bit > >> seemed to be one of the culprits (apart from removing locking which Paolo > >> is working on). > >> > >> I have to following proposal which seems to help a lot in my case. Just > >> wanted to have some feedback. > > > > Hi Peter, > > Do you have any performance numbers for this patch? I'm just curious > > how big the win is. > > Hi Stefan, > > please see my recent email to the list with the final patch. > The win is up to 100%. Worst case execution time (whole > array is zero) is halved on x86_64. Thanks! Stefan
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi : > On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote: >> I ever since had a few VMs which are very hard to migrate because of a lot >> of memory I/O. I found that finding the next dirty bit >> seemed to be one of the culprits (apart from removing locking which Paolo is >> working on). >> >> I have to following proposal which seems to help a lot in my case. Just >> wanted to have some feedback. > > Hi Peter, > Do you have any performance numbers for this patch? I'm just curious > how big the win is. Hi Stefan, please see my recent email to the list with the final patch. The win is up to 100%. Worst case execution time (whole array is zero) is halved on x86_64. Peter > > Stefan
Re: [Qemu-devel] [RFC] find_next_bit optimizations
On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote: > I ever since had a few VMs which are very hard to migrate because of a lot of > memory I/O. I found that finding the next dirty bit > seemed to be one of the culprits (apart from removing locking which Paolo is > working on). > > I have to following proposal which seems to help a lot in my case. Just > wanted to have some feedback. Hi Peter, Do you have any performance numbers for this patch? I'm just curious how big the win is. Stefan
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Am 11.03.2013 18:07, schrieb Paolo Bonzini: > Il 11/03/2013 18:06, ronnie sahlberg ha scritto: >> Even more efficient might be to do bitwise instead of logical or >> if (tmp | d1 | d2 | d3) { >> that should remove 3 of the 4 conditional jumps >> and should become 3 bitwise ors and one conditional jump > > Without any serious profiling, please let the compiler do that. Paolo is right, i ran some tests with gcc 4.6.3 on x86_64 (with -O3) and tried the various ideas. They all made no significant difference. Even unrolling to 8 unsigned longs didn't change anything. What I tried is running 1^20 interations of find_next_bit(bitfield,4194304,0); I choosed the bitfield to be 4MByte which equals a 16GB VM. The bitfield was complete zeroed so find_next_bit had to run completely through the bitfield. The original version took 1 minute and 10 seconds whereas all other took approx. 37-38 seconds which is almost a 100% boost ;-) So I think this here is the final version: while (size >= 4*BITS_PER_LONG) { unsigned long d1, d2, d3; tmp = *p; d1 = *(p+1); d2 = *(p+2); d3 = *(p+3); if (tmp) { goto found_middle; } if (d1 || d2 || d3) { break; } p += 4; result += 4*BITS_PER_LONG; size -= 4*BITS_PER_LONG; } Peter
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Il 11/03/2013 18:06, ronnie sahlberg ha scritto: > Even more efficient might be to do bitwise instead of logical or > >> >if (tmp | d1 | d2 | d3) { > that should remove 3 of the 4 conditional jumps > and should become 3 bitwise ors and one conditional jump Without any serious profiling, please let the compiler do that. Paolo
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Even more efficient might be to do bitwise instead of logical or >if (tmp | d1 | d2 | d3) { that should remove 3 of the 4 conditional jumps and should become 3 bitwise ors and one conditional jump On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini wrote: > Il 11/03/2013 16:37, Peter Lieven ha scritto: >> >> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini : >> >>> Il 11/03/2013 16:24, Peter Lieven ha scritto: > How would that be different in your patch? But you can solve it by > making two >= loops, one checking for 4*BITS_PER_LONG and one checking > BITS_PER_LONG. This is what I have now: diff --git a/util/bitops.c b/util/bitops.c index e72237a..b0dc93f 100644 --- a/util/bitops.c +++ b/util/bitops.c @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); unsigned long tmp; +unsigned long d0,d1,d2,d3; if (offset >= size) { return size; } size -= result; -offset %= BITS_PER_LONG; +offset &= (BITS_PER_LONG-1); if (offset) { tmp = *(p++); tmp &= (~0UL << offset); @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, size -= BITS_PER_LONG; result += BITS_PER_LONG; } -while (size & ~(BITS_PER_LONG-1)) { +while (size >= 4*BITS_PER_LONG) { +d0 = *p; +d1 = *(p+1); +d2 = *(p+2); +d3 = *(p+3); +if (d0 || d1 || d2 || d3) { +break; +} +p+=4; +result += 4*BITS_PER_LONG; +size -= 4*BITS_PER_LONG; +} +while (size >= BITS_PER_LONG) { if ((tmp = *(p++))) { goto found_middle; } >>> >>> Minus the %= vs. &=, >>> >>> Reviewed-by: Paolo Bonzini >>> >>> Perhaps: >>> >>>tmp = *p; >>>d1 = *(p+1); >>>d2 = *(p+2); >>>d3 = *(p+3); >>>if (tmp) { >>>goto found_middle; >>>} >>>if (d1 || d2 || d3) { >>>break; >>>} >> >> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ? > > It depends on the target and how expensive branches are. > >> i would guess its sth like one addition w/ carry and 1 test? > > It could be either 4 compare-and-jump sequences, or 3 bitwise ORs > followed by a compare-and-jump. > > That is, either: > > test %r8, %r8 > jz second_loop > test %r9, %r9 > jz second_loop > test %r10, %r10 > jz second_loop > test %r11, %r11 > jz second_loop > > or > > or %r9, %r8 > or %r11, %r10 > or %r8, %r10 > jz second_loop > > Don't let the length of the code fool you. The processor knows how to > optimize all of these, and GCC knows too. > >> your proposed change would introduce 2 tests (maybe)? > > Yes, but I expect they to be fairly well predicted. > >> what about this to be sure? >> >>tmp = *p; >>d1 = *(p+1); >>d2 = *(p+2); >>d3 = *(p+3); >>if (tmp || d1 || d2 || d3) { >>if (tmp) { >>goto found_middle; > > I suspect that GCC would rewrite it my version (definitely if it > produces 4 compare-and-jumps; but possibly it does it even if it goes > for bitwise ORs, I haven't checked. > > Regarding your other question ("one last thought. would it make sense to > update only `size`in the while loops and compute the `result` at the end > as `orgsize` - `size`?"), again the compiler knows better and might even > do this for you. It will likely drop the p increases and use p[result], > so if you do that change you may even get the same code, only this time > p is increased and you get an extra subtraction at the end. :) > > Bottom line: don't try to outsmart an optimizing C compiler on > micro-optimization, unless you have benchmarked it and it shows there is > a problem. > > Paolo > >>} >>break; >>} >> >> Peter >> > >
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Il 11/03/2013 16:37, Peter Lieven ha scritto: > > Am 11.03.2013 um 16:29 schrieb Paolo Bonzini : > >> Il 11/03/2013 16:24, Peter Lieven ha scritto: >>> How would that be different in your patch? But you can solve it by making two >= loops, one checking for 4*BITS_PER_LONG and one checking BITS_PER_LONG. >>> >>> This is what I have now: >>> >>> diff --git a/util/bitops.c b/util/bitops.c >>> index e72237a..b0dc93f 100644 >>> --- a/util/bitops.c >>> +++ b/util/bitops.c >>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, >>> unsigned long size, >>> const unsigned long *p = addr + BITOP_WORD(offset); >>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>> unsigned long tmp; >>> +unsigned long d0,d1,d2,d3; >>> >>> if (offset >= size) { >>> return size; >>> } >>> size -= result; >>> -offset %= BITS_PER_LONG; >>> +offset &= (BITS_PER_LONG-1); >>> if (offset) { >>> tmp = *(p++); >>> tmp &= (~0UL << offset); >>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, >>> unsigned long size, >>> size -= BITS_PER_LONG; >>> result += BITS_PER_LONG; >>> } >>> -while (size & ~(BITS_PER_LONG-1)) { >>> +while (size >= 4*BITS_PER_LONG) { >>> +d0 = *p; >>> +d1 = *(p+1); >>> +d2 = *(p+2); >>> +d3 = *(p+3); >>> +if (d0 || d1 || d2 || d3) { >>> +break; >>> +} >>> +p+=4; >>> +result += 4*BITS_PER_LONG; >>> +size -= 4*BITS_PER_LONG; >>> +} >>> +while (size >= BITS_PER_LONG) { >>> if ((tmp = *(p++))) { >>> goto found_middle; >>> } >>> >> >> Minus the %= vs. &=, >> >> Reviewed-by: Paolo Bonzini >> >> Perhaps: >> >>tmp = *p; >>d1 = *(p+1); >>d2 = *(p+2); >>d3 = *(p+3); >>if (tmp) { >>goto found_middle; >>} >>if (d1 || d2 || d3) { >>break; >>} > > i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ? It depends on the target and how expensive branches are. > i would guess its sth like one addition w/ carry and 1 test? It could be either 4 compare-and-jump sequences, or 3 bitwise ORs followed by a compare-and-jump. That is, either: test %r8, %r8 jz second_loop test %r9, %r9 jz second_loop test %r10, %r10 jz second_loop test %r11, %r11 jz second_loop or or %r9, %r8 or %r11, %r10 or %r8, %r10 jz second_loop Don't let the length of the code fool you. The processor knows how to optimize all of these, and GCC knows too. > your proposed change would introduce 2 tests (maybe)? Yes, but I expect they to be fairly well predicted. > what about this to be sure? > >tmp = *p; >d1 = *(p+1); >d2 = *(p+2); >d3 = *(p+3); >if (tmp || d1 || d2 || d3) { >if (tmp) { >goto found_middle; I suspect that GCC would rewrite it my version (definitely if it produces 4 compare-and-jumps; but possibly it does it even if it goes for bitwise ORs, I haven't checked. Regarding your other question ("one last thought. would it make sense to update only `size`in the while loops and compute the `result` at the end as `orgsize` - `size`?"), again the compiler knows better and might even do this for you. It will likely drop the p increases and use p[result], so if you do that change you may even get the same code, only this time p is increased and you get an extra subtraction at the end. :) Bottom line: don't try to outsmart an optimizing C compiler on micro-optimization, unless you have benchmarked it and it shows there is a problem. Paolo >} >break; >} > > Peter >
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Am 11.03.2013 um 16:42 schrieb Paolo Bonzini : > Il 11/03/2013 16:41, Peter Lieven ha scritto: >> can you verify if this does not make difference in the generated object code? >> in buffer_is_zero() its outside the loop. > > No, it doesn't. ok, i will sent the final patch tomorrow. one last thought. would it make sense to update only `size`in the while loops and compute the `result` at the end as `orgsize` - `size`? Peter
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Il 11/03/2013 16:41, Peter Lieven ha scritto: > can you verify if this does not make difference in the generated object code? > in buffer_is_zero() its outside the loop. No, it doesn't. Paolo
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Am 11.03.2013 um 16:37 schrieb Peter Maydell : > On 11 March 2013 15:24, Peter Lieven wrote: >> +unsigned long d0,d1,d2,d3; > > These commas should have spaces after them. Also, since > the variables are only used inside the scope of your > newly added while loop: > >> -while (size & ~(BITS_PER_LONG-1)) { >> +while (size >= 4*BITS_PER_LONG) { > > it would be better to declare them here. can you verify if this does not make difference in the generated object code? in buffer_is_zero() its outside the loop. thanks peter
Re: [Qemu-devel] [RFC] find_next_bit optimizations
On 11 March 2013 15:24, Peter Lieven wrote: > +unsigned long d0,d1,d2,d3; These commas should have spaces after them. Also, since the variables are only used inside the scope of your newly added while loop: > -while (size & ~(BITS_PER_LONG-1)) { > +while (size >= 4*BITS_PER_LONG) { it would be better to declare them here. > +d0 = *p; > +d1 = *(p+1); > +d2 = *(p+2); > +d3 = *(p+3); > +if (d0 || d1 || d2 || d3) { > +break; > +} > +p+=4; > +result += 4*BITS_PER_LONG; > +size -= 4*BITS_PER_LONG; > +} > +while (size >= BITS_PER_LONG) { > if ((tmp = *(p++))) { > goto found_middle; > } thanks -- PMM
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Am 11.03.2013 um 16:29 schrieb Paolo Bonzini : > Il 11/03/2013 16:24, Peter Lieven ha scritto: >> >>> How would that be different in your patch? But you can solve it by >>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >>> BITS_PER_LONG. >> >> This is what I have now: >> >> diff --git a/util/bitops.c b/util/bitops.c >> index e72237a..b0dc93f 100644 >> --- a/util/bitops.c >> +++ b/util/bitops.c >> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, >> unsigned long size, >> const unsigned long *p = addr + BITOP_WORD(offset); >> unsigned long result = offset & ~(BITS_PER_LONG-1); >> unsigned long tmp; >> +unsigned long d0,d1,d2,d3; >> >> if (offset >= size) { >> return size; >> } >> size -= result; >> -offset %= BITS_PER_LONG; >> +offset &= (BITS_PER_LONG-1); >> if (offset) { >> tmp = *(p++); >> tmp &= (~0UL << offset); >> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, >> unsigned long size, >> size -= BITS_PER_LONG; >> result += BITS_PER_LONG; >> } >> -while (size & ~(BITS_PER_LONG-1)) { >> +while (size >= 4*BITS_PER_LONG) { >> +d0 = *p; >> +d1 = *(p+1); >> +d2 = *(p+2); >> +d3 = *(p+3); >> +if (d0 || d1 || d2 || d3) { >> +break; >> +} >> +p+=4; >> +result += 4*BITS_PER_LONG; >> +size -= 4*BITS_PER_LONG; >> +} >> +while (size >= BITS_PER_LONG) { >> if ((tmp = *(p++))) { >> goto found_middle; >> } >> > > Minus the %= vs. &=, > > Reviewed-by: Paolo Bonzini > > Perhaps: > >tmp = *p; >d1 = *(p+1); >d2 = *(p+2); >d3 = *(p+3); >if (tmp) { >goto found_middle; >} >if (d1 || d2 || d3) { >break; >} i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ? i would guess its sth like one addition w/ carry and 1 test? your proposed change would introduce 2 tests (maybe)? what about this to be sure? tmp = *p; d1 = *(p+1); d2 = *(p+2); d3 = *(p+3); if (tmp || d1 || d2 || d3) { if (tmp) { goto found_middle; } break; } Peter
Re: [Qemu-devel] [RFC] find_next_bit optimizations
On 11 March 2013 15:24, Peter Lieven wrote: > -offset %= BITS_PER_LONG; > +offset &= (BITS_PER_LONG-1); Still pointless. -- PMM
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Il 11/03/2013 16:24, Peter Lieven ha scritto: > >> How would that be different in your patch? But you can solve it by >> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >> BITS_PER_LONG. > > This is what I have now: > > diff --git a/util/bitops.c b/util/bitops.c > index e72237a..b0dc93f 100644 > --- a/util/bitops.c > +++ b/util/bitops.c > @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, > unsigned long size, > const unsigned long *p = addr + BITOP_WORD(offset); > unsigned long result = offset & ~(BITS_PER_LONG-1); > unsigned long tmp; > +unsigned long d0,d1,d2,d3; > > if (offset >= size) { > return size; > } > size -= result; > -offset %= BITS_PER_LONG; > +offset &= (BITS_PER_LONG-1); > if (offset) { > tmp = *(p++); > tmp &= (~0UL << offset); > @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, > unsigned long size, > size -= BITS_PER_LONG; > result += BITS_PER_LONG; > } > -while (size & ~(BITS_PER_LONG-1)) { > +while (size >= 4*BITS_PER_LONG) { > +d0 = *p; > +d1 = *(p+1); > +d2 = *(p+2); > +d3 = *(p+3); > +if (d0 || d1 || d2 || d3) { > +break; > +} > +p+=4; > +result += 4*BITS_PER_LONG; > +size -= 4*BITS_PER_LONG; > +} > +while (size >= BITS_PER_LONG) { > if ((tmp = *(p++))) { > goto found_middle; > } > Minus the %= vs. &=, Reviewed-by: Paolo Bonzini Perhaps: tmp = *p; d1 = *(p+1); d2 = *(p+2); d3 = *(p+3); if (tmp) { goto found_middle; } if (d1 || d2 || d3) { break; } Paolo
Re: [Qemu-devel] [RFC] find_next_bit optimizations
> How would that be different in your patch? But you can solve it by > making two >= loops, one checking for 4*BITS_PER_LONG and one checking > BITS_PER_LONG. This is what I have now: diff --git a/util/bitops.c b/util/bitops.c index e72237a..b0dc93f 100644 --- a/util/bitops.c +++ b/util/bitops.c @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); unsigned long tmp; +unsigned long d0,d1,d2,d3; if (offset >= size) { return size; } size -= result; -offset %= BITS_PER_LONG; +offset &= (BITS_PER_LONG-1); if (offset) { tmp = *(p++); tmp &= (~0UL << offset); @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, size -= BITS_PER_LONG; result += BITS_PER_LONG; } -while (size & ~(BITS_PER_LONG-1)) { +while (size >= 4*BITS_PER_LONG) { +d0 = *p; +d1 = *(p+1); +d2 = *(p+2); +d3 = *(p+3); +if (d0 || d1 || d2 || d3) { +break; +} +p+=4; +result += 4*BITS_PER_LONG; +size -= 4*BITS_PER_LONG; +} +while (size >= BITS_PER_LONG) { if ((tmp = *(p++))) { goto found_middle; }
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Il 11/03/2013 15:22, Peter Lieven ha scritto: > > Am 11.03.2013 um 15:14 schrieb Paolo Bonzini : > >> Il 11/03/2013 14:44, Peter Lieven ha scritto: >>> Hi, >>> >>> I ever since had a few VMs which are very hard to migrate because of a >>> lot of memory I/O. I found that finding the next dirty bit >>> seemed to be one of the culprits (apart from removing locking which >>> Paolo is working on). >>> >>> I have to following proposal which seems to help a lot in my case. Just >>> wanted to have some feedback. >>> I applied the same unrolling idea like in buffer_is_zero(). >>> >>> Peter >>> >>> --- a/util/bitops.c >>> +++ b/util/bitops.c >>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long >>> *addr, unsigned long size, >>> const unsigned long *p = addr + BITOP_WORD(offset); >>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>> unsigned long tmp; >>> +unsigned long d0,d1,d2,d3; >>> >>> if (offset >= size) { >>> return size; >>> } >>> size -= result; >>> -offset %= BITS_PER_LONG; >>> +offset &= (BITS_PER_LONG-1); >>> if (offset) { >>> tmp = *(p++); >>> tmp &= (~0UL << offset); >>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long >>> *addr, unsigned long size, >>> result += BITS_PER_LONG; >>> } >>> while (size & ~(BITS_PER_LONG-1)) { >>> +while (!(size & (4*BITS_PER_LONG-1))) { >> >> This really means >> >> if (!(size & (4*BITS_PER_LONG-1))) { >> while (1) { >> ... >> } >> } >> >> because the subtraction will not change the result of the "while" loop >> condition. > > Are you sure? The above is working nicely for me (wondering why ;-)) while (!(size & (4*BITS_PER_LONG-1)))=> while (!(size % (4*BITS_PER_LONG)) => while ((size % (4*BITS_PER_LONG)) == 0) Subtracting 4*BITS_PER_LONG doesn't change the modulus. > I think !(size & (4*BITS_PER_LONG-1)) is the same as what you > propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable > by 4*BITS_PER_LONG. But it see it might be a problem for size == 0. In fact I'm not really sure why it works for you. :) >> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which >> in turn means "while (size >= 4*BITS_PER_LONG). >> >> Please change both while loops to use a ">=" condition, it's easier to read. > > Good idea, its easier to understand. > >>> Please change both while loops to use a ">=" condition, it's easier to read. > > Thinking again, in case a bit is found, this might lead to unnecessary > iterations > in the while loop if the bit is in d1, d2 or d3. How would that be different in your patch? But you can solve it by making two >= loops, one checking for 4*BITS_PER_LONG and one checking BITS_PER_LONG. Paolo > > Peter > >> >> Paolo >> >>> +d0 = *p; >>> +d1 = *(p+1); >>> +d2 = *(p+2); >>> +d3 = *(p+3); >>> +if (d0 || d1 || d2 || d3) { >>> +break; >>> +} >>> +p+=4; >>> +result += 4*BITS_PER_LONG; >>> +size -= 4*BITS_PER_LONG; >>> +} >>> if ((tmp = *(p++))) { >>> goto found_middle; >>> } >> >
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Am 11.03.2013 um 15:22 schrieb Peter Lieven : > > Am 11.03.2013 um 15:14 schrieb Paolo Bonzini : > >> Il 11/03/2013 14:44, Peter Lieven ha scritto: >>> Hi, >>> >>> I ever since had a few VMs which are very hard to migrate because of a >>> lot of memory I/O. I found that finding the next dirty bit >>> seemed to be one of the culprits (apart from removing locking which >>> Paolo is working on). >>> >>> I have to following proposal which seems to help a lot in my case. Just >>> wanted to have some feedback. >>> I applied the same unrolling idea like in buffer_is_zero(). >>> >>> Peter >>> >>> --- a/util/bitops.c >>> +++ b/util/bitops.c >>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long >>> *addr, unsigned long size, >>>const unsigned long *p = addr + BITOP_WORD(offset); >>>unsigned long result = offset & ~(BITS_PER_LONG-1); >>>unsigned long tmp; >>> +unsigned long d0,d1,d2,d3; >>> >>>if (offset >= size) { >>>return size; >>>} >>>size -= result; >>> -offset %= BITS_PER_LONG; >>> +offset &= (BITS_PER_LONG-1); >>>if (offset) { >>>tmp = *(p++); >>>tmp &= (~0UL << offset); >>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long >>> *addr, unsigned long size, >>>result += BITS_PER_LONG; >>>} >>>while (size & ~(BITS_PER_LONG-1)) { >>> +while (!(size & (4*BITS_PER_LONG-1))) { >> >> This really means >> >> if (!(size & (4*BITS_PER_LONG-1))) { >> while (1) { >> ... >> } >> } >> >> because the subtraction will not change the result of the "while" loop >> condition. > > Are you sure? The above is working nicely for me (wondering why ;-)) > I think !(size & (4*BITS_PER_LONG-1)) is the same as what you > propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable > by 4*BITS_PER_LONG. But it see it might be a problem for size == 0. > >> >> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which >> in turn means "while (size >= 4*BITS_PER_LONG). >> >> Please change both while loops to use a ">=" condition, it's easier to read. Thinking again, in case a bit is found, this might lead to unnecessary iterations in the while loop if the bit is in d1, d2 or d3. > > Good idea, its easier to understand. > > Has anyone evidence if unrolling 4 longs is optimal on today processors? > I just chooses 4 longs because it was the same in buffer_is_zero. > > Peter > >> >> Paolo >> >>> +d0 = *p; >>> +d1 = *(p+1); >>> +d2 = *(p+2); >>> +d3 = *(p+3); >>> +if (d0 || d1 || d2 || d3) { >>> +break; >>> +} >>> +p+=4; >>> +result += 4*BITS_PER_LONG; >>> +size -= 4*BITS_PER_LONG; >>> +} >>>if ((tmp = *(p++))) { >>>goto found_middle; >>>} >> >
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Am 11.03.2013 um 15:14 schrieb Paolo Bonzini : > Il 11/03/2013 14:44, Peter Lieven ha scritto: >> Hi, >> >> I ever since had a few VMs which are very hard to migrate because of a >> lot of memory I/O. I found that finding the next dirty bit >> seemed to be one of the culprits (apart from removing locking which >> Paolo is working on). >> >> I have to following proposal which seems to help a lot in my case. Just >> wanted to have some feedback. >> I applied the same unrolling idea like in buffer_is_zero(). >> >> Peter >> >> --- a/util/bitops.c >> +++ b/util/bitops.c >> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long >> *addr, unsigned long size, >> const unsigned long *p = addr + BITOP_WORD(offset); >> unsigned long result = offset & ~(BITS_PER_LONG-1); >> unsigned long tmp; >> +unsigned long d0,d1,d2,d3; >> >> if (offset >= size) { >> return size; >> } >> size -= result; >> -offset %= BITS_PER_LONG; >> +offset &= (BITS_PER_LONG-1); >> if (offset) { >> tmp = *(p++); >> tmp &= (~0UL << offset); >> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long >> *addr, unsigned long size, >> result += BITS_PER_LONG; >> } >> while (size & ~(BITS_PER_LONG-1)) { >> +while (!(size & (4*BITS_PER_LONG-1))) { > > This really means > > if (!(size & (4*BITS_PER_LONG-1))) { > while (1) { > ... > } > } > > because the subtraction will not change the result of the "while" loop > condition. Are you sure? The above is working nicely for me (wondering why ;-)) I think !(size & (4*BITS_PER_LONG-1)) is the same as what you propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable by 4*BITS_PER_LONG. But it see it might be a problem for size == 0. > > What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which > in turn means "while (size >= 4*BITS_PER_LONG). > > Please change both while loops to use a ">=" condition, it's easier to read. Good idea, its easier to understand. Has anyone evidence if unrolling 4 longs is optimal on today processors? I just chooses 4 longs because it was the same in buffer_is_zero. Peter > > Paolo > >> +d0 = *p; >> +d1 = *(p+1); >> +d2 = *(p+2); >> +d3 = *(p+3); >> +if (d0 || d1 || d2 || d3) { >> +break; >> +} >> +p+=4; >> +result += 4*BITS_PER_LONG; >> +size -= 4*BITS_PER_LONG; >> +} >> if ((tmp = *(p++))) { >> goto found_middle; >> } >
Re: [Qemu-devel] [RFC] find_next_bit optimizations
Il 11/03/2013 14:44, Peter Lieven ha scritto: > Hi, > > I ever since had a few VMs which are very hard to migrate because of a > lot of memory I/O. I found that finding the next dirty bit > seemed to be one of the culprits (apart from removing locking which > Paolo is working on). > > I have to following proposal which seems to help a lot in my case. Just > wanted to have some feedback. > I applied the same unrolling idea like in buffer_is_zero(). > > Peter > > --- a/util/bitops.c > +++ b/util/bitops.c > @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long > *addr, unsigned long size, > const unsigned long *p = addr + BITOP_WORD(offset); > unsigned long result = offset & ~(BITS_PER_LONG-1); > unsigned long tmp; > +unsigned long d0,d1,d2,d3; > > if (offset >= size) { > return size; > } > size -= result; > -offset %= BITS_PER_LONG; > +offset &= (BITS_PER_LONG-1); > if (offset) { > tmp = *(p++); > tmp &= (~0UL << offset); > @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long > *addr, unsigned long size, > result += BITS_PER_LONG; > } > while (size & ~(BITS_PER_LONG-1)) { > +while (!(size & (4*BITS_PER_LONG-1))) { This really means if (!(size & (4*BITS_PER_LONG-1))) { while (1) { ... } } because the subtraction will not change the result of the "while" loop condition. What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which in turn means "while (size >= 4*BITS_PER_LONG). Please change both while loops to use a ">=" condition, it's easier to read. Paolo > +d0 = *p; > +d1 = *(p+1); > +d2 = *(p+2); > +d3 = *(p+3); > +if (d0 || d1 || d2 || d3) { > +break; > +} > +p+=4; > +result += 4*BITS_PER_LONG; > +size -= 4*BITS_PER_LONG; > +} > if ((tmp = *(p++))) { > goto found_middle; > }
Re: [Qemu-devel] [RFC] find_next_bit optimizations
On 11 March 2013 13:44, Peter Lieven wrote: > @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, > unsigned long size, > const unsigned long *p = addr + BITOP_WORD(offset); > unsigned long result = offset & ~(BITS_PER_LONG-1); > unsigned long tmp; > +unsigned long d0,d1,d2,d3; > > if (offset >= size) { > return size; > } > size -= result; > -offset %= BITS_PER_LONG; > +offset &= (BITS_PER_LONG-1); This change at least is unnecessary -- I just checked, and gcc is already smart enough to turn the % operation into a logical and. The generated object files are identical. -- PMM
[Qemu-devel] [RFC] find_next_bit optimizations
Hi, I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit seemed to be one of the culprits (apart from removing locking which Paolo is working on). I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback. I applied the same unrolling idea like in buffer_is_zero(). Peter --- a/util/bitops.c +++ b/util/bitops.c @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); unsigned long tmp; +unsigned long d0,d1,d2,d3; if (offset >= size) { return size; } size -= result; -offset %= BITS_PER_LONG; +offset &= (BITS_PER_LONG-1); if (offset) { tmp = *(p++); tmp &= (~0UL << offset); @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, result += BITS_PER_LONG; } while (size & ~(BITS_PER_LONG-1)) { +while (!(size & (4*BITS_PER_LONG-1))) { +d0 = *p; +d1 = *(p+1); +d2 = *(p+2); +d3 = *(p+3); +if (d0 || d1 || d2 || d3) { +break; +} +p+=4; +result += 4*BITS_PER_LONG; +size -= 4*BITS_PER_LONG; +} if ((tmp = *(p++))) { goto found_middle; }