Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Thu, 2017-07-20 at 12:10 +0200, Peter Zijlstra wrote: > On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote: > > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > > > > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a > > > 32x32->64 multiplication?) > > > > If avg is 32bit, compiler does that for you. > > > > u32 avg = ... > > > > u64 result = (u64)avg * avg; > > It does not in fact do that :/ See commit: > > 9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends") Interesting :/ Thanks for letting me know !
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Thu, 2017-07-20 at 12:10 +0200, Peter Zijlstra wrote: > On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote: > > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > > > > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a > > > 32x32->64 multiplication?) > > > > If avg is 32bit, compiler does that for you. > > > > u32 avg = ... > > > > u64 result = (u64)avg * avg; > > It does not in fact do that :/ See commit: > > 9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends") Interesting :/ Thanks for letting me know !
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote: > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a > > 32x32->64 multiplication?) > > If avg is 32bit, compiler does that for you. > > u32 avg = ... > > u64 result = (u64)avg * avg; It does not in fact do that :/ See commit: 9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote: > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a > > 32x32->64 multiplication?) > > If avg is 32bit, compiler does that for you. > > u32 avg = ... > > u64 result = (u64)avg * avg; It does not in fact do that :/ See commit: 9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, Feb 09, 2016 at 12:44:00PM -0800, Andi Kleen wrote: > On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote: > > On Mon, Feb 01 2016, Andi Kleen wrote: > > > > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: > > >> On Thu, Jan 28 2016, Andi Kleen wrote: > > >> > > >> > From: Andi Kleen > > >> > > > >> > The menu cpuidle governor does at least two int_sqrt() each time > > >> > we go into idle in get_typical_interval to compute stddev > > >> > > > >> > int_sqrts take 100-120 cycles each. Short idle latency is important > > >> > for many workloads. > > >> > > > >> > > >> If you want to optimize get_typical_interval(), why not just take the > > >> square root out of the equation (literally)? > > >> > > >> Something like > > > > > > Looks good. Yes that's a better fix. > > > > > > > Andi, did you have a way to measure the impact, and if so, could I get > > you to run the numbers again with my patch? > > I got the numbers from the 0day runs (AIM7 gets faster) > In theory if you post the patch that should happen automatically > (checking with Fengguang) Yes we bisect and report a lot of runtime performance changes. On the other hand, some few will be missed, too, due to various unstableness issues. Anyway if you would like to study performance impacts of a patch, feel free to send us requests to gather the numbers and do comparison. Thanks, Fengguang
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, Feb 09, 2016 at 12:44:00PM -0800, Andi Kleen wrote: > On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote: > > On Mon, Feb 01 2016, Andi Kleenwrote: > > > > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: > > >> On Thu, Jan 28 2016, Andi Kleen wrote: > > >> > > >> > From: Andi Kleen > > >> > > > >> > The menu cpuidle governor does at least two int_sqrt() each time > > >> > we go into idle in get_typical_interval to compute stddev > > >> > > > >> > int_sqrts take 100-120 cycles each. Short idle latency is important > > >> > for many workloads. > > >> > > > >> > > >> If you want to optimize get_typical_interval(), why not just take the > > >> square root out of the equation (literally)? > > >> > > >> Something like > > > > > > Looks good. Yes that's a better fix. > > > > > > > Andi, did you have a way to measure the impact, and if so, could I get > > you to run the numbers again with my patch? > > I got the numbers from the 0day runs (AIM7 gets faster) > In theory if you post the patch that should happen automatically > (checking with Fengguang) Yes we bisect and report a lot of runtime performance changes. On the other hand, some few will be missed, too, due to various unstableness issues. Anyway if you would like to study performance impacts of a patch, feel free to send us requests to gather the numbers and do comparison. Thanks, Fengguang
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote: > On Mon, Feb 01 2016, Andi Kleen wrote: > > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: > >> On Thu, Jan 28 2016, Andi Kleen wrote: > >> > >> > From: Andi Kleen > >> > > >> > The menu cpuidle governor does at least two int_sqrt() each time > >> > we go into idle in get_typical_interval to compute stddev > >> > > >> > int_sqrts take 100-120 cycles each. Short idle latency is important > >> > for many workloads. > >> > > >> > >> If you want to optimize get_typical_interval(), why not just take the > >> square root out of the equation (literally)? > >> > >> Something like > > > > Looks good. Yes that's a better fix. > > > > Andi, did you have a way to measure the impact, and if so, could I get > you to run the numbers again with my patch? I got the numbers from the 0day runs (AIM7 gets faster) In theory if you post the patch that should happen automatically (checking with Fengguang) -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote: > On Mon, Feb 01 2016, Andi Kleenwrote: > > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: > >> On Thu, Jan 28 2016, Andi Kleen wrote: > >> > >> > From: Andi Kleen > >> > > >> > The menu cpuidle governor does at least two int_sqrt() each time > >> > we go into idle in get_typical_interval to compute stddev > >> > > >> > int_sqrts take 100-120 cycles each. Short idle latency is important > >> > for many workloads. > >> > > >> > >> If you want to optimize get_typical_interval(), why not just take the > >> square root out of the equation (literally)? > >> > >> Something like > > > > Looks good. Yes that's a better fix. > > > > Andi, did you have a way to measure the impact, and if so, could I get > you to run the numbers again with my patch? I got the numbers from the 0day runs (AIM7 gets faster) In theory if you post the patch that should happen automatically (checking with Fengguang) -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Mon, Feb 01 2016, Andi Kleen wrote: > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: >> On Thu, Jan 28 2016, Andi Kleen wrote: >> >> > From: Andi Kleen >> > >> > The menu cpuidle governor does at least two int_sqrt() each time >> > we go into idle in get_typical_interval to compute stddev >> > >> > int_sqrts take 100-120 cycles each. Short idle latency is important >> > for many workloads. >> > >> >> If you want to optimize get_typical_interval(), why not just take the >> square root out of the equation (literally)? >> >> Something like > > Looks good. Yes that's a better fix. > Andi, did you have a way to measure the impact, and if so, could I get you to run the numbers again with my patch? Thanks, Rasmus
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Mon, Feb 01 2016, Andi Kleenwrote: > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: >> On Thu, Jan 28 2016, Andi Kleen wrote: >> >> > From: Andi Kleen >> > >> > The menu cpuidle governor does at least two int_sqrt() each time >> > we go into idle in get_typical_interval to compute stddev >> > >> > int_sqrts take 100-120 cycles each. Short idle latency is important >> > for many workloads. >> > >> >> If you want to optimize get_typical_interval(), why not just take the >> square root out of the equation (literally)? >> >> Something like > > Looks good. Yes that's a better fix. > Andi, did you have a way to measure the impact, and if so, could I get you to run the numbers again with my patch? Thanks, Rasmus
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, 2016-02-02 at 21:46 +0100, Rasmus Villemoes wrote: > On Tue, Feb 02 2016, Eric Dumazet wrote: > > > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > > > >> Thanks. (Is there a good way to tell gcc that avg*avg is actually a > >> 32x32->64 multiplication?) > > > > If avg is 32bit, compiler does that for you. > > > > u32 avg = ... > > > > u64 result = (u64)avg * avg; > > Yeah, but in this case avg is u64 because it is used to temporarily > contain the sum of a bunch of u32s, before being divided by #bunch. So > I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very > readable :-/ > > I just thought the scenario of a u64 known to be holding a value < 2^32 > was common enough that some utility macros already existed. > > Rasmus crypto/vmac.c has this, you could make it generic maybe. #define MUL32(i1, i2) ((u64)(u32)(i1)*(u32)(i2))
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, Feb 02 2016, Eric Dumazet wrote: > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > >> Thanks. (Is there a good way to tell gcc that avg*avg is actually a >> 32x32->64 multiplication?) > > If avg is 32bit, compiler does that for you. > > u32 avg = ... > > u64 result = (u64)avg * avg; Yeah, but in this case avg is u64 because it is used to temporarily contain the sum of a bunch of u32s, before being divided by #bunch. So I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very readable :-/ I just thought the scenario of a u64 known to be holding a value < 2^32 was common enough that some utility macros already existed. Rasmus
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, Feb 02 2016, Eric Dumazetwrote: > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > >> Thanks. (Is there a good way to tell gcc that avg*avg is actually a >> 32x32->64 multiplication?) > > If avg is 32bit, compiler does that for you. > > u32 avg = ... > > u64 result = (u64)avg * avg; Yeah, but in this case avg is u64 because it is used to temporarily contain the sum of a bunch of u32s, before being divided by #bunch. So I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very readable :-/ I just thought the scenario of a u64 known to be holding a value < 2^32 was common enough that some utility macros already existed. Rasmus
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, 2016-02-02 at 21:46 +0100, Rasmus Villemoes wrote: > On Tue, Feb 02 2016, Eric Dumazetwrote: > > > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > > > >> Thanks. (Is there a good way to tell gcc that avg*avg is actually a > >> 32x32->64 multiplication?) > > > > If avg is 32bit, compiler does that for you. > > > > u32 avg = ... > > > > u64 result = (u64)avg * avg; > > Yeah, but in this case avg is u64 because it is used to temporarily > contain the sum of a bunch of u32s, before being divided by #bunch. So > I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very > readable :-/ > > I just thought the scenario of a u64 known to be holding a value < 2^32 > was common enough that some utility macros already existed. > > Rasmus crypto/vmac.c has this, you could make it generic maybe. #define MUL32(i1, i2) ((u64)(u32)(i1)*(u32)(i2))
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > Thanks. (Is there a good way to tell gcc that avg*avg is actually a > 32x32->64 multiplication?) If avg is 32bit, compiler does that for you. u32 avg = ... u64 result = (u64)avg * avg;
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, Feb 02, 2016 at 12:08:46AM +0100, Rasmus Villemoes wrote: > On Mon, Feb 01 2016, Andi Kleen wrote: > > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: > >> On Thu, Jan 28 2016, Andi Kleen wrote: > >> > >> > From: Andi Kleen > >> > > >> > The menu cpuidle governor does at least two int_sqrt() each time > >> > we go into idle in get_typical_interval to compute stddev > >> > > >> > int_sqrts take 100-120 cycles each. Short idle latency is important > >> > for many workloads. > >> > > >> > >> If you want to optimize get_typical_interval(), why not just take the > >> square root out of the equation (literally)? > >> > >> Something like > > > > Looks good. Yes that's a better fix. > > > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a > 32x32->64 multiplication?) I don't think there is, but you could define a custom macro with a fallback on pure 64x64->64. -Andi
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Mon, Feb 01 2016, Andi Kleen wrote: > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: >> On Thu, Jan 28 2016, Andi Kleen wrote: >> >> > From: Andi Kleen >> > >> > The menu cpuidle governor does at least two int_sqrt() each time >> > we go into idle in get_typical_interval to compute stddev >> > >> > int_sqrts take 100-120 cycles each. Short idle latency is important >> > for many workloads. >> > >> >> If you want to optimize get_typical_interval(), why not just take the >> square root out of the equation (literally)? >> >> Something like > > Looks good. Yes that's a better fix. > Thanks. (Is there a good way to tell gcc that avg*avg is actually a 32x32->64 multiplication?) While there and doing the math, I noticed that the variance computation may _theoretically_ overflow (if half the observations are 0, half C, the variance before the division should be around INTERVALS*C^2/4, which is around 2^65 for C=UINT_MAX and INTERVALS=8). I have no idea if it actually matters, but it can be fixed by lowering the initial threshold from UINT_MAX to sqrt(4*U64_MAX/INTERVALS) ~~ 3e9. However, this would make it possible that all observations are larger than the initial threshold, so we'd have to protect against a division by zero... Rasmus
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: > On Thu, Jan 28 2016, Andi Kleen wrote: > > > From: Andi Kleen > > > > The menu cpuidle governor does at least two int_sqrt() each time > > we go into idle in get_typical_interval to compute stddev > > > > int_sqrts take 100-120 cycles each. Short idle latency is important > > for many workloads. > > > > If you want to optimize get_typical_interval(), why not just take the > square root out of the equation (literally)? > > Something like Looks good. Yes that's a better fix. -Andi
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Thu, Jan 28 2016, Andi Kleen wrote: > From: Andi Kleen > > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > If you want to optimize get_typical_interval(), why not just take the square root out of the equation (literally)? Something like From: Rasmus Villemoes Date: Mon, 1 Feb 2016 21:43:18 +0100 Subject: [PATCH] cpuidle: menu: avoid expensive square root computation Computing the integer square root is a rather expensive operation, at least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64 bit platforms, doing an extra comparison to a constant (variance <= U64_MAX/36). On 64 bit platforms, this does mean that we add a restriction on the range of the variance where we end up using the estimate (since previously the stddev <= ULONG_MAX was a tautology), but on the other hand, we extend the range quite substantially on 32 bit platforms - in both cases, we now allow standard deviations up to 715 seconds, which is for example guaranteed if all observations are less than 1430 seconds. Signed-off-by: Rasmus Villemoes --- drivers/cpuidle/governors/menu.c | 35 +-- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 0742b3296673..beef7ae123ba 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -200,7 +200,7 @@ static void get_typical_interval(struct menu_device *data) { int i, divisor; unsigned int max, thresh; - uint64_t avg, stddev; + uint64_t avg, variance; thresh = UINT_MAX; /* Discard outliers above this value */ @@ -224,36 +224,35 @@ again: else do_div(avg, divisor); - /* Then try to determine standard deviation */ - stddev = 0; + /* Then try to determine variance */ + variance = 0; for (i = 0; i < INTERVALS; i++) { unsigned int value = data->intervals[i]; if (value <= thresh) { int64_t diff = value - avg; - stddev += diff * diff; + variance += diff * diff; } } if (divisor == INTERVALS) - stddev >>= INTERVAL_SHIFT; + variance >>= INTERVAL_SHIFT; else - do_div(stddev, divisor); + do_div(variance, divisor); /* -* The typical interval is obtained when standard deviation is small -* or standard deviation is small compared to the average interval. -* -* int_sqrt() formal parameter type is unsigned long. When the -* greatest difference to an outlier exceeds ~65 ms * sqrt(divisor) -* the resulting squared standard deviation exceeds the input domain -* of int_sqrt on platforms where unsigned long is 32 bits in size. -* In such case reject the candidate average. +* The typical interval is obtained when standard deviation is +* small (stddev <= 20 us, variance <= 400 us^2) or standard +* deviation is small compared to the average interval (avg > +* 6*stddev, avg^2 > 36*variance). The average is smaller than +* UINT_MAX aka U32_MAX, so computing its square does not +* overflow a u64. We simply reject this candidate average if +* the standard deviation is greater than 715 s (which is +* rather unlikely). * * Use this result only if there is no timer to wake us up sooner. */ - if (likely(stddev <= ULONG_MAX)) { - stddev = int_sqrt(stddev); - if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3)) - || stddev <= 20) { + if (likely(variance <= U64_MAX/36)) { + if (((avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3)) + || variance <= 400) { if (data->next_timer_us > avg) data->predicted_us = avg; return; -- 2.6.1
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Thu, Jan 28 2016, Andi Kleenwrote: > From: Andi Kleen > > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > If you want to optimize get_typical_interval(), why not just take the square root out of the equation (literally)? Something like From: Rasmus Villemoes Date: Mon, 1 Feb 2016 21:43:18 +0100 Subject: [PATCH] cpuidle: menu: avoid expensive square root computation Computing the integer square root is a rather expensive operation, at least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64 bit platforms, doing an extra comparison to a constant (variance <= U64_MAX/36). On 64 bit platforms, this does mean that we add a restriction on the range of the variance where we end up using the estimate (since previously the stddev <= ULONG_MAX was a tautology), but on the other hand, we extend the range quite substantially on 32 bit platforms - in both cases, we now allow standard deviations up to 715 seconds, which is for example guaranteed if all observations are less than 1430 seconds. Signed-off-by: Rasmus Villemoes --- drivers/cpuidle/governors/menu.c | 35 +-- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 0742b3296673..beef7ae123ba 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -200,7 +200,7 @@ static void get_typical_interval(struct menu_device *data) { int i, divisor; unsigned int max, thresh; - uint64_t avg, stddev; + uint64_t avg, variance; thresh = UINT_MAX; /* Discard outliers above this value */ @@ -224,36 +224,35 @@ again: else do_div(avg, divisor); - /* Then try to determine standard deviation */ - stddev = 0; + /* Then try to determine variance */ + variance = 0; for (i = 0; i < INTERVALS; i++) { unsigned int value = data->intervals[i]; if (value <= thresh) { int64_t diff = value - avg; - stddev += diff * diff; + variance += diff * diff; } } if (divisor == INTERVALS) - stddev >>= INTERVAL_SHIFT; + variance >>= INTERVAL_SHIFT; else - do_div(stddev, divisor); + do_div(variance, divisor); /* -* The typical interval is obtained when standard deviation is small -* or standard deviation is small compared to the average interval. -* -* int_sqrt() formal parameter type is unsigned long. When the -* greatest difference to an outlier exceeds ~65 ms * sqrt(divisor) -* the resulting squared standard deviation exceeds the input domain -* of int_sqrt on platforms where unsigned long is 32 bits in size. -* In such case reject the candidate average. +* The typical interval is obtained when standard deviation is +* small (stddev <= 20 us, variance <= 400 us^2) or standard +* deviation is small compared to the average interval (avg > +* 6*stddev, avg^2 > 36*variance). The average is smaller than +* UINT_MAX aka U32_MAX, so computing its square does not +* overflow a u64. We simply reject this candidate average if +* the standard deviation is greater than 715 s (which is +* rather unlikely). * * Use this result only if there is no timer to wake us up sooner. */ - if (likely(stddev <= ULONG_MAX)) { - stddev = int_sqrt(stddev); - if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3)) - || stddev <= 20) { + if (likely(variance <= U64_MAX/36)) { + if (((avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3)) + || variance <= 400) { if (data->next_timer_us > avg) data->predicted_us = avg; return; -- 2.6.1
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Mon, Feb 01 2016, Andi Kleenwrote: > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: >> On Thu, Jan 28 2016, Andi Kleen wrote: >> >> > From: Andi Kleen >> > >> > The menu cpuidle governor does at least two int_sqrt() each time >> > we go into idle in get_typical_interval to compute stddev >> > >> > int_sqrts take 100-120 cycles each. Short idle latency is important >> > for many workloads. >> > >> >> If you want to optimize get_typical_interval(), why not just take the >> square root out of the equation (literally)? >> >> Something like > > Looks good. Yes that's a better fix. > Thanks. (Is there a good way to tell gcc that avg*avg is actually a 32x32->64 multiplication?) While there and doing the math, I noticed that the variance computation may _theoretically_ overflow (if half the observations are 0, half C, the variance before the division should be around INTERVALS*C^2/4, which is around 2^65 for C=UINT_MAX and INTERVALS=8). I have no idea if it actually matters, but it can be fixed by lowering the initial threshold from UINT_MAX to sqrt(4*U64_MAX/INTERVALS) ~~ 3e9. However, this would make it possible that all observations are larger than the initial threshold, so we'd have to protect against a division by zero... Rasmus
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: > On Thu, Jan 28 2016, Andi Kleenwrote: > > > From: Andi Kleen > > > > The menu cpuidle governor does at least two int_sqrt() each time > > we go into idle in get_typical_interval to compute stddev > > > > int_sqrts take 100-120 cycles each. Short idle latency is important > > for many workloads. > > > > If you want to optimize get_typical_interval(), why not just take the > square root out of the equation (literally)? > > Something like Looks good. Yes that's a better fix. -Andi
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, Feb 02, 2016 at 12:08:46AM +0100, Rasmus Villemoes wrote: > On Mon, Feb 01 2016, Andi Kleenwrote: > > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote: > >> On Thu, Jan 28 2016, Andi Kleen wrote: > >> > >> > From: Andi Kleen > >> > > >> > The menu cpuidle governor does at least two int_sqrt() each time > >> > we go into idle in get_typical_interval to compute stddev > >> > > >> > int_sqrts take 100-120 cycles each. Short idle latency is important > >> > for many workloads. > >> > > >> > >> If you want to optimize get_typical_interval(), why not just take the > >> square root out of the equation (literally)? > >> > >> Something like > > > > Looks good. Yes that's a better fix. > > > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a > 32x32->64 multiplication?) I don't think there is, but you could define a custom macro with a fallback on pure 64x64->64. -Andi
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote: > Thanks. (Is there a good way to tell gcc that avg*avg is actually a > 32x32->64 multiplication?) If avg is 32bit, compiler does that for you. u32 avg = ... u64 result = (u64)avg * avg;
Re: [PATCH] Optimize int_sqrt for small values for faster idle
Hello, > - m = 1UL << (BITS_PER_LONG - 2); > + if (x <= 0x) { > + if (m <= 0xff) > + m = 1UL << (8 - 2); > + else > + m = 1UL << (16 - 2); > + } else if (x <= 0x) > + m = 1UL << (32 - 2); > + else > + m = 1UL << (BITS_PER_LONG - 2); >while (m != 0) { >b = y + m; >y >>= 1; > I think, m can be initialized with 1 << (greatest multiple of 2 less than or equal to (position of most significant bit of x)) i.e. 1 << ((position of most significant bit of x) & 62) without changing the outcome of the original algorithm (as long as x>= 2). I believe, that for (position of most significant bit of x) there is an efficient macro, and some processors directly have an instruction for it. So this would probably be faster than your suggestion for an initial starting value and give an even better starting value (cutting in some cases further on the number of while loop interations). If one just wants to achieve a result with a certain relative error in terms of the fraction of the input, one can probably only look at the most significant bit and a few following bits of x. Sincerely, Thomas
Re: [PATCH] Optimize int_sqrt for small values for faster idle
Hello, > - m = 1UL << (BITS_PER_LONG - 2); > + if (x <= 0x) { > + if (m <= 0xff) > + m = 1UL << (8 - 2); > + else > + m = 1UL << (16 - 2); > + } else if (x <= 0x) > + m = 1UL << (32 - 2); > + else > + m = 1UL << (BITS_PER_LONG - 2); >while (m != 0) { >b = y + m; >y >>= 1; > I think, m can be initialized with 1 << (greatest multiple of 2 less than or equal to (position of most significant bit of x)) i.e. 1 << ((position of most significant bit of x) & 62) without changing the outcome of the original algorithm (as long as x>= 2). I believe, that for (position of most significant bit of x) there is an efficient macro, and some processors directly have an instruction for it. So this would probably be faster than your suggestion for an initial starting value and give an even better starting value (cutting in some cases further on the number of while loop interations). If one just wants to achieve a result with a certain relative error in terms of the fraction of the input, one can probably only look at the most significant bit and a few following bits of x. Sincerely, Thomas
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Thursday, January 28, 2016 01:42:45 PM Andi Kleen wrote: > From: Andi Kleen > > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > > I instrumented the function on my workstation and most values are > 16bit only and most others 32bit (50% percentile is 122094, > 75% is 3699533). > > sqrt is implemented by starting with an initial estimation, > and then iterating. int_sqrt currently only uses a fixed > estimating which is good for 64bits worth of input. > > This patch adds some checks at the beginning to start with > a better estimate for values fitting in 8, 16bit and 32bit. > This makes int_sqrt between 60+% faster for values in 16bit, > and still somewhat faster (between 10 and 30%) for larger values > upto 32bit. Full 64bit is slightly slower. > > This optimizes the short idle calls and does not hurt the > long sleep (which probably do not care) much. > > An alternative would be a full table drive approach, or > trying some inverted sqrt optimization, but this simple change > already seems to have a good payoff. I'm wondering if you have any numbers on how much of a difference this makes in practice in terms of energy consumption, performance, latency etc. Thanks, Rafael
Re: [PATCH] Optimize int_sqrt for small values for faster idle
> This thread might be relevant: > > https://lkml.org/lkml/2015/2/2/600 > > and perhaps using fls might still be a good approach. Linus wrote: >>> We *probably* have some argument range that we care more about, which is why I'd like to know what the profile is that triggered this optimization, and what the common argument range is. <<< That's exactly what I did. Used perf probe to get the common range and optimize for that. -Andi
Re: [PATCH] Optimize int_sqrt for small values for faster idle
Andi Kleen writes: > From: Andi Kleen > > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev Added a stupid typo in the last minute. I'll post a new version. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote: > From: Andi Kleen > > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > > I instrumented the function on my workstation and most values are > 16bit only and most others 32bit (50% percentile is 122094, > 75% is 3699533). > > sqrt is implemented by starting with an initial estimation, > and then iterating. int_sqrt currently only uses a fixed > estimating which is good for 64bits worth of input. > > This patch adds some checks at the beginning to start with > a better estimate for values fitting in 8, 16bit and 32bit. > This makes int_sqrt between 60+% faster for values in 16bit, > and still somewhat faster (between 10 and 30%) for larger values > upto 32bit. Full 64bit is slightly slower. > > This optimizes the short idle calls and does not hurt the > long sleep (which probably do not care) much. > > An alternative would be a full table drive approach, or > trying some inverted sqrt optimization, but this simple change > already seems to have a good payoff. > > Signed-off-by: Andi Kleen > --- > lib/int_sqrt.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c > index 1ef4cc3..2479ccf 100644 > --- a/lib/int_sqrt.c > +++ b/lib/int_sqrt.c > @@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x) > if (x <= 1) > return x; The above test (x <= 1) should also be moved > > - m = 1UL << (BITS_PER_LONG - 2); > + if (x <= 0x) { > + if (m <= 0xff) m or x ? if (x <= 0xff) looks more correct. > + m = 1UL << (8 - 2); > + else > + m = 1UL << (16 - 2); > + } else if (x <= 0x) > + m = 1UL << (32 - 2); > + else > + m = 1UL << (BITS_PER_LONG - 2); > while (m != 0) { > b = y + m; > y >>= 1;
Re: [PATCH] Optimize int_sqrt for small values for faster idle
(resending with email addresses that shouldn't bounce) (adding Anshul Garg) (fixed Davidlohr Bueso's address) On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote: > From: Andi Kleen > > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > > I instrumented the function on my workstation and most values are > 16bit only and most others 32bit (50% percentile is 122094, > 75% is 3699533). > > sqrt is implemented by starting with an initial estimation, > and then iterating. int_sqrt currently only uses a fixed > estimating which is good for 64bits worth of input. > > This patch adds some checks at the beginning to start with > a better estimate for values fitting in 8, 16bit and 32bit. > This makes int_sqrt between 60+% faster for values in 16bit, > and still somewhat faster (between 10 and 30%) for larger values > upto 32bit. Full 64bit is slightly slower. > > This optimizes the short idle calls and does not hurt the > long sleep (which probably do not care) much. > > An alternative would be a full table drive approach, or > trying some inverted sqrt optimization, but this simple change > already seems to have a good payoff. This thread might be relevant: https://lkml.org/lkml/2015/2/2/600 and perhaps using fls might still be a good approach.
Re: [PATCH] Optimize int_sqrt for small values for faster idle
(adding Anshul Garg) On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote: > From: Andi Kleen > > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > > I instrumented the function on my workstation and most values are > 16bit only and most others 32bit (50% percentile is 122094, > 75% is 3699533). > > sqrt is implemented by starting with an initial estimation, > and then iterating. int_sqrt currently only uses a fixed > estimating which is good for 64bits worth of input. > > This patch adds some checks at the beginning to start with > a better estimate for values fitting in 8, 16bit and 32bit. > This makes int_sqrt between 60+% faster for values in 16bit, > and still somewhat faster (between 10 and 30%) for larger values > upto 32bit. Full 64bit is slightly slower. > > This optimizes the short idle calls and does not hurt the > long sleep (which probably do not care) much. > > An alternative would be a full table drive approach, or > trying some inverted sqrt optimization, but this simple change > already seems to have a good payoff. This thread might be relevant: https://lkml.org/lkml/2015/2/2/600 and perhaps using fls might still be a good approach.
Re: [PATCH] Optimize int_sqrt for small values for faster idle
Hi Andi, [auto build test WARNING on v4.5-rc1] [also build test WARNING on next-20160128] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629 config: avr32-atngw100_defconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=avr32 All warnings (new ones prefixed by >>): lib/int_sqrt.c: In function 'int_sqrt': >> lib/int_sqrt.c:25: warning: 'm' is used uninitialized in this function vim +/m +25 lib/int_sqrt.c 9 #include 10 11 /** 12 * int_sqrt - rough approximation to sqrt 13 * @x: integer of which to calculate the sqrt 14 * 15 * A very rough approximation to the sqrt() function. 16 */ 17 unsigned long int_sqrt(unsigned long x) 18 { 19 unsigned long b, m, y = 0; 20 21 if (x <= 1) 22 return x; 23 24 if (x <= 0x) { > 25 if (m <= 0xff) 26 m = 1UL << (8 - 2); 27 else 28 m = 1UL << (16 - 2); 29 } else if (x <= 0x) 30 m = 1UL << (32 - 2); 31 else 32 m = 1UL << (BITS_PER_LONG - 2); 33 while (m != 0) { --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH] Optimize int_sqrt for small values for faster idle
Hi Andi, [auto build test WARNING on v4.5-rc1] [also build test WARNING on next-20160128] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629 config: x86_64-randconfig-x015-01270835 (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=x86_64 Note: it may well be a FALSE warning. FWIW you are at least aware of it now. http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings All warnings (new ones prefixed by >>): lib/int_sqrt.c: In function 'int_sqrt': >> lib/int_sqrt.c:25:6: warning: 'm' may be used uninitialized in this function >> [-Wmaybe-uninitialized] if (m <= 0xff) ^ vim +/m +25 lib/int_sqrt.c 9 #include 10 11 /** 12 * int_sqrt - rough approximation to sqrt 13 * @x: integer of which to calculate the sqrt 14 * 15 * A very rough approximation to the sqrt() function. 16 */ 17 unsigned long int_sqrt(unsigned long x) 18 { 19 unsigned long b, m, y = 0; 20 21 if (x <= 1) 22 return x; 23 24 if (x <= 0x) { > 25 if (m <= 0xff) 26 m = 1UL << (8 - 2); 27 else 28 m = 1UL << (16 - 2); 29 } else if (x <= 0x) 30 m = 1UL << (32 - 2); 31 else 32 m = 1UL << (BITS_PER_LONG - 2); 33 while (m != 0) { --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
[PATCH] Optimize int_sqrt for small values for faster idle
From: Andi Kleen The menu cpuidle governor does at least two int_sqrt() each time we go into idle in get_typical_interval to compute stddev int_sqrts take 100-120 cycles each. Short idle latency is important for many workloads. I instrumented the function on my workstation and most values are 16bit only and most others 32bit (50% percentile is 122094, 75% is 3699533). sqrt is implemented by starting with an initial estimation, and then iterating. int_sqrt currently only uses a fixed estimating which is good for 64bits worth of input. This patch adds some checks at the beginning to start with a better estimate for values fitting in 8, 16bit and 32bit. This makes int_sqrt between 60+% faster for values in 16bit, and still somewhat faster (between 10 and 30%) for larger values upto 32bit. Full 64bit is slightly slower. This optimizes the short idle calls and does not hurt the long sleep (which probably do not care) much. An alternative would be a full table drive approach, or trying some inverted sqrt optimization, but this simple change already seems to have a good payoff. Signed-off-by: Andi Kleen --- lib/int_sqrt.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c index 1ef4cc3..2479ccf 100644 --- a/lib/int_sqrt.c +++ b/lib/int_sqrt.c @@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x) if (x <= 1) return x; - m = 1UL << (BITS_PER_LONG - 2); + if (x <= 0x) { + if (m <= 0xff) + m = 1UL << (8 - 2); + else + m = 1UL << (16 - 2); + } else if (x <= 0x) + m = 1UL << (32 - 2); + else + m = 1UL << (BITS_PER_LONG - 2); while (m != 0) { b = y + m; y >>= 1; -- 2.4.3
Re: [PATCH] Optimize int_sqrt for small values for faster idle
(resending with email addresses that shouldn't bounce) (adding Anshul Garg) (fixed Davidlohr Bueso's address) On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote: > From: Andi Kleen> > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > > I instrumented the function on my workstation and most values are > 16bit only and most others 32bit (50% percentile is 122094, > 75% is 3699533). > > sqrt is implemented by starting with an initial estimation, > and then iterating. int_sqrt currently only uses a fixed > estimating which is good for 64bits worth of input. > > This patch adds some checks at the beginning to start with > a better estimate for values fitting in 8, 16bit and 32bit. > This makes int_sqrt between 60+% faster for values in 16bit, > and still somewhat faster (between 10 and 30%) for larger values > upto 32bit. Full 64bit is slightly slower. > > This optimizes the short idle calls and does not hurt the > long sleep (which probably do not care) much. > > An alternative would be a full table drive approach, or > trying some inverted sqrt optimization, but this simple change > already seems to have a good payoff. This thread might be relevant: https://lkml.org/lkml/2015/2/2/600 and perhaps using fls might still be a good approach.
Re: [PATCH] Optimize int_sqrt for small values for faster idle
Andi Kleenwrites: > From: Andi Kleen > > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev Added a stupid typo in the last minute. I'll post a new version. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] Optimize int_sqrt for small values for faster idle
Hi Andi, [auto build test WARNING on v4.5-rc1] [also build test WARNING on next-20160128] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629 config: x86_64-randconfig-x015-01270835 (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=x86_64 Note: it may well be a FALSE warning. FWIW you are at least aware of it now. http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings All warnings (new ones prefixed by >>): lib/int_sqrt.c: In function 'int_sqrt': >> lib/int_sqrt.c:25:6: warning: 'm' may be used uninitialized in this function >> [-Wmaybe-uninitialized] if (m <= 0xff) ^ vim +/m +25 lib/int_sqrt.c 9 #include 10 11 /** 12 * int_sqrt - rough approximation to sqrt 13 * @x: integer of which to calculate the sqrt 14 * 15 * A very rough approximation to the sqrt() function. 16 */ 17 unsigned long int_sqrt(unsigned long x) 18 { 19 unsigned long b, m, y = 0; 20 21 if (x <= 1) 22 return x; 23 24 if (x <= 0x) { > 25 if (m <= 0xff) 26 m = 1UL << (8 - 2); 27 else 28 m = 1UL << (16 - 2); 29 } else if (x <= 0x) 30 m = 1UL << (32 - 2); 31 else 32 m = 1UL << (BITS_PER_LONG - 2); 33 while (m != 0) { --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH] Optimize int_sqrt for small values for faster idle
Hi Andi, [auto build test WARNING on v4.5-rc1] [also build test WARNING on next-20160128] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629 config: avr32-atngw100_defconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=avr32 All warnings (new ones prefixed by >>): lib/int_sqrt.c: In function 'int_sqrt': >> lib/int_sqrt.c:25: warning: 'm' is used uninitialized in this function vim +/m +25 lib/int_sqrt.c 9 #include 10 11 /** 12 * int_sqrt - rough approximation to sqrt 13 * @x: integer of which to calculate the sqrt 14 * 15 * A very rough approximation to the sqrt() function. 16 */ 17 unsigned long int_sqrt(unsigned long x) 18 { 19 unsigned long b, m, y = 0; 20 21 if (x <= 1) 22 return x; 23 24 if (x <= 0x) { > 25 if (m <= 0xff) 26 m = 1UL << (8 - 2); 27 else 28 m = 1UL << (16 - 2); 29 } else if (x <= 0x) 30 m = 1UL << (32 - 2); 31 else 32 m = 1UL << (BITS_PER_LONG - 2); 33 while (m != 0) { --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH] Optimize int_sqrt for small values for faster idle
(adding Anshul Garg) On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote: > From: Andi Kleen> > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > > I instrumented the function on my workstation and most values are > 16bit only and most others 32bit (50% percentile is 122094, > 75% is 3699533). > > sqrt is implemented by starting with an initial estimation, > and then iterating. int_sqrt currently only uses a fixed > estimating which is good for 64bits worth of input. > > This patch adds some checks at the beginning to start with > a better estimate for values fitting in 8, 16bit and 32bit. > This makes int_sqrt between 60+% faster for values in 16bit, > and still somewhat faster (between 10 and 30%) for larger values > upto 32bit. Full 64bit is slightly slower. > > This optimizes the short idle calls and does not hurt the > long sleep (which probably do not care) much. > > An alternative would be a full table drive approach, or > trying some inverted sqrt optimization, but this simple change > already seems to have a good payoff. This thread might be relevant: https://lkml.org/lkml/2015/2/2/600 and perhaps using fls might still be a good approach.
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote: > From: Andi Kleen> > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > > I instrumented the function on my workstation and most values are > 16bit only and most others 32bit (50% percentile is 122094, > 75% is 3699533). > > sqrt is implemented by starting with an initial estimation, > and then iterating. int_sqrt currently only uses a fixed > estimating which is good for 64bits worth of input. > > This patch adds some checks at the beginning to start with > a better estimate for values fitting in 8, 16bit and 32bit. > This makes int_sqrt between 60+% faster for values in 16bit, > and still somewhat faster (between 10 and 30%) for larger values > upto 32bit. Full 64bit is slightly slower. > > This optimizes the short idle calls and does not hurt the > long sleep (which probably do not care) much. > > An alternative would be a full table drive approach, or > trying some inverted sqrt optimization, but this simple change > already seems to have a good payoff. > > Signed-off-by: Andi Kleen > --- > lib/int_sqrt.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c > index 1ef4cc3..2479ccf 100644 > --- a/lib/int_sqrt.c > +++ b/lib/int_sqrt.c > @@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x) > if (x <= 1) > return x; The above test (x <= 1) should also be moved > > - m = 1UL << (BITS_PER_LONG - 2); > + if (x <= 0x) { > + if (m <= 0xff) m or x ? if (x <= 0xff) looks more correct. > + m = 1UL << (8 - 2); > + else > + m = 1UL << (16 - 2); > + } else if (x <= 0x) > + m = 1UL << (32 - 2); > + else > + m = 1UL << (BITS_PER_LONG - 2); > while (m != 0) { > b = y + m; > y >>= 1;
Re: [PATCH] Optimize int_sqrt for small values for faster idle
> This thread might be relevant: > > https://lkml.org/lkml/2015/2/2/600 > > and perhaps using fls might still be a good approach. Linus wrote: >>> We *probably* have some argument range that we care more about, which is why I'd like to know what the profile is that triggered this optimization, and what the common argument range is. <<< That's exactly what I did. Used perf probe to get the common range and optimize for that. -Andi
[PATCH] Optimize int_sqrt for small values for faster idle
From: Andi KleenThe menu cpuidle governor does at least two int_sqrt() each time we go into idle in get_typical_interval to compute stddev int_sqrts take 100-120 cycles each. Short idle latency is important for many workloads. I instrumented the function on my workstation and most values are 16bit only and most others 32bit (50% percentile is 122094, 75% is 3699533). sqrt is implemented by starting with an initial estimation, and then iterating. int_sqrt currently only uses a fixed estimating which is good for 64bits worth of input. This patch adds some checks at the beginning to start with a better estimate for values fitting in 8, 16bit and 32bit. This makes int_sqrt between 60+% faster for values in 16bit, and still somewhat faster (between 10 and 30%) for larger values upto 32bit. Full 64bit is slightly slower. This optimizes the short idle calls and does not hurt the long sleep (which probably do not care) much. An alternative would be a full table drive approach, or trying some inverted sqrt optimization, but this simple change already seems to have a good payoff. Signed-off-by: Andi Kleen --- lib/int_sqrt.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c index 1ef4cc3..2479ccf 100644 --- a/lib/int_sqrt.c +++ b/lib/int_sqrt.c @@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x) if (x <= 1) return x; - m = 1UL << (BITS_PER_LONG - 2); + if (x <= 0x) { + if (m <= 0xff) + m = 1UL << (8 - 2); + else + m = 1UL << (16 - 2); + } else if (x <= 0x) + m = 1UL << (32 - 2); + else + m = 1UL << (BITS_PER_LONG - 2); while (m != 0) { b = y + m; y >>= 1; -- 2.4.3
Re: [PATCH] Optimize int_sqrt for small values for faster idle
On Thursday, January 28, 2016 01:42:45 PM Andi Kleen wrote: > From: Andi Kleen> > The menu cpuidle governor does at least two int_sqrt() each time > we go into idle in get_typical_interval to compute stddev > > int_sqrts take 100-120 cycles each. Short idle latency is important > for many workloads. > > I instrumented the function on my workstation and most values are > 16bit only and most others 32bit (50% percentile is 122094, > 75% is 3699533). > > sqrt is implemented by starting with an initial estimation, > and then iterating. int_sqrt currently only uses a fixed > estimating which is good for 64bits worth of input. > > This patch adds some checks at the beginning to start with > a better estimate for values fitting in 8, 16bit and 32bit. > This makes int_sqrt between 60+% faster for values in 16bit, > and still somewhat faster (between 10 and 30%) for larger values > upto 32bit. Full 64bit is slightly slower. > > This optimizes the short idle calls and does not hurt the > long sleep (which probably do not care) much. > > An alternative would be a full table drive approach, or > trying some inverted sqrt optimization, but this simple change > already seems to have a good payoff. I'm wondering if you have any numbers on how much of a difference this makes in practice in terms of energy consumption, performance, latency etc. Thanks, Rafael