Re: [PATCH] Optimize int_sqrt for small values for faster idle

2017-07-24 Thread Eric Dumazet
On Thu, 2017-07-20 at 12:10 +0200, Peter Zijlstra wrote:
> On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote:
> > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> > 
> > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> > > 32x32->64 multiplication?)
> > 
> > If avg is 32bit, compiler does that for you.
> > 
> > u32 avg = ...
> > 
> > u64 result = (u64)avg * avg;
> 
> It does not in fact do that :/ See commit:
> 
>   9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")

Interesting :/

Thanks for letting me know !




Re: [PATCH] Optimize int_sqrt for small values for faster idle

2017-07-24 Thread Eric Dumazet
On Thu, 2017-07-20 at 12:10 +0200, Peter Zijlstra wrote:
> On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote:
> > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> > 
> > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> > > 32x32->64 multiplication?)
> > 
> > If avg is 32bit, compiler does that for you.
> > 
> > u32 avg = ...
> > 
> > u64 result = (u64)avg * avg;
> 
> It does not in fact do that :/ See commit:
> 
>   9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")

Interesting :/

Thanks for letting me know !




Re: [PATCH] Optimize int_sqrt for small values for faster idle

2017-07-20 Thread Peter Zijlstra
On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote:
> On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> 
> > Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> > 32x32->64 multiplication?)
> 
> If avg is 32bit, compiler does that for you.
> 
> u32 avg = ...
> 
> u64 result = (u64)avg * avg;

It does not in fact do that :/ See commit:

  9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2017-07-20 Thread Peter Zijlstra
On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote:
> On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> 
> > Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> > 32x32->64 multiplication?)
> 
> If avg is 32bit, compiler does that for you.
> 
> u32 avg = ...
> 
> u64 result = (u64)avg * avg;

It does not in fact do that :/ See commit:

  9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-10 Thread Fengguang Wu
On Tue, Feb 09, 2016 at 12:44:00PM -0800, Andi Kleen wrote:
> On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote:
> > On Mon, Feb 01 2016, Andi Kleen  wrote:
> > 
> > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> > >> On Thu, Jan 28 2016, Andi Kleen  wrote:
> > >> 
> > >> > From: Andi Kleen 
> > >> >
> > >> > The menu cpuidle governor does at least two int_sqrt() each time
> > >> > we go into idle in get_typical_interval to compute stddev
> > >> >
> > >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> > >> > for many workloads.
> > >> >
> > >> 
> > >> If you want to optimize get_typical_interval(), why not just take the
> > >> square root out of the equation (literally)?
> > >> 
> > >> Something like
> > >
> > > Looks good. Yes that's a better fix.
> > >
> > 
> > Andi, did you have a way to measure the impact, and if so, could I get
> > you to run the numbers again with my patch?
> 
> I got the numbers from the 0day runs (AIM7 gets faster)
> In theory if you post the patch that should happen automatically
> (checking with Fengguang)

Yes we bisect and report a lot of runtime performance changes.
On the other hand, some few will be missed, too, due to various
unstableness issues. Anyway if you would like to study performance
impacts of a patch, feel free to send us requests to gather the
numbers and do comparison.

Thanks,
Fengguang


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-10 Thread Fengguang Wu
On Tue, Feb 09, 2016 at 12:44:00PM -0800, Andi Kleen wrote:
> On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote:
> > On Mon, Feb 01 2016, Andi Kleen  wrote:
> > 
> > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> > >> On Thu, Jan 28 2016, Andi Kleen  wrote:
> > >> 
> > >> > From: Andi Kleen 
> > >> >
> > >> > The menu cpuidle governor does at least two int_sqrt() each time
> > >> > we go into idle in get_typical_interval to compute stddev
> > >> >
> > >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> > >> > for many workloads.
> > >> >
> > >> 
> > >> If you want to optimize get_typical_interval(), why not just take the
> > >> square root out of the equation (literally)?
> > >> 
> > >> Something like
> > >
> > > Looks good. Yes that's a better fix.
> > >
> > 
> > Andi, did you have a way to measure the impact, and if so, could I get
> > you to run the numbers again with my patch?
> 
> I got the numbers from the 0day runs (AIM7 gets faster)
> In theory if you post the patch that should happen automatically
> (checking with Fengguang)

Yes we bisect and report a lot of runtime performance changes.
On the other hand, some few will be missed, too, due to various
unstableness issues. Anyway if you would like to study performance
impacts of a patch, feel free to send us requests to gather the
numbers and do comparison.

Thanks,
Fengguang


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-09 Thread Andi Kleen
On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote:
> On Mon, Feb 01 2016, Andi Kleen  wrote:
> 
> > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> >> On Thu, Jan 28 2016, Andi Kleen  wrote:
> >> 
> >> > From: Andi Kleen 
> >> >
> >> > The menu cpuidle governor does at least two int_sqrt() each time
> >> > we go into idle in get_typical_interval to compute stddev
> >> >
> >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> >> > for many workloads.
> >> >
> >> 
> >> If you want to optimize get_typical_interval(), why not just take the
> >> square root out of the equation (literally)?
> >> 
> >> Something like
> >
> > Looks good. Yes that's a better fix.
> >
> 
> Andi, did you have a way to measure the impact, and if so, could I get
> you to run the numbers again with my patch?

I got the numbers from the 0day runs (AIM7 gets faster)
In theory if you post the patch that should happen automatically
(checking with Fengguang)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-09 Thread Andi Kleen
On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote:
> On Mon, Feb 01 2016, Andi Kleen  wrote:
> 
> > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> >> On Thu, Jan 28 2016, Andi Kleen  wrote:
> >> 
> >> > From: Andi Kleen 
> >> >
> >> > The menu cpuidle governor does at least two int_sqrt() each time
> >> > we go into idle in get_typical_interval to compute stddev
> >> >
> >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> >> > for many workloads.
> >> >
> >> 
> >> If you want to optimize get_typical_interval(), why not just take the
> >> square root out of the equation (literally)?
> >> 
> >> Something like
> >
> > Looks good. Yes that's a better fix.
> >
> 
> Andi, did you have a way to measure the impact, and if so, could I get
> you to run the numbers again with my patch?

I got the numbers from the 0day runs (AIM7 gets faster)
In theory if you post the patch that should happen automatically
(checking with Fengguang)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-07 Thread Rasmus Villemoes
On Mon, Feb 01 2016, Andi Kleen  wrote:

> On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
>> On Thu, Jan 28 2016, Andi Kleen  wrote:
>> 
>> > From: Andi Kleen 
>> >
>> > The menu cpuidle governor does at least two int_sqrt() each time
>> > we go into idle in get_typical_interval to compute stddev
>> >
>> > int_sqrts take 100-120 cycles each. Short idle latency is important
>> > for many workloads.
>> >
>> 
>> If you want to optimize get_typical_interval(), why not just take the
>> square root out of the equation (literally)?
>> 
>> Something like
>
> Looks good. Yes that's a better fix.
>

Andi, did you have a way to measure the impact, and if so, could I get
you to run the numbers again with my patch?

Thanks,
Rasmus


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-07 Thread Rasmus Villemoes
On Mon, Feb 01 2016, Andi Kleen  wrote:

> On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
>> On Thu, Jan 28 2016, Andi Kleen  wrote:
>> 
>> > From: Andi Kleen 
>> >
>> > The menu cpuidle governor does at least two int_sqrt() each time
>> > we go into idle in get_typical_interval to compute stddev
>> >
>> > int_sqrts take 100-120 cycles each. Short idle latency is important
>> > for many workloads.
>> >
>> 
>> If you want to optimize get_typical_interval(), why not just take the
>> square root out of the equation (literally)?
>> 
>> Something like
>
> Looks good. Yes that's a better fix.
>

Andi, did you have a way to measure the impact, and if so, could I get
you to run the numbers again with my patch?

Thanks,
Rasmus


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-02 Thread Eric Dumazet
On Tue, 2016-02-02 at 21:46 +0100, Rasmus Villemoes wrote:
> On Tue, Feb 02 2016, Eric Dumazet  wrote:
> 
> > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> >
> >> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> >> 32x32->64 multiplication?)
> >
> > If avg is 32bit, compiler does that for you.
> >
> > u32 avg = ...
> >
> > u64 result = (u64)avg * avg;
> 
> Yeah, but in this case avg is u64 because it is used to temporarily
> contain the sum of a bunch of u32s, before being divided by #bunch. So
> I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very
> readable :-/
> 
> I just thought the scenario of a u64 known to be holding a value < 2^32
> was common enough that some utility macros already existed.
> 
> Rasmus

crypto/vmac.c has this, you could make it generic maybe.

#define MUL32(i1, i2)   ((u64)(u32)(i1)*(u32)(i2))




Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-02 Thread Rasmus Villemoes
On Tue, Feb 02 2016, Eric Dumazet  wrote:

> On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
>
>> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
>> 32x32->64 multiplication?)
>
> If avg is 32bit, compiler does that for you.
>
> u32 avg = ...
>
> u64 result = (u64)avg * avg;

Yeah, but in this case avg is u64 because it is used to temporarily
contain the sum of a bunch of u32s, before being divided by #bunch. So
I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very
readable :-/

I just thought the scenario of a u64 known to be holding a value < 2^32
was common enough that some utility macros already existed.

Rasmus


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-02 Thread Rasmus Villemoes
On Tue, Feb 02 2016, Eric Dumazet  wrote:

> On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
>
>> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
>> 32x32->64 multiplication?)
>
> If avg is 32bit, compiler does that for you.
>
> u32 avg = ...
>
> u64 result = (u64)avg * avg;

Yeah, but in this case avg is u64 because it is used to temporarily
contain the sum of a bunch of u32s, before being divided by #bunch. So
I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very
readable :-/

I just thought the scenario of a u64 known to be holding a value < 2^32
was common enough that some utility macros already existed.

Rasmus


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-02 Thread Eric Dumazet
On Tue, 2016-02-02 at 21:46 +0100, Rasmus Villemoes wrote:
> On Tue, Feb 02 2016, Eric Dumazet  wrote:
> 
> > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> >
> >> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> >> 32x32->64 multiplication?)
> >
> > If avg is 32bit, compiler does that for you.
> >
> > u32 avg = ...
> >
> > u64 result = (u64)avg * avg;
> 
> Yeah, but in this case avg is u64 because it is used to temporarily
> contain the sum of a bunch of u32s, before being divided by #bunch. So
> I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very
> readable :-/
> 
> I just thought the scenario of a u64 known to be holding a value < 2^32
> was common enough that some utility macros already existed.
> 
> Rasmus

crypto/vmac.c has this, you could make it generic maybe.

#define MUL32(i1, i2)   ((u64)(u32)(i1)*(u32)(i2))




Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Eric Dumazet
On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:

> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> 32x32->64 multiplication?)

If avg is 32bit, compiler does that for you.

u32 avg = ...

u64 result = (u64)avg * avg;






Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Andi Kleen
On Tue, Feb 02, 2016 at 12:08:46AM +0100, Rasmus Villemoes wrote:
> On Mon, Feb 01 2016, Andi Kleen  wrote:
> 
> > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> >> On Thu, Jan 28 2016, Andi Kleen  wrote:
> >> 
> >> > From: Andi Kleen 
> >> >
> >> > The menu cpuidle governor does at least two int_sqrt() each time
> >> > we go into idle in get_typical_interval to compute stddev
> >> >
> >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> >> > for many workloads.
> >> >
> >> 
> >> If you want to optimize get_typical_interval(), why not just take the
> >> square root out of the equation (literally)?
> >> 
> >> Something like
> >
> > Looks good. Yes that's a better fix.
> >
> 
> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> 32x32->64 multiplication?)

I don't think there is, but you could define a custom macro with a fallback
on pure 64x64->64.

-Andi


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Rasmus Villemoes
On Mon, Feb 01 2016, Andi Kleen  wrote:

> On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
>> On Thu, Jan 28 2016, Andi Kleen  wrote:
>> 
>> > From: Andi Kleen 
>> >
>> > The menu cpuidle governor does at least two int_sqrt() each time
>> > we go into idle in get_typical_interval to compute stddev
>> >
>> > int_sqrts take 100-120 cycles each. Short idle latency is important
>> > for many workloads.
>> >
>> 
>> If you want to optimize get_typical_interval(), why not just take the
>> square root out of the equation (literally)?
>> 
>> Something like
>
> Looks good. Yes that's a better fix.
>

Thanks. (Is there a good way to tell gcc that avg*avg is actually a
32x32->64 multiplication?)

While there and doing the math, I noticed that the variance computation
may _theoretically_ overflow (if half the observations are 0, half C,
the variance before the division should be around INTERVALS*C^2/4, which
is around 2^65 for C=UINT_MAX and INTERVALS=8). I have no idea if it
actually matters, but it can be fixed by lowering the initial threshold
from UINT_MAX to sqrt(4*U64_MAX/INTERVALS) ~~ 3e9. However, this would
make it possible that all observations are larger than the initial
threshold, so we'd have to protect against a division by zero...

Rasmus


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Andi Kleen
On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> On Thu, Jan 28 2016, Andi Kleen  wrote:
> 
> > From: Andi Kleen 
> >
> > The menu cpuidle governor does at least two int_sqrt() each time
> > we go into idle in get_typical_interval to compute stddev
> >
> > int_sqrts take 100-120 cycles each. Short idle latency is important
> > for many workloads.
> >
> 
> If you want to optimize get_typical_interval(), why not just take the
> square root out of the equation (literally)?
> 
> Something like

Looks good. Yes that's a better fix.

-Andi


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Rasmus Villemoes
On Thu, Jan 28 2016, Andi Kleen  wrote:

> From: Andi Kleen 
>
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
>
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
>

If you want to optimize get_typical_interval(), why not just take the
square root out of the equation (literally)?

Something like

From: Rasmus Villemoes 
Date: Mon, 1 Feb 2016 21:43:18 +0100
Subject: [PATCH] cpuidle: menu: avoid expensive square root computation

Computing the integer square root is a rather expensive operation, at
least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64
bit platforms, doing an extra comparison to a constant (variance <=
U64_MAX/36).

On 64 bit platforms, this does mean that we add a restriction on the
range of the variance where we end up using the estimate (since
previously the stddev <= ULONG_MAX was a tautology), but on the other
hand, we extend the range quite substantially on 32 bit platforms - in
both cases, we now allow standard deviations up to 715 seconds, which
is for example guaranteed if all observations are less than 1430
seconds.

Signed-off-by: Rasmus Villemoes 
---
 drivers/cpuidle/governors/menu.c | 35 +--
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 0742b3296673..beef7ae123ba 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -200,7 +200,7 @@ static void get_typical_interval(struct menu_device *data)
 {
int i, divisor;
unsigned int max, thresh;
-   uint64_t avg, stddev;
+   uint64_t avg, variance;
 
thresh = UINT_MAX; /* Discard outliers above this value */
 
@@ -224,36 +224,35 @@ again:
else
do_div(avg, divisor);
 
-   /* Then try to determine standard deviation */
-   stddev = 0;
+   /* Then try to determine variance */
+   variance = 0;
for (i = 0; i < INTERVALS; i++) {
unsigned int value = data->intervals[i];
if (value <= thresh) {
int64_t diff = value - avg;
-   stddev += diff * diff;
+   variance += diff * diff;
}
}
if (divisor == INTERVALS)
-   stddev >>= INTERVAL_SHIFT;
+   variance >>= INTERVAL_SHIFT;
else
-   do_div(stddev, divisor);
+   do_div(variance, divisor);
 
/*
-* The typical interval is obtained when standard deviation is small
-* or standard deviation is small compared to the average interval.
-*
-* int_sqrt() formal parameter type is unsigned long. When the
-* greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
-* the resulting squared standard deviation exceeds the input domain
-* of int_sqrt on platforms where unsigned long is 32 bits in size.
-* In such case reject the candidate average.
+* The typical interval is obtained when standard deviation is
+* small (stddev <= 20 us, variance <= 400 us^2) or standard
+* deviation is small compared to the average interval (avg >
+* 6*stddev, avg^2 > 36*variance). The average is smaller than
+* UINT_MAX aka U32_MAX, so computing its square does not
+* overflow a u64. We simply reject this candidate average if
+* the standard deviation is greater than 715 s (which is
+* rather unlikely).
 *
 * Use this result only if there is no timer to wake us up sooner.
 */
-   if (likely(stddev <= ULONG_MAX)) {
-   stddev = int_sqrt(stddev);
-   if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
-   || stddev <= 20) {
+   if (likely(variance <= U64_MAX/36)) {
+   if (((avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3))
+   || variance <= 400) {
if (data->next_timer_us > avg)
data->predicted_us = avg;
return;
-- 
2.6.1



Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Rasmus Villemoes
On Thu, Jan 28 2016, Andi Kleen  wrote:

> From: Andi Kleen 
>
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
>
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
>

If you want to optimize get_typical_interval(), why not just take the
square root out of the equation (literally)?

Something like

From: Rasmus Villemoes 
Date: Mon, 1 Feb 2016 21:43:18 +0100
Subject: [PATCH] cpuidle: menu: avoid expensive square root computation

Computing the integer square root is a rather expensive operation, at
least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64
bit platforms, doing an extra comparison to a constant (variance <=
U64_MAX/36).

On 64 bit platforms, this does mean that we add a restriction on the
range of the variance where we end up using the estimate (since
previously the stddev <= ULONG_MAX was a tautology), but on the other
hand, we extend the range quite substantially on 32 bit platforms - in
both cases, we now allow standard deviations up to 715 seconds, which
is for example guaranteed if all observations are less than 1430
seconds.

Signed-off-by: Rasmus Villemoes 
---
 drivers/cpuidle/governors/menu.c | 35 +--
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 0742b3296673..beef7ae123ba 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -200,7 +200,7 @@ static void get_typical_interval(struct menu_device *data)
 {
int i, divisor;
unsigned int max, thresh;
-   uint64_t avg, stddev;
+   uint64_t avg, variance;
 
thresh = UINT_MAX; /* Discard outliers above this value */
 
@@ -224,36 +224,35 @@ again:
else
do_div(avg, divisor);
 
-   /* Then try to determine standard deviation */
-   stddev = 0;
+   /* Then try to determine variance */
+   variance = 0;
for (i = 0; i < INTERVALS; i++) {
unsigned int value = data->intervals[i];
if (value <= thresh) {
int64_t diff = value - avg;
-   stddev += diff * diff;
+   variance += diff * diff;
}
}
if (divisor == INTERVALS)
-   stddev >>= INTERVAL_SHIFT;
+   variance >>= INTERVAL_SHIFT;
else
-   do_div(stddev, divisor);
+   do_div(variance, divisor);
 
/*
-* The typical interval is obtained when standard deviation is small
-* or standard deviation is small compared to the average interval.
-*
-* int_sqrt() formal parameter type is unsigned long. When the
-* greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
-* the resulting squared standard deviation exceeds the input domain
-* of int_sqrt on platforms where unsigned long is 32 bits in size.
-* In such case reject the candidate average.
+* The typical interval is obtained when standard deviation is
+* small (stddev <= 20 us, variance <= 400 us^2) or standard
+* deviation is small compared to the average interval (avg >
+* 6*stddev, avg^2 > 36*variance). The average is smaller than
+* UINT_MAX aka U32_MAX, so computing its square does not
+* overflow a u64. We simply reject this candidate average if
+* the standard deviation is greater than 715 s (which is
+* rather unlikely).
 *
 * Use this result only if there is no timer to wake us up sooner.
 */
-   if (likely(stddev <= ULONG_MAX)) {
-   stddev = int_sqrt(stddev);
-   if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
-   || stddev <= 20) {
+   if (likely(variance <= U64_MAX/36)) {
+   if (((avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3))
+   || variance <= 400) {
if (data->next_timer_us > avg)
data->predicted_us = avg;
return;
-- 
2.6.1



Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Rasmus Villemoes
On Mon, Feb 01 2016, Andi Kleen  wrote:

> On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
>> On Thu, Jan 28 2016, Andi Kleen  wrote:
>> 
>> > From: Andi Kleen 
>> >
>> > The menu cpuidle governor does at least two int_sqrt() each time
>> > we go into idle in get_typical_interval to compute stddev
>> >
>> > int_sqrts take 100-120 cycles each. Short idle latency is important
>> > for many workloads.
>> >
>> 
>> If you want to optimize get_typical_interval(), why not just take the
>> square root out of the equation (literally)?
>> 
>> Something like
>
> Looks good. Yes that's a better fix.
>

Thanks. (Is there a good way to tell gcc that avg*avg is actually a
32x32->64 multiplication?)

While there and doing the math, I noticed that the variance computation
may _theoretically_ overflow (if half the observations are 0, half C,
the variance before the division should be around INTERVALS*C^2/4, which
is around 2^65 for C=UINT_MAX and INTERVALS=8). I have no idea if it
actually matters, but it can be fixed by lowering the initial threshold
from UINT_MAX to sqrt(4*U64_MAX/INTERVALS) ~~ 3e9. However, this would
make it possible that all observations are larger than the initial
threshold, so we'd have to protect against a division by zero...

Rasmus


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Andi Kleen
On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> On Thu, Jan 28 2016, Andi Kleen  wrote:
> 
> > From: Andi Kleen 
> >
> > The menu cpuidle governor does at least two int_sqrt() each time
> > we go into idle in get_typical_interval to compute stddev
> >
> > int_sqrts take 100-120 cycles each. Short idle latency is important
> > for many workloads.
> >
> 
> If you want to optimize get_typical_interval(), why not just take the
> square root out of the equation (literally)?
> 
> Something like

Looks good. Yes that's a better fix.

-Andi


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Andi Kleen
On Tue, Feb 02, 2016 at 12:08:46AM +0100, Rasmus Villemoes wrote:
> On Mon, Feb 01 2016, Andi Kleen  wrote:
> 
> > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> >> On Thu, Jan 28 2016, Andi Kleen  wrote:
> >> 
> >> > From: Andi Kleen 
> >> >
> >> > The menu cpuidle governor does at least two int_sqrt() each time
> >> > we go into idle in get_typical_interval to compute stddev
> >> >
> >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> >> > for many workloads.
> >> >
> >> 
> >> If you want to optimize get_typical_interval(), why not just take the
> >> square root out of the equation (literally)?
> >> 
> >> Something like
> >
> > Looks good. Yes that's a better fix.
> >
> 
> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> 32x32->64 multiplication?)

I don't think there is, but you could define a custom macro with a fallback
on pure 64x64->64.

-Andi


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-02-01 Thread Eric Dumazet
On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:

> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> 32x32->64 multiplication?)

If avg is 32bit, compiler does that for you.

u32 avg = ...

u64 result = (u64)avg * avg;






Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-30 Thread Thomas Rohwer

Hello,

> -  m = 1UL << (BITS_PER_LONG - 2);
> +  if (x <= 0x) {
> +  if (m <= 0xff)
> +  m = 1UL << (8 - 2);
> +  else
> +  m = 1UL << (16 - 2);
> +  } else if (x <= 0x)
> +  m = 1UL << (32 - 2);
> +  else
> +  m = 1UL << (BITS_PER_LONG - 2);
>while (m != 0) {
>b = y + m;
>y >>= 1;
>


I think, m can be initialized with

1 << (greatest multiple of 2 less than or equal to (position of most 
significant bit of x))

i.e. 1 << ((position of most significant bit of x) & 62)

without changing the outcome of the original algorithm (as long as x>= 2).

I believe, that for (position of most significant bit of x) there is an 
efficient macro, and
some processors directly have an instruction for it. So this would probably be 
faster than your suggestion
for an initial starting value and give an even better starting value (cutting 
in some cases further on the number of
while loop interations).

If one just wants to achieve a result with a certain relative error in terms of 
the fraction of the input, one can
probably only look at the most significant bit and a few following bits of x.

Sincerely,

Thomas




Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-30 Thread Thomas Rohwer

Hello,

> -  m = 1UL << (BITS_PER_LONG - 2);
> +  if (x <= 0x) {
> +  if (m <= 0xff)
> +  m = 1UL << (8 - 2);
> +  else
> +  m = 1UL << (16 - 2);
> +  } else if (x <= 0x)
> +  m = 1UL << (32 - 2);
> +  else
> +  m = 1UL << (BITS_PER_LONG - 2);
>while (m != 0) {
>b = y + m;
>y >>= 1;
>


I think, m can be initialized with

1 << (greatest multiple of 2 less than or equal to (position of most 
significant bit of x))

i.e. 1 << ((position of most significant bit of x) & 62)

without changing the outcome of the original algorithm (as long as x>= 2).

I believe, that for (position of most significant bit of x) there is an 
efficient macro, and
some processors directly have an instruction for it. So this would probably be 
faster than your suggestion
for an initial starting value and give an even better starting value (cutting 
in some cases further on the number of
while loop interations).

If one just wants to achieve a result with a certain relative error in terms of 
the fraction of the input, one can
probably only look at the most significant bit and a few following bits of x.

Sincerely,

Thomas




Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Rafael J. Wysocki
On Thursday, January 28, 2016 01:42:45 PM Andi Kleen wrote:
> From: Andi Kleen 
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

I'm wondering if you have any numbers on how much of a difference this
makes in practice in terms of energy consumption, performance, latency etc.

Thanks,
Rafael



Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Andi Kleen
> This thread might be relevant:
> 
> https://lkml.org/lkml/2015/2/2/600
> 
> and perhaps using fls might still be a good approach.

Linus wrote:


>>>
We *probably* have some argument range that we care more about, which
is why I'd like to know what the profile is that triggered this
optimization, and what the common argument range is.
<<<

That's exactly what I did. Used perf probe to get the common
range and optimize for that.

-Andi


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Andi Kleen
Andi Kleen  writes:

> From: Andi Kleen 
>
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev

Added a stupid typo in the last minute. I'll post a new version.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Eric Dumazet
On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen 
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.
> 
> Signed-off-by: Andi Kleen 
> ---
>  lib/int_sqrt.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
> index 1ef4cc3..2479ccf 100644
> --- a/lib/int_sqrt.c
> +++ b/lib/int_sqrt.c
> @@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x)
>   if (x <= 1)
>   return x;

The above test (x <= 1) should also be moved

>  
> - m = 1UL << (BITS_PER_LONG - 2);
> + if (x <= 0x) {
> + if (m <= 0xff)

m or x ?  if (x <= 0xff) looks more correct.

> + m = 1UL << (8 - 2);
> + else
> + m = 1UL << (16 - 2);
> + } else if (x <= 0x)
> + m = 1UL << (32 - 2);
> + else
> + m = 1UL << (BITS_PER_LONG - 2);
>   while (m != 0) {
>   b = y + m;
>   y >>= 1;




Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Joe Perches
(resending with email addresses that shouldn't bounce)
(adding Anshul Garg)
(fixed Davidlohr Bueso's address)

On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen 
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

This thread might be relevant:

https://lkml.org/lkml/2015/2/2/600

and perhaps using fls might still be a good approach.


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Joe Perches
(adding Anshul Garg)

On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen 
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

This thread might be relevant:

https://lkml.org/lkml/2015/2/2/600

and perhaps using fls might still be a good approach.



Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread kbuild test robot
Hi Andi,

[auto build test WARNING on v4.5-rc1]
[also build test WARNING on next-20160128]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629
config: avr32-atngw100_defconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=avr32 

All warnings (new ones prefixed by >>):

   lib/int_sqrt.c: In function 'int_sqrt':
>> lib/int_sqrt.c:25: warning: 'm' is used uninitialized in this function

vim +/m +25 lib/int_sqrt.c

 9  #include 
10  
11  /**
12   * int_sqrt - rough approximation to sqrt
13   * @x: integer of which to calculate the sqrt
14   *
15   * A very rough approximation to the sqrt() function.
16   */
17  unsigned long int_sqrt(unsigned long x)
18  {
19  unsigned long b, m, y = 0;
20  
21  if (x <= 1)
22  return x;
23  
24  if (x <= 0x) {
  > 25  if (m <= 0xff)
26  m = 1UL << (8 - 2);
27  else
28  m = 1UL << (16 - 2);
29  } else if (x <= 0x)
30  m = 1UL << (32 - 2);
31  else
32  m = 1UL << (BITS_PER_LONG - 2);
33  while (m != 0) {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread kbuild test robot
Hi Andi,

[auto build test WARNING on v4.5-rc1]
[also build test WARNING on next-20160128]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629
config: x86_64-randconfig-x015-01270835 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   lib/int_sqrt.c: In function 'int_sqrt':
>> lib/int_sqrt.c:25:6: warning: 'm' may be used uninitialized in this function 
>> [-Wmaybe-uninitialized]
  if (m <= 0xff)
 ^

vim +/m +25 lib/int_sqrt.c

 9  #include 
10  
11  /**
12   * int_sqrt - rough approximation to sqrt
13   * @x: integer of which to calculate the sqrt
14   *
15   * A very rough approximation to the sqrt() function.
16   */
17  unsigned long int_sqrt(unsigned long x)
18  {
19  unsigned long b, m, y = 0;
20  
21  if (x <= 1)
22  return x;
23  
24  if (x <= 0x) {
  > 25  if (m <= 0xff)
26  m = 1UL << (8 - 2);
27  else
28  m = 1UL << (16 - 2);
29  } else if (x <= 0x)
30  m = 1UL << (32 - 2);
31  else
32  m = 1UL << (BITS_PER_LONG - 2);
33  while (m != 0) {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


[PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Andi Kleen
From: Andi Kleen 

The menu cpuidle governor does at least two int_sqrt() each time
we go into idle in get_typical_interval to compute stddev

int_sqrts take 100-120 cycles each. Short idle latency is important
for many workloads.

I instrumented the function on my workstation and most values are
16bit only and most others 32bit (50% percentile is 122094,
75% is 3699533).

sqrt is implemented by starting with an initial estimation,
and then iterating. int_sqrt currently only uses a fixed
estimating which is good for 64bits worth of input.

This patch adds some checks at the beginning to start with
a better estimate for values fitting in 8, 16bit and 32bit.
This makes int_sqrt between 60+% faster for values in 16bit,
and still somewhat faster (between 10 and 30%) for larger values
upto 32bit. Full 64bit is slightly slower.

This optimizes the short idle calls and does not hurt the
long sleep (which probably do not care) much.

An alternative would be a full table drive approach, or
trying some inverted sqrt optimization, but this simple change
already seems to have a good payoff.

Signed-off-by: Andi Kleen 
---
 lib/int_sqrt.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc3..2479ccf 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x)
if (x <= 1)
return x;
 
-   m = 1UL << (BITS_PER_LONG - 2);
+   if (x <= 0x) {
+   if (m <= 0xff)
+   m = 1UL << (8 - 2);
+   else
+   m = 1UL << (16 - 2);
+   } else if (x <= 0x)
+   m = 1UL << (32 - 2);
+   else
+   m = 1UL << (BITS_PER_LONG - 2);
while (m != 0) {
b = y + m;
y >>= 1;
-- 
2.4.3



Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Joe Perches
(resending with email addresses that shouldn't bounce)
(adding Anshul Garg)
(fixed Davidlohr Bueso's address)

On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen 
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

This thread might be relevant:

https://lkml.org/lkml/2015/2/2/600

and perhaps using fls might still be a good approach.


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Andi Kleen
Andi Kleen  writes:

> From: Andi Kleen 
>
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev

Added a stupid typo in the last minute. I'll post a new version.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread kbuild test robot
Hi Andi,

[auto build test WARNING on v4.5-rc1]
[also build test WARNING on next-20160128]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629
config: x86_64-randconfig-x015-01270835 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   lib/int_sqrt.c: In function 'int_sqrt':
>> lib/int_sqrt.c:25:6: warning: 'm' may be used uninitialized in this function 
>> [-Wmaybe-uninitialized]
  if (m <= 0xff)
 ^

vim +/m +25 lib/int_sqrt.c

 9  #include 
10  
11  /**
12   * int_sqrt - rough approximation to sqrt
13   * @x: integer of which to calculate the sqrt
14   *
15   * A very rough approximation to the sqrt() function.
16   */
17  unsigned long int_sqrt(unsigned long x)
18  {
19  unsigned long b, m, y = 0;
20  
21  if (x <= 1)
22  return x;
23  
24  if (x <= 0x) {
  > 25  if (m <= 0xff)
26  m = 1UL << (8 - 2);
27  else
28  m = 1UL << (16 - 2);
29  } else if (x <= 0x)
30  m = 1UL << (32 - 2);
31  else
32  m = 1UL << (BITS_PER_LONG - 2);
33  while (m != 0) {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread kbuild test robot
Hi Andi,

[auto build test WARNING on v4.5-rc1]
[also build test WARNING on next-20160128]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629
config: avr32-atngw100_defconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=avr32 

All warnings (new ones prefixed by >>):

   lib/int_sqrt.c: In function 'int_sqrt':
>> lib/int_sqrt.c:25: warning: 'm' is used uninitialized in this function

vim +/m +25 lib/int_sqrt.c

 9  #include 
10  
11  /**
12   * int_sqrt - rough approximation to sqrt
13   * @x: integer of which to calculate the sqrt
14   *
15   * A very rough approximation to the sqrt() function.
16   */
17  unsigned long int_sqrt(unsigned long x)
18  {
19  unsigned long b, m, y = 0;
20  
21  if (x <= 1)
22  return x;
23  
24  if (x <= 0x) {
  > 25  if (m <= 0xff)
26  m = 1UL << (8 - 2);
27  else
28  m = 1UL << (16 - 2);
29  } else if (x <= 0x)
30  m = 1UL << (32 - 2);
31  else
32  m = 1UL << (BITS_PER_LONG - 2);
33  while (m != 0) {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Joe Perches
(adding Anshul Garg)

On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen 
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

This thread might be relevant:

https://lkml.org/lkml/2015/2/2/600

and perhaps using fls might still be a good approach.



Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Eric Dumazet
On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen 
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.
> 
> Signed-off-by: Andi Kleen 
> ---
>  lib/int_sqrt.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
> index 1ef4cc3..2479ccf 100644
> --- a/lib/int_sqrt.c
> +++ b/lib/int_sqrt.c
> @@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x)
>   if (x <= 1)
>   return x;

The above test (x <= 1) should also be moved

>  
> - m = 1UL << (BITS_PER_LONG - 2);
> + if (x <= 0x) {
> + if (m <= 0xff)

m or x ?  if (x <= 0xff) looks more correct.

> + m = 1UL << (8 - 2);
> + else
> + m = 1UL << (16 - 2);
> + } else if (x <= 0x)
> + m = 1UL << (32 - 2);
> + else
> + m = 1UL << (BITS_PER_LONG - 2);
>   while (m != 0) {
>   b = y + m;
>   y >>= 1;




Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Andi Kleen
> This thread might be relevant:
> 
> https://lkml.org/lkml/2015/2/2/600
> 
> and perhaps using fls might still be a good approach.

Linus wrote:


>>>
We *probably* have some argument range that we care more about, which
is why I'd like to know what the profile is that triggered this
optimization, and what the common argument range is.
<<<

That's exactly what I did. Used perf probe to get the common
range and optimize for that.

-Andi


[PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Andi Kleen
From: Andi Kleen 

The menu cpuidle governor does at least two int_sqrt() each time
we go into idle in get_typical_interval to compute stddev

int_sqrts take 100-120 cycles each. Short idle latency is important
for many workloads.

I instrumented the function on my workstation and most values are
16bit only and most others 32bit (50% percentile is 122094,
75% is 3699533).

sqrt is implemented by starting with an initial estimation,
and then iterating. int_sqrt currently only uses a fixed
estimating which is good for 64bits worth of input.

This patch adds some checks at the beginning to start with
a better estimate for values fitting in 8, 16bit and 32bit.
This makes int_sqrt between 60+% faster for values in 16bit,
and still somewhat faster (between 10 and 30%) for larger values
upto 32bit. Full 64bit is slightly slower.

This optimizes the short idle calls and does not hurt the
long sleep (which probably do not care) much.

An alternative would be a full table drive approach, or
trying some inverted sqrt optimization, but this simple change
already seems to have a good payoff.

Signed-off-by: Andi Kleen 
---
 lib/int_sqrt.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc3..2479ccf 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x)
if (x <= 1)
return x;
 
-   m = 1UL << (BITS_PER_LONG - 2);
+   if (x <= 0x) {
+   if (m <= 0xff)
+   m = 1UL << (8 - 2);
+   else
+   m = 1UL << (16 - 2);
+   } else if (x <= 0x)
+   m = 1UL << (32 - 2);
+   else
+   m = 1UL << (BITS_PER_LONG - 2);
while (m != 0) {
b = y + m;
y >>= 1;
-- 
2.4.3



Re: [PATCH] Optimize int_sqrt for small values for faster idle

2016-01-28 Thread Rafael J. Wysocki
On Thursday, January 28, 2016 01:42:45 PM Andi Kleen wrote:
> From: Andi Kleen 
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

I'm wondering if you have any numbers on how much of a difference this
makes in practice in terms of energy consumption, performance, latency etc.

Thanks,
Rafael