Re: [PATCH v4] cpuidle: Fix last_residency division

2016-07-01 Thread Shreyas B Prabhu


On 07/01/2016 01:36 PM, Daniel Lezcano wrote:
> On 06/30/2016 05:37 PM, Nicolas Pitre wrote:
>> On Thu, 30 Jun 2016, Daniel Lezcano wrote:
>>> +}
 +}
>>>
>>>
>>> What bothers me with this division is the benefit of adding an extra
>>> ultra
>>> optimized division by 1000 in cpuidle.h while we have already
>>> ktime_divns
>>> which is optimized in ktime.h.
>>
>> It is "optimized" but still much heavier than what is presented above as
>> it provides maximum precision.
>>
>> It all depends on how important the performance gain from the original
>> shift by 10 was in the first place.
> 
> Actually the original shift was there because it was convenient as a
> simple ~div1000 operation. But against all odds, the approximation
> introduced a regression on a very specific use case on PowerPC.
> 
> We are not in the hot path and I think we can live with a ktime_divns
> without problem. That would simplify the fix I believe.
> 

I agree too. I'll post next version with this.

Thanks,
Shreyas

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4] cpuidle: Fix last_residency division

2016-07-01 Thread Nicolas Pitre
On Fri, 1 Jul 2016, Balbir Singh wrote:

> On Fri, 2016-07-01 at 10:06 +0200, Daniel Lezcano wrote:
> > On 06/30/2016 05:37 PM, Nicolas Pitre wrote:
> > > 
> > > On Thu, 30 Jun 2016, Daniel Lezcano wrote:
> > [ ... ]
> > 
> > > 
> > > > 
> > > > > 
> > > > > + if (likely(nsec < DIV_APPROXIMATION_THRESHOLD)) {
> > > > > + u32 usec = nsec;
> > > > > +
> > > > > + usec += usec >> 5;
> > > > > + usec = usec >> 10;
> > > > > +
> > > > > + /* Can safely cast to int since usec is < INT_MAX */
> > > > > + return usec;
> > > > > + } else {
> > > > > + u64 usec = div_u64(nsec, 1000);
> > > > > +
> > > > > + if (usec > INT_MAX)
> > > > > + usec = INT_MAX;
> > > > > +
> > > > > + /* Can safely cast to int since usec is < INT_MAX */
> > > > > + return usec;
> > > > > + }
> > > > > +}
> > > > 
> > > > What bothers me with this division is the benefit of adding an extra 
> > > > ultra
> > > > optimized division by 1000 in cpuidle.h while we have already 
> > > > ktime_divns
> > > > which is optimized in ktime.h.
> > > It is "optimized" but still much heavier than what is presented above as
> > > it provides maximum precision.
> > > 
> > > It all depends on how important the performance gain from the original
> > > shift by 10 was in the first place.
> > Actually the original shift was there because it was convenient as a 
> > simple ~div1000 operation. But against all odds, the approximation 
> > introduced a regression on a very specific use case on PowerPC.
> > 
> > We are not in the hot path and I think we can live with a ktime_divns 
> > without problem. That would simplify the fix I believe.
> > 
> 
> I would tend to agree with this and there are better ways to do
> multiplicative inverses if we care

ktime_divns already does it automatically when applicable.


Nicolas___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4] cpuidle: Fix last_residency division

2016-07-01 Thread Nicolas Pitre
On Fri, 1 Jul 2016, Daniel Lezcano wrote:

> On 06/30/2016 05:37 PM, Nicolas Pitre wrote:
> > On Thu, 30 Jun 2016, Daniel Lezcano wrote:
> 
> [ ... ]
> 
> > > > +   if (likely(nsec < DIV_APPROXIMATION_THRESHOLD)) {
> > > > +   u32 usec = nsec;
> > > > +
> > > > +   usec += usec >> 5;
> > > > +   usec = usec >> 10;
> > > > +
> > > > +   /* Can safely cast to int since usec is < INT_MAX */
> > > > +   return usec;
> > > > +   } else {
> > > > +   u64 usec = div_u64(nsec, 1000);
> > > > +
> > > > +   if (usec > INT_MAX)
> > > > +   usec = INT_MAX;
> > > > +
> > > > +   /* Can safely cast to int since usec is < INT_MAX */
> > > > +   return usec;
> > > > +   }
> > > > +}
> > >
> > >
> > > What bothers me with this division is the benefit of adding an extra ultra
> > > optimized division by 1000 in cpuidle.h while we have already ktime_divns
> > > which is optimized in ktime.h.
> >
> > It is "optimized" but still much heavier than what is presented above as
> > it provides maximum precision.
> >
> > It all depends on how important the performance gain from the original
> > shift by 10 was in the first place.
> 
> Actually the original shift was there because it was convenient as a simple
> ~div1000 operation. But against all odds, the approximation introduced a
> regression on a very specific use case on PowerPC.
> 
> We are not in the hot path and I think we can live with a ktime_divns without
> problem. That would simplify the fix I believe.

I agree.

> Perhaps the div1000 routine could be moved in ktime.h to be used as a helper
> for ktime_divns when we divide by the 1000 constant and submitted in a
> separate patch as an optimization.

The proposed patch here still provides an approximation so it wouldn't 
be appropriate for ktime_divns.


Nicolas
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4] cpuidle: Fix last_residency division

2016-07-01 Thread Balbir Singh
On Fri, 2016-07-01 at 10:06 +0200, Daniel Lezcano wrote:
> On 06/30/2016 05:37 PM, Nicolas Pitre wrote:
> > 
> > On Thu, 30 Jun 2016, Daniel Lezcano wrote:
> [ ... ]
> 
> > 
> > > 
> > > > 
> > > > +   if (likely(nsec < DIV_APPROXIMATION_THRESHOLD)) {
> > > > +   u32 usec = nsec;
> > > > +
> > > > +   usec += usec >> 5;
> > > > +   usec = usec >> 10;
> > > > +
> > > > +   /* Can safely cast to int since usec is < INT_MAX */
> > > > +   return usec;
> > > > +   } else {
> > > > +   u64 usec = div_u64(nsec, 1000);
> > > > +
> > > > +   if (usec > INT_MAX)
> > > > +   usec = INT_MAX;
> > > > +
> > > > +   /* Can safely cast to int since usec is < INT_MAX */
> > > > +   return usec;
> > > > +   }
> > > > +}
> > > 
> > > What bothers me with this division is the benefit of adding an extra ultra
> > > optimized division by 1000 in cpuidle.h while we have already ktime_divns
> > > which is optimized in ktime.h.
> > It is "optimized" but still much heavier than what is presented above as
> > it provides maximum precision.
> > 
> > It all depends on how important the performance gain from the original
> > shift by 10 was in the first place.
> Actually the original shift was there because it was convenient as a 
> simple ~div1000 operation. But against all odds, the approximation 
> introduced a regression on a very specific use case on PowerPC.
> 
> We are not in the hot path and I think we can live with a ktime_divns 
> without problem. That would simplify the fix I believe.
> 

I would tend to agree with this and there are better ways to do
multiplicative inverses if we care

Balbir Singh.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4] cpuidle: Fix last_residency division

2016-07-01 Thread Daniel Lezcano

On 06/30/2016 05:37 PM, Nicolas Pitre wrote:

On Thu, 30 Jun 2016, Daniel Lezcano wrote:


[ ... ]


+   if (likely(nsec < DIV_APPROXIMATION_THRESHOLD)) {
+   u32 usec = nsec;
+
+   usec += usec >> 5;
+   usec = usec >> 10;
+
+   /* Can safely cast to int since usec is < INT_MAX */
+   return usec;
+   } else {
+   u64 usec = div_u64(nsec, 1000);
+
+   if (usec > INT_MAX)
+   usec = INT_MAX;
+
+   /* Can safely cast to int since usec is < INT_MAX */
+   return usec;
+   }
+}



What bothers me with this division is the benefit of adding an extra ultra
optimized division by 1000 in cpuidle.h while we have already ktime_divns
which is optimized in ktime.h.


It is "optimized" but still much heavier than what is presented above as
it provides maximum precision.

It all depends on how important the performance gain from the original
shift by 10 was in the first place.


Actually the original shift was there because it was convenient as a 
simple ~div1000 operation. But against all odds, the approximation 
introduced a regression on a very specific use case on PowerPC.


We are not in the hot path and I think we can live with a ktime_divns 
without problem. That would simplify the fix I believe.


Perhaps the div1000 routine could be moved in ktime.h to be used as a 
helper for ktime_divns when we divide by the 1000 constant and submitted 
in a separate patch as an optimization.


--
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4] cpuidle: Fix last_residency division

2016-06-30 Thread Nicolas Pitre
On Thu, 30 Jun 2016, Daniel Lezcano wrote:

> On 06/30/2016 04:34 PM, Shreyas B. Prabhu wrote:
> > Snooze is a poll idle state in powernv and pseries platforms. Snooze
> > has a timeout so that if a cpu stays in snooze for more than target
> > residency of the next available idle state, then it would exit thereby
> > giving chance to the cpuidle governor to re-evaluate and
> > promote the cpu to a deeper idle state. Therefore whenever snooze exits
> > due to this timeout, its last_residency will be target_residency of next
> > deeper state.
> >
> > commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
> > changed the math around last_residency calculation. Specifically, while
> > converting last_residency value from nanoseconds to microseconds it does
> > right shift by 10. Due to this, in snooze timeout exit scenarios
> > last_residency calculated is roughly 2.3% less than target_residency of
> > next available state. This pattern is picked up get_typical_interval()
> > in the menu governor and therefore expected_interval in menu_select() is
> > frequently less than the target_residency of any state but snooze.
> >
> > Due to this we are entering snooze at a higher rate, thereby affecting
> > the single thread performance.
> >
> > Fix this by using a better approximation for division by 1000.
> >
> > Reported-by: Anton Blanchard 
> > Bisected-by: Shilpasri G Bhat 
> > Suggested-by David Laight 
> > Signed-off-by: Shreyas B. Prabhu 
> > ---
> > Changes in v4
> > =
> >   - Increasing the threshold upto which approximation can be used.
> >   - Removed explicit cast. Instead added a comment saying why cast
> > is safe.
> >
> > Changes in v3
> > =
> >   - Using approximation suggested by David
> >
> > Changes in v2
> > =
> >   - Fixing it in the cpuidle core code instead of driver code.
> >
> >   drivers/cpuidle/cpuidle.c | 11 +++
> >   drivers/cpuidle/cpuidle.h | 38 ++
> >   2 files changed, 41 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> > index a4d0059..f55ad01 100644
> > --- a/drivers/cpuidle/cpuidle.c
> > +++ b/drivers/cpuidle/cpuidle.c
> > @@ -174,7 +174,6 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
> > struct cpuidle_driver *drv,
> >struct cpuidle_state *target_state = &drv->states[index];
> >bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
> >u64 time_start, time_end;
> > -   s64 diff;
> >
> >/*
> >  * Tell the time framework to switch to a broadcast timer because our
> > @@ -218,14 +217,10 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
> > struct cpuidle_driver *drv,
> > local_irq_enable();
> >
> > /*
> > -* local_clock() returns the time in nanosecond, let's shift
> > -* by 10 (divide by 1024) to have microsecond based time.
> > +* local_clock() returns the time in nanoseconds, convert it to
> > +* microsecond based time.
> >  */
> > -   diff = (time_end - time_start) >> 10;
> > -   if (diff > INT_MAX)
> > -   diff = INT_MAX;
> > -
> > -   dev->last_residency = (int) diff;
> > +   dev->last_residency = convert_nsec_to_usec(time_end - time_start);
> >
> >if (entered_state >= 0) {
> > /* Update cpuidle counters */
> > diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
> > index f87f399..a027b35 100644
> > --- a/drivers/cpuidle/cpuidle.h
> > +++ b/drivers/cpuidle/cpuidle.h
> > @@ -68,4 +68,42 @@ static inline void
> > cpuidle_coupled_unregister_device(struct cpuidle_device *dev)
> >   }
> >   #endif
> >
> > +/*
> > + * To ensure that there is no overflow while approximation
> > + * for dividing val by 1000, we must respect -
> > + * val + (val >> 5) <= 0x
> > + * val + val/32 <= 0x
> > + * val <= (0x * 32) / 33
> > + * val <= 0xF83E0F82
> > + * Hence the threshold for val below which we can use the
> > + * approximation is 0xF83E0F82
> > + */
> > +#define DIV_APPROXIMATION_THRESHOLD 0xF83E0F82UL
> > +
> > +/*
> > + * Used for calculating last_residency in usec. Optimized for case
> > + * where last_residency in nsecs is < DIV_APPROXIMATION_THRESHOLD
> > + * Approximated value has less than 1% error.
> > + */
> > +static inline int convert_nsec_to_usec(u64 nsec)
> > +{
> > +   if (likely(nsec < DIV_APPROXIMATION_THRESHOLD)) {
> > +   u32 usec = nsec;
> > +
> > +   usec += usec >> 5;
> > +   usec = usec >> 10;
> > +
> > +   /* Can safely cast to int since usec is < INT_MAX */
> > +   return usec;
> > +   } else {
> > +   u64 usec = div_u64(nsec, 1000);
> > +
> > +   if (usec > INT_MAX)
> > +   usec = INT_MAX;
> > +
> > +   /* Can safely cast to int since usec is < INT_MAX */
> > +   return usec;
> > +   }
> > +}
> 
> 
> What bothers me with this division is the benefit of adding an extra ultra
> optimized divis

Re: [PATCH v4] cpuidle: Fix last_residency division

2016-06-30 Thread Daniel Lezcano

On 06/30/2016 04:34 PM, Shreyas B. Prabhu wrote:

Snooze is a poll idle state in powernv and pseries platforms. Snooze
has a timeout so that if a cpu stays in snooze for more than target
residency of the next available idle state, then it would exit thereby
giving chance to the cpuidle governor to re-evaluate and
promote the cpu to a deeper idle state. Therefore whenever snooze exits
due to this timeout, its last_residency will be target_residency of next
deeper state.

commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
changed the math around last_residency calculation. Specifically, while
converting last_residency value from nanoseconds to microseconds it does
right shift by 10. Due to this, in snooze timeout exit scenarios
last_residency calculated is roughly 2.3% less than target_residency of
next available state. This pattern is picked up get_typical_interval()
in the menu governor and therefore expected_interval in menu_select() is
frequently less than the target_residency of any state but snooze.

Due to this we are entering snooze at a higher rate, thereby affecting
the single thread performance.

Fix this by using a better approximation for division by 1000.

Reported-by: Anton Blanchard 
Bisected-by: Shilpasri G Bhat 
Suggested-by David Laight 
Signed-off-by: Shreyas B. Prabhu 
---
Changes in v4
=
  - Increasing the threshold upto which approximation can be used.
  - Removed explicit cast. Instead added a comment saying why cast
is safe.

Changes in v3
=
  - Using approximation suggested by David

Changes in v2
=
  - Fixing it in the cpuidle core code instead of driver code.

  drivers/cpuidle/cpuidle.c | 11 +++
  drivers/cpuidle/cpuidle.h | 38 ++
  2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a4d0059..f55ad01 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -174,7 +174,6 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
struct cpuidle_state *target_state = &drv->states[index];
bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
u64 time_start, time_end;
-   s64 diff;

/*
 * Tell the time framework to switch to a broadcast timer because our
@@ -218,14 +217,10 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
local_irq_enable();

/*
-* local_clock() returns the time in nanosecond, let's shift
-* by 10 (divide by 1024) to have microsecond based time.
+* local_clock() returns the time in nanoseconds, convert it to
+* microsecond based time.
 */
-   diff = (time_end - time_start) >> 10;
-   if (diff > INT_MAX)
-   diff = INT_MAX;
-
-   dev->last_residency = (int) diff;
+   dev->last_residency = convert_nsec_to_usec(time_end - time_start);

if (entered_state >= 0) {
/* Update cpuidle counters */
diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
index f87f399..a027b35 100644
--- a/drivers/cpuidle/cpuidle.h
+++ b/drivers/cpuidle/cpuidle.h
@@ -68,4 +68,42 @@ static inline void cpuidle_coupled_unregister_device(struct 
cpuidle_device *dev)
  }
  #endif

+/*
+ * To ensure that there is no overflow while approximation
+ * for dividing val by 1000, we must respect -
+ * val + (val >> 5) <= 0x
+ * val + val/32 <= 0x
+ * val <= (0x * 32) / 33
+ * val <= 0xF83E0F82
+ * Hence the threshold for val below which we can use the
+ * approximation is 0xF83E0F82
+ */
+#define DIV_APPROXIMATION_THRESHOLD 0xF83E0F82UL
+
+/*
+ * Used for calculating last_residency in usec. Optimized for case
+ * where last_residency in nsecs is < DIV_APPROXIMATION_THRESHOLD
+ * Approximated value has less than 1% error.
+ */
+static inline int convert_nsec_to_usec(u64 nsec)
+{
+   if (likely(nsec < DIV_APPROXIMATION_THRESHOLD)) {
+   u32 usec = nsec;
+
+   usec += usec >> 5;
+   usec = usec >> 10;
+
+   /* Can safely cast to int since usec is < INT_MAX */
+   return usec;
+   } else {
+   u64 usec = div_u64(nsec, 1000);
+
+   if (usec > INT_MAX)
+   usec = INT_MAX;
+
+   /* Can safely cast to int since usec is < INT_MAX */
+   return usec;
+   }
+}



What bothers me with this division is the benefit of adding an extra 
ultra optimized division by 1000 in cpuidle.h while we have already 
ktime_divns which is optimized in ktime.h.


Why not:

ts = ns_to_ktime(local_clock());

...

te = ns_to_ktime(local_clock());


diff = ktime_us_delta(te, ts);





--
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |