Re: PM related performance degradation on OMAP3

2012-05-09 Thread Jean Pihet
Hi Kevin,

On Mon, May 7, 2012 at 7:31 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 HI Kevin, Grazvydas,

 On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 Hi Grazvydas, Kevin,

 I did some gather some performance measurements and statistics using
 custom tracepoints in __omap3_enter_idle.
 I posted the patches for the power domains registers cache, cf.
 http://marc.info/?l=linux-omapm=133587781712039w=2.

 All the details are at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
 I updated the page with the measurements results with Kevin's patches
 and the registers cache patches.

 The results are showing that:
 - the registers cache optimizes the low power mode transitions, but is
 not sufficient to obtain a big gain. A few unused domains are
 transitioning, which causes a big penalty in the idle path.

 PER is the one that seems to be causing the most latency.

 Can you try do your measurements using hack below which makes sure that
 PER isn't any deeper than CORE?

 Indeed your patch brings significant improvements, cf. wiki page at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
 for detailed information.
 Here below is the reworked patch, more suited for inclusion in mainline [1]

 I have another optimisation -in proof of concept state- that brings
 another significant improvement. It is about allowing/disabling idle
 for only 1 clkdm in a pwrdm and not iterate through all the clkdms.
 This still needs some rework though. Cf. patch [2]

 That should work since disabling idle for any clkdm will have the same
 effect.  Can you send this as a separate patch with a descriptive
 changelog.
I just sent 2 patches which optimize the C1 state latency:
 . [PATCH 1/2] ARM: OMAP3: PM: cpuidle: optimize the PER latency in C1 state
 . [PATCH 2/2] ARM: OMAP3: PM: cpuidle: optimize the clkdm idle
latency in C1 state

Note: those patches apply on top of your pre/post_transition
optimization patches.

The performance results are close to the !PM case (No IDLE, no
omap_sram_idle, all pwrdms to ON), i.e. 3.1MB/s on Beagleboard.
The wiki page update comes asap.

Regards,
Jean


 Kevin


 Patches [1] and [2] on top of the registers cache and the
 optimisations in pre/post_transition bring the performance close to
 the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s
 on Beagleboard).

 What do you think?

 Regards,
 Jean

 ---
 [1]
 diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
 b/arch/arm/mach-omap2/cpuidle34xx.c
 index e406d7b..572b605 100644
 +++ b/arch/arm/mach-omap2/cpuidle34xx.c
 @@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device 
 *dev,
       int ret;

       /*
 -      * Prevent idle completely if CAM is active.
 +      * Use only C1 if CAM is active.
        * CAM does not have wakeup capability in OMAP3.
        */
 -     if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) {
 +     if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON)
               new_state_idx = drv-safe_state_index;
 -             goto select_state;
 -     }
 -
 -     new_state_idx = next_valid_state(dev, drv, index);
 +     else
 +             new_state_idx = next_valid_state(dev, drv, index);

 -     /*
 -      * Prevent PER off if CORE is not in retention or off as this
 -      * would disable PER wakeups completely.
 -      */
 +     /* Program PER state */
       cx = cpuidle_get_statedata(dev-states_usage[new_state_idx]);
       core_next_state = cx-core_state;
 -     per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd);
 -     if ((per_next_state == PWRDM_FUNC_PWRST_OFF) 
 -         (core_next_state  PWRDM_FUNC_PWRST_CSWR))
 -             per_next_state = PWRDM_FUNC_PWRST_CSWR;
 +     if (new_state_idx == 0) {
 +             /* In C1 do not allow PER state lower than CORE state */
 +             per_next_state = core_next_state;
 +     } else {
 +             /*
 +              * Prevent PER off if CORE is not in RETention or OFF as this
 +              * would disable PER wakeups completely.
 +              */
 +             per_next_state = per_saved_state =
 +                             pwrdm_read_next_func_pwrst(per_pd);
 +             if ((per_next_state == PWRDM_FUNC_PWRST_OFF) 
 +                 (core_next_state  PWRDM_FUNC_PWRST_CSWR))
 +                     per_next_state = PWRDM_FUNC_PWRST_CSWR;
 +     }

       /* Are we changing PER target state? */
       if (per_next_state != per_saved_state)
               omap_set_pwrdm_state(per_pd, per_next_state);

 -select_state:
       ret = omap3_enter_idle(dev, drv, new_state_idx);

       /* Restore original PER state if it was modified */
 @@ -390,7 +394,6 @@ int 

Re: PM related performance degradation on OMAP3

2012-05-07 Thread Kevin Hilman
Jean Pihet jean.pi...@newoldbits.com writes:

 On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 HI Kevin, Grazvydas,

 On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 Hi Grazvydas, Kevin,

 I did some gather some performance measurements and statistics using
 custom tracepoints in __omap3_enter_idle.
 I posted the patches for the power domains registers cache, cf.
 http://marc.info/?l=linux-omapm=133587781712039w=2.

 All the details are at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
 I updated the page with the measurements results with Kevin's patches
 and the registers cache patches.

 The results are showing that:
 - the registers cache optimizes the low power mode transitions, but is
 not sufficient to obtain a big gain. A few unused domains are
 transitioning, which causes a big penalty in the idle path.

 PER is the one that seems to be causing the most latency.

 Can you try do your measurements using hack below which makes sure that
 PER isn't any deeper than CORE?

 Indeed your patch brings significant improvements, cf. wiki page at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
 for detailed information.
 Here below is the reworked patch, more suited for inclusion in mainline [1]

 I have another optimisation -in proof of concept state- that brings
 another significant improvement. It is about allowing/disabling idle
 for only 1 clkdm in a pwrdm and not iterate through all the clkdms.
 This still needs some rework though. Cf. patch [2]

That should work since disabling idle for any clkdm will have the same
effect.  Can you send this as a separate patch with a descriptive
changelog.

Kevin


 Patches [1] and [2] on top of the registers cache and the
 optimisations in pre/post_transition bring the performance close to
 the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s
 on Beagleboard).

 What do you think?

 Regards,
 Jean

 ---
 [1]
 diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
 b/arch/arm/mach-omap2/cpuidle34xx.c
 index e406d7b..572b605 100644
 +++ b/arch/arm/mach-omap2/cpuidle34xx.c
 @@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device 
 *dev,
   int ret;

   /*
 -  * Prevent idle completely if CAM is active.
 +  * Use only C1 if CAM is active.
* CAM does not have wakeup capability in OMAP3.
*/
 - if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) {
 + if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON)
   new_state_idx = drv-safe_state_index;
 - goto select_state;
 - }
 -
 - new_state_idx = next_valid_state(dev, drv, index);
 + else
 + new_state_idx = next_valid_state(dev, drv, index);

 - /*
 -  * Prevent PER off if CORE is not in retention or off as this
 -  * would disable PER wakeups completely.
 -  */
 + /* Program PER state */
   cx = cpuidle_get_statedata(dev-states_usage[new_state_idx]);
   core_next_state = cx-core_state;
 - per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd);
 - if ((per_next_state == PWRDM_FUNC_PWRST_OFF) 
 - (core_next_state  PWRDM_FUNC_PWRST_CSWR))
 - per_next_state = PWRDM_FUNC_PWRST_CSWR;
 + if (new_state_idx == 0) {
 + /* In C1 do not allow PER state lower than CORE state */
 + per_next_state = core_next_state;
 + } else {
 + /*
 +  * Prevent PER off if CORE is not in RETention or OFF as this
 +  * would disable PER wakeups completely.
 +  */
 + per_next_state = per_saved_state =
 + pwrdm_read_next_func_pwrst(per_pd);
 + if ((per_next_state == PWRDM_FUNC_PWRST_OFF) 
 + (core_next_state  PWRDM_FUNC_PWRST_CSWR))
 + per_next_state = PWRDM_FUNC_PWRST_CSWR;
 + }

   /* Are we changing PER target state? */
   if (per_next_state != per_saved_state)
   omap_set_pwrdm_state(per_pd, per_next_state);

 -select_state:
   ret = omap3_enter_idle(dev, drv, new_state_idx);

   /* Restore original PER state if it was modified */
 @@ -390,7 +394,6 @@ int __init omap3_idle_init(void)

   /* C1 . MPU WFI + Core active */
   _fill_cstate(drv, 0, MPU ON + CORE ON);
 - (drv-states[0])-enter = omap3_enter_idle;
   drv-safe_state_index = 0;
   cx = _fill_cstate_usage(dev, 0);
   cx-valid = 1;  /* C1 is always valid */

 [2]
 diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
 b/arch/arm/mach-omap2/cpuidle34xx.c
 index e406d7b..6aa3c75 100644
 --- a/arch/arm/mach-omap2/cpuidle34xx.c
 +++ b/arch/arm/mach-omap2/cpuidle34xx.c
 @@ -118,8 +118,10 @@ static int __omap3_enter_idle(struct cpuidle_device 

Re: PM related performance degradation on OMAP3

2012-05-02 Thread Jean Pihet
On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 HI Kevin, Grazvydas,

 On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 Hi Grazvydas, Kevin,

 I did some gather some performance measurements and statistics using
 custom tracepoints in __omap3_enter_idle.
 I posted the patches for the power domains registers cache, cf.
 http://marc.info/?l=linux-omapm=133587781712039w=2.

 All the details are at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
 I updated the page with the measurements results with Kevin's patches
 and the registers cache patches.

 The results are showing that:
 - the registers cache optimizes the low power mode transitions, but is
 not sufficient to obtain a big gain. A few unused domains are
 transitioning, which causes a big penalty in the idle path.

 PER is the one that seems to be causing the most latency.

 Can you try do your measurements using hack below which makes sure that
 PER isn't any deeper than CORE?

Indeed your patch brings significant improvements, cf. wiki page at
http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
for detailed information.
Here below is the reworked patch, more suited for inclusion in mainline [1]

I have another optimisation -in proof of concept state- that brings
another significant improvement. It is about allowing/disabling idle
for only 1 clkdm in a pwrdm and not iterate through all the clkdms.
This still needs some rework though. Cf. patch [2]

Patches [1] and [2] on top of the registers cache and the
optimisations in pre/post_transition bring the performance close to
the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s
on Beagleboard).

What do you think?

Regards,
Jean

---
[1]
diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
b/arch/arm/mach-omap2/cpuidle34xx.c
index e406d7b..572b605 100644
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
int ret;

/*
-* Prevent idle completely if CAM is active.
+* Use only C1 if CAM is active.
 * CAM does not have wakeup capability in OMAP3.
 */
-   if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) {
+   if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON)
new_state_idx = drv-safe_state_index;
-   goto select_state;
-   }
-
-   new_state_idx = next_valid_state(dev, drv, index);
+   else
+   new_state_idx = next_valid_state(dev, drv, index);

-   /*
-* Prevent PER off if CORE is not in retention or off as this
-* would disable PER wakeups completely.
-*/
+   /* Program PER state */
cx = cpuidle_get_statedata(dev-states_usage[new_state_idx]);
core_next_state = cx-core_state;
-   per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd);
-   if ((per_next_state == PWRDM_FUNC_PWRST_OFF) 
-   (core_next_state  PWRDM_FUNC_PWRST_CSWR))
-   per_next_state = PWRDM_FUNC_PWRST_CSWR;
+   if (new_state_idx == 0) {
+   /* In C1 do not allow PER state lower than CORE state */
+   per_next_state = core_next_state;
+   } else {
+   /*
+* Prevent PER off if CORE is not in RETention or OFF as this
+* would disable PER wakeups completely.
+*/
+   per_next_state = per_saved_state =
+   pwrdm_read_next_func_pwrst(per_pd);
+   if ((per_next_state == PWRDM_FUNC_PWRST_OFF) 
+   (core_next_state  PWRDM_FUNC_PWRST_CSWR))
+   per_next_state = PWRDM_FUNC_PWRST_CSWR;
+   }

/* Are we changing PER target state? */
if (per_next_state != per_saved_state)
omap_set_pwrdm_state(per_pd, per_next_state);

-select_state:
ret = omap3_enter_idle(dev, drv, new_state_idx);

/* Restore original PER state if it was modified */
@@ -390,7 +394,6 @@ int __init omap3_idle_init(void)

/* C1 . MPU WFI + Core active */
_fill_cstate(drv, 0, MPU ON + CORE ON);
-   (drv-states[0])-enter = omap3_enter_idle;
drv-safe_state_index = 0;
cx = _fill_cstate_usage(dev, 0);
cx-valid = 1;  /* C1 is always valid */

[2]
diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
b/arch/arm/mach-omap2/cpuidle34xx.c
index e406d7b..6aa3c75 100644
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -118,8 +118,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,

/* Deny idle for C1 */
if (index == 0) {
-   pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
-   pwrdm_for_each_clkdm(core_pd, 

Re: PM related performance degradation on OMAP3

2012-05-01 Thread Jean Pihet
HI Kevin, Grazvydas,

On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 Hi Grazvydas, Kevin,

 I did some gather some performance measurements and statistics using
 custom tracepoints in __omap3_enter_idle.
I posted the patches for the power domains registers cache, cf.
http://marc.info/?l=linux-omapm=133587781712039w=2.

 All the details are at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
I updated the page with the measurements results with Kevin's patches
and the registers cache patches.

The results are showing that:
- the registers cache optimizes the low power mode transitions, but is
not sufficient to obtain a big gain. A few unused domains are
transitioning, which causes a big penalty in the idle path.
- khilman's optimizations are really helpful. Furthermore it optimizes
farther the registers cache statistics accesses.
- the average time in idle now drops to 246us, which is still very
large for a cpu intensive C-state. For information with PM disabled
the average time in idle is 113us.

Regards,
Jean

 .

 This is great, thanks.

 [...]

 Here are the results (BW in MB/s) on Beagleboard:
 - 4.7: without using DMA,

 - Using DMA
   2.1: [0]
   2.1: [1] only C1
   2.6: [1]+[2] no pre_ post_
   2.3: [1]+[5] no pwrdm_for_each_clkdm
   2.8: [1]+[5]+[2]
   3.1: [1]+[5]+[6] no omap_sram_idle
   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON

 So indeed this shows there is some serious performance issue with the
 C1 C-state.

 Yes, this confirms what both Grazvytas and I are seeing as well.

 [...]

 From the list of contributors, the main ones are:
     (140us) pwrdm_pre_transition and pwrdm_post_transition,

 See the series I just posted to address this one:
 [PATCH/RFT 0/3] ARM: OMAP: PM: reduce overhead of pwrdm pre/post transitions

     (105us) omap2_gpio_prepare_for_idle and
 omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
 the latency-critical C-states,
     (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
     (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
     (11 us) clkdm_allow_idle(mpu). Is this needed?

 In that same series, I removed this as it appears to be a remnant of a
 code move (c.f. patch 3 in above series.)

 Here are a few questions and suggestions:
 - In case of latency critical C-states could the high-latency code be
 bypassed in favor of a much simpler version? Pushing the concept a bit
 farther one could have a C1 state that just relaxes the cpu (no WFI),
 a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
 rest of the C-states as we have today,

 I was thinking a WFI only state, with *all* powerdomains staying on is
 probably sufficient for C1.  Do you see the enter/exit latency from that
 as even being too hight?

 - Is it needed to iterate through all the power and clock domains in
 order to keep them active?

 No.  My series above starts to addresses this, but I think Tero's
 use-counting series is the final solution since this should really be
 done when we know the powerdomains are transitioning.

 Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-05-01 Thread Kevin Hilman
Jean Pihet jean.pi...@newoldbits.com writes:

 HI Kevin, Grazvydas,

 On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote:
 Jean Pihet jean.pi...@newoldbits.com writes:

 Hi Grazvydas, Kevin,

 I did some gather some performance measurements and statistics using
 custom tracepoints in __omap3_enter_idle.
 I posted the patches for the power domains registers cache, cf.
 http://marc.info/?l=linux-omapm=133587781712039w=2.

 All the details are at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
 I updated the page with the measurements results with Kevin's patches
 and the registers cache patches.

 The results are showing that:
 - the registers cache optimizes the low power mode transitions, but is
 not sufficient to obtain a big gain. A few unused domains are
 transitioning, which causes a big penalty in the idle path.

PER is the one that seems to be causing the most latency.  

Can you try do your measurements using hack below which makes sure that
PER isn't any deeper than CORE?

Kevin

From bb2f67ed93dc83c645080e293d315d383c23c0c6 Mon Sep 17 00:00:00 2001
From: Kevin Hilman khil...@ti.com
Date: Mon, 16 Apr 2012 17:53:14 -0700
Subject: [PATCH] cpuidle34xx: per follows core, C1 use _bm

---
 arch/arm/mach-omap2/cpuidle34xx.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-omap2/cpuidle34xx.c 
b/arch/arm/mach-omap2/cpuidle34xx.c
index 374708d..00400ad 100644
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -278,9 +278,11 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
cx = cpuidle_get_statedata(dev-states_usage[index]);
core_next_state = cx-core_state;
per_next_state = per_saved_state = pwrdm_read_next_pwrst(per_pd);
-   if ((per_next_state == PWRDM_POWER_OFF) 
-   (core_next_state  PWRDM_POWER_RET))
-   per_next_state = PWRDM_POWER_RET;
+   /* if ((per_next_state == PWRDM_POWER_OFF)  */
+   /* (core_next_state  PWRDM_POWER_RET)) */
+   /*  per_next_state = PWRDM_POWER_RET; */
+   if (per_next_state  core_next_state)
+   per_next_state = core_next_state;
 
/* Are we changing PER target state? */
if (per_next_state != per_saved_state)
@@ -374,7 +376,6 @@ int __init omap3_idle_init(void)
 
/* C1 . MPU WFI + Core active */
_fill_cstate(drv, 0, MPU ON + CORE ON);
-   (drv-states[0])-enter = omap3_enter_idle;
drv-safe_state_index = 0;
cx = _fill_cstate_usage(dev, 0);
cx-valid = 1;  /* C1 is always valid */
-- 
1.7.9.2

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-05-01 Thread Paul Walmsley
On Tue, 1 May 2012, Kevin Hilman wrote:

 PER is the one that seems to be causing the most latency.  
 
 Can you try do your measurements using hack below which makes sure that
 PER isn't any deeper than CORE?

It might be the relock time for DPLL4, the PER DPLL.  You might also 
try disabling DPLL4 autoidle for the shallow C-states...
 

- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-24 Thread Jean Pihet
Hi Grazvydas, Kevin,

I did some gather some performance measurements and statistics using
custom tracepoints in __omap3_enter_idle.
All the details are at
http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
.

The setup is:
- Beagleboard (OMAP3530) at 500MHz,
- l-o master kernel + functional power states + per-device PM QoS. It
has been checked that the changes from l-o master do not have an
impact on the performance.
- The data transfer is performed using dd from a file in JFFS2 to
/dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.

On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman khil...@ti.com wrote:
 Grazvydas Ignotas nota...@gmail.com writes:

 On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote:
 It would be helpful now to narrow down what are the big contributors to
 the overhead in omap_sram_idle().  Most of the code there is skipped for
 C1 because the next states for MPU and CORE are both ON.

 Ok I did some tests, all in mostly idle system with just init, busybox
 shell and dd doing a NAND read to /dev/null .

...

 MB/s is throughput that
 dd reports, mA and approx. current draw during the transfer, read from
 fuel gauge that's onboard.

 MB/s| mA|comment
  3.7|218|mainline f549e088b80
  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
  4.4|220|[1] + pwrdm_p*_transition commented [2]
  3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3]
  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
  4.0|224|[1] + 'Deny idle' [5]
  5.1|210|[2] + [4] + [5]
  5.2|202|[5] + omap_sram_idle-cpu_do_idle [6]
  5.5|243|!CONFIG_PM
  6.1|282|busywait DMA end (for reference)

Here are the results (BW in MB/s) on Beagleboard:
- 4.7: without using DMA,

- Using DMA
  2.1: [0]
  2.1: [1] only C1
  2.6: [1]+[2] no pre_ post_
  2.3: [1]+[5] no pwrdm_for_each_clkdm
  2.8: [1]+[5]+[2]
  3.1: [1]+[5]+[6] no omap_sram_idle
  3.1: No IDLE, no omap_sram_idle, all pwrdms to ON

So indeed this shows there is some serious performance issue with the
C1 C-state.

 Thanks for the detailed experiments.  This definitely confirms we have
 some serious unwanted overhead for C1, and our C-state latency values
 are clearly way off base, since they only account HW latency and not any
 of the SW latency introduced in omap_sram_idle().

 There are 2 primary differences that I see as possible causes.  I list
 them here with a couple more experiments for you to try to help us
 narrow this down.

 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()

 Could you try using omap_sram_idle() and just commenting out those
 calls?  Does that help performance?  Those iterate over all the
 powerdomains, so defintely add some overhead, but I don't think it
 would be as significant as what you're seeing.

 Seems to be taking good part of it.

    Much more likely is...

 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds

 Could not notice any difference.

 To me it looks like this results from many small things adding up..
 Idle is called so often that pwrdm_p*_transition() and those
 pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
 because they access lots of registers on slow buses?

From the list of contributors, the main ones are:
(140us) pwrdm_pre_transition and pwrdm_post_transition,
(105us) omap2_gpio_prepare_for_idle and
omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
the latency-critical C-states,
(78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
(33us estimated) omap_set_pwrdm_state(mpu, core, neon),
(11 us) clkdm_allow_idle(mpu). Is this needed?

Here are a few questions and suggestions:
- In case of latency critical C-states could the high-latency code be
bypassed in favor of a much simpler version? Pushing the concept a bit
farther one could have a C1 state that just relaxes the cpu (no WFI),
a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
rest of the C-states as we have today,
- Is it needed to iterate through all the power and clock domains in
order to keep them active?
- Trying to idle some non related power domains (e.g. PER) causes a
performance hit. How to link all the power domains states to the
cpuidle C-state? The per-device PM QoS framework could be used to
constraint some power domains, but this is highly dependent on the use
case.

 Yes PRCM register accesses are unfortunately rather slow, and we've
 known that for some time, but haven't done any detailed analysis of the
 overhead.
That would be worth doing the analysis. A lot of read accesses to the
current, next and previous power states are performed in the idle
code.

 Using the function_graph tracer, I was able to see that the pre/post
 transition are taking an enormous amount of time:

  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)

 Notice the big difference between 600MHz 

Re: PM related performance degradation on OMAP3

2012-04-24 Thread Santosh Shilimkar
+ Tero

On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
 Hi Grazvydas, Kevin,
 
 I did some gather some performance measurements and statistics using
 custom tracepoints in __omap3_enter_idle.
 All the details are at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
 .
 
Nice data.

 The setup is:
 - Beagleboard (OMAP3530) at 500MHz,
 - l-o master kernel + functional power states + per-device PM QoS. It
 has been checked that the changes from l-o master do not have an
 impact on the performance.
 - The data transfer is performed using dd from a file in JFFS2 to
 /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.
 
 On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman khil...@ti.com wrote:
 Grazvydas Ignotas nota...@gmail.com writes:

 On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote:
 It would be helpful now to narrow down what are the big contributors to
 the overhead in omap_sram_idle().  Most of the code there is skipped for
 C1 because the next states for MPU and CORE are both ON.

 Ok I did some tests, all in mostly idle system with just init, busybox
 shell and dd doing a NAND read to /dev/null .

 ...

 MB/s is throughput that
 dd reports, mA and approx. current draw during the transfer, read from
 fuel gauge that's onboard.

 MB/s| mA|comment
  3.7|218|mainline f549e088b80
  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
  4.4|220|[1] + pwrdm_p*_transition commented [2]
  3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3]
  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
  4.0|224|[1] + 'Deny idle' [5]
  5.1|210|[2] + [4] + [5]
  5.2|202|[5] + omap_sram_idle-cpu_do_idle [6]
  5.5|243|!CONFIG_PM
  6.1|282|busywait DMA end (for reference)
 
 Here are the results (BW in MB/s) on Beagleboard:
 - 4.7: without using DMA,
 
 - Using DMA
   2.1: [0]
   2.1: [1] only C1
   2.6: [1]+[2] no pre_ post_
   2.3: [1]+[5] no pwrdm_for_each_clkdm
   2.8: [1]+[5]+[2]
   3.1: [1]+[5]+[6] no omap_sram_idle
   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
 
 So indeed this shows there is some serious performance issue with the
 C1 C-state.

Looks like other clock-domain (notably l4, per, AON) should be denied
idle in C1 to avoid the huge penalties. It might just do the trick.


 Thanks for the detailed experiments.  This definitely confirms we have
 some serious unwanted overhead for C1, and our C-state latency values
 are clearly way off base, since they only account HW latency and not any
 of the SW latency introduced in omap_sram_idle().

 There are 2 primary differences that I see as possible causes.  I list
 them here with a couple more experiments for you to try to help us
 narrow this down.

 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()

 Could you try using omap_sram_idle() and just commenting out those
 calls?  Does that help performance?  Those iterate over all the
 powerdomains, so defintely add some overhead, but I don't think it
 would be as significant as what you're seeing.

 Seems to be taking good part of it.

Much more likely is...

 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds

 Could not notice any difference.

 To me it looks like this results from many small things adding up..
 Idle is called so often that pwrdm_p*_transition() and those
 pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
 because they access lots of registers on slow buses?
 
 From the list of contributors, the main ones are:
 (140us) pwrdm_pre_transition and pwrdm_post_transition,

I have observed this one on OMAP4 too. There was a plan to remove
this as part of Tero's PD/CD use-counting series.

 (105us) omap2_gpio_prepare_for_idle and
 omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
 the latency-critical C-states,
Yes. In C1 when you deny idle for per, there should be no need to
call this. But even in the case when it is called, why is it taking
105 uS. Needs to dig further.

 (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
Depending on OPP, a PRCM read can take upto ~12-14 uS, so above
shouldn't be surprising.

 (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
This is again dominated by PRCM read

 (11 us) clkdm_allow_idle(mpu). Is this needed?
 
I guess yes other wise when C2+ is attempted MPU CD can't idle.

 Here are a few questions and suggestions:
 - In case of latency critical C-states could the high-latency code be
 bypassed in favor of a much simpler version? Pushing the concept a bit
 farther one could have a C1 state that just relaxes the cpu (no WFI),
 a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
 rest of the C-states as we have today,
We should do that. Infact C1 state should be as lite as possible like
WFI or so.

 - Is it needed to iterate through all the power and clock domains in
 order to keep them active?
That iteration should be removed.

 - Trying to 

Re: PM related performance degradation on OMAP3

2012-04-24 Thread Tero Kristo
On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote:
 + Tero
 
 On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
  Hi Grazvydas, Kevin,
  
  I did some gather some performance measurements and statistics using
  custom tracepoints in __omap3_enter_idle.
  All the details are at
  http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
  .
  
 Nice data.
 
  The setup is:
  - Beagleboard (OMAP3530) at 500MHz,
  - l-o master kernel + functional power states + per-device PM QoS. It
  has been checked that the changes from l-o master do not have an
  impact on the performance.
  - The data transfer is performed using dd from a file in JFFS2 to
  /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.

Question: what is used for gathering the latency values?

  
  On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman khil...@ti.com wrote:
  Grazvydas Ignotas nota...@gmail.com writes:
 
  On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote:
  It would be helpful now to narrow down what are the big contributors to
  the overhead in omap_sram_idle().  Most of the code there is skipped for
  C1 because the next states for MPU and CORE are both ON.
 
  Ok I did some tests, all in mostly idle system with just init, busybox
  shell and dd doing a NAND read to /dev/null .
 
  ...
 
  MB/s is throughput that
  dd reports, mA and approx. current draw during the transfer, read from
  fuel gauge that's onboard.
 
  MB/s| mA|comment
   3.7|218|mainline f549e088b80
   3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
   4.4|220|[1] + pwrdm_p*_transition commented [2]
   3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3]
   4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
   4.0|224|[1] + 'Deny idle' [5]
   5.1|210|[2] + [4] + [5]
   5.2|202|[5] + omap_sram_idle-cpu_do_idle [6]
   5.5|243|!CONFIG_PM
   6.1|282|busywait DMA end (for reference)
  
  Here are the results (BW in MB/s) on Beagleboard:
  - 4.7: without using DMA,
  
  - Using DMA
2.1: [0]
2.1: [1] only C1
2.6: [1]+[2] no pre_ post_
2.3: [1]+[5] no pwrdm_for_each_clkdm
2.8: [1]+[5]+[2]
3.1: [1]+[5]+[6] no omap_sram_idle
3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
  
  So indeed this shows there is some serious performance issue with the
  C1 C-state.
 
 Looks like other clock-domain (notably l4, per, AON) should be denied
 idle in C1 to avoid the huge penalties. It might just do the trick.
 
 
  Thanks for the detailed experiments.  This definitely confirms we have
  some serious unwanted overhead for C1, and our C-state latency values
  are clearly way off base, since they only account HW latency and not any
  of the SW latency introduced in omap_sram_idle().
 
  There are 2 primary differences that I see as possible causes.  I list
  them here with a couple more experiments for you to try to help us
  narrow this down.
 
  1) powerdomain accounting: pwrdm_pre_transition(), 
  pwrdm_post_transition()
 
  Could you try using omap_sram_idle() and just commenting out those
  calls?  Does that help performance?  Those iterate over all the
  powerdomains, so defintely add some overhead, but I don't think it
  would be as significant as what you're seeing.
 
  Seems to be taking good part of it.
 
 Much more likely is...
 
  2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
 
  Could not notice any difference.
 
  To me it looks like this results from many small things adding up..
  Idle is called so often that pwrdm_p*_transition() and those
  pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
  because they access lots of registers on slow buses?
  
  From the list of contributors, the main ones are:
  (140us) pwrdm_pre_transition and pwrdm_post_transition,
 
 I have observed this one on OMAP4 too. There was a plan to remove
 this as part of Tero's PD/CD use-counting series.

pwrdm_pre / post transitions could be optimized a bit already now. They
only should need to be called for mpu / core and per domains, but
currently they scan through everything.

 
  (105us) omap2_gpio_prepare_for_idle and
  omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
  the latency-critical C-states,
 Yes. In C1 when you deny idle for per, there should be no need to
 call this. But even in the case when it is called, why is it taking
 105 uS. Needs to dig further.
 
  (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
 Depending on OPP, a PRCM read can take upto ~12-14 uS, so above
 shouldn't be surprising.
 
  (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
 This is again dominated by PRCM read
 
  (11 us) clkdm_allow_idle(mpu). Is this needed?
  
 I guess yes other wise when C2+ is attempted MPU CD can't idle.
 
  Here are a few questions and suggestions:
  - In case of latency critical C-states could the high-latency code be
  bypassed in favor of a much simpler version? Pushing 

Re: PM related performance degradation on OMAP3

2012-04-24 Thread Jean Pihet
Hi Tero,

On Tue, Apr 24, 2012 at 2:21 PM, Tero Kristo t-kri...@ti.com wrote:
 On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote:
 + Tero

 On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
  Hi Grazvydas, Kevin,
 
  I did some gather some performance measurements and statistics using
  custom tracepoints in __omap3_enter_idle.
  All the details are at
  http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
  .
 
 Nice data.

  The setup is:
  - Beagleboard (OMAP3530) at 500MHz,
  - l-o master kernel + functional power states + per-device PM QoS. It
  has been checked that the changes from l-o master do not have an
  impact on the performance.
  - The data transfer is performed using dd from a file in JFFS2 to
  /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.

 Question: what is used for gathering the latency values?
I used ftrace tracepoints which are supposed to be low overhead. I
checked that the overhead cannot be measured on the measurement
interval (400us), given the fact that the time base is 31us (32 KHz
clock).

 
  On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman khil...@ti.com wrote:
  Grazvydas Ignotas nota...@gmail.com writes:
 
  On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote:
  It would be helpful now to narrow down what are the big contributors to
  the overhead in omap_sram_idle().  Most of the code there is skipped for
  C1 because the next states for MPU and CORE are both ON.
 
  Ok I did some tests, all in mostly idle system with just init, busybox
  shell and dd doing a NAND read to /dev/null .
 
  ...
 
  MB/s is throughput that
  dd reports, mA and approx. current draw during the transfer, read from
  fuel gauge that's onboard.
 
  MB/s| mA|comment
   3.7|218|mainline f549e088b80
   3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
   4.4|220|[1] + pwrdm_p*_transition commented [2]
   3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3]
   4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
   4.0|224|[1] + 'Deny idle' [5]
   5.1|210|[2] + [4] + [5]
   5.2|202|[5] + omap_sram_idle-cpu_do_idle [6]
   5.5|243|!CONFIG_PM
   6.1|282|busywait DMA end (for reference)
 
  Here are the results (BW in MB/s) on Beagleboard:
  - 4.7: without using DMA,
 
  - Using DMA
    2.1: [0]
    2.1: [1] only C1
    2.6: [1]+[2] no pre_ post_
    2.3: [1]+[5] no pwrdm_for_each_clkdm
    2.8: [1]+[5]+[2]
    3.1: [1]+[5]+[6] no omap_sram_idle
    3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
 
  So indeed this shows there is some serious performance issue with the
  C1 C-state.
 
 Looks like other clock-domain (notably l4, per, AON) should be denied
 idle in C1 to avoid the huge penalties. It might just do the trick.


  Thanks for the detailed experiments.  This definitely confirms we have
  some serious unwanted overhead for C1, and our C-state latency values
  are clearly way off base, since they only account HW latency and not any
  of the SW latency introduced in omap_sram_idle().
 
  There are 2 primary differences that I see as possible causes.  I list
  them here with a couple more experiments for you to try to help us
  narrow this down.
 
  1) powerdomain accounting: pwrdm_pre_transition(), 
  pwrdm_post_transition()
 
  Could you try using omap_sram_idle() and just commenting out those
  calls?  Does that help performance?  Those iterate over all the
  powerdomains, so defintely add some overhead, but I don't think it
  would be as significant as what you're seeing.
 
  Seems to be taking good part of it.
 
     Much more likely is...
 
  2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
 
  Could not notice any difference.
 
  To me it looks like this results from many small things adding up..
  Idle is called so often that pwrdm_p*_transition() and those
  pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
  because they access lots of registers on slow buses?
 
  From the list of contributors, the main ones are:
      (140us) pwrdm_pre_transition and pwrdm_post_transition,

 I have observed this one on OMAP4 too. There was a plan to remove
 this as part of Tero's PD/CD use-counting series.

 pwrdm_pre / post transitions could be optimized a bit already now. They
 only should need to be called for mpu / core and per domains, but
 currently they scan through everything.


      (105us) omap2_gpio_prepare_for_idle and
  omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
  the latency-critical C-states,
 Yes. In C1 when you deny idle for per, there should be no need to
 call this. But even in the case when it is called, why is it taking
 105 uS. Needs to dig further.

      (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
 Depending on OPP, a PRCM read can take upto ~12-14 uS, so above
 shouldn't be surprising.

      (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
 This is again dominated by PRCM read

      (11 us) 

Re: PM related performance degradation on OMAP3

2012-04-24 Thread Tero Kristo
On Tue, 2012-04-24 at 14:50 +0200, Jean Pihet wrote:
 Hi Tero,
 
 On Tue, Apr 24, 2012 at 2:21 PM, Tero Kristo t-kri...@ti.com wrote:
  On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote:
  + Tero
 
  On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
   Hi Grazvydas, Kevin,
  
   I did some gather some performance measurements and statistics using
   custom tracepoints in __omap3_enter_idle.
   All the details are at
   http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
   .
  
  Nice data.
 
   The setup is:
   - Beagleboard (OMAP3530) at 500MHz,
   - l-o master kernel + functional power states + per-device PM QoS. It
   has been checked that the changes from l-o master do not have an
   impact on the performance.
   - The data transfer is performed using dd from a file in JFFS2 to
   /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.
 
  Question: what is used for gathering the latency values?
 I used ftrace tracepoints which are supposed to be low overhead. I
 checked that the overhead cannot be measured on the measurement
 interval (400us), given the fact that the time base is 31us (32 KHz
 clock).

If you want to get accurate measurements, you could use ARM performance
counters, namely the cycle counter. I have a couple of patches for that
purpose I've used if you are interested.

-Tero

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-24 Thread Kevin Hilman
Jean Pihet jean.pi...@newoldbits.com writes:

 Hi Grazvydas, Kevin,

 I did some gather some performance measurements and statistics using
 custom tracepoints in __omap3_enter_idle.
 All the details are at
 http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
 .

This is great, thanks.

[...]

 Here are the results (BW in MB/s) on Beagleboard:
 - 4.7: without using DMA,

 - Using DMA
   2.1: [0]
   2.1: [1] only C1
   2.6: [1]+[2] no pre_ post_
   2.3: [1]+[5] no pwrdm_for_each_clkdm
   2.8: [1]+[5]+[2]
   3.1: [1]+[5]+[6] no omap_sram_idle
   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON

 So indeed this shows there is some serious performance issue with the
 C1 C-state.

Yes, this confirms what both Grazvytas and I are seeing as well.

[...]

 From the list of contributors, the main ones are:
 (140us) pwrdm_pre_transition and pwrdm_post_transition,

See the series I just posted to address this one:
[PATCH/RFT 0/3] ARM: OMAP: PM: reduce overhead of pwrdm pre/post transitions

 (105us) omap2_gpio_prepare_for_idle and
 omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
 the latency-critical C-states,
 (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
 (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
 (11 us) clkdm_allow_idle(mpu). Is this needed?

In that same series, I removed this as it appears to be a remnant of a
code move (c.f. patch 3 in above series.)

 Here are a few questions and suggestions:
 - In case of latency critical C-states could the high-latency code be
 bypassed in favor of a much simpler version? Pushing the concept a bit
 farther one could have a C1 state that just relaxes the cpu (no WFI),
 a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
 rest of the C-states as we have today,

I was thinking a WFI only state, with *all* powerdomains staying on is
probably sufficient for C1.  Do you see the enter/exit latency from that
as even being too hight?

 - Is it needed to iterate through all the power and clock domains in
 order to keep them active?

No.  My series above starts to addresses this, but I think Tero's
use-counting series is the final solution since this should really be
done when we know the powerdomains are transitioning.

Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-17 Thread Kevin Hilman
Grazvydas Ignotas nota...@gmail.com writes:

 On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote:
 It would be helpful now to narrow down what are the big contributors to
 the overhead in omap_sram_idle().  Most of the code there is skipped for
 C1 because the next states for MPU and CORE are both ON.

 Ok I did some tests, all in mostly idle system with just init, busybox
 shell and dd doing a NAND read to /dev/null . 

Hmm, I seem to get a hang using dd to read from NAND /dev/mtdX on my
Overo.  I saw your patch 'mtd: omap2: fix resource leak in prefetch-busy
path' but that didn't seem to help my crash.

 MB/s is throughput that
 dd reports, mA and approx. current draw during the transfer, read from
 fuel gauge that's onboard.

 MB/s| mA|comment
  3.7|218|mainline f549e088b80
  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
  4.4|220|[1] + pwrdm_p*_transition commented [2]
  3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3]
  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
  4.0|224|[1] + 'Deny idle' [5]
  5.1|210|[2] + [4] + [5]
  5.2|202|[5] + omap_sram_idle-cpu_do_idle [6]
  5.5|243|!CONFIG_PM
  6.1|282|busywait DMA end (for reference)

Thanks for the detailed experiments.  This definitely confirms we have
some serious unwanted overhead for C1, and our C-state latency values
are clearly way off base, since they only account HW latency and not any
of the SW latency introduced in omap_sram_idle().

 There are 2 primary differences that I see as possible causes.  I list
 them here with a couple more experiments for you to try to help us
 narrow this down.

 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()

 Could you try using omap_sram_idle() and just commenting out those
 calls?  Does that help performance?  Those iterate over all the
 powerdomains, so defintely add some overhead, but I don't think it
 would be as significant as what you're seeing.

 Seems to be taking good part of it.

    Much more likely is...

 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds

 Could not notice any difference.

 To me it looks like this results from many small things adding up..
 Idle is called so often that pwrdm_p*_transition() and those
 pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
 because they access lots of registers on slow buses? 

Yes PRCM register accesses are unfortunately rather slow, and we've
known that for some time, but haven't done any detailed analysis of the
overhead.

Using the function_graph tracer, I was able to see that the pre/post
transition are taking an enormous amount of time:

  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)

Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
using CPUfreq at all in your tests?  If using cpufreq + ondemand
governor, you're probably running at low OPP due to lack of CPU activity
which will also affect the latencies in the idle path.

 Maybe some register cache would help us there, or are those registers
 expected to be changed by hardware often?

Yes, we've known that some sort of register cache here would be useful
for some time, but haven't got to implementing it.

 Also trying to idle PER while transfer is ongoing (as reported in
 previous mail) doesn't sound like a good idea and is one of the
 reasons for slowdown. Seems to also causing more current drain,
 ironically.

Agreed.  Again, using the function_graph tracer, I get some pretty big
latencies from the GPIO pre/post idling process:

  - gpio_prepare_for_idle(): 2400+ us at 600MHz (8200+ us at 125MHz)
  - gpio_resume_from_idle(): 2200+ us at 600MHz (7600+ us at 125MHz)

Removing PER transtions as you did will get rid of those.

I'm looking into this in more detail know, and will likely have a few
patches for you to experiment with.

Thanks again for digging into this with us,

Kevin



 changes (again, sorry for corrupted diffs, but they should be easy to
 reproduce):
 [2]:
 --- a/arch/arm/mach-omap2/pm34xx.c
 +++ b/arch/arm/mach-omap2/pm34xx.c
 @@ -307,7 +307,7 @@ void omap_sram_idle(void)
 omap3_enable_io_chain();
 }

 -   pwrdm_pre_transition();
 +// pwrdm_pre_transition();

 /* PER */
 if (per_next_state  PWRDM_POWER_ON) {
 @@ -372,7 +373,7 @@ void omap_sram_idle(void)
 }
 omap3_intc_resume_idle();

 -   pwrdm_post_transition();
 +// pwrdm_post_transition();

 /* PER */
 if (per_next_state  PWRDM_POWER_ON) {
 [3]:
 --- a/arch/arm/mach-omap2/pm34xx.c
 +++ b/arch/arm/mach-omap2/pm34xx.c
 @@ -347,7 +347,7 @@ void omap_sram_idle(void)
 if (save_state == 1 || save_state == 3)
 cpu_suspend(save_state, omap34xx_do_sram_idle);
 else
 -   omap34xx_do_sram_idle(save_state);
 +   cpu_do_idle();

 /* Restore normal SDRC POWER settings */
 if 

Re: PM related performance degradation on OMAP3

2012-04-17 Thread Grazvydas Ignotas
On Tue, Apr 17, 2012 at 5:30 PM, Kevin Hilman khil...@ti.com wrote:
 Grazvydas Ignotas nota...@gmail.com writes:

 Ok I did some tests, all in mostly idle system with just init, busybox
 shell and dd doing a NAND read to /dev/null .

 Hmm, I seem to get a hang using dd to read from NAND /dev/mtdX on my
 Overo.  I saw your patch 'mtd: omap2: fix resource leak in prefetch-busy
 path' but that didn't seem to help my crash.

I see overo doesn't set 16bit flag, I think it has NAND on 16bit bus?
Perhaps try this:

--- a/arch/arm/mach-omap2/board-overo.c
+++ b/arch/arm/mach-omap2/board-overo.c
@@ -517,7 +517,7 @@ static void __init overo_init(void)
omap_serial_init();
omap_sdrc_init(mt46h32m32lf6_sdrc_params,
  mt46h32m32lf6_sdrc_params);
-   omap_nand_flash_init(0, overo_nand_partitions,
+   omap_nand_flash_init(NAND_BUSWIDTH_16, overo_nand_partitions,
 ARRAY_SIZE(overo_nand_partitions));
usb_musb_init(NULL);
usbhs_init(usbhs_bdata);

Also only pandora is using NAND DMA mode right now in mainline, the
default polling mode won't exhibit the latency problem (with all other
polling consequences like high CPU usage), so this is needed too for
the test:

--- a/arch/arm/mach-omap2/common-board-devices.c
+++ b/arch/arm/mach-omap2/common-board-devices.c
@@ -127,6 +127,7 @@ void __init omap_nand_flash_init(int options,
struct mtd_partition *parts,
nand_data.parts = parts;
nand_data.nr_parts = nr_parts;
nand_data.devsize = options;
+   nand_data.xfer_type = NAND_OMAP_PREFETCH_DMA;

printk(KERN_INFO Registering NAND on CS%d\n, nandcs);
if (gpmc_nand_init(nand_data)  0)

I also forgot to mention I was using ubifs in my test (dd'ing large
file from it), I don't think it has much effect, but if you want to
try with that:
.config
CONFIG_MTD_UBI=y
CONFIG_UBIFS_FS=y
--
ubiformat /dev/mtdX -s 512
ubiattach /dev/ubi_ctrl -m X # X from mtdX
ubimkvol /dev/ubi0 -m -N somename
mount -t ubifs ubi0:somename /mnt

 To me it looks like this results from many small things adding up..
 Idle is called so often that pwrdm_p*_transition() and those
 pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
 because they access lots of registers on slow buses?

 Yes PRCM register accesses are unfortunately rather slow, and we've
 known that for some time, but haven't done any detailed analysis of the
 overhead.

 Using the function_graph tracer, I was able to see that the pre/post
 transition are taking an enormous amount of time:

  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)

Hmm, with this it wouldn't be able to do ~500+ calls/sec I was seeing,
so the tracer overhead is probably quite large too..

 Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
 using CPUfreq at all in your tests?  If using cpufreq + ondemand
 governor, you're probably running at low OPP due to lack of CPU activity
 which will also affect the latencies in the idle path.

I used performance governor in my tests, so it all was at 600MHz.

 I'm looking into this in more detail know, and will likely have a few
 patches for you to experiment with.

Sounds good,


-- 
Gražvydas
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-13 Thread Felipe Balbi
Hi,

On Thu, Apr 12, 2012 at 09:57:32AM -0700, Kevin Hilman wrote:
 +Felipe for EHCI question
 
 Gary Thomas g...@mlbassoc.com writes:
 
 [...]
 
  This worked a treat, thanks.  My network performance is better
  now, but still not what it was.  The same TFTP transfer now takes
  71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
  second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no 
  difference.
 
 And does a CONFIG_PM=n kernel get you back to your v3.0 performance?
 
  I am interested in having PM working as I'm designing a battery powered
  portable unit, so I need to keep pursuing this.
 
 So do I. :)

we all are :-p

  Note: I noticed that when I built with CONFIG_PM off and no other
  changes, my EHCI USB didn't work properly.  Should this be the case?
 
 Probably not, but haven't tested EHCI USB.  I've Cc'd Felipe to see if
 he has any ideas why EHCI wouldn't work with CONFIG_PM=n.

Govind, Keshava... can you look into this at some point next week ? Or
maybe give us a good reason why it doesn't work without PM ;-)

-- 
balbi


signature.asc
Description: Digital signature


Re: PM related performance degradation on OMAP3

2012-04-13 Thread Grazvydas Ignotas
On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote:
 Grazvydas Ignotas nota...@gmail.com writes:

 On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman khil...@ti.com wrote:
 Grazvydas Ignotas nota...@gmail.com writes:
 While SD card performance loss is not that bad (~7%), NAND one is
 worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
 cpuidle states over sysfs, it did not have any significant effect. Is
 there something else to try?

 Looks like we might need a PM QoS constraint when there is DMA activity
 in progress.

 You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when
 DMA transfers are active and I suspect that will help.

 I've tried it and it didn't help much. It looks like the only thing it
 does is limiting cpuidle c-states, I tried to set qos dma latency to 0
 and it made it stay in C1 while transfer was ongoing (I watched
 /sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance
 was still poor.

 Great, thanks for doing this experiment.

 Assuming we get to a C1 that's low-latency enough, we will still need
 this constraint to ensure C1 during transfers.  But first we have to
 figure out what's going on with C1...

I've been working on this to collect more data, and noticed that PER
is often being put to RET even at C1, is that expected? There is some
additional work being done in that case, like putting GPIOs to sleep,
and it seems to be source of part of performance loss here as it
happens often during NAND transfers.

This can be reproduced while doing mmc transfers too and detected with this:
(not a valid patch, sorry, sending through gmail web)

--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -87,6 +87,8 @@ static int _cpuidle_deny_idle(struct powerdomain *pwrdm,
return 0;
 }

+int is_c1;
+
 static int __omap3_enter_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
@@ -117,6 +120,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
cpu_pm_enter();

/* Execute ARM wfi */
+   is_c1 = (index == 0);
omap_sram_idle();

/*
diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
index 703bd10..519ce9d 100644
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -275,6 +275,7 @@ void omap_sram_idle(void)
int per_going_off;
int core_prev_state, per_prev_state;
u32 sdrc_pwr = 0;
+   extern int is_c1;

mpu_next_state = pwrdm_read_next_pwrst(mpu_pwrdm);
switch (mpu_next_state) {
@@ -299,6 +300,8 @@ void omap_sram_idle(void)
/* Enable IO-PAD and IO-CHAIN wakeups */
per_next_state = pwrdm_read_next_pwrst(per_pwrdm);
core_next_state = pwrdm_read_next_pwrst(core_pwrdm);
+if (is_c1  (per_next_state != PWRDM_POWER_ON || core_next_state !=
PWRDM_POWER_ON))
+ printk(KERN_ERR c1 core %d, per %d\n, per_next_state, core_next_state);
if (omap3_has_io_wakeup() 
(per_next_state  PWRDM_POWER_ON ||
 core_next_state  PWRDM_POWER_ON)) {


-- 
Gražvydas
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-13 Thread Grazvydas Ignotas
On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote:
 It would be helpful now to narrow down what are the big contributors to
 the overhead in omap_sram_idle().  Most of the code there is skipped for
 C1 because the next states for MPU and CORE are both ON.

Ok I did some tests, all in mostly idle system with just init, busybox
shell and dd doing a NAND read to /dev/null . MB/s is throughput that
dd reports, mA and approx. current draw during the transfer, read from
fuel gauge that's onboard.

MB/s| mA|comment
 3.7|218|mainline f549e088b80
 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
 4.4|220|[1] + pwrdm_p*_transition commented [2]
 3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3]
 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
 4.0|224|[1] + 'Deny idle' [5]
 5.1|210|[2] + [4] + [5]
 5.2|202|[5] + omap_sram_idle-cpu_do_idle [6]
 5.5|243|!CONFIG_PM
 6.1|282|busywait DMA end (for reference)

 There are 2 primary differences that I see as possible causes.  I list
 them here with a couple more experiments for you to try to help us
 narrow this down.

 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()

 Could you try using omap_sram_idle() and just commenting out those
 calls?  Does that help performance?  Those iterate over all the
 powerdomains, so defintely add some overhead, but I don't think it
 would be as significant as what you're seeing.

Seems to be taking good part of it.

    Much more likely is...

 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds

Could not notice any difference.

To me it looks like this results from many small things adding up..
Idle is called so often that pwrdm_p*_transition() and those
pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
because they access lots of registers on slow buses? Maybe some
register cache would help us there, or are those registers expected to
be changed by hardware often?
Also trying to idle PER while transfer is ongoing (as reported in
previous mail) doesn't sound like a good idea and is one of the
reasons for slowdown. Seems to also causing more current drain,
ironically.


changes (again, sorry for corrupted diffs, but they should be easy to
reproduce):
[2]:
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -307,7 +307,7 @@ void omap_sram_idle(void)
omap3_enable_io_chain();
}

-   pwrdm_pre_transition();
+// pwrdm_pre_transition();

/* PER */
if (per_next_state  PWRDM_POWER_ON) {
@@ -372,7 +373,7 @@ void omap_sram_idle(void)
}
omap3_intc_resume_idle();

-   pwrdm_post_transition();
+// pwrdm_post_transition();

/* PER */
if (per_next_state  PWRDM_POWER_ON) {
[3]:
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -347,7 +347,7 @@ void omap_sram_idle(void)
if (save_state == 1 || save_state == 3)
cpu_suspend(save_state, omap34xx_do_sram_idle);
else
-   omap34xx_do_sram_idle(save_state);
+   cpu_do_idle();

/* Restore normal SDRC POWER settings */
if (cpu_is_omap3430()  omap_rev() = OMAP3430_REV_ES3_0 
[4]:
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -107,6 +107,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
if (index == 0) {
pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
+   pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON);
}

/*
[5]:
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -105,8 +105,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,

/* Deny idle for C1 */
if (index == 0) {
-   pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
-   pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
+   clkdm_deny_idle(mpu_pd-pwrdm_clkdms[0]);
}

/*
@@ -128,8 +128,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,

/* Re-allow idle for C1 */
if (index == 0) {
-   pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
-   pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
+   clkdm_allow_idle(mpu_pd-pwrdm_clkdms[0]);
}

 return_sleep_time:
[6]:
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -117,7 +116,8 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
cpu_pm_enter();

/* Execute ARM wfi */
-   omap_sram_idle();
+   //omap_sram_idle();
+   cpu_do_idle();

/*
 * Call idle CPU PM enter notifier chain to restore


-- 
Gražvydas
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PM related performance degradation on OMAP3

2012-04-12 Thread Gary Thomas

On 2012-04-11 13:17, Kevin Hilman wrote:

Gary Thomasg...@mlbassoc.com  writes:

[...]


I fear I'm seeing similar problems with 3.3.  I have my board (similar
to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
performance on 3.3.  For example, if I use TFTP to download a large file
(~35MB), I get this:
   3.0:  42.5 sec
   3.3: 625.0 sec
That's a factor of 15 worse!


This might not be the same problem.  What is the NIC being used, and
does it have GPIO interrupts?


My board uses SMSC911x with GPIO interrupt signal.



If it's using GPIO interrupts, then you likely need this patch from
mainline (v3.4-rc1)


I tried to just pick up the patch you [sort of] quoted below, but had
a hard time applying it to my kernel. I've tried to just pick up the
latest files from the mainline kernel, but so far I've nothing that
builds - too many dependencies.  These are the files I've pulled in
#   modified:   arch/arm/mach-omap2/cpuidle34xx.c
#   modified:   arch/arm/mach-omap2/gpio.c
#   modified:   arch/arm/mach-omap2/pm34xx.c
#   modified:   arch/arm/plat-omap/include/plat/gpio.h
#   modified:   drivers/gpio/gpio-omap.c
but it fails with these errors:
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:34:29: error: asm/system_misc.h: 
No such file or directory
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c: In function 'omap3_pm_init':
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: 
'omap_pm_clkdms_setup' undeclared (first use in this function)
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: (Each undeclared 
identifier is reported only once
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: for each function it 
appears in.)
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:767: error: 'arm_pm_idle' 
undeclared (first use in this function)

Is this a viable path towards getting the GPIO changes into my kernel?
It's hard for me to update the whole kernel as there are some other
dependencies (OMAP3ISP and video in particular), so I'd like to stay
with this 3.3-ish base.

Thanks for any ideas



If that doesn't work, or you're not using GPIO interrupts, could you
confirm if the patch below[2] (based on idea from Grasvydas) increases
performance for you when CONFIG_PM=y.

Kevin

[1]
Author: Kevin Hilmankhil...@ti.com   2012-03-05 15:10:04
Committer: Grant Likelygrant.lik...@secretlab.ca   2012-03-12 09:16:11
Parent: 25db711df3258d125dc1209800317e5c0ef3c870 (gpio/omap: Fix IRQ handling 
for SPARSE_IRQ)
Child:  8805f410e4fb88a56552c1af42d61b38837a38fd (gpio/omap: Fix section 
warning for omap_mpuio_alloc_gc())
Branches: many (66)
Follows: v3.3-rc7
Precedes: v3.4-rc1

 gpio/omap: fix wakeups on level-triggered GPIOs

 While both level- and edge-triggered GPIOs are capable of generating
 interrupts, only edge-triggered GPIOs are capable of generating a
 module-level wakeup to the PRCM (c.f. 34xx NDA TRM section 25.5.3.2.)

 In order to ensure that devices using level-triggered GPIOs as
 interrupts can also cause wakeups (e.g. from idle), this patch enables
 edge-triggering for wakeup-enabled, level-triggered GPIOs when a GPIO
 bank is runtime-suspended (which also happens during idle.)

 This fixes a problem found in GPMC-connected network cards with GPIO
 interrupts (e.g. smsc911x on Zoom3, Overo, ...) where network booting
 with NFSroot was very slow since the GPIO IRQs used by the NIC were
 not generating PRCM wakeups, and thus not waking the system from idle.
 NOTE: until v3.3, this boot-time problem was somewhat masked because
 the UART init prevented WFI during boot until the full serial driver
 was available.  Preventing WFI allowed regular GPIO interrupts to fire
 and this problem was not seen.  After the UART runtime PM cleanups, we
 no longer avoid WFI during boot, so GPIO IRQs that were not causing
 wakeups resulted in very slow IRQ response times.

 Tested on platforms using level-triggered GPIOs for network IRQs using
 the SMSC911x NIC: 3530/Overo and 3630/Zoom3.

 Reported-by: Tony Lindgrent...@atomide.com
 Tested-by: Tarun Kanti DebBarmatarun.ka...@ti.com
 Tested-by: Tony Lindgrent...@atomide.com
 Signed-off-by: Kevin Hilmankhil...@ti.com
 Signed-off-by: Grant Likelygrant.lik...@secretlab.ca

[2]
diff --git a/arch/arm/mach-omap2/cpuidle34xx.c 
b/arch/arm/mach-omap2/cpuidle34xx.c
index 413aac4..ace4bf6 100644
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -120,7 +120,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
cpu_pm_enter();

/* Execute ARM wfi */
-   omap_sram_idle();
+   if (index == 0)
+   cpu_do_idle();
+   else
+   omap_sram_idle();

/*
 * Call idle CPU PM enter notifier chain to restore
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More 

Re: PM related performance degradation on OMAP3

2012-04-12 Thread Kevin Hilman
Gary Thomas g...@mlbassoc.com writes:

 On 2012-04-11 13:17, Kevin Hilman wrote:
 Gary Thomasg...@mlbassoc.com  writes:

 [...]

 I fear I'm seeing similar problems with 3.3.  I have my board (similar
 to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
 performance on 3.3.  For example, if I use TFTP to download a large file
 (~35MB), I get this:
3.0:  42.5 sec
3.3: 625.0 sec
 That's a factor of 15 worse!

 This might not be the same problem.  What is the NIC being used, and
 does it have GPIO interrupts?

 My board uses SMSC911x with GPIO interrupt signal.

OK, then your problem is almost certainly solved by my GPIO triggering
fix, and not related to Grazvytas' problem.


 If it's using GPIO interrupts, then you likely need this patch from
 mainline (v3.4-rc1)

 I tried to just pick up the patch you [sort of] quoted below, but had
 a hard time applying it to my kernel. I've tried to just pick up the
 latest files from the mainline kernel, but so far I've nothing that
 builds

Oh, right.  Sorry about that.  Yeah, that patch actually has
dependencies on other GPIO changes that were queued for v3.4 (and not in
v3.3.)

If you're on v3.3, just pull the branch below[1] which is based on
v3.3-rc2.  Pulling that into a v3.3 should build just fine.

Kevin

[1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git 
for_3.4/fixes/gpio

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-12 Thread Gary Thomas

On 2012-04-12 08:14, Kevin Hilman wrote:

Gary Thomasg...@mlbassoc.com  writes:


On 2012-04-11 13:17, Kevin Hilman wrote:

Gary Thomasg...@mlbassoc.com   writes:

[...]


I fear I'm seeing similar problems with 3.3.  I have my board (similar
to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
performance on 3.3.  For example, if I use TFTP to download a large file
(~35MB), I get this:
3.0:  42.5 sec
3.3: 625.0 sec
That's a factor of 15 worse!


This might not be the same problem.  What is the NIC being used, and
does it have GPIO interrupts?


My board uses SMSC911x with GPIO interrupt signal.


OK, then your problem is almost certainly solved by my GPIO triggering
fix, and not related to Grazvytas' problem.



If it's using GPIO interrupts, then you likely need this patch from
mainline (v3.4-rc1)


I tried to just pick up the patch you [sort of] quoted below, but had
a hard time applying it to my kernel. I've tried to just pick up the
latest files from the mainline kernel, but so far I've nothing that
builds


Oh, right.  Sorry about that.  Yeah, that patch actually has
dependencies on other GPIO changes that were queued for v3.4 (and not in
v3.3.)

If you're on v3.3, just pull the branch below[1] which is based on
v3.3-rc2.  Pulling that into a v3.3 should build just fine.

Kevin

[1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git 
for_3.4/fixes/gpio


This worked a treat, thanks.  My network performance is better
now, but still not what it was.  The same TFTP transfer now takes
71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.

I am interested in having PM working as I'm designing a battery powered
portable unit, so I need to keep pursuing this.

Note: I noticed that when I built with CONFIG_PM off and no other
changes, my EHCI USB didn't work properly.  Should this be the case?

Thanks again for your help


--

Gary Thomas |  Consulting for the
MLB Associates  |Embedded world

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-12 Thread Kevin Hilman
+Felipe for EHCI question

Gary Thomas g...@mlbassoc.com writes:

[...]

 This worked a treat, thanks.  My network performance is better
 now, but still not what it was.  The same TFTP transfer now takes
 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
 second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.

And does a CONFIG_PM=n kernel get you back to your v3.0 performance?

 I am interested in having PM working as I'm designing a battery powered
 portable unit, so I need to keep pursuing this.

So do I. :)

 Note: I noticed that when I built with CONFIG_PM off and no other
 changes, my EHCI USB didn't work properly.  Should this be the case?

Probably not, but haven't tested EHCI USB.  I've Cc'd Felipe to see if
he has any ideas why EHCI wouldn't work with CONFIG_PM=n.

Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-12 Thread Gary Thomas

On 2012-04-12 10:57, Kevin Hilman wrote:

+Felipe for EHCI question

Gary Thomasg...@mlbassoc.com  writes:

[...]


This worked a treat, thanks.  My network performance is better
now, but still not what it was.  The same TFTP transfer now takes
71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.


And does a CONFIG_PM=n kernel get you back to your v3.0 performance?


Correct.




I am interested in having PM working as I'm designing a battery powered
portable unit, so I need to keep pursuing this.


So do I. :)


Note: I noticed that when I built with CONFIG_PM off and no other
changes, my EHCI USB didn't work properly.  Should this be the case?


Probably not, but haven't tested EHCI USB.  I've Cc'd Felipe to see if
he has any ideas why EHCI wouldn't work with CONFIG_PM=n.


Thanks

--

Gary Thomas |  Consulting for the
MLB Associates  |Embedded world

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-12 Thread Kevin Hilman
Gary Thomas g...@mlbassoc.com writes:

 On 2012-04-12 10:57, Kevin Hilman wrote:
 +Felipe for EHCI question

 Gary Thomasg...@mlbassoc.com  writes:

 [...]

 This worked a treat, thanks.  My network performance is better
 now, but still not what it was.  The same TFTP transfer now takes
 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
 second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no 
 difference.

 And does a CONFIG_PM=n kernel get you back to your v3.0 performance?

 Correct.


OK, I just tried your TFTP experiment on a 3530/Overo board with the
same smsc911x NIC that has GPIO interrupts, and I don't see much
difference between a PM-enabled v3.0 and a PM-enabled v3.3.

Are you TFTP'ing the file to an MMC filesystem?Can you try to a
ramdisk[1]?  If you're using MMC, it could be MMC driver changes since
v3.0 that are actually causing your performance hit.

In my experiment, I TFTP'd a 24Mb file to a ramdisk, to make sure no
other drivers were invovled, and didn't see any major differences
between v3.0, v3.3, and v3.3 CONFIG_PM disabled.

Below are my results.  As you can see, all the results seem to be pretty
close to the same.  This test was not on a controlled, isolated network,
so the differences are probably explained by other network activity:

- v3.0 vanilla: PM enabled, CPUidle enabled
  - Received 25362406 bytes in 35.5 seconds
  - Received 25362406 bytes in 44.9 seconds
  - Received 25362406 bytes in 49.0 seconds
  - Received 25362406 bytes in 36.2 seconds
  - Received 25362406 bytes in 56.3 seconds
  - Received 25362406 bytes in 65.2 seconds
  - Received 25362406 bytes in 37.0 seconds

- v3.3: PM enabled, CPUidle enabled
 + GPIO fix (my for_3.4/fixes/gpio branch)
 + smsc911x regulator boot fix (Tony's omap/fix-smsc911x-regulator branch)
  - Received 25362406 bytes in 32.1 seconds
  - Received 25362406 bytes in 29.8 seconds
  - Received 25362406 bytes in 33.5 seconds
  - Received 25362406 bytes in 44.5 seconds
  - Received 25362406 bytes in 39.2 seconds
  - Received 25362406 bytes in 57.0 seconds
  - Received 25362406 bytes in 49.6 seconds

- v3.3: CONFIG_PM=n + branches above 
 + fix from Grazvydas for !CONFIG_PM case: [PATCH] ARM: OMAP: sram: fix BUG in 
dpll code for !PM case
 + disable CONFIG_OMAP_WATCHDOG which fails to boot when CONFIG_PM=y 
  - Received 25362406 bytes in 34.1 seconds
  - Received 25362406 bytes in 33.9 seconds
  - Received 25362406 bytes in 34.9 seconds
  - Received 25362406 bytes in 37.8 seconds
  - Received 25362406 bytes in 40.0 seconds
  - Received 25362406 bytes in 37.6 seconds
  - Received 25362406 bytes in 34.4 seconds


Kevin

[1] simple steps to make a ramdisk
mkfs.ext2 /dev/ram0
mkdir /tmp/rd
mount /dev/ram0 /tmp/rd
cd /tmp/rd
then TFTP file here
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-12 Thread Gary Thomas

On 2012-04-12 12:08, Kevin Hilman wrote:

Gary Thomasg...@mlbassoc.com  writes:


On 2012-04-12 10:57, Kevin Hilman wrote:

+Felipe for EHCI question

Gary Thomasg...@mlbassoc.com   writes:

[...]


This worked a treat, thanks.  My network performance is better
now, but still not what it was.  The same TFTP transfer now takes
71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.


And does a CONFIG_PM=n kernel get you back to your v3.0 performance?


Correct.



OK, I just tried your TFTP experiment on a 3530/Overo board with the
same smsc911x NIC that has GPIO interrupts, and I don't see much
difference between a PM-enabled v3.0 and a PM-enabled v3.3.

Are you TFTP'ing the file to an MMC filesystem?Can you try to a
ramdisk[1]?  If you're using MMC, it could be MMC driver changes since
v3.0 that are actually causing your performance hit.


I'm testing to a ramdisk, so we're on the same page.

Could you send me your config file so I can compare?  Maybe I have something
dumb in my settings that aggravates things.

Also, what's your performance on 3.4-rc2?  The linux-media tree I started
from is a bit post v3.3, so there might be something else causing this.



In my experiment, I TFTP'd a 24Mb file to a ramdisk, to make sure no
other drivers were invovled, and didn't see any major differences
between v3.0, v3.3, and v3.3 CONFIG_PM disabled.

Below are my results.  As you can see, all the results seem to be pretty
close to the same.  This test was not on a controlled, isolated network,
so the differences are probably explained by other network activity:

- v3.0 vanilla: PM enabled, CPUidle enabled
   - Received 25362406 bytes in 35.5 seconds
   - Received 25362406 bytes in 44.9 seconds
   - Received 25362406 bytes in 49.0 seconds
   - Received 25362406 bytes in 36.2 seconds
   - Received 25362406 bytes in 56.3 seconds
   - Received 25362406 bytes in 65.2 seconds
   - Received 25362406 bytes in 37.0 seconds

- v3.3: PM enabled, CPUidle enabled
  + GPIO fix (my for_3.4/fixes/gpio branch)
  + smsc911x regulator boot fix (Tony's omap/fix-smsc911x-regulator branch)
   - Received 25362406 bytes in 32.1 seconds
   - Received 25362406 bytes in 29.8 seconds
   - Received 25362406 bytes in 33.5 seconds
   - Received 25362406 bytes in 44.5 seconds
   - Received 25362406 bytes in 39.2 seconds
   - Received 25362406 bytes in 57.0 seconds
   - Received 25362406 bytes in 49.6 seconds

- v3.3: CONFIG_PM=n + branches above
  + fix from Grazvydas for !CONFIG_PM case: [PATCH] ARM: OMAP: sram: fix BUG in 
dpll code for !PM case
  + disable CONFIG_OMAP_WATCHDOG which fails to boot when CONFIG_PM=y
   - Received 25362406 bytes in 34.1 seconds
   - Received 25362406 bytes in 33.9 seconds
   - Received 25362406 bytes in 34.9 seconds
   - Received 25362406 bytes in 37.8 seconds
   - Received 25362406 bytes in 40.0 seconds
   - Received 25362406 bytes in 37.6 seconds
   - Received 25362406 bytes in 34.4 seconds


Kevin

[1] simple steps to make a ramdisk
mkfs.ext2 /dev/ram0
mkdir /tmp/rd
mount /dev/ram0 /tmp/rd
cd /tmp/rd
then TFTP file here


--

Gary Thomas |  Consulting for the
MLB Associates  |Embedded world

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-12 Thread Kevin Hilman
Gary Thomas g...@mlbassoc.com writes:

 On 2012-04-12 12:08, Kevin Hilman wrote:
 Gary Thomasg...@mlbassoc.com  writes:

 On 2012-04-12 10:57, Kevin Hilman wrote:
 +Felipe for EHCI question

 Gary Thomasg...@mlbassoc.com   writes:

 [...]

 This worked a treat, thanks.  My network performance is better
 now, but still not what it was.  The same TFTP transfer now takes
 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
 second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no 
 difference.

 And does a CONFIG_PM=n kernel get you back to your v3.0 performance?

 Correct.


 OK, I just tried your TFTP experiment on a 3530/Overo board with the
 same smsc911x NIC that has GPIO interrupts, and I don't see much
 difference between a PM-enabled v3.0 and a PM-enabled v3.3.

 Are you TFTP'ing the file to an MMC filesystem?Can you try to a
 ramdisk[1]?  If you're using MMC, it could be MMC driver changes since
 v3.0 that are actually causing your performance hit.

 I'm testing to a ramdisk, so we're on the same page.

 Could you send me your config file so I can compare?  Maybe I have something
 dumb in my settings that aggravates things.

Below is the Kconfig snippet[1] I append to a default
omap2plus_defconfig to enable CPUidle, CPUfreq and some debug.  Rebuild
with that appended and these settings override the default ones.  I used
omap2plus_defcnfig plus this snippit for v3.0, v3.3 and v3.4-rc2 tests.

 Also, what's your performance on 3.4-rc2?  The linux-media tree I started
 from is a bit post v3.3, so there might be something else causing this.

I just tried with vanilla v3.4-rc2, and I see basically the same
results.  Between 35 and 50 seconds for the 24Mb file transfer, which is
similar to the v3.0 and v3.3 results.

Kevin

[1] 
CONFIG_CPU_IDLE=y
CONFIG_PM_ADVANCED_DEBUG=y
CONFIG_PM_SLEEP_ADVANCED_DEBUG=y
CONFIG_PM_GENERIC_DOMAINS=y
CONFIG_OMAP_SMARTREFLEX=y
CONFIG_OMAP_SMARTREFLEX_CLASS3=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_ARM_OMAP2PLUS_CPUFREQ=y
CONFIG_REGULATOR_OMAP_SMPS=y

CONFIG_DEBUG_LL=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_USER=y
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_SECTION_MISMATCH=y

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: PM related performance degradation on OMAP3

2012-04-12 Thread Woodruff, Richard
 From: linux-omap-ow...@vger.kernel.org [mailto:linux-omap-
 ow...@vger.kernel.org] On Behalf Of Grazvydas Ignotas
 Sent: Tuesday, April 10, 2012 7:30 PM

 What I think is going on here is that omap_sram_idle() is taking too
 much time because it's overhead is too large. I've added a counter
 there and it seems to be called ~530 times per megabyte (DMA operates
 in ~2K chunks so it makes sense), that's over 2000 calls per second.
 Some quick measurement code shows ~243us spent for setting up in
 omap_sram_idle() (before and after omap34xx_do_sram_idle()).

243uS is really a long time for C1. For some reason has grown a lot since last 
time I captured path in ETM.

Your analysis correlates well to reports from a couple years back. N900 folks 
did report that the non-clock gated C1 was needed (as exists in code today). 
IIRC the NAND stack did have small-uS spins on NAND status or something which 
having higher clock stop penalty resulted in big performance dip. You needed 
like 10uS for C1 or bit hit.

Regards,
Richard W.
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-12 Thread Gary Thomas

On 2012-04-12 16:03, Kevin Hilman wrote:

Gary Thomasg...@mlbassoc.com  writes:


On 2012-04-12 12:08, Kevin Hilman wrote:

Gary Thomasg...@mlbassoc.com   writes:


On 2012-04-12 10:57, Kevin Hilman wrote:

+Felipe for EHCI question

Gary Thomasg...@mlbassoc.comwrites:

[...]


This worked a treat, thanks.  My network performance is better
now, but still not what it was.  The same TFTP transfer now takes
71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.


And does a CONFIG_PM=n kernel get you back to your v3.0 performance?


Correct.



OK, I just tried your TFTP experiment on a 3530/Overo board with the
same smsc911x NIC that has GPIO interrupts, and I don't see much
difference between a PM-enabled v3.0 and a PM-enabled v3.3.

Are you TFTP'ing the file to an MMC filesystem?Can you try to a
ramdisk[1]?  If you're using MMC, it could be MMC driver changes since
v3.0 that are actually causing your performance hit.


I'm testing to a ramdisk, so we're on the same page.

Could you send me your config file so I can compare?  Maybe I have something
dumb in my settings that aggravates things.


Below is the Kconfig snippet[1] I append to a default
omap2plus_defconfig to enable CPUidle, CPUfreq and some debug.  Rebuild
with that appended and these settings override the default ones.  I used
omap2plus_defcnfig plus this snippit for v3.0, v3.3 and v3.4-rc2 tests.


Also, what's your performance on 3.4-rc2?  The linux-media tree I started
from is a bit post v3.3, so there might be something else causing this.


I just tried with vanilla v3.4-rc2, and I see basically the same
results.  Between 35 and 50 seconds for the 24Mb file transfer, which is
similar to the v3.0 and v3.3 results.

Kevin

[1]
CONFIG_CPU_IDLE=y
CONFIG_PM_ADVANCED_DEBUG=y
CONFIG_PM_SLEEP_ADVANCED_DEBUG=y
CONFIG_PM_GENERIC_DOMAINS=y
CONFIG_OMAP_SMARTREFLEX=y
CONFIG_OMAP_SMARTREFLEX_CLASS3=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_ARM_OMAP2PLUS_CPUFREQ=y
CONFIG_REGULATOR_OMAP_SMPS=y

CONFIG_DEBUG_LL=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_USER=y
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_SECTION_MISMATCH=y


These settings made no difference.

I just reverified my results to xfer a 39MB file to ramdisk:
  3.0 + PM = 39sec
  3.3 + PM = 70sec
  3.3 - PM = 48sec
so it's not quite the same as 3.0 was, but closer.  BTW, your
results normalized to mine would be
  3.3 + PM = 56sec

I wish I knew why I'm seeing a big difference between +PM/-PM
and you don't.  Is there some way to compare your source tree
(the one you built for v3.3) and mine?  I'm not very good with
GIT so I'm not quite sure how to do it.

Sorry for being so much trouble, I'm just in search of all the
performance I can get out of my system :-)

Thanks


--

Gary Thomas |  Consulting for the
MLB Associates  |Embedded world

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-11 Thread Gary Thomas

On 2012-04-06 16:50, Grazvydas Ignotas wrote:

Hello,

I'm DMA seeing performance loss related to CONFIG_PM on OMAP3.

# CONFIG_PM is set:
echo 3  /proc/sys/vm/drop_caches
# file copy from NAND (using NAND driver in DMA mode)
dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s
# file read from SD (hsmmc uses DMA)
dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s

# CONFIG_PM not set:
# NAND
dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s
# SD
dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s

While SD card performance loss is not that bad (~7%), NAND one is
worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
cpuidle states over sysfs, it did not have any significant effect. Is
there something else to try?

I'm guessing this is caused by CPU wakeup latency to service DMA
interrupts? I've noticed that if I keep CPU busy, the loss is reduced
almost completely.
Talking about cpuidle, what's the difference between C1 and C2 states?
They look mostly the same.
Then there is omap3_do_wfi, it seems to be unconditionally putting
SDRC on self-refresh, would it make sense to just do wfi in higher
power states, like OMAP4 seems to be doing?



I fear I'm seeing similar problems with 3.3.  I have my board (similar
to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
performance on 3.3.  For example, if I use TFTP to download a large file
(~35MB), I get this:
  3.0:  42.5 sec
  3.3: 625.0 sec
That's a factor of 15 worse!

I'd like to try building without CONFIG_PM, but when I disabled this, my
kernel fails to come up.  Can someone point me to the magic to build without
CONFIG_PM, or possibly send me a working config file?

Thanks

--

Gary Thomas |  Consulting for the
MLB Associates  |Embedded world

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-11 Thread Grazvydas Ignotas
On Wed, Apr 11, 2012 at 5:59 PM, Gary Thomas g...@mlbassoc.com wrote:
 I'd like to try building without CONFIG_PM, but when I disabled this, my
 kernel fails to come up.  Can someone point me to the magic to build without
 CONFIG_PM, or possibly send me a working config file?

You probably need this patch:
http://marc.info/?l=linux-omapm=133374930011086w=2
If it still won't boot, you'll need to enable earlyprintk both in
.config and as kernel argument to see where it dies.


-- 
Gražvydas
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-11 Thread Gary Thomas

On 2012-04-11 11:23, Grazvydas Ignotas wrote:

On Wed, Apr 11, 2012 at 5:59 PM, Gary Thomasg...@mlbassoc.com  wrote:

I'd like to try building without CONFIG_PM, but when I disabled this, my
kernel fails to come up.  Can someone point me to the magic to build without
CONFIG_PM, or possibly send me a working config file?


You probably need this patch:
http://marc.info/?l=linux-omapm=133374930011086w=2
If it still won't boot, you'll need to enable earlyprintk both in
.config and as kernel argument to see where it dies.


That does help, but there are lots of tracebacks like these:
[0.588500] [ cut here ]
[0.588531] WARNING: at drivers/video/omap2/dss/dispc.c:404 
dss_driver_probe+0x44/0xd8()
[0.588562] Modules linked in:
[0.588592] [c0012204] (unwind_backtrace+0x0/0xf8) from [c002b81c] 
(warn_slowpath_common+0x4c/0x64)
[0.588623] [c002b81c] (warn_slowpath_common+0x4c/0x64) from [c002b850] 
(warn_slowpath_null+0x1c/0x24)
[0.588623] [c002b850] (warn_slowpath_null+0x1c/0x24) from [c022609c] 
(dss_driver_probe+0x44/0xd8)
[0.588653] [c022609c] (dss_driver_probe+0x44/0xd8) from [c0273e10] 
(driver_probe_device+0x70/0x1e4)
[0.588684] [c0273e10] (driver_probe_device+0x70/0x1e4) from [c0274018] 
(__driver_attach+0x94/0x98)
[0.588714] [c0274018] (__driver_attach+0x94/0x98) from [c027270c] 
(bus_for_each_dev+0x50/0x7c)
[0.588745] [c027270c] (bus_for_each_dev+0x50/0x7c) from [c0273664] 
(bus_add_driver+0x184/0x244)
[0.588775] [c0273664] (bus_add_driver+0x184/0x244) from [c02742bc] 
(driver_register+0x78/0x12c)
[0.588775] [c02742bc] (driver_register+0x78/0x12c) from [c00085a0] 
(do_one_initcall+0x34/0x178)
[0.588806] [c00085a0] (do_one_initcall+0x34/0x178) from [c061d7dc] 
(kernel_init+0x78/0x114)
[0.588836] [c061d7dc] (kernel_init+0x78/0x114) from [c000e0d0] 
(kernel_thread_exit+0x0/0x8)
[0.588867] ---[ end trace 1b75b31a2719ed24 ]---

I also had to disable the watchdog to get it up.

That said, with CONFIG_PM disabled, my network performance is
back to what it was in 3.0 :-)  Note: I also had CONFIG_PM disabled
in that kernel build, so I don't know for sure what the performance
might be with that version if it were enabled.

--

Gary Thomas |  Consulting for the
MLB Associates  |Embedded world

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-11 Thread Kevin Hilman
Gary Thomas g...@mlbassoc.com writes:

[...]

 I fear I'm seeing similar problems with 3.3.  I have my board (similar
 to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
 performance on 3.3.  For example, if I use TFTP to download a large file
 (~35MB), I get this:
   3.0:  42.5 sec
   3.3: 625.0 sec
 That's a factor of 15 worse!

This might not be the same problem.  What is the NIC being used, and
does it have GPIO interrupts?

If it's using GPIO interrupts, then you likely need this patch from
mainline (v3.4-rc1)

If that doesn't work, or you're not using GPIO interrupts, could you
confirm if the patch below[2] (based on idea from Grasvydas) increases
performance for you when CONFIG_PM=y.

Kevin

[1]
Author: Kevin Hilman khil...@ti.com  2012-03-05 15:10:04
Committer: Grant Likely grant.lik...@secretlab.ca  2012-03-12 09:16:11
Parent: 25db711df3258d125dc1209800317e5c0ef3c870 (gpio/omap: Fix IRQ handling 
for SPARSE_IRQ)
Child:  8805f410e4fb88a56552c1af42d61b38837a38fd (gpio/omap: Fix section 
warning for omap_mpuio_alloc_gc())
Branches: many (66)
Follows: v3.3-rc7
Precedes: v3.4-rc1

gpio/omap: fix wakeups on level-triggered GPIOs

While both level- and edge-triggered GPIOs are capable of generating
interrupts, only edge-triggered GPIOs are capable of generating a
module-level wakeup to the PRCM (c.f. 34xx NDA TRM section 25.5.3.2.)

In order to ensure that devices using level-triggered GPIOs as
interrupts can also cause wakeups (e.g. from idle), this patch enables
edge-triggering for wakeup-enabled, level-triggered GPIOs when a GPIO
bank is runtime-suspended (which also happens during idle.)

This fixes a problem found in GPMC-connected network cards with GPIO
interrupts (e.g. smsc911x on Zoom3, Overo, ...) where network booting
with NFSroot was very slow since the GPIO IRQs used by the NIC were
not generating PRCM wakeups, and thus not waking the system from idle.
NOTE: until v3.3, this boot-time problem was somewhat masked because
the UART init prevented WFI during boot until the full serial driver
was available.  Preventing WFI allowed regular GPIO interrupts to fire
and this problem was not seen.  After the UART runtime PM cleanups, we
no longer avoid WFI during boot, so GPIO IRQs that were not causing
wakeups resulted in very slow IRQ response times.

Tested on platforms using level-triggered GPIOs for network IRQs using
the SMSC911x NIC: 3530/Overo and 3630/Zoom3.

Reported-by: Tony Lindgren t...@atomide.com
Tested-by: Tarun Kanti DebBarma tarun.ka...@ti.com
Tested-by: Tony Lindgren t...@atomide.com
Signed-off-by: Kevin Hilman khil...@ti.com
Signed-off-by: Grant Likely grant.lik...@secretlab.ca

[2]
diff --git a/arch/arm/mach-omap2/cpuidle34xx.c 
b/arch/arm/mach-omap2/cpuidle34xx.c
index 413aac4..ace4bf6 100644
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -120,7 +120,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
cpu_pm_enter();
 
/* Execute ARM wfi */
-   omap_sram_idle();
+   if (index == 0)
+   cpu_do_idle();
+   else
+   omap_sram_idle();
 
/*
 * Call idle CPU PM enter notifier chain to restore
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-11 Thread Kevin Hilman
Grazvydas Ignotas nota...@gmail.com writes:

 On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman khil...@ti.com wrote:
 Grazvydas Ignotas nota...@gmail.com writes:
 While SD card performance loss is not that bad (~7%), NAND one is
 worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
 cpuidle states over sysfs, it did not have any significant effect. Is
 there something else to try?

 Looks like we might need a PM QoS constraint when there is DMA activity
 in progress.

 You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when
 DMA transfers are active and I suspect that will help.

 I've tried it and it didn't help much. It looks like the only thing it
 does is limiting cpuidle c-states, I tried to set qos dma latency to 0
 and it made it stay in C1 while transfer was ongoing (I watched
 /sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance
 was still poor.

Great, thanks for doing this experiment.

Assuming we get to a C1 that's low-latency enough, we will still need
this constraint to ensure C1 during transfers.  But first we have to
figure out what's going on with C1...

 What I think is going on here is that omap_sram_idle() is taking too
 much time because it's overhead is too large. I've added a counter
 there and it seems to be called ~530 times per megabyte (DMA operates
 in ~2K chunks so it makes sense), that's over 2000 calls per second.
 Some quick measurement code shows ~243us spent for setting up in
 omap_sram_idle() (before and after omap34xx_do_sram_idle()).

 Could we perhaps have a lighter idle function for C1 that doesn't try
 to switch all powerdomain states and maybe not enable RAM
 self-refresh? 

Yes, but first let's try to uncover exactly what makes the current C1 so
heavy.  

 As a quick test I've tried this in omap3_enter_idle():

 /* Execute ARM wfi */
 if (index == 0) {
 clkdm_deny_idle(mpu_pd-pwrdm_clkdms[0]);
 cpu_do_idle();
 } else
 omap_sram_idle();

 ..and it brought performance close to !CONFIG_PM case (cpu_do_idle()
 is used as pm_idle on !CONFIG_PM). 

OK, I see now.   I think you're right about the overhead.

It would be helpful now to narrow down what are the big contributors to
the overhead in omap_sram_idle().  Most of the code there is skipped for
C1 because the next states for MPU and CORE are both ON.

There are 2 primary differences that I see as possible causes.  I list
them here with a couple more experiments for you to try to help us
narrow this down.

1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()

Could you try using omap_sram_idle() and just commenting out those
calls?  Does that help performance?  Those iterate over all the
powerdomains, so defintely add some overhead, but I don't think it
would be as significant as what you're seeing.Much more likely is...

2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds

This is more likely the culprit of most of the overhead.  Specifically,
when returning from idle there are some errata to workaround that
require waiting for DPLL3 to lock.  I suspect this is more likely to be
the source of the problem.  

Can you try the hack below[1], which basically does the cpu_do_idle() hack
that you've already done, but inside omap_sram_idle() and only
eliminates the jump to SRAM, SDRC self-refresh and SDRC errata
workarounds?

I assume that will get performance back to what you expect.  Then it
remains to be seen if it's the SDRC self-refresh that's causing the
delay, or the errata workarounds.

To add the self-refresh back, but eliminate the SDRC errata workaround,
You could try something like I hacked up in the (untested) branch here[2].
If performance is still good, that will tell us that it's the errata
workaround waiting that's causing the extra overhead.

I need to clarify for myself if SDRC self-refresh is even entered in C1.
When the CORE powerdomain is left on, I don't think the PRCM is would
send IDLEREQ to the SDRC, so it should not enter self refresh, but I
need to verify that.

 I don't know what side effects something like this might have though.

There are some other errata workaounds that you miss by not calling
omap_sram_idle().  Specifically, the call to omap3_intc_prepare_idle()
is important.

Kevin




[1]
diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
index 3e6b564..0fb3942 100644
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -313,7 +313,7 @@ void omap_sram_idle(void)
if (save_state == 1 || save_state == 3)
cpu_suspend(save_state, omap34xx_do_sram_idle);
else
-   omap34xx_do_sram_idle(save_state);
+   cpu_do_idle();
 
/* Restore normal SDRC POWER settings */
if (cpu_is_omap3430()  omap_rev() = OMAP3430_REV_ES3_0 


[2] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git 
tmp/sdrc-hacks
--
To unsubscribe from this 

Re: PM related performance degradation on OMAP3

2012-04-10 Thread Grazvydas Ignotas
On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman khil...@ti.com wrote:
 Grazvydas Ignotas nota...@gmail.com writes:
 While SD card performance loss is not that bad (~7%), NAND one is
 worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
 cpuidle states over sysfs, it did not have any significant effect. Is
 there something else to try?

 Looks like we might need a PM QoS constraint when there is DMA activity
 in progress.

 You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when
 DMA transfers are active and I suspect that will help.

I've tried it and it didn't help much. It looks like the only thing it
does is limiting cpuidle c-states, I tried to set qos dma latency to 0
and it made it stay in C1 while transfer was ongoing (I watched
/sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance
was still poor.

What I think is going on here is that omap_sram_idle() is taking too
much time because it's overhead is too large. I've added a counter
there and it seems to be called ~530 times per megabyte (DMA operates
in ~2K chunks so it makes sense), that's over 2000 calls per second.
Some quick measurement code shows ~243us spent for setting up in
omap_sram_idle() (before and after omap34xx_do_sram_idle()).

Could we perhaps have a lighter idle function for C1 that doesn't try
to switch all powerdomain states and maybe not enable RAM
self-refresh? As a quick test I've tried this in omap3_enter_idle():

/* Execute ARM wfi */
if (index == 0) {
clkdm_deny_idle(mpu_pd-pwrdm_clkdms[0]);
cpu_do_idle();
} else
omap_sram_idle();

..and it brought performance close to !CONFIG_PM case (cpu_do_idle()
is used as pm_idle on !CONFIG_PM). I don't know what side effects
something like this might have though.

 Then there is omap3_do_wfi, it seems to be unconditionally putting
 SDRC on self-refresh, would it make sense to just do wfi in higher
 power states, like OMAP4 seems to be doing?

 Not sure what you're referring to in OMAP4.  There we do WFI in every
 idle state.

What I meant is that OMAP3 idle code always tries to enable RAM
self-refresh (regardless of c-state) before doing wfi while OMAP4 can
do wfi without suspending RAM (although I might be misunderstanding
all that asm code).

-- 
Gražvydas
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PM related performance degradation on OMAP3

2012-04-09 Thread Kevin Hilman
Grazvydas Ignotas nota...@gmail.com writes:

 Hello,

 I'm DMA seeing performance loss related to CONFIG_PM on OMAP3.

 # CONFIG_PM is set:
 echo 3  /proc/sys/vm/drop_caches
 # file copy from NAND (using NAND driver in DMA mode)
 dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
 33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s
 # file read from SD (hsmmc uses DMA)
 dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
 33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s

 # CONFIG_PM not set:
 # NAND
 dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
 33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s
 # SD
 dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
 33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s

 While SD card performance loss is not that bad (~7%), NAND one is
 worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
 cpuidle states over sysfs, it did not have any significant effect. Is
 there something else to try?

Looks like we might need a PM QoS constraint when there is DMA activity
in progress.  

You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when
DMA transfers are active and I suspect that will help.

 I'm guessing this is caused by CPU wakeup latency to service DMA
 interrupts? I've noticed that if I keep CPU busy, the loss is reduced
 almost completely.

Yeah, that suggests a QoS constraint is what's needed here.

 Talking about cpuidle, what's the difference between C1 and C2 states?
 They look mostly the same.

Except for clockdomains are not allowed to idle in C1 which results in
much shorter wakeup latency.

 Then there is omap3_do_wfi, it seems to be unconditionally putting
 SDRC on self-refresh, would it make sense to just do wfi in higher
 power states, like OMAP4 seems to be doing?

Not sure what you're referring to in OMAP4.  There we do WFI in every
idle state.

Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PM related performance degradation on OMAP3

2012-04-06 Thread Grazvydas Ignotas
Hello,

I'm DMA seeing performance loss related to CONFIG_PM on OMAP3.

# CONFIG_PM is set:
echo 3  /proc/sys/vm/drop_caches
# file copy from NAND (using NAND driver in DMA mode)
dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s
# file read from SD (hsmmc uses DMA)
dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s

# CONFIG_PM not set:
# NAND
dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s
# SD
dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s

While SD card performance loss is not that bad (~7%), NAND one is
worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
cpuidle states over sysfs, it did not have any significant effect. Is
there something else to try?

I'm guessing this is caused by CPU wakeup latency to service DMA
interrupts? I've noticed that if I keep CPU busy, the loss is reduced
almost completely.
Talking about cpuidle, what's the difference between C1 and C2 states?
They look mostly the same.
Then there is omap3_do_wfi, it seems to be unconditionally putting
SDRC on self-refresh, would it make sense to just do wfi in higher
power states, like OMAP4 seems to be doing?

-- 
Gražvydas
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html