Re: PM related performance degradation on OMAP3
Hi Kevin, On Mon, May 7, 2012 at 7:31 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: HI Kevin, Grazvydas, On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. I posted the patches for the power domains registers cache, cf. http://marc.info/?l=linux-omapm=133587781712039w=2. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis I updated the page with the measurements results with Kevin's patches and the registers cache patches. The results are showing that: - the registers cache optimizes the low power mode transitions, but is not sufficient to obtain a big gain. A few unused domains are transitioning, which causes a big penalty in the idle path. PER is the one that seems to be causing the most latency. Can you try do your measurements using hack below which makes sure that PER isn't any deeper than CORE? Indeed your patch brings significant improvements, cf. wiki page at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis for detailed information. Here below is the reworked patch, more suited for inclusion in mainline [1] I have another optimisation -in proof of concept state- that brings another significant improvement. It is about allowing/disabling idle for only 1 clkdm in a pwrdm and not iterate through all the clkdms. This still needs some rework though. Cf. patch [2] That should work since disabling idle for any clkdm will have the same effect. Can you send this as a separate patch with a descriptive changelog. I just sent 2 patches which optimize the C1 state latency: . [PATCH 1/2] ARM: OMAP3: PM: cpuidle: optimize the PER latency in C1 state . [PATCH 2/2] ARM: OMAP3: PM: cpuidle: optimize the clkdm idle latency in C1 state Note: those patches apply on top of your pre/post_transition optimization patches. The performance results are close to the !PM case (No IDLE, no omap_sram_idle, all pwrdms to ON), i.e. 3.1MB/s on Beagleboard. The wiki page update comes asap. Regards, Jean Kevin Patches [1] and [2] on top of the registers cache and the optimisations in pre/post_transition bring the performance close to the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s on Beagleboard). What do you think? Regards, Jean --- [1] diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c index e406d7b..572b605 100644 +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev, int ret; /* - * Prevent idle completely if CAM is active. + * Use only C1 if CAM is active. * CAM does not have wakeup capability in OMAP3. */ - if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) { + if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) new_state_idx = drv-safe_state_index; - goto select_state; - } - - new_state_idx = next_valid_state(dev, drv, index); + else + new_state_idx = next_valid_state(dev, drv, index); - /* - * Prevent PER off if CORE is not in retention or off as this - * would disable PER wakeups completely. - */ + /* Program PER state */ cx = cpuidle_get_statedata(dev-states_usage[new_state_idx]); core_next_state = cx-core_state; - per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd); - if ((per_next_state == PWRDM_FUNC_PWRST_OFF) - (core_next_state PWRDM_FUNC_PWRST_CSWR)) - per_next_state = PWRDM_FUNC_PWRST_CSWR; + if (new_state_idx == 0) { + /* In C1 do not allow PER state lower than CORE state */ + per_next_state = core_next_state; + } else { + /* + * Prevent PER off if CORE is not in RETention or OFF as this + * would disable PER wakeups completely. + */ + per_next_state = per_saved_state = + pwrdm_read_next_func_pwrst(per_pd); + if ((per_next_state == PWRDM_FUNC_PWRST_OFF) + (core_next_state PWRDM_FUNC_PWRST_CSWR)) + per_next_state = PWRDM_FUNC_PWRST_CSWR; + } /* Are we changing PER target state? */ if (per_next_state != per_saved_state) omap_set_pwrdm_state(per_pd, per_next_state); -select_state: ret = omap3_enter_idle(dev, drv, new_state_idx); /* Restore original PER state if it was modified */ @@ -390,7 +394,6 @@ int
Re: PM related performance degradation on OMAP3
Jean Pihet jean.pi...@newoldbits.com writes: On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: HI Kevin, Grazvydas, On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. I posted the patches for the power domains registers cache, cf. http://marc.info/?l=linux-omapm=133587781712039w=2. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis I updated the page with the measurements results with Kevin's patches and the registers cache patches. The results are showing that: - the registers cache optimizes the low power mode transitions, but is not sufficient to obtain a big gain. A few unused domains are transitioning, which causes a big penalty in the idle path. PER is the one that seems to be causing the most latency. Can you try do your measurements using hack below which makes sure that PER isn't any deeper than CORE? Indeed your patch brings significant improvements, cf. wiki page at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis for detailed information. Here below is the reworked patch, more suited for inclusion in mainline [1] I have another optimisation -in proof of concept state- that brings another significant improvement. It is about allowing/disabling idle for only 1 clkdm in a pwrdm and not iterate through all the clkdms. This still needs some rework though. Cf. patch [2] That should work since disabling idle for any clkdm will have the same effect. Can you send this as a separate patch with a descriptive changelog. Kevin Patches [1] and [2] on top of the registers cache and the optimisations in pre/post_transition bring the performance close to the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s on Beagleboard). What do you think? Regards, Jean --- [1] diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c index e406d7b..572b605 100644 +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev, int ret; /* - * Prevent idle completely if CAM is active. + * Use only C1 if CAM is active. * CAM does not have wakeup capability in OMAP3. */ - if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) { + if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) new_state_idx = drv-safe_state_index; - goto select_state; - } - - new_state_idx = next_valid_state(dev, drv, index); + else + new_state_idx = next_valid_state(dev, drv, index); - /* - * Prevent PER off if CORE is not in retention or off as this - * would disable PER wakeups completely. - */ + /* Program PER state */ cx = cpuidle_get_statedata(dev-states_usage[new_state_idx]); core_next_state = cx-core_state; - per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd); - if ((per_next_state == PWRDM_FUNC_PWRST_OFF) - (core_next_state PWRDM_FUNC_PWRST_CSWR)) - per_next_state = PWRDM_FUNC_PWRST_CSWR; + if (new_state_idx == 0) { + /* In C1 do not allow PER state lower than CORE state */ + per_next_state = core_next_state; + } else { + /* + * Prevent PER off if CORE is not in RETention or OFF as this + * would disable PER wakeups completely. + */ + per_next_state = per_saved_state = + pwrdm_read_next_func_pwrst(per_pd); + if ((per_next_state == PWRDM_FUNC_PWRST_OFF) + (core_next_state PWRDM_FUNC_PWRST_CSWR)) + per_next_state = PWRDM_FUNC_PWRST_CSWR; + } /* Are we changing PER target state? */ if (per_next_state != per_saved_state) omap_set_pwrdm_state(per_pd, per_next_state); -select_state: ret = omap3_enter_idle(dev, drv, new_state_idx); /* Restore original PER state if it was modified */ @@ -390,7 +394,6 @@ int __init omap3_idle_init(void) /* C1 . MPU WFI + Core active */ _fill_cstate(drv, 0, MPU ON + CORE ON); - (drv-states[0])-enter = omap3_enter_idle; drv-safe_state_index = 0; cx = _fill_cstate_usage(dev, 0); cx-valid = 1; /* C1 is always valid */ [2] diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c index e406d7b..6aa3c75 100644 --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -118,8 +118,10 @@ static int __omap3_enter_idle(struct cpuidle_device
Re: PM related performance degradation on OMAP3
On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: HI Kevin, Grazvydas, On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. I posted the patches for the power domains registers cache, cf. http://marc.info/?l=linux-omapm=133587781712039w=2. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis I updated the page with the measurements results with Kevin's patches and the registers cache patches. The results are showing that: - the registers cache optimizes the low power mode transitions, but is not sufficient to obtain a big gain. A few unused domains are transitioning, which causes a big penalty in the idle path. PER is the one that seems to be causing the most latency. Can you try do your measurements using hack below which makes sure that PER isn't any deeper than CORE? Indeed your patch brings significant improvements, cf. wiki page at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis for detailed information. Here below is the reworked patch, more suited for inclusion in mainline [1] I have another optimisation -in proof of concept state- that brings another significant improvement. It is about allowing/disabling idle for only 1 clkdm in a pwrdm and not iterate through all the clkdms. This still needs some rework though. Cf. patch [2] Patches [1] and [2] on top of the registers cache and the optimisations in pre/post_transition bring the performance close to the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s on Beagleboard). What do you think? Regards, Jean --- [1] diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c index e406d7b..572b605 100644 +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev, int ret; /* -* Prevent idle completely if CAM is active. +* Use only C1 if CAM is active. * CAM does not have wakeup capability in OMAP3. */ - if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) { + if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) new_state_idx = drv-safe_state_index; - goto select_state; - } - - new_state_idx = next_valid_state(dev, drv, index); + else + new_state_idx = next_valid_state(dev, drv, index); - /* -* Prevent PER off if CORE is not in retention or off as this -* would disable PER wakeups completely. -*/ + /* Program PER state */ cx = cpuidle_get_statedata(dev-states_usage[new_state_idx]); core_next_state = cx-core_state; - per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd); - if ((per_next_state == PWRDM_FUNC_PWRST_OFF) - (core_next_state PWRDM_FUNC_PWRST_CSWR)) - per_next_state = PWRDM_FUNC_PWRST_CSWR; + if (new_state_idx == 0) { + /* In C1 do not allow PER state lower than CORE state */ + per_next_state = core_next_state; + } else { + /* +* Prevent PER off if CORE is not in RETention or OFF as this +* would disable PER wakeups completely. +*/ + per_next_state = per_saved_state = + pwrdm_read_next_func_pwrst(per_pd); + if ((per_next_state == PWRDM_FUNC_PWRST_OFF) + (core_next_state PWRDM_FUNC_PWRST_CSWR)) + per_next_state = PWRDM_FUNC_PWRST_CSWR; + } /* Are we changing PER target state? */ if (per_next_state != per_saved_state) omap_set_pwrdm_state(per_pd, per_next_state); -select_state: ret = omap3_enter_idle(dev, drv, new_state_idx); /* Restore original PER state if it was modified */ @@ -390,7 +394,6 @@ int __init omap3_idle_init(void) /* C1 . MPU WFI + Core active */ _fill_cstate(drv, 0, MPU ON + CORE ON); - (drv-states[0])-enter = omap3_enter_idle; drv-safe_state_index = 0; cx = _fill_cstate_usage(dev, 0); cx-valid = 1; /* C1 is always valid */ [2] diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c index e406d7b..6aa3c75 100644 --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -118,8 +118,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, /* Deny idle for C1 */ if (index == 0) { - pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle); - pwrdm_for_each_clkdm(core_pd,
Re: PM related performance degradation on OMAP3
HI Kevin, Grazvydas, On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. I posted the patches for the power domains registers cache, cf. http://marc.info/?l=linux-omapm=133587781712039w=2. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis I updated the page with the measurements results with Kevin's patches and the registers cache patches. The results are showing that: - the registers cache optimizes the low power mode transitions, but is not sufficient to obtain a big gain. A few unused domains are transitioning, which causes a big penalty in the idle path. - khilman's optimizations are really helpful. Furthermore it optimizes farther the registers cache statistics accesses. - the average time in idle now drops to 246us, which is still very large for a cpu intensive C-state. For information with PM disabled the average time in idle is 113us. Regards, Jean . This is great, thanks. [...] Here are the results (BW in MB/s) on Beagleboard: - 4.7: without using DMA, - Using DMA 2.1: [0] 2.1: [1] only C1 2.6: [1]+[2] no pre_ post_ 2.3: [1]+[5] no pwrdm_for_each_clkdm 2.8: [1]+[5]+[2] 3.1: [1]+[5]+[6] no omap_sram_idle 3.1: No IDLE, no omap_sram_idle, all pwrdms to ON So indeed this shows there is some serious performance issue with the C1 C-state. Yes, this confirms what both Grazvytas and I are seeing as well. [...] From the list of contributors, the main ones are: (140us) pwrdm_pre_transition and pwrdm_post_transition, See the series I just posted to address this one: [PATCH/RFT 0/3] ARM: OMAP: PM: reduce overhead of pwrdm pre/post transitions (105us) omap2_gpio_prepare_for_idle and omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in the latency-critical C-states, (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle), (33us estimated) omap_set_pwrdm_state(mpu, core, neon), (11 us) clkdm_allow_idle(mpu). Is this needed? In that same series, I removed this as it appears to be a remnant of a code move (c.f. patch 3 in above series.) Here are a few questions and suggestions: - In case of latency critical C-states could the high-latency code be bypassed in favor of a much simpler version? Pushing the concept a bit farther one could have a C1 state that just relaxes the cpu (no WFI), a C2 state which bypasses a lot of code in __omap3_enter_idle, and the rest of the C-states as we have today, I was thinking a WFI only state, with *all* powerdomains staying on is probably sufficient for C1. Do you see the enter/exit latency from that as even being too hight? - Is it needed to iterate through all the power and clock domains in order to keep them active? No. My series above starts to addresses this, but I think Tero's use-counting series is the final solution since this should really be done when we know the powerdomains are transitioning. Kevin -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Jean Pihet jean.pi...@newoldbits.com writes: HI Kevin, Grazvydas, On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman khil...@ti.com wrote: Jean Pihet jean.pi...@newoldbits.com writes: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. I posted the patches for the power domains registers cache, cf. http://marc.info/?l=linux-omapm=133587781712039w=2. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis I updated the page with the measurements results with Kevin's patches and the registers cache patches. The results are showing that: - the registers cache optimizes the low power mode transitions, but is not sufficient to obtain a big gain. A few unused domains are transitioning, which causes a big penalty in the idle path. PER is the one that seems to be causing the most latency. Can you try do your measurements using hack below which makes sure that PER isn't any deeper than CORE? Kevin From bb2f67ed93dc83c645080e293d315d383c23c0c6 Mon Sep 17 00:00:00 2001 From: Kevin Hilman khil...@ti.com Date: Mon, 16 Apr 2012 17:53:14 -0700 Subject: [PATCH] cpuidle34xx: per follows core, C1 use _bm --- arch/arm/mach-omap2/cpuidle34xx.c |9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c index 374708d..00400ad 100644 --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -278,9 +278,11 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev, cx = cpuidle_get_statedata(dev-states_usage[index]); core_next_state = cx-core_state; per_next_state = per_saved_state = pwrdm_read_next_pwrst(per_pd); - if ((per_next_state == PWRDM_POWER_OFF) - (core_next_state PWRDM_POWER_RET)) - per_next_state = PWRDM_POWER_RET; + /* if ((per_next_state == PWRDM_POWER_OFF) */ + /* (core_next_state PWRDM_POWER_RET)) */ + /* per_next_state = PWRDM_POWER_RET; */ + if (per_next_state core_next_state) + per_next_state = core_next_state; /* Are we changing PER target state? */ if (per_next_state != per_saved_state) @@ -374,7 +376,6 @@ int __init omap3_idle_init(void) /* C1 . MPU WFI + Core active */ _fill_cstate(drv, 0, MPU ON + CORE ON); - (drv-states[0])-enter = omap3_enter_idle; drv-safe_state_index = 0; cx = _fill_cstate_usage(dev, 0); cx-valid = 1; /* C1 is always valid */ -- 1.7.9.2 -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On Tue, 1 May 2012, Kevin Hilman wrote: PER is the one that seems to be causing the most latency. Can you try do your measurements using hack below which makes sure that PER isn't any deeper than CORE? It might be the relock time for DPLL4, the PER DPLL. You might also try disabling DPLL4 autoidle for the shallow C-states... - Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis . The setup is: - Beagleboard (OMAP3530) at 500MHz, - l-o master kernel + functional power states + per-device PM QoS. It has been checked that the changes from l-o master do not have an impact on the performance. - The data transfer is performed using dd from a file in JFFS2 to /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'. On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote: It would be helpful now to narrow down what are the big contributors to the overhead in omap_sram_idle(). Most of the code there is skipped for C1 because the next states for MPU and CORE are both ON. Ok I did some tests, all in mostly idle system with just init, busybox shell and dd doing a NAND read to /dev/null . ... MB/s is throughput that dd reports, mA and approx. current draw during the transfer, read from fuel gauge that's onboard. MB/s| mA|comment 3.7|218|mainline f549e088b80 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1] 4.4|220|[1] + pwrdm_p*_transition commented [2] 3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3] 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4] 4.0|224|[1] + 'Deny idle' [5] 5.1|210|[2] + [4] + [5] 5.2|202|[5] + omap_sram_idle-cpu_do_idle [6] 5.5|243|!CONFIG_PM 6.1|282|busywait DMA end (for reference) Here are the results (BW in MB/s) on Beagleboard: - 4.7: without using DMA, - Using DMA 2.1: [0] 2.1: [1] only C1 2.6: [1]+[2] no pre_ post_ 2.3: [1]+[5] no pwrdm_for_each_clkdm 2.8: [1]+[5]+[2] 3.1: [1]+[5]+[6] no omap_sram_idle 3.1: No IDLE, no omap_sram_idle, all pwrdms to ON So indeed this shows there is some serious performance issue with the C1 C-state. Thanks for the detailed experiments. This definitely confirms we have some serious unwanted overhead for C1, and our C-state latency values are clearly way off base, since they only account HW latency and not any of the SW latency introduced in omap_sram_idle(). There are 2 primary differences that I see as possible causes. I list them here with a couple more experiments for you to try to help us narrow this down. 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() Could you try using omap_sram_idle() and just commenting out those calls? Does that help performance? Those iterate over all the powerdomains, so defintely add some overhead, but I don't think it would be as significant as what you're seeing. Seems to be taking good part of it. Much more likely is... 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds Could not notice any difference. To me it looks like this results from many small things adding up.. Idle is called so often that pwrdm_p*_transition() and those pwrdm_for_each_clkdm() walks start slowing everything down, perhaps because they access lots of registers on slow buses? From the list of contributors, the main ones are: (140us) pwrdm_pre_transition and pwrdm_post_transition, (105us) omap2_gpio_prepare_for_idle and omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in the latency-critical C-states, (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle), (33us estimated) omap_set_pwrdm_state(mpu, core, neon), (11 us) clkdm_allow_idle(mpu). Is this needed? Here are a few questions and suggestions: - In case of latency critical C-states could the high-latency code be bypassed in favor of a much simpler version? Pushing the concept a bit farther one could have a C1 state that just relaxes the cpu (no WFI), a C2 state which bypasses a lot of code in __omap3_enter_idle, and the rest of the C-states as we have today, - Is it needed to iterate through all the power and clock domains in order to keep them active? - Trying to idle some non related power domains (e.g. PER) causes a performance hit. How to link all the power domains states to the cpuidle C-state? The per-device PM QoS framework could be used to constraint some power domains, but this is highly dependent on the use case. Yes PRCM register accesses are unfortunately rather slow, and we've known that for some time, but haven't done any detailed analysis of the overhead. That would be worth doing the analysis. A lot of read accesses to the current, next and previous power states are performed in the idle code. Using the function_graph tracer, I was able to see that the pre/post transition are taking an enormous amount of time: - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz) - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz) Notice the big difference between 600MHz
Re: PM related performance degradation on OMAP3
+ Tero On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis . Nice data. The setup is: - Beagleboard (OMAP3530) at 500MHz, - l-o master kernel + functional power states + per-device PM QoS. It has been checked that the changes from l-o master do not have an impact on the performance. - The data transfer is performed using dd from a file in JFFS2 to /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'. On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote: It would be helpful now to narrow down what are the big contributors to the overhead in omap_sram_idle(). Most of the code there is skipped for C1 because the next states for MPU and CORE are both ON. Ok I did some tests, all in mostly idle system with just init, busybox shell and dd doing a NAND read to /dev/null . ... MB/s is throughput that dd reports, mA and approx. current draw during the transfer, read from fuel gauge that's onboard. MB/s| mA|comment 3.7|218|mainline f549e088b80 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1] 4.4|220|[1] + pwrdm_p*_transition commented [2] 3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3] 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4] 4.0|224|[1] + 'Deny idle' [5] 5.1|210|[2] + [4] + [5] 5.2|202|[5] + omap_sram_idle-cpu_do_idle [6] 5.5|243|!CONFIG_PM 6.1|282|busywait DMA end (for reference) Here are the results (BW in MB/s) on Beagleboard: - 4.7: without using DMA, - Using DMA 2.1: [0] 2.1: [1] only C1 2.6: [1]+[2] no pre_ post_ 2.3: [1]+[5] no pwrdm_for_each_clkdm 2.8: [1]+[5]+[2] 3.1: [1]+[5]+[6] no omap_sram_idle 3.1: No IDLE, no omap_sram_idle, all pwrdms to ON So indeed this shows there is some serious performance issue with the C1 C-state. Looks like other clock-domain (notably l4, per, AON) should be denied idle in C1 to avoid the huge penalties. It might just do the trick. Thanks for the detailed experiments. This definitely confirms we have some serious unwanted overhead for C1, and our C-state latency values are clearly way off base, since they only account HW latency and not any of the SW latency introduced in omap_sram_idle(). There are 2 primary differences that I see as possible causes. I list them here with a couple more experiments for you to try to help us narrow this down. 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() Could you try using omap_sram_idle() and just commenting out those calls? Does that help performance? Those iterate over all the powerdomains, so defintely add some overhead, but I don't think it would be as significant as what you're seeing. Seems to be taking good part of it. Much more likely is... 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds Could not notice any difference. To me it looks like this results from many small things adding up.. Idle is called so often that pwrdm_p*_transition() and those pwrdm_for_each_clkdm() walks start slowing everything down, perhaps because they access lots of registers on slow buses? From the list of contributors, the main ones are: (140us) pwrdm_pre_transition and pwrdm_post_transition, I have observed this one on OMAP4 too. There was a plan to remove this as part of Tero's PD/CD use-counting series. (105us) omap2_gpio_prepare_for_idle and omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in the latency-critical C-states, Yes. In C1 when you deny idle for per, there should be no need to call this. But even in the case when it is called, why is it taking 105 uS. Needs to dig further. (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle), Depending on OPP, a PRCM read can take upto ~12-14 uS, so above shouldn't be surprising. (33us estimated) omap_set_pwrdm_state(mpu, core, neon), This is again dominated by PRCM read (11 us) clkdm_allow_idle(mpu). Is this needed? I guess yes other wise when C2+ is attempted MPU CD can't idle. Here are a few questions and suggestions: - In case of latency critical C-states could the high-latency code be bypassed in favor of a much simpler version? Pushing the concept a bit farther one could have a C1 state that just relaxes the cpu (no WFI), a C2 state which bypasses a lot of code in __omap3_enter_idle, and the rest of the C-states as we have today, We should do that. Infact C1 state should be as lite as possible like WFI or so. - Is it needed to iterate through all the power and clock domains in order to keep them active? That iteration should be removed. - Trying to
Re: PM related performance degradation on OMAP3
On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote: + Tero On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis . Nice data. The setup is: - Beagleboard (OMAP3530) at 500MHz, - l-o master kernel + functional power states + per-device PM QoS. It has been checked that the changes from l-o master do not have an impact on the performance. - The data transfer is performed using dd from a file in JFFS2 to /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'. Question: what is used for gathering the latency values? On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote: It would be helpful now to narrow down what are the big contributors to the overhead in omap_sram_idle(). Most of the code there is skipped for C1 because the next states for MPU and CORE are both ON. Ok I did some tests, all in mostly idle system with just init, busybox shell and dd doing a NAND read to /dev/null . ... MB/s is throughput that dd reports, mA and approx. current draw during the transfer, read from fuel gauge that's onboard. MB/s| mA|comment 3.7|218|mainline f549e088b80 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1] 4.4|220|[1] + pwrdm_p*_transition commented [2] 3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3] 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4] 4.0|224|[1] + 'Deny idle' [5] 5.1|210|[2] + [4] + [5] 5.2|202|[5] + omap_sram_idle-cpu_do_idle [6] 5.5|243|!CONFIG_PM 6.1|282|busywait DMA end (for reference) Here are the results (BW in MB/s) on Beagleboard: - 4.7: without using DMA, - Using DMA 2.1: [0] 2.1: [1] only C1 2.6: [1]+[2] no pre_ post_ 2.3: [1]+[5] no pwrdm_for_each_clkdm 2.8: [1]+[5]+[2] 3.1: [1]+[5]+[6] no omap_sram_idle 3.1: No IDLE, no omap_sram_idle, all pwrdms to ON So indeed this shows there is some serious performance issue with the C1 C-state. Looks like other clock-domain (notably l4, per, AON) should be denied idle in C1 to avoid the huge penalties. It might just do the trick. Thanks for the detailed experiments. This definitely confirms we have some serious unwanted overhead for C1, and our C-state latency values are clearly way off base, since they only account HW latency and not any of the SW latency introduced in omap_sram_idle(). There are 2 primary differences that I see as possible causes. I list them here with a couple more experiments for you to try to help us narrow this down. 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() Could you try using omap_sram_idle() and just commenting out those calls? Does that help performance? Those iterate over all the powerdomains, so defintely add some overhead, but I don't think it would be as significant as what you're seeing. Seems to be taking good part of it. Much more likely is... 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds Could not notice any difference. To me it looks like this results from many small things adding up.. Idle is called so often that pwrdm_p*_transition() and those pwrdm_for_each_clkdm() walks start slowing everything down, perhaps because they access lots of registers on slow buses? From the list of contributors, the main ones are: (140us) pwrdm_pre_transition and pwrdm_post_transition, I have observed this one on OMAP4 too. There was a plan to remove this as part of Tero's PD/CD use-counting series. pwrdm_pre / post transitions could be optimized a bit already now. They only should need to be called for mpu / core and per domains, but currently they scan through everything. (105us) omap2_gpio_prepare_for_idle and omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in the latency-critical C-states, Yes. In C1 when you deny idle for per, there should be no need to call this. But even in the case when it is called, why is it taking 105 uS. Needs to dig further. (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle), Depending on OPP, a PRCM read can take upto ~12-14 uS, so above shouldn't be surprising. (33us estimated) omap_set_pwrdm_state(mpu, core, neon), This is again dominated by PRCM read (11 us) clkdm_allow_idle(mpu). Is this needed? I guess yes other wise when C2+ is attempted MPU CD can't idle. Here are a few questions and suggestions: - In case of latency critical C-states could the high-latency code be bypassed in favor of a much simpler version? Pushing
Re: PM related performance degradation on OMAP3
Hi Tero, On Tue, Apr 24, 2012 at 2:21 PM, Tero Kristo t-kri...@ti.com wrote: On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote: + Tero On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis . Nice data. The setup is: - Beagleboard (OMAP3530) at 500MHz, - l-o master kernel + functional power states + per-device PM QoS. It has been checked that the changes from l-o master do not have an impact on the performance. - The data transfer is performed using dd from a file in JFFS2 to /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'. Question: what is used for gathering the latency values? I used ftrace tracepoints which are supposed to be low overhead. I checked that the overhead cannot be measured on the measurement interval (400us), given the fact that the time base is 31us (32 KHz clock). On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote: It would be helpful now to narrow down what are the big contributors to the overhead in omap_sram_idle(). Most of the code there is skipped for C1 because the next states for MPU and CORE are both ON. Ok I did some tests, all in mostly idle system with just init, busybox shell and dd doing a NAND read to /dev/null . ... MB/s is throughput that dd reports, mA and approx. current draw during the transfer, read from fuel gauge that's onboard. MB/s| mA|comment 3.7|218|mainline f549e088b80 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1] 4.4|220|[1] + pwrdm_p*_transition commented [2] 3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3] 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4] 4.0|224|[1] + 'Deny idle' [5] 5.1|210|[2] + [4] + [5] 5.2|202|[5] + omap_sram_idle-cpu_do_idle [6] 5.5|243|!CONFIG_PM 6.1|282|busywait DMA end (for reference) Here are the results (BW in MB/s) on Beagleboard: - 4.7: without using DMA, - Using DMA 2.1: [0] 2.1: [1] only C1 2.6: [1]+[2] no pre_ post_ 2.3: [1]+[5] no pwrdm_for_each_clkdm 2.8: [1]+[5]+[2] 3.1: [1]+[5]+[6] no omap_sram_idle 3.1: No IDLE, no omap_sram_idle, all pwrdms to ON So indeed this shows there is some serious performance issue with the C1 C-state. Looks like other clock-domain (notably l4, per, AON) should be denied idle in C1 to avoid the huge penalties. It might just do the trick. Thanks for the detailed experiments. This definitely confirms we have some serious unwanted overhead for C1, and our C-state latency values are clearly way off base, since they only account HW latency and not any of the SW latency introduced in omap_sram_idle(). There are 2 primary differences that I see as possible causes. I list them here with a couple more experiments for you to try to help us narrow this down. 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() Could you try using omap_sram_idle() and just commenting out those calls? Does that help performance? Those iterate over all the powerdomains, so defintely add some overhead, but I don't think it would be as significant as what you're seeing. Seems to be taking good part of it. Much more likely is... 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds Could not notice any difference. To me it looks like this results from many small things adding up.. Idle is called so often that pwrdm_p*_transition() and those pwrdm_for_each_clkdm() walks start slowing everything down, perhaps because they access lots of registers on slow buses? From the list of contributors, the main ones are: (140us) pwrdm_pre_transition and pwrdm_post_transition, I have observed this one on OMAP4 too. There was a plan to remove this as part of Tero's PD/CD use-counting series. pwrdm_pre / post transitions could be optimized a bit already now. They only should need to be called for mpu / core and per domains, but currently they scan through everything. (105us) omap2_gpio_prepare_for_idle and omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in the latency-critical C-states, Yes. In C1 when you deny idle for per, there should be no need to call this. But even in the case when it is called, why is it taking 105 uS. Needs to dig further. (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle), Depending on OPP, a PRCM read can take upto ~12-14 uS, so above shouldn't be surprising. (33us estimated) omap_set_pwrdm_state(mpu, core, neon), This is again dominated by PRCM read (11 us)
Re: PM related performance degradation on OMAP3
On Tue, 2012-04-24 at 14:50 +0200, Jean Pihet wrote: Hi Tero, On Tue, Apr 24, 2012 at 2:21 PM, Tero Kristo t-kri...@ti.com wrote: On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote: + Tero On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis . Nice data. The setup is: - Beagleboard (OMAP3530) at 500MHz, - l-o master kernel + functional power states + per-device PM QoS. It has been checked that the changes from l-o master do not have an impact on the performance. - The data transfer is performed using dd from a file in JFFS2 to /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'. Question: what is used for gathering the latency values? I used ftrace tracepoints which are supposed to be low overhead. I checked that the overhead cannot be measured on the measurement interval (400us), given the fact that the time base is 31us (32 KHz clock). If you want to get accurate measurements, you could use ARM performance counters, namely the cycle counter. I have a couple of patches for that purpose I've used if you are interested. -Tero -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Jean Pihet jean.pi...@newoldbits.com writes: Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis . This is great, thanks. [...] Here are the results (BW in MB/s) on Beagleboard: - 4.7: without using DMA, - Using DMA 2.1: [0] 2.1: [1] only C1 2.6: [1]+[2] no pre_ post_ 2.3: [1]+[5] no pwrdm_for_each_clkdm 2.8: [1]+[5]+[2] 3.1: [1]+[5]+[6] no omap_sram_idle 3.1: No IDLE, no omap_sram_idle, all pwrdms to ON So indeed this shows there is some serious performance issue with the C1 C-state. Yes, this confirms what both Grazvytas and I are seeing as well. [...] From the list of contributors, the main ones are: (140us) pwrdm_pre_transition and pwrdm_post_transition, See the series I just posted to address this one: [PATCH/RFT 0/3] ARM: OMAP: PM: reduce overhead of pwrdm pre/post transitions (105us) omap2_gpio_prepare_for_idle and omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in the latency-critical C-states, (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle), (33us estimated) omap_set_pwrdm_state(mpu, core, neon), (11 us) clkdm_allow_idle(mpu). Is this needed? In that same series, I removed this as it appears to be a remnant of a code move (c.f. patch 3 in above series.) Here are a few questions and suggestions: - In case of latency critical C-states could the high-latency code be bypassed in favor of a much simpler version? Pushing the concept a bit farther one could have a C1 state that just relaxes the cpu (no WFI), a C2 state which bypasses a lot of code in __omap3_enter_idle, and the rest of the C-states as we have today, I was thinking a WFI only state, with *all* powerdomains staying on is probably sufficient for C1. Do you see the enter/exit latency from that as even being too hight? - Is it needed to iterate through all the power and clock domains in order to keep them active? No. My series above starts to addresses this, but I think Tero's use-counting series is the final solution since this should really be done when we know the powerdomains are transitioning. Kevin -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Grazvydas Ignotas nota...@gmail.com writes: On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote: It would be helpful now to narrow down what are the big contributors to the overhead in omap_sram_idle(). Most of the code there is skipped for C1 because the next states for MPU and CORE are both ON. Ok I did some tests, all in mostly idle system with just init, busybox shell and dd doing a NAND read to /dev/null . Hmm, I seem to get a hang using dd to read from NAND /dev/mtdX on my Overo. I saw your patch 'mtd: omap2: fix resource leak in prefetch-busy path' but that didn't seem to help my crash. MB/s is throughput that dd reports, mA and approx. current draw during the transfer, read from fuel gauge that's onboard. MB/s| mA|comment 3.7|218|mainline f549e088b80 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1] 4.4|220|[1] + pwrdm_p*_transition commented [2] 3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3] 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4] 4.0|224|[1] + 'Deny idle' [5] 5.1|210|[2] + [4] + [5] 5.2|202|[5] + omap_sram_idle-cpu_do_idle [6] 5.5|243|!CONFIG_PM 6.1|282|busywait DMA end (for reference) Thanks for the detailed experiments. This definitely confirms we have some serious unwanted overhead for C1, and our C-state latency values are clearly way off base, since they only account HW latency and not any of the SW latency introduced in omap_sram_idle(). There are 2 primary differences that I see as possible causes. I list them here with a couple more experiments for you to try to help us narrow this down. 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() Could you try using omap_sram_idle() and just commenting out those calls? Does that help performance? Those iterate over all the powerdomains, so defintely add some overhead, but I don't think it would be as significant as what you're seeing. Seems to be taking good part of it. Much more likely is... 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds Could not notice any difference. To me it looks like this results from many small things adding up.. Idle is called so often that pwrdm_p*_transition() and those pwrdm_for_each_clkdm() walks start slowing everything down, perhaps because they access lots of registers on slow buses? Yes PRCM register accesses are unfortunately rather slow, and we've known that for some time, but haven't done any detailed analysis of the overhead. Using the function_graph tracer, I was able to see that the pre/post transition are taking an enormous amount of time: - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz) - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz) Notice the big difference between 600MHz OPP and 125MHz OPP. Are you using CPUfreq at all in your tests? If using cpufreq + ondemand governor, you're probably running at low OPP due to lack of CPU activity which will also affect the latencies in the idle path. Maybe some register cache would help us there, or are those registers expected to be changed by hardware often? Yes, we've known that some sort of register cache here would be useful for some time, but haven't got to implementing it. Also trying to idle PER while transfer is ongoing (as reported in previous mail) doesn't sound like a good idea and is one of the reasons for slowdown. Seems to also causing more current drain, ironically. Agreed. Again, using the function_graph tracer, I get some pretty big latencies from the GPIO pre/post idling process: - gpio_prepare_for_idle(): 2400+ us at 600MHz (8200+ us at 125MHz) - gpio_resume_from_idle(): 2200+ us at 600MHz (7600+ us at 125MHz) Removing PER transtions as you did will get rid of those. I'm looking into this in more detail know, and will likely have a few patches for you to experiment with. Thanks again for digging into this with us, Kevin changes (again, sorry for corrupted diffs, but they should be easy to reproduce): [2]: --- a/arch/arm/mach-omap2/pm34xx.c +++ b/arch/arm/mach-omap2/pm34xx.c @@ -307,7 +307,7 @@ void omap_sram_idle(void) omap3_enable_io_chain(); } - pwrdm_pre_transition(); +// pwrdm_pre_transition(); /* PER */ if (per_next_state PWRDM_POWER_ON) { @@ -372,7 +373,7 @@ void omap_sram_idle(void) } omap3_intc_resume_idle(); - pwrdm_post_transition(); +// pwrdm_post_transition(); /* PER */ if (per_next_state PWRDM_POWER_ON) { [3]: --- a/arch/arm/mach-omap2/pm34xx.c +++ b/arch/arm/mach-omap2/pm34xx.c @@ -347,7 +347,7 @@ void omap_sram_idle(void) if (save_state == 1 || save_state == 3) cpu_suspend(save_state, omap34xx_do_sram_idle); else - omap34xx_do_sram_idle(save_state); + cpu_do_idle(); /* Restore normal SDRC POWER settings */ if
Re: PM related performance degradation on OMAP3
On Tue, Apr 17, 2012 at 5:30 PM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: Ok I did some tests, all in mostly idle system with just init, busybox shell and dd doing a NAND read to /dev/null . Hmm, I seem to get a hang using dd to read from NAND /dev/mtdX on my Overo. I saw your patch 'mtd: omap2: fix resource leak in prefetch-busy path' but that didn't seem to help my crash. I see overo doesn't set 16bit flag, I think it has NAND on 16bit bus? Perhaps try this: --- a/arch/arm/mach-omap2/board-overo.c +++ b/arch/arm/mach-omap2/board-overo.c @@ -517,7 +517,7 @@ static void __init overo_init(void) omap_serial_init(); omap_sdrc_init(mt46h32m32lf6_sdrc_params, mt46h32m32lf6_sdrc_params); - omap_nand_flash_init(0, overo_nand_partitions, + omap_nand_flash_init(NAND_BUSWIDTH_16, overo_nand_partitions, ARRAY_SIZE(overo_nand_partitions)); usb_musb_init(NULL); usbhs_init(usbhs_bdata); Also only pandora is using NAND DMA mode right now in mainline, the default polling mode won't exhibit the latency problem (with all other polling consequences like high CPU usage), so this is needed too for the test: --- a/arch/arm/mach-omap2/common-board-devices.c +++ b/arch/arm/mach-omap2/common-board-devices.c @@ -127,6 +127,7 @@ void __init omap_nand_flash_init(int options, struct mtd_partition *parts, nand_data.parts = parts; nand_data.nr_parts = nr_parts; nand_data.devsize = options; + nand_data.xfer_type = NAND_OMAP_PREFETCH_DMA; printk(KERN_INFO Registering NAND on CS%d\n, nandcs); if (gpmc_nand_init(nand_data) 0) I also forgot to mention I was using ubifs in my test (dd'ing large file from it), I don't think it has much effect, but if you want to try with that: .config CONFIG_MTD_UBI=y CONFIG_UBIFS_FS=y -- ubiformat /dev/mtdX -s 512 ubiattach /dev/ubi_ctrl -m X # X from mtdX ubimkvol /dev/ubi0 -m -N somename mount -t ubifs ubi0:somename /mnt To me it looks like this results from many small things adding up.. Idle is called so often that pwrdm_p*_transition() and those pwrdm_for_each_clkdm() walks start slowing everything down, perhaps because they access lots of registers on slow buses? Yes PRCM register accesses are unfortunately rather slow, and we've known that for some time, but haven't done any detailed analysis of the overhead. Using the function_graph tracer, I was able to see that the pre/post transition are taking an enormous amount of time: - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz) - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz) Hmm, with this it wouldn't be able to do ~500+ calls/sec I was seeing, so the tracer overhead is probably quite large too.. Notice the big difference between 600MHz OPP and 125MHz OPP. Are you using CPUfreq at all in your tests? If using cpufreq + ondemand governor, you're probably running at low OPP due to lack of CPU activity which will also affect the latencies in the idle path. I used performance governor in my tests, so it all was at 600MHz. I'm looking into this in more detail know, and will likely have a few patches for you to experiment with. Sounds good, -- Gražvydas -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Hi, On Thu, Apr 12, 2012 at 09:57:32AM -0700, Kevin Hilman wrote: +Felipe for EHCI question Gary Thomas g...@mlbassoc.com writes: [...] This worked a treat, thanks. My network performance is better now, but still not what it was. The same TFTP transfer now takes 71 seconds, so about 50% slower than on the 3.0 kernel. Applying the second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference. And does a CONFIG_PM=n kernel get you back to your v3.0 performance? I am interested in having PM working as I'm designing a battery powered portable unit, so I need to keep pursuing this. So do I. :) we all are :-p Note: I noticed that when I built with CONFIG_PM off and no other changes, my EHCI USB didn't work properly. Should this be the case? Probably not, but haven't tested EHCI USB. I've Cc'd Felipe to see if he has any ideas why EHCI wouldn't work with CONFIG_PM=n. Govind, Keshava... can you look into this at some point next week ? Or maybe give us a good reason why it doesn't work without PM ;-) -- balbi signature.asc Description: Digital signature
Re: PM related performance degradation on OMAP3
On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: While SD card performance loss is not that bad (~7%), NAND one is worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also cpuidle states over sysfs, it did not have any significant effect. Is there something else to try? Looks like we might need a PM QoS constraint when there is DMA activity in progress. You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when DMA transfers are active and I suspect that will help. I've tried it and it didn't help much. It looks like the only thing it does is limiting cpuidle c-states, I tried to set qos dma latency to 0 and it made it stay in C1 while transfer was ongoing (I watched /sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance was still poor. Great, thanks for doing this experiment. Assuming we get to a C1 that's low-latency enough, we will still need this constraint to ensure C1 during transfers. But first we have to figure out what's going on with C1... I've been working on this to collect more data, and noticed that PER is often being put to RET even at C1, is that expected? There is some additional work being done in that case, like putting GPIOs to sleep, and it seems to be source of part of performance loss here as it happens often during NAND transfers. This can be reproduced while doing mmc transfers too and detected with this: (not a valid patch, sorry, sending through gmail web) --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -87,6 +87,8 @@ static int _cpuidle_deny_idle(struct powerdomain *pwrdm, return 0; } +int is_c1; + static int __omap3_enter_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) @@ -117,6 +120,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, cpu_pm_enter(); /* Execute ARM wfi */ + is_c1 = (index == 0); omap_sram_idle(); /* diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c index 703bd10..519ce9d 100644 --- a/arch/arm/mach-omap2/pm34xx.c +++ b/arch/arm/mach-omap2/pm34xx.c @@ -275,6 +275,7 @@ void omap_sram_idle(void) int per_going_off; int core_prev_state, per_prev_state; u32 sdrc_pwr = 0; + extern int is_c1; mpu_next_state = pwrdm_read_next_pwrst(mpu_pwrdm); switch (mpu_next_state) { @@ -299,6 +300,8 @@ void omap_sram_idle(void) /* Enable IO-PAD and IO-CHAIN wakeups */ per_next_state = pwrdm_read_next_pwrst(per_pwrdm); core_next_state = pwrdm_read_next_pwrst(core_pwrdm); +if (is_c1 (per_next_state != PWRDM_POWER_ON || core_next_state != PWRDM_POWER_ON)) + printk(KERN_ERR c1 core %d, per %d\n, per_next_state, core_next_state); if (omap3_has_io_wakeup() (per_next_state PWRDM_POWER_ON || core_next_state PWRDM_POWER_ON)) { -- Gražvydas -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman khil...@ti.com wrote: It would be helpful now to narrow down what are the big contributors to the overhead in omap_sram_idle(). Most of the code there is skipped for C1 because the next states for MPU and CORE are both ON. Ok I did some tests, all in mostly idle system with just init, busybox shell and dd doing a NAND read to /dev/null . MB/s is throughput that dd reports, mA and approx. current draw during the transfer, read from fuel gauge that's onboard. MB/s| mA|comment 3.7|218|mainline f549e088b80 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1] 4.4|220|[1] + pwrdm_p*_transition commented [2] 3.8|225|[1] + omap34xx_do_sram_idle-cpu_do_idle [3] 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4] 4.0|224|[1] + 'Deny idle' [5] 5.1|210|[2] + [4] + [5] 5.2|202|[5] + omap_sram_idle-cpu_do_idle [6] 5.5|243|!CONFIG_PM 6.1|282|busywait DMA end (for reference) There are 2 primary differences that I see as possible causes. I list them here with a couple more experiments for you to try to help us narrow this down. 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() Could you try using omap_sram_idle() and just commenting out those calls? Does that help performance? Those iterate over all the powerdomains, so defintely add some overhead, but I don't think it would be as significant as what you're seeing. Seems to be taking good part of it. Much more likely is... 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds Could not notice any difference. To me it looks like this results from many small things adding up.. Idle is called so often that pwrdm_p*_transition() and those pwrdm_for_each_clkdm() walks start slowing everything down, perhaps because they access lots of registers on slow buses? Maybe some register cache would help us there, or are those registers expected to be changed by hardware often? Also trying to idle PER while transfer is ongoing (as reported in previous mail) doesn't sound like a good idea and is one of the reasons for slowdown. Seems to also causing more current drain, ironically. changes (again, sorry for corrupted diffs, but they should be easy to reproduce): [2]: --- a/arch/arm/mach-omap2/pm34xx.c +++ b/arch/arm/mach-omap2/pm34xx.c @@ -307,7 +307,7 @@ void omap_sram_idle(void) omap3_enable_io_chain(); } - pwrdm_pre_transition(); +// pwrdm_pre_transition(); /* PER */ if (per_next_state PWRDM_POWER_ON) { @@ -372,7 +373,7 @@ void omap_sram_idle(void) } omap3_intc_resume_idle(); - pwrdm_post_transition(); +// pwrdm_post_transition(); /* PER */ if (per_next_state PWRDM_POWER_ON) { [3]: --- a/arch/arm/mach-omap2/pm34xx.c +++ b/arch/arm/mach-omap2/pm34xx.c @@ -347,7 +347,7 @@ void omap_sram_idle(void) if (save_state == 1 || save_state == 3) cpu_suspend(save_state, omap34xx_do_sram_idle); else - omap34xx_do_sram_idle(save_state); + cpu_do_idle(); /* Restore normal SDRC POWER settings */ if (cpu_is_omap3430() omap_rev() = OMAP3430_REV_ES3_0 [4]: --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -107,6 +107,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, if (index == 0) { pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle); pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle); + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON); } /* [5]: --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -105,8 +105,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, /* Deny idle for C1 */ if (index == 0) { - pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle); - pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle); + clkdm_deny_idle(mpu_pd-pwrdm_clkdms[0]); } /* @@ -128,8 +128,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, /* Re-allow idle for C1 */ if (index == 0) { - pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle); - pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle); + clkdm_allow_idle(mpu_pd-pwrdm_clkdms[0]); } return_sleep_time: [6]: --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -117,7 +116,8 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, cpu_pm_enter(); /* Execute ARM wfi */ - omap_sram_idle(); + //omap_sram_idle(); + cpu_do_idle(); /* * Call idle CPU PM enter notifier chain to restore -- Gražvydas -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On 2012-04-11 13:17, Kevin Hilman wrote: Gary Thomasg...@mlbassoc.com writes: [...] I fear I'm seeing similar problems with 3.3. I have my board (similar to the BeagleBoard) ported to 3.0 and 3.3. I'm seeing terrible network performance on 3.3. For example, if I use TFTP to download a large file (~35MB), I get this: 3.0: 42.5 sec 3.3: 625.0 sec That's a factor of 15 worse! This might not be the same problem. What is the NIC being used, and does it have GPIO interrupts? My board uses SMSC911x with GPIO interrupt signal. If it's using GPIO interrupts, then you likely need this patch from mainline (v3.4-rc1) I tried to just pick up the patch you [sort of] quoted below, but had a hard time applying it to my kernel. I've tried to just pick up the latest files from the mainline kernel, but so far I've nothing that builds - too many dependencies. These are the files I've pulled in # modified: arch/arm/mach-omap2/cpuidle34xx.c # modified: arch/arm/mach-omap2/gpio.c # modified: arch/arm/mach-omap2/pm34xx.c # modified: arch/arm/plat-omap/include/plat/gpio.h # modified: drivers/gpio/gpio-omap.c but it fails with these errors: /local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:34:29: error: asm/system_misc.h: No such file or directory /local/linux-3.3/arch/arm/mach-omap2/pm34xx.c: In function 'omap3_pm_init': /local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: 'omap_pm_clkdms_setup' undeclared (first use in this function) /local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: (Each undeclared identifier is reported only once /local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: for each function it appears in.) /local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:767: error: 'arm_pm_idle' undeclared (first use in this function) Is this a viable path towards getting the GPIO changes into my kernel? It's hard for me to update the whole kernel as there are some other dependencies (OMAP3ISP and video in particular), so I'd like to stay with this 3.3-ish base. Thanks for any ideas If that doesn't work, or you're not using GPIO interrupts, could you confirm if the patch below[2] (based on idea from Grasvydas) increases performance for you when CONFIG_PM=y. Kevin [1] Author: Kevin Hilmankhil...@ti.com 2012-03-05 15:10:04 Committer: Grant Likelygrant.lik...@secretlab.ca 2012-03-12 09:16:11 Parent: 25db711df3258d125dc1209800317e5c0ef3c870 (gpio/omap: Fix IRQ handling for SPARSE_IRQ) Child: 8805f410e4fb88a56552c1af42d61b38837a38fd (gpio/omap: Fix section warning for omap_mpuio_alloc_gc()) Branches: many (66) Follows: v3.3-rc7 Precedes: v3.4-rc1 gpio/omap: fix wakeups on level-triggered GPIOs While both level- and edge-triggered GPIOs are capable of generating interrupts, only edge-triggered GPIOs are capable of generating a module-level wakeup to the PRCM (c.f. 34xx NDA TRM section 25.5.3.2.) In order to ensure that devices using level-triggered GPIOs as interrupts can also cause wakeups (e.g. from idle), this patch enables edge-triggering for wakeup-enabled, level-triggered GPIOs when a GPIO bank is runtime-suspended (which also happens during idle.) This fixes a problem found in GPMC-connected network cards with GPIO interrupts (e.g. smsc911x on Zoom3, Overo, ...) where network booting with NFSroot was very slow since the GPIO IRQs used by the NIC were not generating PRCM wakeups, and thus not waking the system from idle. NOTE: until v3.3, this boot-time problem was somewhat masked because the UART init prevented WFI during boot until the full serial driver was available. Preventing WFI allowed regular GPIO interrupts to fire and this problem was not seen. After the UART runtime PM cleanups, we no longer avoid WFI during boot, so GPIO IRQs that were not causing wakeups resulted in very slow IRQ response times. Tested on platforms using level-triggered GPIOs for network IRQs using the SMSC911x NIC: 3530/Overo and 3630/Zoom3. Reported-by: Tony Lindgrent...@atomide.com Tested-by: Tarun Kanti DebBarmatarun.ka...@ti.com Tested-by: Tony Lindgrent...@atomide.com Signed-off-by: Kevin Hilmankhil...@ti.com Signed-off-by: Grant Likelygrant.lik...@secretlab.ca [2] diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c index 413aac4..ace4bf6 100644 --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -120,7 +120,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, cpu_pm_enter(); /* Execute ARM wfi */ - omap_sram_idle(); + if (index == 0) + cpu_do_idle(); + else + omap_sram_idle(); /* * Call idle CPU PM enter notifier chain to restore -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More
Re: PM related performance degradation on OMAP3
Gary Thomas g...@mlbassoc.com writes: On 2012-04-11 13:17, Kevin Hilman wrote: Gary Thomasg...@mlbassoc.com writes: [...] I fear I'm seeing similar problems with 3.3. I have my board (similar to the BeagleBoard) ported to 3.0 and 3.3. I'm seeing terrible network performance on 3.3. For example, if I use TFTP to download a large file (~35MB), I get this: 3.0: 42.5 sec 3.3: 625.0 sec That's a factor of 15 worse! This might not be the same problem. What is the NIC being used, and does it have GPIO interrupts? My board uses SMSC911x with GPIO interrupt signal. OK, then your problem is almost certainly solved by my GPIO triggering fix, and not related to Grazvytas' problem. If it's using GPIO interrupts, then you likely need this patch from mainline (v3.4-rc1) I tried to just pick up the patch you [sort of] quoted below, but had a hard time applying it to my kernel. I've tried to just pick up the latest files from the mainline kernel, but so far I've nothing that builds Oh, right. Sorry about that. Yeah, that patch actually has dependencies on other GPIO changes that were queued for v3.4 (and not in v3.3.) If you're on v3.3, just pull the branch below[1] which is based on v3.3-rc2. Pulling that into a v3.3 should build just fine. Kevin [1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git for_3.4/fixes/gpio -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On 2012-04-12 08:14, Kevin Hilman wrote: Gary Thomasg...@mlbassoc.com writes: On 2012-04-11 13:17, Kevin Hilman wrote: Gary Thomasg...@mlbassoc.com writes: [...] I fear I'm seeing similar problems with 3.3. I have my board (similar to the BeagleBoard) ported to 3.0 and 3.3. I'm seeing terrible network performance on 3.3. For example, if I use TFTP to download a large file (~35MB), I get this: 3.0: 42.5 sec 3.3: 625.0 sec That's a factor of 15 worse! This might not be the same problem. What is the NIC being used, and does it have GPIO interrupts? My board uses SMSC911x with GPIO interrupt signal. OK, then your problem is almost certainly solved by my GPIO triggering fix, and not related to Grazvytas' problem. If it's using GPIO interrupts, then you likely need this patch from mainline (v3.4-rc1) I tried to just pick up the patch you [sort of] quoted below, but had a hard time applying it to my kernel. I've tried to just pick up the latest files from the mainline kernel, but so far I've nothing that builds Oh, right. Sorry about that. Yeah, that patch actually has dependencies on other GPIO changes that were queued for v3.4 (and not in v3.3.) If you're on v3.3, just pull the branch below[1] which is based on v3.3-rc2. Pulling that into a v3.3 should build just fine. Kevin [1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git for_3.4/fixes/gpio This worked a treat, thanks. My network performance is better now, but still not what it was. The same TFTP transfer now takes 71 seconds, so about 50% slower than on the 3.0 kernel. Applying the second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference. I am interested in having PM working as I'm designing a battery powered portable unit, so I need to keep pursuing this. Note: I noticed that when I built with CONFIG_PM off and no other changes, my EHCI USB didn't work properly. Should this be the case? Thanks again for your help -- Gary Thomas | Consulting for the MLB Associates |Embedded world -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
+Felipe for EHCI question Gary Thomas g...@mlbassoc.com writes: [...] This worked a treat, thanks. My network performance is better now, but still not what it was. The same TFTP transfer now takes 71 seconds, so about 50% slower than on the 3.0 kernel. Applying the second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference. And does a CONFIG_PM=n kernel get you back to your v3.0 performance? I am interested in having PM working as I'm designing a battery powered portable unit, so I need to keep pursuing this. So do I. :) Note: I noticed that when I built with CONFIG_PM off and no other changes, my EHCI USB didn't work properly. Should this be the case? Probably not, but haven't tested EHCI USB. I've Cc'd Felipe to see if he has any ideas why EHCI wouldn't work with CONFIG_PM=n. Kevin -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On 2012-04-12 10:57, Kevin Hilman wrote: +Felipe for EHCI question Gary Thomasg...@mlbassoc.com writes: [...] This worked a treat, thanks. My network performance is better now, but still not what it was. The same TFTP transfer now takes 71 seconds, so about 50% slower than on the 3.0 kernel. Applying the second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference. And does a CONFIG_PM=n kernel get you back to your v3.0 performance? Correct. I am interested in having PM working as I'm designing a battery powered portable unit, so I need to keep pursuing this. So do I. :) Note: I noticed that when I built with CONFIG_PM off and no other changes, my EHCI USB didn't work properly. Should this be the case? Probably not, but haven't tested EHCI USB. I've Cc'd Felipe to see if he has any ideas why EHCI wouldn't work with CONFIG_PM=n. Thanks -- Gary Thomas | Consulting for the MLB Associates |Embedded world -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Gary Thomas g...@mlbassoc.com writes: On 2012-04-12 10:57, Kevin Hilman wrote: +Felipe for EHCI question Gary Thomasg...@mlbassoc.com writes: [...] This worked a treat, thanks. My network performance is better now, but still not what it was. The same TFTP transfer now takes 71 seconds, so about 50% slower than on the 3.0 kernel. Applying the second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference. And does a CONFIG_PM=n kernel get you back to your v3.0 performance? Correct. OK, I just tried your TFTP experiment on a 3530/Overo board with the same smsc911x NIC that has GPIO interrupts, and I don't see much difference between a PM-enabled v3.0 and a PM-enabled v3.3. Are you TFTP'ing the file to an MMC filesystem?Can you try to a ramdisk[1]? If you're using MMC, it could be MMC driver changes since v3.0 that are actually causing your performance hit. In my experiment, I TFTP'd a 24Mb file to a ramdisk, to make sure no other drivers were invovled, and didn't see any major differences between v3.0, v3.3, and v3.3 CONFIG_PM disabled. Below are my results. As you can see, all the results seem to be pretty close to the same. This test was not on a controlled, isolated network, so the differences are probably explained by other network activity: - v3.0 vanilla: PM enabled, CPUidle enabled - Received 25362406 bytes in 35.5 seconds - Received 25362406 bytes in 44.9 seconds - Received 25362406 bytes in 49.0 seconds - Received 25362406 bytes in 36.2 seconds - Received 25362406 bytes in 56.3 seconds - Received 25362406 bytes in 65.2 seconds - Received 25362406 bytes in 37.0 seconds - v3.3: PM enabled, CPUidle enabled + GPIO fix (my for_3.4/fixes/gpio branch) + smsc911x regulator boot fix (Tony's omap/fix-smsc911x-regulator branch) - Received 25362406 bytes in 32.1 seconds - Received 25362406 bytes in 29.8 seconds - Received 25362406 bytes in 33.5 seconds - Received 25362406 bytes in 44.5 seconds - Received 25362406 bytes in 39.2 seconds - Received 25362406 bytes in 57.0 seconds - Received 25362406 bytes in 49.6 seconds - v3.3: CONFIG_PM=n + branches above + fix from Grazvydas for !CONFIG_PM case: [PATCH] ARM: OMAP: sram: fix BUG in dpll code for !PM case + disable CONFIG_OMAP_WATCHDOG which fails to boot when CONFIG_PM=y - Received 25362406 bytes in 34.1 seconds - Received 25362406 bytes in 33.9 seconds - Received 25362406 bytes in 34.9 seconds - Received 25362406 bytes in 37.8 seconds - Received 25362406 bytes in 40.0 seconds - Received 25362406 bytes in 37.6 seconds - Received 25362406 bytes in 34.4 seconds Kevin [1] simple steps to make a ramdisk mkfs.ext2 /dev/ram0 mkdir /tmp/rd mount /dev/ram0 /tmp/rd cd /tmp/rd then TFTP file here -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On 2012-04-12 12:08, Kevin Hilman wrote: Gary Thomasg...@mlbassoc.com writes: On 2012-04-12 10:57, Kevin Hilman wrote: +Felipe for EHCI question Gary Thomasg...@mlbassoc.com writes: [...] This worked a treat, thanks. My network performance is better now, but still not what it was. The same TFTP transfer now takes 71 seconds, so about 50% slower than on the 3.0 kernel. Applying the second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference. And does a CONFIG_PM=n kernel get you back to your v3.0 performance? Correct. OK, I just tried your TFTP experiment on a 3530/Overo board with the same smsc911x NIC that has GPIO interrupts, and I don't see much difference between a PM-enabled v3.0 and a PM-enabled v3.3. Are you TFTP'ing the file to an MMC filesystem?Can you try to a ramdisk[1]? If you're using MMC, it could be MMC driver changes since v3.0 that are actually causing your performance hit. I'm testing to a ramdisk, so we're on the same page. Could you send me your config file so I can compare? Maybe I have something dumb in my settings that aggravates things. Also, what's your performance on 3.4-rc2? The linux-media tree I started from is a bit post v3.3, so there might be something else causing this. In my experiment, I TFTP'd a 24Mb file to a ramdisk, to make sure no other drivers were invovled, and didn't see any major differences between v3.0, v3.3, and v3.3 CONFIG_PM disabled. Below are my results. As you can see, all the results seem to be pretty close to the same. This test was not on a controlled, isolated network, so the differences are probably explained by other network activity: - v3.0 vanilla: PM enabled, CPUidle enabled - Received 25362406 bytes in 35.5 seconds - Received 25362406 bytes in 44.9 seconds - Received 25362406 bytes in 49.0 seconds - Received 25362406 bytes in 36.2 seconds - Received 25362406 bytes in 56.3 seconds - Received 25362406 bytes in 65.2 seconds - Received 25362406 bytes in 37.0 seconds - v3.3: PM enabled, CPUidle enabled + GPIO fix (my for_3.4/fixes/gpio branch) + smsc911x regulator boot fix (Tony's omap/fix-smsc911x-regulator branch) - Received 25362406 bytes in 32.1 seconds - Received 25362406 bytes in 29.8 seconds - Received 25362406 bytes in 33.5 seconds - Received 25362406 bytes in 44.5 seconds - Received 25362406 bytes in 39.2 seconds - Received 25362406 bytes in 57.0 seconds - Received 25362406 bytes in 49.6 seconds - v3.3: CONFIG_PM=n + branches above + fix from Grazvydas for !CONFIG_PM case: [PATCH] ARM: OMAP: sram: fix BUG in dpll code for !PM case + disable CONFIG_OMAP_WATCHDOG which fails to boot when CONFIG_PM=y - Received 25362406 bytes in 34.1 seconds - Received 25362406 bytes in 33.9 seconds - Received 25362406 bytes in 34.9 seconds - Received 25362406 bytes in 37.8 seconds - Received 25362406 bytes in 40.0 seconds - Received 25362406 bytes in 37.6 seconds - Received 25362406 bytes in 34.4 seconds Kevin [1] simple steps to make a ramdisk mkfs.ext2 /dev/ram0 mkdir /tmp/rd mount /dev/ram0 /tmp/rd cd /tmp/rd then TFTP file here -- Gary Thomas | Consulting for the MLB Associates |Embedded world -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Gary Thomas g...@mlbassoc.com writes: On 2012-04-12 12:08, Kevin Hilman wrote: Gary Thomasg...@mlbassoc.com writes: On 2012-04-12 10:57, Kevin Hilman wrote: +Felipe for EHCI question Gary Thomasg...@mlbassoc.com writes: [...] This worked a treat, thanks. My network performance is better now, but still not what it was. The same TFTP transfer now takes 71 seconds, so about 50% slower than on the 3.0 kernel. Applying the second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference. And does a CONFIG_PM=n kernel get you back to your v3.0 performance? Correct. OK, I just tried your TFTP experiment on a 3530/Overo board with the same smsc911x NIC that has GPIO interrupts, and I don't see much difference between a PM-enabled v3.0 and a PM-enabled v3.3. Are you TFTP'ing the file to an MMC filesystem?Can you try to a ramdisk[1]? If you're using MMC, it could be MMC driver changes since v3.0 that are actually causing your performance hit. I'm testing to a ramdisk, so we're on the same page. Could you send me your config file so I can compare? Maybe I have something dumb in my settings that aggravates things. Below is the Kconfig snippet[1] I append to a default omap2plus_defconfig to enable CPUidle, CPUfreq and some debug. Rebuild with that appended and these settings override the default ones. I used omap2plus_defcnfig plus this snippit for v3.0, v3.3 and v3.4-rc2 tests. Also, what's your performance on 3.4-rc2? The linux-media tree I started from is a bit post v3.3, so there might be something else causing this. I just tried with vanilla v3.4-rc2, and I see basically the same results. Between 35 and 50 seconds for the 24Mb file transfer, which is similar to the v3.0 and v3.3 results. Kevin [1] CONFIG_CPU_IDLE=y CONFIG_PM_ADVANCED_DEBUG=y CONFIG_PM_SLEEP_ADVANCED_DEBUG=y CONFIG_PM_GENERIC_DOMAINS=y CONFIG_OMAP_SMARTREFLEX=y CONFIG_OMAP_SMARTREFLEX_CLASS3=y CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y CONFIG_CPU_FREQ_STAT=y CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_POWERSAVE=y CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_ARM_OMAP2PLUS_CPUFREQ=y CONFIG_REGULATOR_OMAP_SMPS=y CONFIG_DEBUG_LL=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_USER=y CONFIG_EARLY_PRINTK=y CONFIG_DEBUG_SECTION_MISMATCH=y -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: PM related performance degradation on OMAP3
From: linux-omap-ow...@vger.kernel.org [mailto:linux-omap- ow...@vger.kernel.org] On Behalf Of Grazvydas Ignotas Sent: Tuesday, April 10, 2012 7:30 PM What I think is going on here is that omap_sram_idle() is taking too much time because it's overhead is too large. I've added a counter there and it seems to be called ~530 times per megabyte (DMA operates in ~2K chunks so it makes sense), that's over 2000 calls per second. Some quick measurement code shows ~243us spent for setting up in omap_sram_idle() (before and after omap34xx_do_sram_idle()). 243uS is really a long time for C1. For some reason has grown a lot since last time I captured path in ETM. Your analysis correlates well to reports from a couple years back. N900 folks did report that the non-clock gated C1 was needed (as exists in code today). IIRC the NAND stack did have small-uS spins on NAND status or something which having higher clock stop penalty resulted in big performance dip. You needed like 10uS for C1 or bit hit. Regards, Richard W. -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On 2012-04-12 16:03, Kevin Hilman wrote: Gary Thomasg...@mlbassoc.com writes: On 2012-04-12 12:08, Kevin Hilman wrote: Gary Thomasg...@mlbassoc.com writes: On 2012-04-12 10:57, Kevin Hilman wrote: +Felipe for EHCI question Gary Thomasg...@mlbassoc.comwrites: [...] This worked a treat, thanks. My network performance is better now, but still not what it was. The same TFTP transfer now takes 71 seconds, so about 50% slower than on the 3.0 kernel. Applying the second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference. And does a CONFIG_PM=n kernel get you back to your v3.0 performance? Correct. OK, I just tried your TFTP experiment on a 3530/Overo board with the same smsc911x NIC that has GPIO interrupts, and I don't see much difference between a PM-enabled v3.0 and a PM-enabled v3.3. Are you TFTP'ing the file to an MMC filesystem?Can you try to a ramdisk[1]? If you're using MMC, it could be MMC driver changes since v3.0 that are actually causing your performance hit. I'm testing to a ramdisk, so we're on the same page. Could you send me your config file so I can compare? Maybe I have something dumb in my settings that aggravates things. Below is the Kconfig snippet[1] I append to a default omap2plus_defconfig to enable CPUidle, CPUfreq and some debug. Rebuild with that appended and these settings override the default ones. I used omap2plus_defcnfig plus this snippit for v3.0, v3.3 and v3.4-rc2 tests. Also, what's your performance on 3.4-rc2? The linux-media tree I started from is a bit post v3.3, so there might be something else causing this. I just tried with vanilla v3.4-rc2, and I see basically the same results. Between 35 and 50 seconds for the 24Mb file transfer, which is similar to the v3.0 and v3.3 results. Kevin [1] CONFIG_CPU_IDLE=y CONFIG_PM_ADVANCED_DEBUG=y CONFIG_PM_SLEEP_ADVANCED_DEBUG=y CONFIG_PM_GENERIC_DOMAINS=y CONFIG_OMAP_SMARTREFLEX=y CONFIG_OMAP_SMARTREFLEX_CLASS3=y CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y CONFIG_CPU_FREQ_STAT=y CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_POWERSAVE=y CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_ARM_OMAP2PLUS_CPUFREQ=y CONFIG_REGULATOR_OMAP_SMPS=y CONFIG_DEBUG_LL=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_USER=y CONFIG_EARLY_PRINTK=y CONFIG_DEBUG_SECTION_MISMATCH=y These settings made no difference. I just reverified my results to xfer a 39MB file to ramdisk: 3.0 + PM = 39sec 3.3 + PM = 70sec 3.3 - PM = 48sec so it's not quite the same as 3.0 was, but closer. BTW, your results normalized to mine would be 3.3 + PM = 56sec I wish I knew why I'm seeing a big difference between +PM/-PM and you don't. Is there some way to compare your source tree (the one you built for v3.3) and mine? I'm not very good with GIT so I'm not quite sure how to do it. Sorry for being so much trouble, I'm just in search of all the performance I can get out of my system :-) Thanks -- Gary Thomas | Consulting for the MLB Associates |Embedded world -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On 2012-04-06 16:50, Grazvydas Ignotas wrote: Hello, I'm DMA seeing performance loss related to CONFIG_PM on OMAP3. # CONFIG_PM is set: echo 3 /proc/sys/vm/drop_caches # file copy from NAND (using NAND driver in DMA mode) dd if=/mnt/tmp/a of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s # file read from SD (hsmmc uses DMA) dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s # CONFIG_PM not set: # NAND dd if=/mnt/tmp/a of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s # SD dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s While SD card performance loss is not that bad (~7%), NAND one is worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also cpuidle states over sysfs, it did not have any significant effect. Is there something else to try? I'm guessing this is caused by CPU wakeup latency to service DMA interrupts? I've noticed that if I keep CPU busy, the loss is reduced almost completely. Talking about cpuidle, what's the difference between C1 and C2 states? They look mostly the same. Then there is omap3_do_wfi, it seems to be unconditionally putting SDRC on self-refresh, would it make sense to just do wfi in higher power states, like OMAP4 seems to be doing? I fear I'm seeing similar problems with 3.3. I have my board (similar to the BeagleBoard) ported to 3.0 and 3.3. I'm seeing terrible network performance on 3.3. For example, if I use TFTP to download a large file (~35MB), I get this: 3.0: 42.5 sec 3.3: 625.0 sec That's a factor of 15 worse! I'd like to try building without CONFIG_PM, but when I disabled this, my kernel fails to come up. Can someone point me to the magic to build without CONFIG_PM, or possibly send me a working config file? Thanks -- Gary Thomas | Consulting for the MLB Associates |Embedded world -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On Wed, Apr 11, 2012 at 5:59 PM, Gary Thomas g...@mlbassoc.com wrote: I'd like to try building without CONFIG_PM, but when I disabled this, my kernel fails to come up. Can someone point me to the magic to build without CONFIG_PM, or possibly send me a working config file? You probably need this patch: http://marc.info/?l=linux-omapm=133374930011086w=2 If it still won't boot, you'll need to enable earlyprintk both in .config and as kernel argument to see where it dies. -- Gražvydas -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
On 2012-04-11 11:23, Grazvydas Ignotas wrote: On Wed, Apr 11, 2012 at 5:59 PM, Gary Thomasg...@mlbassoc.com wrote: I'd like to try building without CONFIG_PM, but when I disabled this, my kernel fails to come up. Can someone point me to the magic to build without CONFIG_PM, or possibly send me a working config file? You probably need this patch: http://marc.info/?l=linux-omapm=133374930011086w=2 If it still won't boot, you'll need to enable earlyprintk both in .config and as kernel argument to see where it dies. That does help, but there are lots of tracebacks like these: [0.588500] [ cut here ] [0.588531] WARNING: at drivers/video/omap2/dss/dispc.c:404 dss_driver_probe+0x44/0xd8() [0.588562] Modules linked in: [0.588592] [c0012204] (unwind_backtrace+0x0/0xf8) from [c002b81c] (warn_slowpath_common+0x4c/0x64) [0.588623] [c002b81c] (warn_slowpath_common+0x4c/0x64) from [c002b850] (warn_slowpath_null+0x1c/0x24) [0.588623] [c002b850] (warn_slowpath_null+0x1c/0x24) from [c022609c] (dss_driver_probe+0x44/0xd8) [0.588653] [c022609c] (dss_driver_probe+0x44/0xd8) from [c0273e10] (driver_probe_device+0x70/0x1e4) [0.588684] [c0273e10] (driver_probe_device+0x70/0x1e4) from [c0274018] (__driver_attach+0x94/0x98) [0.588714] [c0274018] (__driver_attach+0x94/0x98) from [c027270c] (bus_for_each_dev+0x50/0x7c) [0.588745] [c027270c] (bus_for_each_dev+0x50/0x7c) from [c0273664] (bus_add_driver+0x184/0x244) [0.588775] [c0273664] (bus_add_driver+0x184/0x244) from [c02742bc] (driver_register+0x78/0x12c) [0.588775] [c02742bc] (driver_register+0x78/0x12c) from [c00085a0] (do_one_initcall+0x34/0x178) [0.588806] [c00085a0] (do_one_initcall+0x34/0x178) from [c061d7dc] (kernel_init+0x78/0x114) [0.588836] [c061d7dc] (kernel_init+0x78/0x114) from [c000e0d0] (kernel_thread_exit+0x0/0x8) [0.588867] ---[ end trace 1b75b31a2719ed24 ]--- I also had to disable the watchdog to get it up. That said, with CONFIG_PM disabled, my network performance is back to what it was in 3.0 :-) Note: I also had CONFIG_PM disabled in that kernel build, so I don't know for sure what the performance might be with that version if it were enabled. -- Gary Thomas | Consulting for the MLB Associates |Embedded world -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Gary Thomas g...@mlbassoc.com writes: [...] I fear I'm seeing similar problems with 3.3. I have my board (similar to the BeagleBoard) ported to 3.0 and 3.3. I'm seeing terrible network performance on 3.3. For example, if I use TFTP to download a large file (~35MB), I get this: 3.0: 42.5 sec 3.3: 625.0 sec That's a factor of 15 worse! This might not be the same problem. What is the NIC being used, and does it have GPIO interrupts? If it's using GPIO interrupts, then you likely need this patch from mainline (v3.4-rc1) If that doesn't work, or you're not using GPIO interrupts, could you confirm if the patch below[2] (based on idea from Grasvydas) increases performance for you when CONFIG_PM=y. Kevin [1] Author: Kevin Hilman khil...@ti.com 2012-03-05 15:10:04 Committer: Grant Likely grant.lik...@secretlab.ca 2012-03-12 09:16:11 Parent: 25db711df3258d125dc1209800317e5c0ef3c870 (gpio/omap: Fix IRQ handling for SPARSE_IRQ) Child: 8805f410e4fb88a56552c1af42d61b38837a38fd (gpio/omap: Fix section warning for omap_mpuio_alloc_gc()) Branches: many (66) Follows: v3.3-rc7 Precedes: v3.4-rc1 gpio/omap: fix wakeups on level-triggered GPIOs While both level- and edge-triggered GPIOs are capable of generating interrupts, only edge-triggered GPIOs are capable of generating a module-level wakeup to the PRCM (c.f. 34xx NDA TRM section 25.5.3.2.) In order to ensure that devices using level-triggered GPIOs as interrupts can also cause wakeups (e.g. from idle), this patch enables edge-triggering for wakeup-enabled, level-triggered GPIOs when a GPIO bank is runtime-suspended (which also happens during idle.) This fixes a problem found in GPMC-connected network cards with GPIO interrupts (e.g. smsc911x on Zoom3, Overo, ...) where network booting with NFSroot was very slow since the GPIO IRQs used by the NIC were not generating PRCM wakeups, and thus not waking the system from idle. NOTE: until v3.3, this boot-time problem was somewhat masked because the UART init prevented WFI during boot until the full serial driver was available. Preventing WFI allowed regular GPIO interrupts to fire and this problem was not seen. After the UART runtime PM cleanups, we no longer avoid WFI during boot, so GPIO IRQs that were not causing wakeups resulted in very slow IRQ response times. Tested on platforms using level-triggered GPIOs for network IRQs using the SMSC911x NIC: 3530/Overo and 3630/Zoom3. Reported-by: Tony Lindgren t...@atomide.com Tested-by: Tarun Kanti DebBarma tarun.ka...@ti.com Tested-by: Tony Lindgren t...@atomide.com Signed-off-by: Kevin Hilman khil...@ti.com Signed-off-by: Grant Likely grant.lik...@secretlab.ca [2] diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c index 413aac4..ace4bf6 100644 --- a/arch/arm/mach-omap2/cpuidle34xx.c +++ b/arch/arm/mach-omap2/cpuidle34xx.c @@ -120,7 +120,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, cpu_pm_enter(); /* Execute ARM wfi */ - omap_sram_idle(); + if (index == 0) + cpu_do_idle(); + else + omap_sram_idle(); /* * Call idle CPU PM enter notifier chain to restore -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Grazvydas Ignotas nota...@gmail.com writes: On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: While SD card performance loss is not that bad (~7%), NAND one is worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also cpuidle states over sysfs, it did not have any significant effect. Is there something else to try? Looks like we might need a PM QoS constraint when there is DMA activity in progress. You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when DMA transfers are active and I suspect that will help. I've tried it and it didn't help much. It looks like the only thing it does is limiting cpuidle c-states, I tried to set qos dma latency to 0 and it made it stay in C1 while transfer was ongoing (I watched /sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance was still poor. Great, thanks for doing this experiment. Assuming we get to a C1 that's low-latency enough, we will still need this constraint to ensure C1 during transfers. But first we have to figure out what's going on with C1... What I think is going on here is that omap_sram_idle() is taking too much time because it's overhead is too large. I've added a counter there and it seems to be called ~530 times per megabyte (DMA operates in ~2K chunks so it makes sense), that's over 2000 calls per second. Some quick measurement code shows ~243us spent for setting up in omap_sram_idle() (before and after omap34xx_do_sram_idle()). Could we perhaps have a lighter idle function for C1 that doesn't try to switch all powerdomain states and maybe not enable RAM self-refresh? Yes, but first let's try to uncover exactly what makes the current C1 so heavy. As a quick test I've tried this in omap3_enter_idle(): /* Execute ARM wfi */ if (index == 0) { clkdm_deny_idle(mpu_pd-pwrdm_clkdms[0]); cpu_do_idle(); } else omap_sram_idle(); ..and it brought performance close to !CONFIG_PM case (cpu_do_idle() is used as pm_idle on !CONFIG_PM). OK, I see now. I think you're right about the overhead. It would be helpful now to narrow down what are the big contributors to the overhead in omap_sram_idle(). Most of the code there is skipped for C1 because the next states for MPU and CORE are both ON. There are 2 primary differences that I see as possible causes. I list them here with a couple more experiments for you to try to help us narrow this down. 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() Could you try using omap_sram_idle() and just commenting out those calls? Does that help performance? Those iterate over all the powerdomains, so defintely add some overhead, but I don't think it would be as significant as what you're seeing.Much more likely is... 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds This is more likely the culprit of most of the overhead. Specifically, when returning from idle there are some errata to workaround that require waiting for DPLL3 to lock. I suspect this is more likely to be the source of the problem. Can you try the hack below[1], which basically does the cpu_do_idle() hack that you've already done, but inside omap_sram_idle() and only eliminates the jump to SRAM, SDRC self-refresh and SDRC errata workarounds? I assume that will get performance back to what you expect. Then it remains to be seen if it's the SDRC self-refresh that's causing the delay, or the errata workarounds. To add the self-refresh back, but eliminate the SDRC errata workaround, You could try something like I hacked up in the (untested) branch here[2]. If performance is still good, that will tell us that it's the errata workaround waiting that's causing the extra overhead. I need to clarify for myself if SDRC self-refresh is even entered in C1. When the CORE powerdomain is left on, I don't think the PRCM is would send IDLEREQ to the SDRC, so it should not enter self refresh, but I need to verify that. I don't know what side effects something like this might have though. There are some other errata workaounds that you miss by not calling omap_sram_idle(). Specifically, the call to omap3_intc_prepare_idle() is important. Kevin [1] diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c index 3e6b564..0fb3942 100644 --- a/arch/arm/mach-omap2/pm34xx.c +++ b/arch/arm/mach-omap2/pm34xx.c @@ -313,7 +313,7 @@ void omap_sram_idle(void) if (save_state == 1 || save_state == 3) cpu_suspend(save_state, omap34xx_do_sram_idle); else - omap34xx_do_sram_idle(save_state); + cpu_do_idle(); /* Restore normal SDRC POWER settings */ if (cpu_is_omap3430() omap_rev() = OMAP3430_REV_ES3_0 [2] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git tmp/sdrc-hacks -- To unsubscribe from this
Re: PM related performance degradation on OMAP3
On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman khil...@ti.com wrote: Grazvydas Ignotas nota...@gmail.com writes: While SD card performance loss is not that bad (~7%), NAND one is worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also cpuidle states over sysfs, it did not have any significant effect. Is there something else to try? Looks like we might need a PM QoS constraint when there is DMA activity in progress. You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when DMA transfers are active and I suspect that will help. I've tried it and it didn't help much. It looks like the only thing it does is limiting cpuidle c-states, I tried to set qos dma latency to 0 and it made it stay in C1 while transfer was ongoing (I watched /sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance was still poor. What I think is going on here is that omap_sram_idle() is taking too much time because it's overhead is too large. I've added a counter there and it seems to be called ~530 times per megabyte (DMA operates in ~2K chunks so it makes sense), that's over 2000 calls per second. Some quick measurement code shows ~243us spent for setting up in omap_sram_idle() (before and after omap34xx_do_sram_idle()). Could we perhaps have a lighter idle function for C1 that doesn't try to switch all powerdomain states and maybe not enable RAM self-refresh? As a quick test I've tried this in omap3_enter_idle(): /* Execute ARM wfi */ if (index == 0) { clkdm_deny_idle(mpu_pd-pwrdm_clkdms[0]); cpu_do_idle(); } else omap_sram_idle(); ..and it brought performance close to !CONFIG_PM case (cpu_do_idle() is used as pm_idle on !CONFIG_PM). I don't know what side effects something like this might have though. Then there is omap3_do_wfi, it seems to be unconditionally putting SDRC on self-refresh, would it make sense to just do wfi in higher power states, like OMAP4 seems to be doing? Not sure what you're referring to in OMAP4. There we do WFI in every idle state. What I meant is that OMAP3 idle code always tries to enable RAM self-refresh (regardless of c-state) before doing wfi while OMAP4 can do wfi without suspending RAM (although I might be misunderstanding all that asm code). -- Gražvydas -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PM related performance degradation on OMAP3
Grazvydas Ignotas nota...@gmail.com writes: Hello, I'm DMA seeing performance loss related to CONFIG_PM on OMAP3. # CONFIG_PM is set: echo 3 /proc/sys/vm/drop_caches # file copy from NAND (using NAND driver in DMA mode) dd if=/mnt/tmp/a of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s # file read from SD (hsmmc uses DMA) dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s # CONFIG_PM not set: # NAND dd if=/mnt/tmp/a of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s # SD dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s While SD card performance loss is not that bad (~7%), NAND one is worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also cpuidle states over sysfs, it did not have any significant effect. Is there something else to try? Looks like we might need a PM QoS constraint when there is DMA activity in progress. You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when DMA transfers are active and I suspect that will help. I'm guessing this is caused by CPU wakeup latency to service DMA interrupts? I've noticed that if I keep CPU busy, the loss is reduced almost completely. Yeah, that suggests a QoS constraint is what's needed here. Talking about cpuidle, what's the difference between C1 and C2 states? They look mostly the same. Except for clockdomains are not allowed to idle in C1 which results in much shorter wakeup latency. Then there is omap3_do_wfi, it seems to be unconditionally putting SDRC on self-refresh, would it make sense to just do wfi in higher power states, like OMAP4 seems to be doing? Not sure what you're referring to in OMAP4. There we do WFI in every idle state. Kevin -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PM related performance degradation on OMAP3
Hello, I'm DMA seeing performance loss related to CONFIG_PM on OMAP3. # CONFIG_PM is set: echo 3 /proc/sys/vm/drop_caches # file copy from NAND (using NAND driver in DMA mode) dd if=/mnt/tmp/a of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s # file read from SD (hsmmc uses DMA) dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s # CONFIG_PM not set: # NAND dd if=/mnt/tmp/a of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s # SD dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32 33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s While SD card performance loss is not that bad (~7%), NAND one is worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also cpuidle states over sysfs, it did not have any significant effect. Is there something else to try? I'm guessing this is caused by CPU wakeup latency to service DMA interrupts? I've noticed that if I keep CPU busy, the loss is reduced almost completely. Talking about cpuidle, what's the difference between C1 and C2 states? They look mostly the same. Then there is omap3_do_wfi, it seems to be unconditionally putting SDRC on self-refresh, would it make sense to just do wfi in higher power states, like OMAP4 seems to be doing? -- Gražvydas -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html