Re: [PATCH] gpio/omap: fix invalid context restore of gpio bank-0
On 06/29/2012 10:22 AM, Jon Hunter wrote: Currently the gpio _runtime_resume/suspend functions are calling the get_context_loss_count() platform function if the function is populated for a gpio bank. This function is used to determine if the gpio bank logic state needs to be restored due to a power transition. This function will be populated for all banks, but it should only be called for banks that have the "loses_context" variable set. It is pointless to call this if loses_context is false as we know the context will never be lost and will not need restoring. For all OMAP2+ devices gpio bank-0 is in an always-on power domain and so will never lose context. We found that the get_context_loss_count() was being called for bank-0 during the probe and returning 1 instead of 0 indicating that the context had been lost. This was causing the context restore function to be called at probe time for this bank and because the context had never been saved, was restoring an invalid state. This ultimately resulted in a crash [1]. There are multiple bugs here that need to be addressed ... 1. Why the always-on power domain returns a context loss count of 1? This needs to be fixed in the power domain code. However, the gpio driver should not assume the loss count is 0 to begin with. 2. The omap gpio driver should never be calling get_context_loss_count for a gpio bank in a always-on domain. This is pointless and adds unneccessary overhead. 3. The OMAP gpio driver assumes that the initial power domain context loss count will be 0 at the time the gpio driver is probed. However, it could be possible that this is not the case and an invalid context restore could be performed during the probe. To avoid this otherwise only populated the get_context_loss_count() function pointer after the initial call to pm_runtime_get() has occurred. This will ensure that the first pm_runtime_put() initialised the loss count correctly. This patch addresses issues 2 and 3 above. [1] http://marc.info/?l=linux-omap&m=134065775323775&w=2 Cc: Grant Likely Cc: Linus Walleij Cc: Kevin Hilman Cc: Tarun Kanti DebBarma Cc: Franky Lin Reported-by: Franky Lin Signed-off-by: Jon Hunter --- Tested-by: Franky Lin -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panda ES board hang when using GPIO as interrupt
On 06/28/2012 04:54 PM, Jon Hunter wrote: I am wondering if this could be the bug ... on start-up I see that we do a context restore on bank1 during the probe which is before we have done the first suspend! In other words, we could restore a bad/uninitialised context for bank1. In the case of bank1, the loss count starts at 1 and not 0 and so we falsely think we need to perform a restore :-( [0.176269] omap_gpio_runtime_resume: bank @ 0xfc31 [0.177276] omap_gpio_runtime_resume: count 0, now 1 [0.177276] gpiochip_add: registered GPIOs 0 to 31 on device: gpio [0.177642] omap_gpio_runtime_suspend: bank @ 0xfc31 Can you try ... diff --git a/drivers/gpio/gpio-omap.c b/drivers/gpio/gpio-omap.c index c4ed172..9623408 100644 --- a/drivers/gpio/gpio-omap.c +++ b/drivers/gpio/gpio-omap.c @@ -1086,6 +1086,9 @@ static int __devinit omap_gpio_probe(struct platform_device *pdev) #ifdef CONFIG_OF_GPIO bank->chip.of_node = of_node_get(node); #endif + if (bank->get_context_loss_count) + bank->context_loss_count = + bank->get_context_loss_count(bank->dev); bank->irq_base = irq_alloc_descs(-1, 0, bank->width, 0); if (bank->irq_base < 0) { Looks like you found the culprit. :) It does fix the problem. Franky -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panda ES board hang when using GPIO as interrupt
On 06/28/2012 03:59 PM, Jon Hunter wrote: On 06/28/2012 05:53 PM, Franky Lin wrote: I found one interesting thing. When I added the print info to see when runtime_suspend/resume get called, it seems like the suspend/resume is unbalance during boot. Resume got called more than suspend. So I hack the code to make sure suspend and resume are called in pair. A resume without suspend will do nothing and return immediately. This also makes the hang vanish. I am not 100% sure I follow. On boot I would expect to see a resume/suspend due to the probe on the irq bank and then I would expect to see another resume from the acquisition of the gpio, however, I would not expect a suspend until the gpio is freed, which I don't believe you are doing. Can you share your hack? Just paste the diff? This may help me understand more. OK. This is what I saw in the log: [0.171844] dummy: [0.172912] NET: Registered protocol family 16 [0.173431] GPMC revision 6.0 [0.173492] gpmc: irq-52 could not claim: err -22 [0.177551] ??omap_gpio_runtime_resume [0.178619] OMAP GPIO hardware version 0.1 [0.178649] !omap_gpio_runtime_suspend [0.178771] ??omap_gpio_runtime_resume [0.179351] !omap_gpio_runtime_suspend [0.179504] ??omap_gpio_runtime_resume [0.180023] !omap_gpio_runtime_suspend [0.180145] ??omap_gpio_runtime_resume [0.180694] !omap_gpio_runtime_suspend [0.180847] ??omap_gpio_runtime_resume [0.181365] !omap_gpio_runtime_suspend [0.181518] ??omap_gpio_runtime_resume [0.182037] !omap_gpio_runtime_suspend [0.185089] omap_mux_init: Add partition: #1: core, flags: 2 [0.186462] omap_mux_init: Add partition: #2: wkup, flags: 2 [0.186584] error setting wl12xx data: -38 [0.189788] _omap_mux_get_by_name: Could not find signal uart1_rx.uart1_rx [0.189788] _omap_mux_get_by_name: Could not find signal uart1_rx.uart1_rx [0.239501] ??omap_gpio_runtime_resume [0.239532] ??omap_gpio_runtime_resume [0.241058] usbhs_omap: alias fck already exists [0.244781] ??omap_gpio_runtime_resume diff --git a/drivers/gpio/gpio-omap.c b/drivers/gpio/gpio-omap.c index c4ed172..bca3985 100644 --- a/drivers/gpio/gpio-omap.c +++ b/drivers/gpio/gpio-omap.c @@ -1146,7 +1146,7 @@ static int __devinit omap_gpio_probe(struct platform_device *pdev) #if defined(CONFIG_PM_RUNTIME) static void omap_gpio_restore_context(struct gpio_bank *bank); - +static int flag = 0; static int omap_gpio_runtime_suspend(struct device *dev) { struct platform_device *pdev = to_platform_device(dev); @@ -1155,6 +1155,8 @@ static int omap_gpio_runtime_suspend(struct device *dev) unsigned long flags; u32 wake_low, wake_hi; + flag ++; + spin_lock_irqsave(&bank->lock, flags); /* @@ -1221,6 +1223,11 @@ static int omap_gpio_runtime_resume(struct device *dev) u32 l = 0, gen, gen0, gen1; unsigned long flags; + if (flag) + flag--; + else + return 0; + spin_lock_irqsave(&bank->lock, flags); _gpio_dbck_enable(bank); Regards, Franky -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panda ES board hang when using GPIO as interrupt
On 06/28/2012 02:55 PM, Jon Hunter wrote: Ok. Any way to manually reset the wlan module to deactivate the gpio when it is hung? I am wondering if the gpio is deactivated if the board comes back to life, indicating it is stuck in the interrupt somewhere. The only way I can think of is removing the module manually. But it didn't bring the board back to live. Well, at least that is consistent with what I see, but also perplexing that it takes sometime to fail. Can you try the following as a debug patch to see if it is in the context restore that is the problem. From your testing and bisect, the only possible difference in the current kernel is that it could perform the context restore when acquiring the gpio. diff --git a/drivers/gpio/gpio-omap.c b/drivers/gpio/gpio-omap.c index c4ed172..a2401bd 100644 --- a/drivers/gpio/gpio-omap.c +++ b/drivers/gpio/gpio-omap.c @@ -1341,6 +1341,8 @@ void omap2_gpio_resume_after_idle(void) #if defined(CONFIG_PM_RUNTIME) static void omap_gpio_restore_context(struct gpio_bank *bank) { + return; + __raw_writel(bank->context.wake_en, bank->base + bank->regs->wkup_en); __raw_writel(bank->context.ctrl, bank->base + bank->regs->ctrl); This one works! It can run more than 20 mins. I found one interesting thing. When I added the print info to see when runtime_suspend/resume get called, it seems like the suspend/resume is unbalance during boot. Resume got called more than suspend. So I hack the code to make sure suspend and resume are called in pair. A resume without suspend will do nothing and return immediately. This also makes the hang vanish. Regards, Franky -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panda ES board hang when using GPIO as interrupt
On 06/28/2012 08:42 AM, Jon Hunter wrote: On 06/27/2012 07:41 PM, Franky Lin wrote: On 06/26/2012 08:37 PM, Kevin Hilman wrote: "Franky Lin" writes: I noticed Kevin raised some similar cases on other platforms and also provided two patches in the patch mail thread. But unfortunately those two patches doesn't help in our case. I tested the driver with 3.5-rc3 mainline kernel and the issue is still there. I can only "fix" the hang by either reverting the commit or disabling CONFIG_PM_RUNTIME. Also, the hang only happens on Panda ES board. Old Panda with 4430 works good. Any thoughts and suggestions? If reverting the patch fixes your problem, can you isolate down to which part of that patch causes the problem? IOW, can you fix your problem if you undo just the hunk added in runtime_suspend or undo just the moved hunk runtime_resume? Or is reverting both required? I suspect the added runtime_suspend hunk is causing the problems, so can you see if just undoing that part works[1]. If that works, I will give a bit more of a thinking on it tomorrow. runtime_suspend hunk is fine. The hang still exist after reverting it. The culprit is the moved hunk in runtime_resume. Reverting it makes the hang disappear. Thanks. From reviewing the code the only thing that appears suspect based upon your findings is the return if we find the context has not been lost. We are not checking if "workaround_enabled" is set before we return. Could you try the following change on top of v3.5-rc3? The patch doesn't help. And I also managed to probe the signal. It's active when it hung. Also, could you add a print in the runtime_suspend/resume() functions so we can see how often these are being called. In my case, I really don't see these being exercised and I am wondering how often you see suspend/resume being called in your setup. Well, the runtime_suspend/resume never get called during the test. Thanks, Franky -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panda ES board hang when using GPIO as interrupt
On 06/27/2012 04:43 PM, Jon Hunter wrote: Hi Franky, On 06/25/2012 03:52 PM, Franky Lin wrote: Hi Kevin, Tarun, We are using the expansion connector A on Panda board to mount a SDIO WiFi dongle on MMC2 with a level triggered interrupt signal connected to GPIO 138. It's been working fine until 3.5 rc1. The board hang randomly within 5 mins during a network traffic test. After bisecting we found the culprit is "[PATCH 8/8] gpio/omap: fix missing check in *_runtime_suspend()" [1]. I have been looking into this today to see if I can replicate the problem that you have reported. However, so far I have not had any luck. Please note that my test setup is not exactly the same as yours as I don't have your wlan module. However, I have been using a 2nd board to generate gpio events to a panda-es to see I can make it lock up. I have tried mainline kernel 3.5-rc1 and 3.5-rc3 but I have not seen any problems after sending 100k gpio events (over many minutes). My setup is as follows ... - OMAP4460 panda-es with gpio-138 connected to OMAP3430 beagle gpio-11. - Mainline kernel 3.5-rc1/3 using omap2plus_defconfig (no changes) - Created a simple kernel module that acquires gpio-138 and sets up a IRQ with flag IRQF_TRIGGER_HIGH (for active high level interrupt). - GPIO events are triggered roughly every 1ms Don't know if it's related, but we also mux several other pins on connector A: /* MMC2 Mux for extension board */ /* MMC2 CMD */ OMAP4_MUX(GPMC_NWE, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP), /* MMC2 CLK */ OMAP4_MUX(GPMC_NOE, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP), /* MMC2 DAT 0-3 */ OMAP4_MUX(GPMC_AD0, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP), OMAP4_MUX(GPMC_AD1, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP), OMAP4_MUX(GPMC_AD2, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP), OMAP4_MUX(GPMC_AD3, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP), /* GPIO MUX for OOB interupt of dongle */ OMAP4_MUX(MCSPI1_CS1, OMAP_MUX_MODE3 | OMAP_PIN_INPUT_PULLDOWN), /* GPIO MUX for WLAN_ENABLE for dongle */ OMAP4_MUX(MCSPI1_CLK, OMAP_MUX_MODE3 | OMAP_PIN_OUTPUT), Can you confirm ... 1. You are just using omap2plus_defconfig with no changes? No, we enable following options CONFIG_DEVTMPFS=y CONFIG_DEVTMPFS_MOUNT=y CONFIG_USB_OHCI_HCD=y 2. Rough frequency of gpio events? 3367 interrupts were triggered during a 10 secs throughput test. 3. Is the gpio configured for active low or high? active high 4. When the hang occurs, what is the state of the gpio? Active or inactive? Can you probe it with a scope? If it was always active I could see that this would lock the device up, but I am not sure how that would relate to the results from your bisect??? I dont have a scope nearby. Let me see if I can find one tomorrow. I noticed Kevin raised some similar cases on other platforms and also provided two patches in the patch mail thread. But unfortunately those two patches doesn't help in our case. I tested the driver with 3.5-rc3 mainline kernel and the issue is still there. I can only "fix" the hang by either reverting the commit or disabling CONFIG_PM_RUNTIME. Also, the hang only happens on Panda ES board. Old Panda with 4430 works good. It does not make sense to me yet why this would only impact 4460, but I will keep this in mind. In your wlan driver are you acquiring and freeing the gpio often? Or are you only acquiring the gpio on boot? The reason I ask is because for omap4, it seems that we are not currently calling omap2_gpio_prepare_for_idle() during idle and so the only time I see us call the runtime_suspend/resume handlers for omap4 is during probe and when we acquire and free the gpio. So if you were not acquiring and freeing the gpio and are using the stock kernel, then as far as I can tell, the runtime pm code is not being exercised much. My test is not acquiring and releasing the gpio and so I am wondering if that is the secret to reproducing this problem :-) We only request the irq once during initialization. But we do frequently disable and re-enable it since we need to access to the module through SDIO to clear the interrupt. Apparently we can't finish all this in irq handler. Hope these could help. Regards, Franky -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panda ES board hang when using GPIO as interrupt
On 06/26/2012 08:37 PM, Kevin Hilman wrote: "Franky Lin" writes: I noticed Kevin raised some similar cases on other platforms and also provided two patches in the patch mail thread. But unfortunately those two patches doesn't help in our case. I tested the driver with 3.5-rc3 mainline kernel and the issue is still there. I can only "fix" the hang by either reverting the commit or disabling CONFIG_PM_RUNTIME. Also, the hang only happens on Panda ES board. Old Panda with 4430 works good. Any thoughts and suggestions? If reverting the patch fixes your problem, can you isolate down to which part of that patch causes the problem? IOW, can you fix your problem if you undo just the hunk added in runtime_suspend or undo just the moved hunk runtime_resume? Or is reverting both required? I suspect the added runtime_suspend hunk is causing the problems, so can you see if just undoing that part works[1]. If that works, I will give a bit more of a thinking on it tomorrow. runtime_suspend hunk is fine. The hang still exist after reverting it. The culprit is the moved hunk in runtime_resume. Reverting it makes the hang disappear. Thanks for reporting the problem! Bug reports like this that have clearly been thoroughly researched and bisected are greatly appreciated! Kevin You are welcome. Regards, Franky -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panda ES board hang when using GPIO as interrupt
On 06/26/2012 12:21 AM, DebBarma, Tarun Kanti wrote: On Tue, Jun 26, 2012 at 2:22 AM, Franky Lin wrote: Hi Kevin, Tarun, We are using the expansion connector A on Panda board to mount a SDIO WiFi dongle on MMC2 with a level triggered interrupt signal connected to GPIO 138. It's been working fine until 3.5 rc1. The board hang randomly within 5 mins during a network traffic test. After bisecting we found the culprit is "[PATCH 8/8] gpio/omap: fix missing check in *_runtime_suspend()" [1]. I noticed Kevin raised some similar cases on other platforms and also provided two patches in the patch mail thread. But unfortunately those two patches doesn't help in our case. I tested the driver with 3.5-rc3 mainline kernel and the issue is still there. I can only "fix" the hang by either reverting the commit or disabling CONFIG_PM_RUNTIME. Also, the hang only happens on Panda ES board. Old Panda with 4430 works good. Any thoughts and suggestions? I just had a quick look at the code. Can you please check if the attached patch solves the issue? I just boot tested on Panda and Blaze. -- Tarun Thanks for the prompt reply. Booting is fine even without the patch and revert. The wifi dongle generates interrupt whenever there is data packet available for host to read. So during a traffic test a significant numbers of interrupt will be triggered through the GPIO. So I assume it has something to do with the interrupt GPIO. With the patch, the kernel still crashes. But the symptom is slightly different. Now it has a panic log every time. See attachment. Regards, Franky [ 636.143585] Internal error: Oops - undefined instruction: 0 [#1] SMP ARM [ 636.150634] Modules linked in: brcmfmac brcmutil cfg80211 [ 636.156311] CPU: 0Not tainted (3.5.0-rc4+ #3) [ 636.161346] PC is at __lock_acquire+0x65c/0x1d88 [ 636.166198] LR is at 0x6093 [ 636.169494] pc : []lr : [<6093>]psr: 2093 [ 636.169494] sp : c06b1e18 ip : 9e370001 fp : c0724f70 [ 636.181549] r10: c06b r9 : 001e r8 : c0b92998 [ 636.187042] r7 : c06d2cc8 r6 : r5 : c0746d64 r4 : c06d2868 [ 636.193908] r3 : 3b0e r2 : ec3b001d r1 : 0001d870 r0 : 001d [ 636.200744] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment kernel [ 636.208526] Control: 10c53c7d Table: ae39c04a DAC: 0017 [ 636.214569] Process swapper/0 (pid: 0, stack limit = 0xc06b02f8) [ 636.220855] Stack: (0xc06b1e18 to 0xc06b2000) [ 636.225433] 1e00: c06d00f8 0002 [ 636.234039] 1e20: c0807968 0001 0002 001d 0001 0001d870 [ 636.242614] 1e40: c08070e8 0001 0002 0002 c00903e4 [ 636.251220] 1e60: 0002 0080 c0066838 6093 [ 636.259796] 1e80: 6093 c06b4324 c06b
Panda ES board hang when using GPIO as interrupt
Hi Kevin, Tarun, We are using the expansion connector A on Panda board to mount a SDIO WiFi dongle on MMC2 with a level triggered interrupt signal connected to GPIO 138. It's been working fine until 3.5 rc1. The board hang randomly within 5 mins during a network traffic test. After bisecting we found the culprit is "[PATCH 8/8] gpio/omap: fix missing check in *_runtime_suspend()" [1]. I noticed Kevin raised some similar cases on other platforms and also provided two patches in the patch mail thread. But unfortunately those two patches doesn't help in our case. I tested the driver with 3.5-rc3 mainline kernel and the issue is still there. I can only "fix" the hang by either reverting the commit or disabling CONFIG_PM_RUNTIME. Also, the hang only happens on Panda ES board. Old Panda with 4430 works good. Any thoughts and suggestions? Thanks, Franky [1] http://article.gmane.org/gmane.linux.ports.arm.omap/75708/ -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html