Re: [PATCH] gpio/omap: fix invalid context restore of gpio bank-0

2012-06-29 Thread Franky Lin

On 06/29/2012 10:22 AM, Jon Hunter wrote:

Currently the gpio _runtime_resume/suspend functions are calling the
get_context_loss_count() platform function if the function is populated for
a gpio bank. This function is used to determine if the gpio bank logic state
needs to be restored due to a power transition. This function will be populated
for all banks, but it should only be called for banks that have the
"loses_context" variable set. It is pointless to call this if loses_context is
false as we know the context will never be lost and will not need restoring.

For all OMAP2+ devices gpio bank-0 is in an always-on power domain and so will
never lose context. We found that the get_context_loss_count() was being called
for bank-0 during the probe and returning 1 instead of 0 indicating that the
context had been lost. This was causing the context restore function to be
called at probe time for this bank and because the context had never been saved,
was restoring an invalid state. This ultimately resulted in a crash [1].

There are multiple bugs here that need to be addressed ...

1. Why the always-on power domain returns a context loss count of 1? This needs
to be fixed in the power domain code. However, the gpio driver should not
assume the loss count is 0 to begin with.
2. The omap gpio driver should never be calling get_context_loss_count for a
gpio bank in a always-on domain. This is pointless and adds unneccessary
overhead.
3. The OMAP gpio driver assumes that the initial power domain context loss count
will be 0 at the time the gpio driver is probed. However, it could be
possible that this is not the case and an invalid context restore could be
performed during the probe. To avoid this otherwise only populated the
get_context_loss_count() function pointer after the initial call to
pm_runtime_get() has occurred. This will ensure that the first
pm_runtime_put() initialised the loss count correctly.

This patch addresses issues 2 and 3 above.

[1] http://marc.info/?l=linux-omap&m=134065775323775&w=2

Cc: Grant Likely 
Cc: Linus Walleij 
Cc: Kevin Hilman 
Cc: Tarun Kanti DebBarma 
Cc: Franky Lin 

Reported-by: Franky Lin 
Signed-off-by: Jon Hunter 
---


Tested-by: Franky Lin 

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Panda ES board hang when using GPIO as interrupt

2012-06-28 Thread Franky Lin

On 06/28/2012 04:54 PM, Jon Hunter wrote:

I am wondering if this could be the bug ... on start-up I see that we do
a context restore on bank1 during the probe which is before we have done
the first suspend! In other words, we could restore a bad/uninitialised
context for bank1. In the case of bank1, the loss count starts at 1 and
not 0 and so we falsely think we need to perform a restore :-(

[0.176269] omap_gpio_runtime_resume: bank @ 0xfc31
[0.177276] omap_gpio_runtime_resume: count 0, now 1
[0.177276] gpiochip_add: registered GPIOs 0 to 31 on device: gpio
[0.177642] omap_gpio_runtime_suspend: bank @ 0xfc31

Can you try ...

diff --git a/drivers/gpio/gpio-omap.c b/drivers/gpio/gpio-omap.c
index c4ed172..9623408 100644
--- a/drivers/gpio/gpio-omap.c
+++ b/drivers/gpio/gpio-omap.c
@@ -1086,6 +1086,9 @@ static int __devinit omap_gpio_probe(struct
platform_device *pdev)
  #ifdef CONFIG_OF_GPIO
 bank->chip.of_node = of_node_get(node);
  #endif
+   if (bank->get_context_loss_count)
+   bank->context_loss_count =
+   bank->get_context_loss_count(bank->dev);

 bank->irq_base = irq_alloc_descs(-1, 0, bank->width, 0);
 if (bank->irq_base < 0) {



Looks like you found the culprit. :) It does fix the problem.

Franky

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Panda ES board hang when using GPIO as interrupt

2012-06-28 Thread Franky Lin

On 06/28/2012 03:59 PM, Jon Hunter wrote:


On 06/28/2012 05:53 PM, Franky Lin wrote:

I found one interesting thing. When I added the print info to see when
runtime_suspend/resume get called, it seems like the suspend/resume is
unbalance during boot. Resume got called more than suspend. So I hack
the code to make sure suspend and resume are called in pair. A resume
without suspend will do nothing and return immediately. This also makes
the hang vanish.


I am not 100% sure I follow. On boot I would expect to see a
resume/suspend due to the probe on the irq bank and then I would expect
to see another resume from the acquisition of the gpio, however, I would
not expect a suspend until the gpio is freed, which I don't believe you
are doing.

Can you share your hack? Just paste the diff? This may help me
understand more.



OK.
This is what I saw in the log:
[0.171844] dummy:
[0.172912] NET: Registered protocol family 16
[0.173431] GPMC revision 6.0
[0.173492] gpmc: irq-52 could not claim: err -22
[0.177551] ??omap_gpio_runtime_resume
[0.178619] OMAP GPIO hardware version 0.1
[0.178649] !omap_gpio_runtime_suspend
[0.178771] ??omap_gpio_runtime_resume
[0.179351] !omap_gpio_runtime_suspend
[0.179504] ??omap_gpio_runtime_resume
[0.180023] !omap_gpio_runtime_suspend
[0.180145] ??omap_gpio_runtime_resume
[0.180694] !omap_gpio_runtime_suspend
[0.180847] ??omap_gpio_runtime_resume
[0.181365] !omap_gpio_runtime_suspend
[0.181518] ??omap_gpio_runtime_resume
[0.182037] !omap_gpio_runtime_suspend
[0.185089] omap_mux_init: Add partition: #1: core, flags: 2
[0.186462] omap_mux_init: Add partition: #2: wkup, flags: 2
[0.186584] error setting wl12xx data: -38
[0.189788] _omap_mux_get_by_name: Could not find signal 
uart1_rx.uart1_rx
[0.189788] _omap_mux_get_by_name: Could not find signal 
uart1_rx.uart1_rx

[0.239501] ??omap_gpio_runtime_resume
[0.239532] ??omap_gpio_runtime_resume
[0.241058]  usbhs_omap: alias fck already exists
[0.244781] ??omap_gpio_runtime_resume

diff --git a/drivers/gpio/gpio-omap.c b/drivers/gpio/gpio-omap.c
index c4ed172..bca3985 100644
--- a/drivers/gpio/gpio-omap.c
+++ b/drivers/gpio/gpio-omap.c
@@ -1146,7 +1146,7 @@ static int __devinit omap_gpio_probe(struct 
platform_device *pdev)


 #if defined(CONFIG_PM_RUNTIME)
 static void omap_gpio_restore_context(struct gpio_bank *bank);
-
+static int flag = 0;
 static int omap_gpio_runtime_suspend(struct device *dev)
 {
struct platform_device *pdev = to_platform_device(dev);
@@ -1155,6 +1155,8 @@ static int omap_gpio_runtime_suspend(struct device 
*dev)

unsigned long flags;
u32 wake_low, wake_hi;

+   flag ++;
+
spin_lock_irqsave(&bank->lock, flags);

/*
@@ -1221,6 +1223,11 @@ static int omap_gpio_runtime_resume(struct device 
*dev)

u32 l = 0, gen, gen0, gen1;
unsigned long flags;

+   if (flag)
+   flag--;
+   else
+   return 0;
+
spin_lock_irqsave(&bank->lock, flags);
_gpio_dbck_enable(bank);

Regards,
Franky

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Panda ES board hang when using GPIO as interrupt

2012-06-28 Thread Franky Lin

On 06/28/2012 02:55 PM, Jon Hunter wrote:

Ok. Any way to manually reset the wlan module to deactivate the gpio
when it is hung? I am wondering if the gpio is deactivated if the board
comes back to life, indicating it is stuck in the interrupt somewhere.


The only way I can think of is removing the module manually. But it 
didn't bring the board back to live.



Well, at least that is consistent with what I see, but also perplexing
that it takes sometime to fail. Can you try the following as a debug
patch to see if it is in the context restore that is the problem. From
your testing and bisect, the only possible difference in the current
kernel is that it could perform the context restore when acquiring the gpio.

diff --git a/drivers/gpio/gpio-omap.c b/drivers/gpio/gpio-omap.c
index c4ed172..a2401bd 100644
--- a/drivers/gpio/gpio-omap.c
+++ b/drivers/gpio/gpio-omap.c
@@ -1341,6 +1341,8 @@ void omap2_gpio_resume_after_idle(void)
  #if defined(CONFIG_PM_RUNTIME)
  static void omap_gpio_restore_context(struct gpio_bank *bank)
  {
+   return;
+
 __raw_writel(bank->context.wake_en,
 bank->base + bank->regs->wkup_en);
 __raw_writel(bank->context.ctrl, bank->base + bank->regs->ctrl);



This one works! It can run more than 20 mins.

I found one interesting thing. When I added the print info to see when 
runtime_suspend/resume get called, it seems like the suspend/resume is 
unbalance during boot. Resume got called more than suspend. So I hack 
the code to make sure suspend and resume are called in pair. A resume 
without suspend will do nothing and return immediately. This also makes 
the hang vanish.


Regards,
Franky


--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Panda ES board hang when using GPIO as interrupt

2012-06-28 Thread Franky Lin

On 06/28/2012 08:42 AM, Jon Hunter wrote:


On 06/27/2012 07:41 PM, Franky Lin wrote:

On 06/26/2012 08:37 PM, Kevin Hilman wrote:

"Franky Lin"  writes:

I noticed Kevin raised some similar cases on other platforms and also
provided two patches in the patch mail thread. But unfortunately those
two patches doesn't help in our case. I tested the driver with 3.5-rc3
mainline kernel and the issue is still there. I can only "fix" the
hang by either reverting the commit or disabling
CONFIG_PM_RUNTIME. Also, the hang only happens on Panda ES board. Old
Panda with 4430 works good.

Any thoughts and suggestions?


If reverting the patch fixes your problem, can you isolate down to which
part of that patch causes the problem?  IOW, can you fix your problem if
you undo just the hunk added in runtime_suspend or undo just the moved
hunk runtime_resume?  Or is reverting both required?

I suspect the added runtime_suspend hunk is causing the problems, so can
you see if just undoing that part works[1].  If that works, I will give
a bit more of a thinking on it tomorrow.


runtime_suspend hunk is fine. The hang still exist after reverting it.
The culprit is the moved hunk in runtime_resume. Reverting it makes the
hang disappear.


Thanks. From reviewing the code the only thing that appears suspect based
upon your findings is the return if we find the context has not been lost.
We are not checking if "workaround_enabled" is set before we return.

Could you try the following change on top of v3.5-rc3?



The patch doesn't help. And I also managed to probe the signal. It's 
active when it hung.



Also, could you add a print in the runtime_suspend/resume() functions so
we can see how often these are being called. In my case, I really don't see
these being exercised and I am wondering how often you see suspend/resume
being called in your setup.


Well, the runtime_suspend/resume never get called during the test.

Thanks,
Franky

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Panda ES board hang when using GPIO as interrupt

2012-06-27 Thread Franky Lin

On 06/27/2012 04:43 PM, Jon Hunter wrote:

Hi Franky,

On 06/25/2012 03:52 PM, Franky Lin wrote:

Hi Kevin, Tarun,

We are using the expansion connector A on Panda board to mount a SDIO
WiFi dongle on MMC2 with a level triggered interrupt signal connected to
GPIO 138. It's been working fine until 3.5 rc1. The board hang randomly
within 5 mins during a network traffic test. After bisecting we found
the culprit is "[PATCH 8/8] gpio/omap: fix missing check in
*_runtime_suspend()" [1].


I have been looking into this today to see if I can replicate the
problem that you have reported. However, so far I have not had any luck.
Please note that my test setup is not exactly the same as yours as I
don't have your wlan module. However, I have been using a 2nd board to
generate gpio events to a panda-es to see I can make it lock up. I have
tried mainline kernel 3.5-rc1 and 3.5-rc3 but I have not seen any
problems after sending 100k gpio events (over many minutes). My setup is
as follows ...

- OMAP4460 panda-es with gpio-138 connected to OMAP3430 beagle gpio-11.
- Mainline kernel 3.5-rc1/3 using omap2plus_defconfig (no changes)
- Created a simple kernel module that acquires gpio-138 and sets up a
   IRQ with flag IRQF_TRIGGER_HIGH (for active high level interrupt).
- GPIO events are triggered roughly every 1ms


Don't know if it's related, but we also mux several other pins on 
connector A:

/* MMC2 Mux for extension board */
/* MMC2 CMD */
OMAP4_MUX(GPMC_NWE, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP),
/* MMC2 CLK */
OMAP4_MUX(GPMC_NOE, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP),
/* MMC2 DAT 0-3 */
OMAP4_MUX(GPMC_AD0, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP),
OMAP4_MUX(GPMC_AD1, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP),
OMAP4_MUX(GPMC_AD2, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP),
OMAP4_MUX(GPMC_AD3, OMAP_MUX_MODE1 | OMAP_PIN_INPUT_PULLUP),
/* GPIO MUX for OOB interupt of dongle */
OMAP4_MUX(MCSPI1_CS1, OMAP_MUX_MODE3 | OMAP_PIN_INPUT_PULLDOWN),
/* GPIO MUX for WLAN_ENABLE for dongle */
OMAP4_MUX(MCSPI1_CLK, OMAP_MUX_MODE3 | OMAP_PIN_OUTPUT),


Can you confirm ...
1. You are just using omap2plus_defconfig with no changes?

No, we enable following options
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_USB_OHCI_HCD=y


2. Rough frequency of gpio events?

3367 interrupts were triggered during a 10 secs throughput test.


3. Is the gpio configured for active low or high?

active high


4. When the hang occurs, what is the state of the gpio? Active or
inactive? Can you probe it with a scope? If it was always active I
could see that this would lock the device up, but I am not sure how
that would relate to the results from your bisect???


I dont have a scope nearby. Let me see if I can find one tomorrow.


I noticed Kevin raised some similar cases on other platforms and also
provided two patches in the patch mail thread. But unfortunately those
two patches doesn't help in our case. I tested the driver with 3.5-rc3
mainline kernel and the issue is still there. I can only "fix" the hang
by either reverting the commit or disabling CONFIG_PM_RUNTIME. Also, the
hang only happens on Panda ES board. Old Panda with 4430 works good.


It does not make sense to me yet why this would only impact 4460, but I
will keep this in mind.

In your wlan driver are you acquiring and freeing the gpio often? Or are
you only acquiring the gpio on boot?

The reason I ask is because for omap4, it seems that we are not
currently calling omap2_gpio_prepare_for_idle() during idle and so the
only time I see us call the runtime_suspend/resume handlers for omap4 is
during probe and when we acquire and free the gpio.

So if you were not acquiring and freeing the gpio and are using the
stock kernel, then as far as I can tell, the runtime pm code is not
being exercised much. My test is not acquiring and releasing the gpio
and so I am wondering if that is the secret to reproducing this problem :-)


We only request the irq once during initialization. But we do frequently 
disable and re-enable it since we need to access to the module through 
SDIO to clear the interrupt. Apparently we can't finish all this in irq 
handler.


Hope these could help.

Regards,
Franky

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Panda ES board hang when using GPIO as interrupt

2012-06-27 Thread Franky Lin

On 06/26/2012 08:37 PM, Kevin Hilman wrote:

"Franky Lin"  writes:

I noticed Kevin raised some similar cases on other platforms and also
provided two patches in the patch mail thread. But unfortunately those
two patches doesn't help in our case. I tested the driver with 3.5-rc3
mainline kernel and the issue is still there. I can only "fix" the
hang by either reverting the commit or disabling
CONFIG_PM_RUNTIME. Also, the hang only happens on Panda ES board. Old
Panda with 4430 works good.

Any thoughts and suggestions?


If reverting the patch fixes your problem, can you isolate down to which
part of that patch causes the problem?  IOW, can you fix your problem if
you undo just the hunk added in runtime_suspend or undo just the moved
hunk runtime_resume?  Or is reverting both required?

I suspect the added runtime_suspend hunk is causing the problems, so can
you see if just undoing that part works[1].  If that works, I will give
a bit more of a thinking on it tomorrow.


runtime_suspend hunk is fine. The hang still exist after reverting it. 
The culprit is the moved hunk in runtime_resume. Reverting it makes the 
hang disappear.




Thanks for reporting the problem!   Bug reports like this that have
clearly been thoroughly researched and bisected are greatly appreciated!

Kevin



You are welcome.

Regards,
Franky

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Panda ES board hang when using GPIO as interrupt

2012-06-26 Thread Franky Lin

On 06/26/2012 12:21 AM, DebBarma, Tarun Kanti wrote:

On Tue, Jun 26, 2012 at 2:22 AM, Franky Lin  wrote:

Hi Kevin, Tarun,

We are using the expansion connector A on Panda board to mount a SDIO WiFi
dongle on MMC2 with a level triggered interrupt signal connected to GPIO
138. It's been working fine until 3.5 rc1. The board hang randomly within 5
mins during a network traffic test. After bisecting we found the culprit is
"[PATCH 8/8] gpio/omap: fix missing check in *_runtime_suspend()" [1].

I noticed Kevin raised some similar cases on other platforms and also
provided two patches in the patch mail thread. But unfortunately those two
patches doesn't help in our case. I tested the driver with 3.5-rc3 mainline
kernel and the issue is still there. I can only "fix" the hang by either
reverting the commit or disabling CONFIG_PM_RUNTIME. Also, the hang only
happens on Panda ES board. Old Panda with 4430 works good.

Any thoughts and suggestions?

I just had a quick look at the code. Can you please check if the
attached patch solves
the issue? I just boot tested on Panda and Blaze.
--
Tarun



Thanks for the prompt reply.

Booting is fine even without the patch and revert. The wifi dongle 
generates interrupt whenever there is data packet available for host to 
read. So during a traffic test a significant numbers of interrupt will 
be triggered through the GPIO. So I assume it has something to do with 
the interrupt GPIO.


With the patch, the kernel still crashes. But the symptom is slightly 
different. Now it has a panic log every time. See attachment.


Regards,
Franky
[  636.143585] Internal error: Oops - undefined instruction: 0 [#1] SMP ARM 


[  636.150634] Modules linked in: brcmfmac brcmutil cfg80211


[  636.156311] CPU: 0Not tainted  (3.5.0-rc4+ #3)   


[  636.161346] PC is at __lock_acquire+0x65c/0x1d88 


[  636.166198] LR is at 0x6093  


[  636.169494] pc : []lr : [<6093>]psr: 2093  


[  636.169494] sp : c06b1e18  ip : 9e370001  fp : c0724f70  


[  636.181549] r10: c06b  r9 : 001e  r8 : c0b92998  


[  636.187042] r7 : c06d2cc8  r6 :   r5 : c0746d64  r4 : c06d2868   


[  636.193908] r3 : 3b0e  r2 : ec3b001d  r1 : 0001d870  r0 : 001d   


[  636.200744] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment 
kernel  
   
[  636.208526] Control: 10c53c7d  Table: ae39c04a  DAC: 0017


[  636.214569] Process swapper/0 (pid: 0, stack limit = 0xc06b02f8) 


[  636.220855] Stack: (0xc06b1e18 to 0xc06b2000)


[  636.225433] 1e00:   
c06d00f8 0002   
 
[  636.234039] 1e20: c0807968 0001  0002 001d  
0001 0001d870   
 
[  636.242614] 1e40: c08070e8 0001  0002 0002  
 c00903e4   
 
[  636.251220] 1e60: 0002 0080  c0066838   
6093    
 
[  636.259796] 1e80: 6093  c06b4324 c06b 

Panda ES board hang when using GPIO as interrupt

2012-06-25 Thread Franky Lin

Hi Kevin, Tarun,

We are using the expansion connector A on Panda board to mount a SDIO 
WiFi dongle on MMC2 with a level triggered interrupt signal connected to 
GPIO 138. It's been working fine until 3.5 rc1. The board hang randomly 
within 5 mins during a network traffic test. After bisecting we found 
the culprit is "[PATCH 8/8] gpio/omap: fix missing check in 
*_runtime_suspend()" [1].


I noticed Kevin raised some similar cases on other platforms and also 
provided two patches in the patch mail thread. But unfortunately those 
two patches doesn't help in our case. I tested the driver with 3.5-rc3 
mainline kernel and the issue is still there. I can only "fix" the hang 
by either reverting the commit or disabling CONFIG_PM_RUNTIME. Also, the 
hang only happens on Panda ES board. Old Panda with 4430 works good.


Any thoughts and suggestions?

Thanks,
Franky

[1] http://article.gmane.org/gmane.linux.ports.arm.omap/75708/

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html