Re: Deadlock in do_page_fault() on ARM (old kernel)

2014-01-20 Thread Alan Ott

On 01/17/2014 08:20 PM, Russell King - ARM Linux wrote:

On Fri, Jan 17, 2014 at 07:57:16PM -0500, Alan Ott wrote:

On 01/17/2014 08:46 AM, Russell King - ARM Linux wrote:

My suspicion therefore is that some other thread must have died while
holding the mmap_sem, so there's probably a kernel oops earlier...
that's my best guess at the moment without seeing the full backtrace.

There's no oops that I'm able to see.

Each of the tasks which lockdep reports as holding mmap_sem are
blocking for it. If some other task had taken it and then crashed, I
assume lockdep would list the crashed task as also holding the resource
in the printout.

My point is this:

- the five (or six) threads which are trying to take the mmap_sem in
   read-mode in the fault handler are all blocked on it - they haven't
   taken the lock, which will only happen because there's a pending writer.
- of these in your original post, there are two which faulted from
   __copy_to_user_std().  __copy_to_user_std() doesn't take the mmap_sem -
   this is the non-uaccess-with-memcpy path.
- the pending writers are the two threads in sys_mmap_pgoff(), both of
   which are blocked waiting to gain the write lock.
- there are no *other* threads holding the mmap_sem lock.


Yes, all true. I don't remember why I started looking at the memcpy() case.


So... there's a question here how we got into this state - and frankly
I don't know.  What I do see from your latest dump is that there's two
unknown modules there - something called rcu2m and another called
buttoms, and there are two threads inside ioctls there.  Both have
faulted from the function at 0xc0d2a394 (which won't appear in the
backtrace, but is most likely __copy_to_user_std.)


Yes, there are a handful of out-of-tree modules.


So, in the absence of you saying anything about there being any preceding
oopses, my conclusion now is that one of those modules is taking the
mmap_sem itself, and is the culpret inducing this deadlock.


Yes, I came to that as well. I had checked for the presence of mmap_sem 
in the sources of the out-of-tree modules and didn't see it. However, 
upon closer inspection, my grep-fu failed me as there were some backward 
symlinks I didn't account for. TI's cmemk module _is_ taking out 
mmap_sem. I wish I had seen this days ago. That's my new investigation path.



Note that your dump ([2]) in your reply was just the hung task detector
printing out the stacktrace for a few tasks, not the full all-threads
stack dump which I was expecting.


Yes, in a misguided attempt to keep the SNR high, I didn't include the 
full dump, but only what I thought was the interesting part. I did 
another capture and the full dump is at [1] .



So I'm pulling out these conclusions from the very little information
you're supplying.


I appreciate it. Thank you for taking the time to reply.

Alan.

[1] 
http://www.signal11.us/~alan/stack_dump_all_tasks_with_frame_pointers.txt


--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deadlock in do_page_fault() on ARM (old kernel)

2014-01-17 Thread Alan Ott

On 01/17/2014 08:46 AM, Russell King - ARM Linux wrote:

On Wed, Jan 15, 2014 at 08:13:04PM -0500, Alan Ott wrote:

So my questions are:
1. Why don't I see a full backtrace beyond the exception stack? It's the
same when dump_stack() is called manually.

No idea - it looks like you're not using frame pointers, but are using
the unwinder.  Full backtraces can always be created with frame pointers,
it's just that unwinding seems unreliable.


Hi Russell,

I managed to get frame pointers turned on, which in this kernel version, 
seems like it requires the unwinder to be turned off[1].


When I built with frame pointers, the backtraces show differently. It 
only shows the frames which were _not_ shown with the unwinder. 
Backtrace at [2]. As you can see, it shows the non-exception stack dumps 
ending at places which can't page fault (and if it did page fault in 
those locations, it wouldn't try to take the mmap_sem lock). Its as 
though it shows what's in the modules but not what's in the image, where 
with the unwinder, it showed what's in the image and not what's in the 
modules[3].



I think we do need to see the full backtrace here - from looking at the
full state dump, I don't see any sign of the mmap_sem being held except
by an attempt to process a fault, and two threads trying to do a
sys_mmap_pgoff().


The lockdep printout at the end of [4] (which is the original log I sent 
the other day) shows do_page_fault() holding mmap_sem in six tasks and 
sys_mmap_pgoff() holding it in two. (I don't like the term held in the 
lockdep printout and believe it to mean held or waiting for, based on 
my analysis of the rw_semaphore code.)


Each of those tasks seems to be blocking for it.


My suspicion therefore is that some other thread must have died while
holding the mmap_sem, so there's probably a kernel oops earlier...
that's my best guess at the moment without seeing the full backtrace.


There's no oops that I'm able to see.

Each of the tasks which lockdep reports as holding mmap_sem are 
blocking for it. If some other task had taken it and then crashed, I 
assume lockdep would list the crashed task as also holding the resource 
in the printout.


Thanks for having a look at this,

Alan.

[1] It seems a little strange to me even in the newest kernel:
1: lib/Kconfig.debug specifies FRAME_POINTER . The text of this instance 
shows up when one searches for FRAME_POINTER in menuconfig. It can 
become selected even though the dependencies listed here are not met.
2: arch/arm/Kconfig.debug also specifies FRAME_POINTER, with a 
dependency only on !THUMB2_KERNEL, defaulting to yes on !ARM_UNWIND

3: ARM doens't set ARCH_WANT_FRAME_POINTERS

[2] http://www.signal11.us/~alan/stack_dump_with_frame_pointers.txt
Note: This is a sysrq dump of the locks held at lockup time and then the 
automatic hung task detection.


[3] The modules are being built in the standard way for out-of-tree:
$(MAKE) -C $(PRJDIR)/linux M=$(CWD) modules

[4] http://www.signal11.us/~alan/show-all-tasks-deadlock.txt


--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Deadlock in do_page_fault() on ARM (old kernel)

2014-01-15 Thread Alan Ott

Hello,

I have a deadlock that I'm trying to understand. The symptom is multiple 
tasks trying to acquire a read lock (down_read()) on mm-mmap_sem in 
do_page_fault(). I'll be right up front and say that this is a fairly 
old kernel (2.6.37 TI PSP kernel) on a fairly old processor DaVinci 6446.


At the time of the deadlock, sysrq's show-all-tasks shows the following 
for three of the tasks which are deadlocked (there are more, but I just 
picked the interesting ones; the full output is at [1]):


uiD c0ea8208 0  1405   1293 0x
[c0ea8208] (schedule+0x33c/0x3c4) from [c0eaa3b4] 
(__down_read+0xbc/0xd4)
[c0eaa3b4] (__down_read+0xbc/0xd4) from [c0c0b378] 
(do_page_fault+0x94/0x248)
[c0c0b378] (do_page_fault+0x94/0x248) from [c0c052e0] 
(do_DataAbort+0x34/0x94)
[c0c052e0] (do_DataAbort+0x34/0x94) from [c0c05b0c] 
(__dabt_svc+0x4c/0x60)

Exception stack(0xc048dce8 to 0xc048dd30)
dce0:   400e9a94 c048ddb0 ffec  c048c000 
c048dda4
dd00: 400e9a94  ff92 c048c000  0001 0014 
c048dd34

dd20:  c0d1f68c 0013 
[c0c05b0c] (__dabt_svc+0x4c/0x60) from [c0d1f68c] 
(__copy_to_user_std+0xcc/0x3a8)


uiD c0ea8208 0  1406   1293 0x
[c0ea8208] (schedule+0x33c/0x3c4) from [c0eaa3b4] 
(__down_read+0xbc/0xd4)
[c0eaa3b4] (__down_read+0xbc/0xd4) from [c0c0b378] 
(do_page_fault+0x94/0x248)
[c0c0b378] (do_page_fault+0x94/0x248) from [c0c052e0] 
(do_DataAbort+0x34/0x94)
[c0c052e0] (do_DataAbort+0x34/0x94) from [c0c05f0c] 
(ret_from_exception+0x0/0x10)

Exception stack(0xc048ffb0 to 0xc048fff8)
ffa0: 0060 000a 00a8 
0010d000
ffc0: 00c23d80 00c23de8 405af06c  405af03c 405af074 0050 
01ff

ffe0: 405ae000 40185748 404f5c4c 404f393c 8010 

uiD c0ea8208 0  1411   1293 0x
[c0ea8208] (schedule+0x33c/0x3c4) from [c0eaa3b4] 
(__down_read+0xbc/0xd4)
[c0eaa3b4] (__down_read+0xbc/0xd4) from [c0c0b378] 
(do_page_fault+0x94/0x248)
[c0c0b378] (do_page_fault+0x94/0x248) from [c0c052e0] 
(do_DataAbort+0x34/0x94)
[c0c052e0] (do_DataAbort+0x34/0x94) from [c0c05f0c] 
(ret_from_exception+0x0/0x10)

Exception stack(0xc053bfb0 to 0xc053bff8)
bfa0:  0001 00ba3610 

bfc0:  00ba3610 00bb6020 00ba3610 40074000 00b91024 415e4930 
0583

bfe0: 00b611a0 415e38e0 4005f3e4 0fc0 6010 

 [snip] 

Showing all locks held in the system:
1 lock held by getty/1294:
 #0:  (tty-atomic_read_lock){+.+...}, at: [c0d45bf0] 
n_tty_read+0x21c/0x670

1 lock held by ui/1405:
 #0:  (mm-mmap_sem){++}, at: [c0c0b378] do_page_fault+0x94/0x248
1 lock held by ui/1406:
 #0:  (mm-mmap_sem){++}, at: [c0c0b378] do_page_fault+0x94/0x248
1 lock held by ui/1408:
 #0:  (mm-mmap_sem){++}, at: [c0c0b378] do_page_fault+0x94/0x248
1 lock held by ui/1409:
 #0:  (mm-mmap_sem){++}, at: [c0c0b378] do_page_fault+0x94/0x248
1 lock held by ui/1411:
 #0:  (mm-mmap_sem){++}, at: [c0c0b378] do_page_fault+0x94/0x248
1 lock held by ui/1416:
 #0:  (mm-mmap_sem){++}, at: [c0c6e604] sys_mmap_pgoff+0x70/0xc0
1 lock held by ui/1418:
 #0:  (mm-mmap_sem){++}, at: [c0c0b378] do_page_fault+0x94/0x248
1 lock held by ui/1420:
 #0:  (mm-mmap_sem){++}, at: [c0c6e604] sys_mmap_pgoff+0x70/0xc0
1 lock held by ui/1434:
 #0:  (tty-atomic_read_lock){+.+...}, at: [c0d45bf0] 
n_tty_read+0x21c/0x670



Note that above, do_page_fault() takes out a read lock (down_read()) and 
sys_mmap_pgoff() takes out a write lock (down_write()).


I've searched for this kind of problem and found two patches which seem 
to be related to this issue[2]. I have applied both with no better results.


So my questions are:
1. Why don't I see a full backtrace beyond the exception stack? It's the 
same when dump_stack() is called manually.
2. __copy_to_user_memcpy() takes a read lock (down_read()) on 
mm-mmap_sem. While that lock is held, __copy_to_user_memcpy() can 
generate a page fault, causing do_page_fault() to get called, which will 
also try to get a read lock (down_read()) on mm-mmap_sem. Multiple read 
locks can be taken on an rw_semaphore, but deadlock will occur if 
another thread tries to get a write lock (down_write()) in between. For 
example:

Task 1: Task 2:
down_read(sem)
down_write(sem)-- Goes to sleep
down_read(sem) -- Goes to sleep

There is a thread from 2005[3] which seems to discuss the same concept 
of recursive rw_semaphores, but for futexes.


Other comments:
1. My analysis of this probably wrong. Otherwise it seems many others 
would have the same problem, and they don't seem to. I'm hoping this 
email will help to correct my understanding.
2. I looked through the git logs for recent (since 2.6.37 time frame) 
and nothing else jumped out at me as being an obvious fix for this 
situation.


Thanks for any insight you can give,


Re: [Regression?] Removed regulator support in ehci-omap

2011-04-25 Thread Alan Ott
On 04/24/2011 02:37 AM, Keshava Munegowda wrote:
 -Original Message-
 From: Alan Ott [mailto:a...@signal11.us]
 Sent: Friday, April 22, 2011 6:41 AM
 To: Keshava Munegowda
 Cc: Dmitry Artamonow; Steve Sakoman; Steve Calfee; Felipe Balbi;
 linux-...@vger.kernel.org; linux-
 o...@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [Regression?] Removed regulator support in ehci-omap

 On 04/12/2011 12:20 PM, Keshava Munegowda wrote:
 Ah, so EHCI/OHCI common code was moved into drivers/mfd... Good.
 But seems regulators support was lost somewhere during transition -
 the
 only mentioning about regulators in omap-usb-host.c is #include :

 mad@macmini:~/kernel-hack/linux-2.6(master)$ grep -i regulator
 drivers/mfd/omap-usb-host.c
 Thanks ! I will add some time next week!
 I have confirmed this on my BeagleBoard-xM. 2.6.38 works, but the
 2.6.39-rc4+ head from yesterday did not. It seems that the regulator
 does not turn the power on for the USB hub which the ethernet is
 attached to.

 I connected this issue to bug 33092 in bugzilla[1].

 Alan.

 [1] https://bugzilla.kernel.org/show_bug.cgi?id=33092
 I have posted the patches on April 22, 2011.
 If possible I request you to test this.

Hi Keshava,

If you mean this patch[1], then I tested it against the head and it
didn't fix my problem. I couldn't convince myself that your patch was
supposed to be the whole fix to my problem (since it wasn't mentioned on
this thread) so I didn't reply with my findings. Sorry about that.

Thanks for taking a look,

Alan.

[1] http://www.spinics.net/lists/linux-usb/msg45618.html




signature.asc
Description: OpenPGP digital signature


Re: [Regression?] Removed regulator support in ehci-omap

2011-04-21 Thread Alan Ott
On 04/12/2011 12:20 PM, Keshava Munegowda wrote:
 Ah, so EHCI/OHCI common code was moved into drivers/mfd... Good.
 But seems regulators support was lost somewhere during transition - the
 only mentioning about regulators in omap-usb-host.c is #include :

 mad@macmini:~/kernel-hack/linux-2.6(master)$ grep -i regulator
 drivers/mfd/omap-usb-host.c
 Thanks ! I will add some time next week!

I have confirmed this on my BeagleBoard-xM. 2.6.38 works, but the
2.6.39-rc4+ head from yesterday did not. It seems that the regulator
does not turn the power on for the USB hub which the ethernet is
attached to.

I connected this issue to bug 33092 in bugzilla[1].

Alan.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=33092



signature.asc
Description: OpenPGP digital signature