Hi!
We develop an application that runs on ARM Linux on the SA1100. In some
rare cases, it gets killed with a segmentation fault for which there is no
apparent reasons. See the following:
lpn: memory violation at pc=0x020140c0, lr=0x020282f0
(bad address=0x020140c0, code 1)
pc : [<020140c0>] lr : [<020282f0>]
sp : befffa74 ip : befffa3c fp : befffa8c
r10: 00000000 r9 : 0208ec9c r8 : 00000000
r7 : 01fa582c r6 : 020f4090 r5 : 020adf88 r4 : 00001800
r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : 00001800
Flags: nzCv IRQs on FIQs on Mode USER_32 Segment user
Control: C018D17D Table: C018D17D DAC: 00000015
The code around the fault is:
...
201407c: e5950000 ldr r0, [r5]
2014080: e595100c ldr r1, [r5, #12]
2014084: e3a02000 mov r2, #0
2014088: eb005062 bl 2028218 <__lseek>
201408c: e5950000 ldr r0, [r5]
2014090: e2861004 add r1, r6, #4
2014094: e1a02004 mov r2, r4
2014098: eb005082 bl 20282a8 <__read>
201409c: e2504000 subs r4, r0, #0
20140a0: aa000005 bge 20140bc <GetBuffer+0xac>
20140a4: e59f000c ldr r0, 20140b8 <GetBuffer+0xa8>
20140a8: eb008de2 bl 2037838 <perror>
20140ac: e1a00006 mov r0, r6
20140b0: eb009f83 bl 203bec4 <__libc_free>
20140b4: eaffffec b 201406c <GetBuffer+0x5c>
20140b8: 0207a778 andeq sl, r7, #31457280
20140bc: 0a000005 beq 20140d8 <GetBuffer+0xc8>
20140c0: e5864000 str r4, [r6]
20140c4: e595300c ldr r3, [r5, #12]
20140c8: e0833004 add r3, r3, r4
20140cc: e585300c str r3, [r5, #12]
20140d0: e1a00006 mov r0, r6
20140d4: e91ba870 ldmdb fp, {r4, r5, r6, fp, sp, pc}
20140d8: e1a00006 mov r0, r6
20140dc: eb009f78 bl 203bec4 <__libc_free>
20140e0: e1a00004 mov r0, r4
20140e4: e91ba870 ldmdb fp, {r4, r5, r6, fp, sp, pc}
What is really weird about it is the fact that the faulty address is equal
to the pc. However, r6 which is used to do the str contains actually a
good address and is quite different from the pc.
This code is heavily used but crashes once in a while randomly and not
always at the same place. The dump above is only one example. I tried to
reproduce the same behavior with a simple test code but without any
success. However, the code above is part of an audio player and it
crashes reliably depending on the sample rate for the input data. For
example: if the audio is sampled at 44100 Hz, the program crashes within
few minutes. Otherwise it may take hours before it happens. In all cases
the program always allocate the same amount of memory, etc. So it looks
like if the right timing is required.
I also experimented one egcs segfault for no apparent reason once -- I
just restarted make and it didn't crash again. Is it related... don't
know since I didn't see the register dump. But for all cases in my
program, the faulty address was always same as the pc but the actual code
and register dump have nothing to do with a memory access to the pc value.
The application for which I can easily reproduce the bug makes heavy use
of threads, malloc()/free(), and runs in low memory environment (4 mb).
But even if memory is low, there is always around 1 to 2 megs of free ram.
If I run it on a 16 mb ram board, I'm not able to reproduce the bug
anymore. So it seems to have something to do with heavy
paging/scheduling. Also 75% of all crashes happen in glibc's chunk_alloc
where page faults are likely to take place.
In all cases, do_page_fault() was always called from do_DataAbort(), which
is surprising since the offending address is actually the pc address.
So I started to imagine a CPU bug. Two possibilities:
1) In some situations, the CPU generates a data abort exception instead of
a prefetch abort exception as it should be. This would explain why the
faulty address is equal to the pc. And since this happens in the middle
of a page and there is no way to jump exacly there from another page, this
should hapen right after a context switch. However the data abort handler
gets the offending memory address from the FAR register but the
documentation says that it is used only for data abort exceptions. So is
the FAR updated for prefetch abort exception too? If not, this might not
be a wrongly identified prefetch exception but really a data abort
exception. And since the data abort handler substract 8 from the pc
instead of 4, the pc and faulting address shouldn't match.
2) In some situations, maybe when the process is restarted after a context
switch or similar, the str opcode takes the pc register instead of the r6
register in this case to dereference the address to use for storing. This
would fault since the text segment is mapped read-only. But here if the
pc register was actually used it would have been 8 bytes ahead from the
instruction's address, which isn't the case.
Kernel bug? Unlikely. After looking at the code over and over I can't
see where it could make the faulting address equal to the pc value. The
same problem shows up with a 2.0.35 kernel too.
Application bug? Still might be possible, but I'm unable to explain how
it could happen randomly all over the place and always with a faulting
adress == pc but without any code/register values corresponding to such
action.
So I tried this patch:
--- ../2.2.9-rmk2_o/linux/arch/arm/mm/fault-common.c Fri May 28
13:47:18 1999
+++ linux/arch/arm/mm/fault-common.c Fri Jun 11 21:45:52 1999
@@ -151,6 +151,14 @@
/* User mode accesses just cause a SIGSEGV */
if (mode & FAULT_CODE_USER) {
+ if( addr == regs->ARM_pc ) {
+ /* HACK: convert write fault into read fault */
+ down(&mm->mmap_sem);
+ printk( "do_page_fault(): HACK TRIGGERED\n" );
+ mode |= FAULT_CODE_READ;
+ goto good_area;
+ }
+
tsk->tss.error_code = mode;
tsk->tss.trap_no = 14;
#ifdef CONFIG_DEBUG_USER
Tadam! My application now runs reliably for hours where it crashed in
about one minute before. The hack is triggered once in a while but not
much. So if possibility 1 is true, then the missing text page is loaded
in memory and execution is happily resumed with the possibility that the
pc might be one opcode before where it should be. If it's rather
possibility 2, then simply ignoring the fault and resuming execution might
do it and the fault would have happened because of a very subtile
bug that shows up in some circumstances only, but I didn't verify this
yet. I didn't try to find more similarities between all crash cases
either i.e. opcode, register involved, etc. But there is surely something
to pinpoint there.
I don't know if this has some similarities with the obscur bug some people
experienced on the Netwinder. It might be a SA1100 issue only.
But if anyone has any experience/comments/ideas/explanation for this
problem and what could be the best fix for it, please speak up.
Nicolas Pitre, B. ing.
[EMAIL PROTECTED]
unsubscribe: body of `unsubscribe linux-arm' to [EMAIL PROTECTED]