SA110/SA1100 possible bug or kernel bug? (long) ...

Nicolas Pitre Sat, 12 Jun 1999 05:34:24 -0700
Hi!

We develop an application that runs on ARM Linux on the SA1100.  In some
rare cases, it gets killed with a segmentation fault for which there is no
apparent reasons.  See the following:

lpn: memory violation at pc=0x020140c0, lr=0x020282f0 
(bad address=0x020140c0, code 1)
pc : [<020140c0>]    lr : [<020282f0>]
sp : befffa74  ip : befffa3c  fp : befffa8c
r10: 00000000  r9 : 0208ec9c  r8 : 00000000
r7 : 01fa582c  r6 : 020f4090  r5 : 020adf88  r4 : 00001800
r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 : 00001800
Flags: nzCv  IRQs on  FIQs on  Mode USER_32  Segment user
Control: C018D17D  Table: C018D17D  DAC: 00000015


The code around the fault is:

...
 201407c:       e5950000        ldr     r0, [r5]
 2014080:       e595100c        ldr     r1, [r5, #12]
 2014084:       e3a02000        mov     r2, #0
 2014088:       eb005062        bl      2028218 <__lseek>
 201408c:       e5950000        ldr     r0, [r5]
 2014090:       e2861004        add     r1, r6, #4
 2014094:       e1a02004        mov     r2, r4
 2014098:       eb005082        bl      20282a8 <__read>
 201409c:       e2504000        subs    r4, r0, #0
 20140a0:       aa000005        bge     20140bc <GetBuffer+0xac>
 20140a4:       e59f000c        ldr     r0, 20140b8 <GetBuffer+0xa8>
 20140a8:       eb008de2        bl      2037838 <perror>
 20140ac:       e1a00006        mov     r0, r6
 20140b0:       eb009f83        bl      203bec4 <__libc_free>
 20140b4:       eaffffec        b       201406c <GetBuffer+0x5c>
 20140b8:       0207a778        andeq   sl, r7, #31457280
 20140bc:       0a000005        beq     20140d8 <GetBuffer+0xc8>
 20140c0:       e5864000        str     r4, [r6]
 20140c4:       e595300c        ldr     r3, [r5, #12]
 20140c8:       e0833004        add     r3, r3, r4
 20140cc:       e585300c        str     r3, [r5, #12]
 20140d0:       e1a00006        mov     r0, r6
 20140d4:       e91ba870        ldmdb   fp, {r4, r5, r6, fp, sp, pc}
 20140d8:       e1a00006        mov     r0, r6
 20140dc:       eb009f78        bl      203bec4 <__libc_free>
 20140e0:       e1a00004        mov     r0, r4
 20140e4:       e91ba870        ldmdb   fp, {r4, r5, r6, fp, sp, pc}


What is really weird about it is the fact that the faulty address is equal
to the pc.  However, r6 which is used to do the str contains actually a
good address and is quite different from the pc.

This code is heavily used but crashes once in a while randomly and not
always at the same place.  The dump above is only one example.  I tried to
reproduce the same behavior with a simple test code but without any
success.  However, the code above is part of an audio player and it
crashes reliably depending on the sample rate for the input data.  For
example: if the audio is sampled at 44100 Hz, the program crashes within
few minutes.  Otherwise it may take hours before it happens.  In all cases
the program always allocate the same amount of memory, etc.  So it looks
like if the right timing is required.

I also experimented one egcs segfault for no apparent reason once -- I
just restarted make and it didn't crash again.  Is it related... don't
know since I didn't see the register dump.  But for all cases in my
program, the faulty address was always same as the pc but the actual code
and register dump have nothing to do with a memory access to the pc value.

The application for which I can easily reproduce the bug makes heavy use
of threads, malloc()/free(), and runs in low memory environment (4 mb).
But even if memory is low, there is always around 1 to 2 megs of free ram.
If I run it on a 16 mb ram board, I'm not able to reproduce the bug
anymore.  So it seems to have something to do with heavy
paging/scheduling.  Also 75% of all crashes happen in glibc's chunk_alloc
where page faults are likely to take place.

In all cases, do_page_fault() was always called from do_DataAbort(), which
is surprising since the offending address is actually the pc address.

So I started to imagine a CPU bug.  Two possibilities:  

1) In some situations, the CPU generates a data abort exception instead of
a prefetch abort exception as it should be.  This would explain why the 
faulty address is equal to the pc.  And since this happens in the middle
of a page and there is no way to jump exacly there from another page, this
should hapen right after a context switch.  However the data abort handler
gets the offending memory address from the FAR register but the
documentation says that it is used only for data abort exceptions.  So is
the FAR updated for prefetch abort exception too?  If not, this might not
be a wrongly identified prefetch exception but really a data abort
exception.  And since the data abort handler substract 8 from the pc
instead of 4, the pc and faulting address shouldn't match.

2) In some situations, maybe when the process is restarted after a context
switch or similar, the str opcode takes the pc register instead of the r6
register in this case to dereference the address to use for storing.  This
would fault since the text segment is mapped read-only.  But here if the
pc register was actually used it would have been 8 bytes ahead from the
instruction's address, which isn't the case.

Kernel bug?  Unlikely.  After looking at the code over and over I can't
see where it could make the faulting address equal to the pc value.  The
same problem shows up with a 2.0.35 kernel too.

Application bug?  Still might be possible, but I'm unable to explain how
it could happen randomly all over the place and always with a faulting
adress == pc but without any code/register values corresponding to such
action.

So I tried this patch:

--- ../2.2.9-rmk2_o/linux/arch/arm/mm/fault-common.c    Fri May 28
13:47:18 1999
+++ linux/arch/arm/mm/fault-common.c  Fri Jun 11 21:45:52 1999
@@ -151,6 +151,14 @@

        /* User mode accesses just cause a SIGSEGV */
        if (mode & FAULT_CODE_USER) {
+               if( addr == regs->ARM_pc ) {
+                       /* HACK: convert write fault into read fault */
+                       down(&mm->mmap_sem);
+                       printk( "do_page_fault(): HACK TRIGGERED\n" );
+                       mode |= FAULT_CODE_READ;
+                       goto good_area;
+               }
+
                tsk->tss.error_code = mode;
                tsk->tss.trap_no = 14;
 #ifdef CONFIG_DEBUG_USER


Tadam!  My application now runs reliably for hours where it crashed in
about one minute before.  The hack is triggered once in a while but not
much.  So if possibility 1 is true, then the missing text page is loaded
in memory and execution is happily resumed with the possibility that the
pc might be one opcode before where it should be.  If it's rather
possibility 2, then simply ignoring the fault and resuming execution might
do it and the fault would have happened because of a very subtile
bug that shows up in some circumstances only, but I didn't verify this
yet.  I didn't try to find more similarities between all crash cases
either i.e. opcode, register involved, etc.  But there is surely something
to pinpoint there.

I don't know if this has some similarities with the obscur bug some people
experienced on the Netwinder.  It might be a SA1100 issue only.
But if anyone has any experience/comments/ideas/explanation for this
problem and what could be the best fix for it, please speak up.



Nicolas Pitre, B. ing.
[EMAIL PROTECTED]



unsubscribe: body of `unsubscribe linux-arm' to [EMAIL PROTECTED]
SA110/SA1100 possible bug or kernel bug? (long) ...

Reply via email to