Curious guy not knowing much about radeon/video drivers but with a can-
debug approach trying to take a stab at this issue based on the
radeon.ko objdump disassembly provided by the bug reporter who happens
to be my friend : (please try the recommendation suggested at the end of
this report in r600.c, r600_cp_resume to see if it resolves the OOPs)

The kernel panic is a result of an invalid ring write pointer while updating a 
value to the radeon ring buffer.
The write pointer read from the radeon control register (r100_mm_rreq function 
in radeon.h) is returning an incorrect (or seemingly negative value on RESUME). 
Looks like we may have to add a retry on r600_cp_resume to make it work.

RCA enclosed below:
------------------

Mapping the Oops to the disassembly, its clear that the kernel panic was 
triggered by this instruction:
static inline void radeon_ring_write(struct radeon_device *rdev, uint32_t v)
{
#if DRM_DEBUG_CODE
        if (rdev->cp.count_dw <= 0) {
                DRM_ERROR("radeon: writting more dword to ring than expected 
!\n");
        }
#endif
        rdev->cp.ring[rdev->cp.wptr++] = v; ->PANICs here as rdev->cp.wptr 
seems to be negative
        rdev->cp.wptr &= rdev->cp.ptr_mask;
        rdev->cp.count_dw--;
        rdev->cp.ring_free_dw--;
}

Lets now map the above to the EIP first:

EIP at the time of the kernel panic was r600_cp_start+0x48
>From objdump disassembly, it maps to:

r600_cp_start: (0x709e8):
  709e8:   c7 02 00 44 05 c0       movl   $0xc0054400,(%edx)

Also it exactly matches the OOPs hex dump:
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] Code: c6 0f 85 fd 02 00 00 8b 
bb a4 07 00 00 85 ff 0f 8e d6 02 00 00 8b 83 94 07 00 00 8d 14 85 00 00 00 00 
83 c0 01 03 93 8c 07 00 00 <c7> 02 00 44 05 c0 8b 93 a4 07 00 00 23 83 b4 07 00 
00 83 ab a0

Refer to the instruction <c7> in angular brackets which represents the
faulting instruction and hexcodes also match the above EIP. (c7 02 00 44
05 c0 )

If you reverse engineer the code to the disassembly, the panic EIP is evident. 
>From the objdump, EBX holds the radeon_device pointer *rdev.

The ring buffer remaining count cp.dw_count is 16 and held in register
EDI.

The write pointer or rdev->cp.wptr index for the radeon ring buffer is stored 
in EAX. 
>From the panic this value is shown as 0 as EAX holds 0.

The ring buffer pointer that triggered the OOps is in EDX as seen also from the 
above target of the store.
EDX value from the OOPs is: 0xfa501ffc. And this is also the PTE entry that 
took the page fault as seen from the OOPs:

Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354151] BUG: unable to handle
kernel paging request at fa501ffc

movl $0xc0054400, (%edx)

The "C" call for the above store is

    radeon_ring_write(rdev, PACKET3(PACKET3_ME_INITIALIZE, 5));

from r600_cp_start. PACKET3(PACKET3_ME_INITIALIZE, 5) macro evaluates to 
0xc0054400.
So now we are dead sure that we had an incorrect radeon ring write pointer read 
from the register in r600_cp_resume:
before calling r600_cp_start:

        rdev->cp.rptr = RREG32(CP_RB_RPTR);
        rdev->cp.wptr = RREG32(CP_RB_WPTR);

Now from the assembly, the value of the write pointer is stored in EAX
at the time of the panic:

Taking a few instructions above the faulting instruction:

   709d2:       8b 83 94 07 00 00       mov    0x794(%ebx),%eax
   709d8:       8d 14 85 00 00 00 00    lea    0x0(,%eax,4),%edx
   709df:       83 c0 01                add    $0x1,%eax
   709e2:       03 93 8c 07 00 00       add    0x78c(%ebx),%edx
   709e8:       c7 02 00 44 05 c0       movl   $0xc0054400,(%edx)

0x794 offset of EBX (rdev pointer) is rdev->cp.wptr or the write index for the 
ring buffer.
We can see this is being moved to EAX. And the indexed absolute address is 
stored in EDX (the address that took the page fault as mentioned above)

Now we can see that:
add $0x1, %eax
happens BEFORE the movl or the faulting instruction.
In other words:
         rdev->cp.wptr++ after use of EAX. Unless we missed a speculative 
execution, there is no chance to miss execution of this increment. So for all 
intents and purposes, EAX cannot hold 0 since it was incremented before the 
fault.

Which implies that EAX was negative on reading from the radeon register
on a resume. And as a result, we indexed an invalid location into the
ring buffer (rdev->cp.ring) which expectedly triggered the kernel panic.

I am not sure if we have to re-read the values on resume if the value
returned is negative from the radeon register for the read and write
pointers.

You can try the following changes to see if it works:

In r600_cp_resume before calling r600_cp_start : (r600.c)

/*
 * Re-read the read and write if the value returned isn't sane. before calling 
r600_cp_start
*/
 do {
        rdev->cp.rptr = RREG32(CP_RB_RPTR);
        mdelay(15);
 } while((int)rdev->cp.rptr < 0);
 
 do {
        rdev->cp.wptr = RREG32(CP_RB_WPTR);
        mdelay(15);
 } while( (int)rdev->cp.wptr < 0 );


        r600_cp_start(rdev);

If the above still triggers the panic, then force a hackish reset for the write 
pointer in:
radeon.h,radeon_ring_write before updating the value in the ring buffer.

        if((int)rdev->cp.wptr < 0) rdev->cp.wptr = 0;
        rdev->cp.ring[rdev->cp.wptr++] = v; 

But I believe, the retry of the read and write register in
r600_cp_resume might make it work.

Regards,
-Karthick

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/820746

Title:
  Black screen on boot - Oops: 0002 [#1] SMP on Lenovo T400 laptop with
  radeon

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/820746/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to