subject:"\[mipsel\+rs780e\]Occasionally \"GPU lockup\" after resuming from suspend."

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-03-01 Thread che...@lemote.com

Status update:
In r600.c I found for RS780, num_*_threads are like this:
sq_thread_resource_mgmt = (NUM_PS_THREADS(79) |
   NUM_VS_THREADS(78) |
   NUM_GS_THREADS(4) |
   NUM_ES_THREADS(31));

But in documents, each of them should be a multiple of 4. And in
r600_blit_kms.c? they are 136, 48, 4, 4. I want to know why
79, 78, 4 and 31 are use here.

Huacai Chen

> On Wed, 2012-02-29 at 12:49 +0800, chenhc at lemote.com wrote:
>> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> >> ? 2012?2?17? ??5:27?Chen Jie  ???
>> >> >> One good way to test gart is to go over GPU gart table and write a
>> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> >> or somevalue that is unlikely to be already set. And then go over
>> >> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> >> register write back feature is the easiest way to try that.
>> >> > I'm planning to add a GART table check procedure when resume, which
>> >> > will go over GPU gart table:
>> >> > 1. read(backup) a dword at end of each GPU page
>> >> > 2. write a mark by GPU and check it
>> >> > 3. restore the original dword
>> >> Attachment validateGART.patch do the job:
>> >> * It current only works for mips64 platform.
>> >> * To use it, apply all_in_vram.patch first, which will allocate CP
>> >> ring, ih, ib in VRAM and hard code no_wb=1.
>> >>
>> >> The gart test routine will be invoked in r600_resume. We've tried it,
>> >> and find that when lockup happened the gart table was good before
>> >> userspace restarting. The related dmesg follows:
>> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> >> at 90004004, 32768 entries, Dummy
>> >> Page[0x0e004000-0x0e007fff]
>> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> >> entries(valid=8544, invalid=24224, total=32768).
>> >> ...
>> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> >> [ 1532.152343] Restarting tasks ... done.
>> >> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
>> >> 10003msec
>> >> [ 1544.472656] [ cut here ]
>> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> >> radeon_fence_wait+0x25c/0x314()
>> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> >> 0x0002136A)
>> >> ...
>> >> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
>> >> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
>> >> [ 1545.062500] radeon :01:05.0: WB disabled
>> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> >> [ 1545.109375] [drm] Enabling audio support
>> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> >> at 90004004, 32768 entries, Dummy
>> >> Page[0x0e004000-0x0e007fff]
>> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> >> entry=0x0e008067, orignal=0x745aaad1
>> >> ...
>> >> /* System blocked here. */
>> >>
>> >> Any idea?
>> >
>> > I know lockup are frustrating, my only idea is the memory controller
>> > is lockup because of some failing pci <-> system ram transaction.
>> >
>> >>
>> >> BTW, we find the following in r600_pcie_gart_enable()
>> >> (drivers/gpu/drm/radeon/r600.c):
>> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> >> (u32)(rdev->dummy_page.addr >> 12));
>> >>
>> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
>> >
>> > No this should be handled properly.
>> >
>> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> >> should change to:
>> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>> >>   radeon_gart_set_page(rdev, t, page_base);
>> >> - page_base += RADEON_GPU_PAGE_SIZE;
>> >> + if (page_base != rdev->dummy_page.addr)
>> >> + page_base += RADEON_GPU_PAGE_SIZE;
>> >>   }
>> >> ???
>> >
>> > No need to do so, dummy page will be 16K too, so it's fine.
>> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
>> is 0x8e004000, then there are four types of address in GART:0x8e004000,
>> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
>> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
>> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
>
> When radeon_gart_unbind initialize the gart entry to point to the dummy
> page it's just to have something safe in the GART table.
>
> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
> there is a fault happening. It's like a sandbox for the mc. It doesn't
> conflict in anyway to have gart table entry to point to

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-03-01 Thread Alex Deucher

2012/3/1  :
> Status update:
> In r600.c I found for RS780, num_*_threads are like this:
>sq_thread_resource_mgmt = (NUM_PS_THREADS(79) |
>   NUM_VS_THREADS(78) |
>   NUM_GS_THREADS(4) |
>   NUM_ES_THREADS(31));
>
> But in documents, each of them should be a multiple of 4. And in
> r600_blit_kms.c? they are 136, 48, 4, 4. I want to know why
> 79, 78, 4 and 31 are use here.

You can try changing them, but I don't think it will make a difference.

Alex

>
> Huacai Chen
>
>> On Wed, 2012-02-29 at 12:49 +0800, chenhc at lemote.com wrote:
>>> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>>> >> ? 2012?2?17? ??5:27?Chen Jie  ???
>>> >> >> One good way to test gart is to go over GPU gart table and write a
>>> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>>> >> >> or somevalue that is unlikely to be already set. And then go over
>>> >> >> all the page and check that GPU write succeed. Abusing the scratch
>>> >> >> register write back feature is the easiest way to try that.
>>> >> > I'm planning to add a GART table check procedure when resume, which
>>> >> > will go over GPU gart table:
>>> >> > 1. read(backup) a dword at end of each GPU page
>>> >> > 2. write a mark by GPU and check it
>>> >> > 3. restore the original dword
>>> >> Attachment validateGART.patch do the job:
>>> >> * It current only works for mips64 platform.
>>> >> * To use it, apply all_in_vram.patch first, which will allocate CP
>>> >> ring, ih, ib in VRAM and hard code no_wb=1.
>>> >>
>>> >> The gart test routine will be invoked in r600_resume. We've tried it,
>>> >> and find that when lockup happened the gart table was good before
>>> >> userspace restarting. The related dmesg follows:
>>> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>>> >> at 90004004, 32768 entries, Dummy
>>> >> Page[0x0e004000-0x0e007fff]
>>> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>>> >> entries(valid=8544, invalid=24224, total=32768).
>>> >> ...
>>> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>>> >> [ 1532.152343] Restarting tasks ... done.
>>> >> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
>>> >> 10003msec
>>> >> [ 1544.472656] [ cut here ]
>>> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>>> >> radeon_fence_wait+0x25c/0x314()
>>> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>>> >> 0x0002136A)
>>> >> ...
>>> >> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
>>> >> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
>>> >> [ 1545.062500] radeon :01:05.0: WB disabled
>>> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>>> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>>> >> [ 1545.109375] [drm] Enabling audio support
>>> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>>> >> at 90004004, 32768 entries, Dummy
>>> >> Page[0x0e004000-0x0e007fff]
>>> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>>> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>>> >> entry=0x0e008067, orignal=0x745aaad1
>>> >> ...
>>> >> /* System blocked here. */
>>> >>
>>> >> Any idea?
>>> >
>>> > I know lockup are frustrating, my only idea is the memory controller
>>> > is lockup because of some failing pci <-> system ram transaction.
>>> >
>>> >>
>>> >> BTW, we find the following in r600_pcie_gart_enable()
>>> >> (drivers/gpu/drm/radeon/r600.c):
>>> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>>> >> (u32)(rdev->dummy_page.addr >> 12));
>>> >>
>>> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
>>> >
>>> > No this should be handled properly.
>>> >
>>> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>>> >> should change to:
>>> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>> >>   radeon_gart_set_page(rdev, t, page_base);
>>> >> - page_base += RADEON_GPU_PAGE_SIZE;
>>> >> + if (page_base != rdev->dummy_page.addr)
>>> >> + page_base += RADEON_GPU_PAGE_SIZE;
>>> >>   }
>>> >> ???
>>> >
>>> > No need to do so, dummy page will be 16K too, so it's fine.
>>> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
>>> is 0x8e004000, then there are four types of address in GART:0x8e004000,
>>> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
>>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
>>> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
>>> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
>>
>> When radeon_gart_unbind initialize the gart entry to point to the dummy
>> page it's just to have something safe in the GART table.
>>
>> VM

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-03-01 Thread Alex Deucher

2012/3/1  :
> Status update:
> In r600.c I found for RS780, num_*_threads are like this:
>sq_thread_resource_mgmt = (NUM_PS_THREADS(79) |
>   NUM_VS_THREADS(78) |
>   NUM_GS_THREADS(4) |
>   NUM_ES_THREADS(31));
>
> But in documents, each of them should be a multiple of 4. And in
> r600_blit_kms.c， they are 136, 48, 4, 4. I want to know why
> 79, 78, 4 and 31 are use here.

You can try changing them, but I don't think it will make a difference.

Alex

>
> Huacai Chen
>
>> On Wed, 2012-02-29 at 12:49 +0800, che...@lemote.com wrote:
>>> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>>> >> 在 2012年2月17日 下午5:27，Chen Jie  写道：
>>> >> >> One good way to test gart is to go over GPU gart table and write a
>>> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>>> >> >> or somevalue that is unlikely to be already set. And then go over
>>> >> >> all the page and check that GPU write succeed. Abusing the scratch
>>> >> >> register write back feature is the easiest way to try that.
>>> >> > I'm planning to add a GART table check procedure when resume, which
>>> >> > will go over GPU gart table:
>>> >> > 1. read(backup) a dword at end of each GPU page
>>> >> > 2. write a mark by GPU and check it
>>> >> > 3. restore the original dword
>>> >> Attachment validateGART.patch do the job:
>>> >> * It current only works for mips64 platform.
>>> >> * To use it, apply all_in_vram.patch first, which will allocate CP
>>> >> ring, ih, ib in VRAM and hard code no_wb=1.
>>> >>
>>> >> The gart test routine will be invoked in r600_resume. We've tried it,
>>> >> and find that when lockup happened the gart table was good before
>>> >> userspace restarting. The related dmesg follows:
>>> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>>> >> at 90004004, 32768 entries, Dummy
>>> >> Page[0x0e004000-0x0e007fff]
>>> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>>> >> entries(valid=8544, invalid=24224, total=32768).
>>> >> ...
>>> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>>> >> [ 1532.152343] Restarting tasks ... done.
>>> >> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
>>> >> 10003msec
>>> >> [ 1544.472656] [ cut here ]
>>> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>>> >> radeon_fence_wait+0x25c/0x314()
>>> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>>> >> 0x0002136A)
>>> >> ...
>>> >> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
>>> >> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
>>> >> [ 1545.062500] radeon :01:05.0: WB disabled
>>> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>>> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>>> >> [ 1545.109375] [drm] Enabling audio support
>>> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>>> >> at 90004004, 32768 entries, Dummy
>>> >> Page[0x0e004000-0x0e007fff]
>>> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>>> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>>> >> entry=0x0e008067, orignal=0x745aaad1
>>> >> ...
>>> >> /* System blocked here. */
>>> >>
>>> >> Any idea?
>>> >
>>> > I know lockup are frustrating, my only idea is the memory controller
>>> > is lockup because of some failing pci <-> system ram transaction.
>>> >
>>> >>
>>> >> BTW, we find the following in r600_pcie_gart_enable()
>>> >> (drivers/gpu/drm/radeon/r600.c):
>>> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>>> >> (u32)(rdev->dummy_page.addr >> 12));
>>> >>
>>> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
>>> >
>>> > No this should be handled properly.
>>> >
>>> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>>> >> should change to:
>>> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>> >>   radeon_gart_set_page(rdev, t, page_base);
>>> >> - page_base += RADEON_GPU_PAGE_SIZE;
>>> >> + if (page_base != rdev->dummy_page.addr)
>>> >> + page_base += RADEON_GPU_PAGE_SIZE;
>>> >>   }
>>> >> ???
>>> >
>>> > No need to do so, dummy page will be 16K too, so it's fine.
>>> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
>>> is 0x8e004000, then there are four types of address in GART:0x8e004000,
>>> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
>>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
>>> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
>>> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
>>
>> When radeon_gart_unbind initialize the gart entry to point to the dummy
>> page it's just to have something safe in the GART table.
>>
>> VM_CO

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-03-01 Thread chenhc

Status update:
In r600.c I found for RS780, num_*_threads are like this:
sq_thread_resource_mgmt = (NUM_PS_THREADS(79) |
   NUM_VS_THREADS(78) |
   NUM_GS_THREADS(4) |
   NUM_ES_THREADS(31));

But in documents, each of them should be a multiple of 4. And in
r600_blit_kms.c， they are 136, 48, 4, 4. I want to know why
79, 78, 4 and 31 are use here.

Huacai Chen

> On Wed, 2012-02-29 at 12:49 +0800, che...@lemote.com wrote:
>> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> >> 在 2012年2月17日 下午5:27，Chen Jie  写道：
>> >> >> One good way to test gart is to go over GPU gart table and write a
>> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> >> or somevalue that is unlikely to be already set. And then go over
>> >> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> >> register write back feature is the easiest way to try that.
>> >> > I'm planning to add a GART table check procedure when resume, which
>> >> > will go over GPU gart table:
>> >> > 1. read(backup) a dword at end of each GPU page
>> >> > 2. write a mark by GPU and check it
>> >> > 3. restore the original dword
>> >> Attachment validateGART.patch do the job:
>> >> * It current only works for mips64 platform.
>> >> * To use it, apply all_in_vram.patch first, which will allocate CP
>> >> ring, ih, ib in VRAM and hard code no_wb=1.
>> >>
>> >> The gart test routine will be invoked in r600_resume. We've tried it,
>> >> and find that when lockup happened the gart table was good before
>> >> userspace restarting. The related dmesg follows:
>> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> >> at 90004004, 32768 entries, Dummy
>> >> Page[0x0e004000-0x0e007fff]
>> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> >> entries(valid=8544, invalid=24224, total=32768).
>> >> ...
>> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> >> [ 1532.152343] Restarting tasks ... done.
>> >> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
>> >> 10003msec
>> >> [ 1544.472656] [ cut here ]
>> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> >> radeon_fence_wait+0x25c/0x314()
>> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> >> 0x0002136A)
>> >> ...
>> >> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
>> >> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
>> >> [ 1545.062500] radeon :01:05.0: WB disabled
>> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> >> [ 1545.109375] [drm] Enabling audio support
>> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> >> at 90004004, 32768 entries, Dummy
>> >> Page[0x0e004000-0x0e007fff]
>> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> >> entry=0x0e008067, orignal=0x745aaad1
>> >> ...
>> >> /* System blocked here. */
>> >>
>> >> Any idea?
>> >
>> > I know lockup are frustrating, my only idea is the memory controller
>> > is lockup because of some failing pci <-> system ram transaction.
>> >
>> >>
>> >> BTW, we find the following in r600_pcie_gart_enable()
>> >> (drivers/gpu/drm/radeon/r600.c):
>> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> >> (u32)(rdev->dummy_page.addr >> 12));
>> >>
>> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
>> >
>> > No this should be handled properly.
>> >
>> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> >> should change to:
>> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>> >>   radeon_gart_set_page(rdev, t, page_base);
>> >> - page_base += RADEON_GPU_PAGE_SIZE;
>> >> + if (page_base != rdev->dummy_page.addr)
>> >> + page_base += RADEON_GPU_PAGE_SIZE;
>> >>   }
>> >> ???
>> >
>> > No need to do so, dummy page will be 16K too, so it's fine.
>> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
>> is 0x8e004000, then there are four types of address in GART:0x8e004000,
>> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
>> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
>> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
>
> When radeon_gart_unbind initialize the gart entry to point to the dummy
> page it's just to have something safe in the GART table.
>
> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
> there is a fault happening. It's like a sandbox for the mc. It doesn't
> conflict in anyway to have gart table entry to point to sa

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-29 Thread che...@lemote.com

> On Mon, 2012-02-27 at 10:44 +0800, Chen Jie wrote:
>> Hi,
>>
>> For this occasional GPU lockup when returns from STR/STD, I find
>> followings(when the problem happens):
>>
>> The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
>> Which means:
>> * HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
>> * MCDW_BUSY(Memory Controller Block is Busy)
>> * BIF_BUSY(Bus Interface is Busy)
>> * MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
>> Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
>> relationship among GART mapped memory, On-board video memory and MCDX,
>> MCDW?
>>
>> CP_STAT: the CSF_RING_BUSY is always set.
>
> Once the memory controller fails to do a pci transaction the CP
> will be stuck. At least if ring is in system memory, if ring is
> in vram CP might be stuck too because anyway everything goes
> through the MC.
>
I've tried the method of rs600 for gpu reset (use rs600_bm_disable() to
disable PCI MASTER bit and enable it after reset), but it doesn't solve
the problem. Then I found that in r100_bm_disable() it do more things,
e.g. writing the GPU register R_30_BUS_CNTL. In r600_reg.h there is
a register R600_BUS_CNTL, does this register have a similar function?
 But I don't know how to use it...

Huacai Chen

>>
>> There are many CP_PACKET2(0x8000) in CP ring(more than three
>> hundreds). e.g.
>> r[131800]=0x00028000
>> r[131801]=0xc0016800
>> r[131802]=0x0140
>> r[131803]=0x79c5
>> r[131804]=0x304a
>> r[131805] ... r[132143]=0x8000
>> r[132144]=0x
>> After the first reset, GPU will lockup again, this time, typically
>> there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
>> in the end.
>> Are these normal?
>>
>> BTW, is there any way for X to switch to NOACCEL mode when the problem
>> happens? Thus users will have a chance to save their documents and
>> then reboot machine.
>
> I have been meaning to patch the ddx to fallback to sw after GPU lockup.
> But this is useless in today world, where everything is composited ie
> the screen is updated using the 3D driver for which there is no easy
> way to suddenly migrate to  software rendering. I will still probably
> do the ddx patch at one point.
>
> Cheers,
> Jerome
>
>

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-29 Thread Jerome Glisse

On Wed, 2012-02-29 at 12:49 +0800, chenhc at lemote.com wrote:
> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
> >> ? 2012?2?17? ??5:27?Chen Jie  ???
> >> >> One good way to test gart is to go over GPU gart table and write a
> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
> >> >> or somevalue that is unlikely to be already set. And then go over
> >> >> all the page and check that GPU write succeed. Abusing the scratch
> >> >> register write back feature is the easiest way to try that.
> >> > I'm planning to add a GART table check procedure when resume, which
> >> > will go over GPU gart table:
> >> > 1. read(backup) a dword at end of each GPU page
> >> > 2. write a mark by GPU and check it
> >> > 3. restore the original dword
> >> Attachment validateGART.patch do the job:
> >> * It current only works for mips64 platform.
> >> * To use it, apply all_in_vram.patch first, which will allocate CP
> >> ring, ih, ib in VRAM and hard code no_wb=1.
> >>
> >> The gart test routine will be invoked in r600_resume. We've tried it,
> >> and find that when lockup happened the gart table was good before
> >> userspace restarting. The related dmesg follows:
> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
> >> at 90004004, 32768 entries, Dummy
> >> Page[0x0e004000-0x0e007fff]
> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
> >> entries(valid=8544, invalid=24224, total=32768).
> >> ...
> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
> >> [ 1532.152343] Restarting tasks ... done.
> >> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
> >> 10003msec
> >> [ 1544.472656] [ cut here ]
> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
> >> radeon_fence_wait+0x25c/0x314()
> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
> >> 0x0002136A)
> >> ...
> >> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
> >> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
> >> [ 1545.062500] radeon :01:05.0: WB disabled
> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
> >> [ 1545.109375] [drm] Enabling audio support
> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
> >> at 90004004, 32768 entries, Dummy
> >> Page[0x0e004000-0x0e007fff]
> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
> >> entry=0x0e008067, orignal=0x745aaad1
> >> ...
> >> /* System blocked here. */
> >>
> >> Any idea?
> >
> > I know lockup are frustrating, my only idea is the memory controller
> > is lockup because of some failing pci <-> system ram transaction.
> >
> >>
> >> BTW, we find the following in r600_pcie_gart_enable()
> >> (drivers/gpu/drm/radeon/r600.c):
> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
> >> (u32)(rdev->dummy_page.addr >> 12));
> >>
> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
> >
> > No this should be handled properly.
> >
> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
> >> should change to:
> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
> >>   radeon_gart_set_page(rdev, t, page_base);
> >> - page_base += RADEON_GPU_PAGE_SIZE;
> >> + if (page_base != rdev->dummy_page.addr)
> >> + page_base += RADEON_GPU_PAGE_SIZE;
> >>   }
> >> ???
> >
> > No need to do so, dummy page will be 16K too, so it's fine.
> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
> is 0x8e004000, then there are four types of address in GART:0x8e004000,
> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.

When radeon_gart_unbind initialize the gart entry to point to the dummy
page it's just to have something safe in the GART table.

VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
there is a fault happening. It's like a sandbox for the mc. It doesn't
conflict in anyway to have gart table entry to point to same page.

Cheers,
Jerome

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-29 Thread che...@lemote.com

> On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> ? 2012?2?17? ??5:27?Chen Jie  ???
>> >> One good way to test gart is to go over GPU gart table and write a
>> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> or somevalue that is unlikely to be already set. And then go over
>> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> register write back feature is the easiest way to try that.
>> > I'm planning to add a GART table check procedure when resume, which
>> > will go over GPU gart table:
>> > 1. read(backup) a dword at end of each GPU page
>> > 2. write a mark by GPU and check it
>> > 3. restore the original dword
>> Attachment validateGART.patch do the job:
>> * It current only works for mips64 platform.
>> * To use it, apply all_in_vram.patch first, which will allocate CP
>> ring, ih, ib in VRAM and hard code no_wb=1.
>>
>> The gart test routine will be invoked in r600_resume. We've tried it,
>> and find that when lockup happened the gart table was good before
>> userspace restarting. The related dmesg follows:
>> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> at 90004004, 32768 entries, Dummy
>> Page[0x0e004000-0x0e007fff]
>> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> entries(valid=8544, invalid=24224, total=32768).
>> ...
>> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> [ 1532.152343] Restarting tasks ... done.
>> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
>> 10003msec
>> [ 1544.472656] [ cut here ]
>> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> radeon_fence_wait+0x25c/0x314()
>> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> 0x0002136A)
>> ...
>> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
>> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
>> [ 1545.062500] radeon :01:05.0: WB disabled
>> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> [ 1545.109375] [drm] Enabling audio support
>> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> at 90004004, 32768 entries, Dummy
>> Page[0x0e004000-0x0e007fff]
>> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> entry=0x0e008067, orignal=0x745aaad1
>> ...
>> /* System blocked here. */
>>
>> Any idea?
>
> I know lockup are frustrating, my only idea is the memory controller
> is lockup because of some failing pci <-> system ram transaction.
>
>>
>> BTW, we find the following in r600_pcie_gart_enable()
>> (drivers/gpu/drm/radeon/r600.c):
>> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> (u32)(rdev->dummy_page.addr >> 12));
>>
>> On our platform, PAGE_SIZE is 16K, does it have any problem?
>
> No this should be handled properly.
>
>> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> should change to:
>>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>   radeon_gart_set_page(rdev, t, page_base);
>> - page_base += RADEON_GPU_PAGE_SIZE;
>> + if (page_base != rdev->dummy_page.addr)
>> + page_base += RADEON_GPU_PAGE_SIZE;
>>   }
>> ???
>
> No need to do so, dummy page will be 16K too, so it's fine.
Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
is 0x8e004000, then there are four types of address in GART:0x8e004000,
0x8e005000, 0x8e006000, 0x8e007000. The value which written in
VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.

>
> Cheers,
> Jerome
>
>

Huacai Chen

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-29 Thread Jerome Glisse

On Wed, 2012-02-29 at 12:49 +0800, che...@lemote.com wrote:
> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
> >> 在 2012年2月17日 下午5:27，Chen Jie  写道：
> >> >> One good way to test gart is to go over GPU gart table and write a
> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
> >> >> or somevalue that is unlikely to be already set. And then go over
> >> >> all the page and check that GPU write succeed. Abusing the scratch
> >> >> register write back feature is the easiest way to try that.
> >> > I'm planning to add a GART table check procedure when resume, which
> >> > will go over GPU gart table:
> >> > 1. read(backup) a dword at end of each GPU page
> >> > 2. write a mark by GPU and check it
> >> > 3. restore the original dword
> >> Attachment validateGART.patch do the job:
> >> * It current only works for mips64 platform.
> >> * To use it, apply all_in_vram.patch first, which will allocate CP
> >> ring, ih, ib in VRAM and hard code no_wb=1.
> >>
> >> The gart test routine will be invoked in r600_resume. We've tried it,
> >> and find that when lockup happened the gart table was good before
> >> userspace restarting. The related dmesg follows:
> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
> >> at 90004004, 32768 entries, Dummy
> >> Page[0x0e004000-0x0e007fff]
> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
> >> entries(valid=8544, invalid=24224, total=32768).
> >> ...
> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
> >> [ 1532.152343] Restarting tasks ... done.
> >> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
> >> 10003msec
> >> [ 1544.472656] [ cut here ]
> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
> >> radeon_fence_wait+0x25c/0x314()
> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
> >> 0x0002136A)
> >> ...
> >> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
> >> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
> >> [ 1545.062500] radeon :01:05.0: WB disabled
> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
> >> [ 1545.109375] [drm] Enabling audio support
> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
> >> at 90004004, 32768 entries, Dummy
> >> Page[0x0e004000-0x0e007fff]
> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
> >> entry=0x0e008067, orignal=0x745aaad1
> >> ...
> >> /* System blocked here. */
> >>
> >> Any idea?
> >
> > I know lockup are frustrating, my only idea is the memory controller
> > is lockup because of some failing pci <-> system ram transaction.
> >
> >>
> >> BTW, we find the following in r600_pcie_gart_enable()
> >> (drivers/gpu/drm/radeon/r600.c):
> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
> >> (u32)(rdev->dummy_page.addr >> 12));
> >>
> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
> >
> > No this should be handled properly.
> >
> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
> >> should change to:
> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
> >>   radeon_gart_set_page(rdev, t, page_base);
> >> - page_base += RADEON_GPU_PAGE_SIZE;
> >> + if (page_base != rdev->dummy_page.addr)
> >> + page_base += RADEON_GPU_PAGE_SIZE;
> >>   }
> >> ???
> >
> > No need to do so, dummy page will be 16K too, so it's fine.
> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
> is 0x8e004000, then there are four types of address in GART:0x8e004000,
> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.

When radeon_gart_unbind initialize the gart entry to point to the dummy
page it's just to have something safe in the GART table.

VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
there is a fault happening. It's like a sandbox for the mc. It doesn't
conflict in anyway to have gart table entry to point to same page.

Cheers,
Jerome

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-28 Thread chenhc

> On Mon, 2012-02-27 at 10:44 +0800, Chen Jie wrote:
>> Hi,
>>
>> For this occasional GPU lockup when returns from STR/STD, I find
>> followings(when the problem happens):
>>
>> The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
>> Which means:
>> * HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
>> * MCDW_BUSY(Memory Controller Block is Busy)
>> * BIF_BUSY(Bus Interface is Busy)
>> * MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
>> Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
>> relationship among GART mapped memory, On-board video memory and MCDX,
>> MCDW?
>>
>> CP_STAT: the CSF_RING_BUSY is always set.
>
> Once the memory controller fails to do a pci transaction the CP
> will be stuck. At least if ring is in system memory, if ring is
> in vram CP might be stuck too because anyway everything goes
> through the MC.
>
I've tried the method of rs600 for gpu reset (use rs600_bm_disable() to
disable PCI MASTER bit and enable it after reset), but it doesn't solve
the problem. Then I found that in r100_bm_disable() it do more things,
e.g. writing the GPU register R_30_BUS_CNTL. In r600_reg.h there is
a register R600_BUS_CNTL, does this register have a similar function?
 But I don't know how to use it...

Huacai Chen

>>
>> There are many CP_PACKET2(0x8000) in CP ring(more than three
>> hundreds). e.g.
>> r[131800]=0x00028000
>> r[131801]=0xc0016800
>> r[131802]=0x0140
>> r[131803]=0x79c5
>> r[131804]=0x304a
>> r[131805] ... r[132143]=0x8000
>> r[132144]=0x
>> After the first reset, GPU will lockup again, this time, typically
>> there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
>> in the end.
>> Are these normal?
>>
>> BTW, is there any way for X to switch to NOACCEL mode when the problem
>> happens? Thus users will have a chance to save their documents and
>> then reboot machine.
>
> I have been meaning to patch the ddx to fallback to sw after GPU lockup.
> But this is useless in today world, where everything is composited ie
> the screen is updated using the 3D driver for which there is no easy
> way to suddenly migrate to  software rendering. I will still probably
> do the ddx patch at one point.
>
> Cheers,
> Jerome
>
>


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-28 Thread chenhc

> On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> 在 2012年2月17日 下午5:27，Chen Jie  写道：
>> >> One good way to test gart is to go over GPU gart table and write a
>> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> or somevalue that is unlikely to be already set. And then go over
>> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> register write back feature is the easiest way to try that.
>> > I'm planning to add a GART table check procedure when resume, which
>> > will go over GPU gart table:
>> > 1. read(backup) a dword at end of each GPU page
>> > 2. write a mark by GPU and check it
>> > 3. restore the original dword
>> Attachment validateGART.patch do the job:
>> * It current only works for mips64 platform.
>> * To use it, apply all_in_vram.patch first, which will allocate CP
>> ring, ih, ib in VRAM and hard code no_wb=1.
>>
>> The gart test routine will be invoked in r600_resume. We've tried it,
>> and find that when lockup happened the gart table was good before
>> userspace restarting. The related dmesg follows:
>> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> at 90004004, 32768 entries, Dummy
>> Page[0x0e004000-0x0e007fff]
>> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> entries(valid=8544, invalid=24224, total=32768).
>> ...
>> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> [ 1532.152343] Restarting tasks ... done.
>> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
>> 10003msec
>> [ 1544.472656] [ cut here ]
>> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> radeon_fence_wait+0x25c/0x314()
>> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> 0x0002136A)
>> ...
>> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
>> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
>> [ 1545.062500] radeon :01:05.0: WB disabled
>> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> [ 1545.109375] [drm] Enabling audio support
>> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> at 90004004, 32768 entries, Dummy
>> Page[0x0e004000-0x0e007fff]
>> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> entry=0x0e008067, orignal=0x745aaad1
>> ...
>> /* System blocked here. */
>>
>> Any idea?
>
> I know lockup are frustrating, my only idea is the memory controller
> is lockup because of some failing pci <-> system ram transaction.
>
>>
>> BTW, we find the following in r600_pcie_gart_enable()
>> (drivers/gpu/drm/radeon/r600.c):
>> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> (u32)(rdev->dummy_page.addr >> 12));
>>
>> On our platform, PAGE_SIZE is 16K, does it have any problem?
>
> No this should be handled properly.
>
>> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> should change to:
>>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>   radeon_gart_set_page(rdev, t, page_base);
>> - page_base += RADEON_GPU_PAGE_SIZE;
>> + if (page_base != rdev->dummy_page.addr)
>> + page_base += RADEON_GPU_PAGE_SIZE;
>>   }
>> ???
>
> No need to do so, dummy page will be 16K too, so it's fine.
Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
is 0x8e004000, then there are four types of address in GART:0x8e004000,
0x8e005000, 0x8e006000, 0x8e007000. The value which written in
VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.

>
> Cheers,
> Jerome
>
>

Huacai Chen

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-27 Thread Jerome Glisse

On Mon, 2012-02-27 at 10:44 +0800, Chen Jie wrote:
> Hi,
> 
> For this occasional GPU lockup when returns from STR/STD, I find
> followings(when the problem happens):
> 
> The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
> Which means:
> * HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
> * MCDW_BUSY(Memory Controller Block is Busy)
> * BIF_BUSY(Bus Interface is Busy)
> * MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
> Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
> relationship among GART mapped memory, On-board video memory and MCDX,
> MCDW?
> 
> CP_STAT: the CSF_RING_BUSY is always set.

Once the memory controller fails to do a pci transaction the CP
will be stuck. At least if ring is in system memory, if ring is
in vram CP might be stuck too because anyway everything goes
through the MC.

> 
> There are many CP_PACKET2(0x8000) in CP ring(more than three hundreds). 
> e.g.
> r[131800]=0x00028000
> r[131801]=0xc0016800
> r[131802]=0x0140
> r[131803]=0x79c5
> r[131804]=0x304a
> r[131805] ... r[132143]=0x8000
> r[132144]=0x
> After the first reset, GPU will lockup again, this time, typically
> there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
> in the end.
> Are these normal?
> 
> BTW, is there any way for X to switch to NOACCEL mode when the problem
> happens? Thus users will have a chance to save their documents and
> then reboot machine.

I have been meaning to patch the ddx to fallback to sw after GPU lockup.
But this is useless in today world, where everything is composited ie
the screen is updated using the 3D driver for which there is no easy
way to suddenly migrate to  software rendering. I will still probably
do the ddx patch at one point.

Cheers,
Jerome

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-27 Thread Jerome Glisse

On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
> ? 2012?2?17? ??5:27?Chen Jie  ???
> >> One good way to test gart is to go over GPU gart table and write a
> >> dword using the GPU at end of each page something like 0xCAFEDEAD
> >> or somevalue that is unlikely to be already set. And then go over
> >> all the page and check that GPU write succeed. Abusing the scratch
> >> register write back feature is the easiest way to try that.
> > I'm planning to add a GART table check procedure when resume, which
> > will go over GPU gart table:
> > 1. read(backup) a dword at end of each GPU page
> > 2. write a mark by GPU and check it
> > 3. restore the original dword
> Attachment validateGART.patch do the job:
> * It current only works for mips64 platform.
> * To use it, apply all_in_vram.patch first, which will allocate CP
> ring, ih, ib in VRAM and hard code no_wb=1.
> 
> The gart test routine will be invoked in r600_resume. We've tried it,
> and find that when lockup happened the gart table was good before
> userspace restarting. The related dmesg follows:
> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
> at 90004004, 32768 entries, Dummy
> Page[0x0e004000-0x0e007fff]
> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
> entries(valid=8544, invalid=24224, total=32768).
> ...
> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
> [ 1532.152343] Restarting tasks ... done.
> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than 
> 10003msec
> [ 1544.472656] [ cut here ]
> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
> radeon_fence_wait+0x25c/0x314()
> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A)
> ...
> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
> [ 1545.062500] radeon :01:05.0: WB disabled
> [ 1545.097656] [drm] ring test succeeded in 0 usecs
> [ 1545.105468] [drm] ib test succeeded in 0 usecs
> [ 1545.109375] [drm] Enabling audio support
> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
> at 90004004, 32768 entries, Dummy
> Page[0x0e004000-0x0e007fff]
> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
> unexpected value 0x745aaad1(expect 0xDEADBEEF)
> entry=0x0e008067, orignal=0x745aaad1
> ...
> /* System blocked here. */
> 
> Any idea?

I know lockup are frustrating, my only idea is the memory controller
is lockup because of some failing pci <-> system ram transaction.

> 
> BTW, we find the following in r600_pcie_gart_enable()
> (drivers/gpu/drm/radeon/r600.c):
> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
> (u32)(rdev->dummy_page.addr >> 12));
> 
> On our platform, PAGE_SIZE is 16K, does it have any problem?

No this should be handled properly.

> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
> should change to:
>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>   radeon_gart_set_page(rdev, t, page_base);
> - page_base += RADEON_GPU_PAGE_SIZE;
> + if (page_base != rdev->dummy_page.addr)
> + page_base += RADEON_GPU_PAGE_SIZE;
>   }
> ???

No need to do so, dummy page will be 16K too, so it's fine.

Cheers,
Jerome

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-27 Thread Chen Jie

Hi,

For this occasional GPU lockup when returns from STR/STD, I find
followings(when the problem happens):

The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
Which means:
* HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
* MCDW_BUSY(Memory Controller Block is Busy)
* BIF_BUSY(Bus Interface is Busy)
* MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
relationship among GART mapped memory, On-board video memory and MCDX,
MCDW?

CP_STAT: the CSF_RING_BUSY is always set.

There are many CP_PACKET2(0x8000) in CP ring(more than three hundreds). e.g.
r[131800]=0x00028000
r[131801]=0xc0016800
r[131802]=0x0140
r[131803]=0x79c5
r[131804]=0x304a
r[131805] ... r[132143]=0x8000
r[132144]=0x
After the first reset, GPU will lockup again, this time, typically
there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
in the end.
Are these normal?

BTW, is there any way for X to switch to NOACCEL mode when the problem
happens? Thus users will have a chance to save their documents and
then reboot machine.


Regards,
-- Chen Jie

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-27 Thread Jerome Glisse

On Mon, 2012-02-27 at 10:44 +0800, Chen Jie wrote:
> Hi,
> 
> For this occasional GPU lockup when returns from STR/STD, I find
> followings(when the problem happens):
> 
> The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
> Which means:
> * HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
> * MCDW_BUSY(Memory Controller Block is Busy)
> * BIF_BUSY(Bus Interface is Busy)
> * MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
> Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
> relationship among GART mapped memory, On-board video memory and MCDX,
> MCDW?
> 
> CP_STAT: the CSF_RING_BUSY is always set.

Once the memory controller fails to do a pci transaction the CP
will be stuck. At least if ring is in system memory, if ring is
in vram CP might be stuck too because anyway everything goes
through the MC.

> 
> There are many CP_PACKET2(0x8000) in CP ring(more than three hundreds). 
> e.g.
> r[131800]=0x00028000
> r[131801]=0xc0016800
> r[131802]=0x0140
> r[131803]=0x79c5
> r[131804]=0x304a
> r[131805] ... r[132143]=0x8000
> r[132144]=0x
> After the first reset, GPU will lockup again, this time, typically
> there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
> in the end.
> Are these normal?
> 
> BTW, is there any way for X to switch to NOACCEL mode when the problem
> happens? Thus users will have a chance to save their documents and
> then reboot machine.

I have been meaning to patch the ddx to fallback to sw after GPU lockup.
But this is useless in today world, where everything is composited ie
the screen is updated using the 3D driver for which there is no easy
way to suddenly migrate to  software rendering. I will still probably
do the ddx patch at one point.

Cheers,
Jerome

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-27 Thread Jerome Glisse

On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
> 在 2012年2月17日 下午5:27，Chen Jie  写道：
> >> One good way to test gart is to go over GPU gart table and write a
> >> dword using the GPU at end of each page something like 0xCAFEDEAD
> >> or somevalue that is unlikely to be already set. And then go over
> >> all the page and check that GPU write succeed. Abusing the scratch
> >> register write back feature is the easiest way to try that.
> > I'm planning to add a GART table check procedure when resume, which
> > will go over GPU gart table:
> > 1. read(backup) a dword at end of each GPU page
> > 2. write a mark by GPU and check it
> > 3. restore the original dword
> Attachment validateGART.patch do the job:
> * It current only works for mips64 platform.
> * To use it, apply all_in_vram.patch first, which will allocate CP
> ring, ih, ib in VRAM and hard code no_wb=1.
> 
> The gart test routine will be invoked in r600_resume. We've tried it,
> and find that when lockup happened the gart table was good before
> userspace restarting. The related dmesg follows:
> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
> at 90004004, 32768 entries, Dummy
> Page[0x0e004000-0x0e007fff]
> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
> entries(valid=8544, invalid=24224, total=32768).
> ...
> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
> [ 1532.152343] Restarting tasks ... done.
> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than 
> 10003msec
> [ 1544.472656] [ cut here ]
> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
> radeon_fence_wait+0x25c/0x314()
> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A)
> ...
> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
> [ 1545.062500] radeon :01:05.0: WB disabled
> [ 1545.097656] [drm] ring test succeeded in 0 usecs
> [ 1545.105468] [drm] ib test succeeded in 0 usecs
> [ 1545.109375] [drm] Enabling audio support
> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
> at 90004004, 32768 entries, Dummy
> Page[0x0e004000-0x0e007fff]
> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
> unexpected value 0x745aaad1(expect 0xDEADBEEF)
> entry=0x0e008067, orignal=0x745aaad1
> ...
> /* System blocked here. */
> 
> Any idea?

I know lockup are frustrating, my only idea is the memory controller
is lockup because of some failing pci <-> system ram transaction.

> 
> BTW, we find the following in r600_pcie_gart_enable()
> (drivers/gpu/drm/radeon/r600.c):
> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
> (u32)(rdev->dummy_page.addr >> 12));
> 
> On our platform, PAGE_SIZE is 16K, does it have any problem?

No this should be handled properly.

> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
> should change to:
>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>   radeon_gart_set_page(rdev, t, page_base);
> - page_base += RADEON_GPU_PAGE_SIZE;
> + if (page_base != rdev->dummy_page.addr)
> + page_base += RADEON_GPU_PAGE_SIZE;
>   }
> ???

No need to do so, dummy page will be 16K too, so it's fine.

Cheers,
Jerome

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-26 Thread Chen Jie

Hi,

For this occasional GPU lockup when returns from STR/STD, I find
followings(when the problem happens):

The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
Which means:
* HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
* MCDW_BUSY(Memory Controller Block is Busy)
* BIF_BUSY(Bus Interface is Busy)
* MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
relationship among GART mapped memory, On-board video memory and MCDX,
MCDW?

CP_STAT: the CSF_RING_BUSY is always set.

There are many CP_PACKET2(0x8000) in CP ring(more than three hundreds). e.g.
r[131800]=0x00028000
r[131801]=0xc0016800
r[131802]=0x0140
r[131803]=0x79c5
r[131804]=0x304a
r[131805] ... r[132143]=0x8000
r[132144]=0x
After the first reset, GPU will lockup again, this time, typically
there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
in the end.
Are these normal?

BTW, is there any way for X to switch to NOACCEL mode when the problem
happens? Thus users will have a chance to save their documents and
then reboot machine.


Regards,
-- Chen Jie
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-21 Thread Chen Jie

? 2012?2?17? ??5:27?Chen Jie  ???
>> One good way to test gart is to go over GPU gart table and write a
>> dword using the GPU at end of each page something like 0xCAFEDEAD
>> or somevalue that is unlikely to be already set. And then go over
>> all the page and check that GPU write succeed. Abusing the scratch
>> register write back feature is the easiest way to try that.
> I'm planning to add a GART table check procedure when resume, which
> will go over GPU gart table:
> 1. read(backup) a dword at end of each GPU page
> 2. write a mark by GPU and check it
> 3. restore the original dword
Attachment validateGART.patch do the job:
* It current only works for mips64 platform.
* To use it, apply all_in_vram.patch first, which will allocate CP
ring, ih, ib in VRAM and hard code no_wb=1.

The gart test routine will be invoked in r600_resume. We've tried it,
and find that when lockup happened the gart table was good before
userspace restarting. The related dmesg follows:
[ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
at 90004004, 32768 entries, Dummy
Page[0x0e004000-0x0e007fff]
[ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
entries(valid=8544, invalid=24224, total=32768).
...
[ 1531.156250] PM: resume of devices complete after 9396.588 msecs
[ 1532.152343] Restarting tasks ... done.
[ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than 10003msec
[ 1544.472656] [ cut here ]
[ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
radeon_fence_wait+0x25c/0x314()
[ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A)
...
[ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
[ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
[ 1545.062500] radeon :01:05.0: WB disabled
[ 1545.097656] [drm] ring test succeeded in 0 usecs
[ 1545.105468] [drm] ib test succeeded in 0 usecs
[ 1545.109375] [drm] Enabling audio support
[ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
at 90004004, 32768 entries, Dummy
Page[0x0e004000-0x0e007fff]
[ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
unexpected value 0x745aaad1(expect 0xDEADBEEF)
entry=0x0e008067, orignal=0x745aaad1
...
/* System blocked here. */

Any idea?

BTW, we find the following in r600_pcie_gart_enable()
(drivers/gpu/drm/radeon/r600.c):
WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
(u32)(rdev->dummy_page.addr >> 12));

On our platform, PAGE_SIZE is 16K, does it have any problem?

Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
should change to:
  for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
  radeon_gart_set_page(rdev, t, page_base);
- page_base += RADEON_GPU_PAGE_SIZE;
+ if (page_base != rdev->dummy_page.addr)
+ page_base += RADEON_GPU_PAGE_SIZE;
  }
???



Regards,
-- Chen Jie
-- next part --
A non-text attachment was scrubbed...
Name: all_in_vram.patch
Type: text/x-patch
Size: 3971 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: validateGART.patch
Type: text/x-patch
Size: 3947 bytes
Desc: not available
URL:

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-21 Thread Chen Jie

在 2012年2月17日 下午5:27，Chen Jie  写道：
>> One good way to test gart is to go over GPU gart table and write a
>> dword using the GPU at end of each page something like 0xCAFEDEAD
>> or somevalue that is unlikely to be already set. And then go over
>> all the page and check that GPU write succeed. Abusing the scratch
>> register write back feature is the easiest way to try that.
> I'm planning to add a GART table check procedure when resume, which
> will go over GPU gart table:
> 1. read(backup) a dword at end of each GPU page
> 2. write a mark by GPU and check it
> 3. restore the original dword
Attachment validateGART.patch do the job:
* It current only works for mips64 platform.
* To use it, apply all_in_vram.patch first, which will allocate CP
ring, ih, ib in VRAM and hard code no_wb=1.

The gart test routine will be invoked in r600_resume. We've tried it,
and find that when lockup happened the gart table was good before
userspace restarting. The related dmesg follows:
[ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
at 90004004, 32768 entries, Dummy
Page[0x0e004000-0x0e007fff]
[ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
entries(valid=8544, invalid=24224, total=32768).
...
[ 1531.156250] PM: resume of devices complete after 9396.588 msecs
[ 1532.152343] Restarting tasks ... done.
[ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than 10003msec
[ 1544.472656] [ cut here ]
[ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
radeon_fence_wait+0x25c/0x314()
[ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A)
...
[ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
[ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
[ 1545.062500] radeon :01:05.0: WB disabled
[ 1545.097656] [drm] ring test succeeded in 0 usecs
[ 1545.105468] [drm] ib test succeeded in 0 usecs
[ 1545.109375] [drm] Enabling audio support
[ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
at 90004004, 32768 entries, Dummy
Page[0x0e004000-0x0e007fff]
[ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
unexpected value 0x745aaad1(expect 0xDEADBEEF)
entry=0x0e008067, orignal=0x745aaad1
...
/* System blocked here. */

Any idea?

BTW, we find the following in r600_pcie_gart_enable()
(drivers/gpu/drm/radeon/r600.c):
WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
(u32)(rdev->dummy_page.addr >> 12));

On our platform, PAGE_SIZE is 16K, does it have any problem?

Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
should change to:
  for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
  radeon_gart_set_page(rdev, t, page_base);
- page_base += RADEON_GPU_PAGE_SIZE;
+ if (page_base != rdev->dummy_page.addr)
+ page_base += RADEON_GPU_PAGE_SIZE;
  }
???



Regards,
-- Chen Jie
diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index 53dbf50..e5961ed 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -2215,6 +2218,8 @@ int r600_cp_resume(struct radeon_device *rdev)
 
 void r600_cp_commit(struct radeon_device *rdev)
 {
+	if ((rdev->cp.ring_obj->tbo.mem.placement &  TTM_PL_MASK_MEM) == TTM_PL_FLAG_VRAM)
+		WREG32(R_005480_HDP_MEM_COHERENCY_FLUSH_CNTL, 0x1);
 	WREG32(CP_RB_WPTR, rdev->cp.wptr);
 	(void)RREG32(CP_RB_WPTR);
 }
@@ -2754,7 +2764,7 @@ static int r600_ih_ring_alloc(struct radeon_device *rdev)
 	if (rdev->ih.ring_obj == NULL) {
 		r = radeon_bo_create(rdev, NULL, rdev->ih.ring_size,
  true,
- RADEON_GEM_DOMAIN_GTT,
+ RADEON_GEM_DOMAIN_VRAM,
  &rdev->ih.ring_obj);
 		if (r) {
 			DRM_ERROR("radeon: failed to create ih ring buffer (%d).\n", r);
@@ -2764,7 +2774,7 @@ static int r600_ih_ring_alloc(struct radeon_device *rdev)
 		if (unlikely(r != 0))
 			return r;
 		r = radeon_bo_pin(rdev->ih.ring_obj,
-  RADEON_GEM_DOMAIN_GTT,
+  RADEON_GEM_DOMAIN_VRAM,
   &rdev->ih.gpu_addr);
 		if (r) {
 			radeon_bo_unreserve(rdev->ih.ring_obj);
@@ -3444,6 +3454,8 @@ restart_ih:
 	if (queue_hotplug)
 		queue_work(rdev->wq, &rdev->hotplug_work);
 	rdev->ih.rptr = rptr;
+	if ((rdev->ih.ring_obj->tbo.mem.placement &  TTM_PL_MASK_MEM) == TTM_PL_FLAG_VRAM)
+		WREG32(R_005480_HDP_MEM_COHERENCY_FLUSH_CNTL, 0x1);
 	WREG32(IH_RB_RPTR, rdev->ih.rptr);
 	spin_unlock_irqrestore(&rdev->ih.lock, flags);
 	return IRQ_HANDLED;
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
index 795403b..c5326e0 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -82,13 +82,13 @@ void radeon_debugfs_cleanup(struct drm_minor *minor);
 #endif
 
 
-int radeon_no_wb;
+int radeon_no_wb = 1;
 int radeon_modeset = -1;
 int radeon_dynclks = -1;
 int radeon_r4xx_atom = 0;
 int radeon_agpmode = 0;
 int radeon_vram_limit = 0;
-int radeon_gart_size =

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-17 Thread Chen Jie

>> ? 2012?2?15? ??11:53?Jerome Glisse  ???
>>> To me it looks like the CP is trying to fetch memory but the
>>> GPU memory controller fail to fullfill cp request. Did you
>>> check the PCI configuration before & after (when things don't
>>> work) My best guest is PCI bus mastering is no properly working
>>> or the PCIE GPU gart table as wrong data.
>>>
>>> Maybe one need to drop bus master and reenable bus master to
>>> work around some bug...
>> Thanks for your suggestion. We've tried the 'drop and reenable master'
>> trick, unfortunately doesn't work.
>> The PCI configuration compare will be done later.
> Update: We've checked the first 64 bytes of PCI configuration space
> before & after, and didn't find any difference.
Hi,

Status update:
We try to analyze the GPU instruction stream when lockup today. The
lockup always occurs after tasks restarting, so the related
instructions should reside at ib, as pointed by dmesg:
[ 2456.585937] GPU lockup (waiting for 0x0002F98B last fence id 0x0002F98A)

Print instructions in related ib:
[ 2462.492187] PM4 block 10 has 115 instructions, with fence seq 2f98b

[ 2462.976562] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2462.984375] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2462.988281] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2462.992187] Type3:PACKET3_SET_ALU_CONST ref_addr  
[ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880
[ 2463.003906] Type3:PACKET3_SET_RESOURCE ref_addr  
[ 2463.007812] Type3:PACKET3_SET_CONFIG_REG ref_addr  
[ 2463.011718] Type3:PACKET3_INDEX_TYPE ref_addr  
[ 2463.015625] Type3:PACKET3_NUM_INSTANCES ref_addr  
[ 2463.019531] Type3:PACKET3_DRAW_INDEX_AUTO ref_addr  
[ 2463.027343] Type3:PACKET3_EVENT_WRITE ref_addr  
[ 2463.031250] Type3:PACKET3_SET_CONFIG_REG ref_addr  
[ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680
[ 2463.039062] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2463.046875] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2463.050781] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2463.054687] Type3:PACKET3_SET_BOOL_CONST ref_addr  
[ 2463.062500] Type3:PACKET3_SURFACE_SYNC ref_addr 10668e

CP_COHER_BASE was 0x0018C880, so the instruction which caused lockup
should be in:
[ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880
...
[ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680

Here, only SURFACE_SYNC, SET_RESOURCE and EVENT_WRITE will access GPU memory.
We guess it maybe SURFACE_SYNC?

BTW, when lockup happens, if places the CP ring at vram, ring_test
will pass, but ib_test fails -- which suggests ME fails to feed CP
when lockup? May a former SURFACE_SYNC block the MC?

P.S. We hack to place CP ring, ib and ih at vram and disable
wb(radeon_no_wb=1) in today's debugging.

Any idea?



Regards,
-- Chen Jie

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-17 Thread Chen Jie

? 2012?2?17? ??12:32?Jerome Glisse  ???
> Ok let's start from the begining, i convince it's related to GPU
> memory controller failing to full fill some request that hit system
> memory. So in another mail you wrote :
>
>> BTW, I found radeon_gart_bind() will call pci_map_page(), it hooks
>> to swiotlb_map_page on our platform, which seems allocates and returns
>> dma_addr_t of a new page from pool if not meet dma_mask. Seems a bug, since
>> the BO backed by one set of pages, but mapped to GART was another set of
>> pages?
>
> Is this still the case ? As this is obviously wrong, we fixed that
> recently. What drm code are you using. rs780 dma mask is something
> like 40bits iirc so you should never have issue on your system with
> 1G of memory right ?
Right.

>
> If you have an iommu what happens on resume ? Are all page previously
> mapped with pci map page still valid ?
The physical address is directly mapped to bus address, so iommu do
nothing on resume, the pages should be valid?

>
> One good way to test gart is to go over GPU gart table and write a
> dword using the GPU at end of each page something like 0xCAFEDEAD
> or somevalue that is unlikely to be already set. And then go over
> all the page and check that GPU write succeed. Abusing the scratch
> register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which
will go over GPU gart table:
1. read(backup) a dword at end of each GPU page
2. write a mark by GPU and check it
3. restore the original dword

Hopefully, this can do some help.

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-17 Thread Chen Jie

>> 在 2012年2月15日 下午11:53，Jerome Glisse  写道：
>>> To me it looks like the CP is trying to fetch memory but the
>>> GPU memory controller fail to fullfill cp request. Did you
>>> check the PCI configuration before & after (when things don't
>>> work) My best guest is PCI bus mastering is no properly working
>>> or the PCIE GPU gart table as wrong data.
>>>
>>> Maybe one need to drop bus master and reenable bus master to
>>> work around some bug...
>> Thanks for your suggestion. We've tried the 'drop and reenable master'
>> trick, unfortunately doesn't work.
>> The PCI configuration compare will be done later.
> Update: We've checked the first 64 bytes of PCI configuration space
> before & after, and didn't find any difference.
Hi,

Status update:
We try to analyze the GPU instruction stream when lockup today. The
lockup always occurs after tasks restarting, so the related
instructions should reside at ib, as pointed by dmesg:
[ 2456.585937] GPU lockup (waiting for 0x0002F98B last fence id 0x0002F98A)

Print instructions in related ib:
[ 2462.492187] PM4 block 10 has 115 instructions, with fence seq 2f98b

[ 2462.976562] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2462.984375] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2462.988281] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2462.992187] Type3:PACKET3_SET_ALU_CONST ref_addr  
[ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880
[ 2463.003906] Type3:PACKET3_SET_RESOURCE ref_addr  
[ 2463.007812] Type3:PACKET3_SET_CONFIG_REG ref_addr  
[ 2463.011718] Type3:PACKET3_INDEX_TYPE ref_addr  
[ 2463.015625] Type3:PACKET3_NUM_INSTANCES ref_addr  
[ 2463.019531] Type3:PACKET3_DRAW_INDEX_AUTO ref_addr  
[ 2463.027343] Type3:PACKET3_EVENT_WRITE ref_addr  
[ 2463.031250] Type3:PACKET3_SET_CONFIG_REG ref_addr  
[ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680
[ 2463.039062] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2463.046875] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2463.050781] Type3:PACKET3_SET_CONTEXT_REG ref_addr  
[ 2463.054687] Type3:PACKET3_SET_BOOL_CONST ref_addr  
[ 2463.062500] Type3:PACKET3_SURFACE_SYNC ref_addr 10668e

CP_COHER_BASE was 0x0018C880, so the instruction which caused lockup
should be in:
[ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880
...
[ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680

Here, only SURFACE_SYNC, SET_RESOURCE and EVENT_WRITE will access GPU memory.
We guess it maybe SURFACE_SYNC?

BTW, when lockup happens, if places the CP ring at vram, ring_test
will pass, but ib_test fails -- which suggests ME fails to feed CP
when lockup? May a former SURFACE_SYNC block the MC?

P.S. We hack to place CP ring, ib and ih at vram and disable
wb(radeon_no_wb=1) in today's debugging.

Any idea?



Regards,
-- Chen Jie
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-17 Thread Chen Jie

在 2012年2月17日 上午12:32，Jerome Glisse  写道：
> Ok let's start from the begining, i convince it's related to GPU
> memory controller failing to full fill some request that hit system
> memory. So in another mail you wrote :
>
>> BTW, I found radeon_gart_bind() will call pci_map_page(), it hooks
>> to swiotlb_map_page on our platform, which seems allocates and returns
>> dma_addr_t of a new page from pool if not meet dma_mask. Seems a bug, since
>> the BO backed by one set of pages, but mapped to GART was another set of
>> pages?
>
> Is this still the case ? As this is obviously wrong, we fixed that
> recently. What drm code are you using. rs780 dma mask is something
> like 40bits iirc so you should never have issue on your system with
> 1G of memory right ?
Right.

>
> If you have an iommu what happens on resume ? Are all page previously
> mapped with pci map page still valid ?
The physical address is directly mapped to bus address, so iommu do
nothing on resume, the pages should be valid?

>
> One good way to test gart is to go over GPU gart table and write a
> dword using the GPU at end of each page something like 0xCAFEDEAD
> or somevalue that is unlikely to be already set. And then go over
> all the page and check that GPU write succeed. Abusing the scratch
> register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which
will go over GPU gart table:
1. read(backup) a dword at end of each GPU page
2. write a mark by GPU and check it
3. restore the original dword

Hopefully, this can do some help.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-16 Thread Chen Jie

? 2012?2?16? ??5:21?Chen Jie  ???
> Hi,
>
> ? 2012?2?15? ??11:53?Jerome Glisse  ???
>> To me it looks like the CP is trying to fetch memory but the
>> GPU memory controller fail to fullfill cp request. Did you
>> check the PCI configuration before & after (when things don't
>> work) My best guest is PCI bus mastering is no properly working
>> or the PCIE GPU gart table as wrong data.
>>
>> Maybe one need to drop bus master and reenable bus master to
>> work around some bug...
> Thanks for your suggestion. We've tried the 'drop and reenable master'
> trick, unfortunately doesn't work.
> The PCI configuration compare will be done later.
Update: We've checked the first 64 bytes of PCI configuration space
before & after, and didn't find any difference.



Regards,
-- Chen Jie

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-16 Thread Chen Jie

Hi,

? 2012?2?15? ??11:53?Jerome Glisse  ???
> To me it looks like the CP is trying to fetch memory but the
> GPU memory controller fail to fullfill cp request. Did you
> check the PCI configuration before & after (when things don't
> work) My best guest is PCI bus mastering is no properly working
> or the PCIE GPU gart table as wrong data.
>
> Maybe one need to drop bus master and reenable bus master to
> work around some bug...
Thanks for your suggestion. We've tried the 'drop and reenable master'
trick, unfortunately doesn't work.
The PCI configuration compare will be done later.

Some additional information:
The "GPU Lockup" seems always occur after tasks be restarting -- We
inserted more ring tests , non of them failed before restarting tasks.

BTW, I hacked GART  table to try to simulate the problem:
1. Changes the system memory address(bus address) of ring_obj to an
arbitrary value, e.g. 0 or 128M.
2. Changes the system memory address of a BO in radeon_test to an
arbitrary value, e.g. 0

Non of above leaded to a GPU Lockup:
Point 1 rendered a black screen;
Point 2 only the test itself failed

Any idea?


Regards,
-- Chen Jie

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-16 Thread Jerome Glisse

On Thu, Feb 16, 2012 at 05:21:10PM +0800, Chen Jie wrote:
> Hi,
> 
> ? 2012?2?15? ??11:53?Jerome Glisse  ???
> > To me it looks like the CP is trying to fetch memory but the
> > GPU memory controller fail to fullfill cp request. Did you
> > check the PCI configuration before & after (when things don't
> > work) My best guest is PCI bus mastering is no properly working
> > or the PCIE GPU gart table as wrong data.
> >
> > Maybe one need to drop bus master and reenable bus master to
> > work around some bug...
> Thanks for your suggestion. We've tried the 'drop and reenable master'
> trick, unfortunately doesn't work.
> The PCI configuration compare will be done later.
> 
> Some additional information:
> The "GPU Lockup" seems always occur after tasks be restarting -- We
> inserted more ring tests , non of them failed before restarting tasks.
> 
> BTW, I hacked GART  table to try to simulate the problem:
> 1. Changes the system memory address(bus address) of ring_obj to an
> arbitrary value, e.g. 0 or 128M.
> 2. Changes the system memory address of a BO in radeon_test to an
> arbitrary value, e.g. 0
> 
> Non of above leaded to a GPU Lockup:
> Point 1 rendered a black screen;
> Point 2 only the test itself failed
> 
> Any idea?
> 

Ok let's start from the begining, i convince it's related to GPU
memory controller failing to full fill some request that hit system
memory. So in another mail you wrote :

> BTW, I found radeon_gart_bind() will call pci_map_page(), it hooks
> to swiotlb_map_page on our platform, which seems allocates and returns
> dma_addr_t of a new page from pool if not meet dma_mask. Seems a bug, since
> the BO backed by one set of pages, but mapped to GART was another set of
> pages?

Is this still the case ? As this is obviously wrong, we fixed that
recently. What drm code are you using. rs780 dma mask is something
like 40bits iirc so you should never have issue on your system with
1G of memory right ?

If you have an iommu what happens on resume ? Are all page previously
mapped with pci map page still valid ?

One good way to test gart is to go over GPU gart table and write a
dword using the GPU at end of each page something like 0xCAFEDEAD
or somevalue that is unlikely to be already set. And then go over
all the page and check that GPU write succeed. Abusing the scratch
register write back feature is the easiest way to try that.

Cheers,
Jerome

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-16 Thread Jerome Glisse

On Thu, Feb 16, 2012 at 05:21:10PM +0800, Chen Jie wrote:
> Hi,
> 
> 在 2012年2月15日 下午11:53，Jerome Glisse  写道：
> > To me it looks like the CP is trying to fetch memory but the
> > GPU memory controller fail to fullfill cp request. Did you
> > check the PCI configuration before & after (when things don't
> > work) My best guest is PCI bus mastering is no properly working
> > or the PCIE GPU gart table as wrong data.
> >
> > Maybe one need to drop bus master and reenable bus master to
> > work around some bug...
> Thanks for your suggestion. We've tried the 'drop and reenable master'
> trick, unfortunately doesn't work.
> The PCI configuration compare will be done later.
> 
> Some additional information:
> The "GPU Lockup" seems always occur after tasks be restarting -- We
> inserted more ring tests , non of them failed before restarting tasks.
> 
> BTW, I hacked GART  table to try to simulate the problem:
> 1. Changes the system memory address(bus address) of ring_obj to an
> arbitrary value, e.g. 0 or 128M.
> 2. Changes the system memory address of a BO in radeon_test to an
> arbitrary value, e.g. 0
> 
> Non of above leaded to a GPU Lockup:
> Point 1 rendered a black screen;
> Point 2 only the test itself failed
> 
> Any idea?
> 

Ok let's start from the begining, i convince it's related to GPU
memory controller failing to full fill some request that hit system
memory. So in another mail you wrote :

> BTW, I found radeon_gart_bind() will call pci_map_page(), it hooks
> to swiotlb_map_page on our platform, which seems allocates and returns
> dma_addr_t of a new page from pool if not meet dma_mask. Seems a bug, since
> the BO backed by one set of pages, but mapped to GART was another set of
> pages?

Is this still the case ? As this is obviously wrong, we fixed that
recently. What drm code are you using. rs780 dma mask is something
like 40bits iirc so you should never have issue on your system with
1G of memory right ?

If you have an iommu what happens on resume ? Are all page previously
mapped with pci map page still valid ?

One good way to test gart is to go over GPU gart table and write a
dword using the GPU at end of each page something like 0xCAFEDEAD
or somevalue that is unlikely to be already set. And then go over
all the page and check that GPU write succeed. Abusing the scratch
register write back feature is the easiest way to try that.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-16 Thread Chen Jie

在 2012年2月16日 下午5:21，Chen Jie  写道：
> Hi,
>
> 在 2012年2月15日 下午11:53，Jerome Glisse  写道：
>> To me it looks like the CP is trying to fetch memory but the
>> GPU memory controller fail to fullfill cp request. Did you
>> check the PCI configuration before & after (when things don't
>> work) My best guest is PCI bus mastering is no properly working
>> or the PCIE GPU gart table as wrong data.
>>
>> Maybe one need to drop bus master and reenable bus master to
>> work around some bug...
> Thanks for your suggestion. We've tried the 'drop and reenable master'
> trick, unfortunately doesn't work.
> The PCI configuration compare will be done later.
Update: We've checked the first 64 bytes of PCI configuration space
before & after, and didn't find any difference.



Regards,
-- Chen Jie
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-16 Thread Chen Jie

Hi,

在 2012年2月15日 下午11:53，Jerome Glisse  写道：
> To me it looks like the CP is trying to fetch memory but the
> GPU memory controller fail to fullfill cp request. Did you
> check the PCI configuration before & after (when things don't
> work) My best guest is PCI bus mastering is no properly working
> or the PCIE GPU gart table as wrong data.
>
> Maybe one need to drop bus master and reenable bus master to
> work around some bug...
Thanks for your suggestion. We've tried the 'drop and reenable master'
trick, unfortunately doesn't work.
The PCI configuration compare will be done later.

Some additional information:
The "GPU Lockup" seems always occur after tasks be restarting -- We
inserted more ring tests , non of them failed before restarting tasks.

BTW, I hacked GART  table to try to simulate the problem:
1. Changes the system memory address(bus address) of ring_obj to an
arbitrary value, e.g. 0 or 128M.
2. Changes the system memory address of a BO in radeon_test to an
arbitrary value, e.g. 0

Non of above leaded to a GPU Lockup:
Point 1 rendered a black screen;
Point 2 only the test itself failed

Any idea?


Regards,
-- Chen Jie
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-15 Thread Chen Jie

Hi,

Status update about the problem 'Occasionally "GPU lockup" after
resuming from suspend.'

First, this could happen when system returns from a STR(suspend to
ram) or STD(suspend to disk, aka hibernation).
When returns from STD, the initialization process is most similar to
the normal boot.
The standby is ok, which is similar to STR, except that standby will
not shutdown the power of CPU,GPU etc.

We've dumped and compared the registers, and found something:
CP_STAT
normal value: 0x
value when this problem occurred: 0x802100C1 or 0x802300C1

CP_ME_CNTL
normal value: 0x00FF
value when this problem occurred: always 0x20FF in our test

Questions:
According to the manual,
CP_STAT = 0x802100C1 means
CSF_RING_BUSY(bit 0):
The Ring fetcher still has command buffer data to fetch, or the 
PFP
still has data left to process from the reorder queue.
CSF_BUSY(bit 6):
The input FIFOs have command buffers to fetch, or one or more 
of the
fetchers are busy, or the arbiter has a request to send to the MIU.
MIU_RDREQ_BUSY(bit 7):
The read path logic inside the MIU is busy.
MEQ_BUSY(bit 16):
The PFP-to-ME queue has valid data in it.
SURFACE_SYNC_BUSY(bit 21):
The Surface Sync unit is busy.
CP_BUSY(bit 31):
Any block in the CP is busy.
What does it suggest?

What does it mean if bit 29 of CP_ME_CNTL is set?

BTW, how does the dummy page work in GART?


Regards,
-- Chen Jie

? 2011?12?7? ??10:21?Alex Deucher  ???
> 2011/12/7  :
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
>> #define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET   0x0E60
>> #define R_000E50_SRBM_STATUS   0x0E50
>> #define R_008020_GRBM_SOFT_RESET0x8020
>> #define R_008010_GRBM_STATUS0x8010
>> #define R_008014_GRBM_STATUS2   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  :
 And, I want to know something:
 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
 2, What can cause MC timeout?
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>

> Hi,
>
> Some status update.
> ? 2011?9?29? ??5:17?Chen Jie  ???
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon :01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 :02:00.0: eth0: link up
>> [  165.621093] PM: resume

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-15 Thread Jerome Glisse

On Wed, Feb 15, 2012 at 05:32:35PM +0800, Chen Jie wrote:
> Hi,
> 
> Status update about the problem 'Occasionally "GPU lockup" after
> resuming from suspend.'
> 
> First, this could happen when system returns from a STR(suspend to
> ram) or STD(suspend to disk, aka hibernation).
> When returns from STD, the initialization process is most similar to
> the normal boot.
> The standby is ok, which is similar to STR, except that standby will
> not shutdown the power of CPU,GPU etc.
> 
> We've dumped and compared the registers, and found something:
> CP_STAT
> normal value: 0x
> value when this problem occurred: 0x802100C1 or 0x802300C1
> 
> CP_ME_CNTL
> normal value: 0x00FF
> value when this problem occurred: always 0x20FF in our test
> 
> Questions:
> According to the manual,
> CP_STAT = 0x802100C1 means
>   CSF_RING_BUSY(bit 0):
>   The Ring fetcher still has command buffer data to fetch, or the 
> PFP
> still has data left to process from the reorder queue.
>   CSF_BUSY(bit 6):
>   The input FIFOs have command buffers to fetch, or one or more 
> of the
> fetchers are busy, or the arbiter has a request to send to the MIU.
>   MIU_RDREQ_BUSY(bit 7):
>   The read path logic inside the MIU is busy.
>   MEQ_BUSY(bit 16):
>   The PFP-to-ME queue has valid data in it.
>   SURFACE_SYNC_BUSY(bit 21):
>   The Surface Sync unit is busy.
>   CP_BUSY(bit 31):
>   Any block in the CP is busy.
> What does it suggest?
> 
> What does it mean if bit 29 of CP_ME_CNTL is set?
> 
> BTW, how does the dummy page work in GART?
> 
> 
> Regards,
> -- Chen Jie

To me it looks like the CP is trying to fetch memory but the
GPU memory controller fail to fullfill cp request. Did you
check the PCI configuration before & after (when things don't
work) My best guest is PCI bus mastering is no properly working
or the PCIE GPU gart table as wrong data.

Maybe one need to drop bus master and reenable bus master to
work around some bug...

Cheers,
Jerome

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-15 Thread Jerome Glisse

On Wed, Feb 15, 2012 at 05:32:35PM +0800, Chen Jie wrote:
> Hi,
> 
> Status update about the problem 'Occasionally "GPU lockup" after
> resuming from suspend.'
> 
> First, this could happen when system returns from a STR(suspend to
> ram) or STD(suspend to disk, aka hibernation).
> When returns from STD, the initialization process is most similar to
> the normal boot.
> The standby is ok, which is similar to STR, except that standby will
> not shutdown the power of CPU,GPU etc.
> 
> We've dumped and compared the registers, and found something:
> CP_STAT
> normal value: 0x
> value when this problem occurred: 0x802100C1 or 0x802300C1
> 
> CP_ME_CNTL
> normal value: 0x00FF
> value when this problem occurred: always 0x20FF in our test
> 
> Questions:
> According to the manual,
> CP_STAT = 0x802100C1 means
>   CSF_RING_BUSY(bit 0):
>   The Ring fetcher still has command buffer data to fetch, or the 
> PFP
> still has data left to process from the reorder queue.
>   CSF_BUSY(bit 6):
>   The input FIFOs have command buffers to fetch, or one or more 
> of the
> fetchers are busy, or the arbiter has a request to send to the MIU.
>   MIU_RDREQ_BUSY(bit 7):
>   The read path logic inside the MIU is busy.
>   MEQ_BUSY(bit 16):
>   The PFP-to-ME queue has valid data in it.
>   SURFACE_SYNC_BUSY(bit 21):
>   The Surface Sync unit is busy.
>   CP_BUSY(bit 31):
>   Any block in the CP is busy.
> What does it suggest?
> 
> What does it mean if bit 29 of CP_ME_CNTL is set?
> 
> BTW, how does the dummy page work in GART?
> 
> 
> Regards,
> -- Chen Jie

To me it looks like the CP is trying to fetch memory but the
GPU memory controller fail to fullfill cp request. Did you
check the PCI configuration before & after (when things don't
work) My best guest is PCI bus mastering is no properly working
or the PCIE GPU gart table as wrong data.

Maybe one need to drop bus master and reenable bus master to
work around some bug...

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-15 Thread Chen Jie

Hi,

Status update about the problem 'Occasionally "GPU lockup" after
resuming from suspend.'

First, this could happen when system returns from a STR(suspend to
ram) or STD(suspend to disk, aka hibernation).
When returns from STD, the initialization process is most similar to
the normal boot.
The standby is ok, which is similar to STR, except that standby will
not shutdown the power of CPU,GPU etc.

We've dumped and compared the registers, and found something:
CP_STAT
normal value: 0x
value when this problem occurred: 0x802100C1 or 0x802300C1

CP_ME_CNTL
normal value: 0x00FF
value when this problem occurred: always 0x20FF in our test

Questions:
According to the manual,
CP_STAT = 0x802100C1 means
CSF_RING_BUSY(bit 0):
The Ring fetcher still has command buffer data to fetch, or the 
PFP
still has data left to process from the reorder queue.
CSF_BUSY(bit 6):
The input FIFOs have command buffers to fetch, or one or more 
of the
fetchers are busy, or the arbiter has a request to send to the MIU.
MIU_RDREQ_BUSY(bit 7):
The read path logic inside the MIU is busy.
MEQ_BUSY(bit 16):
The PFP-to-ME queue has valid data in it.
SURFACE_SYNC_BUSY(bit 21):
The Surface Sync unit is busy.
CP_BUSY(bit 31):
Any block in the CP is busy.
What does it suggest?

What does it mean if bit 29 of CP_ME_CNTL is set?

BTW, how does the dummy page work in GART?


Regards,
-- Chen Jie

在 2011年12月7日 下午10:21，Alex Deucher  写道：
> 2011/12/7  :
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
>> #define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET   0x0E60
>> #define R_000E50_SRBM_STATUS   0x0E50
>> #define R_008020_GRBM_SOFT_RESET0x8020
>> #define R_008010_GRBM_STATUS0x8010
>> #define R_008014_GRBM_STATUS2   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  :
 And, I want to know something:
 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
 2, What can cause MC timeout？
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>

> Hi,
>
> Some status update.
> 在 2011年9月29日 下午5:17，Chen Jie  写道：
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon :01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 :02:00.0: eth0: link up
>> [  165.621093] PM: resume

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-16 Thread che...@lemote.com

> On Don, 2011-12-08 at 19:35 +0800, chenhc at lemote.com wrote:
>>
>> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
>> active, but what it get from ring buffer is wrong.
>
> CP_RB_WPTR is normally only changed by the CPU after adding commands to
> the ring buffer, so I'm afraid that may not be a valid conclusion.
>
>
I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
CP_RB_WPTR both changed, so I think CP is active.

>> Then, I want to know whether there is a way to check the content that
>> GPU get from ring buffer.
>
> See the r100_debugfs_cp_csq_fifo() function, which generates the output
> for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
>
Hmmm, I don't think this function can be used by r600 (or write a similar
one for R600), because I haven't found CSQ registers in r600 code.

>
>> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
>> /sys/power/state" to do a hibernation, there will be occasionally "GPU
>> reset" just like suspend. However, if I use "echo reboot >
>> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
>> wakeup automatically, there is no "GPU reset" after hundreds of tests.
>> What does this imply? Power loss cause something break?
>
> Yeah, it sounds like the resume code doesn't properly re-initialize
> something that's preserved on a warm boot but lost on a cold boot.
>
>
> --
> Earthling Michel D

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-16 Thread Michel Dänzer

On Fre, 2011-12-16 at 16:42 +0800, chenhc at lemote.com wrote: 
> > On Don, 2011-12-08 at 19:35 +0800, chenhc at lemote.com wrote:
> >>
> >> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
> >> active, but what it get from ring buffer is wrong.
> >
> > CP_RB_WPTR is normally only changed by the CPU after adding commands to
> > the ring buffer, so I'm afraid that may not be a valid conclusion.
> >
> >
> I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
> CP_RB_WPTR both changed, so I think CP is active.

I see. However, I think this actually makes it unlikely that the problem
is the CP reading wrong values from the ring, as otherwise the CP itself
would likely get stuck sooner or later.


> >> Then, I want to know whether there is a way to check the content that
> >> GPU get from ring buffer.
> >
> > See the r100_debugfs_cp_csq_fifo() function, which generates the output
> > for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
> >
> Hmmm, I don't think this function can be used by r600 (or write a similar
> one for R600), because I haven't found CSQ registers in r600 code.

Hmm yeah, looks like the registers for this have changed.


-- 
Earthling Michel D?nzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-16 Thread Jerome Glisse

2011/12/16  :
>> On Don, 2011-12-08 at 19:35 +0800, chenhc at lemote.com wrote:
>>>
>>> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
>>> active, but what it get from ring buffer is wrong.
>>
>> CP_RB_WPTR is normally only changed by the CPU after adding commands to
>> the ring buffer, so I'm afraid that may not be a valid conclusion.
>>
>>
> I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
> CP_RB_WPTR both changed, so I think CP is active.
>
>>> Then, I want to know whether there is a way to check the content that
>>> GPU get from ring buffer.
>>
>> See the r100_debugfs_cp_csq_fifo() function, which generates the output
>> for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
>>
> Hmmm, I don't think this function can be used by r600 (or write a similar
> one for R600), because I haven't found CSQ registers in r600 code.
>
>>
>>> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
>>> /sys/power/state" to do a hibernation, there will be occasionally "GPU
>>> reset" just like suspend. However, if I use "echo reboot >
>>> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
>>> wakeup automatically, there is no "GPU reset" after hundreds of tests.
>>> What does this imply? Power loss cause something break?
>>
>> Yeah, it sounds like the resume code doesn't properly re-initialize
>> something that's preserved on a warm boot but lost on a cold boot.
>>
>>
>> --
>> Earthling Michel D
>

It might be pci issue, you should check pci configuration before and after
suspend thought kernel should properly restore things. The cp might still
be reading random data that doesn't lockup the cp (i saw it happen more
than once).

Cheers,
Jerome

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-16 Thread Jerome Glisse

2011/12/16  :
>> On Don, 2011-12-08 at 19:35 +0800, che...@lemote.com wrote:
>>>
>>> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
>>> active, but what it get from ring buffer is wrong.
>>
>> CP_RB_WPTR is normally only changed by the CPU after adding commands to
>> the ring buffer, so I'm afraid that may not be a valid conclusion.
>>
>>
> I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
> CP_RB_WPTR both changed, so I think CP is active.
>
>>> Then, I want to know whether there is a way to check the content that
>>> GPU get from ring buffer.
>>
>> See the r100_debugfs_cp_csq_fifo() function, which generates the output
>> for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
>>
> Hmmm, I don't think this function can be used by r600 (or write a similar
> one for R600), because I haven't found CSQ registers in r600 code.
>
>>
>>> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
>>> /sys/power/state" to do a hibernation, there will be occasionally "GPU
>>> reset" just like suspend. However, if I use "echo reboot >
>>> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
>>> wakeup automatically, there is no "GPU reset" after hundreds of tests.
>>> What does this imply? Power loss cause something break?
>>
>> Yeah, it sounds like the resume code doesn't properly re-initialize
>> something that's preserved on a warm boot but lost on a cold boot.
>>
>>
>> --
>> Earthling Michel D
>

It might be pci issue, you should check pci configuration before and after
suspend thought kernel should properly restore things. The cp might still
be reading random data that doesn't lockup the cp (i saw it happen more
than once).

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-16 Thread Michel Dänzer

On Fre, 2011-12-16 at 16:42 +0800, che...@lemote.com wrote: 
> > On Don, 2011-12-08 at 19:35 +0800, che...@lemote.com wrote:
> >>
> >> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
> >> active, but what it get from ring buffer is wrong.
> >
> > CP_RB_WPTR is normally only changed by the CPU after adding commands to
> > the ring buffer, so I'm afraid that may not be a valid conclusion.
> >
> >
> I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
> CP_RB_WPTR both changed, so I think CP is active.

I see. However, I think this actually makes it unlikely that the problem
is the CP reading wrong values from the ring, as otherwise the CP itself
would likely get stuck sooner or later.


> >> Then, I want to know whether there is a way to check the content that
> >> GPU get from ring buffer.
> >
> > See the r100_debugfs_cp_csq_fifo() function, which generates the output
> > for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
> >
> Hmmm, I don't think this function can be used by r600 (or write a similar
> one for R600), because I haven't found CSQ registers in r600 code.

Hmm yeah, looks like the registers for this have changed.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-16 Thread chenhc

> On Don, 2011-12-08 at 19:35 +0800, che...@lemote.com wrote:
>>
>> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
>> active, but what it get from ring buffer is wrong.
>
> CP_RB_WPTR is normally only changed by the CPU after adding commands to
> the ring buffer, so I'm afraid that may not be a valid conclusion.
>
>
I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
CP_RB_WPTR both changed, so I think CP is active.

>> Then, I want to know whether there is a way to check the content that
>> GPU get from ring buffer.
>
> See the r100_debugfs_cp_csq_fifo() function, which generates the output
> for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
>
Hmmm, I don't think this function can be used by r600 (or write a similar
one for R600), because I haven't found CSQ registers in r600 code.

>
>> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
>> /sys/power/state" to do a hibernation, there will be occasionally "GPU
>> reset" just like suspend. However, if I use "echo reboot >
>> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
>> wakeup automatically, there is no "GPU reset" after hundreds of tests.
>> What does this imply? Power loss cause something break?
>
> Yeah, it sounds like the resume code doesn't properly re-initialize
> something that's preserved on a warm boot but lost on a cold boot.
>
>
> --
> Earthling Michel D


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-15 Thread Michel Dänzer

On Don, 2011-12-08 at 19:35 +0800, chenhc at lemote.com wrote:
> 
> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
> active, but what it get from ring buffer is wrong.

CP_RB_WPTR is normally only changed by the CPU after adding commands to
the ring buffer, so I'm afraid that may not be a valid conclusion. 

> Then, I want to know whether there is a way to check the content that
> GPU get from ring buffer. 

See the r100_debugfs_cp_csq_fifo() function, which generates the output
for /sys/kernel/debug/dri/0/r100_cp_csq_fifo. 

> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
> /sys/power/state" to do a hibernation, there will be occasionally "GPU
> reset" just like suspend. However, if I use "echo reboot >
> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
> wakeup automatically, there is no "GPU reset" after hundreds of tests.
> What does this imply? Power loss cause something break?

Yeah, it sounds like the resume code doesn't properly re-initialize
something that's preserved on a warm boot but lost on a cold boot. 

-- 
Earthling Michel D?nzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-15 Thread Michel Dänzer

On Don, 2011-12-08 at 19:35 +0800, che...@lemote.com wrote:
> 
> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
> active, but what it get from ring buffer is wrong.

CP_RB_WPTR is normally only changed by the CPU after adding commands to
the ring buffer, so I'm afraid that may not be a valid conclusion. 


> Then, I want to know whether there is a way to check the content that
> GPU get from ring buffer. 

See the r100_debugfs_cp_csq_fifo() function, which generates the output
for /sys/kernel/debug/dri/0/r100_cp_csq_fifo. 


> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
> /sys/power/state" to do a hibernation, there will be occasionally "GPU
> reset" just like suspend. However, if I use "echo reboot >
> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
> wakeup automatically, there is no "GPU reset" after hundreds of tests.
> What does this imply? Power loss cause something break?

Yeah, it sounds like the resume code doesn't properly re-initialize
something that's preserved on a warm boot but lost on a cold boot. 


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-10 Thread chenhc

Thank you for your reply.

I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
active, but what it get from ring buffer is wrong. Then, I want to know
whether there is a way to check the content that GPU get from ring buffer.

BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
/sys/power/state" to do a hibernation, there will be occasionally "GPU
reset" just like suspend. However, if I use "echo reboot >
/sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
wakeup automatically, there is no "GPU reset" after hundreds of tests.
What does this imply? Power loss cause something break?

Best regards,

Huacai Chen


> 2011/12/7  :
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
>> #define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET   0x0E60
>> #define R_000E50_SRBM_STATUS   0x0E50
>> #define R_008020_GRBM_SOFT_RESET0x8020
>> #define R_008010_GRBM_STATUS0x8010
>> #define R_008014_GRBM_STATUS2   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  :
 And, I want to know something:
 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
 2, What can cause MC timeout？
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>

> Hi,
>
> Some status update.
> 在 2011年9月29日 下午5:17，Chen Jie  写道：
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on
>> mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon :01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl
>> 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 :02:00.0: eth0: link up
>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>> [  165.628906] Restarting tasks ... done.
>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more
>> than
>> 10019msec
>> [  177.089843] [ cut here ]
>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>> 0x13AD)
>> [  177.113281] Modules linked in: psmouse serio_raw
>> [  177.117187] Call Trace:
>> [  177.121093] [] dump_stack+0x8/0x34
>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>> [  177.148437] []
>> radeon_gem_wait_idle_ioctl+0x80/0x114
>> [  177.156250]

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-08 Thread che...@lemote.com

Thank you for your reply.

I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
active, but what it get from ring buffer is wrong. Then, I want to know
whether there is a way to check the content that GPU get from ring buffer.

BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
/sys/power/state" to do a hibernation, there will be occasionally "GPU
reset" just like suspend. However, if I use "echo reboot >
/sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
wakeup automatically, there is no "GPU reset" after hundreds of tests.
What does this imply? Power loss cause something break?

Best regards,

Huacai Chen


> 2011/12/7  :
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
>> #define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET   0x0E60
>> #define R_000E50_SRBM_STATUS   0x0E50
>> #define R_008020_GRBM_SOFT_RESET0x8020
>> #define R_008010_GRBM_STATUS0x8010
>> #define R_008014_GRBM_STATUS2   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  :
 And, I want to know something:
 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
 2, What can cause MC timeout?
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>

> Hi,
>
> Some status update.
> ? 2011?9?29? ??5:17?Chen Jie  ???
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on
>> mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon :01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl
>> 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 :02:00.0: eth0: link up
>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>> [  165.628906] Restarting tasks ... done.
>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more
>> than
>> 10019msec
>> [  177.089843] [ cut here ]
>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>> 0x13AD)
>> [  177.113281] Modules linked in: psmouse serio_raw
>> [  177.117187] Call Trace:
>> [  177.121093] [] dump_stack+0x8/0x34
>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>> [  177.148437] []
>> radeon_gem_wait_idle_ioctl+0x80/0x114
>> [  177.156250]

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-07 Thread che...@lemote.com

When "MC timeout" happens at GPU reset, we found the 12th and 13th
bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
two bits are like this:
#define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
#define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)

Could you please tell me what does they mean? And if possible,
I want to know the functionalities of these 5 registers in detail:
#define R_000E60_SRBM_SOFT_RESET   0x0E60
#define R_000E50_SRBM_STATUS   0x0E50
#define R_008020_GRBM_SOFT_RESET0x8020
#define R_008010_GRBM_STATUS0x8010
#define R_008014_GRBM_STATUS2   0x8014

A bit more info: If I reset the MC after resetting CP (this is what
Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
disappear, but there is still "ring test failed".

Huacai Chen

> 2011/11/8  :
>> And, I want to know something:
>> 1, Does GPU use MC to access GTT?
>
> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
> memory (vram or gart).
>
>> 2, What can cause MC timeout?
>
> Lots of things.  Some GPU client still active, some GPU client hung or
> not properly initialized.
>
> Alex
>
>>
>>> Hi,
>>>
>>> Some status update.
>>> ? 2011?9?29? ??5:17?Chen Jie  ???
 Hi,
 Add more information.
 We got occasionally "GPU lockup" after resuming from suspend(on mipsel
 platform with a mips64 compatible CPU and rs780e, the kernel is
 3.1.0-rc8
 64bit).  Related kernel message:
 /* return from STR */
 [  156.152343] radeon :01:05.0: WB enabled
 [  156.187500] [drm] ring test succeeded in 0 usecs
 [  156.187500] [drm] ib test succeeded in 0 usecs
 [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
 [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
 [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
 [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 [  156.597656] ata1.00: configured for UDMA/133
 [  156.613281] usb 1-5: reset high speed USB device number 4 using
 ehci_hcd
 [  157.027343] usb 3-2: reset low speed USB device number 2 using
 ohci_hcd
 [  157.609375] usb 3-3: reset low speed USB device number 3 using
 ohci_hcd
 [  157.683593] r8169 :02:00.0: eth0: link up
 [  165.621093] PM: resume of devices complete after 9679.556 msecs
 [  165.628906] Restarting tasks ... done.
 [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
 10019msec
 [  177.089843] [ cut here ]
 [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
 radeon_fence_wait+0x25c/0x33c()
 [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
 0x13AD)
 [  177.113281] Modules linked in: psmouse serio_raw
 [  177.117187] Call Trace:
 [  177.121093] [] dump_stack+0x8/0x34
 [  177.125000] [] warn_slowpath_common+0x78/0xa0
 [  177.132812] [] warn_slowpath_fmt+0x38/0x44
 [  177.136718] [] radeon_fence_wait+0x25c/0x33c
 [  177.144531] [] ttm_bo_wait+0x108/0x220
 [  177.148437] []
 radeon_gem_wait_idle_ioctl+0x80/0x114
 [  177.156250] [] drm_ioctl+0x2e4/0x3fc
 [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
 [  177.167968] [] compat_sys_ioctl+0x120/0x35c
 [  177.171875] [] handle_sys+0x118/0x138
 [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
 [  177.187500] radeon :01:05.0: GPU softreset
 [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
 [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
 [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
 [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
 [  177.367187] radeon :01:05.0:
 R_008020_GRBM_SOFT_RESET=0x7FEE
 [  177.390625] radeon :01:05.0:
 R_008020_GRBM_SOFT_RESET=0x0001
 [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
 [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
 [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
 [  177.433593] radeon :01:05.0: GPU reset succeed
 [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
 [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
 [  177.804687] radeon :01:05.0: WB enabled
 [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
 (scratch(0x8504)=0xCAFEDEAD)
>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>> something wrong with accessing through GTT.
>>>
>>> We dump gart table just after stopped cp, and compare gart table with
>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>> difference.
>>>
>>> Any idea?
>>>
 [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
 [

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-07 Thread Alex Deucher

2011/12/7  :
> When "MC timeout" happens at GPU reset, we found the 12th and 13th
> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
> two bits are like this:
> #define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
> #define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)
>
> Could you please tell me what does they mean? And if possible,

They refer to sub-blocks in the memory controller.  I don't really
know off hand what the name mean.

> I want to know the functionalities of these 5 registers in detail:
> #define R_000E60_SRBM_SOFT_RESET   0x0E60
> #define R_000E50_SRBM_STATUS   0x0E50
> #define R_008020_GRBM_SOFT_RESET0x8020
> #define R_008010_GRBM_STATUS0x8010
> #define R_008014_GRBM_STATUS2   0x8014
>
> A bit more info: If I reset the MC after resetting CP (this is what
> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
> disappear, but there is still "ring test failed".

The bits are defined in r600d.h.  As to the acronyms:
BIF - Bus InterFace
CG - clocks
DC - Display Controller
GRBM - Graphics block (3D engine)
HDP - Host Data Path (CPU access to vram via the PCI BAR)
IH, RLC - Interrupt controller
MC - Memory controller
ROM - ROM
SEM - semaphore controller

When you reset the MC, you will probably have to reset just about
everything else since most blocks depend on the MC for access to
memory.  If you do reset the MC, you should do it at prior to calling
asic_init so you make sure all the hw gets re-initialized properly.
Additionally, you should probably reset the GRBM either via
SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.

Alex

>
> Huacai Chen
>
>> 2011/11/8  :
>>> And, I want to know something:
>>> 1, Does GPU use MC to access GTT?
>>
>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>> memory (vram or gart).
>>
>>> 2, What can cause MC timeout?
>>
>> Lots of things.  Some GPU client still active, some GPU client hung or
>> not properly initialized.
>>
>> Alex
>>
>>>
 Hi,

 Some status update.
 ? 2011?9?29? ??5:17?Chen Jie  ???
> Hi,
> Add more information.
> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> platform with a mips64 compatible CPU and rs780e, the kernel is
> 3.1.0-rc8
> 64bit).  Related kernel message:
> /* return from STR */
> [  156.152343] radeon :01:05.0: WB enabled
> [  156.187500] [drm] ring test succeeded in 0 usecs
> [  156.187500] [drm] ib test succeeded in 0 usecs
> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [  156.597656] ata1.00: configured for UDMA/133
> [  156.613281] usb 1-5: reset high speed USB device number 4 using
> ehci_hcd
> [  157.027343] usb 3-2: reset low speed USB device number 2 using
> ohci_hcd
> [  157.609375] usb 3-3: reset low speed USB device number 3 using
> ohci_hcd
> [  157.683593] r8169 :02:00.0: eth0: link up
> [  165.621093] PM: resume of devices complete after 9679.556 msecs
> [  165.628906] Restarting tasks ... done.
> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> 10019msec
> [  177.089843] [ cut here ]
> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> radeon_fence_wait+0x25c/0x33c()
> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
> 0x13AD)
> [  177.113281] Modules linked in: psmouse serio_raw
> [  177.117187] Call Trace:
> [  177.121093] [] dump_stack+0x8/0x34
> [  177.125000] [] warn_slowpath_common+0x78/0xa0
> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> [  177.144531] [] ttm_bo_wait+0x108/0x220
> [  177.148437] []
> radeon_gem_wait_idle_ioctl+0x80/0x114
> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> [  177.171875] [] handle_sys+0x118/0x138
> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> [  177.187500] radeon :01:05.0: GPU softreset
> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
> [  177.367187] radeon :01:05.0:
> R_008020_GRBM_SOFT_RESET=0x7FEE
> [  177.390625] radeon :01:05.0:
> R_008020_GRBM_SOFT_RESET=0x0001
> [  177.414062] radeon :01:05.0:   R_0080

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-07 Thread Alex Deucher

2011/12/7  :
> When "MC timeout" happens at GPU reset, we found the 12th and 13th
> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
> two bits are like this:
> #define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
> #define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)
>
> Could you please tell me what does they mean? And if possible,

They refer to sub-blocks in the memory controller.  I don't really
know off hand what the name mean.

> I want to know the functionalities of these 5 registers in detail:
> #define R_000E60_SRBM_SOFT_RESET   0x0E60
> #define R_000E50_SRBM_STATUS   0x0E50
> #define R_008020_GRBM_SOFT_RESET0x8020
> #define R_008010_GRBM_STATUS0x8010
> #define R_008014_GRBM_STATUS2   0x8014
>
> A bit more info: If I reset the MC after resetting CP (this is what
> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
> disappear, but there is still "ring test failed".

The bits are defined in r600d.h.  As to the acronyms:
BIF - Bus InterFace
CG - clocks
DC - Display Controller
GRBM - Graphics block (3D engine)
HDP - Host Data Path (CPU access to vram via the PCI BAR)
IH, RLC - Interrupt controller
MC - Memory controller
ROM - ROM
SEM - semaphore controller

When you reset the MC, you will probably have to reset just about
everything else since most blocks depend on the MC for access to
memory.  If you do reset the MC, you should do it at prior to calling
asic_init so you make sure all the hw gets re-initialized properly.
Additionally, you should probably reset the GRBM either via
SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.

Alex

>
> Huacai Chen
>
>> 2011/11/8  :
>>> And, I want to know something:
>>> 1, Does GPU use MC to access GTT?
>>
>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>> memory (vram or gart).
>>
>>> 2, What can cause MC timeout？
>>
>> Lots of things.  Some GPU client still active, some GPU client hung or
>> not properly initialized.
>>
>> Alex
>>
>>>
 Hi,

 Some status update.
 在 2011年9月29日 下午5:17，Chen Jie  写道：
> Hi,
> Add more information.
> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> platform with a mips64 compatible CPU and rs780e, the kernel is
> 3.1.0-rc8
> 64bit).  Related kernel message:
> /* return from STR */
> [  156.152343] radeon :01:05.0: WB enabled
> [  156.187500] [drm] ring test succeeded in 0 usecs
> [  156.187500] [drm] ib test succeeded in 0 usecs
> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [  156.597656] ata1.00: configured for UDMA/133
> [  156.613281] usb 1-5: reset high speed USB device number 4 using
> ehci_hcd
> [  157.027343] usb 3-2: reset low speed USB device number 2 using
> ohci_hcd
> [  157.609375] usb 3-3: reset low speed USB device number 3 using
> ohci_hcd
> [  157.683593] r8169 :02:00.0: eth0: link up
> [  165.621093] PM: resume of devices complete after 9679.556 msecs
> [  165.628906] Restarting tasks ... done.
> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> 10019msec
> [  177.089843] [ cut here ]
> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> radeon_fence_wait+0x25c/0x33c()
> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
> 0x13AD)
> [  177.113281] Modules linked in: psmouse serio_raw
> [  177.117187] Call Trace:
> [  177.121093] [] dump_stack+0x8/0x34
> [  177.125000] [] warn_slowpath_common+0x78/0xa0
> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> [  177.144531] [] ttm_bo_wait+0x108/0x220
> [  177.148437] []
> radeon_gem_wait_idle_ioctl+0x80/0x114
> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> [  177.171875] [] handle_sys+0x118/0x138
> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> [  177.187500] radeon :01:05.0: GPU softreset
> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
> [  177.367187] radeon :01:05.0:
> R_008020_GRBM_SOFT_RESET=0x7FEE
> [  177.390625] radeon :01:05.0:
> R_008020_GRBM_SOFT_RESET=0x0001
> [  177.414062] radeon :01:05.0:   R_0080

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-07 Thread chenhc

When "MC timeout" happens at GPU reset, we found the 12th and 13th
bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
two bits are like this:
#define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
#define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)

Could you please tell me what does they mean? And if possible,
I want to know the functionalities of these 5 registers in detail:
#define R_000E60_SRBM_SOFT_RESET   0x0E60
#define R_000E50_SRBM_STATUS   0x0E50
#define R_008020_GRBM_SOFT_RESET0x8020
#define R_008010_GRBM_STATUS0x8010
#define R_008014_GRBM_STATUS2   0x8014

A bit more info: If I reset the MC after resetting CP (this is what
Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
disappear, but there is still "ring test failed".

Huacai Chen

> 2011/11/8  :
>> And, I want to know something:
>> 1, Does GPU use MC to access GTT?
>
> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
> memory (vram or gart).
>
>> 2, What can cause MC timeout？
>
> Lots of things.  Some GPU client still active, some GPU client hung or
> not properly initialized.
>
> Alex
>
>>
>>> Hi,
>>>
>>> Some status update.
>>> 在 2011年9月29日 下午5:17，Chen Jie  写道：
 Hi,
 Add more information.
 We got occasionally "GPU lockup" after resuming from suspend(on mipsel
 platform with a mips64 compatible CPU and rs780e, the kernel is
 3.1.0-rc8
 64bit).  Related kernel message:
 /* return from STR */
 [  156.152343] radeon :01:05.0: WB enabled
 [  156.187500] [drm] ring test succeeded in 0 usecs
 [  156.187500] [drm] ib test succeeded in 0 usecs
 [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
 [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
 [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
 [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 [  156.597656] ata1.00: configured for UDMA/133
 [  156.613281] usb 1-5: reset high speed USB device number 4 using
 ehci_hcd
 [  157.027343] usb 3-2: reset low speed USB device number 2 using
 ohci_hcd
 [  157.609375] usb 3-3: reset low speed USB device number 3 using
 ohci_hcd
 [  157.683593] r8169 :02:00.0: eth0: link up
 [  165.621093] PM: resume of devices complete after 9679.556 msecs
 [  165.628906] Restarting tasks ... done.
 [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
 10019msec
 [  177.089843] [ cut here ]
 [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
 radeon_fence_wait+0x25c/0x33c()
 [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
 0x13AD)
 [  177.113281] Modules linked in: psmouse serio_raw
 [  177.117187] Call Trace:
 [  177.121093] [] dump_stack+0x8/0x34
 [  177.125000] [] warn_slowpath_common+0x78/0xa0
 [  177.132812] [] warn_slowpath_fmt+0x38/0x44
 [  177.136718] [] radeon_fence_wait+0x25c/0x33c
 [  177.144531] [] ttm_bo_wait+0x108/0x220
 [  177.148437] []
 radeon_gem_wait_idle_ioctl+0x80/0x114
 [  177.156250] [] drm_ioctl+0x2e4/0x3fc
 [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
 [  177.167968] [] compat_sys_ioctl+0x120/0x35c
 [  177.171875] [] handle_sys+0x118/0x138
 [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
 [  177.187500] radeon :01:05.0: GPU softreset
 [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
 [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
 [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
 [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
 [  177.367187] radeon :01:05.0:
 R_008020_GRBM_SOFT_RESET=0x7FEE
 [  177.390625] radeon :01:05.0:
 R_008020_GRBM_SOFT_RESET=0x0001
 [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
 [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
 [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
 [  177.433593] radeon :01:05.0: GPU reset succeed
 [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
 [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
 [  177.804687] radeon :01:05.0: WB enabled
 [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
 (scratch(0x8504)=0xCAFEDEAD)
>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>> something wrong with accessing through GTT.
>>>
>>> We dump gart table just after stopped cp, and compare gart table with
>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>> difference.
>>>
>>> Any idea?
>>>
 [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
 [

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-08 Thread che...@lemote.com

And, I want to know something:
1, Does GPU use MC to access GTT?
2, What can cause MC timeout?

> Hi,
>
> Some status update.
> ? 2011?9?29? ??5:17?Chen Jie  ???
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon :01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 :02:00.0: eth0: link up
>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>> [  165.628906] Restarting tasks ... done.
>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
>> 10019msec
>> [  177.089843] [ cut here ]
>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>> 0x13AD)
>> [  177.113281] Modules linked in: psmouse serio_raw
>> [  177.117187] Call Trace:
>> [  177.121093] [] dump_stack+0x8/0x34
>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>> [  177.148437] []
>> radeon_gem_wait_idle_ioctl+0x80/0x114
>> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
>> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
>> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
>> [  177.171875] [] handle_sys+0x118/0x138
>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>> [  177.187500] radeon :01:05.0: GPU softreset
>> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.367187] radeon :01:05.0:
>> R_008020_GRBM_SOFT_RESET=0x7FEE
>> [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
>> [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>> [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>> [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>> [  177.433593] radeon :01:05.0: GPU reset succeed
>> [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.804687] radeon :01:05.0: WB enabled
>> [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>> (scratch(0x8504)=0xCAFEDEAD)
> After pinned ring in VRAM, it warned an ib test failure. It seems
> something wrong with accessing through GTT.
>
> We dump gart table just after stopped cp, and compare gart table with
> the dumped one just after r600_pcie_gart_enable, and don't find any
> difference.
>
> Any idea?
>
>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>> schedule
>> IB(5).
>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>> schedule
>> IB(6).
>> ...
>
>
>
> Regards,
> -- Chen Jie
>

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-08 Thread Chen Jie

Hi,

Some status update.
? 2011?9?29? ??5:17?Chen Jie  ???
> Hi,
> Add more information.
> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
> 64bit).  Related kernel message:
> /* return from STR */
> [  156.152343] radeon :01:05.0: WB enabled
> [  156.187500] [drm] ring test succeeded in 0 usecs
> [  156.187500] [drm] ib test succeeded in 0 usecs
> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [  156.597656] ata1.00: configured for UDMA/133
> [  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
> [  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
> [  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
> [  157.683593] r8169 :02:00.0: eth0: link up
> [  165.621093] PM: resume of devices complete after 9679.556 msecs
> [  165.628906] Restarting tasks ... done.
> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> 10019msec
> [  177.089843] [ cut here ]
> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> radeon_fence_wait+0x25c/0x33c()
> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id 0x13AD)
> [  177.113281] Modules linked in: psmouse serio_raw
> [  177.117187] Call Trace:
> [  177.121093] [] dump_stack+0x8/0x34
> [  177.125000] [] warn_slowpath_common+0x78/0xa0
> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> [  177.144531] [] ttm_bo_wait+0x108/0x220
> [  177.148437] [] radeon_gem_wait_idle_ioctl+0x80/0x114
> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> [  177.171875] [] handle_sys+0x118/0x138
> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> [  177.187500] radeon :01:05.0: GPU softreset
> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
> [  177.367187] radeon :01:05.0:   R_008020_GRBM_SOFT_RESET=0x7FEE
> [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
> [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
> [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
> [  177.433593] radeon :01:05.0: GPU reset succeed
> [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
> [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
> [  177.804687] radeon :01:05.0: WB enabled
> [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
> (scratch(0x8504)=0xCAFEDEAD)
After pinned ring in VRAM, it warned an ib test failure. It seems
something wrong with accessing through GTT.

We dump gart table just after stopped cp, and compare gart table with
the dumped one just after r600_pcie_gart_enable, and don't find any
difference.

Any idea?

> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> IB(5).
> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> IB(6).
> ...



Regards,
-- Chen Jie

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-08 Thread chenhc

And, I want to know something:
1, Does GPU use MC to access GTT?
2, What can cause MC timeout？

> Hi,
>
> Some status update.
> 在 2011年9月29日 下午5:17，Chen Jie  写道：
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon :01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 :02:00.0: eth0: link up
>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>> [  165.628906] Restarting tasks ... done.
>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
>> 10019msec
>> [  177.089843] [ cut here ]
>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>> 0x13AD)
>> [  177.113281] Modules linked in: psmouse serio_raw
>> [  177.117187] Call Trace:
>> [  177.121093] [] dump_stack+0x8/0x34
>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>> [  177.148437] []
>> radeon_gem_wait_idle_ioctl+0x80/0x114
>> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
>> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
>> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
>> [  177.171875] [] handle_sys+0x118/0x138
>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>> [  177.187500] radeon :01:05.0: GPU softreset
>> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.367187] radeon :01:05.0:
>> R_008020_GRBM_SOFT_RESET=0x7FEE
>> [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
>> [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>> [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>> [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>> [  177.433593] radeon :01:05.0: GPU reset succeed
>> [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.804687] radeon :01:05.0: WB enabled
>> [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>> (scratch(0x8504)=0xCAFEDEAD)
> After pinned ring in VRAM, it warned an ib test failure. It seems
> something wrong with accessing through GTT.
>
> We dump gart table just after stopped cp, and compare gart table with
> the dumped one just after r600_pcie_gart_enable, and don't find any
> difference.
>
> Any idea?
>
>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>> schedule
>> IB(5).
>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>> schedule
>> IB(6).
>> ...
>
>
>
> Regards,
> -- Chen Jie
>


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-08 Thread Jerome Glisse

On Tue, Nov 08, 2011 at 03:33:03PM +0800, Chen Jie wrote:
> Hi,
> 
> Some status update.
> ? 2011?9?29? ??5:17?Chen Jie  ???
> > Hi,
> > Add more information.
> > We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> > platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
> > 64bit).  Related kernel message:
> > /* return from STR */
> > [  156.152343] radeon :01:05.0: WB enabled
> > [  156.187500] [drm] ring test succeeded in 0 usecs
> > [  156.187500] [drm] ib test succeeded in 0 usecs
> > [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> > [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> > [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> > [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [  156.597656] ata1.00: configured for UDMA/133
> > [  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
> > [  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
> > [  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
> > [  157.683593] r8169 :02:00.0: eth0: link up
> > [  165.621093] PM: resume of devices complete after 9679.556 msecs
> > [  165.628906] Restarting tasks ... done.
> > [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> > 10019msec
> > [  177.089843] [ cut here ]
> > [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> > radeon_fence_wait+0x25c/0x33c()
> > [  177.105468] GPU lockup (waiting for 0x13C3 last fence id 0x13AD)
> > [  177.113281] Modules linked in: psmouse serio_raw
> > [  177.117187] Call Trace:
> > [  177.121093] [] dump_stack+0x8/0x34
> > [  177.125000] [] warn_slowpath_common+0x78/0xa0
> > [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> > [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> > [  177.144531] [] ttm_bo_wait+0x108/0x220
> > [  177.148437] [] radeon_gem_wait_idle_ioctl+0x80/0x114
> > [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> > [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> > [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> > [  177.171875] [] handle_sys+0x118/0x138
> > [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> > [  177.187500] radeon :01:05.0: GPU softreset
> > [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> > [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> > [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> > [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
> > [  177.367187] radeon :01:05.0:   R_008020_GRBM_SOFT_RESET=0x7FEE
> > [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
> > [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
> > [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> > [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
> > [  177.433593] radeon :01:05.0: GPU reset succeed
> > [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
> > [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
> > [  177.804687] radeon :01:05.0: WB enabled
> > [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
> > (scratch(0x8504)=0xCAFEDEAD)
> After pinned ring in VRAM, it warned an ib test failure. It seems
> something wrong with accessing through GTT.
> 
> We dump gart table just after stopped cp, and compare gart table with
> the dumped one just after r600_pcie_gart_enable, and don't find any
> difference.
> 
> Any idea?
> 
> > [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
> > [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> > IB(5).
> > [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
> > [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> > IB(6).
> > ...
> 
> 

Do you have any kind of iommu ? Is the gart table programmed with proper
physical address for the page ? Is the GPU PCI master (iirc a PCI device
need to be master to be able initiate request to memory). Then there
could be a lot other PCI things getting in the way.

Cheers,
Jerome

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-08 Thread Alex Deucher

2011/11/8  :
> And, I want to know something:
> 1, Does GPU use MC to access GTT?

Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
memory (vram or gart).

> 2, What can cause MC timeout?

Lots of things.  Some GPU client still active, some GPU client hung or
not properly initialized.

Alex

>
>> Hi,
>>
>> Some status update.
>> ? 2011?9?29? ??5:17?Chen Jie  ???
>>> Hi,
>>> Add more information.
>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>> 3.1.0-rc8
>>> 64bit).  Related kernel message:
>>> /* return from STR */
>>> [  156.152343] radeon :01:05.0: WB enabled
>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>> [  156.597656] ata1.00: configured for UDMA/133
>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>> ehci_hcd
>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>> ohci_hcd
>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>> ohci_hcd
>>> [  157.683593] r8169 :02:00.0: eth0: link up
>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>> [  165.628906] Restarting tasks ... done.
>>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
>>> 10019msec
>>> [  177.089843] [ cut here ]
>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>> radeon_fence_wait+0x25c/0x33c()
>>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>>> 0x13AD)
>>> [  177.113281] Modules linked in: psmouse serio_raw
>>> [  177.117187] Call Trace:
>>> [  177.121093] [] dump_stack+0x8/0x34
>>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>>> [  177.148437] []
>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
>>> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
>>> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
>>> [  177.171875] [] handle_sys+0x118/0x138
>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>> [  177.187500] radeon :01:05.0: GPU softreset
>>> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>>> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
>>> [  177.367187] radeon :01:05.0:
>>> R_008020_GRBM_SOFT_RESET=0x7FEE
>>> [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
>>> [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>> [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>>> [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>> [  177.433593] radeon :01:05.0: GPU reset succeed
>>> [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
>>> [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
>>> [  177.804687] radeon :01:05.0: WB enabled
>>> [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>> (scratch(0x8504)=0xCAFEDEAD)
>> After pinned ring in VRAM, it warned an ib test failure. It seems
>> something wrong with accessing through GTT.
>>
>> We dump gart table just after stopped cp, and compare gart table with
>> the dumped one just after r600_pcie_gart_enable, and don't find any
>> difference.
>>
>> Any idea?
>>
>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>> schedule
>>> IB(5).
>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>> schedule
>>> IB(6).
>>> ...
>>
>>
>>
>> Regards,
>> -- Chen Jie
>>
>
>
>

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-08 Thread Jerome Glisse

On Tue, Nov 08, 2011 at 03:33:03PM +0800, Chen Jie wrote:
> Hi,
> 
> Some status update.
> 在 2011年9月29日 下午5:17，Chen Jie  写道：
> > Hi,
> > Add more information.
> > We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> > platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
> > 64bit).  Related kernel message:
> > /* return from STR */
> > [  156.152343] radeon :01:05.0: WB enabled
> > [  156.187500] [drm] ring test succeeded in 0 usecs
> > [  156.187500] [drm] ib test succeeded in 0 usecs
> > [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> > [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> > [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> > [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [  156.597656] ata1.00: configured for UDMA/133
> > [  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
> > [  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
> > [  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
> > [  157.683593] r8169 :02:00.0: eth0: link up
> > [  165.621093] PM: resume of devices complete after 9679.556 msecs
> > [  165.628906] Restarting tasks ... done.
> > [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> > 10019msec
> > [  177.089843] [ cut here ]
> > [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> > radeon_fence_wait+0x25c/0x33c()
> > [  177.105468] GPU lockup (waiting for 0x13C3 last fence id 0x13AD)
> > [  177.113281] Modules linked in: psmouse serio_raw
> > [  177.117187] Call Trace:
> > [  177.121093] [] dump_stack+0x8/0x34
> > [  177.125000] [] warn_slowpath_common+0x78/0xa0
> > [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> > [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> > [  177.144531] [] ttm_bo_wait+0x108/0x220
> > [  177.148437] [] radeon_gem_wait_idle_ioctl+0x80/0x114
> > [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> > [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> > [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> > [  177.171875] [] handle_sys+0x118/0x138
> > [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> > [  177.187500] radeon :01:05.0: GPU softreset
> > [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> > [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> > [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> > [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
> > [  177.367187] radeon :01:05.0:   R_008020_GRBM_SOFT_RESET=0x7FEE
> > [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
> > [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
> > [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> > [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
> > [  177.433593] radeon :01:05.0: GPU reset succeed
> > [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
> > [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
> > [  177.804687] radeon :01:05.0: WB enabled
> > [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
> > (scratch(0x8504)=0xCAFEDEAD)
> After pinned ring in VRAM, it warned an ib test failure. It seems
> something wrong with accessing through GTT.
> 
> We dump gart table just after stopped cp, and compare gart table with
> the dumped one just after r600_pcie_gart_enable, and don't find any
> difference.
> 
> Any idea?
> 
> > [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
> > [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> > IB(5).
> > [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
> > [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> > IB(6).
> > ...
> 
> 

Do you have any kind of iommu ? Is the gart table programmed with proper
physical address for the page ? Is the GPU PCI master (iirc a PCI device
need to be master to be able initiate request to memory). Then there
could be a lot other PCI things getting in the way.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-08 Thread Alex Deucher

2011/11/8  :
> And, I want to know something:
> 1, Does GPU use MC to access GTT?

Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
memory (vram or gart).

> 2, What can cause MC timeout？

Lots of things.  Some GPU client still active, some GPU client hung or
not properly initialized.

Alex

>
>> Hi,
>>
>> Some status update.
>> 在 2011年9月29日 下午5:17，Chen Jie  写道：
>>> Hi,
>>> Add more information.
>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>> 3.1.0-rc8
>>> 64bit).  Related kernel message:
>>> /* return from STR */
>>> [  156.152343] radeon :01:05.0: WB enabled
>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>> [  156.597656] ata1.00: configured for UDMA/133
>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>> ehci_hcd
>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>> ohci_hcd
>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>> ohci_hcd
>>> [  157.683593] r8169 :02:00.0: eth0: link up
>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>> [  165.628906] Restarting tasks ... done.
>>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
>>> 10019msec
>>> [  177.089843] [ cut here ]
>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>> radeon_fence_wait+0x25c/0x33c()
>>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>>> 0x13AD)
>>> [  177.113281] Modules linked in: psmouse serio_raw
>>> [  177.117187] Call Trace:
>>> [  177.121093] [] dump_stack+0x8/0x34
>>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>>> [  177.148437] []
>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
>>> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
>>> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
>>> [  177.171875] [] handle_sys+0x118/0x138
>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>> [  177.187500] radeon :01:05.0: GPU softreset
>>> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>>> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
>>> [  177.367187] radeon :01:05.0:
>>> R_008020_GRBM_SOFT_RESET=0x7FEE
>>> [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
>>> [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>> [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>>> [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>> [  177.433593] radeon :01:05.0: GPU reset succeed
>>> [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
>>> [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
>>> [  177.804687] radeon :01:05.0: WB enabled
>>> [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>> (scratch(0x8504)=0xCAFEDEAD)
>> After pinned ring in VRAM, it warned an ib test failure. It seems
>> something wrong with accessing through GTT.
>>
>> We dump gart table just after stopped cp, and compare gart table with
>> the dumped one just after r600_pcie_gart_enable, and don't find any
>> difference.
>>
>> Any idea?
>>
>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>> schedule
>>> IB(5).
>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>> schedule
>>> IB(6).
>>> ...
>>
>>
>>
>> Regards,
>> -- Chen Jie
>>
>
>
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-07 Thread Chen Jie

Hi,

Some status update.
在 2011年9月29日 下午5:17，Chen Jie  写道：
> Hi,
> Add more information.
> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
> 64bit).  Related kernel message:
> /* return from STR */
> [  156.152343] radeon :01:05.0: WB enabled
> [  156.187500] [drm] ring test succeeded in 0 usecs
> [  156.187500] [drm] ib test succeeded in 0 usecs
> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [  156.597656] ata1.00: configured for UDMA/133
> [  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
> [  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
> [  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
> [  157.683593] r8169 :02:00.0: eth0: link up
> [  165.621093] PM: resume of devices complete after 9679.556 msecs
> [  165.628906] Restarting tasks ... done.
> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> 10019msec
> [  177.089843] [ cut here ]
> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> radeon_fence_wait+0x25c/0x33c()
> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id 0x13AD)
> [  177.113281] Modules linked in: psmouse serio_raw
> [  177.117187] Call Trace:
> [  177.121093] [] dump_stack+0x8/0x34
> [  177.125000] [] warn_slowpath_common+0x78/0xa0
> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> [  177.144531] [] ttm_bo_wait+0x108/0x220
> [  177.148437] [] radeon_gem_wait_idle_ioctl+0x80/0x114
> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> [  177.171875] [] handle_sys+0x118/0x138
> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> [  177.187500] radeon :01:05.0: GPU softreset
> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
> [  177.367187] radeon :01:05.0:   R_008020_GRBM_SOFT_RESET=0x7FEE
> [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
> [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
> [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
> [  177.433593] radeon :01:05.0: GPU reset succeed
> [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
> [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
> [  177.804687] radeon :01:05.0: WB enabled
> [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
> (scratch(0x8504)=0xCAFEDEAD)
After pinned ring in VRAM, it warned an ib test failure. It seems
something wrong with accessing through GTT.

We dump gart table just after stopped cp, and compare gart table with
the dumped one just after r600_pcie_gart_enable, and don't find any
difference.

Any idea?

> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> IB(5).
> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> IB(6).
> ...



Regards,
-- Chen Jie
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-20 Thread Michel Dänzer

On Die, 2011-10-18 at 16:35 +0800, Chen Jie wrote:
> 
> ? 2011?10?17? ??2:34? ???
> If I start X but switch to the console, then do suspend &
> resume, "GPU
> reset" hardly happen. but there is a new problem that the IRQ
> of radeon
> card is disabled. Maybe "GPU reset" has something to do with
> "IRQ
> disabled"?
> 
> I have tried "irqpoll", it doesn't fix this problem.
> 
> [  571.914062] irq 6: nobody cared (try booting with the
> "irqpoll" option)
> [  571.914062] Call Trace:
> [  571.914062] [] dump_stack+0x8/0x34
> [  571.914062] [] __report_bad_irq.clone.6
> +0x44/0x15c
> [  571.914062] [] note_interrupt+0x204/0x2a0
> [  571.914062] [] handle_irq_event_percpu
> +0x19c/0x1f8
> [  571.914062] [] handle_irq_event+0x68/0xa8
> [  571.914062] [] handle_level_irq
> +0xd8/0x13c
> [  571.914062] [] generic_handle_irq
> +0x48/0x58
> [  571.914062] [] do_IRQ+0x18/0x24
> [  571.914062] [] mach_irq_dispatch
> +0xf0/0x194
> [  571.914062] [] ret_from_irq+0x0/0x4
> [  571.914062]
> [  571.914062] handlers:
> [  571.914062] []
> radeon_driver_irq_handler_kms
> 
> P.S.: use the latest kernel from git, and irq6 is not shared
> by other
> devices.
> 
> Does fence_wait depends on GPU's interrupt? If yes, then can I say
> "GPU lockup" is caused by unexpected disabling of GPU's irq?

No, if the GPU didn't actually lock up, the fences should still signal
eventually, as radeon_fence_signaled()->radeon_fence_poll_locked() is
called after the wait for the SW interrupt times out. 


-- 
Earthling Michel D?nzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer

Re: Re:[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-20 Thread Michel Dänzer

On Die, 2011-10-18 at 16:35 +0800, Chen Jie wrote:
> 
> 在 2011年10月17日 下午2:34， 写道：
> If I start X but switch to the console, then do suspend &
> resume, "GPU
> reset" hardly happen. but there is a new problem that the IRQ
> of radeon
> card is disabled. Maybe "GPU reset" has something to do with
> "IRQ
> disabled"?
> 
> I have tried "irqpoll", it doesn't fix this problem.
> 
> [  571.914062] irq 6: nobody cared (try booting with the
> "irqpoll" option)
> [  571.914062] Call Trace:
> [  571.914062] [] dump_stack+0x8/0x34
> [  571.914062] [] __report_bad_irq.clone.6
> +0x44/0x15c
> [  571.914062] [] note_interrupt+0x204/0x2a0
> [  571.914062] [] handle_irq_event_percpu
> +0x19c/0x1f8
> [  571.914062] [] handle_irq_event+0x68/0xa8
> [  571.914062] [] handle_level_irq
> +0xd8/0x13c
> [  571.914062] [] generic_handle_irq
> +0x48/0x58
> [  571.914062] [] do_IRQ+0x18/0x24
> [  571.914062] [] mach_irq_dispatch
> +0xf0/0x194
> [  571.914062] [] ret_from_irq+0x0/0x4
> [  571.914062]
> [  571.914062] handlers:
> [  571.914062] []
> radeon_driver_irq_handler_kms
> 
> P.S.: use the latest kernel from git, and irq6 is not shared
> by other
> devices.
> 
> Does fence_wait depends on GPU's interrupt? If yes, then can I say
> "GPU lockup" is caused by unexpected disabling of GPU's irq?

No, if the GPU didn't actually lock up, the fences should still signal
eventually, as radeon_fence_signaled()->radeon_fence_poll_locked() is
called after the wait for the SW interrupt times out. 


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: Re:[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-19 Thread Chen Jie

Hi,

在 2011年10月17日 下午2:34， 写道：

> If I start X but switch to the console, then do suspend & resume, "GPU
> reset" hardly happen. but there is a new problem that the IRQ of radeon
> card is disabled. Maybe "GPU reset" has something to do with "IRQ
> disabled"?
>
> I have tried "irqpoll", it doesn't fix this problem.
>
> [  571.914062] irq 6: nobody cared (try booting with the "irqpoll" option)
> [  571.914062] Call Trace:
> [  571.914062] [] dump_stack+0x8/0x34
> [  571.914062] [] __report_bad_irq.clone.6+0x44/0x15c
> [  571.914062] [] note_interrupt+0x204/0x2a0
> [  571.914062] [] handle_irq_event_percpu+0x19c/0x1f8
> [  571.914062] [] handle_irq_event+0x68/0xa8
> [  571.914062] [] handle_level_irq+0xd8/0x13c
> [  571.914062] [] generic_handle_irq+0x48/0x58
> [  571.914062] [] do_IRQ+0x18/0x24
> [  571.914062] [] mach_irq_dispatch+0xf0/0x194
> [  571.914062] [] ret_from_irq+0x0/0x4
> [  571.914062]
> [  571.914062] handlers:
> [  571.914062] [] radeon_driver_irq_handler_kms
>
> P.S.: use the latest kernel from git, and irq6 is not shared by other
> devices.
>
> Does fence_wait depends on GPU's interrupt? If yes, then can I say "GPU
lockup" is caused by unexpected disabling of GPU's irq?


> > Hi Alex, Michel
> >
> > 2011/10/5 Alex Deucher 
> >
> >> 2011/10/5 Michel D鋘zer :
> >> > On Don, 2011-09-29 at 17:17 +0800, Chen Jie wrote:
> >> >>
> >> >> We got occasionally "GPU lockup" after resuming from suspend(on
> >> mipsel
> >> >> platform with a mips64 compatible CPU and rs780e, the kernel is
> >> >> 3.1.0-rc8 64bit).  Related kernel message:
> >> >
> >> > [...]
> >> >
> >> >> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> >> >> 10019msec
> >> >> [  177.089843] [ cut here ]
> >> >> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> >> >> radeon_fence_wait+0x25c/0x33c()
> >> >> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
> >> >> 0x13AD)
> >> >> [  177.113281] Modules linked in: psmouse serio_raw
> >> >> [  177.117187] Call Trace:
> >> >> [  177.121093] [] dump_stack+0x8/0x34
> >> >> [  177.125000] [] warn_slowpath_common+0x78/0xa0
> >> >> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> >> >> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> >> >> [  177.144531] [] ttm_bo_wait+0x108/0x220
> >> >> [  177.148437] [] radeon_gem_wait_idle_ioctl
> >> >> +0x80/0x114
> >> >> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> >> >> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> >> >> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> >> >> [  177.171875] [] handle_sys+0x118/0x138
> >> >> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> >> >> [  177.187500] radeon :01:05.0: GPU softreset
> >> >> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> >> >> [  177.195312] radeon :01:05.0:
> >> R_008014_GRBM_STATUS2=0x0003
> >> >> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> >> >> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
> >> >
> >> > [...]
> >> >
> >> >> What may cause a "GPU lockup"?
> >> >
> >> > Lots of things... The most common cause is an incorrect command stream
> >> > sent to the GPU by userspace or the kernel.
> >> >
> >> >> Why reset didn't work?
> >> >
> >> > Might be related to 'Wait for MC idle timedout !', but I don't know
> >> > offhand what could be up with that.
> >> >
> >> >
> >> >> BTW,  one question:
> >> >> I got 'RADEON_IS_PCI | RADEON_IS_IGP' in rdev->flags, which causes
> >> >> need_dma32 was set.
> >> >> Is it correct? (drivers/char/agp is not available on mips, could that
> >> >> be the reason?)
> >> >
> >> > Not sure, Alex?
> >>
> >> You don't AGP for newer IGP cards (rs4xx+).  It gets set by default if
> >> the card is not AGP or PCIE.  That should be changed as only the
> >> legacy r1xx PCI GART block has that limitation.  I'll send a patch out
> >> shortly.
> >>
> >> Got it, thanks for the reply.
> >
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-18 Thread Chen Jie

Hi,

? 2011?10?17? ??2:34? ???

> If I start X but switch to the console, then do suspend & resume, "GPU
> reset" hardly happen. but there is a new problem that the IRQ of radeon
> card is disabled. Maybe "GPU reset" has something to do with "IRQ
> disabled"?
>
> I have tried "irqpoll", it doesn't fix this problem.
>
> [  571.914062] irq 6: nobody cared (try booting with the "irqpoll" option)
> [  571.914062] Call Trace:
> [  571.914062] [] dump_stack+0x8/0x34
> [  571.914062] [] __report_bad_irq.clone.6+0x44/0x15c
> [  571.914062] [] note_interrupt+0x204/0x2a0
> [  571.914062] [] handle_irq_event_percpu+0x19c/0x1f8
> [  571.914062] [] handle_irq_event+0x68/0xa8
> [  571.914062] [] handle_level_irq+0xd8/0x13c
> [  571.914062] [] generic_handle_irq+0x48/0x58
> [  571.914062] [] do_IRQ+0x18/0x24
> [  571.914062] [] mach_irq_dispatch+0xf0/0x194
> [  571.914062] [] ret_from_irq+0x0/0x4
> [  571.914062]
> [  571.914062] handlers:
> [  571.914062] [] radeon_driver_irq_handler_kms
>
> P.S.: use the latest kernel from git, and irq6 is not shared by other
> devices.
>
> Does fence_wait depends on GPU's interrupt? If yes, then can I say "GPU
lockup" is caused by unexpected disabling of GPU's irq?


> > Hi Alex, Michel
> >
> > 2011/10/5 Alex Deucher 
> >
> >> 2011/10/5 Michel D?zer :
> >> > On Don, 2011-09-29 at 17:17 +0800, Chen Jie wrote:
> >> >>
> >> >> We got occasionally "GPU lockup" after resuming from suspend(on
> >> mipsel
> >> >> platform with a mips64 compatible CPU and rs780e, the kernel is
> >> >> 3.1.0-rc8 64bit).  Related kernel message:
> >> >
> >> > [...]
> >> >
> >> >> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> >> >> 10019msec
> >> >> [  177.089843] [ cut here ]
> >> >> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> >> >> radeon_fence_wait+0x25c/0x33c()
> >> >> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
> >> >> 0x13AD)
> >> >> [  177.113281] Modules linked in: psmouse serio_raw
> >> >> [  177.117187] Call Trace:
> >> >> [  177.121093] [] dump_stack+0x8/0x34
> >> >> [  177.125000] [] warn_slowpath_common+0x78/0xa0
> >> >> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> >> >> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> >> >> [  177.144531] [] ttm_bo_wait+0x108/0x220
> >> >> [  177.148437] [] radeon_gem_wait_idle_ioctl
> >> >> +0x80/0x114
> >> >> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> >> >> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> >> >> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> >> >> [  177.171875] [] handle_sys+0x118/0x138
> >> >> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> >> >> [  177.187500] radeon :01:05.0: GPU softreset
> >> >> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> >> >> [  177.195312] radeon :01:05.0:
> >> R_008014_GRBM_STATUS2=0x0003
> >> >> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> >> >> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
> >> >
> >> > [...]
> >> >
> >> >> What may cause a "GPU lockup"?
> >> >
> >> > Lots of things... The most common cause is an incorrect command stream
> >> > sent to the GPU by userspace or the kernel.
> >> >
> >> >> Why reset didn't work?
> >> >
> >> > Might be related to 'Wait for MC idle timedout !', but I don't know
> >> > offhand what could be up with that.
> >> >
> >> >
> >> >> BTW,  one question:
> >> >> I got 'RADEON_IS_PCI | RADEON_IS_IGP' in rdev->flags, which causes
> >> >> need_dma32 was set.
> >> >> Is it correct? (drivers/char/agp is not available on mips, could that
> >> >> be the reason?)
> >> >
> >> > Not sure, Alex?
> >>
> >> You don't AGP for newer IGP cards (rs4xx+).  It gets set by default if
> >> the card is not AGP or PCIE.  That should be changed as only the
> >> legacy r1xx PCI GART block has that limitation.  I'll send a patch out
> >> shortly.
> >>
> >> Got it, thanks for the reply.
> >
>
-- next part --
An HTML attachment was scrubbed...
URL:

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-05 Thread Michel Dänzer

On Don, 2011-09-29 at 17:17 +0800, Chen Jie wrote:
> 
> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> platform with a mips64 compatible CPU and rs780e, the kernel is
> 3.1.0-rc8 64bit).  Related kernel message:

[...]

> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> 10019msec
> [  177.089843] [ cut here ]
> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> radeon_fence_wait+0x25c/0x33c()
> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
> 0x13AD)
> [  177.113281] Modules linked in: psmouse serio_raw
> [  177.117187] Call Trace:
> [  177.121093] [] dump_stack+0x8/0x34
> [  177.125000] [] warn_slowpath_common+0x78/0xa0
> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> [  177.144531] [] ttm_bo_wait+0x108/0x220
> [  177.148437] [] radeon_gem_wait_idle_ioctl
> +0x80/0x114
> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> [  177.171875] [] handle_sys+0x118/0x138
> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> [  177.187500] radeon :01:05.0: GPU softreset
> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !

[...]

> What may cause a "GPU lockup"?

Lots of things... The most common cause is an incorrect command stream
sent to the GPU by userspace or the kernel. 

> Why reset didn't work?

Might be related to 'Wait for MC idle timedout !', but I don't know
offhand what could be up with that. 


> BTW,  one question:
> I got 'RADEON_IS_PCI | RADEON_IS_IGP' in rdev->flags, which causes
> need_dma32 was set.
> Is it correct? (drivers/char/agp is not available on mips, could that
> be the reason?)

Not sure, Alex?


-- 
Earthling Michel D?nzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-05 Thread Alex Deucher

2011/10/5 Michel D?nzer :
> On Don, 2011-09-29 at 17:17 +0800, Chen Jie wrote:
>>
>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8 64bit). ?Related kernel message:
>
> [...]
>
>> [ ?177.085937] radeon :01:05.0: GPU lockup CP stall for more than
>> 10019msec
>> [ ?177.089843] [ cut here ]
>> [ ?177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [ ?177.105468] GPU lockup (waiting for 0x13C3 last fence id
>> 0x13AD)
>> [ ?177.113281] Modules linked in: psmouse serio_raw
>> [ ?177.117187] Call Trace:
>> [ ?177.121093] [] dump_stack+0x8/0x34
>> [ ?177.125000] [] warn_slowpath_common+0x78/0xa0
>> [ ?177.132812] [] warn_slowpath_fmt+0x38/0x44
>> [ ?177.136718] [] radeon_fence_wait+0x25c/0x33c
>> [ ?177.144531] [] ttm_bo_wait+0x108/0x220
>> [ ?177.148437] [] radeon_gem_wait_idle_ioctl
>> +0x80/0x114
>> [ ?177.156250] [] drm_ioctl+0x2e4/0x3fc
>> [ ?177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
>> [ ?177.167968] [] compat_sys_ioctl+0x120/0x35c
>> [ ?177.171875] [] handle_sys+0x118/0x138
>> [ ?177.179687] ---[ end trace 92f63d998efe4c6d ]---
>> [ ?177.187500] radeon :01:05.0: GPU softreset
>> [ ?177.191406] radeon :01:05.0: ? R_008010_GRBM_STATUS=0xF57C2030
>> [ ?177.195312] radeon :01:05.0: ? R_008014_GRBM_STATUS2=0x0003
>> [ ?177.203125] radeon :01:05.0: ? R_000E50_SRBM_STATUS=0x20023040
>> [ ?177.363281] radeon :01:05.0: Wait for MC idle timedout !
>
> [...]
>
>> What may cause a "GPU lockup"?
>
> Lots of things... The most common cause is an incorrect command stream
> sent to the GPU by userspace or the kernel.
>
>> Why reset didn't work?
>
> Might be related to 'Wait for MC idle timedout !', but I don't know
> offhand what could be up with that.
>
>
>> BTW, ?one question:
>> I got 'RADEON_IS_PCI | RADEON_IS_IGP' in rdev->flags, which causes
>> need_dma32 was set.
>> Is it correct? (drivers/char/agp is not available on mips, could that
>> be the reason?)
>
> Not sure, Alex?

You don't AGP for newer IGP cards (rs4xx+).  It gets set by default if
the card is not AGP or PCIE.  That should be changed as only the
legacy r1xx PCI GART block has that limitation.  I'll send a patch out
shortly.

Alex

Re: Re:[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-05 Thread Alex Deucher

2011/10/5 Michel Dänzer :
> On Don, 2011-09-29 at 17:17 +0800, Chen Jie wrote:
>>
>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8 64bit).  Related kernel message:
>
> [...]
>
>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
>> 10019msec
>> [  177.089843] [ cut here ]
>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>> 0x13AD)
>> [  177.113281] Modules linked in: psmouse serio_raw
>> [  177.117187] Call Trace:
>> [  177.121093] [] dump_stack+0x8/0x34
>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>> [  177.148437] [] radeon_gem_wait_idle_ioctl
>> +0x80/0x114
>> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
>> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
>> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
>> [  177.171875] [] handle_sys+0x118/0x138
>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>> [  177.187500] radeon :01:05.0: GPU softreset
>> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
>
> [...]
>
>> What may cause a "GPU lockup"?
>
> Lots of things... The most common cause is an incorrect command stream
> sent to the GPU by userspace or the kernel.
>
>> Why reset didn't work?
>
> Might be related to 'Wait for MC idle timedout !', but I don't know
> offhand what could be up with that.
>
>
>> BTW,  one question:
>> I got 'RADEON_IS_PCI | RADEON_IS_IGP' in rdev->flags, which causes
>> need_dma32 was set.
>> Is it correct? (drivers/char/agp is not available on mips, could that
>> be the reason?)
>
> Not sure, Alex?

You don't AGP for newer IGP cards (rs4xx+).  It gets set by default if
the card is not AGP or PCIE.  That should be changed as only the
legacy r1xx PCI GART block has that limitation.  I'll send a patch out
shortly.

Alex
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: Re:[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-05 Thread Michel Dänzer

On Don, 2011-09-29 at 17:17 +0800, Chen Jie wrote:
> 
> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> platform with a mips64 compatible CPU and rs780e, the kernel is
> 3.1.0-rc8 64bit).  Related kernel message:

[...]

> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
> 10019msec
> [  177.089843] [ cut here ]
> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> radeon_fence_wait+0x25c/0x33c()
> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
> 0x13AD)
> [  177.113281] Modules linked in: psmouse serio_raw
> [  177.117187] Call Trace:
> [  177.121093] [] dump_stack+0x8/0x34
> [  177.125000] [] warn_slowpath_common+0x78/0xa0
> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
> [  177.144531] [] ttm_bo_wait+0x108/0x220
> [  177.148437] [] radeon_gem_wait_idle_ioctl
> +0x80/0x114
> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
> [  177.171875] [] handle_sys+0x118/0x138
> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> [  177.187500] radeon :01:05.0: GPU softreset
> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !

[...]

> What may cause a "GPU lockup"?

Lots of things... The most common cause is an incorrect command stream
sent to the GPU by userspace or the kernel. 

> Why reset didn't work?

Might be related to 'Wait for MC idle timedout !', but I don't know
offhand what could be up with that. 


> BTW,  one question:
> I got 'RADEON_IS_PCI | RADEON_IS_IGP' in rdev->flags, which causes
> need_dma32 was set.
> Is it correct? (drivers/char/agp is not available on mips, could that
> be the reason?)

Not sure, Alex?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re:[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-10-01 Thread Chen Jie

Hi,

Add more information.

We got occasionally "GPU lockup" after resuming from suspend(on mipsel
platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
64bit).  Related kernel message:
/* return from STR */
[  156.152343] radeon :01:05.0: WB enabled
[  156.187500] [drm] ring test succeeded in 0 usecs
[  156.187500] [drm] ib test succeeded in 0 usecs
[  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
[  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
[  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
[  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  156.597656] ata1.00: configured for UDMA/133
[  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
[  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
[  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
[  157.683593] r8169 :02:00.0: eth0: link up
[  165.621093] PM: resume of devices complete after 9679.556 msecs
[  165.628906] Restarting tasks ... done.
[  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
10019msec
[  177.089843] [ cut here ]
[  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
radeon_fence_wait+0x25c/0x33c()
[  177.105468] GPU lockup (waiting for 0x13C3 last fence id 0x13AD)
[  177.113281] Modules linked in: psmouse serio_raw
[  177.117187] Call Trace:
[  177.121093] [] dump_stack+0x8/0x34
[  177.125000] [] warn_slowpath_common+0x78/0xa0
[  177.132812] [] warn_slowpath_fmt+0x38/0x44
[  177.136718] [] radeon_fence_wait+0x25c/0x33c
[  177.144531] [] ttm_bo_wait+0x108/0x220
[  177.148437] [] radeon_gem_wait_idle_ioctl+0x80/0x114
[  177.156250] [] drm_ioctl+0x2e4/0x3fc
[  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
[  177.167968] [] compat_sys_ioctl+0x120/0x35c
[  177.171875] [] handle_sys+0x118/0x138
[  177.179687] ---[ end trace 92f63d998efe4c6d ]---
[  177.187500] radeon :01:05.0: GPU softreset
[  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
[  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
[  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
[  177.363281] radeon :01:05.0: Wait for MC idle timedout !
[  177.367187] radeon :01:05.0:   R_008020_GRBM_SOFT_RESET=0x7FEE
[  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
[  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
[  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
[  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
[  177.433593] radeon :01:05.0: GPU reset succeed
[  177.605468] radeon :01:05.0: Wait for MC idle timedout !
[  177.761718] radeon :01:05.0: Wait for MC idle timedout !
[  177.804687] radeon :01:05.0: WB enabled
[  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
(scratch(0x8504)=0xCAFEDEAD)
[  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
[  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
IB(5).
[  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
[  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
IB(6).
...

What may cause a "GPU lockup"? Why reset didn't work? Any idea?

BTW,  one question:
I got 'RADEON_IS_PCI | RADEON_IS_IGP' in rdev->flags, which causes
need_dma32 was set.
Is it correct? (drivers/char/agp is not available on mips, could that be the
reason?)


[  177.179687]在 2011年9月28日 下午3:23， 写道：

> Hi Alex,
>
> When we do STR (S3) with a RS780E radeon card on MIPS platform. "GPU
> reset" may happen after resume (the possibility is about 5%). After that,
> X is unusuable.
>
> We know there is a "ring test" at system resume time and GPU reset time.
> Whether GPU reset happens, the "ring test" at system resume time is always
> successful. But the "ring test" at GPU reset time usually fails.
>
> We use the latest kernel (3.1.0-RC8 from git) and X.org is 7.6.
>
> Any ideas?
>
> Best regards,
> Huacai Chen
>
>

Regards,
- Chen Jie
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-09-29 Thread Chen Jie

Hi,

Add more information.

We got occasionally "GPU lockup" after resuming from suspend(on mipsel
platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
64bit).  Related kernel message:
/* return from STR */
[  156.152343] radeon :01:05.0: WB enabled
[  156.187500] [drm] ring test succeeded in 0 usecs
[  156.187500] [drm] ib test succeeded in 0 usecs
[  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
[  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
[  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
[  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  156.597656] ata1.00: configured for UDMA/133
[  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
[  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
[  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
[  157.683593] r8169 :02:00.0: eth0: link up
[  165.621093] PM: resume of devices complete after 9679.556 msecs
[  165.628906] Restarting tasks ... done.
[  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
10019msec
[  177.089843] [ cut here ]
[  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
radeon_fence_wait+0x25c/0x33c()
[  177.105468] GPU lockup (waiting for 0x13C3 last fence id 0x13AD)
[  177.113281] Modules linked in: psmouse serio_raw
[  177.117187] Call Trace:
[  177.121093] [] dump_stack+0x8/0x34
[  177.125000] [] warn_slowpath_common+0x78/0xa0
[  177.132812] [] warn_slowpath_fmt+0x38/0x44
[  177.136718] [] radeon_fence_wait+0x25c/0x33c
[  177.144531] [] ttm_bo_wait+0x108/0x220
[  177.148437] [] radeon_gem_wait_idle_ioctl+0x80/0x114
[  177.156250] [] drm_ioctl+0x2e4/0x3fc
[  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
[  177.167968] [] compat_sys_ioctl+0x120/0x35c
[  177.171875] [] handle_sys+0x118/0x138
[  177.179687] ---[ end trace 92f63d998efe4c6d ]---
[  177.187500] radeon :01:05.0: GPU softreset
[  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
[  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
[  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
[  177.363281] radeon :01:05.0: Wait for MC idle timedout !
[  177.367187] radeon :01:05.0:   R_008020_GRBM_SOFT_RESET=0x7FEE
[  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
[  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
[  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
[  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
[  177.433593] radeon :01:05.0: GPU reset succeed
[  177.605468] radeon :01:05.0: Wait for MC idle timedout !
[  177.761718] radeon :01:05.0: Wait for MC idle timedout !
[  177.804687] radeon :01:05.0: WB enabled
[  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
(scratch(0x8504)=0xCAFEDEAD)
[  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
[  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
IB(5).
[  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
[  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
IB(6).
...

What may cause a "GPU lockup"? Why reset didn't work? Any idea?

BTW,  one question:
I got 'RADEON_IS_PCI | RADEON_IS_IGP' in rdev->flags, which causes
need_dma32 was set.
Is it correct? (drivers/char/agp is not available on mips, could that be the
reason?)


[  177.179687]? 2011?9?28? ??3:23? ???

> Hi Alex,
>
> When we do STR (S3) with a RS780E radeon card on MIPS platform. "GPU
> reset" may happen after resume (the possibility is about 5%). After that,
> X is unusuable.
>
> We know there is a "ring test" at system resume time and GPU reset time.
> Whether GPU reset happens, the "ring test" at system resume time is always
> successful. But the "ring test" at GPU reset time usually fails.
>
> We use the latest kernel (3.1.0-RC8 from git) and X.org is 7.6.
>
> Any ideas?
>
> Best regards,
> Huacai Chen
>
>

Regards,
- Chen Jie
-- next part --
An HTML attachment was scrubbed...
URL:

64 matches

Mail list logo