Hi, Christian,
        I noticed that you said  'amdgpu is known to not work on arm64 until 
very recently'.    I found the CPU related commit with drm is "drm: disable 
uncached DMA optimization for ARM and arm64".  

@@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)
        return false;
 #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
        return false;
+#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+       /*
+        * The DRM driver stack is designed to work with cache coherent devices
+        * only, but permits an optimization to be enabled in some cases, where
+        * for some buffers, both the CPU and the GPU use uncached mappings,
+        * removing the need for DMA snooping and allocation in the CPU caches.
+        *
+        * The use of uncached GPU mappings relies on the correct implementation
+        * of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU
+        * will use cached mappings nonetheless. On x86 platforms, this does not
+        * seem to matter, as uncached CPU mappings will snoop the caches in any
+        * case. However, on ARM and arm64, enabling this optimization on a
+        * platform where NoSnoop is ignored results in loss of coherency, which
+        * breaks correct operation of the device. Since we have no way of
+        * detecting whether NoSnoop works or not, just disable this
+        * optimization entirely for ARM and arm64.
+        */
+       return false;
 #else
        return true;
 #endif


The real effect is to  in amdgpu_object.c



   if (!drm_arch_can_wc_memory())
                bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;



And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So 
I think this is not  the cause of my bug.  Are there anything I have missed ?


I had suggest the machine supplier to use a more newer kernel such as 5.2.2, 
But they failed to do so after some try.  We also backport a series patches 
from newer kernel. But still we get the bad ring timeout.


We have dived into the amdgpu drm driver a long time, bu it is really difficult 
for me, especially the hardware related ring timeout.


------------------
Yanhua



------------------ ???????? ------------------
??????: "Koenig, Christian"<christian.koe...@amd.com>;
????????: 2019??9??3??(??????) ????9:19
??????: "yanhua"<78666...@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
????: "Deucher, Alexander"<alexander.deuc...@amd.com>;
????: Re: ?????? ?????? Bug: amdgpu drm driver cause process into Disk sleep 
state



 This is just a GPU lock, please open up a bug report on freedesktop.org and 
attach the full dmesg and which version of Mesa you are using.
 
 Regards,
 Christian.
 
 Am 03.09.19 um 15:16 schrieb 78666679:
 
  Yes, with dmesg|grep drm ,  I get following.
 
 
 348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, 
signaled seq=24423862, emitted seq=24423865
  
 
 
 
  ------------------ ???????? ------------------
  ??????: "Koenig, Christian"<christian.koe...@amd.com>;
 ????????: 2019??9??3??(??????) ????9:07
 ??????: ""<78666...@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
 ????: "Deucher, Alexander"<alexander.deuc...@amd.com>;
 ????: Re: ?????? Bug: amdgpu drm driver cause process into Disk sleep state
 
 
 
 Well that looks like the hardware got stuck.
 
 Do you get something in the locks about a timeout on the SDMA ring?
 
 Regards,
 Christian.
 
 Am 03.09.19 um 14:50 schrieb 78666679:
 
  Hi Christian,
        Sometimes the thread blocked  disk sleeping in call to 
amdgpu_sa_bo_new. following is the stack trace.  it seems the sa bo is used up 
,  so  the caller blocked waiting someone to free sa resources. 
 
 
 
 D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
 cat /proc/206833/task/227656/stack
 
 
 [<0>] __switch_to+0x94/0xe8
 [<0>] dma_fence_wait_any_timeout+0x234/0x2d0
 [<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
 [<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
 [<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
 [<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
 [<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
 [<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
 [<0>] drm_ioctl_kernel+0x94/0x118 [drm]
 [<0>] drm_ioctl+0x1f0/0x438 [drm]
 [<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
 [<0>] do_vfs_ioctl+0xc4/0x8c0
 [<0>] ksys_ioctl+0x8c/0xa0
 [<0>] __arm64_sys_ioctl+0x28/0x38
 [<0>] el0_svc_common+0xa0/0x180
 [<0>] el0_svc_handler+0x38/0x78
 [<0>] el0_svc+0x8/0xc
 [<0>] 0xffffffffffffffff
 
 
 
 
  --------------------
 YanHua
 
 
 
  ------------------ ???????? ------------------
  ??????: "Koenig, Christian"<christian.koe...@amd.com>;
 ????????: 2019??9??3??(??????) ????4:21
 ??????: ""<78666...@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
 ????: "Deucher, Alexander"<alexander.deuc...@amd.com>;
 ????: Re: Bug: amdgpu drm driver cause process into Disk sleep state
 
 
 
 Hi Yanhua,
 
 please update your kernel first, cause that looks like a known issue 
 which was recently fixed by patch "drm/scheduler: use job count instead 
 of peek".
 
 Probably best to try the latest bleeding edge kernel and if that doesn't 
 help please open up a bug report on  https://bugs.freedesktop.org/.
 
 Regards,
 Christian.
 
 Am 03.09.19 um 09:35 schrieb 78666679:
 > Hi, Sirs:
 >         I have a wx5100 amdgpu card, It randomly come into failure.  
 > sometimes, it will cause processes into uninterruptible wait state.
 >
 >
 > cps-new-ondemand-0587:~ # ps aux|grep -w D
 > root      11268  0.0  0.0 260628  3516 ?        Ssl  8??26   0:00 
 > /usr/sbin/gssproxy -D
 > root     136482  0.0  0.0 212500   572 pts/0    S+   15:25   0:00 grep 
 > --color=auto -w D
 > root     370684  0.0  0.0  17972  7428 ?        Ss   9??02   0:04 
 > /usr/sbin/sshd -D
 > 10066    432951  0.0  0.0      0     0 ?        D    9??02   0:00 
 > [FakeFinalizerDa]
 > root     496774  0.0  0.0      0     0 ?        D    9??02   0:17 
 > [kworker/8:1+eve]
 > cps-new-ondemand-0587:~ # cat /proc/496774/stack
 > [<0>] __switch_to+0x94/0xe8
 > [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
 > [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
 > [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
 > [<0>] filp_close+0x40/0xa0
 > [<0>] put_files_struct+0x118/0x120
 > [<0>] put_files_struct+0x30/0x68 [binder_linux]
 > [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
 > [<0>] process_one_work+0x1b4/0x3f8
 > [<0>] worker_thread+0x54/0x470
 > [<0>] kthread+0x134/0x138
 > [<0>] ret_from_fork+0x10/0x18
 > [<0>] 0xffffffffffffffff
 >
 >
 >
 > This issue troubled me a long time.  looking eagerly to get help from you!
 >
 >
 > -----
 > Yanhua
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to