?????? ?????? ?????? ?????? Bug: amdgpu drm driver cause process into Disk sleep state
I'm leaving out for some days. Thanks very much for your detailed answer. Best Regards. Yanhua -- -- ??: "Koenig, Christian"; : 2019??9??6??(??) 7:23 ??: "yanhua"<78666...@qq.com>;"amd-gfx"; : "Deucher, Alexander"; : Re: ?? ?? ?? Bug: amdgpu drm driver cause process into Disk sleep state Are there anything I have missed ? Yeah, unfortunately quite a bunch of things. The fact that arm64 doesn't support the PCIe NoSnoop TLP attribute is only the tip of the iceberg. You need a full "recent" driver stack, e.g. not older than a few month till a year, for this to work. And not only the kernel, but also recent userspace components. Maybe that's something you could first, e.g. install a recent version of Mesa and/or tell Mesa to not use the SDMA at all. But since you are running into an SDMA lockup with a kernel triggered page table update I see little chance that this work. The only other alternative I can see is the DKMS package of the pro-driver. With that one you might be able to compile the recent driver for an older kernel version. But I can't guarantee at all that this actually works on ARM64. Sorry that I don't have better news for you, Christian. Am 05.09.19 um 03:36 schrieb yanhua: Hi, Christian, I noticed that you said 'amdgpu is known to not work on arm64 until very recently'.I found the CPU related commit with drm is "drm: disable uncached DMA optimization for ARM and arm64". @@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void) return false; #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; +#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64) + /* +* The DRM driver stack is designed to work with cache coherent devices +* only, but permits an optimization to be enabled in some cases, where +* for some buffers, both the CPU and the GPU use uncached mappings, +* removing the need for DMA snooping and allocation in the CPU caches. +* +* The use of uncached GPU mappings relies on the correct implementation +* of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU +* will use cached mappings nonetheless. On x86 platforms, this does not +* seem to matter, as uncached CPU mappings will snoop the caches in any +* case. However, on ARM and arm64, enabling this optimization on a +* platform where NoSnoop is ignored results in loss of coherency, which +* breaks correct operation of the device. Since we have no way of +* detecting whether NoSnoop works or not, just disable this +* optimization entirely for ARM and arm64. +*/ + return false; #else return true; #endif The real effect is to in amdgpu_object.c if (!drm_arch_can_wc_memory()) bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So I think this is not the cause of my bug. Are there anything I have missed ? I had suggest the machine supplier to use a more newer kernel such as 5.2.2, But they failed to do so after some try. We also backport a series patches from newer kernel. But still we get the bad ring timeout. We have dived into the amdgpu drm driver a long time, bu it is really difficult for me, especially the hardware related ring timeout. -- Yanhua -- -- ??: "Koenig, Christian"; : 2019??9??3??(??) 9:19 ??: "yanhua"<78666...@qq.com>;"amd-gfx"; : "Deucher, Alexander"; : Re: ?? ?? Bug: amdgpu drm driver cause process into Disk sleep state This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using. Regards, Christian. Am 03.09.19 um 15:16 schrieb 7879: Yes, with dmesg|grep drm , I get following. 348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865 -- -- ??: "Koenig, Christian"; : 2019??9??3??(??) 9:07 ??: ""<78666...@qq.com>;"amd-gfx"; : "Deucher, Alexander"; : Re: ?? Bug: amdgpu drm driver cause process into Disk sleep state Well that looks like the hardware got stuck. Do you get something in the locks about a timeout on the SDMA ring? Regards, Christian. Am 03.09.19 um 14:50 schrieb 7879: Hi Christian, Sometimes the thread blocked disk sleeping in cal
?????? ?????? ?????? Bug: amdgpu drm driver cause process into Disk sleep state
Hi, Christian, I noticed that you said 'amdgpu is known to not work on arm64 until very recently'.I found the CPU related commit with drm is "drm: disable uncached DMA optimization for ARM and arm64". @@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void) return false; #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; +#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64) + /* +* The DRM driver stack is designed to work with cache coherent devices +* only, but permits an optimization to be enabled in some cases, where +* for some buffers, both the CPU and the GPU use uncached mappings, +* removing the need for DMA snooping and allocation in the CPU caches. +* +* The use of uncached GPU mappings relies on the correct implementation +* of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU +* will use cached mappings nonetheless. On x86 platforms, this does not +* seem to matter, as uncached CPU mappings will snoop the caches in any +* case. However, on ARM and arm64, enabling this optimization on a +* platform where NoSnoop is ignored results in loss of coherency, which +* breaks correct operation of the device. Since we have no way of +* detecting whether NoSnoop works or not, just disable this +* optimization entirely for ARM and arm64. +*/ + return false; #else return true; #endif The real effect is to in amdgpu_object.c if (!drm_arch_can_wc_memory()) bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So I think this is not the cause of my bug. Are there anything I have missed ? I had suggest the machine supplier to use a more newer kernel such as 5.2.2, But they failed to do so after some try. We also backport a series patches from newer kernel. But still we get the bad ring timeout. We have dived into the amdgpu drm driver a long time, bu it is really difficult for me, especially the hardware related ring timeout. -- Yanhua -- -- ??: "Koenig, Christian"; : 2019??9??3??(??) 9:19 ??: "yanhua"<78666...@qq.com>;"amd-gfx"; : "Deucher, Alexander"; : Re: ?? ?? Bug: amdgpu drm driver cause process into Disk sleep state This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using. Regards, Christian. Am 03.09.19 um 15:16 schrieb 7879: Yes, with dmesg|grep drm , I get following. 348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865 -- -- ??: "Koenig, Christian"; : 2019??9??3??(??) 9:07 ??: ""<78666...@qq.com>;"amd-gfx"; : "Deucher, Alexander"; : Re: ?? Bug: amdgpu drm driver cause process into Disk sleep state Well that looks like the hardware got stuck. Do you get something in the locks about a timeout on the SDMA ring? Regards, Christian. Am 03.09.19 um 14:50 schrieb 7879: Hi Christian, Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources. D 206833 227656 [surfaceflinger] Binder:45_5 cat /proc/206833/task/227656/stack [<0>] __switch_to+0x94/0xe8 [<0>] dma_fence_wait_any_timeout+0x234/0x2d0 [<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu] [<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu] [<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu] [<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu] [<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu] [<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu] [<0>] drm_ioctl_kernel+0x94/0x118 [drm] [<0>] drm_ioctl+0x1f0/0x438 [drm] [<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu] [<0>] do_vfs_ioctl+0xc4/0x8c0 [<0>] ksys_ioctl+0x8c/0xa0 [<0>] __arm64_sys_ioctl+0x28/0x38 [<0>] el0_svc_common+0xa0/0x180 [<0>] el0_svc_handler+0x38/0x78 [<0>] el0_svc+0x8/0xc [<0>] 0x YanHua -- ------ ??: "Koenig, Christian"; : 2019??9??3??(??) 4:21 ??: ""<78666...@qq.com>;"amd-gfx"; : "Deucher, Alexander"; : Re: Bug: amdgpu drm driver cause process into Disk sleep state Hi Yanhua, please update your kernel first, cause t