?????? ?????? ?????? ?????? Bug: amdgpu drm driver cause process into Disk sleep state

2019-09-11 Thread yanhua
I'm leaving out for some days.  Thanks very much for your detailed answer.


Best Regards.
Yanhua





--  --
??: "Koenig, Christian";
: 2019??9??6??(??) 7:23
??: "yanhua"<78666...@qq.com>;"amd-gfx";
: "Deucher, Alexander";
: Re: ?? ?? ?? Bug: amdgpu drm driver cause process into Disk 
sleep state



  Are there anything I have missed ? 
 Yeah, unfortunately quite a bunch of things. The fact that arm64 doesn't 
support the PCIe NoSnoop TLP attribute is only the tip of the iceberg.
 
 You need a full "recent" driver stack, e.g. not older than a few month till a 
year, for this to work. And not only the kernel, but also recent userspace 
components.
 
 Maybe that's something you could first, e.g. install a recent version of Mesa 
and/or tell Mesa to not use the SDMA at all. But since you are running into an 
SDMA lockup with a kernel triggered page table update I see little chance that 
this work.
 
 The only other alternative I can see is the DKMS package of the pro-driver. 
With that one you might be able to compile the recent driver for an older 
kernel version.
 
 But I can't guarantee at all that this actually works on ARM64.
 
 Sorry that I don't have better news for you,
 Christian.
 
 Am 05.09.19 um 03:36 schrieb yanhua:
 
  Hi, Christian,
 I noticed that you said  'amdgpu is known to not work on arm64 until 
very recently'.I found the CPU related commit with drm is "drm: disable 
uncached DMA optimization for ARM and arm64".  
 
 @@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)
 return false;
  #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
 return false;
 +#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
 +   /*
 +* The DRM driver stack is designed to work with cache coherent devices
 +* only, but permits an optimization to be enabled in some cases, where
 +* for some buffers, both the CPU and the GPU use uncached mappings,
 +* removing the need for DMA snooping and allocation in the CPU caches.
 +*
 +* The use of uncached GPU mappings relies on the correct 
implementation
 +* of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU
 +* will use cached mappings nonetheless. On x86 platforms, this does 
not
 +* seem to matter, as uncached CPU mappings will snoop the caches in 
any
 +* case. However, on ARM and arm64, enabling this optimization on a
 +* platform where NoSnoop is ignored results in loss of coherency, 
which
 +* breaks correct operation of the device. Since we have no way of
 +* detecting whether NoSnoop works or not, just disable this
 +* optimization entirely for ARM and arm64.
 +*/
 +   return false;
  #else
 return true;
  #endif
 
 
 The real effect is to  in amdgpu_object.c
 
 
 
if (!drm_arch_can_wc_memory())
 bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
 
 
 
 And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, 
So I think this is not  the cause of my bug.  Are there anything I have missed ?
 
 
 I had suggest the machine supplier to use a more newer kernel such as 5.2.2, 
But they failed to do so after some try.  We also backport a series patches 
from newer kernel. But still we get the bad ring timeout.
 
 
 We have dived into the amdgpu drm driver a long time, bu it is really 
difficult for me, especially the hardware related ring timeout.
 
 
 --
 Yanhua
 
  
 
  --  --
  ??: "Koenig, Christian";
 : 2019??9??3??(??) 9:19
 ??: "yanhua"<78666...@qq.com>;"amd-gfx";
 : "Deucher, Alexander";
 : Re: ?? ?? Bug: amdgpu drm driver cause process into Disk sleep 
state
 
 
 
 This is just a GPU lock, please open up a bug report on freedesktop.org and 
attach the full dmesg and which version of Mesa you are using.
 
 Regards,
 Christian.
 
 Am 03.09.19 um 15:16 schrieb 7879:
 
  Yes, with dmesg|grep drm ,  I get following.
 
 
 348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, 
signaled seq=24423862, emitted seq=24423865
  
 
 
 
  --  --
  ??: "Koenig, Christian";
 : 2019??9??3??(??) 9:07
 ??: ""<78666...@qq.com>;"amd-gfx";
 : "Deucher, Alexander";
 : Re: ?? Bug: amdgpu drm driver cause process into Disk sleep state
 
 
 
 Well that looks like the hardware got stuck.
 
 Do you get something in the locks about a timeout on the SDMA ring?
 
 Regards,
 Christian.
 
 Am 03.09.19 um 14:50 schrieb 7879:
 
  Hi Christian,
Sometimes the thread blocked  disk sleeping in cal

?????? ?????? ?????? Bug: amdgpu drm driver cause process into Disk sleep state

2019-09-05 Thread yanhua
Hi, Christian,
I noticed that you said  'amdgpu is known to not work on arm64 until 
very recently'.I found the CPU related commit with drm is "drm: disable 
uncached DMA optimization for ARM and arm64".  

@@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)
return false;
 #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
return false;
+#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+   /*
+* The DRM driver stack is designed to work with cache coherent devices
+* only, but permits an optimization to be enabled in some cases, where
+* for some buffers, both the CPU and the GPU use uncached mappings,
+* removing the need for DMA snooping and allocation in the CPU caches.
+*
+* The use of uncached GPU mappings relies on the correct implementation
+* of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU
+* will use cached mappings nonetheless. On x86 platforms, this does not
+* seem to matter, as uncached CPU mappings will snoop the caches in any
+* case. However, on ARM and arm64, enabling this optimization on a
+* platform where NoSnoop is ignored results in loss of coherency, which
+* breaks correct operation of the device. Since we have no way of
+* detecting whether NoSnoop works or not, just disable this
+* optimization entirely for ARM and arm64.
+*/
+   return false;
 #else
return true;
 #endif


The real effect is to  in amdgpu_object.c



   if (!drm_arch_can_wc_memory())
bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;



And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So 
I think this is not  the cause of my bug.  Are there anything I have missed ?


I had suggest the machine supplier to use a more newer kernel such as 5.2.2, 
But they failed to do so after some try.  We also backport a series patches 
from newer kernel. But still we get the bad ring timeout.


We have dived into the amdgpu drm driver a long time, bu it is really difficult 
for me, especially the hardware related ring timeout.


--
Yanhua



--  --
??: "Koenig, Christian";
: 2019??9??3??(??) 9:19
??: "yanhua"<78666...@qq.com>;"amd-gfx";
: "Deucher, Alexander";
: Re: ?? ?? Bug: amdgpu drm driver cause process into Disk sleep 
state



 This is just a GPU lock, please open up a bug report on freedesktop.org and 
attach the full dmesg and which version of Mesa you are using.
 
 Regards,
 Christian.
 
 Am 03.09.19 um 15:16 schrieb 7879:
 
  Yes, with dmesg|grep drm ,  I get following.
 
 
 348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, 
signaled seq=24423862, emitted seq=24423865
  
 
 
 
  --  --
  ??: "Koenig, Christian";
 : 2019??9??3??(??) 9:07
 ??: ""<78666...@qq.com>;"amd-gfx";
 : "Deucher, Alexander";
 : Re: ?? Bug: amdgpu drm driver cause process into Disk sleep state
 
 
 
 Well that looks like the hardware got stuck.
 
 Do you get something in the locks about a timeout on the SDMA ring?
 
 Regards,
 Christian.
 
 Am 03.09.19 um 14:50 schrieb 7879:
 
  Hi Christian,
Sometimes the thread blocked  disk sleeping in call to 
amdgpu_sa_bo_new. following is the stack trace.  it seems the sa bo is used up 
,  so  the caller blocked waiting someone to free sa resources. 
 
 
 
 D 206833 227656 [surfaceflinger]  Binder:45_5
 cat /proc/206833/task/227656/stack
 
 
 [<0>] __switch_to+0x94/0xe8
 [<0>] dma_fence_wait_any_timeout+0x234/0x2d0
 [<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
 [<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
 [<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
 [<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
 [<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
 [<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
 [<0>] drm_ioctl_kernel+0x94/0x118 [drm]
 [<0>] drm_ioctl+0x1f0/0x438 [drm]
 [<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
 [<0>] do_vfs_ioctl+0xc4/0x8c0
 [<0>] ksys_ioctl+0x8c/0xa0
 [<0>] __arm64_sys_ioctl+0x28/0x38
 [<0>] el0_svc_common+0xa0/0x180
 [<0>] el0_svc_handler+0x38/0x78
 [<0>] el0_svc+0x8/0xc
 [<0>] 0x
 
 
 
 
  
 YanHua
 
 
 
  --  ------
  ??: "Koenig, Christian";
 : 2019??9??3??(??) 4:21
 ??: ""<78666...@qq.com>;"amd-gfx";
 : "Deucher, Alexander";
 : Re: Bug: amdgpu drm driver cause process into Disk sleep state
 
 
 
 Hi Yanhua,
 
 please update your kernel first, cause t