Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-11 Thread Daniel Vetter
On Thu, Jun 11, 2020 at 11:06:16AM -0400, Andrey Grodzovsky wrote:
> 
> On 6/11/20 2:35 AM, Thomas Hellström (Intel) wrote:
> > 
> > On 6/10/20 11:19 PM, Andrey Grodzovsky wrote:
> > > 
> > > On 6/10/20 4:30 PM, Thomas Hellström (Intel) wrote:
> > > > 
> > > > On 6/10/20 5:30 PM, Daniel Vetter wrote:
> > > > > On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote:
> > > > > > Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky:
> > > > > > > 
> > > > > > > On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote:
> > > > > > > > 
> > > > > > > > On 6/9/20 7:21 PM, Koenig, Christian wrote:
> > > > > > > > > 
> > > > > > > > > Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey"
> > > > > > > > > :
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >  On 6/5/20 2:40 PM, Christian König wrote:
> > > > > > > > >  > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
> > > > > > > > >  >>
> > > > > > > > >  >> On 5/11/20 2:45 AM, Christian König wrote:
> > > > > > > > >  >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
> > > > > > > > >   Signed-off-by: Andrey Grodzovsky
> > > > > > > > > 
> > > > > > > > >   ---
> > > > > > > > >   drivers/gpu/drm/ttm/ttm_bo.c | 22 
> > > > > > > > > +-
> > > > > > > > >   include/drm/ttm/ttm_bo_driver.h | 2 ++
> > > > > > > > >     2 files changed, 23 insertions(+), 1 deletion(-)
> > > > > > > > >  
> > > > > > > > >   diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > > >   b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > > >   index c5b516f..eae61cc 100644
> > > > > > > > >   --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > > >   +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > > >   @@ -1750,9 +1750,29 @@ void 
> > > > > > > > > ttm_bo_unmap_virtual(struct
> > > > > > > > >   ttm_buffer_object *bo)
> > > > > > > > >   ttm_bo_unmap_virtual_locked(bo);
> > > > > > > > >   ttm_mem_io_unlock(man);
> > > > > > > > >     }
> > > > > > > > >   +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
> > > > > > > > >     +void ttm_bo_unmap_virtual_address_space(struct
> > > > > > > > >  ttm_bo_device *bdev)
> > > > > > > > >   +{
> > > > > > > > >   +    struct ttm_mem_type_manager *man;
> > > > > > > > >   +    int i;
> > > > > > > > >   -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
> > > > > > > > >  >>>
> > > > > > > > >   +    for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
> > > > > > > > >   +    man = &bdev->man[i];
> > > > > > > > >   +    if (man->has_type && man->use_type)
> > > > > > > > >   + ttm_mem_io_lock(man, false);
> > > > > > > > >   +    }
> > > > > > > > >  >>>
> > > > > > > > >  >>> You should drop that it will just result in a 
> > > > > > > > > deadlock
> > > > > > > > >  warning for
> > > > > > > > >  >>> Nouveau and has no effect at all.
> > > > > > > > >  >>>
> > > > > > > > >  >>> Apart from that looks good to me,
> > > > > > > > >  >>> Christian.
> > > > > > > > >  >>
> > > > > > > > >  >>
> > > > > > > > >  >> As I am considering to re-include this in V2 of the
> > > > > > > > >  patchsets, can
> > > > > > > > >  >> you clarify please why this will have no effect at 
> > > > > > > > > all ?
> > > > > > > > >  >
> > > > > > > > >  > The locks are exclusive for Nouveau to allocate/free 
> > > > > > > > > the io
> > > > > > > > >  address
> > > > > > > > >  > space.
> > > > > > > > >  >
> > > > > > > > >  > Since we don't do this here we don't need the locks.
> > > > > > > > >  >
> > > > > > > > >  > Christian.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >  So basically calling
> > > > > > > > > unmap_mapping_range doesn't require any
> > > > > > > > > extra
> > > > > > > > >  locking around it and whatever locks
> > > > > > > > > are taken within the function
> > > > > > > > >  should be enough ?
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I think so, yes.
> > > > > > > > > 
> > > > > > > > > Christian.
> > > > > > > > Yes, that's true. However, without the bo
> > > > > > > > reservation, nothing stops
> > > > > > > > a PTE from being immediately re-faulted back again. Even while
> > > > > > > > unmap_mapping_range() is running.
> > > > > > > > 
> > > > > > > Can you explain more on this - specifically, which
> > > > > > > function to reserve
> > > > > > > the BO, why BO reservation would prevent re-fault of the PTE ?
> > > > > > > 
> > > > > > Thomas is talking about
> > > > > > ttm_bo_reserver()/ttm_bo_unreserve(), but we don't
> > > > > > need this because we unmap everything because the whole
> > > > > > device is gone and
> > > > > > not just manipulate a single BO.
> > > > > > 
> > > > > > > > So the device removed flag needs to be advertized before this
> > > > > > > > function is r

Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-11 Thread Daniel Vetter
On Thu, Jun 11, 2020 at 11:15:42AM -0400, Andrey Grodzovsky wrote:
> 
> On 6/10/20 5:16 PM, Daniel Vetter wrote:
> > On Wed, Jun 10, 2020 at 10:30 PM Thomas Hellström (Intel)
> >  wrote:
> > > 
> > > On 6/10/20 5:30 PM, Daniel Vetter wrote:
> > > > On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote:
> > > > > Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky:
> > > > > > On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote:
> > > > > > > On 6/9/20 7:21 PM, Koenig, Christian wrote:
> > > > > > > > Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey"
> > > > > > > > :
> > > > > > > > 
> > > > > > > > 
> > > > > > > >   On 6/5/20 2:40 PM, Christian König wrote:
> > > > > > > >   > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
> > > > > > > >   >>
> > > > > > > >   >> On 5/11/20 2:45 AM, Christian König wrote:
> > > > > > > >   >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
> > > > > > > >    Signed-off-by: Andrey Grodzovsky 
> > > > > > > > 
> > > > > > > >    ---
> > > > > > > >    drivers/gpu/drm/ttm/ttm_bo.c| 22 
> > > > > > > > +-
> > > > > > > >    include/drm/ttm/ttm_bo_driver.h |  2 ++
> > > > > > > >      2 files changed, 23 insertions(+), 1 deletion(-)
> > > > > > > >   
> > > > > > > >    diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > >    b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > >    index c5b516f..eae61cc 100644
> > > > > > > >    --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > >    +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > >    @@ -1750,9 +1750,29 @@ void 
> > > > > > > > ttm_bo_unmap_virtual(struct
> > > > > > > >    ttm_buffer_object *bo)
> > > > > > > >    ttm_bo_unmap_virtual_locked(bo);
> > > > > > > >    ttm_mem_io_unlock(man);
> > > > > > > >      }
> > > > > > > >    +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
> > > > > > > >      +void ttm_bo_unmap_virtual_address_space(struct
> > > > > > > >   ttm_bo_device *bdev)
> > > > > > > >    +{
> > > > > > > >    +struct ttm_mem_type_manager *man;
> > > > > > > >    +int i;
> > > > > > > >    -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
> > > > > > > >   >>>
> > > > > > > >    +for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
> > > > > > > >    +man = &bdev->man[i];
> > > > > > > >    +if (man->has_type && man->use_type)
> > > > > > > >    + ttm_mem_io_lock(man, false);
> > > > > > > >    +}
> > > > > > > >   >>>
> > > > > > > >   >>> You should drop that it will just result in a deadlock
> > > > > > > >   warning for
> > > > > > > >   >>> Nouveau and has no effect at all.
> > > > > > > >   >>>
> > > > > > > >   >>> Apart from that looks good to me,
> > > > > > > >   >>> Christian.
> > > > > > > >   >>
> > > > > > > >   >>
> > > > > > > >   >> As I am considering to re-include this in V2 of the
> > > > > > > >   patchsets, can
> > > > > > > >   >> you clarify please why this will have no effect at all 
> > > > > > > > ?
> > > > > > > >   >
> > > > > > > >   > The locks are exclusive for Nouveau to allocate/free 
> > > > > > > > the io
> > > > > > > >   address
> > > > > > > >   > space.
> > > > > > > >   >
> > > > > > > >   > Since we don't do this here we don't need the locks.
> > > > > > > >   >
> > > > > > > >   > Christian.
> > > > > > > > 
> > > > > > > > 
> > > > > > > >   So basically calling unmap_mapping_range doesn't require 
> > > > > > > > any extra
> > > > > > > >   locking around it and whatever locks are taken within the 
> > > > > > > > function
> > > > > > > >   should be enough ?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > I think so, yes.
> > > > > > > > 
> > > > > > > > Christian.
> > > > > > > Yes, that's true. However, without the bo reservation, nothing 
> > > > > > > stops
> > > > > > > a PTE from being immediately re-faulted back again. Even while
> > > > > > > unmap_mapping_range() is running.
> > > > > > > 
> > > > > > Can you explain more on this - specifically, which function to 
> > > > > > reserve
> > > > > > the BO, why BO reservation would prevent re-fault of the PTE ?
> > > > > > 
> > > > > Thomas is talking about ttm_bo_reserver()/ttm_bo_unreserve(), but we 
> > > > > don't
> > > > > need this because we unmap everything because the whole device is 
> > > > > gone and
> > > > > not just manipulate a single BO.
> > > > > 
> > > > > > > So the device removed flag needs to be advertized before this
> > > > > > > function is run,
> > > > > > > 
> > > > > > I indeed intend to call this  right after calling drm_dev_unplug 
> > > > > > from
> > > > > > amdgpu_pci_remove while adding drm_dev_enter/exit in 
> > > > > > ttm_bo_vm_fault (or
> > > > > > in amdgpu specific wrapper since I don't see how ca

Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-11 Thread Daniel Vetter
On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling  wrote:
>
> Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
> > On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
> >>> I still have my doubts about allowing fence waiting from within shrinkers.
> >>> IMO ideally they should use a trywait approach, in order to allow memory
> >>> allocation during command submission for drivers that
> >>> publish fences before command submission. (Since early reservation object
> >>> release requires that).
> >> Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up
> >> with a mempool to make sure it can handle it's allocations.
> >>
> >>> But since drivers are already waiting from within shrinkers and I take 
> >>> your
> >>> word for HMM requiring this,
> >> Yeah the big trouble is HMM and mmu notifiers. That's the really awkward
> >> one, the shrinker one is a lot less established.
> > I really question if HW that needs something like DMA fence should
> > even be using mmu notifiers - the best use is HW that can fence the
> > DMA directly without having to get involved with some command stream
> > processing.
> >
> > Or at the very least it should not be a generic DMA fence but a
> > narrowed completion tied only into the same GPU driver's command
> > completion processing which should be able to progress without
> > blocking.
> >
> > The intent of notifiers was never to endlessly block while vast
> > amounts of SW does work.
> >
> > Going around and switching everything in a GPU to GFP_ATOMIC seems
> > like bad idea.
> >
> >> I've pinged a bunch of armsoc gpu driver people and ask them how much this
> >> hurts, so that we have a clear answer. On x86 I don't think we have much
> >> of a choice on this, with userptr in amd and i915 and hmm work in nouveau
> >> (but nouveau I think doesn't use dma_fence in there).
>
> Soon nouveau will get company. We're working on a recoverable page fault
> implementation for HMM in amdgpu where we'll need to update page tables
> using the GPUs SDMA engine and wait for corresponding fences in MMU
> notifiers.

Well amdgpu already has dma_fence waits in the hmm callbacks, so
nothing new. But since you start using these in amdkfd ... perfect
opportunity to annotate the amdkfd paths for fence signalling critical
sections? Especially the preempt-ctx fence should be an interesting
case to annotate and see whether lockdep finds anything. Not sure what
else there is.
-Daniel

>
> Regards,
>   Felix
>
>
> > Right, nor will RDMA ODP.
> >
> > Jason
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/1] drm/amdkfd: Add eviction debug messages

2020-06-11 Thread Felix Kuehling
Use WARN to print messages with backtrace when evictions are triggered.
This can help determine the root cause of evictions and help spot driver
bugs triggering evictions unintentionally, or help with performance tuning
by avoiding conditions that cause evictions in a specific workload.

The messages are controlled by a new module parameter that can be changed
at runtime:

  echo Y > /sys/module/amdgpu/parameters/debug_evictions
  echo N > /sys/module/amdgpu/parameters/debug_evictions

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h  | 2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  | 8 
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 2 ++
 drivers/gpu/drm/amd/amdkfd/kfd_device.c  | 3 +++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 5 +
 5 files changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 10ae92e835f6..6c7dd0a707c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -186,8 +186,10 @@ extern int amdgpu_noretry;
 extern int amdgpu_force_asic_type;
 #ifdef CONFIG_HSA_AMD
 extern int sched_policy;
+extern bool debug_evictions;
 #else
 static const int sched_policy = KFD_SCHED_POLICY_HWS;
+static const bool debug_evictions; /* = false */
 #endif
 
 extern int amdgpu_tmz;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d4d7cca1cc72..fdf350d5e7b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -705,6 +705,14 @@ MODULE_PARM_DESC(hws_gws_support, "Assume MEC2 FW supports 
GWS barriers (false =
 int queue_preemption_timeout_ms = 9000;
 module_param(queue_preemption_timeout_ms, int, 0644);
 MODULE_PARM_DESC(queue_preemption_timeout_ms, "queue preemption timeout in ms 
(1 = Minimum, 9000 = default)");
+
+/**
+ * DOC: debug_evictions(bool)
+ * Enable extra debug messages to help determine the cause of evictions
+ */
+bool debug_evictions;
+module_param(debug_evictions, bool, 0644);
+MODULE_PARM_DESC(debug_evictions, "enable eviction debug messages (false = 
default)");
 #endif
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index b87ca171986a..072f0e1185a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -275,6 +275,8 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct 
amdgpu_sync *sync,
continue;
}
 
+   WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,
+"Adding eviction fence to sync obj");
r = amdgpu_sync_fence(sync, f, false);
if (r)
break;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 22348cebaf36..80393e0583bb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -942,6 +942,7 @@ int kgd2kfd_quiesce_mm(struct mm_struct *mm)
if (!p)
return -ESRCH;
 
+   WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
r = kfd_process_evict_queues(p);
 
kfd_unref_process(p);
@@ -1009,6 +1010,8 @@ int kgd2kfd_schedule_evict_and_restore_process(struct 
mm_struct *mm,
/* During process initialization eviction_work.dwork is initialized
 * to kfd_evict_bo_worker
 */
+   WARN(debug_evictions, "Scheduling eviction of pid %d in %ld jiffies",
+p->lead_thread->pid, delay_jiffies);
schedule_delayed_work(&p->eviction_work, delay_jiffies);
 out:
kfd_unref_process(p);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 173d58b2d81f..51ba2020732e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -177,6 +177,11 @@ extern bool hws_gws_support;
  */
 extern int queue_preemption_timeout_ms;
 
+/*
+ * Enable eviction debug messages
+ */
+extern bool debug_evictions;
+
 enum cache_policy {
cache_policy_coherent,
cache_policy_noncoherent
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdkfd: Use correct major in devcgroup check

2020-06-11 Thread Lorenz Brun
The existing code used the major version number of the DRM driver
instead of the device major number of the DRM subsystem for
validating access for a devices cgroup.

This meant that accesses allowed by the devices cgroup weren't
permitted and certain accesses denied by the devices cgroup were
permitted (if they matched the wrong major device number).

Signed-off-by: Lorenz Brun 
Fixes: 6b855f7b83d2f ("drm/amdkfd: Check against device cgroup")
---
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index f0587d94294d..fee60921fccf 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1076,7 +1077,7 @@ static inline int kfd_devcgroup_check_permission(struct 
kfd_dev *kfd)
 #if defined(CONFIG_CGROUP_DEVICE) || defined(CONFIG_CGROUP_BPF)
struct drm_device *ddev = kfd->ddev;
 
-   return devcgroup_check_permission(DEVCG_DEV_CHAR, ddev->driver->major,
+   return devcgroup_check_permission(DEVCG_DEV_CHAR, DRM_MAJOR,
  ddev->render->index,
  DEVCG_ACC_WRITE | DEVCG_ACC_READ);
 #else
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-11 Thread Felix Kuehling
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
> On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
>>> I still have my doubts about allowing fence waiting from within shrinkers.
>>> IMO ideally they should use a trywait approach, in order to allow memory
>>> allocation during command submission for drivers that
>>> publish fences before command submission. (Since early reservation object
>>> release requires that).
>> Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up
>> with a mempool to make sure it can handle it's allocations.
>>
>>> But since drivers are already waiting from within shrinkers and I take your
>>> word for HMM requiring this,
>> Yeah the big trouble is HMM and mmu notifiers. That's the really awkward
>> one, the shrinker one is a lot less established.
> I really question if HW that needs something like DMA fence should
> even be using mmu notifiers - the best use is HW that can fence the
> DMA directly without having to get involved with some command stream
> processing.
>
> Or at the very least it should not be a generic DMA fence but a
> narrowed completion tied only into the same GPU driver's command
> completion processing which should be able to progress without
> blocking.
>
> The intent of notifiers was never to endlessly block while vast
> amounts of SW does work.
>
> Going around and switching everything in a GPU to GFP_ATOMIC seems
> like bad idea.
>
>> I've pinged a bunch of armsoc gpu driver people and ask them how much this
>> hurts, so that we have a clear answer. On x86 I don't think we have much
>> of a choice on this, with userptr in amd and i915 and hmm work in nouveau
>> (but nouveau I think doesn't use dma_fence in there). 

Soon nouveau will get company. We're working on a recoverable page fault
implementation for HMM in amdgpu where we'll need to update page tables
using the GPUs SDMA engine and wait for corresponding fences in MMU
notifiers.

Regards,
  Felix


> Right, nor will RDMA ODP. 
>
> Jason
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

2020-06-11 Thread Jason Gunthorpe
On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:
> fs_reclaim_acquire/release nicely catch recursion issues when
> allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend
> to use to keep the excessive caches in check). For mmu notifier
> recursions we do have lockdep annotations since 23b68395c7c7
> ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").
> 
> But these only fire if a path actually results in some pte
> invalidation - for most small allocations that's very rarely the case.
> The other trouble is that pte invalidation can happen any time when
> __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe
> choice, GFP_NOIO isn't good enough to avoid potential mmu notifier
> recursion.
> 
> I was pondering whether we should just do the general annotation, but
> there's always the risk for false positives. Plus I'm assuming that
> the core fs and io code is a lot better reviewed and tested than
> random mmu notifier code in drivers. Hence why I decide to only
> annotate for that specific case.
> 
> Furthermore even if we'd create a lockdep map for direct reclaim, we'd
> still need to explicit pull in the mmu notifier map - there's a lot
> more places that do pte invalidation than just direct reclaim, these
> two contexts arent the same.
> 
> Note that the mmu notifiers needing their own independent lockdep map
> is also the reason we can't hold them from fs_reclaim_acquire to
> fs_reclaim_release - it would nest with the acquistion in the pte
> invalidation code, causing a lockdep splat. And we can't remove the
> annotations from pte invalidation and all the other places since
> they're called from many other places than page reclaim. Hence we can
> only do the equivalent of might_lock, but on the raw lockdep map.
> 
> With this we can also remove the lockdep priming added in 66204f1d2d1b
> ("mm/mmu_notifiers: prime lockdep") since the new annotations are
> strictly more powerful.
> 
> v2: Review from Thomas Hellstrom:
> - unbotch the fs_reclaim context check, I accidentally inverted it,
>   but it didn't blow up because I inverted it immediately
> - fix compiling for !CONFIG_MMU_NOTIFIER
> 
> Cc: Thomas Hellström (Intel) 
> Cc: Andrew Morton 
> Cc: Jason Gunthorpe 
> Cc: linux...@kvack.org
> Cc: linux-r...@vger.kernel.org
> Cc: Maarten Lankhorst 
> Cc: Christian König 
> Signed-off-by: Daniel Vetter 
> ---
> This is part of a gpu lockdep annotation series simply because it
> really helps to catch issues where gpu subsystem locks and primitives
> can deadlock with themselves through allocations and mmu notifiers.
> But aside from that motivation it should be completely free-standing,
> and can land through -mm/-rdma/-hmm or any other tree really whenever.
> -Daniel

I'm still not totally clear on how all the GFP flags map to
different behaviors, but this seems plausible to me

At this point it should go through Andrew's tree, thanks

Acked-by: Jason Gunthorpe  # For mmu_notifiers

Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdkfd: Use correct major in devcgroup check

2020-06-11 Thread Felix Kuehling
Am 2020-06-11 um 4:11 p.m. schrieb Lorenz Brun:
> The existing code used the major version number of the DRM driver
> instead of the device major number of the DRM subsystem for
> validating access for a devices cgroup.
>
> This meant that accesses allowed by the devices cgroup weren't
> permitted and certain accesses denied by the devices cgroup were
> permitted (if they matched the wrong major device number).
>
> Signed-off-by: Lorenz Brun 

Reviewed-by: Felix Kuehling 

Thanks for this fix. I will apply it to our amd-staging-drm-next branch.

Regards,
  Felix


> Fixes: 6b855f7b83d2f ("drm/amdkfd: Check against device cgroup")
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index f0587d94294d..fee60921fccf 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -1076,7 +1077,7 @@ static inline int kfd_devcgroup_check_permission(struct 
> kfd_dev *kfd)
>  #if defined(CONFIG_CGROUP_DEVICE) || defined(CONFIG_CGROUP_BPF)
>   struct drm_device *ddev = kfd->ddev;
>  
> - return devcgroup_check_permission(DEVCG_DEV_CHAR, ddev->driver->major,
> + return devcgroup_check_permission(DEVCG_DEV_CHAR, DRM_MAJOR,
> ddev->render->index,
> DEVCG_ACC_WRITE | DEVCG_ACC_READ);
>  #else
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-11 Thread Andrey Grodzovsky


On 6/10/20 5:16 PM, Daniel Vetter wrote:

On Wed, Jun 10, 2020 at 10:30 PM Thomas Hellström (Intel)
 wrote:


On 6/10/20 5:30 PM, Daniel Vetter wrote:

On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote:

Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky:

On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote:

On 6/9/20 7:21 PM, Koenig, Christian wrote:

Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey"
:


  On 6/5/20 2:40 PM, Christian König wrote:
  > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
  >>
  >> On 5/11/20 2:45 AM, Christian König wrote:
  >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
   Signed-off-by: Andrey Grodzovsky 
   ---
   drivers/gpu/drm/ttm/ttm_bo.c| 22 +-
   include/drm/ttm/ttm_bo_driver.h |  2 ++
     2 files changed, 23 insertions(+), 1 deletion(-)
  
   diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
   b/drivers/gpu/drm/ttm/ttm_bo.c
   index c5b516f..eae61cc 100644
   --- a/drivers/gpu/drm/ttm/ttm_bo.c
   +++ b/drivers/gpu/drm/ttm/ttm_bo.c
   @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct
   ttm_buffer_object *bo)
   ttm_bo_unmap_virtual_locked(bo);
   ttm_mem_io_unlock(man);
     }
   +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
     +void ttm_bo_unmap_virtual_address_space(struct
  ttm_bo_device *bdev)
   +{
   +struct ttm_mem_type_manager *man;
   +int i;
   -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
  >>>
   +for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
   +man = &bdev->man[i];
   +if (man->has_type && man->use_type)
   + ttm_mem_io_lock(man, false);
   +}
  >>>
  >>> You should drop that it will just result in a deadlock
  warning for
  >>> Nouveau and has no effect at all.
  >>>
  >>> Apart from that looks good to me,
  >>> Christian.
  >>
  >>
  >> As I am considering to re-include this in V2 of the
  patchsets, can
  >> you clarify please why this will have no effect at all ?
  >
  > The locks are exclusive for Nouveau to allocate/free the io
  address
  > space.
  >
  > Since we don't do this here we don't need the locks.
  >
  > Christian.


  So basically calling unmap_mapping_range doesn't require any extra
  locking around it and whatever locks are taken within the function
  should be enough ?



I think so, yes.

Christian.

Yes, that's true. However, without the bo reservation, nothing stops
a PTE from being immediately re-faulted back again. Even while
unmap_mapping_range() is running.


Can you explain more on this - specifically, which function to reserve
the BO, why BO reservation would prevent re-fault of the PTE ?


Thomas is talking about ttm_bo_reserver()/ttm_bo_unreserve(), but we don't
need this because we unmap everything because the whole device is gone and
not just manipulate a single BO.


So the device removed flag needs to be advertized before this
function is run,


I indeed intend to call this  right after calling drm_dev_unplug from
amdgpu_pci_remove while adding drm_dev_enter/exit in ttm_bo_vm_fault (or
in amdgpu specific wrapper since I don't see how can I access struct
drm_device from ttm_bo_vm_fault) and this in my understanding should
stop a PTE from being re-faulted back as you pointed out - so again I
don't see how  bo reservation would prevent it so it looks like I am
missing something...



(perhaps with a memory barrier pair).


drm_dev_unplug and drm_dev_enter/exit are RCU synchronized and so I
don't think require any extra memory barriers for visibility of the
removed flag being set


As far as I can see that should be perfectly sufficient.

Only if you have a drm_dev_enter/exit pair in your fault handler.
Otherwise you're still open to the races Thomas described. But aside from
that the drm_dev_unplug stuff has all the barriers and stuff to make sure
nothing escapes.

Failure to drm_dev_enter could then also trigger the special case where we
put a dummy page in place.
-Daniel

Hmm, Yes, indeed advertizing the flag before the call to
unmap_mapping_range isn't enough, since there might be fault handlers
running that haven't picked up the flag when unmap_mapping_range is
launched.

Hm ... Now I'm not sure drm_dev_enter/exit is actually good enough. I
guess if you use vmf_insert_pfn within the drm_dev_enter/exit critical
section, it should be fine. But I think you can also do fault handlers
that just return the struct page and then let core handle the pte
wrangling, those would indeed race and we can't have that I think.

I think we should try and make sure (as much as possible) that this is
done all done in helpers and not some open coded stuff in drivers, or
we'll just get it all wrong in the details.



Can you please clarify this last paragr

Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-11 Thread Andrey Grodzovsky


On 6/11/20 2:35 AM, Thomas Hellström (Intel) wrote:


On 6/10/20 11:19 PM, Andrey Grodzovsky wrote:


On 6/10/20 4:30 PM, Thomas Hellström (Intel) wrote:


On 6/10/20 5:30 PM, Daniel Vetter wrote:

On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote:

Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky:


On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote:


On 6/9/20 7:21 PM, Koenig, Christian wrote:


Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey"
:


 On 6/5/20 2:40 PM, Christian König wrote:
 > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
 >>
 >> On 5/11/20 2:45 AM, Christian König wrote:
 >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
  Signed-off-by: Andrey Grodzovsky 


  ---
  drivers/gpu/drm/ttm/ttm_bo.c | 22 +-
  include/drm/ttm/ttm_bo_driver.h | 2 ++
    2 files changed, 23 insertions(+), 1 deletion(-)
 
  diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
  b/drivers/gpu/drm/ttm/ttm_bo.c
  index c5b516f..eae61cc 100644
  --- a/drivers/gpu/drm/ttm/ttm_bo.c
  +++ b/drivers/gpu/drm/ttm/ttm_bo.c
  @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct
  ttm_buffer_object *bo)
  ttm_bo_unmap_virtual_locked(bo);
  ttm_mem_io_unlock(man);
    }
  +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
    +void ttm_bo_unmap_virtual_address_space(struct
 ttm_bo_device *bdev)
  +{
  +    struct ttm_mem_type_manager *man;
  +    int i;
  -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
 >>>
  +    for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
  +    man = &bdev->man[i];
  +    if (man->has_type && man->use_type)
  + ttm_mem_io_lock(man, false);
  +    }
 >>>
 >>> You should drop that it will just result in a deadlock
 warning for
 >>> Nouveau and has no effect at all.
 >>>
 >>> Apart from that looks good to me,
 >>> Christian.
 >>
 >>
 >> As I am considering to re-include this in V2 of the
 patchsets, can
 >> you clarify please why this will have no effect at all ?
 >
 > The locks are exclusive for Nouveau to allocate/free the io
 address
 > space.
 >
 > Since we don't do this here we don't need the locks.
 >
 > Christian.


 So basically calling unmap_mapping_range doesn't require 
any extra
 locking around it and whatever locks are taken within the 
function

 should be enough ?



I think so, yes.

Christian.
Yes, that's true. However, without the bo reservation, nothing 
stops

a PTE from being immediately re-faulted back again. Even while
unmap_mapping_range() is running.

Can you explain more on this - specifically, which function to 
reserve

the BO, why BO reservation would prevent re-fault of the PTE ?

Thomas is talking about ttm_bo_reserver()/ttm_bo_unreserve(), but 
we don't
need this because we unmap everything because the whole device is 
gone and

not just manipulate a single BO.


So the device removed flag needs to be advertized before this
function is run,

I indeed intend to call this  right after calling drm_dev_unplug 
from
amdgpu_pci_remove while adding drm_dev_enter/exit in 
ttm_bo_vm_fault (or

in amdgpu specific wrapper since I don't see how can I access struct
drm_device from ttm_bo_vm_fault) and this in my understanding should
stop a PTE from being re-faulted back as you pointed out - so 
again I

don't see how  bo reservation would prevent it so it looks like I am
missing something...



(perhaps with a memory barrier pair).


drm_dev_unplug and drm_dev_enter/exit are RCU synchronized and so I
don't think require any extra memory barriers for visibility of the
removed flag being set


As far as I can see that should be perfectly sufficient.

Only if you have a drm_dev_enter/exit pair in your fault handler.
Otherwise you're still open to the races Thomas described. But 
aside from
that the drm_dev_unplug stuff has all the barriers and stuff to 
make sure

nothing escapes.

Failure to drm_dev_enter could then also trigger the special case 
where we

put a dummy page in place.
-Daniel


Hmm, Yes, indeed advertizing the flag before the call to 
unmap_mapping_range isn't enough, since there might be fault 
handlers running that haven't picked up the flag when 
unmap_mapping_range is launched.



If you mean those fault handlers that were in progress when the flag 
(drm_dev_unplug) was set in amdgpu_pci_remove then as long as i wrap 
the entire fault handler (probably using amdgpu specific .fault hook 
around ttm_bo_vm_fault) with drm_dev_enter/exit pair then 
drm_dev_unplug->synchronize_srcu will block until those in progress 
faults have completed and only after this i will call 
unmap_mapping_range. Should this be enough ?


Andrey


Yes, I believe so. Although I suspect you might trip lockdep with 
reverse locking order against the mmap_sem wh

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Daniel Vetter
On Thu, Jun 11, 2020 at 4:29 PM Tvrtko Ursulin
 wrote:
>
>
> On 11/06/2020 12:29, Daniel Vetter wrote:
> > On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin
> >  wrote:
> >> On 10/06/2020 16:17, Daniel Vetter wrote:
> >>> On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
> >>>  wrote:
> 
> 
>  On 04/06/2020 09:12, Daniel Vetter wrote:
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> >
> > - We use a read-lock for the execution/worker/completion side, so that
> >  this explicit annotation can be more liberally sprinkled around.
> >  With read locks lockdep isn't going to complain if the read-side
> >  isn't nested the same way under all circumstances, so ABBA 
> > deadlocks
> >  are ok. Which they are, since this is an annotation only.
> >
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >  read lock mode lockdep does not catch read side hazards. And we
> >  _very_ much want read side hazards to be caught. For full details 
> > of
> >  this limitation see
> >
> >  commit e91498589746065e3ae95d9a00b068e525eec34f
> >  Author: Peter Zijlstra 
> >  Date:   Wed Aug 23 13:13:11 2017 +0200
> >
> >  locking/lockdep/selftests: Add mixed read-write ABBA tests
> >
> > - To allow nesting of the read-side explicit annotations we explicitly
> >  keep track of the nesting. lock_is_held() allows us to do that.
> >
> > - The wait-side annotation is a write lock, and entirely done within
> >  dma_fence_wait() for everyone by default.
> >
> > - To be able to freely annotate helper functions I want to make it ok
> >  to call dma_fence_begin/end_signalling from soft/hardirq context.
> >  First attempt was using the hardirq locking context for the write
> >  side in lockdep, but this forces all normal spinlocks nested within
> >  dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >
> >  The approach now is to simple check in_atomic(), and for these 
> > cases
> >  entirely rely on the might_sleep() check in dma_fence_wait(). That
> >  will catch any wrong nesting against spinlocks from soft/hardirq
> >  contexts.
> >
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> >
> > The main class of deadlocks this is supposed to catch are:
> >
> > Thread A:
> >
> > mutex_lock(A);
> > mutex_unlock(A);
> >
> > dma_fence_signal();
> >
> > Thread B:
> >
> > mutex_lock(A);
> > dma_fence_wait();
> > mutex_unlock(A);
> >
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> >
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> >
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> >
> > v3: Kerneldoc.
> >
> > v4: Some spelling fixes from Mika
> >
> > Cc: Mika Kuoppala 
> > Cc: Thomas Hellstrom 
> > Cc: linux-me...@vger.kernel.org
> > Cc: linaro-mm-...@lists.linaro.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-...@lists.freedesktop.org
> > Cc: Chris Wilson 
> > Cc: Maarten Lankhorst 
> > Cc: Christian König 
> > Signed-off-by: Daniel Vetter 
> > ---
> > Documentation/driver-api/dm

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Tvrtko Ursulin

On 11/06/2020 12:29, Daniel Vetter wrote:
> On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin
>  wrote:
>> On 10/06/2020 16:17, Daniel Vetter wrote:
>>> On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
>>>  wrote:


 On 04/06/2020 09:12, Daniel Vetter wrote:
> Design is similar to the lockdep annotations for workers, but with
> some twists:
>
> - We use a read-lock for the execution/worker/completion side, so that
>  this explicit annotation can be more liberally sprinkled around.
>  With read locks lockdep isn't going to complain if the read-side
>  isn't nested the same way under all circumstances, so ABBA deadlocks
>  are ok. Which they are, since this is an annotation only.
>
> - We're using non-recursive lockdep read lock mode, since in recursive
>  read lock mode lockdep does not catch read side hazards. And we
>  _very_ much want read side hazards to be caught. For full details of
>  this limitation see
>
>  commit e91498589746065e3ae95d9a00b068e525eec34f
>  Author: Peter Zijlstra 
>  Date:   Wed Aug 23 13:13:11 2017 +0200
>
>  locking/lockdep/selftests: Add mixed read-write ABBA tests
>
> - To allow nesting of the read-side explicit annotations we explicitly
>  keep track of the nesting. lock_is_held() allows us to do that.
>
> - The wait-side annotation is a write lock, and entirely done within
>  dma_fence_wait() for everyone by default.
>
> - To be able to freely annotate helper functions I want to make it ok
>  to call dma_fence_begin/end_signalling from soft/hardirq context.
>  First attempt was using the hardirq locking context for the write
>  side in lockdep, but this forces all normal spinlocks nested within
>  dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>
>  The approach now is to simple check in_atomic(), and for these cases
>  entirely rely on the might_sleep() check in dma_fence_wait(). That
>  will catch any wrong nesting against spinlocks from soft/hardirq
>  contexts.
>
> The idea here is that every code path that's critical for eventually
> signalling a dma_fence should be annotated with
> dma_fence_begin/end_signalling. The annotation ideally starts right
> after a dma_fence is published (added to a dma_resv, exposed as a
> sync_file fd, attached to a drm_syncobj fd, or anything else that
> makes the dma_fence visible to other kernel threads), up to and
> including the dma_fence_wait(). Examples are irq handlers, the
> scheduler rt threads, the tail of execbuf (after the corresponding
> fences are visible), any workers that end up signalling dma_fences and
> really anything else. Not annotated should be code paths that only
> complete fences opportunistically as the gpu progresses, like e.g.
> shrinker/eviction code.
>
> The main class of deadlocks this is supposed to catch are:
>
> Thread A:
>
> mutex_lock(A);
> mutex_unlock(A);
>
> dma_fence_signal();
>
> Thread B:
>
> mutex_lock(A);
> dma_fence_wait();
> mutex_unlock(A);
>
> Thread B is blocked on A signalling the fence, but A never gets around
> to that because it cannot acquire the lock A.
>
> Note that dma_fence_wait() is allowed to be nested within
> dma_fence_begin/end_signalling sections. To allow this to happen the
> read lock needs to be upgraded to a write lock, which means that any
> other lock is acquired between the dma_fence_begin_signalling() call and
> the call to dma_fence_wait(), and still held, this will result in an
> immediate lockdep complaint. The only other option would be to not
> annotate such calls, defeating the point. Therefore these annotations
> cannot be sprinkled over the code entirely mindless to avoid false
> positives.
>
> v2: handle soft/hardirq ctx better against write side and dont forget
> EXPORT_SYMBOL, drivers can't use this otherwise.
>
> v3: Kerneldoc.
>
> v4: Some spelling fixes from Mika
>
> Cc: Mika Kuoppala 
> Cc: Thomas Hellstrom 
> Cc: linux-me...@vger.kernel.org
> Cc: linaro-mm-...@lists.linaro.org
> Cc: linux-r...@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-...@lists.freedesktop.org
> Cc: Chris Wilson 
> Cc: Maarten Lankhorst 
> Cc: Christian König 
> Signed-off-by: Daniel Vetter 
> ---
> Documentation/driver-api/dma-buf.rst |  12 +-
> drivers/dma-buf/dma-fence.c  | 161 +++
> include/linux/dma-fence.h|  12 ++
> 3 files changed, 182 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/driver-api/dma-buf.rst 
> b/Documentati

Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-11 Thread Jason Gunthorpe
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
> > I still have my doubts about allowing fence waiting from within shrinkers.
> > IMO ideally they should use a trywait approach, in order to allow memory
> > allocation during command submission for drivers that
> > publish fences before command submission. (Since early reservation object
> > release requires that).
> 
> Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up
> with a mempool to make sure it can handle it's allocations.
> 
> > But since drivers are already waiting from within shrinkers and I take your
> > word for HMM requiring this,
> 
> Yeah the big trouble is HMM and mmu notifiers. That's the really awkward
> one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should
even be using mmu notifiers - the best use is HW that can fence the
DMA directly without having to get involved with some command stream
processing.

Or at the very least it should not be a generic DMA fence but a
narrowed completion tied only into the same GPU driver's command
completion processing which should be able to progress without
blocking.

The intent of notifiers was never to endlessly block while vast
amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems
like bad idea.

> I've pinged a bunch of armsoc gpu driver people and ask them how much this
> hurts, so that we have a clear answer. On x86 I don't think we have much
> of a choice on this, with userptr in amd and i915 and hmm work in nouveau
> (but nouveau I think doesn't use dma_fence in there). 

Right, nor will RDMA ODP. 

Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: correct ras query as part of ctx query

2020-06-11 Thread Chen, Guchun
[AMD Public Use]

Hi Dennis,

Sorry for confusion brought by the commit message. I will update patch v2 later.

Regards,
Guchun

-Original Message-
From: Li, Dennis  
Sent: Thursday, June 11, 2020 6:57 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Zhang, 
Hawking ; Zhou1, Tao ; Pan, Xinhui 
; Clements, John 
Subject: RE: [PATCH] drm/amdgpu: correct ras query as part of ctx query

[AMD Official Use Only - Internal Distribution Only]

Hi, Guchun,
 The ras_manager obj will save the error counters in every querying, 
therefore the previous querying shouldn't affect the result of current 
querying. Please check the function: amdgpu_ras_error_query. 
 
Best Regards
Dennis Li
-Original Message-
From: Chen, Guchun  
Sent: Thursday, June 11, 2020 6:24 PM
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Zhou1, Tao ; Pan, Xinhui ; Li, Dennis 
; Clements, John 
Cc: Chen, Guchun 
Subject: [PATCH] drm/amdgpu: correct ras query as part of ctx query

Almost error count registers are automatically cleared after reading once, so 
both CE and UE count needs to be read in one loop.

Signed-off-by: Guchun Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 +++-  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +-  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 ++--
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index c06cb06398b1..29fa6b6b9d3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -335,7 +335,7 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,  {
struct amdgpu_ctx *ctx;
struct amdgpu_ctx_mgr *mgr;
-   unsigned long ras_counter;
+   unsigned long ras_counter_ue, ras_counter_ce;
 
if (!fpriv)
return -EINVAL;
@@ -360,19 +360,17 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
if (atomic_read(&ctx->guilty))
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
-   /*query ue count*/
-   ras_counter = amdgpu_ras_query_error_count(adev, false);
+   /*query both ue and ce count*/
+   amdgpu_ras_query_error_count(adev, &ras_counter_ue, &ras_counter_ce);
/*ras counter is monotonic increasing*/
-   if (ras_counter != ctx->ras_counter_ue) {
+   if (ras_counter_ue != ctx->ras_counter_ue) {
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-   ctx->ras_counter_ue = ras_counter;
+   ctx->ras_counter_ue = ras_counter_ue;
}
 
-   /*query ce count*/
-   ras_counter = amdgpu_ras_query_error_count(adev, true);
-   if (ras_counter != ctx->ras_counter_ce) {
+   if (ras_counter_ce != ctx->ras_counter_ce) {
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-   ctx->ras_counter_ce = ras_counter;
+   ctx->ras_counter_ce = ras_counter_ce;
}
 
mutex_unlock(&mgr->lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 337bf2da7bdc..109eff2869b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -861,15 +861,18 @@ int amdgpu_ras_error_cure(struct amdgpu_device *adev,  }
 
 /* get the total error counts on all IPs */ -unsigned long 
amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-   bool is_ce)
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+   unsigned long *ue_cnt, unsigned long *ce_cnt)
 {
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
struct ras_manager *obj;
struct ras_err_data data = {0, 0};
 
+   *ue_cnt = 0;
+   *ce_cnt = 0;
+
if (!con)
-   return 0;
+   return;
 
list_for_each_entry(obj, &con->head, node) {
struct ras_query_if info = {
@@ -877,13 +880,14 @@ unsigned long amdgpu_ras_query_error_count(struct 
amdgpu_device *adev,
};
 
if (amdgpu_ras_error_query(adev, &info))
-   return 0;
+   continue;
 
data.ce_count += info.ce_count;
data.ue_count += info.ue_count;
}
 
-   return is_ce ? data.ce_count : data.ue_count;
+   *ue_cnt = data.ue_count;
+   *ce_cnt = data.ce_count;
 }
 /* query/inject/cure end */
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index e7df5d8429f8..733eab5bc512 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -487,8 +487,8 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device 
*adev,  void amdgpu_ras_resume(struct amdgpu_device *adev);  void 
amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-   bool is_c

Re: [PATCH] drm/amd/amdgpu: Add SQ_DEBUG_STS_GLOBAL* registers/bits

2020-06-11 Thread Alex Deucher
On Thu, Jun 11, 2020 at 7:58 AM Tom St Denis  wrote:
>
> Even though they are technically MMIO registers I put the bits with the sqind 
> block
> for organizational purposes.
>
> Requested for UMR debugging.
>
> Signed-off-by: Tom St Denis 

Reviewed-by: Alex Deucher 

> ---
>  .../include/asic_reg/gc/gc_10_1_0_offset.h|  3 ++-
>  .../include/asic_reg/gc/gc_10_1_0_sh_mask.h   | 16 ++
>  .../include/asic_reg/gc/gc_10_3_0_offset.h|  3 ++-
>  .../include/asic_reg/gc/gc_10_3_0_sh_mask.h   | 16 ++
>  .../amd/include/asic_reg/gc/gc_9_0_offset.h   |  4 +++-
>  .../amd/include/asic_reg/gc/gc_9_0_sh_mask.h  | 22 +++
>  .../amd/include/asic_reg/gc/gc_9_1_offset.h   |  4 +++-
>  .../amd/include/asic_reg/gc/gc_9_1_sh_mask.h  | 21 ++
>  .../amd/include/asic_reg/gc/gc_9_2_1_offset.h |  4 +++-
>  .../include/asic_reg/gc/gc_9_2_1_sh_mask.h| 21 ++
>  10 files changed, 109 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_offset.h 
> b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_offset.h
> index 791dc2b3d74a..aab3d22c3b0f 100644
> --- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_offset.h
> +++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_offset.h
> @@ -21,7 +21,8 @@
>  #ifndef _gc_10_1_0_OFFSET_HEADER
>  #define _gc_10_1_0_OFFSET_HEADER
>
> -
> +#define mmSQ_DEBUG_STS_GLOBAL
>   0x2309
> +#define mmSQ_DEBUG_STS_GLOBAL2   
>   0x2310
>
>  // addressBlock: gc_sdma0_sdma0dec
>  // base address: 0x4980
> diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_sh_mask.h 
> b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_sh_mask.h
> index 355e61bed291..4127896ffcdf 100644
> --- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_sh_mask.h
> +++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_sh_mask.h
> @@ -42546,6 +42546,22 @@
>
>
>  // addressBlock: sqind
> +//SQ_DEBUG_STS_GLOBAL
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX0_MASK 0x00ffL
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX0__SHIFT 0x
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX1_MASK 0xff00L
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX1__SHIFT 0x0008
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_COMPUTE_MASK 0xffL
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_COMPUTE__SHIFT 0x0010
> +#define SQ_DEBUG_STS_GLOBAL__BUSY_MASK 0x0001L
> +#define SQ_DEBUG_STS_GLOBAL__BUSY__SHIFT 0x
> +#define SQ_DEBUG_STS_GLOBAL__INTERRUPT_MSG_BUSY_MASK 0x0002L
> +#define SQ_DEBUG_STS_GLOBAL__INTERRUPT_MSG_BUSY__SHIFT 0x0001
> +#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA0_MASK 0xfff0L
> +#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA0__SHIFT 0x0004
> +#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA1_MASK 0x0fffL
> +#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA1__SHIFT 0x0010
> +
>  //SQ_DEBUG_STS_LOCAL
>  #define SQ_DEBUG_STS_LOCAL__BUSY_MASK
>  0x0001L
>  #define SQ_DEBUG_STS_LOCAL__BUSY__SHIFT  
>  0x
> diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_offset.h 
> b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_offset.h
> index a9a66371b75e..16c7f6f2467e 100644
> --- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_offset.h
> +++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_offset.h
> @@ -22,7 +22,8 @@
>  #ifndef _gc_10_3_0_OFFSET_HEADER
>  #define _gc_10_3_0_OFFSET_HEADER
>
> -
> +#define mmSQ_DEBUG_STS_GLOBAL
>   0x2309
> +#define mmSQ_DEBUG_STS_GLOBAL2   
>   0x2310
>
>  // addressBlock: gc_sdma0_sdma0dec
>  // base address: 0x4980
> diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_sh_mask.h 
> b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_sh_mask.h
> index 499a8c3c2693..aac57f714cf1 100644
> --- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_sh_mask.h
> +++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_sh_mask.h
> @@ -46269,6 +46269,22 @@
>
>
>  // addressBlock: sqind
> +//SQ_DEBUG_STS_GLOBAL
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX0_MASK 0x00ffL
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX0__SHIFT 0x
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX1_MASK 0xff00L
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX1__SHIFT 0x0008
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_COMPUTE_MASK 0xffL
> +#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_COMPUTE__SHIFT 0x0010
> +#define SQ_DEBUG_STS_GLOBAL__BUSY_MASK 0x0001L
> +#define SQ_DEBUG_STS_GLOBAL__BUSY__SHIFT 0x
> +#define SQ_DEBUG_STS_GLOBAL__INTERRUPT_MSG_BUSY_MASK 0x0002L
> +#define SQ_DEBUG_STS

Re: [PATCH] drm/amdgpu: remove distinction between explicit and implicit sync (v2)

2020-06-11 Thread Chunming Zhou
I didn't check the patch details, if it is for existing implicit sync of 
shared buffer, feel free go ahead.


But if you add some description for its usage, that will be more clear 
to others.


-David

在 2020/6/11 15:19, Marek Olšák 写道:

Hi David,

Explicit sync has nothing to do with this. This is for implicit sync, 
which is required by DRI3. This fix allows removing existing 
inefficiencies from drivers, so it's a good thing.


Marek

On Wed., Jun. 10, 2020, 03:56 Chunming Zhou, > wrote:



在 2020/6/10 15:41, Christian König 写道:

That's true, but for now we are stuck with the implicit sync for
quite a number of use cases.

My problem is rather that we already tried this and it backfired
immediately.

I do remember that it was your patch who introduced the pipeline
sync flag handling and I warned that this could be problematic.
You then came back with a QA result saying that this is indeed
causing a huge performance drop in one test case and we need to
do something else. Together we then came up with the different
handling between implicit and explicit sync.


Isn't pipeline sync flag to fix some issue because of parralel
execution between jobs in one pipeline?  I really don't have this
memory in mind why that's realted to this, Or do you mean extra
sync hides many other potential issues?

Anyway, when I go through Vulkan WSI code, the synchronization
isn't so smooth between OS window system. And when I saw Jason
drives explicit sync through the whole Linux ecosystem like
Android window system does, I feel that's really a good direction.

-David



But I can't find that stupid mail thread any more. I knew that it
was a couple of years ago when we started with the explicit sync
for Vulkan.

Christian.

Am 10.06.20 um 08:29 schrieb Zhou, David(ChunMing):


[AMD Official Use Only - Internal Distribution Only]

Not sue if this is right direction, I think usermode wants all
synchronizations to be explicit. Implicit sync often confuses
people who don’t know its history. I remember Jason from Intel
 is driving explicit synchronization through the Linux
ecosystem, which even removes implicit sync of shared buffer.

-David

*From:* amd-gfx 
 *On Behalf Of
*Marek Olšák
*Sent:* Tuesday, June 9, 2020 6:58 PM
*To:* amd-gfx mailing list 

*Subject:* [PATCH] drm/amdgpu: remove distinction between
explicit and implicit sync (v2)

Hi,

This enables a full pipeline sync for implicit sync. It's
Christian's patch with the driver version bumped. With this,
user mode drivers don't have to wait for idle at the end of gfx IBs.

Any concerns?

Thanks,

Marek


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org  
https://lists.freedesktop.org/mailman/listinfo/amd-gfx  



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/amdgpu: Add SQ_DEBUG_STS_GLOBAL* registers/bits

2020-06-11 Thread Tom St Denis
Even though they are technically MMIO registers I put the bits with the sqind 
block
for organizational purposes.

Requested for UMR debugging.

Signed-off-by: Tom St Denis 
---
 .../include/asic_reg/gc/gc_10_1_0_offset.h|  3 ++-
 .../include/asic_reg/gc/gc_10_1_0_sh_mask.h   | 16 ++
 .../include/asic_reg/gc/gc_10_3_0_offset.h|  3 ++-
 .../include/asic_reg/gc/gc_10_3_0_sh_mask.h   | 16 ++
 .../amd/include/asic_reg/gc/gc_9_0_offset.h   |  4 +++-
 .../amd/include/asic_reg/gc/gc_9_0_sh_mask.h  | 22 +++
 .../amd/include/asic_reg/gc/gc_9_1_offset.h   |  4 +++-
 .../amd/include/asic_reg/gc/gc_9_1_sh_mask.h  | 21 ++
 .../amd/include/asic_reg/gc/gc_9_2_1_offset.h |  4 +++-
 .../include/asic_reg/gc/gc_9_2_1_sh_mask.h| 21 ++
 10 files changed, 109 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_offset.h 
b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_offset.h
index 791dc2b3d74a..aab3d22c3b0f 100644
--- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_offset.h
+++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_offset.h
@@ -21,7 +21,8 @@
 #ifndef _gc_10_1_0_OFFSET_HEADER
 #define _gc_10_1_0_OFFSET_HEADER
 
-
+#define mmSQ_DEBUG_STS_GLOBAL  
0x2309
+#define mmSQ_DEBUG_STS_GLOBAL2 
0x2310
 
 // addressBlock: gc_sdma0_sdma0dec
 // base address: 0x4980
diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_sh_mask.h 
b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_sh_mask.h
index 355e61bed291..4127896ffcdf 100644
--- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_sh_mask.h
+++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_1_0_sh_mask.h
@@ -42546,6 +42546,22 @@
 
 
 // addressBlock: sqind
+//SQ_DEBUG_STS_GLOBAL
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX0_MASK 0x00ffL
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX0__SHIFT 0x
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX1_MASK 0xff00L
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX1__SHIFT 0x0008
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_COMPUTE_MASK 0xffL
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_COMPUTE__SHIFT 0x0010
+#define SQ_DEBUG_STS_GLOBAL__BUSY_MASK 0x0001L
+#define SQ_DEBUG_STS_GLOBAL__BUSY__SHIFT 0x
+#define SQ_DEBUG_STS_GLOBAL__INTERRUPT_MSG_BUSY_MASK 0x0002L
+#define SQ_DEBUG_STS_GLOBAL__INTERRUPT_MSG_BUSY__SHIFT 0x0001
+#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA0_MASK 0xfff0L
+#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA0__SHIFT 0x0004
+#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA1_MASK 0x0fffL
+#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA1__SHIFT 0x0010
+
 //SQ_DEBUG_STS_LOCAL
 #define SQ_DEBUG_STS_LOCAL__BUSY_MASK  
   0x0001L
 #define SQ_DEBUG_STS_LOCAL__BUSY__SHIFT
   0x
diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_offset.h 
b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_offset.h
index a9a66371b75e..16c7f6f2467e 100644
--- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_offset.h
+++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_offset.h
@@ -22,7 +22,8 @@
 #ifndef _gc_10_3_0_OFFSET_HEADER
 #define _gc_10_3_0_OFFSET_HEADER
 
-
+#define mmSQ_DEBUG_STS_GLOBAL  
0x2309
+#define mmSQ_DEBUG_STS_GLOBAL2 
0x2310
 
 // addressBlock: gc_sdma0_sdma0dec
 // base address: 0x4980
diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_sh_mask.h 
b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_sh_mask.h
index 499a8c3c2693..aac57f714cf1 100644
--- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_sh_mask.h
+++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_10_3_0_sh_mask.h
@@ -46269,6 +46269,22 @@
 
 
 // addressBlock: sqind
+//SQ_DEBUG_STS_GLOBAL
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX0_MASK 0x00ffL
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX0__SHIFT 0x
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX1_MASK 0xff00L
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_GFX1__SHIFT 0x0008
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_COMPUTE_MASK 0xffL
+#define SQ_DEBUG_STS_GLOBAL2__FIFO_LEVEL_COMPUTE__SHIFT 0x0010
+#define SQ_DEBUG_STS_GLOBAL__BUSY_MASK 0x0001L
+#define SQ_DEBUG_STS_GLOBAL__BUSY__SHIFT 0x
+#define SQ_DEBUG_STS_GLOBAL__INTERRUPT_MSG_BUSY_MASK 0x0002L
+#define SQ_DEBUG_STS_GLOBAL__INTERRUPT_MSG_BUSY__SHIFT 0x0001
+#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA0_MASK 0xfff0L
+#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA0__SHIFT 0x0004
+#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL_SA1_MASK 0x0fffL
+#define SQ_DEBUG_STS_GLOBAL__WAVE_LEVEL

Re: [PATCH] drm/amdgpu/jpeg: fix race condition issue for jpeg start

2020-06-11 Thread Leo Liu

Reviewed-by: Leo Liu 

On 2020-06-10 12:36 p.m., James Zhu wrote:

Fix race condition issue when multiple jpeg starts are called.

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 16 
  drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.h |  2 ++
  2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index d31d65e..8996cb4e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -37,6 +37,8 @@ static void amdgpu_jpeg_idle_work_handler(struct work_struct 
*work);
  int amdgpu_jpeg_sw_init(struct amdgpu_device *adev)
  {
INIT_DELAYED_WORK(&adev->jpeg.idle_work, amdgpu_jpeg_idle_work_handler);
+   mutex_init(&adev->jpeg.jpeg_pg_lock);
+   atomic_set(&adev->jpeg.total_submission_cnt, 0);
  
  	return 0;

  }
@@ -54,6 +56,8 @@ int amdgpu_jpeg_sw_fini(struct amdgpu_device *adev)
amdgpu_ring_fini(&adev->jpeg.inst[i].ring_dec);
}
  
+	mutex_destroy(&adev->jpeg.jpeg_pg_lock);

+
return 0;
  }
  
@@ -83,7 +87,7 @@ static void amdgpu_jpeg_idle_work_handler(struct work_struct *work)

fences += 
amdgpu_fence_count_emitted(&adev->jpeg.inst[i].ring_dec);
}
  
-	if (fences == 0)

+   if (!fences && !atomic_read(&adev->jpeg.total_submission_cnt))
amdgpu_device_ip_set_powergating_state(adev, 
AMD_IP_BLOCK_TYPE_JPEG,
   AMD_PG_STATE_GATE);
else
@@ -93,15 +97,19 @@ static void amdgpu_jpeg_idle_work_handler(struct 
work_struct *work)
  void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
  {
struct amdgpu_device *adev = ring->adev;
-   bool set_clocks = !cancel_delayed_work_sync(&adev->jpeg.idle_work);
  
-	if (set_clocks)

-   amdgpu_device_ip_set_powergating_state(adev, 
AMD_IP_BLOCK_TYPE_JPEG,
+   atomic_inc(&adev->jpeg.total_submission_cnt);
+   cancel_delayed_work_sync(&adev->jpeg.idle_work);
+
+   mutex_lock(&adev->jpeg.jpeg_pg_lock);
+   amdgpu_device_ip_set_powergating_state(adev, AMD_IP_BLOCK_TYPE_JPEG,
   AMD_PG_STATE_UNGATE);
+   mutex_unlock(&adev->jpeg.jpeg_pg_lock);
  }
  
  void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)

  {
+   atomic_dec(&ring->adev->jpeg.total_submission_cnt);
schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.h

index 5131a0a..55fbff2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.h
@@ -46,6 +46,8 @@ struct amdgpu_jpeg {
unsigned harvest_config;
struct delayed_work idle_work;
enum amd_powergating_state cur_state;
+   struct mutex jpeg_pg_lock;
+   atomic_t total_submission_cnt;
  };
  
  int amdgpu_jpeg_sw_init(struct amdgpu_device *adev);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Daniel Vetter
On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin
 wrote:
>
>
> On 10/06/2020 16:17, Daniel Vetter wrote:
> > On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
> >  wrote:
> >>
> >>
> >> On 04/06/2020 09:12, Daniel Vetter wrote:
> >>> Design is similar to the lockdep annotations for workers, but with
> >>> some twists:
> >>>
> >>> - We use a read-lock for the execution/worker/completion side, so that
> >>> this explicit annotation can be more liberally sprinkled around.
> >>> With read locks lockdep isn't going to complain if the read-side
> >>> isn't nested the same way under all circumstances, so ABBA deadlocks
> >>> are ok. Which they are, since this is an annotation only.
> >>>
> >>> - We're using non-recursive lockdep read lock mode, since in recursive
> >>> read lock mode lockdep does not catch read side hazards. And we
> >>> _very_ much want read side hazards to be caught. For full details of
> >>> this limitation see
> >>>
> >>> commit e91498589746065e3ae95d9a00b068e525eec34f
> >>> Author: Peter Zijlstra 
> >>> Date:   Wed Aug 23 13:13:11 2017 +0200
> >>>
> >>> locking/lockdep/selftests: Add mixed read-write ABBA tests
> >>>
> >>> - To allow nesting of the read-side explicit annotations we explicitly
> >>> keep track of the nesting. lock_is_held() allows us to do that.
> >>>
> >>> - The wait-side annotation is a write lock, and entirely done within
> >>> dma_fence_wait() for everyone by default.
> >>>
> >>> - To be able to freely annotate helper functions I want to make it ok
> >>> to call dma_fence_begin/end_signalling from soft/hardirq context.
> >>> First attempt was using the hardirq locking context for the write
> >>> side in lockdep, but this forces all normal spinlocks nested within
> >>> dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >>>
> >>> The approach now is to simple check in_atomic(), and for these cases
> >>> entirely rely on the might_sleep() check in dma_fence_wait(). That
> >>> will catch any wrong nesting against spinlocks from soft/hardirq
> >>> contexts.
> >>>
> >>> The idea here is that every code path that's critical for eventually
> >>> signalling a dma_fence should be annotated with
> >>> dma_fence_begin/end_signalling. The annotation ideally starts right
> >>> after a dma_fence is published (added to a dma_resv, exposed as a
> >>> sync_file fd, attached to a drm_syncobj fd, or anything else that
> >>> makes the dma_fence visible to other kernel threads), up to and
> >>> including the dma_fence_wait(). Examples are irq handlers, the
> >>> scheduler rt threads, the tail of execbuf (after the corresponding
> >>> fences are visible), any workers that end up signalling dma_fences and
> >>> really anything else. Not annotated should be code paths that only
> >>> complete fences opportunistically as the gpu progresses, like e.g.
> >>> shrinker/eviction code.
> >>>
> >>> The main class of deadlocks this is supposed to catch are:
> >>>
> >>> Thread A:
> >>>
> >>>mutex_lock(A);
> >>>mutex_unlock(A);
> >>>
> >>>dma_fence_signal();
> >>>
> >>> Thread B:
> >>>
> >>>mutex_lock(A);
> >>>dma_fence_wait();
> >>>mutex_unlock(A);
> >>>
> >>> Thread B is blocked on A signalling the fence, but A never gets around
> >>> to that because it cannot acquire the lock A.
> >>>
> >>> Note that dma_fence_wait() is allowed to be nested within
> >>> dma_fence_begin/end_signalling sections. To allow this to happen the
> >>> read lock needs to be upgraded to a write lock, which means that any
> >>> other lock is acquired between the dma_fence_begin_signalling() call and
> >>> the call to dma_fence_wait(), and still held, this will result in an
> >>> immediate lockdep complaint. The only other option would be to not
> >>> annotate such calls, defeating the point. Therefore these annotations
> >>> cannot be sprinkled over the code entirely mindless to avoid false
> >>> positives.
> >>>
> >>> v2: handle soft/hardirq ctx better against write side and dont forget
> >>> EXPORT_SYMBOL, drivers can't use this otherwise.
> >>>
> >>> v3: Kerneldoc.
> >>>
> >>> v4: Some spelling fixes from Mika
> >>>
> >>> Cc: Mika Kuoppala 
> >>> Cc: Thomas Hellstrom 
> >>> Cc: linux-me...@vger.kernel.org
> >>> Cc: linaro-mm-...@lists.linaro.org
> >>> Cc: linux-r...@vger.kernel.org
> >>> Cc: amd-gfx@lists.freedesktop.org
> >>> Cc: intel-...@lists.freedesktop.org
> >>> Cc: Chris Wilson 
> >>> Cc: Maarten Lankhorst 
> >>> Cc: Christian König 
> >>> Signed-off-by: Daniel Vetter 
> >>> ---
> >>>Documentation/driver-api/dma-buf.rst |  12 +-
> >>>drivers/dma-buf/dma-fence.c  | 161 +++
> >>>include/linux/dma-fence.h|  12 ++
> >>>3 files changed, 182 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/Documentation/driver-api/dma-buf.rst 
> >>> b/Documentation/driver-api/dma-buf.rst
> >>> index 63dec76d1d8d..05d856131140 100644
> >

RE: [PATCH] drm/amdgpu: correct ras query as part of ctx query

2020-06-11 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Hi, Guchun,
 The ras_manager obj will save the error counters in every querying, 
therefore the previous querying shouldn't affect the result of current 
querying. Please check the function: amdgpu_ras_error_query. 
 
Best Regards
Dennis Li
-Original Message-
From: Chen, Guchun  
Sent: Thursday, June 11, 2020 6:24 PM
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Zhou1, Tao ; Pan, Xinhui ; Li, Dennis 
; Clements, John 
Cc: Chen, Guchun 
Subject: [PATCH] drm/amdgpu: correct ras query as part of ctx query

Almost error count registers are automatically cleared after reading once, so 
both CE and UE count needs to be read in one loop.

Signed-off-by: Guchun Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 +++-  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +-  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 ++--
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index c06cb06398b1..29fa6b6b9d3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -335,7 +335,7 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,  {
struct amdgpu_ctx *ctx;
struct amdgpu_ctx_mgr *mgr;
-   unsigned long ras_counter;
+   unsigned long ras_counter_ue, ras_counter_ce;
 
if (!fpriv)
return -EINVAL;
@@ -360,19 +360,17 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
if (atomic_read(&ctx->guilty))
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
-   /*query ue count*/
-   ras_counter = amdgpu_ras_query_error_count(adev, false);
+   /*query both ue and ce count*/
+   amdgpu_ras_query_error_count(adev, &ras_counter_ue, &ras_counter_ce);
/*ras counter is monotonic increasing*/
-   if (ras_counter != ctx->ras_counter_ue) {
+   if (ras_counter_ue != ctx->ras_counter_ue) {
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-   ctx->ras_counter_ue = ras_counter;
+   ctx->ras_counter_ue = ras_counter_ue;
}
 
-   /*query ce count*/
-   ras_counter = amdgpu_ras_query_error_count(adev, true);
-   if (ras_counter != ctx->ras_counter_ce) {
+   if (ras_counter_ce != ctx->ras_counter_ce) {
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-   ctx->ras_counter_ce = ras_counter;
+   ctx->ras_counter_ce = ras_counter_ce;
}
 
mutex_unlock(&mgr->lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 337bf2da7bdc..109eff2869b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -861,15 +861,18 @@ int amdgpu_ras_error_cure(struct amdgpu_device *adev,  }
 
 /* get the total error counts on all IPs */ -unsigned long 
amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-   bool is_ce)
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+   unsigned long *ue_cnt, unsigned long *ce_cnt)
 {
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
struct ras_manager *obj;
struct ras_err_data data = {0, 0};
 
+   *ue_cnt = 0;
+   *ce_cnt = 0;
+
if (!con)
-   return 0;
+   return;
 
list_for_each_entry(obj, &con->head, node) {
struct ras_query_if info = {
@@ -877,13 +880,14 @@ unsigned long amdgpu_ras_query_error_count(struct 
amdgpu_device *adev,
};
 
if (amdgpu_ras_error_query(adev, &info))
-   return 0;
+   continue;
 
data.ce_count += info.ce_count;
data.ue_count += info.ue_count;
}
 
-   return is_ce ? data.ce_count : data.ue_count;
+   *ue_cnt = data.ue_count;
+   *ce_cnt = data.ce_count;
 }
 /* query/inject/cure end */
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index e7df5d8429f8..733eab5bc512 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -487,8 +487,8 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device 
*adev,  void amdgpu_ras_resume(struct amdgpu_device *adev);  void 
amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-   bool is_ce);
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+   unsigned long *ue_cnt, unsigned long *ce_cnt);
 
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Tvrtko Ursulin

On 10/06/2020 16:17, Daniel Vetter wrote:
> On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
>  wrote:
>>
>>
>> On 04/06/2020 09:12, Daniel Vetter wrote:
>>> Design is similar to the lockdep annotations for workers, but with
>>> some twists:
>>>
>>> - We use a read-lock for the execution/worker/completion side, so that
>>> this explicit annotation can be more liberally sprinkled around.
>>> With read locks lockdep isn't going to complain if the read-side
>>> isn't nested the same way under all circumstances, so ABBA deadlocks
>>> are ok. Which they are, since this is an annotation only.
>>>
>>> - We're using non-recursive lockdep read lock mode, since in recursive
>>> read lock mode lockdep does not catch read side hazards. And we
>>> _very_ much want read side hazards to be caught. For full details of
>>> this limitation see
>>>
>>> commit e91498589746065e3ae95d9a00b068e525eec34f
>>> Author: Peter Zijlstra 
>>> Date:   Wed Aug 23 13:13:11 2017 +0200
>>>
>>> locking/lockdep/selftests: Add mixed read-write ABBA tests
>>>
>>> - To allow nesting of the read-side explicit annotations we explicitly
>>> keep track of the nesting. lock_is_held() allows us to do that.
>>>
>>> - The wait-side annotation is a write lock, and entirely done within
>>> dma_fence_wait() for everyone by default.
>>>
>>> - To be able to freely annotate helper functions I want to make it ok
>>> to call dma_fence_begin/end_signalling from soft/hardirq context.
>>> First attempt was using the hardirq locking context for the write
>>> side in lockdep, but this forces all normal spinlocks nested within
>>> dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>>>
>>> The approach now is to simple check in_atomic(), and for these cases
>>> entirely rely on the might_sleep() check in dma_fence_wait(). That
>>> will catch any wrong nesting against spinlocks from soft/hardirq
>>> contexts.
>>>
>>> The idea here is that every code path that's critical for eventually
>>> signalling a dma_fence should be annotated with
>>> dma_fence_begin/end_signalling. The annotation ideally starts right
>>> after a dma_fence is published (added to a dma_resv, exposed as a
>>> sync_file fd, attached to a drm_syncobj fd, or anything else that
>>> makes the dma_fence visible to other kernel threads), up to and
>>> including the dma_fence_wait(). Examples are irq handlers, the
>>> scheduler rt threads, the tail of execbuf (after the corresponding
>>> fences are visible), any workers that end up signalling dma_fences and
>>> really anything else. Not annotated should be code paths that only
>>> complete fences opportunistically as the gpu progresses, like e.g.
>>> shrinker/eviction code.
>>>
>>> The main class of deadlocks this is supposed to catch are:
>>>
>>> Thread A:
>>>
>>>mutex_lock(A);
>>>mutex_unlock(A);
>>>
>>>dma_fence_signal();
>>>
>>> Thread B:
>>>
>>>mutex_lock(A);
>>>dma_fence_wait();
>>>mutex_unlock(A);
>>>
>>> Thread B is blocked on A signalling the fence, but A never gets around
>>> to that because it cannot acquire the lock A.
>>>
>>> Note that dma_fence_wait() is allowed to be nested within
>>> dma_fence_begin/end_signalling sections. To allow this to happen the
>>> read lock needs to be upgraded to a write lock, which means that any
>>> other lock is acquired between the dma_fence_begin_signalling() call and
>>> the call to dma_fence_wait(), and still held, this will result in an
>>> immediate lockdep complaint. The only other option would be to not
>>> annotate such calls, defeating the point. Therefore these annotations
>>> cannot be sprinkled over the code entirely mindless to avoid false
>>> positives.
>>>
>>> v2: handle soft/hardirq ctx better against write side and dont forget
>>> EXPORT_SYMBOL, drivers can't use this otherwise.
>>>
>>> v3: Kerneldoc.
>>>
>>> v4: Some spelling fixes from Mika
>>>
>>> Cc: Mika Kuoppala 
>>> Cc: Thomas Hellstrom 
>>> Cc: linux-me...@vger.kernel.org
>>> Cc: linaro-mm-...@lists.linaro.org
>>> Cc: linux-r...@vger.kernel.org
>>> Cc: amd-gfx@lists.freedesktop.org
>>> Cc: intel-...@lists.freedesktop.org
>>> Cc: Chris Wilson 
>>> Cc: Maarten Lankhorst 
>>> Cc: Christian König 
>>> Signed-off-by: Daniel Vetter 
>>> ---
>>>Documentation/driver-api/dma-buf.rst |  12 +-
>>>drivers/dma-buf/dma-fence.c  | 161 +++
>>>include/linux/dma-fence.h|  12 ++
>>>3 files changed, 182 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/driver-api/dma-buf.rst 
>>> b/Documentation/driver-api/dma-buf.rst
>>> index 63dec76d1d8d..05d856131140 100644
>>> --- a/Documentation/driver-api/dma-buf.rst
>>> +++ b/Documentation/driver-api/dma-buf.rst
>>> @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects
>>>.. kernel-doc:: drivers/dma-buf/dma-buf.c
>>>   :doc: cpu access
>>>
>>> -Fence Poll Support
>>> -~~
>>>

[PATCH] drm/amdgpu: correct ras query as part of ctx query

2020-06-11 Thread Guchun Chen
Almost error count registers are automatically cleared
after reading once, so both CE and UE count needs to be
read in one loop.

Signed-off-by: Guchun Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 ++--
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index c06cb06398b1..29fa6b6b9d3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -335,7 +335,7 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 {
struct amdgpu_ctx *ctx;
struct amdgpu_ctx_mgr *mgr;
-   unsigned long ras_counter;
+   unsigned long ras_counter_ue, ras_counter_ce;
 
if (!fpriv)
return -EINVAL;
@@ -360,19 +360,17 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
if (atomic_read(&ctx->guilty))
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
-   /*query ue count*/
-   ras_counter = amdgpu_ras_query_error_count(adev, false);
+   /*query both ue and ce count*/
+   amdgpu_ras_query_error_count(adev, &ras_counter_ue, &ras_counter_ce);
/*ras counter is monotonic increasing*/
-   if (ras_counter != ctx->ras_counter_ue) {
+   if (ras_counter_ue != ctx->ras_counter_ue) {
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-   ctx->ras_counter_ue = ras_counter;
+   ctx->ras_counter_ue = ras_counter_ue;
}
 
-   /*query ce count*/
-   ras_counter = amdgpu_ras_query_error_count(adev, true);
-   if (ras_counter != ctx->ras_counter_ce) {
+   if (ras_counter_ce != ctx->ras_counter_ce) {
out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-   ctx->ras_counter_ce = ras_counter;
+   ctx->ras_counter_ce = ras_counter_ce;
}
 
mutex_unlock(&mgr->lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 337bf2da7bdc..109eff2869b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -861,15 +861,18 @@ int amdgpu_ras_error_cure(struct amdgpu_device *adev,
 }
 
 /* get the total error counts on all IPs */
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-   bool is_ce)
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+   unsigned long *ue_cnt, unsigned long *ce_cnt)
 {
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
struct ras_manager *obj;
struct ras_err_data data = {0, 0};
 
+   *ue_cnt = 0;
+   *ce_cnt = 0;
+
if (!con)
-   return 0;
+   return;
 
list_for_each_entry(obj, &con->head, node) {
struct ras_query_if info = {
@@ -877,13 +880,14 @@ unsigned long amdgpu_ras_query_error_count(struct 
amdgpu_device *adev,
};
 
if (amdgpu_ras_error_query(adev, &info))
-   return 0;
+   continue;
 
data.ce_count += info.ce_count;
data.ue_count += info.ue_count;
}
 
-   return is_ce ? data.ce_count : data.ue_count;
+   *ue_cnt = data.ue_count;
+   *ce_cnt = data.ce_count;
 }
 /* query/inject/cure end */
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index e7df5d8429f8..733eab5bc512 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -487,8 +487,8 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device 
*adev,
 void amdgpu_ras_resume(struct amdgpu_device *adev);
 void amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-   bool is_ce);
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+   unsigned long *ue_cnt, unsigned long *ce_cnt);
 
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] dma-fence: basic lockdep annotations

2020-06-11 Thread Maarten Lankhorst
Op 05-06-2020 om 15:29 schreef Daniel Vetter:
> Design is similar to the lockdep annotations for workers, but with
> some twists:
>
> - We use a read-lock for the execution/worker/completion side, so that
>   this explicit annotation can be more liberally sprinkled around.
>   With read locks lockdep isn't going to complain if the read-side
>   isn't nested the same way under all circumstances, so ABBA deadlocks
>   are ok. Which they are, since this is an annotation only.
>
> - We're using non-recursive lockdep read lock mode, since in recursive
>   read lock mode lockdep does not catch read side hazards. And we
>   _very_ much want read side hazards to be caught. For full details of
>   this limitation see
>
>   commit e91498589746065e3ae95d9a00b068e525eec34f
>   Author: Peter Zijlstra 
>   Date:   Wed Aug 23 13:13:11 2017 +0200
>
>   locking/lockdep/selftests: Add mixed read-write ABBA tests
>
> - To allow nesting of the read-side explicit annotations we explicitly
>   keep track of the nesting. lock_is_held() allows us to do that.
>
> - The wait-side annotation is a write lock, and entirely done within
>   dma_fence_wait() for everyone by default.
>
> - To be able to freely annotate helper functions I want to make it ok
>   to call dma_fence_begin/end_signalling from soft/hardirq context.
>   First attempt was using the hardirq locking context for the write
>   side in lockdep, but this forces all normal spinlocks nested within
>   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>
>   The approach now is to simple check in_atomic(), and for these cases
>   entirely rely on the might_sleep() check in dma_fence_wait(). That
>   will catch any wrong nesting against spinlocks from soft/hardirq
>   contexts.
>
> The idea here is that every code path that's critical for eventually
> signalling a dma_fence should be annotated with
> dma_fence_begin/end_signalling. The annotation ideally starts right
> after a dma_fence is published (added to a dma_resv, exposed as a
> sync_file fd, attached to a drm_syncobj fd, or anything else that
> makes the dma_fence visible to other kernel threads), up to and
> including the dma_fence_wait(). Examples are irq handlers, the
> scheduler rt threads, the tail of execbuf (after the corresponding
> fences are visible), any workers that end up signalling dma_fences and
> really anything else. Not annotated should be code paths that only
> complete fences opportunistically as the gpu progresses, like e.g.
> shrinker/eviction code.
>
> The main class of deadlocks this is supposed to catch are:
>
> Thread A:
>
>   mutex_lock(A);
>   mutex_unlock(A);
>
>   dma_fence_signal();
>
> Thread B:
>
>   mutex_lock(A);
>   dma_fence_wait();
>   mutex_unlock(A);
>
> Thread B is blocked on A signalling the fence, but A never gets around
> to that because it cannot acquire the lock A.
>
> Note that dma_fence_wait() is allowed to be nested within
> dma_fence_begin/end_signalling sections. To allow this to happen the
> read lock needs to be upgraded to a write lock, which means that any
> other lock is acquired between the dma_fence_begin_signalling() call and
> the call to dma_fence_wait(), and still held, this will result in an
> immediate lockdep complaint. The only other option would be to not
> annotate such calls, defeating the point. Therefore these annotations
> cannot be sprinkled over the code entirely mindless to avoid false
> positives.
>
> Originally I hope that the cross-release lockdep extensions would
> alleviate the need for explicit annotations:
>
> https://lwn.net/Articles/709849/
>
> But there's a few reasons why that's not an option:
>
> - It's not happening in upstream, since it got reverted due to too
>   many false positives:
>
>   commit e966eaeeb623f09975ef362c2866fae6f86844f9
>   Author: Ingo Molnar 
>   Date:   Tue Dec 12 12:31:16 2017 +0100
>
>   locking/lockdep: Remove the cross-release locking checks
>
>   This code (CONFIG_LOCKDEP_CROSSRELEASE=y and 
> CONFIG_LOCKDEP_COMPLETIONS=y),
>   while it found a number of old bugs initially, was also causing too 
> many
>   false positives that caused people to disable lockdep - which is 
> arguably
>   a worse overall outcome.
>
> - cross-release uses the complete() call to annotate the end of
>   critical sections, for dma_fence that would be dma_fence_signal().
>   But we do not want all dma_fence_signal() calls to be treated as
>   critical, since many are opportunistic cleanup of gpu requests. If
>   these get stuck there's still the main completion interrupt and
>   workers who can unblock everyone. Automatically annotating all
>   dma_fence_signal() calls would hence cause false positives.
>
> - cross-release had some educated guesses for when a critical section
>   starts, like fresh syscall or fresh work callback. This would again
>   cause false positives without explicit annotations, since for
>   dma_fence the critica

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Daniel Stone
Hi,

On Thu, 11 Jun 2020 at 09:44, Dave Airlie  wrote:
> On Thu, 11 Jun 2020 at 18:01, Chris Wilson  wrote:
> > Introducing a global lockmap that cannot capture the rules correctly,
>
> Can you document the rules all drivers should be following then,
> because from here it looks to get refactored every version of i915,
> and it would be nice if we could all aim for the same set of things
> roughly. We've already had enough problems with amdgpu vs i915 vs
> everyone else with fences, if this stops that in the future then I'd
> rather we have that than just some unwritten rules per driver and
> untestable.

As someone who has sunk a bunch of work into explicit-fencing
awareness in my compositor so I can never be blocked, I'd be
disappointed if the infrastructure was ultimately pointless because
the documented fencing rules were \_o_/ or thereabouts. Lockdep
definitely isn't my area of expertise so I can't comment on the patch
per se, but having something to ensure we don't hit deadlocks sure
seems a lot better than nothing.

Cheers,
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Dave Airlie
On Thu, 11 Jun 2020 at 18:01, Chris Wilson  wrote:
>
> Quoting Daniel Vetter (2020-06-04 09:12:09)
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> >
> > - We use a read-lock for the execution/worker/completion side, so that
> >   this explicit annotation can be more liberally sprinkled around.
> >   With read locks lockdep isn't going to complain if the read-side
> >   isn't nested the same way under all circumstances, so ABBA deadlocks
> >   are ok. Which they are, since this is an annotation only.
> >
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >   read lock mode lockdep does not catch read side hazards. And we
> >   _very_ much want read side hazards to be caught. For full details of
> >   this limitation see
> >
> >   commit e91498589746065e3ae95d9a00b068e525eec34f
> >   Author: Peter Zijlstra 
> >   Date:   Wed Aug 23 13:13:11 2017 +0200
> >
> >   locking/lockdep/selftests: Add mixed read-write ABBA tests
> >
> > - To allow nesting of the read-side explicit annotations we explicitly
> >   keep track of the nesting. lock_is_held() allows us to do that.
> >
> > - The wait-side annotation is a write lock, and entirely done within
> >   dma_fence_wait() for everyone by default.
> >
> > - To be able to freely annotate helper functions I want to make it ok
> >   to call dma_fence_begin/end_signalling from soft/hardirq context.
> >   First attempt was using the hardirq locking context for the write
> >   side in lockdep, but this forces all normal spinlocks nested within
> >   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >
> >   The approach now is to simple check in_atomic(), and for these cases
> >   entirely rely on the might_sleep() check in dma_fence_wait(). That
> >   will catch any wrong nesting against spinlocks from soft/hardirq
> >   contexts.
> >
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> >
> > The main class of deadlocks this is supposed to catch are:
> >
> > Thread A:
> >
> > mutex_lock(A);
> > mutex_unlock(A);
> >
> > dma_fence_signal();
> >
> > Thread B:
> >
> > mutex_lock(A);
> > dma_fence_wait();
> > mutex_unlock(A);
> >
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> >
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> >
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> >
> > v3: Kerneldoc.
> >
> > v4: Some spelling fixes from Mika
> >
> > Cc: Mika Kuoppala 
> > Cc: Thomas Hellstrom 
> > Cc: linux-me...@vger.kernel.org
> > Cc: linaro-mm-...@lists.linaro.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-...@lists.freedesktop.org
> > Cc: Chris Wilson 
> > Cc: Maarten Lankhorst 
> > Cc: Christian König 
> > Signed-off-by: Daniel Vetter 
>
> Introducing a global lockmap that cannot capture the rules correctly,

Can you document the rules all drivers should be following then,
because from here it looks to get refactored every version of i915,
and it would be nice if we could all aim for the same set of things
roughly. We've already had enough problems with amdgpu vs i915 vs
everyone else with fences, if this stops that in the future then I'd
rather we have that than just some unwritten rules per driver and
untestable.

Dave.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-11 Thread Daniel Vetter
On Thu, Jun 11, 2020 at 09:30:12AM +0200, Thomas Hellström (Intel) wrote:
> 
> On 6/4/20 10:12 AM, Daniel Vetter wrote:
> > Two in one go:
> > - it is allowed to call dma_fence_wait() while holding a
> >dma_resv_lock(). This is fundamental to how eviction works with ttm,
> >so required.
> > 
> > - it is allowed to call dma_fence_wait() from memory reclaim contexts,
> >specifically from shrinker callbacks (which i915 does), and from mmu
> >notifier callbacks (which amdgpu does, and which i915 sometimes also
> >does, and probably always should, but that's kinda a debate). Also
> >for stuff like HMM we really need to be able to do this, or things
> >get real dicey.
> > 
> > Consequence is that any critical path necessary to get to a
> > dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
> > allocate memory with GFP_KERNEL. Also by implication of
> > dma_resv_lock(), no userspace faulting allowed. That's some supremely
> > obnoxious limitations, which is why we need to sprinkle the right
> > annotations to all relevant paths.
> > 
> > The one big locking context we're leaving out here is mmu notifiers,
> > added in
> > 
> > commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
> > Author: Daniel Vetter 
> > Date:   Mon Aug 26 22:14:21 2019 +0200
> > 
> >  mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
> > 
> > that one covers a lot of other callsites, and it's also allowed to
> > wait on dma-fences from mmu notifiers. But there's no ready-made
> > functions exposed to prime this, so I've left it out for now.
> > 
> > v2: Also track against mmu notifier context.
> > 
> > v3: kerneldoc to spec the cross-driver contract. Note that currently
> > i915 throws in a hard-coded 10s timeout on foreign fences (not sure
> > why that was done, but it's there), which is why that rule is worded
> > with SHOULD instead of MUST.
> > 
> > Also some of the mmu_notifier/shrinker rules might surprise SoC
> > drivers, I haven't fully audited them all. Which is infeasible anyway,
> > we'll need to run them with lockdep and dma-fence annotations and see
> > what goes boom.
> > 
> > v4: A spelling fix from Mika
> > 
> > Cc: Mika Kuoppala 
> > Cc: Thomas Hellstrom 
> > Cc: linux-me...@vger.kernel.org
> > Cc: linaro-mm-...@lists.linaro.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-...@lists.freedesktop.org
> > Cc: Chris Wilson 
> > Cc: Maarten Lankhorst 
> > Cc: Christian König 
> > Signed-off-by: Daniel Vetter 
> > ---
> >   Documentation/driver-api/dma-buf.rst |  6 
> >   drivers/dma-buf/dma-fence.c  | 41 
> >   drivers/dma-buf/dma-resv.c   |  4 +++
> >   include/linux/dma-fence.h|  1 +
> >   4 files changed, 52 insertions(+)
> 
> I still have my doubts about allowing fence waiting from within shrinkers.
> IMO ideally they should use a trywait approach, in order to allow memory
> allocation during command submission for drivers that
> publish fences before command submission. (Since early reservation object
> release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up
with a mempool to make sure it can handle it's allocations.

> But since drivers are already waiting from within shrinkers and I take your
> word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward
one, the shrinker one is a lot less established.

I do wonder whether the mmu notifier constraint should only be set when
mmu notifiers are enabled, since on a bunch of arm-soc gpu drivers that
stuff just doesn't matter. But I expect that sooner or later these arm
gpus will show up in bigger arm cores, where you might want to have kvm
and maybe device virtualization and stuff, and then you need mmu
notifiers.

Plus having a very clear and consistent cross-driver api contract is imo
better than leaving this up to drivers and then having incompatible
assumptions.

I've pinged a bunch of armsoc gpu driver people and ask them how much this
hurts, so that we have a clear answer. On x86 I don't think we have much
of a choice on this, with userptr in amd and i915 and hmm work in nouveau
(but nouveau I think doesn't use dma_fence in there). I think it'll take
us a while to really bottom out on this specific question here.
-Daniel


> 
> Reviewed-by: Thomas Hellström 
> 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-11 Thread Daniel Vetter
On Thu, Jun 11, 2020 at 08:12:37AM +0200, Thomas Hellström (Intel) wrote:
> 
> On 6/10/20 11:16 PM, Daniel Vetter wrote:
> > On Wed, Jun 10, 2020 at 10:30 PM Thomas Hellström (Intel)
> >  wrote:
> > > 
> > > On 6/10/20 5:30 PM, Daniel Vetter wrote:
> > > > On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote:
> > > > > Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky:
> > > > > > On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote:
> > > > > > > On 6/9/20 7:21 PM, Koenig, Christian wrote:
> > > > > > > > Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey"
> > > > > > > > :
> > > > > > > > 
> > > > > > > > 
> > > > > > > >   On 6/5/20 2:40 PM, Christian König wrote:
> > > > > > > >   > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
> > > > > > > >   >>
> > > > > > > >   >> On 5/11/20 2:45 AM, Christian König wrote:
> > > > > > > >   >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
> > > > > > > >    Signed-off-by: Andrey Grodzovsky 
> > > > > > > > 
> > > > > > > >    ---
> > > > > > > >    drivers/gpu/drm/ttm/ttm_bo.c| 22 
> > > > > > > > +-
> > > > > > > >    include/drm/ttm/ttm_bo_driver.h |  2 ++
> > > > > > > >      2 files changed, 23 insertions(+), 1 deletion(-)
> > > > > > > >   
> > > > > > > >    diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > >    b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > >    index c5b516f..eae61cc 100644
> > > > > > > >    --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > >    +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > > > >    @@ -1750,9 +1750,29 @@ void 
> > > > > > > > ttm_bo_unmap_virtual(struct
> > > > > > > >    ttm_buffer_object *bo)
> > > > > > > >    ttm_bo_unmap_virtual_locked(bo);
> > > > > > > >    ttm_mem_io_unlock(man);
> > > > > > > >      }
> > > > > > > >    +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
> > > > > > > >      +void ttm_bo_unmap_virtual_address_space(struct
> > > > > > > >   ttm_bo_device *bdev)
> > > > > > > >    +{
> > > > > > > >    +struct ttm_mem_type_manager *man;
> > > > > > > >    +int i;
> > > > > > > >    -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
> > > > > > > >   >>>
> > > > > > > >    +for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
> > > > > > > >    +man = &bdev->man[i];
> > > > > > > >    +if (man->has_type && man->use_type)
> > > > > > > >    + ttm_mem_io_lock(man, false);
> > > > > > > >    +}
> > > > > > > >   >>>
> > > > > > > >   >>> You should drop that it will just result in a deadlock
> > > > > > > >   warning for
> > > > > > > >   >>> Nouveau and has no effect at all.
> > > > > > > >   >>>
> > > > > > > >   >>> Apart from that looks good to me,
> > > > > > > >   >>> Christian.
> > > > > > > >   >>
> > > > > > > >   >>
> > > > > > > >   >> As I am considering to re-include this in V2 of the
> > > > > > > >   patchsets, can
> > > > > > > >   >> you clarify please why this will have no effect at all 
> > > > > > > > ?
> > > > > > > >   >
> > > > > > > >   > The locks are exclusive for Nouveau to allocate/free 
> > > > > > > > the io
> > > > > > > >   address
> > > > > > > >   > space.
> > > > > > > >   >
> > > > > > > >   > Since we don't do this here we don't need the locks.
> > > > > > > >   >
> > > > > > > >   > Christian.
> > > > > > > > 
> > > > > > > > 
> > > > > > > >   So basically calling unmap_mapping_range doesn't require 
> > > > > > > > any extra
> > > > > > > >   locking around it and whatever locks are taken within the 
> > > > > > > > function
> > > > > > > >   should be enough ?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > I think so, yes.
> > > > > > > > 
> > > > > > > > Christian.
> > > > > > > Yes, that's true. However, without the bo reservation, nothing 
> > > > > > > stops
> > > > > > > a PTE from being immediately re-faulted back again. Even while
> > > > > > > unmap_mapping_range() is running.
> > > > > > > 
> > > > > > Can you explain more on this - specifically, which function to 
> > > > > > reserve
> > > > > > the BO, why BO reservation would prevent re-fault of the PTE ?
> > > > > > 
> > > > > Thomas is talking about ttm_bo_reserver()/ttm_bo_unreserve(), but we 
> > > > > don't
> > > > > need this because we unmap everything because the whole device is 
> > > > > gone and
> > > > > not just manipulate a single BO.
> > > > > 
> > > > > > > So the device removed flag needs to be advertized before this
> > > > > > > function is run,
> > > > > > > 
> > > > > > I indeed intend to call this  right after calling drm_dev_unplug 
> > > > > > from
> > > > > > amdgpu_pci_remove while adding drm_dev_enter/exit in 
> > > > > > ttm_bo_vm_fault (or
> > > > > > in amdgpu specific wrapper since I don't se

Re: [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Chris Wilson
Quoting Daniel Vetter (2020-06-04 09:12:09)
> Design is similar to the lockdep annotations for workers, but with
> some twists:
> 
> - We use a read-lock for the execution/worker/completion side, so that
>   this explicit annotation can be more liberally sprinkled around.
>   With read locks lockdep isn't going to complain if the read-side
>   isn't nested the same way under all circumstances, so ABBA deadlocks
>   are ok. Which they are, since this is an annotation only.
> 
> - We're using non-recursive lockdep read lock mode, since in recursive
>   read lock mode lockdep does not catch read side hazards. And we
>   _very_ much want read side hazards to be caught. For full details of
>   this limitation see
> 
>   commit e91498589746065e3ae95d9a00b068e525eec34f
>   Author: Peter Zijlstra 
>   Date:   Wed Aug 23 13:13:11 2017 +0200
> 
>   locking/lockdep/selftests: Add mixed read-write ABBA tests
> 
> - To allow nesting of the read-side explicit annotations we explicitly
>   keep track of the nesting. lock_is_held() allows us to do that.
> 
> - The wait-side annotation is a write lock, and entirely done within
>   dma_fence_wait() for everyone by default.
> 
> - To be able to freely annotate helper functions I want to make it ok
>   to call dma_fence_begin/end_signalling from soft/hardirq context.
>   First attempt was using the hardirq locking context for the write
>   side in lockdep, but this forces all normal spinlocks nested within
>   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> 
>   The approach now is to simple check in_atomic(), and for these cases
>   entirely rely on the might_sleep() check in dma_fence_wait(). That
>   will catch any wrong nesting against spinlocks from soft/hardirq
>   contexts.
> 
> The idea here is that every code path that's critical for eventually
> signalling a dma_fence should be annotated with
> dma_fence_begin/end_signalling. The annotation ideally starts right
> after a dma_fence is published (added to a dma_resv, exposed as a
> sync_file fd, attached to a drm_syncobj fd, or anything else that
> makes the dma_fence visible to other kernel threads), up to and
> including the dma_fence_wait(). Examples are irq handlers, the
> scheduler rt threads, the tail of execbuf (after the corresponding
> fences are visible), any workers that end up signalling dma_fences and
> really anything else. Not annotated should be code paths that only
> complete fences opportunistically as the gpu progresses, like e.g.
> shrinker/eviction code.
> 
> The main class of deadlocks this is supposed to catch are:
> 
> Thread A:
> 
> mutex_lock(A);
> mutex_unlock(A);
> 
> dma_fence_signal();
> 
> Thread B:
> 
> mutex_lock(A);
> dma_fence_wait();
> mutex_unlock(A);
> 
> Thread B is blocked on A signalling the fence, but A never gets around
> to that because it cannot acquire the lock A.
> 
> Note that dma_fence_wait() is allowed to be nested within
> dma_fence_begin/end_signalling sections. To allow this to happen the
> read lock needs to be upgraded to a write lock, which means that any
> other lock is acquired between the dma_fence_begin_signalling() call and
> the call to dma_fence_wait(), and still held, this will result in an
> immediate lockdep complaint. The only other option would be to not
> annotate such calls, defeating the point. Therefore these annotations
> cannot be sprinkled over the code entirely mindless to avoid false
> positives.
> 
> v2: handle soft/hardirq ctx better against write side and dont forget
> EXPORT_SYMBOL, drivers can't use this otherwise.
> 
> v3: Kerneldoc.
> 
> v4: Some spelling fixes from Mika
> 
> Cc: Mika Kuoppala 
> Cc: Thomas Hellstrom 
> Cc: linux-me...@vger.kernel.org
> Cc: linaro-mm-...@lists.linaro.org
> Cc: linux-r...@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-...@lists.freedesktop.org
> Cc: Chris Wilson 
> Cc: Maarten Lankhorst 
> Cc: Christian König 
> Signed-off-by: Daniel Vetter 

Introducing a global lockmap that cannot capture the rules correctly,
Nacked-by: Chris Wilson 
-Chris
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-11 Thread Intel


On 6/4/20 10:12 AM, Daniel Vetter wrote:

Two in one go:
- it is allowed to call dma_fence_wait() while holding a
   dma_resv_lock(). This is fundamental to how eviction works with ttm,
   so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts,
   specifically from shrinker callbacks (which i915 does), and from mmu
   notifier callbacks (which amdgpu does, and which i915 sometimes also
   does, and probably always should, but that's kinda a debate). Also
   for stuff like HMM we really need to be able to do this, or things
   get real dicey.

Consequence is that any critical path necessary to get to a
dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
allocate memory with GFP_KERNEL. Also by implication of
dma_resv_lock(), no userspace faulting allowed. That's some supremely
obnoxious limitations, which is why we need to sprinkle the right
annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers,
added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
Author: Daniel Vetter 
Date:   Mon Aug 26 22:14:21 2019 +0200

 mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to
wait on dma-fences from mmu notifiers. But there's no ready-made
functions exposed to prime this, so I've left it out for now.

v2: Also track against mmu notifier context.

v3: kerneldoc to spec the cross-driver contract. Note that currently
i915 throws in a hard-coded 10s timeout on foreign fences (not sure
why that was done, but it's there), which is why that rule is worded
with SHOULD instead of MUST.

Also some of the mmu_notifier/shrinker rules might surprise SoC
drivers, I haven't fully audited them all. Which is infeasible anyway,
we'll need to run them with lockdep and dma-fence annotations and see
what goes boom.

v4: A spelling fix from Mika

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
  Documentation/driver-api/dma-buf.rst |  6 
  drivers/dma-buf/dma-fence.c  | 41 
  drivers/dma-buf/dma-resv.c   |  4 +++
  include/linux/dma-fence.h|  1 +
  4 files changed, 52 insertions(+)


I still have my doubts about allowing fence waiting from within 
shrinkers. IMO ideally they should use a trywait approach, in order to 
allow memory allocation during command submission for drivers that
publish fences before command submission. (Since early reservation 
object release requires that).


But since drivers are already waiting from within shrinkers and I take 
your word for HMM requiring this,


Reviewed-by: Thomas Hellström 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: remove distinction between explicit and implicit sync (v2)

2020-06-11 Thread Marek Olšák
Hi David,

Explicit sync has nothing to do with this. This is for implicit sync, which
is required by DRI3. This fix allows removing existing inefficiencies from
drivers, so it's a good thing.

Marek

On Wed., Jun. 10, 2020, 03:56 Chunming Zhou,  wrote:

>
> 在 2020/6/10 15:41, Christian König 写道:
>
> That's true, but for now we are stuck with the implicit sync for quite a
> number of use cases.
>
> My problem is rather that we already tried this and it backfired
> immediately.
>
> I do remember that it was your patch who introduced the pipeline sync flag
> handling and I warned that this could be problematic. You then came back
> with a QA result saying that this is indeed causing a huge performance drop
> in one test case and we need to do something else. Together we then came up
> with the different handling between implicit and explicit sync.
>
> Isn't pipeline sync flag to fix some issue because of parralel execution
> between jobs in one pipeline?  I really don't have this memory in mind why
> that's realted to this, Or do you mean extra sync hides many other
> potential issues?
>
> Anyway, when I go through Vulkan WSI code, the synchronization isn't so
> smooth between OS window system. And when I saw Jason drives explicit sync
> through the whole Linux ecosystem like Android window system does, I feel
> that's really a good direction.
>
> -David
>
>
> But I can't find that stupid mail thread any more. I knew that it was a
> couple of years ago when we started with the explicit sync for Vulkan.
>
> Christian.
>
> Am 10.06.20 um 08:29 schrieb Zhou, David(ChunMing):
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
>
> Not sue if this is right direction, I think usermode wants all
> synchronizations to be explicit. Implicit sync often confuses people who
> don’t know its history. I remember Jason from Intel  is driving explicit
> synchronization through the Linux ecosystem, which even removes implicit
> sync of shared buffer.
>
>
>
> -David
>
>
>
> *From:* amd-gfx 
>  *On Behalf Of *Marek Olšák
> *Sent:* Tuesday, June 9, 2020 6:58 PM
> *To:* amd-gfx mailing list 
> 
> *Subject:* [PATCH] drm/amdgpu: remove distinction between explicit and
> implicit sync (v2)
>
>
>
> Hi,
>
>
>
> This enables a full pipeline sync for implicit sync. It's Christian's
> patch with the driver version bumped. With this, user mode drivers don't
> have to wait for idle at the end of gfx IBs.
>
>
>
> Any concerns?
>
>
>
> Thanks,
>
> Marek
>
> ___
> amd-gfx mailing 
> listamd-gfx@lists.freedesktop.orghttps://lists.freedesktop.org/mailman/listinfo/amd-gfx
>  
> 
>
>
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx