RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only]

>>>
I'm not sure I can add much to help this along, I'm sure Alex has some internal 
training,
Once your driver is upstream, it belongs to upstream, you can maintain it, but 
you no longer control it 100%, it's a tradeoff, it's not one companies always 
understand.
Usually people are fine developing away internally, but once interaction with 
other parts of the kernel/subsystem is required they have the realisation that 
they needed to work upstream 6 months earlier.
The best time to interact with upstream was 6 months ago, the second best time 
is now.
<<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review 
... and  I'm totally already get used that our driver is not 100% under control 
by AMDers, 
but supposedly any one from community (including you) who tend to change AMD's 
driver need at least to get approvement from someone in AMD, e.g.: AlexD or 
Christian, doesn't that reasonable?
just like we need your approve if we try to modify DRM-sched, or need 
panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your 
changes (on AMD code part) are conflicting with AMD's on-going internal 
features/refactoring or not ?

Thanks 

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: Dave Airlie  
Sent: Thursday, September 2, 2021 2:51 AM
To: Alex Deucher 
Cc: Liu, Monk ; Daniel Vetter ; Koenig, 
Christian ; Grodzovsky, Andrey 
; Chen, JingWen ; DRI 
Development ; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher  wrote:
>
> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:
> >
> > [AMD Official Use Only]
> >
> > Daniel
> >
> > From the link you share it looks you(or someone else) have quite a bunch 
> > patches that changes DRM_SCHED or even amdgpu, by that case before they are 
> > merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >
> > They looks to me somehow conflicting with what we changed in our repo
> >
> > It is really a chaos for AMDer if someone else out side of AMD changes our 
> > kernel driver (or/and scheduler) without reviewed by AMDer, just like we 
> > are requiring your review if we tend to change scheduler's logic here 
> >
> > This one changes AMD's code: 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
> > %40collabora.com%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
> > d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> > 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BWJSkKN
> > y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&reserved=0
> > And I didn't see any reviewed-by from AMDers ...
> >
> > This one also touches AMD's code: 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4
> > 0ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e
> > f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766
> > 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2
> > luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2F8vIVXCWjHkM
> > 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&reserved=0
> > Which is conflicting with one patch we submitted (in our repo 
> > rightnow), and neither see AMDder gave a review-by on this one (let 
> > me know if I missed it)
> >
>
> Monk, this is not how upstream works.  You need to participate.
> That's how communities work.  There's a reason all these discussions 
> happen on public mailing lists.  The patch author can't be expected to 
> know every person on every vendor team to CC with a patch.  If you 
> have concerns, you need to raise them when the patches are being 
> discussed.
>

I'm not sure I can add much to help this along, I'm sure Alex has some internal 
training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but 
you no longer control it 100%, it's a tradeoff, it's not one companies always 
understand.

Usually people are fine developing away internally, but once interaction with 
other parts of the kernel/subsystem is required they have the realisation that 
they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time 
is now.

Dave.


Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Lazar, Lijo

Thanks Felix for the detailed explanation.

Thanks,
Lijo

On 9/1/2021 10:17 PM, Felix Kuehling wrote:

Am 2021-09-01 um 12:30 p.m. schrieb Lazar, Lijo:


[Public]


What I wanted to ask was -

Whether user mode application relies only on link properties alone to
assume atomic ops are supported? If they check only link properties
and if the firmware doesn't work fine, should it be still marked as
supported?


Let's be clear what "firmware doesn't work fine" means in this context.
It means "firmware requires PCIe atomics". If firmware requires PCIe
atomics and the system doesn't support PCIe atomics, KFD will not use
the GPU and will not report the GPU to user mode.

If firmware does not require PCIe atomics, or if PCIe atomics work on
the system, KFD will use the GPU and will report the atomic capability
to user mode in the IO link attribute.




Basically, what is the purpose of exposing atomic capability in link
properties and whether that can be utilised by upper mode applications
just based on PCIe atomics support?


Applications can use PCIe atomics by using atomic shader instructions
when accessing system memory in GPU shader code. If the system doesn't
support PCIe atomics, these atomic operations are silently dropped.
Therefore the application must check the atomic capability in the IO
link properties before relying on these instructions for system memory.

Regards,
   Felix




Thanks,
Lijo

*From:* Kuehling, Felix 
*Sent:* Wednesday, September 1, 2021 8:24:56 PM
*To:* Lazar, Lijo ; amd-gfx@lists.freedesktop.org

*Subject:* Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics
FW-version dependent
  
Am 2021-09-01 um 7:04 a.m. schrieb Lazar, Lijo:



On 9/1/2021 3:26 AM, Felix Kuehling wrote:

On some GPUs the PCIe atomic requirement for KFD depends on the MEC
firmware version. Add a firmware version check for this. The minimum
firmware version that works without atomics can be updated in the
device_info structure for each GPU type.

Signed-off-by: Felix Kuehling 
---
   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++--
   drivers/gpu/drm/amd/amdkfd/kfd_priv.h   | 1 +
   2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 16a57b70cc1a..655ee5733229 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -688,6 +688,7 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
   struct kfd_dev *kfd;
   const struct kfd_device_info *device_info;
   const struct kfd2kgd_calls *f2g;
+    uint32_t fw_version;
     if (asic_type >= sizeof(kfd_supported_devices) / (sizeof(void
*) * 2)
   || asic_type >= sizeof(kfd2kgd_funcs) / sizeof(void *)) {
@@ -713,8 +714,12 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
    * supported.
    */
   kfd->pci_atomic_requested =
amdgpu_amdkfd_have_atomics_support(kgd);


Should the check be grouped inside amdgpu_amdkfd_have_atomics_support?

This flag is used for setting some link properties. If there is HW
support but comes with incompatible firmware, should the link be still
marked as atomic?


Our GPU HW always supports PCIe atomics (it's part of the PCIe 3 spec).
But some mainboards with older PCIe chipsets do not. Sometimes even
different ports on the same mainboard differ in their PCIe version and
atomic support.

amdgpu_device_init always tries to enable atomics on the root port an
all the bridges leading to the GPU by calling
pci_enable_atomic_ops_to_root. The result is saved in
adev->have_atomics_support, which is returned to KFD by
amdgpu_amdkfd_have_atomics_support.

The firmware change here does not affect whether atomics are
_supported_. It changes whether atomics are _required_ for the basic
operation of AQL user mode queues. The coming firmware update will
remove that requirement, which allows us to enable KFD for these GPUs+FW
on systems without PCIe atomics.

Enabling PCIe atomics with the updated FW is still beneficial because
shader programs can use a subset of atomic instructions for accessing
system memory atomically on supported systems.

Regards,
   Felix




Thanks,
Lijo


-    if (device_info->needs_pci_atomics &&
-    !kfd->pci_atomic_requested) {
+    fw_version = amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1);
+    if (!kfd->pci_atomic_requested &&
+    device_info->needs_pci_atomics &&
+    (!device_info->no_atomic_fw_version ||
+  amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1) <
+    device_info->no_atomic_fw_version)) {
   dev_info(kfd_device,
    "skipped device %x:%x, PCI rejects atomics\n",
    pdev->vendor, pdev->device);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index ab83b0de6b22..6d8f9bb2d905 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/am

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Dave Chinner
On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote:
> On 2021-09-01 6:03 p.m., Dave Chinner wrote:
> > On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> > > Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> > > > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> > > > > > > driver code is not really involved in updating the CPU mappings. 
> > > > > > > Maybe
> > > > > > > it's something we need to do in the migration helpers.
> > > > > > It looks like I'm totally misunderstanding what you are adding here
> > > > > > then.  Why do we need any special treatment at all for memory that
> > > > > > has normal struct pages and is part of the direct kernel map?
> > > > > The pages are like normal memory for purposes of mapping them in CPU
> > > > > page tables and for coherent access from the CPU.
> > > > That's the user page tables.  What about the kernel direct map?
> > > > If there is a normal kernel struct page backing there really should
> > > > be no need for the pgmap.
> > > I'm not sure. The physical address ranges are in the UEFI system address
> > > map as special-purpose memory. Does Linux create the struct pages and
> > > kernel direct map for that without a pgmap call? I didn't see that last
> > > time I went digging through that code.
> > > 
> > > 
> > > > >  From an application
> > > > > perspective, we want file-backed and anonymous mappings to be able to
> > > > > use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> > > > > optimize performance for GPU heavy workloads while minimizing the need
> > > > > to migrate data back-and-forth between system memory and device 
> > > > > memory.
> > > > I don't really understand that part.  file backed pages are always
> > > > allocated by the file system using the pagecache helpers, that is
> > > > using the page allocator.  Anonymouns memory also always comes from
> > > > the page allocator.
> > > I'm coming at this from my experience with DEVICE_PRIVATE. Both
> > > anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> > > memory by the migrate_vma_* helpers for more efficient access by our
> > > GPU. (*) It's part of the basic premise of HMM as I understand it. I
> > > would expect the same thing to work for DEVICE_PUBLIC memory.
> > > 
> > > (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
> > > currently work, but that's something I'm hoping to fix at some point.
> > FWIW, I'd love to see the architecture documents that define how
> > filesystems are supposed to interact with this device private
> > memory. This whole "hand filesystem controlled memory to other
> > devices" is a minefield that is trivial to get wrong iand very
> > difficult to fix - just look at the historical mess that RDMA
> > to/from file backed and/or DAX pages has been.
> > 
> > So, really, from my perspective as a filesystem engineer, I want to
> > see an actual specification for how this new memory type is going to
> > interact with filesystem and the page cache so everyone has some
> > idea of how this is going to work and can point out how it doesn't
> > work before code that simply doesn't work is pushed out into
> > production systems and then merged
> 
> OK. To be clear, that's not part of this patch series. And I have no
> authority to push anything in this part of the kernel, so you have nothing
> to fear. ;)

I know this isn't part of the series. but this patchset is laying
the foundation for future work that will impact subsystems that
currently have zero visibility and/or knowledge of these changes.
There must be an overall architectural plan for this functionality,
regardless of the current state of implementation. It's that overall
architectural plan I'm asking about here, because I need to
understand that before I can sanely comment on the page
cache/filesystem aspect of the proposed functionality...

> FWIW, we already have the ability to map file-backed system memory pages
> into device page tables with HMM and interval notifiers. But we cannot
> currently migrate them to ZONE_DEVICE pages.

Sure, but sharing page cache pages allocated and managed by the
filesystem is not what you are talking about. You're talking about
migrating page cache data to completely different memory allocated
by a different memory manager that the filesystems currently have no
knowledge of or have any way of interfacing with.

So I'm asking basic, fundamental questions about how these special
device based pages are going to work.  How are these pages different
to normal pages - does page_lock() still guarantee exclusive access
to the page state across all hardware that can access it? If not,
what provides access serialisation for pages that are allocated in
device memory rather than CPU memory (e.g. for truncate
serialisation)?  Does the hardware that own these pages raise page
faults on the CPU when those pages are accessed/dirtied? How does
demand paging in and out of device memor

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Dave Chinner
On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> 
> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>  driver code is not really involved in updating the CPU mappings. Maybe
>  it's something we need to do in the migration helpers.
> >>> It looks like I'm totally misunderstanding what you are adding here
> >>> then.  Why do we need any special treatment at all for memory that
> >>> has normal struct pages and is part of the direct kernel map?
> >> The pages are like normal memory for purposes of mapping them in CPU
> >> page tables and for coherent access from the CPU.
> > That's the user page tables.  What about the kernel direct map?
> > If there is a normal kernel struct page backing there really should
> > be no need for the pgmap.
> 
> I'm not sure. The physical address ranges are in the UEFI system address
> map as special-purpose memory. Does Linux create the struct pages and
> kernel direct map for that without a pgmap call? I didn't see that last
> time I went digging through that code.
> 
> 
> >
> >> From an application
> >> perspective, we want file-backed and anonymous mappings to be able to
> >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> >> optimize performance for GPU heavy workloads while minimizing the need
> >> to migrate data back-and-forth between system memory and device memory.
> > I don't really understand that part.  file backed pages are always
> > allocated by the file system using the pagecache helpers, that is
> > using the page allocator.  Anonymouns memory also always comes from
> > the page allocator.
> 
> I'm coming at this from my experience with DEVICE_PRIVATE. Both
> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> memory by the migrate_vma_* helpers for more efficient access by our
> GPU. (*) It's part of the basic premise of HMM as I understand it. I
> would expect the same thing to work for DEVICE_PUBLIC memory.
> 
> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
> currently work, but that's something I'm hoping to fix at some point.

FWIW, I'd love to see the architecture documents that define how
filesystems are supposed to interact with this device private
memory. This whole "hand filesystem controlled memory to other
devices" is a minefield that is trivial to get wrong iand very
difficult to fix - just look at the historical mess that RDMA
to/from file backed and/or DAX pages has been.

So, really, from my perspective as a filesystem engineer, I want to
see an actual specification for how this new memory type is going to
interact with filesystem and the page cache so everyone has some
idea of how this is going to work and can point out how it doesn't
work before code that simply doesn't work is pushed out into
production systems and then merged

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Felix Kuehling

On 2021-09-01 6:03 p.m., Dave Chinner wrote:

On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:

Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:

On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:

driver code is not really involved in updating the CPU mappings. Maybe
it's something we need to do in the migration helpers.

It looks like I'm totally misunderstanding what you are adding here
then.  Why do we need any special treatment at all for memory that
has normal struct pages and is part of the direct kernel map?

The pages are like normal memory for purposes of mapping them in CPU
page tables and for coherent access from the CPU.

That's the user page tables.  What about the kernel direct map?
If there is a normal kernel struct page backing there really should
be no need for the pgmap.

I'm not sure. The physical address ranges are in the UEFI system address
map as special-purpose memory. Does Linux create the struct pages and
kernel direct map for that without a pgmap call? I didn't see that last
time I went digging through that code.



 From an application
perspective, we want file-backed and anonymous mappings to be able to
use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
optimize performance for GPU heavy workloads while minimizing the need
to migrate data back-and-forth between system memory and device memory.

I don't really understand that part.  file backed pages are always
allocated by the file system using the pagecache helpers, that is
using the page allocator.  Anonymouns memory also always comes from
the page allocator.

I'm coming at this from my experience with DEVICE_PRIVATE. Both
anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
memory by the migrate_vma_* helpers for more efficient access by our
GPU. (*) It's part of the basic premise of HMM as I understand it. I
would expect the same thing to work for DEVICE_PUBLIC memory.

(*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
currently work, but that's something I'm hoping to fix at some point.

FWIW, I'd love to see the architecture documents that define how
filesystems are supposed to interact with this device private
memory. This whole "hand filesystem controlled memory to other
devices" is a minefield that is trivial to get wrong iand very
difficult to fix - just look at the historical mess that RDMA
to/from file backed and/or DAX pages has been.

So, really, from my perspective as a filesystem engineer, I want to
see an actual specification for how this new memory type is going to
interact with filesystem and the page cache so everyone has some
idea of how this is going to work and can point out how it doesn't
work before code that simply doesn't work is pushed out into
production systems and then merged


OK. To be clear, that's not part of this patch series. And I have no 
authority to push anything in this part of the kernel, so you have 
nothing to fear. ;)


FWIW, we already have the ability to map file-backed system memory pages 
into device page tables with HMM and interval notifiers. But we cannot 
currently migrate them to ZONE_DEVICE pages. Beyond that, my 
understanding of how filesystems and page cache work is rather 
superficial at this point. I'll keep your name in mind for when I am 
ready to discuss this in more detail.


Cheers,
  Felix




Cheers,

Dave.


[pull] amdgpu, amdkfd drm-next-5.15

2021-09-01 Thread Alex Deucher
Hi Dave, Daniel,

Fixes for 5.15.

The following changes since commit 8f0284f190e6a0aa09015090568c03f18288231a:

  Merge tag 'amd-drm-next-5.15-2021-08-27' of 
https://gitlab.freedesktop.org/agd5f/linux into drm-next (2021-08-30 09:06:03 
+1000)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-next-5.15-2021-09-01

for you to fetch changes up to d6043581e1d9d0507a8413a302db0e35c8506e0e:

  drm/amdkfd: drop process ref count when xnack disable (2021-09-01 16:55:09 
-0400)


amd-drm-next-5.15-2021-09-01:

amdgpu:
- Misc cleanups, typo fixes
- EEPROM fix
- Add some new PCI IDs
- Scatter/Gather display support for Yellow Carp
- PCIe DPM fix for RKL platforms
- RAS fix

amdkfd:
- SVM fix


Aaron Liu (1):
  drm/amd/display: setup system context for APUs

Alex Deucher (1):
  drm/amdgpu: add some additional RDNA2 PCI IDs

Alex Sierra (1):
  drm/amdkfd: drop process ref count when xnack disable

Angus Wang (1):
  drm/amd/display: cleanup idents after a revert

Anson Jacob (1):
  drm/amd/display: Fix memory leak reported by coverity

Colin Ian King (1):
  drm/amdgpu/swsmu: fix spelling mistake "minimun" -> "minimum"

Evan Quan (1):
  drm/amdgpu: reenable BACO support for 699F:C7 polaris12 SKU

Guchun Chen (1):
  drm/amdgpu: stop scheduler when calling hw_fini (v2)

Jiawei Gu (1):
  drm/amdgpu: enable more pm sysfs under SRIOV 1-VF mode

Jing Yangyang (1):
  drm:dcn31: fix boolreturn.cocci warnings

John Clements (1):
  drm/amdgpu: Clear RAS interrupt status on aldebaran

Kees Cook (1):
  drm/amd/pm: And destination bounds checking to struct copy

Koba Ko (1):
  drm/amdgpu: Disable PCIE_DPM on Intel RKL Platform

Lang Yu (1):
  drm/amdgpu: show both cmd id and name when psp cmd failed

Luben Tuikov (2):
  drm/amdgpu: Fixes to returning VBIOS RAS EEPROM address
  drm/amdgpu: Process any VBIOS RAS EEPROM address

Michael Strauss (1):
  drm/amd/display: Initialize lt_settings on instantiation

Nicholas Kazlauskas (1):
  drm/amdgpu: Enable S/G for Yellow Carp

Philip Yang (1):
  drm/amdgpu: fix fdinfo race with process exit

Yifan Zhang (1):
  drm/amdgpu: correct comments in memory type managers

YuBiao Wang (1):
  drm/amd/amdgpu: Add ready_to_reset resp for vega10

xinhui pan (1):
  drm/amdgpu: Fix a deadlock if previous GEM object allocation fails

 drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c   | 50 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 17 
 drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 11 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  8 
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c| 23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c|  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  8 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 25 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c   |  6 +--
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  2 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 30 ++---
 drivers/gpu/drm/amd/amdgpu/vi.c|  9 +---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c   |  3 +-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c  |  2 +-
 drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c   |  6 +--
 .../drm/amd/display/dc/dcn303/dcn303_resource.c|  6 ++-
 .../drm/amd/display/dc/dcn31/dcn31_panel_cntl.c|  4 +-
 .../display/dc/dml/dcn20/display_mode_vba_20v2.c   |  2 +-
 drivers/gpu/drm/amd/pm/amdgpu_pm.c |  8 ++--
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 24 +++
 .../gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c| 17 +++-
 drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c  |  6 +--
 drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c|  8 ++--
 drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c   |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c|  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c |  5 +--
 .../gpu/drm/amd/pm/swsmu/smu13/yellow_carp_ppt.c   |  2 +-
 29 files changed, 198 insertions(+), 94 deletions(-)


Re: [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

2021-09-01 Thread Alex Deucher
On Wed, Sep 1, 2021 at 2:50 AM Christian König
 wrote:
>
> Am 01.09.21 um 02:46 schrieb Monk Liu:
> > issue:
> > in cleanup_job the cancle_delayed_work will cancel a TO timer
> > even the its corresponding job is still running.
> >
> > fix:
> > do not cancel the timer in cleanup_job, instead do the cancelling
> > only when the heading job is signaled, and if there is a "next" job
> > we start_timeout again.
> >
> > v2:
> > further cleanup the logic, and do the TDR timer cancelling if the signaled 
> > job
> > is the last one in its scheduler.
> >
> > v3:
> > change the issue description
> > remove the cancel_delayed_work in the begining of the cleanup_job
> > recover the implement of drm_sched_job_begin.
> >
> > v4:
> > remove the kthread_should_park() checking in cleanup_job routine,
> > we should cleanup the signaled job asap
> >
> > TODO:
> > 1)introduce pause/resume scheduler in job_timeout to serial the handling
> > of scheduler and job_timeout.
> > 2)drop the bad job's del and insert in scheduler due to above serialization
> > (no race issue anymore with the serialization)
> >
> > tested-by: jingwen 
> > Signed-off-by: Monk Liu 
>
> Reviewed-by: Christian König 
>

Are you planning to push this to drm-misc?

Alex


> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 26 +-
> >   1 file changed, 9 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index a2a9536..3e0bbc7 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler 
> > *sched)
> >   {
> >   struct drm_sched_job *job, *next;
> >
> > - /*
> > -  * Don't destroy jobs while the timeout worker is running  OR thread
> > -  * is being parked and hence assumed to not touch pending_list
> > -  */
> > - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> > - !cancel_delayed_work(&sched->work_tdr)) ||
> > - kthread_should_park())
> > - return NULL;
> > -
> >   spin_lock(&sched->job_list_lock);
> >
> >   job = list_first_entry_or_null(&sched->pending_list,
> > @@ -693,17 +684,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler 
> > *sched)
> >   if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
> >   /* remove job from pending_list */
> >   list_del_init(&job->list);
> > +
> > + /* cancel this job's TO timer */
> > + cancel_delayed_work(&sched->work_tdr);
> >   /* make the scheduled timestamp more accurate */
> >   next = list_first_entry_or_null(&sched->pending_list,
> >   typeof(*next), list);
> > - if (next)
> > +
> > + if (next) {
> >   next->s_fence->scheduled.timestamp =
> >   job->s_fence->finished.timestamp;
> > -
> > + /* start TO timer for next job */
> > + drm_sched_start_timeout(sched);
> > + }
> >   } else {
> >   job = NULL;
> > - /* queue timeout for next job */
> > - drm_sched_start_timeout(sched);
> >   }
> >
> >   spin_unlock(&sched->job_list_lock);
> > @@ -791,11 +786,8 @@ static int drm_sched_main(void *param)
> > (entity = 
> > drm_sched_select_entity(sched))) ||
> >kthread_should_stop());
> >
> > - if (cleanup_job) {
> > + if (cleanup_job)
> >   sched->ops->free_job(cleanup_job);
> > - /* queue timeout for next job */
> > - drm_sched_start_timeout(sched);
> > - }
> >
> >   if (!entity)
> >   continue;
>


Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread Felix Kuehling
If it's not too late, please add

Fixes: 2383f56bbe4a ("drm/amdkfd: page table restore through svm API")

Thanks,
  Felix


Am 2021-09-01 um 1:54 p.m. schrieb Felix Kuehling:
> Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan:
>> [Public]
>>
>>
>> [Public]
>>
>>
>> I wouldn’t know if it was another bug elsewhere.
>>
>> From what I was seeing, the leak was coming from !p->xnack_enable on
>> the svm_range_restore_pages call.
>>
>> If it helps, I saw this on Aldebaran where a shader does some bad
>> memory access on purpose on a debugged ptraced child process.
>>
> On Aldebaran the XNACK mode can be changed per process. But the page
> fault interrupts are retry faults (until they get turned into no-retry
> faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults
> go into svm_range_restore_pages before they realize that the process in
> question doesn't use XNACK.
>
> The patch is
>
> Reviewed-by: Felix Kuehling 
>
>
>> The vm fault prompt pops up in dmesgs and a stale KFD process appends
>> per run without this fix.
>>
>> I’m just assuming at this point that the IV retry bit is set but I
>> never confirmed that.
>>
>>  
>>
>> Thanks,
>>
>>  
>>
>> Jon
>>
>> *From:* Yang, Philip 
>> *Sent:* Wednesday, September 1, 2021 12:30 PM
>> *To:* Kim, Jonathan ; Yang, Philip
>> ; Sierra Guiza, Alejandro (Alex)
>> ; amd-gfx@lists.freedesktop.org
>> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack
>> disable
>>
>>  
>>
>>  
>>
>> On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:
>>
>> [AMD Official Use Only]
>>
>>  
>>
>> We were seeing process leaks on a couple of machines running
>> certain tests that triggered vm faults on purpose.
>>
>> I think svm_range_restore_pages gets called unconditionally on vm
>> fault handling (unless the retry interrupt payload bit is supposed
>> to be clear with xnack off)?
>>
>>  
>>
>> yes, with xnack off, sh_mem_config retry should be off, retry bit is
>> supposed to be clear in fault interrupt vector, we should not try to
>> recover vm fault, just report the vm fault back to application and
>> evict user queues. Maybe it is another bug cause p->xnack_enabled and
>> sh_mem_config retry mismatch under specific condition?
>>
>> Regards,
>>
>> Philip
>>
>> Either way, this patch prevents the process leaks we seeing and is
>> also:
>>
>> Reviewed-by: Jonathan Kim 
>> 
>>
>>  
>>
>> Thanks,
>>
>>  
>>
>> Jon
>>
>>  
>>
>>  
>>
>> *From:* amd-gfx 
>>  *On Behalf Of
>> *philip yang
>> *Sent:* Wednesday, September 1, 2021 7:30 AM
>> *To:* Sierra Guiza, Alejandro (Alex) 
>> ; amd-gfx@lists.freedesktop.org
>> 
>> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when
>> xnack disable
>>
>>  
>>
>> [CAUTION: External Email]
>>
>>  
>>
>> On 2021-08-31 10:41 p.m., Alex Sierra wrote:
>>
>> During svm restore pages interrupt handler, kfd_process ref count was
>>
>> never dropped when xnack was disabled. Therefore, the object was 
>> never
>>
>> released.
>>
>> Good catch, but if xnack is off, we should not get here to recover
>> fault.
>>
>> The fix looks good to me.
>>
>> Reviewed-by: Philip Yang 
>> 
>>
>>  
>>
>> Signed-off-by: Alex Sierra  
>> 
>>
>> ---
>>
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
>>
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>>  
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>> index 8f9b5b53dab5..110c46cd7fac 100644
>>
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>> @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device 
>> *adev, unsigned int pasid,
>>
>>      }
>>
>>      if (!p->xnack_enabled) {
>>
>>     pr_debug("XNACK not enabled for pasid 0x%x\n", 
>> pasid);
>>
>> -   return -EFAULT;
>>
>> +   r = -EFAULT;
>>
>> +   goto out;
>>
>>      }
>>
>>      svms = &p->svms;
>>
>>  
>>


Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Dave Airlie
On Thu, 2 Sept 2021 at 01:20, Alex Deucher  wrote:
>
> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:
> >
> > [AMD Official Use Only]
> >
> > Daniel
> >
> > From the link you share it looks you(or someone else) have quite a bunch 
> > patches that changes DRM_SCHED or even amdgpu, by that case before they are 
> > merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >
> > They looks to me somehow conflicting with what we changed in our repo
> >
> > It is really a chaos for AMDer if someone else out side of AMD changes our 
> > kernel driver (or/and scheduler) without reviewed by AMDer, just like we 
> > are requiring your review if we tend to change scheduler's logic here 
> >
> > This one changes AMD's code: 
> > https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/
> > And I didn't see any reviewed-by from AMDers ...
> >
> > This one also touches AMD's code: 
> > https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vet...@ffwll.ch/
> > Which is conflicting with one patch we submitted (in our repo rightnow), 
> > and neither see AMDder gave a review-by on this one (let me know if I 
> > missed it)
> >
>
> Monk, this is not how upstream works.  You need to participate.
> That's how communities work.  There's a reason all these discussions
> happen on public mailing lists.  The patch author can't be expected to
> know every person on every vendor team to CC with a patch.  If you
> have concerns, you need to raise them when the patches are being
> discussed.
>

I'm not sure I can add much to help this along, I'm sure Alex has some
internal training,

Once your driver is upstream, it belongs to upstream, you can maintain
it, but you no longer control it 100%, it's a tradeoff, it's not one
companies always understand.

Usually people are fine developing away internally, but once
interaction with other parts of the kernel/subsystem is required they
have the realisation that they needed to work upstream 6 months
earlier.

The best time to interact with upstream was 6 months ago, the second
best time is now.

Dave.


Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread Felix Kuehling
Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan:
>
> [Public]
>
>
> [Public]
>
>
> I wouldn’t know if it was another bug elsewhere.
>
> From what I was seeing, the leak was coming from !p->xnack_enable on
> the svm_range_restore_pages call.
>
> If it helps, I saw this on Aldebaran where a shader does some bad
> memory access on purpose on a debugged ptraced child process.
>
On Aldebaran the XNACK mode can be changed per process. But the page
fault interrupts are retry faults (until they get turned into no-retry
faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults
go into svm_range_restore_pages before they realize that the process in
question doesn't use XNACK.

The patch is

Reviewed-by: Felix Kuehling 


> The vm fault prompt pops up in dmesgs and a stale KFD process appends
> per run without this fix.
>
> I’m just assuming at this point that the IV retry bit is set but I
> never confirmed that.
>
>  
>
> Thanks,
>
>  
>
> Jon
>
> *From:* Yang, Philip 
> *Sent:* Wednesday, September 1, 2021 12:30 PM
> *To:* Kim, Jonathan ; Yang, Philip
> ; Sierra Guiza, Alejandro (Alex)
> ; amd-gfx@lists.freedesktop.org
> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack
> disable
>
>  
>
>  
>
> On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:
>
> [AMD Official Use Only]
>
>  
>
> We were seeing process leaks on a couple of machines running
> certain tests that triggered vm faults on purpose.
>
> I think svm_range_restore_pages gets called unconditionally on vm
> fault handling (unless the retry interrupt payload bit is supposed
> to be clear with xnack off)?
>
>  
>
> yes, with xnack off, sh_mem_config retry should be off, retry bit is
> supposed to be clear in fault interrupt vector, we should not try to
> recover vm fault, just report the vm fault back to application and
> evict user queues. Maybe it is another bug cause p->xnack_enabled and
> sh_mem_config retry mismatch under specific condition?
>
> Regards,
>
> Philip
>
> Either way, this patch prevents the process leaks we seeing and is
> also:
>
> Reviewed-by: Jonathan Kim 
> 
>
>  
>
> Thanks,
>
>  
>
> Jon
>
>  
>
>  
>
> *From:* amd-gfx 
>  *On Behalf Of
> *philip yang
> *Sent:* Wednesday, September 1, 2021 7:30 AM
> *To:* Sierra Guiza, Alejandro (Alex) 
> ; amd-gfx@lists.freedesktop.org
> 
> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when
> xnack disable
>
>  
>
> [CAUTION: External Email]
>
>  
>
> On 2021-08-31 10:41 p.m., Alex Sierra wrote:
>
> During svm restore pages interrupt handler, kfd_process ref count was
>
> never dropped when xnack was disabled. Therefore, the object was never
>
> released.
>
> Good catch, but if xnack is off, we should not get here to recover
> fault.
>
> The fix looks good to me.
>
> Reviewed-by: Philip Yang 
> 
>
>  
>
> Signed-off-by: Alex Sierra  
> 
>
> ---
>
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
>
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
>  
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
> index 8f9b5b53dab5..110c46cd7fac 100644
>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
> @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device 
> *adev, unsigned int pasid,
>
>      }
>
>      if (!p->xnack_enabled) {
>
>     pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
>
> -   return -EFAULT;
>
> +   r = -EFAULT;
>
> +   goto out;
>
>      }
>
>      svms = &p->svms;
>
>  
>


RE: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread Kim, Jonathan
[Public]

I wouldn’t know if it was another bug elsewhere.
From what I was seeing, the leak was coming from !p->xnack_enable on the 
svm_range_restore_pages call.

If it helps, I saw this on Aldebaran where a shader does some bad memory access 
on purpose on a debugged ptraced child process.
The vm fault prompt pops up in dmesgs and a stale KFD process appends per run 
without this fix.
I’m just assuming at this point that the IV retry bit is set but I never 
confirmed that.

Thanks,

Jon
From: Yang, Philip 
Sent: Wednesday, September 1, 2021 12:30 PM
To: Kim, Jonathan ; Yang, Philip ; 
Sierra Guiza, Alejandro (Alex) ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable



On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:

[AMD Official Use Only]

We were seeing process leaks on a couple of machines running certain tests that 
triggered vm faults on purpose.
I think svm_range_restore_pages gets called unconditionally on vm fault 
handling (unless the retry interrupt payload bit is supposed to be clear with 
xnack off)?


yes, with xnack off, sh_mem_config retry should be off, retry bit is supposed 
to be clear in fault interrupt vector, we should not try to recover vm fault, 
just report the vm fault back to application and evict user queues. Maybe it is 
another bug cause p->xnack_enabled and sh_mem_config retry mismatch under 
specific condition?

Regards,

Philip
Either way, this patch prevents the process leaks we seeing and is also:
Reviewed-by: Jonathan Kim 

Thanks,

Jon


From: amd-gfx 

 On Behalf Of philip yang
Sent: Wednesday, September 1, 2021 7:30 AM
To: Sierra Guiza, Alejandro (Alex) 
; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

[CAUTION: External Email]


On 2021-08-31 10:41 p.m., Alex Sierra wrote:

During svm restore pages interrupt handler, kfd_process ref count was

never dropped when xnack was disabled. Therefore, the object was never

released.

Good catch, but if xnack is off, we should not get here to recover fault.

The fix looks good to me.

Reviewed-by: Philip Yang 



Signed-off-by: Alex Sierra 

---

 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-

 1 file changed, 2 insertions(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 8f9b5b53dab5..110c46cd7fac 100644

--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

@@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,

 }

 if (!p->xnack_enabled) {

pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);

-   return -EFAULT;

+   r = -EFAULT;

+   goto out;

 }

 svms = &p->svms;




Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Felix Kuehling
Am 2021-09-01 um 12:30 p.m. schrieb Lazar, Lijo:
>
> [Public]
>
>
> What I wanted to ask was -
>
> Whether user mode application relies only on link properties alone to
> assume atomic ops are supported? If they check only link properties
> and if the firmware doesn't work fine, should it be still marked as
> supported?

Let's be clear what "firmware doesn't work fine" means in this context.
It means "firmware requires PCIe atomics". If firmware requires PCIe
atomics and the system doesn't support PCIe atomics, KFD will not use
the GPU and will not report the GPU to user mode.

If firmware does not require PCIe atomics, or if PCIe atomics work on
the system, KFD will use the GPU and will report the atomic capability
to user mode in the IO link attribute.


>
> Basically, what is the purpose of exposing atomic capability in link
> properties and whether that can be utilised by upper mode applications
> just based on PCIe atomics support?

Applications can use PCIe atomics by using atomic shader instructions
when accessing system memory in GPU shader code. If the system doesn't
support PCIe atomics, these atomic operations are silently dropped.
Therefore the application must check the atomic capability in the IO
link properties before relying on these instructions for system memory.

Regards,
  Felix


>
> Thanks,
> Lijo
> 
> *From:* Kuehling, Felix 
> *Sent:* Wednesday, September 1, 2021 8:24:56 PM
> *To:* Lazar, Lijo ; amd-gfx@lists.freedesktop.org
> 
> *Subject:* Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics
> FW-version dependent
>  
> Am 2021-09-01 um 7:04 a.m. schrieb Lazar, Lijo:
> >
> >
> > On 9/1/2021 3:26 AM, Felix Kuehling wrote:
> >> On some GPUs the PCIe atomic requirement for KFD depends on the MEC
> >> firmware version. Add a firmware version check for this. The minimum
> >> firmware version that works without atomics can be updated in the
> >> device_info structure for each GPU type.
> >>
> >> Signed-off-by: Felix Kuehling 
> >> ---
> >>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++--
> >>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h   | 1 +
> >>   2 files changed, 8 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >> index 16a57b70cc1a..655ee5733229 100644
> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >> @@ -688,6 +688,7 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
> >>   struct kfd_dev *kfd;
> >>   const struct kfd_device_info *device_info;
> >>   const struct kfd2kgd_calls *f2g;
> >> +    uint32_t fw_version;
> >>     if (asic_type >= sizeof(kfd_supported_devices) / (sizeof(void
> >> *) * 2)
> >>   || asic_type >= sizeof(kfd2kgd_funcs) / sizeof(void *)) {
> >> @@ -713,8 +714,12 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
> >>    * supported.
> >>    */
> >>   kfd->pci_atomic_requested =
> >> amdgpu_amdkfd_have_atomics_support(kgd);
> >
> > Should the check be grouped inside amdgpu_amdkfd_have_atomics_support?
> >
> > This flag is used for setting some link properties. If there is HW
> > support but comes with incompatible firmware, should the link be still
> > marked as atomic?
>
> Our GPU HW always supports PCIe atomics (it's part of the PCIe 3 spec).
> But some mainboards with older PCIe chipsets do not. Sometimes even
> different ports on the same mainboard differ in their PCIe version and
> atomic support.
>
> amdgpu_device_init always tries to enable atomics on the root port an
> all the bridges leading to the GPU by calling
> pci_enable_atomic_ops_to_root. The result is saved in
> adev->have_atomics_support, which is returned to KFD by
> amdgpu_amdkfd_have_atomics_support.
>
> The firmware change here does not affect whether atomics are
> _supported_. It changes whether atomics are _required_ for the basic
> operation of AQL user mode queues. The coming firmware update will
> remove that requirement, which allows us to enable KFD for these GPUs+FW
> on systems without PCIe atomics.
>
> Enabling PCIe atomics with the updated FW is still beneficial because
> shader programs can use a subset of atomic instructions for accessing
> system memory atomically on supported systems.
>
> Regards,
>   Felix
>
>
> >
> > Thanks,
> > Lijo
> >
> >> -    if (device_info->needs_pci_atomics &&
> >> -    !kfd->pci_atomic_requested) {
> >> +    fw_version = amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1);
> >> +    if (!kfd->pci_atomic_requested &&
> >> +    device_info->needs_pci_atomics &&
> >> +    (!device_info->no_atomic_fw_version ||
> >> +  amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1) <
> >> +    device_info->no_atomic_fw_version)) {
> >>   dev_info(kfd_device,
> >>    "skipped device %x:%x, PCI rejects atomics\n",
> >>    pdev->vendor, pdev->devic

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Alex Deucher
On Wed, Sep 1, 2021 at 12:30 PM Lazar, Lijo  wrote:
>
> [Public]
>
>
> What I wanted to ask was -
>
> Whether user mode application relies only on link properties alone to assume 
> atomic ops are supported? If they check only link properties and if the 
> firmware doesn't work fine, should it be still marked as supported?
>
> Basically, what is the purpose of exposing atomic capability in link 
> properties and whether that can be utilised by upper mode applications just 
> based on PCIe atomics support?
>

PCI atomics in general and the requirement for PCI atomics in the CP
firmware are independent.  The firmware can operate either with
atomics or without.  The operation of the firmware does not affect
user processes that might want to use atomics for other things.

Alex


> Thanks,
> Lijo
> 
> From: Kuehling, Felix 
> Sent: Wednesday, September 1, 2021 8:24:56 PM
> To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org 
> 
> Subject: Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version 
> dependent
>
> Am 2021-09-01 um 7:04 a.m. schrieb Lazar, Lijo:
> >
> >
> > On 9/1/2021 3:26 AM, Felix Kuehling wrote:
> >> On some GPUs the PCIe atomic requirement for KFD depends on the MEC
> >> firmware version. Add a firmware version check for this. The minimum
> >> firmware version that works without atomics can be updated in the
> >> device_info structure for each GPU type.
> >>
> >> Signed-off-by: Felix Kuehling 
> >> ---
> >>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++--
> >>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h   | 1 +
> >>   2 files changed, 8 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >> index 16a57b70cc1a..655ee5733229 100644
> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >> @@ -688,6 +688,7 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
> >>   struct kfd_dev *kfd;
> >>   const struct kfd_device_info *device_info;
> >>   const struct kfd2kgd_calls *f2g;
> >> +uint32_t fw_version;
> >> if (asic_type >= sizeof(kfd_supported_devices) / (sizeof(void
> >> *) * 2)
> >>   || asic_type >= sizeof(kfd2kgd_funcs) / sizeof(void *)) {
> >> @@ -713,8 +714,12 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
> >>* supported.
> >>*/
> >>   kfd->pci_atomic_requested =
> >> amdgpu_amdkfd_have_atomics_support(kgd);
> >
> > Should the check be grouped inside amdgpu_amdkfd_have_atomics_support?
> >
> > This flag is used for setting some link properties. If there is HW
> > support but comes with incompatible firmware, should the link be still
> > marked as atomic?
>
> Our GPU HW always supports PCIe atomics (it's part of the PCIe 3 spec).
> But some mainboards with older PCIe chipsets do not. Sometimes even
> different ports on the same mainboard differ in their PCIe version and
> atomic support.
>
> amdgpu_device_init always tries to enable atomics on the root port an
> all the bridges leading to the GPU by calling
> pci_enable_atomic_ops_to_root. The result is saved in
> adev->have_atomics_support, which is returned to KFD by
> amdgpu_amdkfd_have_atomics_support.
>
> The firmware change here does not affect whether atomics are
> _supported_. It changes whether atomics are _required_ for the basic
> operation of AQL user mode queues. The coming firmware update will
> remove that requirement, which allows us to enable KFD for these GPUs+FW
> on systems without PCIe atomics.
>
> Enabling PCIe atomics with the updated FW is still beneficial because
> shader programs can use a subset of atomic instructions for accessing
> system memory atomically on supported systems.
>
> Regards,
>   Felix
>
>
> >
> > Thanks,
> > Lijo
> >
> >> -if (device_info->needs_pci_atomics &&
> >> -!kfd->pci_atomic_requested) {
> >> +fw_version = amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1);
> >> +if (!kfd->pci_atomic_requested &&
> >> +device_info->needs_pci_atomics &&
> >> +(!device_info->no_atomic_fw_version ||
> >> +  amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1) <
> >> +device_info->no_atomic_fw_version)) {
> >>   dev_info(kfd_device,
> >>"skipped device %x:%x, PCI rejects atomics\n",
> >>pdev->vendor, pdev->device);
> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> >> index ab83b0de6b22..6d8f9bb2d905 100644
> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> >> @@ -207,6 +207,7 @@ struct kfd_device_info {
> >>   bool supports_cwsr;
> >>   bool needs_iommu_device;
> >>   bool needs_pci_atomics;
> >> +uint32_t no_atomic_fw_version;
> >>   unsigned int num_sdma_engines;
> >>   unsigned int num_xgmi_sdma_engines;
> >>   unsigned int num_sdma_queues_per_engine;
> >>

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Lazar, Lijo
[Public]

What I wanted to ask was -

Whether user mode application relies only on link properties alone to assume 
atomic ops are supported? If they check only link properties and if the 
firmware doesn't work fine, should it be still marked as supported?

Basically, what is the purpose of exposing atomic capability in link properties 
and whether that can be utilised by upper mode applications just based on PCIe 
atomics support?

Thanks,
Lijo

From: Kuehling, Felix 
Sent: Wednesday, September 1, 2021 8:24:56 PM
To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org 

Subject: Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version 
dependent

Am 2021-09-01 um 7:04 a.m. schrieb Lazar, Lijo:
>
>
> On 9/1/2021 3:26 AM, Felix Kuehling wrote:
>> On some GPUs the PCIe atomic requirement for KFD depends on the MEC
>> firmware version. Add a firmware version check for this. The minimum
>> firmware version that works without atomics can be updated in the
>> device_info structure for each GPU type.
>>
>> Signed-off-by: Felix Kuehling 
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++--
>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h   | 1 +
>>   2 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 16a57b70cc1a..655ee5733229 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -688,6 +688,7 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
>>   struct kfd_dev *kfd;
>>   const struct kfd_device_info *device_info;
>>   const struct kfd2kgd_calls *f2g;
>> +uint32_t fw_version;
>> if (asic_type >= sizeof(kfd_supported_devices) / (sizeof(void
>> *) * 2)
>>   || asic_type >= sizeof(kfd2kgd_funcs) / sizeof(void *)) {
>> @@ -713,8 +714,12 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
>>* supported.
>>*/
>>   kfd->pci_atomic_requested =
>> amdgpu_amdkfd_have_atomics_support(kgd);
>
> Should the check be grouped inside amdgpu_amdkfd_have_atomics_support?
>
> This flag is used for setting some link properties. If there is HW
> support but comes with incompatible firmware, should the link be still
> marked as atomic?

Our GPU HW always supports PCIe atomics (it's part of the PCIe 3 spec).
But some mainboards with older PCIe chipsets do not. Sometimes even
different ports on the same mainboard differ in their PCIe version and
atomic support.

amdgpu_device_init always tries to enable atomics on the root port an
all the bridges leading to the GPU by calling
pci_enable_atomic_ops_to_root. The result is saved in
adev->have_atomics_support, which is returned to KFD by
amdgpu_amdkfd_have_atomics_support.

The firmware change here does not affect whether atomics are
_supported_. It changes whether atomics are _required_ for the basic
operation of AQL user mode queues. The coming firmware update will
remove that requirement, which allows us to enable KFD for these GPUs+FW
on systems without PCIe atomics.

Enabling PCIe atomics with the updated FW is still beneficial because
shader programs can use a subset of atomic instructions for accessing
system memory atomically on supported systems.

Regards,
  Felix


>
> Thanks,
> Lijo
>
>> -if (device_info->needs_pci_atomics &&
>> -!kfd->pci_atomic_requested) {
>> +fw_version = amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1);
>> +if (!kfd->pci_atomic_requested &&
>> +device_info->needs_pci_atomics &&
>> +(!device_info->no_atomic_fw_version ||
>> +  amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1) <
>> +device_info->no_atomic_fw_version)) {
>>   dev_info(kfd_device,
>>"skipped device %x:%x, PCI rejects atomics\n",
>>pdev->vendor, pdev->device);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index ab83b0de6b22..6d8f9bb2d905 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -207,6 +207,7 @@ struct kfd_device_info {
>>   bool supports_cwsr;
>>   bool needs_iommu_device;
>>   bool needs_pci_atomics;
>> +uint32_t no_atomic_fw_version;
>>   unsigned int num_sdma_engines;
>>   unsigned int num_xgmi_sdma_engines;
>>   unsigned int num_sdma_queues_per_engine;
>>


Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread philip yang

  


On 2021-09-01 9:45 a.m., Kim, Jonathan
  wrote:


  
  
  
  
[AMD Official Use Only]
  
  
  

  We were seeing process leaks on a couple
of machines running certain tests that triggered vm faults
on purpose.
  I think svm_range_restore_pages gets
called unconditionally on vm fault handling (unless the
retry interrupt payload bit is supposed to be clear with
xnack off)?
   

  

yes, with xnack off, sh_mem_config retry should be off, retry bit
  is supposed to be clear in fault interrupt vector, we should not
  try to recover vm fault, just report the vm fault back to
  application and evict user queues. Maybe it is another bug cause
  p->xnack_enabled and sh_mem_config retry mismatch under
  specific condition?

Regards,
Philip


  

  Either way, this patch prevents the
process leaks we seeing and is also:
  Reviewed-by: Jonathan Kim

   
  Thanks,
   
  Jon
   
   
  

  
From: amd-gfx
  
  On Behalf Of philip yang
  Sent: Wednesday, September 1, 2021 7:30 AM
  To: Sierra Guiza, Alejandro (Alex)
  ;
  amd-gfx@lists.freedesktop.org
  Subject: Re: [PATCH] drm/amdkfd: drop process
  ref count when xnack disable
  

 
[CAUTION: External Email] 

   
  
On 2021-08-31 10:41 p.m., Alex
  Sierra wrote:
  
  
During svm restore pages interrupt handler, kfd_process ref count was
never dropped when xnack was disabled. Therefore, the object was never
released.
  
  Good catch, but if xnack is off, we should not get here
to recover fault.
  The fix looks good to me.
  Reviewed-by: Philip Yang 
  
 
Signed-off-by: Alex Sierra 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 8f9b5b53dab5..110c46cd7fac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
     }
     if (!p->xnack_enabled) {
    pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
-   return -EFAULT;
+   r = -EFAULT;
+   goto out;
     }
     svms = &p->svms;
 
  

  

  

  



Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Felix Kuehling


Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
 driver code is not really involved in updating the CPU mappings. Maybe
 it's something we need to do in the migration helpers.
>>> It looks like I'm totally misunderstanding what you are adding here
>>> then.  Why do we need any special treatment at all for memory that
>>> has normal struct pages and is part of the direct kernel map?
>> The pages are like normal memory for purposes of mapping them in CPU
>> page tables and for coherent access from the CPU.
> That's the user page tables.  What about the kernel direct map?
> If there is a normal kernel struct page backing there really should
> be no need for the pgmap.

I'm not sure. The physical address ranges are in the UEFI system address
map as special-purpose memory. Does Linux create the struct pages and
kernel direct map for that without a pgmap call? I didn't see that last
time I went digging through that code.


>
>> From an application
>> perspective, we want file-backed and anonymous mappings to be able to
>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>> optimize performance for GPU heavy workloads while minimizing the need
>> to migrate data back-and-forth between system memory and device memory.
> I don't really understand that part.  file backed pages are always
> allocated by the file system using the pagecache helpers, that is
> using the page allocator.  Anonymouns memory also always comes from
> the page allocator.

I'm coming at this from my experience with DEVICE_PRIVATE. Both
anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
memory by the migrate_vma_* helpers for more efficient access by our
GPU. (*) It's part of the basic premise of HMM as I understand it. I
would expect the same thing to work for DEVICE_PUBLIC memory.

(*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
currently work, but that's something I'm hoping to fix at some point.


>
>> The pages are special in two ways:
>>
>>  1. The memory is managed not by the Linux buddy allocator, but by the
>> GPU driver's TTM memory manager
> Why?

It's a system architectural decision based on the access latency to the
memory and the expected use cases that we do not want the GPU driver and
the Linux buddy allocator and VM subsystem competing for the same device
memory.


>
>>  2. We want to migrate data in response to GPU page faults and
>> application hints using the migrate_vma helpers
> Why? 

Device memory has much higher bandwidth and much lower latency than
regular system memory for the GPU to access. It's essential for enabling
good GPU application performance. Page-based memory migration enables
good performance with more intuitive programming models such as
managed/unified memory in HIP or unified shared memory in OpenMP. We do
this on our discrete GPUs with DEVICE_PRIVATE memory.

I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
the CPU to map the device memory coherently to minimize the need for
migrations when CPU and GPU access the same memory concurrently or
alternatingly. But we're not going as far as putting that memory
entirely under the management of the Linux memory manager and VM
subsystem. Our (and HPE's) system architects decided that this memory is
not suitable to be used like regular NUMA system memory by the Linux
memory manager.

Regards,
  Felix




Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Alex Deucher
On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
> Daniel
>
> From the link you share it looks you(or someone else) have quite a bunch 
> patches that changes DRM_SCHED or even amdgpu, by that case before they are 
> merged to kernel tree I'm wondering if any AMD develop reviewed them ?
>
> They looks to me somehow conflicting with what we changed in our repo
>
> It is really a chaos for AMDer if someone else out side of AMD changes our 
> kernel driver (or/and scheduler) without reviewed by AMDer, just like we are 
> requiring your review if we tend to change scheduler's logic here 
>
> This one changes AMD's code: 
> https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/
> And I didn't see any reviewed-by from AMDers ...
>
> This one also touches AMD's code: 
> https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vet...@ffwll.ch/
> Which is conflicting with one patch we submitted (in our repo rightnow), and 
> neither see AMDder gave a review-by on this one (let me know if I missed it)
>

Monk, this is not how upstream works.  You need to participate.
That's how communities work.  There's a reason all these discussions
happen on public mailing lists.  The patch author can't be expected to
know every person on every vendor team to CC with a patch.  If you
have concerns, you need to raise them when the patches are being
discussed.

Alex


> Thanks
>
> --
> Monk Liu | Cloud-GPU Core team
> --
>
> -Original Message-
> From: amd-gfx  On Behalf Of Daniel 
> Vetter
> Sent: Wednesday, September 1, 2021 4:18 PM
> To: Liu, Monk 
> Cc: Koenig, Christian ; Grodzovsky, Andrey 
> ; Chen, JingWen ; DRI 
> Development ; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution 
> opinions/suggestions in one thread
>
> Hi Monk,
>
> On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk  wrote:
> >
> > [AMD Official Use Only]
> >
> >
> > Hi Daniel/Christian/Andrey
> >
> >
> >
> > It looks the voice from you three are spread over those email floods to me, 
> > the feature we are working on (diagnostic TDR scheme) is pending there for 
> > more than 6 month (we started it from feb 2021).
>
> For me your project exists since a few weeks at most, because that is when 
> your team showed up on dri-devel. That you already spent 6 months on this 
> within amd, on a code area that very much affects shared code, without 
> kicking of any thread on dri-devel isn't great, but also not something we can 
> fix, since time machines don't exist.
>
> So we have to make the best out of the situation and move ahead where we are. 
> From my understanding you've done a bunch of changes to the scheduler code. 
> As far as I can see there's been two related things your team has done:
>
> - remove some allocations from scheduler code, because that can lead to 
> deadlocks. I've kicked up this topic quite a while ago here
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&reserved=0
>
> This is just one patch of the entire series. This is an area where we really 
> need a consistent solution across all drm/sched drivers, not something that 
> individual drivers just fix in their own way.
>
> - the other one is the timeout issue for the patches you cite here.
> Again there's been discussion on this on dri-devel with Boris from panfrost 
> about how we can handle at least some of the races in tdr.
> That resulted in lots of discussions and documentation improvements.
> Those patches are merged now, link
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&reserved=0
>
> There's been more than just this, also quite some doc patches from Boris that 
> explain how it's all supposed to work and be race-free.
> Again your driver isn't the only one with interesting TDR races.
>
> Your team hasn't been active in any of these discussions, but now suddenly 
> pops up out of nowhere and demands that your approach needs to land asap. 
> That's really not how upstream works.
>
> The other thing where I'm struggling is that there's a lot of missing context 
> for outsiders. The patches sometimes come with ze

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Felix Kuehling
Am 2021-09-01 um 7:04 a.m. schrieb Lazar, Lijo:
>
>
> On 9/1/2021 3:26 AM, Felix Kuehling wrote:
>> On some GPUs the PCIe atomic requirement for KFD depends on the MEC
>> firmware version. Add a firmware version check for this. The minimum
>> firmware version that works without atomics can be updated in the
>> device_info structure for each GPU type.
>>
>> Signed-off-by: Felix Kuehling 
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++--
>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h   | 1 +
>>   2 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 16a57b70cc1a..655ee5733229 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -688,6 +688,7 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
>>   struct kfd_dev *kfd;
>>   const struct kfd_device_info *device_info;
>>   const struct kfd2kgd_calls *f2g;
>> +    uint32_t fw_version;
>>     if (asic_type >= sizeof(kfd_supported_devices) / (sizeof(void
>> *) * 2)
>>   || asic_type >= sizeof(kfd2kgd_funcs) / sizeof(void *)) {
>> @@ -713,8 +714,12 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
>>    * supported.
>>    */
>>   kfd->pci_atomic_requested =
>> amdgpu_amdkfd_have_atomics_support(kgd);
>
> Should the check be grouped inside amdgpu_amdkfd_have_atomics_support?
>
> This flag is used for setting some link properties. If there is HW
> support but comes with incompatible firmware, should the link be still
> marked as atomic?

Our GPU HW always supports PCIe atomics (it's part of the PCIe 3 spec).
But some mainboards with older PCIe chipsets do not. Sometimes even
different ports on the same mainboard differ in their PCIe version and
atomic support.

amdgpu_device_init always tries to enable atomics on the root port an
all the bridges leading to the GPU by calling
pci_enable_atomic_ops_to_root. The result is saved in
adev->have_atomics_support, which is returned to KFD by
amdgpu_amdkfd_have_atomics_support.

The firmware change here does not affect whether atomics are
_supported_. It changes whether atomics are _required_ for the basic
operation of AQL user mode queues. The coming firmware update will
remove that requirement, which allows us to enable KFD for these GPUs+FW
on systems without PCIe atomics.

Enabling PCIe atomics with the updated FW is still beneficial because
shader programs can use a subset of atomic instructions for accessing
system memory atomically on supported systems.

Regards,
  Felix


>
> Thanks,
> Lijo
>
>> -    if (device_info->needs_pci_atomics &&
>> -    !kfd->pci_atomic_requested) {
>> +    fw_version = amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1);
>> +    if (!kfd->pci_atomic_requested &&
>> +    device_info->needs_pci_atomics &&
>> +    (!device_info->no_atomic_fw_version ||
>> +  amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1) <
>> +    device_info->no_atomic_fw_version)) {
>>   dev_info(kfd_device,
>>    "skipped device %x:%x, PCI rejects atomics\n",
>>    pdev->vendor, pdev->device);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index ab83b0de6b22..6d8f9bb2d905 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -207,6 +207,7 @@ struct kfd_device_info {
>>   bool supports_cwsr;
>>   bool needs_iommu_device;
>>   bool needs_pci_atomics;
>> +    uint32_t no_atomic_fw_version;
>>   unsigned int num_sdma_engines;
>>   unsigned int num_xgmi_sdma_engines;
>>   unsigned int num_sdma_queues_per_engine;
>>


RE: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread Kim, Jonathan
[AMD Official Use Only]

We were seeing process leaks on a couple of machines running certain tests that 
triggered vm faults on purpose.
I think svm_range_restore_pages gets called unconditionally on vm fault 
handling (unless the retry interrupt payload bit is supposed to be clear with 
xnack off)?

Either way, this patch prevents the process leaks we seeing and is also:
Reviewed-by: Jonathan Kim 

Thanks,

Jon


From: amd-gfx  On Behalf Of philip yang
Sent: Wednesday, September 1, 2021 7:30 AM
To: Sierra Guiza, Alejandro (Alex) ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

[CAUTION: External Email]


On 2021-08-31 10:41 p.m., Alex Sierra wrote:

During svm restore pages interrupt handler, kfd_process ref count was

never dropped when xnack was disabled. Therefore, the object was never

released.

Good catch, but if xnack is off, we should not get here to recover fault.

The fix looks good to me.

Reviewed-by: Philip Yang 



Signed-off-by: Alex Sierra 

---

 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-

 1 file changed, 2 insertions(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 8f9b5b53dab5..110c46cd7fac 100644

--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

@@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,

 }

 if (!p->xnack_enabled) {

pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);

-   return -EFAULT;

+   r = -EFAULT;

+   goto out;

 }

 svms = &p->svms;




Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Christoph Hellwig
On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> >> driver code is not really involved in updating the CPU mappings. Maybe
> >> it's something we need to do in the migration helpers.
> > It looks like I'm totally misunderstanding what you are adding here
> > then.  Why do we need any special treatment at all for memory that
> > has normal struct pages and is part of the direct kernel map?
> 
> The pages are like normal memory for purposes of mapping them in CPU
> page tables and for coherent access from the CPU.

That's the user page tables.  What about the kernel direct map?
If there is a normal kernel struct page backing there really should
be no need for the pgmap.

> From an application
> perspective, we want file-backed and anonymous mappings to be able to
> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> optimize performance for GPU heavy workloads while minimizing the need
> to migrate data back-and-forth between system memory and device memory.

I don't really understand that part.  file backed pages are always
allocated by the file system using the pagecache helpers, that is
using the page allocator.  Anonymouns memory also always comes from
the page allocator.

> The pages are special in two ways:
> 
>  1. The memory is managed not by the Linux buddy allocator, but by the
> GPU driver's TTM memory manager

Why?

>  2. We want to migrate data in response to GPU page faults and
> application hints using the migrate_vma helpers

Why? 


Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread philip yang

  


On 2021-08-31 10:41 p.m., Alex Sierra
  wrote:


  During svm restore pages interrupt handler, kfd_process ref count was
never dropped when xnack was disabled. Therefore, the object was never
released.


Good catch, but if xnack is off, we should not get here to
  recover fault.
The fix looks good to me.
 Reviewed-by: Philip Yang 


  
Signed-off-by: Alex Sierra 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 8f9b5b53dab5..110c46cd7fac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
 	}
 	if (!p->xnack_enabled) {
 		pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
-		return -EFAULT;
+		r = -EFAULT;
+		goto out;
 	}
 	svms = &p->svms;
 


  



Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Lazar, Lijo




On 9/1/2021 3:26 AM, Felix Kuehling wrote:

On some GPUs the PCIe atomic requirement for KFD depends on the MEC
firmware version. Add a firmware version check for this. The minimum
firmware version that works without atomics can be updated in the
device_info structure for each GPU type.

Signed-off-by: Felix Kuehling 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++--
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h   | 1 +
  2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 16a57b70cc1a..655ee5733229 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -688,6 +688,7 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
struct kfd_dev *kfd;
const struct kfd_device_info *device_info;
const struct kfd2kgd_calls *f2g;
+   uint32_t fw_version;
  
  	if (asic_type >= sizeof(kfd_supported_devices) / (sizeof(void *) * 2)

|| asic_type >= sizeof(kfd2kgd_funcs) / sizeof(void *)) {
@@ -713,8 +714,12 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
 * supported.
 */
kfd->pci_atomic_requested = amdgpu_amdkfd_have_atomics_support(kgd);


Should the check be grouped inside amdgpu_amdkfd_have_atomics_support?

This flag is used for setting some link properties. If there is HW 
support but comes with incompatible firmware, should the link be still 
marked as atomic?


Thanks,
Lijo


-   if (device_info->needs_pci_atomics &&
-   !kfd->pci_atomic_requested) {
+   fw_version = amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1);
+   if (!kfd->pci_atomic_requested &&
+   device_info->needs_pci_atomics &&
+   (!device_info->no_atomic_fw_version ||
+ amdgpu_amdkfd_get_fw_version(kgd, KGD_ENGINE_MEC1) <
+   device_info->no_atomic_fw_version)) {
dev_info(kfd_device,
 "skipped device %x:%x, PCI rejects atomics\n",
 pdev->vendor, pdev->device);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index ab83b0de6b22..6d8f9bb2d905 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -207,6 +207,7 @@ struct kfd_device_info {
bool supports_cwsr;
bool needs_iommu_device;
bool needs_pci_atomics;
+   uint32_t no_atomic_fw_version;
unsigned int num_sdma_engines;
unsigned int num_xgmi_sdma_engines;
unsigned int num_sdma_queues_per_engine;



RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only]

>> For me your project exists since a few weeks at most, because that is when 
>> your team showed up on dri-devel. That you already spent 6 months on this 
>> within amd, on a code area that very much affects shared code, without 
>> kicking of any thread on dri-devel isn't great, but also not something we 
>> can fix, since time machines don't exist.
This is partially true, because in the past months our change only resident in 
AMD driver, it is till now that we found we had to make changes in SCHED level 

>> Your team hasn't been active in any of these discussions, but now suddenly 
>> pops up out of nowhere and demands that your approach needs to land asap. 
>> That's really not how upstream works.
if our changes on DRI level part cannot get merged soon that's fine, we can 
discuss more, but that's not suddenly pops up from nowhere, we already worked 
on it for months inside of AMD drivers.

>> I think the best way forward is to go through the above process again and 
>> essentially restart. So submit a complete patch series with problem 
>> descriptions, solution you picked, why you picked that, all the amdgpu 
>> patches to get there and the core patches too. Since it sounds like a bunch 
>> of this has all landed already you probably need a patch 1 that goes back to 
>> 6 months ago so that we can see the overall direction, and review whether 
>> that's the right one or not.

We are not objecting this approach,  we didn't do that because previously all 
what we need to do is resident inside AMD driver ...   because we try to avoid 
change DRI/DRM interface part ... 

For the patches you shows to us with links I'm sorry that due to some IT 
infrastructure reason me and my team didn't see it before (we kind of work in 
AMD repo ... the patches you shows didn't get merged in our repo yet...)
One thing I also want to emphasis here: if any code need change inside AMD 
driver please always let us know and review.

Thanks 

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: amd-gfx  On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk 
Cc: Koenig, Christian ; Grodzovsky, Andrey 
; Chen, JingWen ; DRI 
Development ; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, 
> the feature we are working on (diagnostic TDR scheme) is pending there for 
> more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your 
team showed up on dri-devel. That you already spent 6 months on this within 
amd, on a code area that very much affects shared code, without kicking of any 
thread on dri-devel isn't great, but also not something we can fix, since time 
machines don't exist.

So we have to make the best out of the situation and move ahead where we are. 
From my understanding you've done a bunch of changes to the scheduler code. As 
far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to 
deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&reserved=0

This is just one patch of the entire series. This is an area where we really 
need a consistent solution across all drm/sched drivers, not something that 
individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost 
about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&reserved=0

There's been more than just this, also quite some doc patches from Boris that 
explain how it's all supposed to w

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only]

Daniel 

>From the link you share it looks you(or someone else) have quite a bunch 
>patches that changes DRM_SCHED or even amdgpu, by that case before they are 
>merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo 

It is really a chaos for AMDer if someone else out side of AMD changes our 
kernel driver (or/and scheduler) without reviewed by AMDer, just like we are 
requiring your review if we tend to change scheduler's logic here  

This one changes AMD's code: 
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/
And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: 
https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vet...@ffwll.ch/
Which is conflicting with one patch we submitted (in our repo rightnow), and 
neither see AMDder gave a review-by on this one (let me know if I missed it)

Thanks 

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: amd-gfx  On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk 
Cc: Koenig, Christian ; Grodzovsky, Andrey 
; Chen, JingWen ; DRI 
Development ; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, 
> the feature we are working on (diagnostic TDR scheme) is pending there for 
> more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your 
team showed up on dri-devel. That you already spent 6 months on this within 
amd, on a code area that very much affects shared code, without kicking of any 
thread on dri-devel isn't great, but also not something we can fix, since time 
machines don't exist.

So we have to make the best out of the situation and move ahead where we are. 
From my understanding you've done a bunch of changes to the scheduler code. As 
far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to 
deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&reserved=0

This is just one patch of the entire series. This is an area where we really 
need a consistent solution across all drm/sched drivers, not something that 
individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost 
about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&reserved=0

There's been more than just this, also quite some doc patches from Boris that 
explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops 
up out of nowhere and demands that your approach needs to land asap. That's 
really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context 
for outsiders. The patches sometimes come with zero commit message, for tricky 
concurrency bugs. And there's no context with what you've done already on the 
amdgpu side (since that never showed up on dri-devel), which makes constructive 
discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work 
when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or 
whatever the topic really is about) about the problem you have and how your 
trying to solve it. This can be just text if it's a big thing, but 

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Daniel Vetter
Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, 
> the feature we are working on (diagnostic TDR scheme) is pending there for 
> more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is
when your team showed up on dri-devel. That you already spent 6 months
on this within amd, on a code area that very much affects shared code,
without kicking of any thread on dri-devel isn't great, but also not
something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where
we are. From my understanding you've done a bunch of changes to the
scheduler code. As far as I can see there's been two related things
your team has done:

- remove some allocations from scheduler code, because that can lead
to deadlocks. I've kicked up this topic quite a while ago here

https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vet...@ffwll.ch/

This is just one patch of the entire series. This is an area where we
really need a consistent solution across all drm/sched drivers, not
something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from
panfrost about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/

There's been more than just this, also quite some doc patches from
Boris that explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now
suddenly pops up out of nowhere and demands that your approach needs
to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing
context for outsiders. The patches sometimes come with zero commit
message, for tricky concurrency bugs. And there's no context with what
you've done already on the amdgpu side (since that never showed up on
dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed
to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on
dri-devel (or whatever the topic really is about) about the problem
you have and how your trying to solve it. This can be just text if
it's a big thing, but it can also already include some proof of
concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code
_together_. Not "merge amdgpu code first, then get annoyed when the
core changes don't land immediately after you've practially finished
the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have
upstream buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and 
> quite painful to me ….

Yes this is painful :-(

I think the best way forward is to go through the above process again
and essentially restart. So submit a complete patch series with
problem descriptions, solution you picked, why you picked that, all
the amdgpu patches to get there and the core patches too. Since it
sounds like a bunch of this has all landed already you probably need a
patch 1 that goes back to 6 months ago so that we can see the overall
direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion
we've had just a few months ago would have involved your team too, I'm
sure there would have been some good insights from amd's side. I'd
really want you and your engineers involved here, so let's do this
properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here 
> together, let’s go through them one by one, it’s too hard for us to reply 
> each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete 
> this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other thread , 
> regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> -