RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Liu, Monk Wed, 01 Sep 2021 03:27:56 -0700

[AMD Official Use Only]

>> For me your project exists since a few weeks at most, because that is when 
>> your team showed up on dri-devel. That you already spent 6 months on this 
>> within amd, on a code area that very much affects shared code, without 
>> kicking of any thread on dri-devel isn't great, but also not something we 
>> can fix, since time machines don't exist.
This is partially true, because in the past months our change only resident in 
AMD driver, it is till now that we found we had to make changes in SCHED level

>> Your team hasn't been active in any of these discussions, but now suddenly 
>> pops up out of nowhere and demands that your approach needs to land asap. 
>> That's really not how upstream works.
if our changes on DRI level part cannot get merged soon that's fine, we can 
discuss more, but that's not suddenly pops up from nowhere, we already worked 
on it for months inside of AMD drivers.

>> I think the best way forward is to go through the above process again and 
>> essentially restart. So submit a complete patch series with problem 
>> descriptions, solution you picked, why you picked that, all the amdgpu 
>> patches to get there and the core patches too. Since it sounds like a bunch 
>> of this has all landed already you probably need a patch 1 that goes back to 
>> 6 months ago so that we can see the overall direction, and review whether 
>> that's the right one or not.

We are not objecting this approach,  we didn't do that because previously all 
what we need to do is resident inside AMD driver ...   because we try to avoid 
change DRI/DRM interface part ... 

For the patches you shows to us with links I'm sorry that due to some IT 
infrastructure reason me and my team didn't see it before (we kind of work in 
AMD repo ... the patches you shows didn't get merged in our repo yet...)
One thing I also want to emphasis here: if any code need change inside AMD 
driver please always let us know and review.

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk <monk....@amd.com>
Cc: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, Andrey 
<andrey.grodzov...@amd.com>; Chen, JingWen <jingwen.ch...@amd.com>; DRI 
Development <dri-de...@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <monk....@amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, 
> the feature we are working on (diagnostic TDR scheme) is pending there for 
> more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your 
team showed up on dri-devel. That you already spent 6 months on this within 
amd, on a code area that very much affects shared code, without kicking of any 
thread on dri-devel isn't great, but also not something we can fix, since time 
machines don't exist.

So we have to make the best out of the situation and move ahead where we are. 
From my understanding you've done a bunch of changes to the scheduler code. As 
far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to 
deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&amp;reserved=0

This is just one patch of the entire series. This is an area where we really 
need a consistent solution across all drm/sched drivers, not something that 
individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost 
about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&amp;reserved=0

There's been more than just this, also quite some doc patches from Boris that 
explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops 
up out of nowhere and demands that your approach needs to land asap. That's 
really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context 
for outsiders. The patches sometimes come with zero commit message, for tricky 
concurrency bugs. And there's no context with what you've done already on the 
amdgpu side (since that never showed up on dri-devel), which makes constructive 
discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work 
when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or 
whatever the topic really is about) about the problem you have and how your 
trying to solve it. This can be just text if it's a big thing, but it can also 
already include some proof of concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code _together_. 
Not "merge amdgpu code first, then get annoyed when the core changes don't land 
immediately after you've practially finished the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream 
buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and 
> quite painful to me ....

Yes this is painful :-(

I think the best way forward is to go through the above process again and 
essentially restart. So submit a complete patch series with problem 
descriptions, solution you picked, why you picked that, all the amdgpu patches 
to get there and the core patches too. Since it sounds like a bunch of this has 
all landed already you probably need a patch 1 that goes back to 6 months ago 
so that we can see the overall direction, and review whether that's the right 
one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion we've had 
just a few months ago would have involved your team too, I'm sure there would 
have been some good insights from amd's side. I'd really want you and your 
engineers involved here, so let's do this properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here 
> together, let's go through them one by one, it's too hard for us to reply 
> each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete 
> this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other 
> thread , regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NA3iopIUYFOuTokczRA%2BNBcwVrvMMMHGPM96%2B%2Bm0nEg%3D&amp;reserved=0

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Reply via email to