Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Alex Deucher Wed, 01 Sep 2021 08:20:13 -0700

On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <monk....@amd.com> wrote:
>
> [AMD Official Use Only]
>
> Daniel
>
> From the link you share it looks you(or someone else) have quite a bunch 
> patches that changes DRM_SCHED or even amdgpu, by that case before they are 
> merged to kernel tree I'm wondering if any AMD develop reviewed them ?
>
> They looks to me somehow conflicting with what we changed in our repo....
>
> It is really a chaos for AMDer if someone else out side of AMD changes our 
> kernel driver (or/and scheduler) without reviewed by AMDer, just like we are 
> requiring your review if we tend to change scheduler's logic here ....
>
> This one changes AMD's code: 
> https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/
> And I didn't see any reviewed-by from AMDers ...
>
> This one also touches AMD's code: 
> https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vet...@ffwll.ch/
> Which is conflicting with one patch we submitted (in our repo rightnow), and 
> neither see AMDder gave a review-by on this one (let me know if I missed it)
>


Monk, this is not how upstream works.  You need to participate.
That's how communities work.  There's a reason all these discussions
happen on public mailing lists.  The patch author can't be expected to
know every person on every vendor team to CC with a patch.  If you
have concerns, you need to raise them when the patches are being
discussed.

Alex


> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Daniel 
> Vetter
> Sent: Wednesday, September 1, 2021 4:18 PM
> To: Liu, Monk <monk....@amd.com>
> Cc: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, Andrey 
> <andrey.grodzov...@amd.com>; Chen, JingWen <jingwen.ch...@amd.com>; DRI 
> Development <dri-de...@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution 
> opinions/suggestions in one thread
>
> Hi Monk,
>
> On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <monk....@amd.com> wrote:
> >
> > [AMD Official Use Only]
> >
> >
> > Hi Daniel/Christian/Andrey
> >
> >
> >
> > It looks the voice from you three are spread over those email floods to me, 
> > the feature we are working on (diagnostic TDR scheme) is pending there for 
> > more than 6 month (we started it from feb 2021).
>
> For me your project exists since a few weeks at most, because that is when 
> your team showed up on dri-devel. That you already spent 6 months on this 
> within amd, on a code area that very much affects shared code, without 
> kicking of any thread on dri-devel isn't great, but also not something we can 
> fix, since time machines don't exist.
>
> So we have to make the best out of the situation and move ahead where we are. 
> From my understanding you've done a bunch of changes to the scheduler code. 
> As far as I can see there's been two related things your team has done:
>
> - remove some allocations from scheduler code, because that can lead to 
> deadlocks. I've kicked up this topic quite a while ago here
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&amp;reserved=0
>
> This is just one patch of the entire series. This is an area where we really 
> need a consistent solution across all drm/sched drivers, not something that 
> individual drivers just fix in their own way.
>
> - the other one is the timeout issue for the patches you cite here.
> Again there's been discussion on this on dri-devel with Boris from panfrost 
> about how we can handle at least some of the races in tdr.
> That resulted in lots of discussions and documentation improvements.
> Those patches are merged now, link
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&amp;reserved=0
>
> There's been more than just this, also quite some doc patches from Boris that 
> explain how it's all supposed to work and be race-free.
> Again your driver isn't the only one with interesting TDR races.
>
> Your team hasn't been active in any of these discussions, but now suddenly 
> pops up out of nowhere and demands that your approach needs to land asap. 
> That's really not how upstream works.
>
> The other thing where I'm struggling is that there's a lot of missing context 
> for outsiders. The patches sometimes come with zero commit message, for 
> tricky concurrency bugs. And there's no context with what you've done already 
> on the amdgpu side (since that never showed up on dri-devel), which makes 
> constructive discussions here really hard.
>
> Now fixing these bugs is obviously good, but the way this is supposed to work 
> when touching shared infrastructure is:
>
> - Before you start merging anything kick off an RFC thread on dri-devel (or 
> whatever the topic really is about) about the problem you have and how your 
> trying to solve it. This can be just text if it's a big thing, but it can 
> also already include some proof of concept solution in the form of patches.
>
> - Then we iterate on the solution, across drivers and shared code _together_. 
> Not "merge amdgpu code first, then get annoyed when the core changes don't 
> land immediately after you've practially finished the project".
>
> - This might mean changes to other drivers if we need to adjust interfaces.
>
> On the plus side you can plan much better, because you know you have upstream 
> buy-in before you start to put in real work on the project.
>
> > Honestly speaking the email ways that we are using now is not friendly and 
> > quite painful to me ....
>
> Yes this is painful :-(
>
> I think the best way forward is to go through the above process again and 
> essentially restart. So submit a complete patch series with problem 
> descriptions, solution you picked, why you picked that, all the amdgpu 
> patches to get there and the core patches too. Since it sounds like a bunch 
> of this has all landed already you probably need a patch 1 that goes back to 
> 6 months ago so that we can see the overall direction, and review whether 
> that's the right one or not.
>
> The not-so-painful approach would have been to do this from the start,
> 6 months ago. It would definitely have helped if the tdr discussion we've had 
> just a few months ago would have involved your team too, I'm sure there would 
> have been some good insights from amd's side. I'd really want you and your 
> engineers involved here, so let's do this properly!
>
> Cheers, Daniel
>
> > Can we try to put all our opinions, suggestions, or even objects here 
> > together, let's go through them one by one, it's too hard for us to reply 
> > each email on different questions .
> >
> >
> >
> > For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
> >
> >
> >
> > This is a fixing patch on the timeout timer in scheduler, can we complete 
> > this one first ? it should already resolved all the questions and 
> > suggestions.
> >
> >
> >
> > For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> >
> >
> >
> > I think I already explained the questions raised by Daniel in other
> > thread , regarding why I use __kthread_should_park()
> >
> > For other aspects, can we put all our opinion synthesized here ?
> >
> >
> >
> > Thanks !
> >
> >
> >
> > ------------------------------------------
> >
> > Monk Liu | Cloud-GPU Core team
> >
> > ------------------------------------------
> >
> >
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NA3iopIUYFOuTokczRA%2BNBcwVrvMMMHGPM96%2B%2Bm0nEg%3D&amp;reserved=0

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Reply via email to