Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-06 Thread Jingwen Chen
Hi Christian/Andrey/Daniel,

I read Boris's patch about ordered workqueue and I think maybe we can
leverage this change.
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/

As the TDR race condition we are talking about is caused by a bailing
job being deleted from pending list. While if we use the ordered
workqueue for timedout in the driver, there will be no bailing job.

Do you have any suggestions?

Best Regards,
JingWen Chen

On Mon Sep 06, 2021 at 02:36:52PM +0800, Liu, Monk wrote:
> [AMD Official Use Only]
> 
> > I'm fearing that just repeating what Alex said, but to make it clear 
> > once more: That is *not* necessary!
> >
> > The shared repository is owned by upstream maintainers and they are 
> > usually free to do restructuring work without getting acknowledge from 
> > every single driver maintainer.
> 
> Hi Daniel
> 
> Anyway thanks for officially confirm to me of working model & policy in 
> community, I don't want to put my opinion here due to that's not my call to 
> change no matter how.
> I only want to let this diagnostic TDR scheme going to a good end for AMD or 
> even for all DRM vendor.
> 
> How about this way, we still have a final patch not landed in DRM scheduler 
> and I would like jingwen to present it to you and AlexD/Christian/Andrey,  I 
> believe you will have concerns or objections regarding this patch, but that's 
> fine, let us figure it out together, how to make it acceptable by you and 
> other vendors that working with DRM scheduler.
> 
> P.S.:  I had to repeat myself again, we are not popping up new idea suddenly, 
> it is disconnection issue, we didn't have changes (or plan to have changes) 
> in DRM scheduler before, but eventually we found we must make job_timeout and 
> sched_main to work in a serialized otherwise it won't work based on current 
> scheduler's code structure.
> 
> Thanks 
> 
> --
> Monk Liu | Cloud-GPU Core team
> --
> 
> -Original Message-
> From: Daniel Vetter  
> Sent: Friday, September 3, 2021 12:11 AM
> To: Koenig, Christian 
> Cc: Liu, Monk ; Dave Airlie ; Alex 
> Deucher ; Grodzovsky, Andrey 
> ; Chen, JingWen ; DRI 
> Development ; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution 
> opinions/suggestions in one thread
> 
> On Thu, Sep 2, 2021 at 1:00 PM Christian König  
> wrote:
> >
> > Hi Monk,
> >
> > Am 02.09.21 um 07:52 schrieb Liu, Monk:
> > > [AMD Official Use Only]
> > >
> > > I'm not sure I can add much to help this along, I'm sure Alex has 
> > > some internal training, Once your driver is upstream, it belongs to 
> > > upstream, you can maintain it, but you no longer control it 100%, it's a 
> > > tradeoff, it's not one companies always understand.
> > > Usually people are fine developing away internally, but once interaction 
> > > with other parts of the kernel/subsystem is required they have the 
> > > realisation that they needed to work upstream 6 months earlier.
> > > The best time to interact with upstream was 6 months ago, the second best 
> > > time is now.
> > > <<<
> > >
> > > Daniel/AlexD
> > >
> > > I didn't mean your changes on AMD driver need my personal approval 
> > > or review ... and  I'm totally already get used that our driver is not 
> > > 100% under control by AMDers, but supposedly any one from community 
> > > (including you) who tend to change AMD's driver need at least to get 
> > > approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that 
> > > reasonable?
> >
> > I'm fearing that just repeating what Alex said, but to make it clear 
> > once more: That is *not* necessary!
> >
> > The shared repository is owned by upstream maintainers and they are 
> > usually free to do restructuring work without getting acknowledge from 
> > every single driver maintainer.
> >
> > Anybody can of course technically object to upstream design decisions, 
> > but that means that you need to pay attention to the mailing lists in 
> > the first place.
> >
> > > just like we need your approve if we try to modify DRM-sched, or need 
> > > panfrost's approval if we need to change panfrost code ...
> > >
> > > by only CC AMD's engineers looks not quite properly, how do you know if 
> > > your changes (on AMD code part) are conflicting with AMD's on-going 
> > > internal features/refactoring or not ?
> >
> > Well because AMD 

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-06 Thread Liu, Monk
[AMD Official Use Only]

> I'm fearing that just repeating what Alex said, but to make it clear 
> once more: That is *not* necessary!
>
> The shared repository is owned by upstream maintainers and they are 
> usually free to do restructuring work without getting acknowledge from 
> every single driver maintainer.

Hi Daniel

Anyway thanks for officially confirm to me of working model & policy in 
community, I don't want to put my opinion here due to that's not my call to 
change no matter how.
I only want to let this diagnostic TDR scheme going to a good end for AMD or 
even for all DRM vendor.

How about this way, we still have a final patch not landed in DRM scheduler and 
I would like jingwen to present it to you and AlexD/Christian/Andrey,  I 
believe you will have concerns or objections regarding this patch, but that's 
fine, let us figure it out together, how to make it acceptable by you and other 
vendors that working with DRM scheduler.

P.S.:  I had to repeat myself again, we are not popping up new idea suddenly, 
it is disconnection issue, we didn't have changes (or plan to have changes) in 
DRM scheduler before, but eventually we found we must make job_timeout and 
sched_main to work in a serialized otherwise it won't work based on current 
scheduler's code structure.

Thanks 

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: Daniel Vetter  
Sent: Friday, September 3, 2021 12:11 AM
To: Koenig, Christian 
Cc: Liu, Monk ; Dave Airlie ; Alex Deucher 
; Grodzovsky, Andrey ; Chen, 
JingWen ; DRI Development 
; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

On Thu, Sep 2, 2021 at 1:00 PM Christian König  wrote:
>
> Hi Monk,
>
> Am 02.09.21 um 07:52 schrieb Liu, Monk:
> > [AMD Official Use Only]
> >
> > I'm not sure I can add much to help this along, I'm sure Alex has 
> > some internal training, Once your driver is upstream, it belongs to 
> > upstream, you can maintain it, but you no longer control it 100%, it's a 
> > tradeoff, it's not one companies always understand.
> > Usually people are fine developing away internally, but once interaction 
> > with other parts of the kernel/subsystem is required they have the 
> > realisation that they needed to work upstream 6 months earlier.
> > The best time to interact with upstream was 6 months ago, the second best 
> > time is now.
> > <<<
> >
> > Daniel/AlexD
> >
> > I didn't mean your changes on AMD driver need my personal approval 
> > or review ... and  I'm totally already get used that our driver is not 100% 
> > under control by AMDers, but supposedly any one from community (including 
> > you) who tend to change AMD's driver need at least to get approvement from 
> > someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?
>
> I'm fearing that just repeating what Alex said, but to make it clear 
> once more: That is *not* necessary!
>
> The shared repository is owned by upstream maintainers and they are 
> usually free to do restructuring work without getting acknowledge from 
> every single driver maintainer.
>
> Anybody can of course technically object to upstream design decisions, 
> but that means that you need to pay attention to the mailing lists in 
> the first place.
>
> > just like we need your approve if we try to modify DRM-sched, or need 
> > panfrost's approval if we need to change panfrost code ...
> >
> > by only CC AMD's engineers looks not quite properly, how do you know if 
> > your changes (on AMD code part) are conflicting with AMD's on-going 
> > internal features/refactoring or not ?
>
> Well because AMD is supposed to work in public as much as possible and 
> ask upstream before doing changes to the code base.
>
> Additional to that design decisions are supposed to be discussed on 
> the mailing list and *not* internally.

Yeah I'm honestly really surprised about the course of this discussion here. 
With Alex, Christian and others amd has a lot of folks with years/decades of 
experience in how to collaborate in upstream, when to pull in others 
proactively and when that's not needed, and in general how to plan upstream 
work with the lest amount of risk and surprises.

I think step zero here needs to be some training at amd and then re-planning 
this effort, before we get back to technical stuff.
Otherwise we'll just get bogged down in pain because expectations about the 
process don't pan out.
-Daniel

>
> Regards,
> Christian.
>
> >
> > Thanks
> >
> > --
> > Monk Liu | Cloud-GPU Core team
> > --

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-02 Thread Daniel Stone
Hi Monk,

On Thu, 2 Sept 2021 at 06:52, Liu, Monk  wrote:
> I didn't mean your changes on AMD driver need my personal approval or review 
> ... and  I'm totally already get used that our driver is not 100% under 
> control by AMDers,
> but supposedly any one from community (including you) who tend to change 
> AMD's driver need at least to get approvement from someone in AMD, e.g.: 
> AlexD or Christian, doesn't that reasonable?
> just like we need your approve if we try to modify DRM-sched, or need 
> panfrost's approval if we need to change panfrost code ...
>
> by only CC AMD's engineers looks not quite properly, how do you know if your 
> changes (on AMD code part) are conflicting with AMD's on-going internal 
> features/refactoring or not ?

Looking at the patches in question, they were (at least mostly) CCed
both to the amd-gfx@ mailing list and also to ckoenig. Unfortunately
it is not possible for every single patch to get mandatory signoff
from every single stakeholder - e.g. if every AMD patch which touched
the scheduler required explicit approval from Etnaviv, Freedreno,
Lima, Panfrost, and V3D teams, it would become very difficult for AMD
to merge any code.

So the approach is that patches are sent for approval, they are CCed
to people who should be interested, and after some time with no
comments, they may be merged if it seems like a reasonable thing to
do.

The problem with internal work is that, well, it's internal. If the
community sends patches to amd-gfx@, there is no comment from AMD, and
then months later we are told that it should not have happened because
it conflicts with development that AMD has been doing - how should the
rest of the community have known about this? So unfortunately this is
the compromise: if you decide to do private development, not inform
anyone about your plans, and not join in any common discussion, then
it is your responsibility to deal with any changes or conflicts that
happen whilst you are developing privately.

The only way we can successfully have support in the same ecosystem
for AMD, Arm, Broadcom, Intel, NVIDIA, Qualcomm, and VeriSilicon, is
that we are all working together openly. If community development had
to stop because each of these vendors had been doing internal
development for several months without even informing the community of
their plans, any kind of shared development is clearly impossible.

Cheers,
Daniel


Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-02 Thread Daniel Vetter
On Thu, Sep 2, 2021 at 1:00 PM Christian König  wrote:
>
> Hi Monk,
>
> Am 02.09.21 um 07:52 schrieb Liu, Monk:
> > [AMD Official Use Only]
> >
> > I'm not sure I can add much to help this along, I'm sure Alex has some 
> > internal training,
> > Once your driver is upstream, it belongs to upstream, you can maintain it, 
> > but you no longer control it 100%, it's a tradeoff, it's not one companies 
> > always understand.
> > Usually people are fine developing away internally, but once interaction 
> > with other parts of the kernel/subsystem is required they have the 
> > realisation that they needed to work upstream 6 months earlier.
> > The best time to interact with upstream was 6 months ago, the second best 
> > time is now.
> > <<<
> >
> > Daniel/AlexD
> >
> > I didn't mean your changes on AMD driver need my personal approval or 
> > review ... and  I'm totally already get used that our driver is not 100% 
> > under control by AMDers,
> > but supposedly any one from community (including you) who tend to change 
> > AMD's driver need at least to get approvement from someone in AMD, e.g.: 
> > AlexD or Christian, doesn't that reasonable?
>
> I'm fearing that just repeating what Alex said, but to make it clear
> once more: That is *not* necessary!
>
> The shared repository is owned by upstream maintainers and they are
> usually free to do restructuring work without getting acknowledge from
> every single driver maintainer.
>
> Anybody can of course technically object to upstream design decisions,
> but that means that you need to pay attention to the mailing lists in
> the first place.
>
> > just like we need your approve if we try to modify DRM-sched, or need 
> > panfrost's approval if we need to change panfrost code ...
> >
> > by only CC AMD's engineers looks not quite properly, how do you know if 
> > your changes (on AMD code part) are conflicting with AMD's on-going 
> > internal features/refactoring or not ?
>
> Well because AMD is supposed to work in public as much as possible and
> ask upstream before doing changes to the code base.
>
> Additional to that design decisions are supposed to be discussed on the
> mailing list and *not* internally.

Yeah I'm honestly really surprised about the course of this discussion
here. With Alex, Christian and others amd has a lot of folks with
years/decades of experience in how to collaborate in upstream, when to
pull in others proactively and when that's not needed, and in general
how to plan upstream work with the lest amount of risk and surprises.

I think step zero here needs to be some training at amd and then
re-planning this effort, before we get back to technical stuff.
Otherwise we'll just get bogged down in pain because expectations
about the process don't pan out.
-Daniel

>
> Regards,
> Christian.
>
> >
> > Thanks
> >
> > --
> > Monk Liu | Cloud-GPU Core team
> > --
> >
> > -----Original Message-----
> > From: Dave Airlie 
> > Sent: Thursday, September 2, 2021 2:51 AM
> > To: Alex Deucher 
> > Cc: Liu, Monk ; Daniel Vetter ; Koenig, 
> > Christian ; Grodzovsky, Andrey 
> > ; Chen, JingWen ; DRI 
> > Development ; amd-gfx@lists.freedesktop.org
> > Subject: Re: [diagnostic TDR mode patches] unify our solution 
> > opinions/suggestions in one thread
> >
> > On Thu, 2 Sept 2021 at 01:20, Alex Deucher  wrote:
> >> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:
> >>> [AMD Official Use Only]
> >>>
> >>> Daniel
> >>>
> >>>  From the link you share it looks you(or someone else) have quite a bunch 
> >>> patches that changes DRM_SCHED or even amdgpu, by that case before they 
> >>> are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >>>
> >>> They looks to me somehow conflicting with what we changed in our repo
> >>>
> >>> It is really a chaos for AMDer if someone else out side of AMD changes 
> >>> our kernel driver (or/and scheduler) without reviewed by AMDer, just like 
> >>> we are requiring your review if we tend to change scheduler's logic here 
> >>> 
> >>>
> >>> This one changes AMD's code:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> >>> re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
> >>> %40collabora.com%2Fdata=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
> >>>

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-02 Thread Alex Deucher
On Thu, Sep 2, 2021 at 1:52 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
> >>>
> I'm not sure I can add much to help this along, I'm sure Alex has some 
> internal training,
> Once your driver is upstream, it belongs to upstream, you can maintain it, 
> but you no longer control it 100%, it's a tradeoff, it's not one companies 
> always understand.
> Usually people are fine developing away internally, but once interaction with 
> other parts of the kernel/subsystem is required they have the realisation 
> that they needed to work upstream 6 months earlier.
> The best time to interact with upstream was 6 months ago, the second best 
> time is now.
> <<<
>
> Daniel/AlexD
>
> I didn't mean your changes on AMD driver need my personal approval or review 
> ... and  I'm totally already get used that our driver is not 100% under 
> control by AMDers,
> but supposedly any one from community (including you) who tend to change 
> AMD's driver need at least to get approvement from someone in AMD, e.g.: 
> AlexD or Christian, doesn't that reasonable?
> just like we need your approve if we try to modify DRM-sched, or need 
> panfrost's approval if we need to change panfrost code ...
>
> by only CC AMD's engineers looks not quite properly, how do you know if your 
> changes (on AMD code part) are conflicting with AMD's on-going internal 
> features/refactoring or not ?
>

We keep as up to date as possible with upstream.  I don't have the
bandwidth to verify every patch, but in most cases I try and look at
them.  In your first example, the patch basically just adds a new
parameter to some common functions.  Drivers that don't need that
parameter don't use it.  It shouldn't really affect the functionality.
There are lots of changes that touch our driver that we are largely
not aware of.  E.g., APIs that we may use may change the function
signatures with no intended functional changes.  If problems are found
they are reported and resolved.  It is a collective effort.  If there
are changes that would conflict with stuff we are doing in our tree we
should bring them up when the relevant patches are being discussed.
We can also make changes to core functionality like scheduler, ttm,
etc. that would affect other drivers.  When we send out the patches we
cc the relevant maintainers, but ultimately the ones who participate
in the discussion set the direction.  That's why participation is
important.

Alex


> Thanks
>
> --
> Monk Liu | Cloud-GPU Core team
> --
>
> -Original Message-
> From: Dave Airlie 
> Sent: Thursday, September 2, 2021 2:51 AM
> To: Alex Deucher 
> Cc: Liu, Monk ; Daniel Vetter ; Koenig, 
> Christian ; Grodzovsky, Andrey 
> ; Chen, JingWen ; DRI 
> Development ; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution 
> opinions/suggestions in one thread
>
> On Thu, 2 Sept 2021 at 01:20, Alex Deucher  wrote:
> >
> > On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:
> > >
> > > [AMD Official Use Only]
> > >
> > > Daniel
> > >
> > > From the link you share it looks you(or someone else) have quite a bunch 
> > > patches that changes DRM_SCHED or even amdgpu, by that case before they 
> > > are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> > >
> > > They looks to me somehow conflicting with what we changed in our repo
> > >
> > > It is really a chaos for AMDer if someone else out side of AMD changes 
> > > our kernel driver (or/and scheduler) without reviewed by AMDer, just like 
> > > we are requiring your review if we tend to change scheduler's logic here 
> > > 
> > >
> > > This one changes AMD's code:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > > re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
> > > %40collabora.com%2Fdata=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
> > > d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> > > 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> > > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=BWJSkKN
> > > y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3Dreserved=0
> > > And I didn't see any reviewed-by from AMDers ...
> > >
> > > This one also touches AMD's code:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > > re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4
> > > 0ffwll.ch%2Fdata=04%7C01%7CMonk.Liu%40amd.com%7C6c507d1

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-02 Thread Christian König

Hi Monk,

Am 02.09.21 um 07:52 schrieb Liu, Monk:

[AMD Official Use Only]

I'm not sure I can add much to help this along, I'm sure Alex has some internal 
training,
Once your driver is upstream, it belongs to upstream, you can maintain it, but 
you no longer control it 100%, it's a tradeoff, it's not one companies always 
understand.
Usually people are fine developing away internally, but once interaction with 
other parts of the kernel/subsystem is required they have the realisation that 
they needed to work upstream 6 months earlier.
The best time to interact with upstream was 6 months ago, the second best time 
is now.
<<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review 
... and  I'm totally already get used that our driver is not 100% under control 
by AMDers,
but supposedly any one from community (including you) who tend to change AMD's 
driver need at least to get approvement from someone in AMD, e.g.: AlexD or 
Christian, doesn't that reasonable?


I'm fearing that just repeating what Alex said, but to make it clear 
once more: That is *not* necessary!


The shared repository is owned by upstream maintainers and they are 
usually free to do restructuring work without getting acknowledge from 
every single driver maintainer.


Anybody can of course technically object to upstream design decisions, 
but that means that you need to pay attention to the mailing lists in 
the first place.



just like we need your approve if we try to modify DRM-sched, or need 
panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your 
changes (on AMD code part) are conflicting with AMD's on-going internal 
features/refactoring or not ?


Well because AMD is supposed to work in public as much as possible and 
ask upstream before doing changes to the code base.


Additional to that design decisions are supposed to be discussed on the 
mailing list and *not* internally.


Regards,
Christian.



Thanks

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: Dave Airlie 
Sent: Thursday, September 2, 2021 2:51 AM
To: Alex Deucher 
Cc: Liu, Monk ; Daniel Vetter ; Koenig, Christian 
; Grodzovsky, Andrey ; Chen, JingWen 
; DRI Development ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher  wrote:

On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:

[AMD Official Use Only]

Daniel

 From the link you share it looks you(or someone else) have quite a bunch 
patches that changes DRM_SCHED or even amdgpu, by that case before they are 
merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo

It is really a chaos for AMDer if someone else out side of AMD changes our 
kernel driver (or/and scheduler) without reviewed by AMDer, just like we are 
requiring your review if we tend to change scheduler's logic here 

This one changes AMD's code:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
%40collabora.com%2Fdata=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=BWJSkKN
y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3Dreserved=0
And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4
0ffwll.ch%2Fdata=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e
f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766
1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2
luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=%2F8vIVXCWjHkM
56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3Dreserved=0
Which is conflicting with one patch we submitted (in our repo
rightnow), and neither see AMDder gave a review-by on this one (let
me know if I missed it)


Monk, this is not how upstream works.  You need to participate.
That's how communities work.  There's a reason all these discussions
happen on public mailing lists.  The patch author can't be expected to
know every person on every vendor team to CC with a patch.  If you
have concerns, you need to raise them when the patches are being
discussed.


I'm not sure I can add much to help this along, I'm sure Alex has some internal 
training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but 
you no longer control it 100%, it's a tradeoff, it's not one companies always 
understand.

Usually people are fine develo

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only]

>>>
I'm not sure I can add much to help this along, I'm sure Alex has some internal 
training,
Once your driver is upstream, it belongs to upstream, you can maintain it, but 
you no longer control it 100%, it's a tradeoff, it's not one companies always 
understand.
Usually people are fine developing away internally, but once interaction with 
other parts of the kernel/subsystem is required they have the realisation that 
they needed to work upstream 6 months earlier.
The best time to interact with upstream was 6 months ago, the second best time 
is now.
<<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review 
... and  I'm totally already get used that our driver is not 100% under control 
by AMDers, 
but supposedly any one from community (including you) who tend to change AMD's 
driver need at least to get approvement from someone in AMD, e.g.: AlexD or 
Christian, doesn't that reasonable?
just like we need your approve if we try to modify DRM-sched, or need 
panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your 
changes (on AMD code part) are conflicting with AMD's on-going internal 
features/refactoring or not ?

Thanks 

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: Dave Airlie  
Sent: Thursday, September 2, 2021 2:51 AM
To: Alex Deucher 
Cc: Liu, Monk ; Daniel Vetter ; Koenig, 
Christian ; Grodzovsky, Andrey 
; Chen, JingWen ; DRI 
Development ; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher  wrote:
>
> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:
> >
> > [AMD Official Use Only]
> >
> > Daniel
> >
> > From the link you share it looks you(or someone else) have quite a bunch 
> > patches that changes DRM_SCHED or even amdgpu, by that case before they are 
> > merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >
> > They looks to me somehow conflicting with what we changed in our repo
> >
> > It is really a chaos for AMDer if someone else out side of AMD changes our 
> > kernel driver (or/and scheduler) without reviewed by AMDer, just like we 
> > are requiring your review if we tend to change scheduler's logic here 
> >
> > This one changes AMD's code: 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
> > %40collabora.com%2Fdata=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
> > d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> > 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=BWJSkKN
> > y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3Dreserved=0
> > And I didn't see any reviewed-by from AMDers ...
> >
> > This one also touches AMD's code: 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4
> > 0ffwll.ch%2Fdata=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e
> > f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766
> > 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2
> > luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=%2F8vIVXCWjHkM
> > 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3Dreserved=0
> > Which is conflicting with one patch we submitted (in our repo 
> > rightnow), and neither see AMDder gave a review-by on this one (let 
> > me know if I missed it)
> >
>
> Monk, this is not how upstream works.  You need to participate.
> That's how communities work.  There's a reason all these discussions 
> happen on public mailing lists.  The patch author can't be expected to 
> know every person on every vendor team to CC with a patch.  If you 
> have concerns, you need to raise them when the patches are being 
> discussed.
>

I'm not sure I can add much to help this along, I'm sure Alex has some internal 
training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but 
you no longer control it 100%, it's a tradeoff, it's not one companies always 
understand.

Usually people are fine developing away internally, but once interaction with 
other parts of the kernel/subsystem is required they have the realisation that 
they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time 
is now.

Dave.


Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Dave Airlie
On Thu, 2 Sept 2021 at 01:20, Alex Deucher  wrote:
>
> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:
> >
> > [AMD Official Use Only]
> >
> > Daniel
> >
> > From the link you share it looks you(or someone else) have quite a bunch 
> > patches that changes DRM_SCHED or even amdgpu, by that case before they are 
> > merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >
> > They looks to me somehow conflicting with what we changed in our repo
> >
> > It is really a chaos for AMDer if someone else out side of AMD changes our 
> > kernel driver (or/and scheduler) without reviewed by AMDer, just like we 
> > are requiring your review if we tend to change scheduler's logic here 
> >
> > This one changes AMD's code: 
> > https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/
> > And I didn't see any reviewed-by from AMDers ...
> >
> > This one also touches AMD's code: 
> > https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vet...@ffwll.ch/
> > Which is conflicting with one patch we submitted (in our repo rightnow), 
> > and neither see AMDder gave a review-by on this one (let me know if I 
> > missed it)
> >
>
> Monk, this is not how upstream works.  You need to participate.
> That's how communities work.  There's a reason all these discussions
> happen on public mailing lists.  The patch author can't be expected to
> know every person on every vendor team to CC with a patch.  If you
> have concerns, you need to raise them when the patches are being
> discussed.
>

I'm not sure I can add much to help this along, I'm sure Alex has some
internal training,

Once your driver is upstream, it belongs to upstream, you can maintain
it, but you no longer control it 100%, it's a tradeoff, it's not one
companies always understand.

Usually people are fine developing away internally, but once
interaction with other parts of the kernel/subsystem is required they
have the realisation that they needed to work upstream 6 months
earlier.

The best time to interact with upstream was 6 months ago, the second
best time is now.

Dave.


Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Alex Deucher
On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
> Daniel
>
> From the link you share it looks you(or someone else) have quite a bunch 
> patches that changes DRM_SCHED or even amdgpu, by that case before they are 
> merged to kernel tree I'm wondering if any AMD develop reviewed them ?
>
> They looks to me somehow conflicting with what we changed in our repo
>
> It is really a chaos for AMDer if someone else out side of AMD changes our 
> kernel driver (or/and scheduler) without reviewed by AMDer, just like we are 
> requiring your review if we tend to change scheduler's logic here 
>
> This one changes AMD's code: 
> https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/
> And I didn't see any reviewed-by from AMDers ...
>
> This one also touches AMD's code: 
> https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vet...@ffwll.ch/
> Which is conflicting with one patch we submitted (in our repo rightnow), and 
> neither see AMDder gave a review-by on this one (let me know if I missed it)
>

Monk, this is not how upstream works.  You need to participate.
That's how communities work.  There's a reason all these discussions
happen on public mailing lists.  The patch author can't be expected to
know every person on every vendor team to CC with a patch.  If you
have concerns, you need to raise them when the patches are being
discussed.

Alex


> Thanks
>
> --
> Monk Liu | Cloud-GPU Core team
> --
>
> -Original Message-
> From: amd-gfx  On Behalf Of Daniel 
> Vetter
> Sent: Wednesday, September 1, 2021 4:18 PM
> To: Liu, Monk 
> Cc: Koenig, Christian ; Grodzovsky, Andrey 
> ; Chen, JingWen ; DRI 
> Development ; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution 
> opinions/suggestions in one thread
>
> Hi Monk,
>
> On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk  wrote:
> >
> > [AMD Official Use Only]
> >
> >
> > Hi Daniel/Christian/Andrey
> >
> >
> >
> > It looks the voice from you three are spread over those email floods to me, 
> > the feature we are working on (diagnostic TDR scheme) is pending there for 
> > more than 6 month (we started it from feb 2021).
>
> For me your project exists since a few weeks at most, because that is when 
> your team showed up on dri-devel. That you already spent 6 months on this 
> within amd, on a code area that very much affects shared code, without 
> kicking of any thread on dri-devel isn't great, but also not something we can 
> fix, since time machines don't exist.
>
> So we have to make the best out of the situation and move ahead where we are. 
> From my understanding you've done a bunch of changes to the scheduler code. 
> As far as I can see there's been two related things your team has done:
>
> - remove some allocations from scheduler code, because that can lead to 
> deadlocks. I've kicked up this topic quite a while ago here
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2Fdata=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3Dreserved=0
>
> This is just one patch of the entire series. This is an area where we really 
> need a consistent solution across all drm/sched drivers, not something that 
> individual drivers just fix in their own way.
>
> - the other one is the timeout issue for the patches you cite here.
> Again there's been discussion on this on dri-devel with Boris from panfrost 
> about how we can handle at least some of the races in tdr.
> That resulted in lots of discussions and documentation improvements.
> Those patches are merged now, link
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2Fdata=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3Dreserved=0
>
> There's been more than just this, also quite some doc patches from Boris that 
> explain how it's all supposed to work and be race-free.
> Again your driver isn't the only one with interesting TDR races.
>
> Your team hasn't been active in any

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only]

>> For me your project exists since a few weeks at most, because that is when 
>> your team showed up on dri-devel. That you already spent 6 months on this 
>> within amd, on a code area that very much affects shared code, without 
>> kicking of any thread on dri-devel isn't great, but also not something we 
>> can fix, since time machines don't exist.
This is partially true, because in the past months our change only resident in 
AMD driver, it is till now that we found we had to make changes in SCHED level 

>> Your team hasn't been active in any of these discussions, but now suddenly 
>> pops up out of nowhere and demands that your approach needs to land asap. 
>> That's really not how upstream works.
if our changes on DRI level part cannot get merged soon that's fine, we can 
discuss more, but that's not suddenly pops up from nowhere, we already worked 
on it for months inside of AMD drivers.

>> I think the best way forward is to go through the above process again and 
>> essentially restart. So submit a complete patch series with problem 
>> descriptions, solution you picked, why you picked that, all the amdgpu 
>> patches to get there and the core patches too. Since it sounds like a bunch 
>> of this has all landed already you probably need a patch 1 that goes back to 
>> 6 months ago so that we can see the overall direction, and review whether 
>> that's the right one or not.

We are not objecting this approach,  we didn't do that because previously all 
what we need to do is resident inside AMD driver ...   because we try to avoid 
change DRI/DRM interface part ... 

For the patches you shows to us with links I'm sorry that due to some IT 
infrastructure reason me and my team didn't see it before (we kind of work in 
AMD repo ... the patches you shows didn't get merged in our repo yet...)
One thing I also want to emphasis here: if any code need change inside AMD 
driver please always let us know and review.

Thanks 

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: amd-gfx  On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk 
Cc: Koenig, Christian ; Grodzovsky, Andrey 
; Chen, JingWen ; DRI 
Development ; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, 
> the feature we are working on (diagnostic TDR scheme) is pending there for 
> more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your 
team showed up on dri-devel. That you already spent 6 months on this within 
amd, on a code area that very much affects shared code, without kicking of any 
thread on dri-devel isn't great, but also not something we can fix, since time 
machines don't exist.

So we have to make the best out of the situation and move ahead where we are. 
From my understanding you've done a bunch of changes to the scheduler code. As 
far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to 
deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2Fdata=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3Dreserved=0

This is just one patch of the entire series. This is an area where we really 
need a consistent solution across all drm/sched drivers, not something that 
individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost 
about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2Fdata=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3Dreser

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only]

Daniel 

>From the link you share it looks you(or someone else) have quite a bunch 
>patches that changes DRM_SCHED or even amdgpu, by that case before they are 
>merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo 

It is really a chaos for AMDer if someone else out side of AMD changes our 
kernel driver (or/and scheduler) without reviewed by AMDer, just like we are 
requiring your review if we tend to change scheduler's logic here  

This one changes AMD's code: 
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/
And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: 
https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vet...@ffwll.ch/
Which is conflicting with one patch we submitted (in our repo rightnow), and 
neither see AMDder gave a review-by on this one (let me know if I missed it)

Thanks 

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: amd-gfx  On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk 
Cc: Koenig, Christian ; Grodzovsky, Andrey 
; Chen, JingWen ; DRI 
Development ; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, 
> the feature we are working on (diagnostic TDR scheme) is pending there for 
> more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your 
team showed up on dri-devel. That you already spent 6 months on this within 
amd, on a code area that very much affects shared code, without kicking of any 
thread on dri-devel isn't great, but also not something we can fix, since time 
machines don't exist.

So we have to make the best out of the situation and move ahead where we are. 
From my understanding you've done a bunch of changes to the scheduler code. As 
far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to 
deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2Fdata=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3Dreserved=0

This is just one patch of the entire series. This is an area where we really 
need a consistent solution across all drm/sched drivers, not something that 
individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost 
about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2Fdata=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3Dreserved=0

There's been more than just this, also quite some doc patches from Boris that 
explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops 
up out of nowhere and demands that your approach needs to land asap. That's 
really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context 
for outsiders. The patches sometimes come with zero commit message, for tricky 
concurrency bugs. And there's no context with what you've done already on the 
amdgpu side (since that never showed up on dri-devel), which makes constructive 
discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work 
when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or 
whatever the topic really is about) about the problem you have and how your 
trying to solve it.

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Daniel Vetter
Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk  wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, 
> the feature we are working on (diagnostic TDR scheme) is pending there for 
> more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is
when your team showed up on dri-devel. That you already spent 6 months
on this within amd, on a code area that very much affects shared code,
without kicking of any thread on dri-devel isn't great, but also not
something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where
we are. From my understanding you've done a bunch of changes to the
scheduler code. As far as I can see there's been two related things
your team has done:

- remove some allocations from scheduler code, because that can lead
to deadlocks. I've kicked up this topic quite a while ago here

https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vet...@ffwll.ch/

This is just one patch of the entire series. This is an area where we
really need a consistent solution across all drm/sched drivers, not
something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from
panfrost about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/

There's been more than just this, also quite some doc patches from
Boris that explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now
suddenly pops up out of nowhere and demands that your approach needs
to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing
context for outsiders. The patches sometimes come with zero commit
message, for tricky concurrency bugs. And there's no context with what
you've done already on the amdgpu side (since that never showed up on
dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed
to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on
dri-devel (or whatever the topic really is about) about the problem
you have and how your trying to solve it. This can be just text if
it's a big thing, but it can also already include some proof of
concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code
_together_. Not "merge amdgpu code first, then get annoyed when the
core changes don't land immediately after you've practially finished
the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have
upstream buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and 
> quite painful to me ….

Yes this is painful :-(

I think the best way forward is to go through the above process again
and essentially restart. So submit a complete patch series with
problem descriptions, solution you picked, why you picked that, all
the amdgpu patches to get there and the core patches too. Since it
sounds like a bunch of this has all landed already you probably need a
patch 1 that goes back to 6 months ago so that we can see the overall
direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion
we've had just a few months ago would have involved your team too, I'm
sure there would have been some good insights from amd's side. I'd
really want you and your engineers involved here, so let's do this
properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here 
> together, let’s go through them one by one, it’s too hard for us to reply 
> each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete 
> this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other thread , 
> regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> 

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky



On 2021-09-01 12:40 a.m., Jingwen Chen wrote:

On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:

On 2021-09-01 12:25 a.m., Jingwen Chen wrote:

On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:

I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:


  [AMD Official Use Only]


  In the previous discussion, you guys stated that we should drop the
  “kthread_should_park” in cleanup_job.


  @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
  *sched)

  {

  struct drm_sched_job *job, *next;


  -   /*

  -* Don't destroy jobs while the timeout worker is running  OR
  thread

  -* is being parked and hence assumed to not touch pending_list

  -*/

  -   if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&

  -   !cancel_delayed_work(>work_tdr)) ||

  -   kthread_should_park())

  -   return NULL;


  But I suddenly have a question here: if return the timedout job no matter
  kthread_should_park() or not, then we are backing to the original problem
  again: that the timedout_job is suddenly signaling and cleanup_job still
  returns it to sched_main and job is freed while it is still handling by
  vendor’s timeout callback


  If we return NULL when kthread_should_park() in cleanup_job, we can 
prevent
  above scenario from happening: once a job is processed by job_timedout we
  can stop its scheduler, and after that even this job suddenly signaled the
  cleanup_job won’t return it so sched_main won’t free it in parallel …


  What do you think ?


Is your analysis above takes into account that you also submit
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
problem -

Hi Andrey,
Monk has talked to me and we agreed that as there're multiple opinions about the
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
1 is an independent patch to fix some error. So we should not take the patch 2 
into
analysis.


I think that as long as you put kthread_park(sched->thread) BEFORE
fetching next bad job from pending list (under spinlock) there is no
such issue as in the case you describe because this potential bad job
that became signaled will be removed from pending list before you
even fetch the next job and by the time you fetch it the scheduler
thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no problem
because
we temporary remove the job we fetch for TDR from pending list under spinlock
exactly to avoid this race


So can you help review [PATCH 1/2] drm/sched: fix the bug of time out 
calculation(v3)?
patch v3 keeps this kthread_should_park check.


But since in both cases looks like there is no danger of use after free
then I see no reason to keep kthread_should_park check.

Andrey

OK, I get it. So patch v4 has removed this check, can you help review
[PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)?



Sure

Andrey





Best Regards,
JingWen

  Thanks


  --

  Monk Liu | Cloud-GPU Core team

  --


  From: Liu, Monk
  Sent: Wednesday, September 1, 2021 9:23 AM
  To: Koenig, Christian ; Grodzovsky, Andrey
  ; Daniel Vetter ; Chen, 
JingWen
  
  Cc: DRI Development ;
  amd-gfx@lists.freedesktop.org
  Subject: [diagnostic TDR mode patches] unify our solution opinions/
  suggestions in one thread


  [AMD Official Use Only]


  Hi Daniel/Christian/Andrey


  It looks the voice from you three are spread over those email floods to 
me,
  the feature we are working on (diagnostic TDR scheme) is pending there for
  more than 6 month (we started it from feb 2021).


  Honestly speaking the email ways that we are using now is not friendly and
  quite painful to me ….

  Can we try to put all our opinions, suggestions, or even objects here
  together, let’s go through them one by one, it’s too hard for us to reply
  each email on different questions .


  For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)


  This is a fixing patch on the timeout timer in scheduler, can we complete
  this one first ? it should already resolved all the questions and
  suggestions.


I have no objections for this one besides getting rid of the
kthread_should_park()) return null part,
if my answer above is not wrong then it seems superfluous to me



  For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler


  I think I already explained the questions raised by Daniel in other thread
  , regarding why I use __kthread_should_park()


Is this race free ? Can't the other thread execute kthread_park after the check
?


  For other aspects, can we pu

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Jingwen Chen
On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-09-01 12:25 a.m., Jingwen Chen wrote:
> > On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
> > > I will answer everything here -
> > > 
> > > On 2021-08-31 9:58 p.m., Liu, Monk wrote:
> > > 
> > > 
> > >  [AMD Official Use Only]
> > > 
> > > 
> > >  In the previous discussion, you guys stated that we should drop the
> > >  “kthread_should_park” in cleanup_job.
> > > 
> > > 
> > >  @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct 
> > > drm_gpu_scheduler
> > >  *sched)
> > > 
> > >  {
> > > 
> > >  struct drm_sched_job *job, *next;
> > > 
> > > 
> > >  -   /*
> > > 
> > >  -* Don't destroy jobs while the timeout worker is running  OR
> > >  thread
> > > 
> > >  -* is being parked and hence assumed to not touch 
> > > pending_list
> > > 
> > >  -*/
> > > 
> > >  -   if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> > > 
> > >  -   !cancel_delayed_work(>work_tdr)) ||
> > > 
> > >  -   kthread_should_park())
> > > 
> > >  -   return NULL;
> > > 
> > > 
> > >  But I suddenly have a question here: if return the timedout job no 
> > > matter
> > >  kthread_should_park() or not, then we are backing to the original 
> > > problem
> > >  again: that the timedout_job is suddenly signaling and cleanup_job 
> > > still
> > >  returns it to sched_main and job is freed while it is still handling 
> > > by
> > >  vendor’s timeout callback
> > > 
> > > 
> > >  If we return NULL when kthread_should_park() in cleanup_job, we can 
> > > prevent
> > >  above scenario from happening: once a job is processed by 
> > > job_timedout we
> > >  can stop its scheduler, and after that even this job suddenly 
> > > signaled the
> > >  cleanup_job won’t return it so sched_main won’t free it in parallel …
> > > 
> > > 
> > >  What do you think ?
> > > 
> > > 
> > > Is your analysis above takes into account that you also submit
> > > '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't 
> > > see a
> > > problem -
> > Hi Andrey,
> > Monk has talked to me and we agreed that as there're multiple opinions 
> > about the
> > '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
> > 1 is an independent patch to fix some error. So we should not take the 
> > patch 2 into
> > analysis.
> > 
> > > I think that as long as you put kthread_park(sched->thread) BEFORE
> > > fetching next bad job from pending list (under spinlock) there is no
> > > such issue as in the case you describe because this potential bad job
> > > that became signaled will be removed from pending list before you
> > > even fetch the next job and by the time you fetch it the scheduler
> > > thread is already stopped anyway
> > > 
> > > If you don't submit and we keep the removal hack for now then also no 
> > > problem
> > > because
> > > we temporary remove the job we fetch for TDR from pending list under 
> > > spinlock
> > > exactly to avoid this race
> > > 
> > So can you help review [PATCH 1/2] drm/sched: fix the bug of time out 
> > calculation(v3)?
> > patch v3 keeps this kthread_should_park check.
> 
> 
> But since in both cases looks like there is no danger of use after free
> then I see no reason to keep kthread_should_park check.
> 
> Andrey
OK, I get it. So patch v4 has removed this check, can you help review
[PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)?
> 
> 
> > 
> > Best Regards,
> > JingWen
> > > 
> > >  Thanks
> > > 
> > > 
> > >  --
> > > 
> > >  Monk Liu | Cloud-GPU Core team
> > > 
> > >  --
> > > 
> > > 
> > >  From: Liu, Monk
> > >  Sent: Wednesday, September 1, 2021 9:23 AM
> > >  To: Koenig, Christian ; Grodzovsky, Andrey
> > >  ; Daniel 

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky



On 2021-09-01 12:25 a.m., Jingwen Chen wrote:

On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:

I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:


 [AMD Official Use Only]

  


 In the previous discussion, you guys stated that we should drop the
 “kthread_should_park” in cleanup_job.

  


 @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
 *sched)

 {

 struct drm_sched_job *job, *next;

  


 -   /*

 -* Don't destroy jobs while the timeout worker is running  OR
 thread

 -* is being parked and hence assumed to not touch pending_list

 -*/

 -   if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&

 -   !cancel_delayed_work(>work_tdr)) ||

 -   kthread_should_park())

 -   return NULL;

  


 But I suddenly have a question here: if return the timedout job no matter
 kthread_should_park() or not, then we are backing to the original problem
 again: that the timedout_job is suddenly signaling and cleanup_job still
 returns it to sched_main and job is freed while it is still handling by
 vendor’s timeout callback

  


 If we return NULL when kthread_should_park() in cleanup_job, we can prevent
 above scenario from happening: once a job is processed by job_timedout we
 can stop its scheduler, and after that even this job suddenly signaled the
 cleanup_job won’t return it so sched_main won’t free it in parallel …

  


 What do you think ?


Is your analysis above takes into account that you also submit
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
problem -

Hi Andrey,
Monk has talked to me and we agreed that as there're multiple opinions about the
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
1 is an independent patch to fix some error. So we should not take the patch 2 
into
analysis.


I think that as long as you put kthread_park(sched->thread) BEFORE
fetching next bad job from pending list (under spinlock) there is no
such issue as in the case you describe because this potential bad job
that became signaled will be removed from pending list before you
even fetch the next job and by the time you fetch it the scheduler
thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no problem
because
we temporary remove the job we fetch for TDR from pending list under spinlock
exactly to avoid this race


So can you help review [PATCH 1/2] drm/sched: fix the bug of time out 
calculation(v3)?
patch v3 keeps this kthread_should_park check.



But since in both cases looks like there is no danger of use after free
then I see no reason to keep kthread_should_park check.

Andrey




Best Regards,
JingWen


 Thanks

  


 --

 Monk Liu | Cloud-GPU Core team

 --

  


 From: Liu, Monk
 Sent: Wednesday, September 1, 2021 9:23 AM
 To: Koenig, Christian ; Grodzovsky, Andrey
 ; Daniel Vetter ; Chen, JingWen
 
 Cc: DRI Development ;
 amd-gfx@lists.freedesktop.org
 Subject: [diagnostic TDR mode patches] unify our solution opinions/
     suggestions in one thread

  


 [AMD Official Use Only]

  


 Hi Daniel/Christian/Andrey

  


 It looks the voice from you three are spread over those email floods to me,
 the feature we are working on (diagnostic TDR scheme) is pending there for
 more than 6 month (we started it from feb 2021).

  


 Honestly speaking the email ways that we are using now is not friendly and
 quite painful to me ….

 Can we try to put all our opinions, suggestions, or even objects here
 together, let’s go through them one by one, it’s too hard for us to reply
 each email on different questions .

  


 For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

  


 This is a fixing patch on the timeout timer in scheduler, can we complete
 this one first ? it should already resolved all the questions and
 suggestions.


I have no objections for this one besides getting rid of the
kthread_should_park()) return null part,
if my answer above is not wrong then it seems superfluous to me


  


 For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

  


 I think I already explained the questions raised by Daniel in other thread
 , regarding why I use __kthread_should_park()


Is this race free ? Can't the other thread execute kthread_park after the check
?


 For other aspects, can we put all our opinion synthesized here ?


So to summarize from previous threads I think that the best solution
to the problem being solved in this patch is if we do HW fence embedding
at the drm_sched_

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Jingwen Chen
On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
> I will answer everything here -
> 
> On 2021-08-31 9:58 p.m., Liu, Monk wrote:
> 
> 
> [AMD Official Use Only]
> 
>  
> 
> In the previous discussion, you guys stated that we should drop the
> “kthread_should_park” in cleanup_job.
> 
>  
> 
> @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
> *sched)
> 
> {
> 
> struct drm_sched_job *job, *next;
> 
>  
> 
> -   /*
> 
> -* Don't destroy jobs while the timeout worker is running  OR
> thread
> 
> -* is being parked and hence assumed to not touch pending_list
> 
> -*/
> 
> -   if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> 
> -   !cancel_delayed_work(>work_tdr)) ||
> 
> -   kthread_should_park())
> 
> -   return NULL;
> 
>  
> 
> But I suddenly have a question here: if return the timedout job no matter
> kthread_should_park() or not, then we are backing to the original problem
> again: that the timedout_job is suddenly signaling and cleanup_job still
> returns it to sched_main and job is freed while it is still handling by
> vendor’s timeout callback
> 
>  
> 
> If we return NULL when kthread_should_park() in cleanup_job, we can 
> prevent
> above scenario from happening: once a job is processed by job_timedout we
> can stop its scheduler, and after that even this job suddenly signaled the
> cleanup_job won’t return it so sched_main won’t free it in parallel …
> 
>  
> 
> What do you think ?
> 
> 
> Is your analysis above takes into account that you also submit
> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see 
> a
> problem -
Hi Andrey,
Monk has talked to me and we agreed that as there're multiple opinions about the
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
1 is an independent patch to fix some error. So we should not take the patch 2 
into
analysis.

> I think that as long as you put kthread_park(sched->thread) BEFORE
> fetching next bad job from pending list (under spinlock) there is no
> such issue as in the case you describe because this potential bad job
> that became signaled will be removed from pending list before you
> even fetch the next job and by the time you fetch it the scheduler
> thread is already stopped anyway
> 
> If you don't submit and we keep the removal hack for now then also no problem
> because
> we temporary remove the job we fetch for TDR from pending list under spinlock
> exactly to avoid this race
> 
So can you help review [PATCH 1/2] drm/sched: fix the bug of time out 
calculation(v3)?
patch v3 keeps this kthread_should_park check.

Best Regards,
JingWen
> 
> 
> Thanks
> 
>  
> 
> --
> 
> Monk Liu | Cloud-GPU Core team
> 
> ----------
> 
>  
> 
>     From: Liu, Monk
> Sent: Wednesday, September 1, 2021 9:23 AM
> To: Koenig, Christian ; Grodzovsky, Andrey
> ; Daniel Vetter ; Chen, 
> JingWen
> 
> Cc: DRI Development ;
> amd-gfx@lists.freedesktop.org
> Subject: [diagnostic TDR mode patches] unify our solution opinions/
> suggestions in one thread
> 
>  
> 
> [AMD Official Use Only]
> 
>  
> 
> Hi Daniel/Christian/Andrey
> 
>  
> 
> It looks the voice from you three are spread over those email floods to 
> me,
> the feature we are working on (diagnostic TDR scheme) is pending there for
> more than 6 month (we started it from feb 2021).
> 
>  
> 
> Honestly speaking the email ways that we are using now is not friendly and
> quite painful to me ….
> 
> Can we try to put all our opinions, suggestions, or even objects here
> together, let’s go through them one by one, it’s too hard for us to reply
> each email on different questions .
> 
>  
> 
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
> 
>  
> 
> This is a fixing patch on the timeout timer in scheduler, can we complete
> this one first ? it should already resolved all the questions and
> suggestions.
> 
> 
> I have no objections for this one besides getting rid of the
> kthread_should_park()) return null part,
> if my answer above is not wrong then it seems superfluous to me
> 
> 
>  
> 
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> 
&g

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky

I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:


[AMD Official Use Only]

In the previous discussion, you guys stated that we should drop the 
“kthread_should_park” in cleanup_job.


@@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct 
drm_gpu_scheduler *sched)


{

    struct drm_sched_job *job, *next;

-   /*

-    * Don't destroy jobs while the timeout worker is running  OR 
thread


-    * is being parked and hence assumed to not touch pending_list

-    */

-   if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&

- !cancel_delayed_work(>work_tdr)) ||

-   kthread_should_park())

-   return NULL;

But I suddenly have a question here: if return the timedout job no 
matter kthread_should_park() or not, then we are backing to the 
original problem again: that the timedout_job is suddenly signaling 
and cleanup_job still returns it to sched_main and job is freed while 
it is still handling by vendor’s timeout callback


If we return NULL when kthread_should_park() in cleanup_job, we can 
prevent above scenario from happening: once a job is processed by 
job_timedout we can stop its scheduler, and after that even this job 
suddenly signaled the cleanup_job won’t return it so sched_main won’t 
free it in parallel …


What do you think ?



Is your analysis above takes into account that you also submit
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I 
don't see a problem -

I think that as long as you put kthread_park(sched->thread) BEFORE
fetching next bad job from pending list (under spinlock) there is no
such issue as in the case you describe because this potential bad job
that became signaled will be removed from pending list before you
even fetch the next job and by the time you fetch it the scheduler
thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no 
problem because
we temporary remove the job we fetch for TDR from pending list under 
spinlock

exactly to avoid this race



Thanks

--

Monk Liu | Cloud-GPU Core team

--

*From:* Liu, Monk
*Sent:* Wednesday, September 1, 2021 9:23 AM
*To:* Koenig, Christian ; Grodzovsky, Andrey 
; Daniel Vetter ; Chen, 
JingWen 
*Cc:* DRI Development ; 
amd-gfx@lists.freedesktop.org
*Subject:* [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread


[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods 
to me, the feature we are working on (diagnostic TDR scheme) is 
pending there for more than 6 month (we started it from feb 2021).


Honestly speaking the email ways that we are using now is not friendly 
and quite painful to me ….


Can we try to put all our opinions, suggestions, or even objects here 
together, let’s go through them one by one, it’s too hard for us to 
reply each email on different questions .


For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we 
complete this one first ? it should already resolved all the questions 
and suggestions.




I have no objections for this one besides getting rid of the 
kthread_should_park()) return null part,

if my answer above is not wrong then it seems superfluous to me



For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other 
thread , regarding why I use __kthread_should_park()




Is this race free ? Can't the other thread execute kthread_park after 
the check ?




For other aspects, can we put all our opinion synthesized here ?



So to summarize from previous threads I think that the best solution
to the problem being solved in this patch is if we do HW fence embedding
at the drm_sched_job level instead of doing it only for amdgpu, and 
modifying all

the drivers to support this we can both remove this hack and solve the race
against concurrent drm_sched_cleanup_jobs job freeing just by taking 
reference

to the hw fence of the job at the beginning of drm_sched_job_timedout

If doing this refactoring for all the drivers is not an option now and 
you need a quick
solution such as the serialization you do here then take into account 
other concurrent
TDR handlers that might run, as mentioned before, without serializing 
against other TDR handlers from other
schedulers you just race here against them, e.g. you parked it now but 
another
one in progress will unpark it as part of calling  drm_sched_start for 
other rings.
So you either need a global lock or dedicated single threaded queue as 
Daniel suggested.
At minimum we should change cancel_delayed_work in drm_sched_stop to 
cancel_delayed_work_sync
to cancel and flush all concurrent TDRs and probably move it to the 
begining of the function, after kthread_park

and before

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Liu, Monk
[AMD Official Use Only]

In the previous discussion, you guys stated that we should drop the 
"kthread_should_park" in cleanup_job.

@@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
{
struct drm_sched_job *job, *next;

-   /*
-* Don't destroy jobs while the timeout worker is running  OR thread
-* is being parked and hence assumed to not touch pending_list
-*/
-   if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
-   !cancel_delayed_work(>work_tdr)) ||
-   kthread_should_park())
-   return NULL;

But I suddenly have a question here: if return the timedout job no matter 
kthread_should_park() or not, then we are backing to the original problem 
again: that the timedout_job is suddenly signaling and cleanup_job still 
returns it to sched_main and job is freed while it is still handling by 
vendor's timeout callback

If we return NULL when kthread_should_park() in cleanup_job, we can prevent 
above scenario from happening: once a job is processed by job_timedout we can 
stop its scheduler, and after that even this job suddenly signaled the 
cleanup_job won't return it so sched_main won't free it in parallel ...

What do you think ?
Thanks

--
Monk Liu | Cloud-GPU Core team
--

From: Liu, Monk
Sent: Wednesday, September 1, 2021 9:23 AM
To: Koenig, Christian ; Grodzovsky, Andrey 
; Daniel Vetter ; Chen, JingWen 

Cc: DRI Development ; 
amd-gfx@lists.freedesktop.org
Subject: [diagnostic TDR mode patches] unify our solution opinions/suggestions 
in one thread


[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the 
feature we are working on (diagnostic TDR scheme) is pending there for more 
than 6 month (we started it from feb 2021).

Honestly speaking the email ways that we are using now is not friendly and 
quite painful to me 
Can we try to put all our opinions, suggestions, or even objects here together, 
let's go through them one by one, it's too hard for us to reply each email on 
different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this 
one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , 
regarding why I use __kthread_should_park()
For other aspects, can we put all our opinion synthesized here ?

Thanks !

--
Monk Liu | Cloud-GPU Core team
--



[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Liu, Monk
[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the 
feature we are working on (diagnostic TDR scheme) is pending there for more 
than 6 month (we started it from feb 2021).

Honestly speaking the email ways that we are using now is not friendly and 
quite painful to me 
Can we try to put all our opinions, suggestions, or even objects here together, 
let's go through them one by one, it's too hard for us to reply each email on 
different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this 
one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , 
regarding why I use __kthread_should_park()
For other aspects, can we put all our opinion synthesized here ?

Thanks !

--
Monk Liu | Cloud-GPU Core team
--