Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-08-10 Thread Michel Dänzer
On 8/9/23 21:15, Marek Olšák wrote:
> On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer  
> wrote:
>> On 8/8/23 19:03, Marek Olšák wrote:
>>> It's the same situation as SIGSEGV. A process can catch the signal,
>>> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
>>> to catch the GPU error and prevent the process termination. If you
>>> don't use the API, you'll get undefined behavior, which means anything
>>> can happen, including process termination.
>>
>> Got a spec reference for that?
>>
>> I know the spec allows process termination in response to e.g. out of bounds 
>> buffer access by the application (which corresponds to SIGSEGV). There are 
>> other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec 
>> says:
>>
>> If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
>> then the implementation will never deliver notification of reset
>> events, and GetGraphicsResetStatusARB will always return
>> NO_ERROR[fn1].
>>[fn1: In this case it is recommended that implementations should
>> not allow loss of context state no matter what events occur.
>> However, this is only a recommendation, and cannot be relied
>> upon by applications.]
>>
>> No mention of process termination, that rather sounds to me like the GL 
>> implementation should do its best to keep the application running.
> 
> It basically says that we can do anything.

Not really? If program termination is a possible outcome, the spec otherwise 
mentions that explicitly, ala "including program termination".


> A frozen window or flipping between 2 random frames can't be described
> as "keeping the application running".

This assumes that an application which uses OpenGL cannot have any other 
purpose than using OpenGL.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-08-09 Thread Marek Olšák
On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer  wrote:
>
> On 8/8/23 19:03, Marek Olšák wrote:
> > It's the same situation as SIGSEGV. A process can catch the signal,
> > but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> > to catch the GPU error and prevent the process termination. If you
> > don't use the API, you'll get undefined behavior, which means anything
> > can happen, including process termination.
>
> Got a spec reference for that?
>
> I know the spec allows process termination in response to e.g. out of bounds 
> buffer access by the application (which corresponds to SIGSEGV). There are 
> other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec 
> says:
>
> If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
> then the implementation will never deliver notification of reset
> events, and GetGraphicsResetStatusARB will always return
> NO_ERROR[fn1].
>[fn1: In this case it is recommended that implementations should
> not allow loss of context state no matter what events occur.
> However, this is only a recommendation, and cannot be relied
> upon by applications.]
>
> No mention of process termination, that rather sounds to me like the GL 
> implementation should do its best to keep the application running.

It basically says that we can do anything.

A frozen window or flipping between 2 random frames can't be described
as "keeping the application running". That's the worst user
experience. I will not accept it.

A window system can force-enable robustness for its non-robust apps
and control that. That's the best possible user experience and it's
achievable everywhere. Everything else doesn't matter.

Marek




Marek


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-08-09 Thread Michel Dänzer
On 8/8/23 19:03, Marek Olšák wrote:
> It's the same situation as SIGSEGV. A process can catch the signal,
> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> to catch the GPU error and prevent the process termination. If you
> don't use the API, you'll get undefined behavior, which means anything
> can happen, including process termination.

Got a spec reference for that?

I know the spec allows process termination in response to e.g. out of bounds 
buffer access by the application (which corresponds to SIGSEGV). There are 
other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec 
says:

If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
then the implementation will never deliver notification of reset
events, and GetGraphicsResetStatusARB will always return
NO_ERROR[fn1].
   [fn1: In this case it is recommended that implementations should
not allow loss of context state no matter what events occur.
However, this is only a recommendation, and cannot be relied
upon by applications.]

No mention of process termination, that rather sounds to me like the GL 
implementation should do its best to keep the application running.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-08-08 Thread Marek Olšák
It's the same situation as SIGSEGV. A process can catch the signal,
but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
to catch the GPU error and prevent the process termination. If you
don't use the API, you'll get undefined behavior, which means anything
can happen, including process termination.



Marek

On Tue, Aug 8, 2023 at 8:14 AM Sebastian Wick  wrote:
>
> On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter  wrote:
> >
> > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen 
> > > Signed-off-by: André Almeida 
> > > ---
> > >
> > > v4: 
> > > https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
> > >
> > > Changes:
> > >  - Grammar fixes (Randy)
> > >
> > >  Documentation/gpu/drm-uapi.rst | 68 ++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst 
> > > b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
> > > third handler for
> > >  mmapped regular files. Threads cause additional pain with signal
> > >  handling as well.
> > >
> > > +Device reset
> > > +
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware 
> > > bugs,
> > > +faulty applications and everything in between the many layers. Some 
> > > errors
> > > +require resetting the device in order to make the device usable again. 
> > > This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +--
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to 
> > > perform
> > > +it as needed. Usually a hang is detected when a job gets stuck 
> > > executing. KMD
> > > +should keep track of resets, because userspace can query any time about 
> > > the
> > > +reset stats for an specific context. This is needed to propagate to the 
> > > rest of
> > > +the stack that a reset has happened. Currently, this is implemented by 
> > > each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the 
> > > device has
> > > +been reset, and this can be checked more often if the UMD requires it. 
> > > After
> > > +detecting a reset, UMD will then proceed to report it to the application 
> > > using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +--
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is 
> > > using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. 
> > > However,
> > > +there is no guarantee that the app will use such features correctly, and 
> > > the
> > > +UMD can implement policies to close the app if it is a repeating 
> > > offender,
> >
> > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> > offender killing is more a policy where the kernel enforces policy, and no
> > longer up to userspace to dtrt (because very clearly userspace is not
> > really doing the right thing anymore when it's just hanging the gpu in an
> > endless loop). Also maybe tune it down further to something like "the
> > kernel driver may implemnent ..."
> >
> > In my opinion the umd shouldn't implement these kind of magic guesses, the
> > entire point of robustness apis is to delegate responsibility for
> > correctly recovering to the application. And the kernel is left with
> > enforcing fair resource usage policies (which eventually might be a
> > cgroups limit on how much gpu time you're allowed to waste with gpu
> > resets).
>
> Killing apps that the kernel thinks are misbehaving really doesn't
> seem like a good idea to me. What if the process is a service getting
> restarted after getting killed? What if killing that process leaves
> the system in a bad state?
>
> Can't the kernel provide some information to user space so that e.g.
> systemd can handle those situations?
>
> > > +likely in a broken loop. This is done to ensure that it does not keep 
> > > blocking
> > > +the user interface from being correctly displayed. This should be done 
> > > even if
> > > +the app is correct but happens to trigger some bug in the 
> > > hardware/driver.
> > > +
> > > +OpenGL
> > > +~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` 

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-08-08 Thread Sebastian Wick
On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter  wrote:
>
> On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen 
> > Signed-off-by: André Almeida 
> > ---
> >
> > v4: 
> > https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
> >
> > Changes:
> >  - Grammar fixes (Randy)
> >
> >  Documentation/gpu/drm-uapi.rst | 68 ++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third 
> > handler for
> >  mmapped regular files. Threads cause additional pain with signal
> >  handling as well.
> >
> > +Device reset
> > +
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +--
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to 
> > perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. 
> > KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the 
> > rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +
> > +
> > +The UMD should check before submitting new commands to the KMD if the 
> > device has
> > +been reset, and this can be checked more often if the UMD requires it. 
> > After
> > +detecting a reset, UMD will then proceed to report it to the application 
> > using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +--
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. 
> > However,
> > +there is no guarantee that the app will use such features correctly, and 
> > the
> > +UMD can implement policies to close the app if it is a repeating offender,
>
> Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> offender killing is more a policy where the kernel enforces policy, and no
> longer up to userspace to dtrt (because very clearly userspace is not
> really doing the right thing anymore when it's just hanging the gpu in an
> endless loop). Also maybe tune it down further to something like "the
> kernel driver may implemnent ..."
>
> In my opinion the umd shouldn't implement these kind of magic guesses, the
> entire point of robustness apis is to delegate responsibility for
> correctly recovering to the application. And the kernel is left with
> enforcing fair resource usage policies (which eventually might be a
> cgroups limit on how much gpu time you're allowed to waste with gpu
> resets).

Killing apps that the kernel thinks are misbehaving really doesn't
seem like a good idea to me. What if the process is a service getting
restarted after getting killed? What if killing that process leaves
the system in a bad state?

Can't the kernel provide some information to user space so that e.g.
systemd can handle those situations?

> > +likely in a broken loop. This is done to ensure that it does not keep 
> > blocking
> > +the user interface from being correctly displayed. This should be done 
> > even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
> > +OpenGL
> > +~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). 
> > This
> > +interface tells if a reset has happened, and if so, all the context state 
> > is
> > +considered lost and the app proceeds by creating new ones. If it is 
> > possible to
> > +determine that robustness is not in use, the UMD will terminate the app 
> > when a
> > +reset is detected, giving that the contexts are lost and the app won't be 
> > able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for 
> > submissions.
> > +This error code means, among other things, that a device reset has 
> > 

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-08-04 Thread Daniel Vetter
On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
> 
> Acked-by: Pekka Paalanen 
> Signed-off-by: André Almeida 
> ---
> 
> v4: 
> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
> 
> Changes:
>  - Grammar fixes (Randy)
> 
>  Documentation/gpu/drm-uapi.rst | 68 ++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third 
> handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>  
> +Device reset
> +
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +--
> +
> +The KMD is responsible for checking if the device needs a reset, and to 
> perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest 
> of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +
> +
> +The UMD should check before submitting new commands to the KMD if the device 
> has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application 
> using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +--
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. 
> However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,

Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
offender killing is more a policy where the kernel enforces policy, and no
longer up to userspace to dtrt (because very clearly userspace is not
really doing the right thing anymore when it's just hanging the gpu in an
endless loop). Also maybe tune it down further to something like "the
kernel driver may implemnent ..."

In my opinion the umd shouldn't implement these kind of magic guesses, the
entire point of robustness apis is to delegate responsibility for
correctly recovering to the application. And the kernel is left with
enforcing fair resource usage policies (which eventually might be a
cgroups limit on how much gpu time you're allowed to waste with gpu
resets).

> +likely in a broken loop. This is done to ensure that it does not keep 
> blocking
> +the user interface from being correctly displayed. This should be done even 
> if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). 
> This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible 
> to
> +determine that robustness is not in use, the UMD will terminate the app when 
> a
> +reset is detected, giving that the contexts are lost and the app won't be 
> able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened 
> and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the 
> reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.

Since we do not seem to have a solid consensus in the community about
non-robust userspace, maybe we could just document that lack of consensus
to unblock this 

Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-08-02 Thread Michel Dänzer
On 8/2/23 09:38, Marek Olšák wrote:
> 
> The precedent from the CPU land is pretty strong here. There is
> SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
> instructions, yet we do nothing for invalid GPU memory access and
> invalid GPU instructions. Sending a terminating signal from the kernel
> would be the most natural thing to do.

After an unhandled SIGSEGV or SIGILL, the process is in an inconsistent state 
and cannot safely continue executing. That's why the process is terminated by 
default in those cases.

The same is not true when an OpenGL context stops working. Any threads / other 
parts of the process not using that OpenGL context continue working normally. 
And any attempts to use that OpenGL context can be safely ignored (or the 
OpenGL implementation couldn't support the robustness extensions).


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-08-02 Thread Marek Olšák
A screen that doesn't update isn't usable. Killing the window system
and returning to the login screen is one option. Killing the window
system manually from a terminal or over ssh and then returning to the
login screen is another option, but 99% of users either hard-reset the
machine or do sysrq+REISUB anyway because it's faster that way. Those
are all your options. If we don't do the kill, users might decide to
do a hard reset with an unsync'd file system, which can cause more
damage.

The precedent from the CPU land is pretty strong here. There is
SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
instructions, yet we do nothing for invalid GPU memory access and
invalid GPU instructions. Sending a terminating signal from the kernel
would be the most natural thing to do. Instead, we just keep a frozen
GUI to keep users helpless, or we continue command submission and then
the hanging app can cause an infinite cycle of GPU hangs and resets,
making the GPU unusable until somebody kills the app over ssh.

That's why GL/Vulkan robustness is required - either robust apps, or a
robust compositor that greys out lost windows and pops up a diagnostic
message with a list of actions to choose from. That's the direction we
should be taking. Non-robust apps under a non-robust compositor should
just be killed if they crash the GPU.


Marek

On Wed, Jul 26, 2023 at 4:07 AM Michel Dänzer
 wrote:
>
> On 7/25/23 15:02, André Almeida wrote:
> > Em 25/07/2023 05:03, Michel Dänzer escreveu:
> >> On 7/25/23 04:55, André Almeida wrote:
> >>> Hi everyone,
> >>>
> >>> It's not clear what we should do about non-robust OpenGL apps after GPU 
> >>> resets, so I'll try to summarize the topic, show some options and my 
> >>> proposal to move forward on that.
> >>>
> >>> Em 27/06/2023 10:23, André Almeida escreveu:
>  +Robustness
>  +--
>  +
>  +The only way to try to keep an application working after a reset is if 
>  it
>  +complies with the robustness aspects of the graphical API that it is 
>  using.
>  +
>  +Graphical APIs provide ways to applications to deal with device resets. 
>  However,
>  +there is no guarantee that the app will use such features correctly, 
>  and the
>  +UMD can implement policies to close the app if it is a repeating 
>  offender,
>  +likely in a broken loop. This is done to ensure that it does not keep 
>  blocking
>  +the user interface from being correctly displayed. This should be done 
>  even if
>  +the app is correct but happens to trigger some bug in the 
>  hardware/driver.
>  +
> >>> Depending on the OpenGL version, there are different robustness API 
> >>> available:
> >>>
> >>> - OpenGL ABR extension [0]
> >>> - OpenGL KHR extension [1]
> >>> - OpenGL ES extension  [2]
> >>>
> >>> Apps written in OpenGL should use whatever version is available for them 
> >>> to make the app robust for GPU resets. That usually means calling 
> >>> GetGraphicsResetStatusARB(), checking the status, and if it encounter 
> >>> something different from NO_ERROR, that means that a reset has happened, 
> >>> the context is considered lost and should be recreated. If an app follow 
> >>> this, it will likely succeed recovering a reset.
> >>>
> >>> What should non-robustness apps do then? They certainly will not be 
> >>> notified if a reset happens, and thus can't recover if their context is 
> >>> lost. OpenGL specification does not explicitly define what should be done 
> >>> in such situations[3], and I believe that usually when the spec mandates 
> >>> to close the app, it would explicitly note it.
> >>>
> >>> However, in reality there are different types of device resets, causing 
> >>> different results. A reset can be precise enough to damage only the 
> >>> guilty context, and keep others alive.
> >>>
> >>> Given that, I believe drivers have the following options:
> >>>
> >>> a) Kill all non-robust apps after a reset. This may lead to lose work 
> >>> from innocent applications.
> >>>
> >>> b) Ignore all non-robust apps OpenGL calls. That means that applications 
> >>> would still be alive, but the user interface would be freeze. The user 
> >>> would need to close it manually anyway, but in some corner cases, the app 
> >>> could autosave some work or the user might be able to interact with it 
> >>> using some alternative method (command line?).
> >>>
> >>> c) Kill just the affected non-robust applications. To do that, the driver 
> >>> need to be 100% sure on the impact of its resets.
> >>>
> >>> RadeonSI currently implements a), as can be seen at [4], while Iris 
> >>> implements what I think it's c)[5].
> >>>
> >>> For the user experience point-of-view, c) is clearly the best option, but 
> >>> it's the hardest to archive. There's not much gain on having b) over a), 
> >>> perhaps it could be an optional env var for such corner case applications.
> >>
> >> I disagree on these conclusions.
> >>
> >> 

Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-26 Thread Michel Dänzer
On 7/25/23 15:02, André Almeida wrote:
> Em 25/07/2023 05:03, Michel Dänzer escreveu:
>> On 7/25/23 04:55, André Almeida wrote:
>>> Hi everyone,
>>>
>>> It's not clear what we should do about non-robust OpenGL apps after GPU 
>>> resets, so I'll try to summarize the topic, show some options and my 
>>> proposal to move forward on that.
>>>
>>> Em 27/06/2023 10:23, André Almeida escreveu:
 +Robustness
 +--
 +
 +The only way to try to keep an application working after a reset is if it
 +complies with the robustness aspects of the graphical API that it is 
 using.
 +
 +Graphical APIs provide ways to applications to deal with device resets. 
 However,
 +there is no guarantee that the app will use such features correctly, and 
 the
 +UMD can implement policies to close the app if it is a repeating offender,
 +likely in a broken loop. This is done to ensure that it does not keep 
 blocking
 +the user interface from being correctly displayed. This should be done 
 even if
 +the app is correct but happens to trigger some bug in the hardware/driver.
 +
>>> Depending on the OpenGL version, there are different robustness API 
>>> available:
>>>
>>> - OpenGL ABR extension [0]
>>> - OpenGL KHR extension [1]
>>> - OpenGL ES extension  [2]
>>>
>>> Apps written in OpenGL should use whatever version is available for them to 
>>> make the app robust for GPU resets. That usually means calling 
>>> GetGraphicsResetStatusARB(), checking the status, and if it encounter 
>>> something different from NO_ERROR, that means that a reset has happened, 
>>> the context is considered lost and should be recreated. If an app follow 
>>> this, it will likely succeed recovering a reset.
>>>
>>> What should non-robustness apps do then? They certainly will not be 
>>> notified if a reset happens, and thus can't recover if their context is 
>>> lost. OpenGL specification does not explicitly define what should be done 
>>> in such situations[3], and I believe that usually when the spec mandates to 
>>> close the app, it would explicitly note it.
>>>
>>> However, in reality there are different types of device resets, causing 
>>> different results. A reset can be precise enough to damage only the guilty 
>>> context, and keep others alive.
>>>
>>> Given that, I believe drivers have the following options:
>>>
>>> a) Kill all non-robust apps after a reset. This may lead to lose work from 
>>> innocent applications.
>>>
>>> b) Ignore all non-robust apps OpenGL calls. That means that applications 
>>> would still be alive, but the user interface would be freeze. The user 
>>> would need to close it manually anyway, but in some corner cases, the app 
>>> could autosave some work or the user might be able to interact with it 
>>> using some alternative method (command line?).
>>>
>>> c) Kill just the affected non-robust applications. To do that, the driver 
>>> need to be 100% sure on the impact of its resets.
>>>
>>> RadeonSI currently implements a), as can be seen at [4], while Iris 
>>> implements what I think it's c)[5].
>>>
>>> For the user experience point-of-view, c) is clearly the best option, but 
>>> it's the hardest to archive. There's not much gain on having b) over a), 
>>> perhaps it could be an optional env var for such corner case applications.
>>
>> I disagree on these conclusions.
>>
>> c) is certainly better than a), but it's not "clearly the best" in all 
>> cases. The OpenGL UMD is not a privileged/special component and is in no 
>> position to decide whether or not the process as a whole (only some 
>> thread(s) of which may use OpenGL at all) gets to continue running or not.
>>
> 
> Thank you for the feedback. How do you think the documentation should look 
> like for this part?

The initial paragraph about robustness should say "keep OpenGL working" instead 
of "keep an application working". If an OpenGL context stops working, it 
doesn't necessarily mean the application stops working altogether.


If the application doesn't use the robustness extensions, your option b) is 
what should happen by default whenever possible. And it really has to be 
possible if the robustness extensions are supported.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-26 Thread Timur Kristóf
On Tue, 2023-07-25 at 19:00 +0200, Michel Dänzer wrote:
> On 7/25/23 17:05, Marek Olšák wrote:
> > On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
> >  wrote:
> > > On 7/25/23 04:55, André Almeida wrote:
> > > > Hi everyone,
> > > > 
> > > > It's not clear what we should do about non-robust OpenGL apps
> > > > after GPU resets, so I'll try to summarize the topic, show some
> > > > options and my proposal to move forward on that.
> > > > 
> > > > Em 27/06/2023 10:23, André Almeida escreveu:
> > > > > +Robustness
> > > > > +--
> > > > > +
> > > > > +The only way to try to keep an application working after a
> > > > > reset is if it
> > > > > +complies with the robustness aspects of the graphical API
> > > > > that it is using.
> > > > > +
> > > > > +Graphical APIs provide ways to applications to deal with
> > > > > device resets. However,
> > > > > +there is no guarantee that the app will use such features
> > > > > correctly, and the
> > > > > +UMD can implement policies to close the app if it is a
> > > > > repeating offender,
> > > > > +likely in a broken loop. This is done to ensure that it does
> > > > > not keep blocking
> > > > > +the user interface from being correctly displayed. This
> > > > > should be done even if
> > > > > +the app is correct but happens to trigger some bug in the
> > > > > hardware/driver.
> > > > > +
> > > > Depending on the OpenGL version, there are different robustness
> > > > API available:
> > > > 
> > > > - OpenGL ABR extension [0]
> > > > - OpenGL KHR extension [1]
> > > > - OpenGL ES extension  [2]
> > > > 
> > > > Apps written in OpenGL should use whatever version is available
> > > > for them to make the app robust for GPU resets. That usually
> > > > means calling GetGraphicsResetStatusARB(), checking the status,
> > > > and if it encounter something different from NO_ERROR, that
> > > > means that a reset has happened, the context is considered lost
> > > > and should be recreated. If an app follow this, it will likely
> > > > succeed recovering a reset.
> > > > 
> > > > What should non-robustness apps do then? They certainly will
> > > > not be notified if a reset happens, and thus can't recover if
> > > > their context is lost. OpenGL specification does not explicitly
> > > > define what should be done in such situations[3], and I believe
> > > > that usually when the spec mandates to close the app, it would
> > > > explicitly note it.
> > > > 
> > > > However, in reality there are different types of device resets,
> > > > causing different results. A reset can be precise enough to
> > > > damage only the guilty context, and keep others alive.
> > > > 
> > > > Given that, I believe drivers have the following options:
> > > > 
> > > > a) Kill all non-robust apps after a reset. This may lead to
> > > > lose work from innocent applications.
> > > > 
> > > > b) Ignore all non-robust apps OpenGL calls. That means that
> > > > applications would still be alive, but the user interface would
> > > > be freeze. The user would need to close it manually anyway, but
> > > > in some corner cases, the app could autosave some work or the
> > > > user might be able to interact with it using some alternative
> > > > method (command line?).
> > > > 
> > > > c) Kill just the affected non-robust applications. To do that,
> > > > the driver need to be 100% sure on the impact of its resets.
> > > > 
> > > > RadeonSI currently implements a), as can be seen at [4], while
> > > > Iris implements what I think it's c)[5].
> > > > 
> > > > For the user experience point-of-view, c) is clearly the best
> > > > option, but it's the hardest to archive. There's not much gain
> > > > on having b) over a), perhaps it could be an optional env var
> > > > for such corner case applications.
> > > 
> > > I disagree on these conclusions.
> > > 
> > > c) is certainly better than a), but it's not "clearly the best"
> > > in all cases. The OpenGL UMD is not a privileged/special
> > > component and is in no position to decide whether or not the
> > > process as a whole (only some thread(s) of which may use OpenGL
> > > at all) gets to continue running or not.
> > 
> > That's not true.
> 
> Which part of what I wrote are you referring to?
> 
> 
> > I recommend that you enable b) with your driver and then hang the
> > GPU under different scenarios and see the result.
> 
> I've been doing GPU driver development for over two decades, I'm
> perfectly aware what it means. It doesn't change what I wrote above.
> 

Michel, I understand that you disagree with the proposed solutions in
this email thread but from your mails it is unclear to me what exactly
is the solution that you would actually recommend, can you please
clarify?

Thanks,
Timur



Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-25 Thread Michel Dänzer
On 7/25/23 17:05, Marek Olšák wrote:
> On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
>  wrote:
>> On 7/25/23 04:55, André Almeida wrote:
>>> Hi everyone,
>>>
>>> It's not clear what we should do about non-robust OpenGL apps after GPU 
>>> resets, so I'll try to summarize the topic, show some options and my 
>>> proposal to move forward on that.
>>>
>>> Em 27/06/2023 10:23, André Almeida escreveu:
 +Robustness
 +--
 +
 +The only way to try to keep an application working after a reset is if it
 +complies with the robustness aspects of the graphical API that it is 
 using.
 +
 +Graphical APIs provide ways to applications to deal with device resets. 
 However,
 +there is no guarantee that the app will use such features correctly, and 
 the
 +UMD can implement policies to close the app if it is a repeating offender,
 +likely in a broken loop. This is done to ensure that it does not keep 
 blocking
 +the user interface from being correctly displayed. This should be done 
 even if
 +the app is correct but happens to trigger some bug in the hardware/driver.
 +
>>> Depending on the OpenGL version, there are different robustness API 
>>> available:
>>>
>>> - OpenGL ABR extension [0]
>>> - OpenGL KHR extension [1]
>>> - OpenGL ES extension  [2]
>>>
>>> Apps written in OpenGL should use whatever version is available for them to 
>>> make the app robust for GPU resets. That usually means calling 
>>> GetGraphicsResetStatusARB(), checking the status, and if it encounter 
>>> something different from NO_ERROR, that means that a reset has happened, 
>>> the context is considered lost and should be recreated. If an app follow 
>>> this, it will likely succeed recovering a reset.
>>>
>>> What should non-robustness apps do then? They certainly will not be 
>>> notified if a reset happens, and thus can't recover if their context is 
>>> lost. OpenGL specification does not explicitly define what should be done 
>>> in such situations[3], and I believe that usually when the spec mandates to 
>>> close the app, it would explicitly note it.
>>>
>>> However, in reality there are different types of device resets, causing 
>>> different results. A reset can be precise enough to damage only the guilty 
>>> context, and keep others alive.
>>>
>>> Given that, I believe drivers have the following options:
>>>
>>> a) Kill all non-robust apps after a reset. This may lead to lose work from 
>>> innocent applications.
>>>
>>> b) Ignore all non-robust apps OpenGL calls. That means that applications 
>>> would still be alive, but the user interface would be freeze. The user 
>>> would need to close it manually anyway, but in some corner cases, the app 
>>> could autosave some work or the user might be able to interact with it 
>>> using some alternative method (command line?).
>>>
>>> c) Kill just the affected non-robust applications. To do that, the driver 
>>> need to be 100% sure on the impact of its resets.
>>>
>>> RadeonSI currently implements a), as can be seen at [4], while Iris 
>>> implements what I think it's c)[5].
>>>
>>> For the user experience point-of-view, c) is clearly the best option, but 
>>> it's the hardest to archive. There's not much gain on having b) over a), 
>>> perhaps it could be an optional env var for such corner case applications.
>>
>> I disagree on these conclusions.
>>
>> c) is certainly better than a), but it's not "clearly the best" in all 
>> cases. The OpenGL UMD is not a privileged/special component and is in no 
>> position to decide whether or not the process as a whole (only some 
>> thread(s) of which may use OpenGL at all) gets to continue running or not.
> 
> That's not true.

Which part of what I wrote are you referring to?


> I recommend that you enable b) with your driver and then hang the GPU under 
> different scenarios and see the result.

I've been doing GPU driver development for over two decades, I'm perfectly 
aware what it means. It doesn't change what I wrote above.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-25 Thread Marek Olšák
On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
 wrote:
>
> On 7/25/23 04:55, André Almeida wrote:
> > Hi everyone,
> >
> > It's not clear what we should do about non-robust OpenGL apps after GPU 
> > resets, so I'll try to summarize the topic, show some options and my 
> > proposal to move forward on that.
> >
> > Em 27/06/2023 10:23, André Almeida escreveu:
> >> +Robustness
> >> +--
> >> +
> >> +The only way to try to keep an application working after a reset is if it
> >> +complies with the robustness aspects of the graphical API that it is 
> >> using.
> >> +
> >> +Graphical APIs provide ways to applications to deal with device resets. 
> >> However,
> >> +there is no guarantee that the app will use such features correctly, and 
> >> the
> >> +UMD can implement policies to close the app if it is a repeating offender,
> >> +likely in a broken loop. This is done to ensure that it does not keep 
> >> blocking
> >> +the user interface from being correctly displayed. This should be done 
> >> even if
> >> +the app is correct but happens to trigger some bug in the hardware/driver.
> >> +
> > Depending on the OpenGL version, there are different robustness API 
> > available:
> >
> > - OpenGL ABR extension [0]
> > - OpenGL KHR extension [1]
> > - OpenGL ES extension  [2]
> >
> > Apps written in OpenGL should use whatever version is available for them to 
> > make the app robust for GPU resets. That usually means calling 
> > GetGraphicsResetStatusARB(), checking the status, and if it encounter 
> > something different from NO_ERROR, that means that a reset has happened, 
> > the context is considered lost and should be recreated. If an app follow 
> > this, it will likely succeed recovering a reset.
> >
> > What should non-robustness apps do then? They certainly will not be 
> > notified if a reset happens, and thus can't recover if their context is 
> > lost. OpenGL specification does not explicitly define what should be done 
> > in such situations[3], and I believe that usually when the spec mandates to 
> > close the app, it would explicitly note it.
> >
> > However, in reality there are different types of device resets, causing 
> > different results. A reset can be precise enough to damage only the guilty 
> > context, and keep others alive.
> >
> > Given that, I believe drivers have the following options:
> >
> > a) Kill all non-robust apps after a reset. This may lead to lose work from 
> > innocent applications.
> >
> > b) Ignore all non-robust apps OpenGL calls. That means that applications 
> > would still be alive, but the user interface would be freeze. The user 
> > would need to close it manually anyway, but in some corner cases, the app 
> > could autosave some work or the user might be able to interact with it 
> > using some alternative method (command line?).
> >
> > c) Kill just the affected non-robust applications. To do that, the driver 
> > need to be 100% sure on the impact of its resets.
> >
> > RadeonSI currently implements a), as can be seen at [4], while Iris 
> > implements what I think it's c)[5].
> >
> > For the user experience point-of-view, c) is clearly the best option, but 
> > it's the hardest to archive. There's not much gain on having b) over a), 
> > perhaps it could be an optional env var for such corner case applications.
>
> I disagree on these conclusions.
>
> c) is certainly better than a), but it's not "clearly the best" in all cases. 
> The OpenGL UMD is not a privileged/special component and is in no position to 
> decide whether or not the process as a whole (only some thread(s) of which 
> may use OpenGL at all) gets to continue running or not.

That's not true. I recommend that you enable b) with your driver and
then hang the GPU under different scenarios and see the result. Then
enable a) and do the same and compare.

Options a) and c) can be merged into one because they are not separate
options to choose from.

If Wayland wanted to grey out lost apps, they would appear as robust
contexts in gallium, but the reset status would be piped through the
Wayland protocol instead of the GL API.

Marek



Marek


Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-25 Thread André Almeida

Hi Michel,

Em 25/07/2023 05:03, Michel Dänzer escreveu:

On 7/25/23 04:55, André Almeida wrote:

Hi everyone,

It's not clear what we should do about non-robust OpenGL apps after GPU resets, 
so I'll try to summarize the topic, show some options and my proposal to move 
forward on that.

Em 27/06/2023 10:23, André Almeida escreveu:

+Robustness
+--
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to applications to deal with device resets. 
However,
+there is no guarantee that the app will use such features correctly, and the
+UMD can implement policies to close the app if it is a repeating offender,
+likely in a broken loop. This is done to ensure that it does not keep blocking
+the user interface from being correctly displayed. This should be done even if
+the app is correct but happens to trigger some bug in the hardware/driver.
+

Depending on the OpenGL version, there are different robustness API available:

- OpenGL ABR extension [0]
- OpenGL KHR extension [1]
- OpenGL ES extension  [2]

Apps written in OpenGL should use whatever version is available for them to 
make the app robust for GPU resets. That usually means calling 
GetGraphicsResetStatusARB(), checking the status, and if it encounter something 
different from NO_ERROR, that means that a reset has happened, the context is 
considered lost and should be recreated. If an app follow this, it will likely 
succeed recovering a reset.

What should non-robustness apps do then? They certainly will not be notified if 
a reset happens, and thus can't recover if their context is lost. OpenGL 
specification does not explicitly define what should be done in such 
situations[3], and I believe that usually when the spec mandates to close the 
app, it would explicitly note it.

However, in reality there are different types of device resets, causing 
different results. A reset can be precise enough to damage only the guilty 
context, and keep others alive.

Given that, I believe drivers have the following options:

a) Kill all non-robust apps after a reset. This may lead to lose work from 
innocent applications.

b) Ignore all non-robust apps OpenGL calls. That means that applications would 
still be alive, but the user interface would be freeze. The user would need to 
close it manually anyway, but in some corner cases, the app could autosave some 
work or the user might be able to interact with it using some alternative 
method (command line?).

c) Kill just the affected non-robust applications. To do that, the driver need 
to be 100% sure on the impact of its resets.

RadeonSI currently implements a), as can be seen at [4], while Iris implements 
what I think it's c)[5].

For the user experience point-of-view, c) is clearly the best option, but it's 
the hardest to archive. There's not much gain on having b) over a), perhaps it 
could be an optional env var for such corner case applications.


I disagree on these conclusions.

c) is certainly better than a), but it's not "clearly the best" in all cases. 
The OpenGL UMD is not a privileged/special component and is in no position to decide 
whether or not the process as a whole (only some thread(s) of which may use OpenGL at 
all) gets to continue running or not.



Thank you for the feedback. How do you think the documentation should 
look like for this part?


Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-25 Thread Michel Dänzer
On 7/25/23 04:55, André Almeida wrote:
> Hi everyone,
> 
> It's not clear what we should do about non-robust OpenGL apps after GPU 
> resets, so I'll try to summarize the topic, show some options and my proposal 
> to move forward on that.
> 
> Em 27/06/2023 10:23, André Almeida escreveu:
>> +Robustness
>> +--
>> +
>> +The only way to try to keep an application working after a reset is if it
>> +complies with the robustness aspects of the graphical API that it is using.
>> +
>> +Graphical APIs provide ways to applications to deal with device resets. 
>> However,
>> +there is no guarantee that the app will use such features correctly, and the
>> +UMD can implement policies to close the app if it is a repeating offender,
>> +likely in a broken loop. This is done to ensure that it does not keep 
>> blocking
>> +the user interface from being correctly displayed. This should be done even 
>> if
>> +the app is correct but happens to trigger some bug in the hardware/driver.
>> +
> Depending on the OpenGL version, there are different robustness API available:
> 
> - OpenGL ABR extension [0]
> - OpenGL KHR extension [1]
> - OpenGL ES extension  [2]
> 
> Apps written in OpenGL should use whatever version is available for them to 
> make the app robust for GPU resets. That usually means calling 
> GetGraphicsResetStatusARB(), checking the status, and if it encounter 
> something different from NO_ERROR, that means that a reset has happened, the 
> context is considered lost and should be recreated. If an app follow this, it 
> will likely succeed recovering a reset.
> 
> What should non-robustness apps do then? They certainly will not be notified 
> if a reset happens, and thus can't recover if their context is lost. OpenGL 
> specification does not explicitly define what should be done in such 
> situations[3], and I believe that usually when the spec mandates to close the 
> app, it would explicitly note it.
> 
> However, in reality there are different types of device resets, causing 
> different results. A reset can be precise enough to damage only the guilty 
> context, and keep others alive.
> 
> Given that, I believe drivers have the following options:
> 
> a) Kill all non-robust apps after a reset. This may lead to lose work from 
> innocent applications.
> 
> b) Ignore all non-robust apps OpenGL calls. That means that applications 
> would still be alive, but the user interface would be freeze. The user would 
> need to close it manually anyway, but in some corner cases, the app could 
> autosave some work or the user might be able to interact with it using some 
> alternative method (command line?).
> 
> c) Kill just the affected non-robust applications. To do that, the driver 
> need to be 100% sure on the impact of its resets.
> 
> RadeonSI currently implements a), as can be seen at [4], while Iris 
> implements what I think it's c)[5].
> 
> For the user experience point-of-view, c) is clearly the best option, but 
> it's the hardest to archive. There's not much gain on having b) over a), 
> perhaps it could be an optional env var for such corner case applications.

I disagree on these conclusions.

c) is certainly better than a), but it's not "clearly the best" in all cases. 
The OpenGL UMD is not a privileged/special component and is in no position to 
decide whether or not the process as a whole (only some thread(s) of which may 
use OpenGL at all) gets to continue running or not.


> [0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
> [1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt
> [2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
> [3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf
> [4] 
> https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657
> [5] 
> https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842

-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-25 Thread Simon Ser
On Tuesday, July 25th, 2023 at 04:55, André Almeida  
wrote:

> It's not clear what we should do about non-robust OpenGL apps after GPU
> resets, so I'll try to summarize the topic, show some options and my
> proposal to move forward on that.
> 
> Em 27/06/2023 10:23, André Almeida escreveu:
> 
> > +Robustness
> > +--
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. 
> > However,
> > +there is no guarantee that the app will use such features correctly, and 
> > the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep 
> > blocking
> > +the user interface from being correctly displayed. This should be done 
> > even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
> 
> Depending on the OpenGL version, there are different robustness API
> available:
> 
> - OpenGL ABR extension [0]
> - OpenGL KHR extension [1]
> - OpenGL ES extension [2]
> 
> Apps written in OpenGL should use whatever version is available for them
> to make the app robust for GPU resets. That usually means calling
> GetGraphicsResetStatusARB(), checking the status, and if it encounter
> something different from NO_ERROR, that means that a reset has happened,
> the context is considered lost and should be recreated. If an app follow
> this, it will likely succeed recovering a reset.
> 
> What should non-robustness apps do then? They certainly will not be
> notified if a reset happens, and thus can't recover if their context is
> lost. OpenGL specification does not explicitly define what should be
> done in such situations[3], and I believe that usually when the spec
> mandates to close the app, it would explicitly note it.
> 
> However, in reality there are different types of device resets, causing
> different results. A reset can be precise enough to damage only the
> guilty context, and keep others alive.
> 
> Given that, I believe drivers have the following options:
> 
> a) Kill all non-robust apps after a reset. This may lead to lose work
> from innocent applications.
> 
> b) Ignore all non-robust apps OpenGL calls. That means that applications
> would still be alive, but the user interface would be freeze. The user
> would need to close it manually anyway, but in some corner cases, the
> app could autosave some work or the user might be able to interact with
> it using some alternative method (command line?).
> 
> c) Kill just the affected non-robust applications. To do that, the
> driver need to be 100% sure on the impact of its resets.

We've discussed this a while back on #dri-devel IIRC. I think the best
experience would be for the Wayland compositor to gray out apps which
lost their GL context, and display an information dialog to explain
what happened to the user and a button to kill the app. I'm not exactly
sure how that would translate to a kernel or Mesa uAPI, and if there's
appetite to do a lot of work to get "the best GPU reset UX" (IOW: maybe
it's not worth all of the trouble).


Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-24 Thread André Almeida

Hi everyone,

It's not clear what we should do about non-robust OpenGL apps after GPU 
resets, so I'll try to summarize the topic, show some options and my 
proposal to move forward on that.


Em 27/06/2023 10:23, André Almeida escreveu:

+Robustness
+--
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to applications to deal with device resets. 
However,
+there is no guarantee that the app will use such features correctly, and the
+UMD can implement policies to close the app if it is a repeating offender,
+likely in a broken loop. This is done to ensure that it does not keep blocking
+the user interface from being correctly displayed. This should be done even if
+the app is correct but happens to trigger some bug in the hardware/driver.
+
Depending on the OpenGL version, there are different robustness API 
available:


- OpenGL ABR extension [0]
- OpenGL KHR extension [1]
- OpenGL ES extension  [2]

Apps written in OpenGL should use whatever version is available for them 
to make the app robust for GPU resets. That usually means calling 
GetGraphicsResetStatusARB(), checking the status, and if it encounter 
something different from NO_ERROR, that means that a reset has happened, 
the context is considered lost and should be recreated. If an app follow 
this, it will likely succeed recovering a reset.


What should non-robustness apps do then? They certainly will not be 
notified if a reset happens, and thus can't recover if their context is 
lost. OpenGL specification does not explicitly define what should be 
done in such situations[3], and I believe that usually when the spec 
mandates to close the app, it would explicitly note it.


However, in reality there are different types of device resets, causing 
different results. A reset can be precise enough to damage only the 
guilty context, and keep others alive.


Given that, I believe drivers have the following options:

a) Kill all non-robust apps after a reset. This may lead to lose work 
from innocent applications.


b) Ignore all non-robust apps OpenGL calls. That means that applications 
would still be alive, but the user interface would be freeze. The user 
would need to close it manually anyway, but in some corner cases, the 
app could autosave some work or the user might be able to interact with 
it using some alternative method (command line?).


c) Kill just the affected non-robust applications. To do that, the 
driver need to be 100% sure on the impact of its resets.


RadeonSI currently implements a), as can be seen at [4], while Iris 
implements what I think it's c)[5].


For the user experience point-of-view, c) is clearly the best option, 
but it's the hardest to archive. There's not much gain on having b) over 
a), perhaps it could be an optional env var for such corner case 
applications.


My proposal for the documentation is: implement a) if nothing else is 
available, have a MESA_NO_RESET_KILL for people that want b), ideally 
implement c) if the driver is able to know for sure that the non-guilty 
apps can still work after a reset.


Thanks,
André

[0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
[1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt
[2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
[3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf
[4] 
https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657
[5] 
https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-05 Thread Marek Olšák
On Wed, Jul 5, 2023 at 3:32 AM Michel Dänzer  wrote:
>
> On 7/5/23 08:30, Marek Olšák wrote:
> > On Tue, Jul 4, 2023, 03:55 Michel Dänzer  wrote:
> > On 7/4/23 04:34, Marek Olšák wrote:
> > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer  > > wrote:
> > > On 6/30/23 22:32, Marek Olšák wrote:
> > > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer 
> >  wrote:
> > > >> On 6/30/23 16:59, Alex Deucher wrote:
> > > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> > > >>> mailto:sebastian.w...@redhat.com> 
> > wrote:
> > >  On Tue, Jun 27, 2023 at 3:23 PM André Almeida 
> >  wrote:
> > > >
> > > > +Robustness
> > > > +--
> > > > +
> > > > +The only way to try to keep an application working after a 
> > reset is if it
> > > > +complies with the robustness aspects of the graphical API 
> > that it is using.
> > > > +
> > > > +Graphical APIs provide ways to applications to deal with 
> > device resets. However,
> > > > +there is no guarantee that the app will use such features 
> > correctly, and the
> > > > +UMD can implement policies to close the app if it is a 
> > repeating offender,
> > > > +likely in a broken loop. This is done to ensure that it 
> > does not keep blocking
> > > > +the user interface from being correctly displayed. This 
> > should be done even if
> > > > +the app is correct but happens to trigger some bug in the 
> > hardware/driver.
> > > 
> > >  I still don't think it's good to let the kernel arbitrarily 
> > kill
> > >  processes that it thinks are not well-behaved based on some 
> > heuristics
> > >  and policy.
> > > 
> > >  Can't this be outsourced to user space? Expose the 
> > information about
> > >  processes causing a device and let e.g. systemd deal with 
> > coming up
> > >  with a policy and with killing stuff.
> > > >>>
> > > >>> I don't think it's the kernel doing the killing, it would be 
> > the UMD.
> > > >>> E.g., if the app is guilty and doesn't support robustness the 
> > UMD can
> > > >>> just call exit().
> > > >>
> > > >> It would be safer to just ignore API calls[0], similarly to 
> > what is done until the application destroys the context with robustness. 
> > Calling exit() likely results in losing any unsaved work, whereas at least 
> > some applications might otherwise allow saving the work by other means.
> > > >
> > > > That's a terrible idea. Ignoring API calls would be identical 
> > to a freeze. You might as well disable GPU recovery because the result 
> > would be the same.
> > >
> > > No GPU recovery would affect everything using the GPU, whereas 
> > this affects only non-robust applications.
> > >
> > > which is currently the majority.
> >
> > Not sure where you're going with this. Applications need to use 
> > robustness to be able to recover from a GPU hang, and the GPU needs to be 
> > reset for that. So disabling GPU reset is not the same as what we're 
> > discussing here.
> >
> >
> > > > - non-robust contexts: call exit(1) immediately, which is the 
> > best way to recover
> > >
> > > That's not the UMD's call to make.
> > >
> > > That's absolutely the UMD's call to make because that's mandated by 
> > the hw and API design
> >
> > Can you point us to a spec which mandates that the process must be 
> > killed in this case?
> >
> >
> > > and only driver devs know this, which this thread is a proof of. The 
> > default behavior is to skip all command submission if a non-robust context 
> > is lost, which looks like a freeze. That's required to prevent infinite 
> > hangs from the same context and can be caused by the side effects of the 
> > GPU reset itself, not by the cause of the previous hang. The only way out 
> > of that is killing the process.
> >
> > The UMD killing the process is not the only way out of that, and doing 
> > so is overreach on its part. The UMD is but one out of many components in a 
> > process, not the main one or a special one. It doesn't get to decide when 
> > the process must die, certainly not under circumstances where it must be 
> > able to continue while ignoring API calls (that's required for robustness).
> >
> >
> > You're mixing things up. Robust apps don't any special action from a UMD. 
> > Only non-robust apps need to be killed for proper recovery with the only 
> > other alternative being not updating the window/screen,
>
> I'm saying they don't "need" to be killed, since the UMD must be able to keep 
> going while ignoring API calls (or it couldn't support robustness). It's a 
> choice, one which is not for the UMD to make.
>
>
> > Also it's already used and 

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-05 Thread Michel Dänzer
On 7/5/23 08:30, Marek Olšák wrote:
> On Tue, Jul 4, 2023, 03:55 Michel Dänzer  wrote:
> On 7/4/23 04:34, Marek Olšák wrote:
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer  
> wrote:
> >     On 6/30/23 22:32, Marek Olšák wrote:
> >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer 
>  wrote:
> >     >> On 6/30/23 16:59, Alex Deucher wrote:
> >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >     >>> mailto:sebastian.w...@redhat.com> 
> wrote:
> >      On Tue, Jun 27, 2023 at 3:23 PM André Almeida 
>  wrote:
> >     >
> >     > +Robustness
> >     > +--
> >     > +
> >     > +The only way to try to keep an application working after a 
> reset is if it
> >     > +complies with the robustness aspects of the graphical API 
> that it is using.
> >     > +
> >     > +Graphical APIs provide ways to applications to deal with 
> device resets. However,
> >     > +there is no guarantee that the app will use such features 
> correctly, and the
> >     > +UMD can implement policies to close the app if it is a 
> repeating offender,
> >     > +likely in a broken loop. This is done to ensure that it does 
> not keep blocking
> >     > +the user interface from being correctly displayed. This 
> should be done even if
> >     > +the app is correct but happens to trigger some bug in the 
> hardware/driver.
> >     
> >      I still don't think it's good to let the kernel arbitrarily 
> kill
> >      processes that it thinks are not well-behaved based on some 
> heuristics
> >      and policy.
> >     
> >      Can't this be outsourced to user space? Expose the information 
> about
> >      processes causing a device and let e.g. systemd deal with 
> coming up
> >      with a policy and with killing stuff.
> >     >>>
> >     >>> I don't think it's the kernel doing the killing, it would be 
> the UMD.
> >     >>> E.g., if the app is guilty and doesn't support robustness the 
> UMD can
> >     >>> just call exit().
> >     >>
> >     >> It would be safer to just ignore API calls[0], similarly to what 
> is done until the application destroys the context with robustness. Calling 
> exit() likely results in losing any unsaved work, whereas at least some 
> applications might otherwise allow saving the work by other means.
> >     >
> >     > That's a terrible idea. Ignoring API calls would be identical to 
> a freeze. You might as well disable GPU recovery because the result would be 
> the same.
> >
> >     No GPU recovery would affect everything using the GPU, whereas this 
> affects only non-robust applications.
> >
> > which is currently the majority.
> 
> Not sure where you're going with this. Applications need to use 
> robustness to be able to recover from a GPU hang, and the GPU needs to be 
> reset for that. So disabling GPU reset is not the same as what we're 
> discussing here.
> 
> 
> >     > - non-robust contexts: call exit(1) immediately, which is the 
> best way to recover
> >
> >     That's not the UMD's call to make.
> >
> > That's absolutely the UMD's call to make because that's mandated by the 
> hw and API design
> 
> Can you point us to a spec which mandates that the process must be killed 
> in this case?
> 
> 
> > and only driver devs know this, which this thread is a proof of. The 
> default behavior is to skip all command submission if a non-robust context is 
> lost, which looks like a freeze. That's required to prevent infinite hangs 
> from the same context and can be caused by the side effects of the GPU reset 
> itself, not by the cause of the previous hang. The only way out of that is 
> killing the process.
> 
> The UMD killing the process is not the only way out of that, and doing so 
> is overreach on its part. The UMD is but one out of many components in a 
> process, not the main one or a special one. It doesn't get to decide when the 
> process must die, certainly not under circumstances where it must be able to 
> continue while ignoring API calls (that's required for robustness).
> 
> 
> You're mixing things up. Robust apps don't any special action from a UMD. 
> Only non-robust apps need to be killed for proper recovery with the only 
> other alternative being not updating the window/screen,

I'm saying they don't "need" to be killed, since the UMD must be able to keep 
going while ignoring API calls (or it couldn't support robustness). It's a 
choice, one which is not for the UMD to make.


> Also it's already used and required by our customers on Android because 
> killing a process returns the user to the desktop screen and can generate a 
> crash dump instead of keeping the app output frozen, and they agree that this 
> is the best user experience given the circumstances.

Then 

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-05 Thread Marek Olšák
On Tue, Jul 4, 2023, 03:55 Michel Dänzer  wrote:

> On 7/4/23 04:34, Marek Olšák wrote:
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer  > wrote:
> > On 6/30/23 22:32, Marek Olšák wrote:
> > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daen...@mailbox.org   michel.daen...@mailbox.org >> wrote:
> > >> On 6/30/23 16:59, Alex Deucher wrote:
> > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> > >>> mailto:sebastian.w...@redhat.com>
> >>
> wrote:
> >  On Tue, Jun 27, 2023 at 3:23 PM André Almeida <
> andrealm...@igalia.com   andrealm...@igalia.com >> wrote:
> > >
> > > +Robustness
> > > +--
> > > +
> > > +The only way to try to keep an application working after a
> reset is if it
> > > +complies with the robustness aspects of the graphical API
> that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with
> device resets. However,
> > > +there is no guarantee that the app will use such features
> correctly, and the
> > > +UMD can implement policies to close the app if it is a
> repeating offender,
> > > +likely in a broken loop. This is done to ensure that it does
> not keep blocking
> > > +the user interface from being correctly displayed. This
> should be done even if
> > > +the app is correct but happens to trigger some bug in the
> hardware/driver.
> > 
> >  I still don't think it's good to let the kernel arbitrarily kill
> >  processes that it thinks are not well-behaved based on some
> heuristics
> >  and policy.
> > 
> >  Can't this be outsourced to user space? Expose the information
> about
> >  processes causing a device and let e.g. systemd deal with
> coming up
> >  with a policy and with killing stuff.
> > >>>
> > >>> I don't think it's the kernel doing the killing, it would be the
> UMD.
> > >>> E.g., if the app is guilty and doesn't support robustness the
> UMD can
> > >>> just call exit().
> > >>
> > >> It would be safer to just ignore API calls[0], similarly to what
> is done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> > >
> > > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
> >
> > No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
> >
> > which is currently the majority.
>
> Not sure where you're going with this. Applications need to use robustness
> to be able to recover from a GPU hang, and the GPU needs to be reset for
> that. So disabling GPU reset is not the same as what we're discussing here.
>
>
> > > - non-robust contexts: call exit(1) immediately, which is the best
> way to recover
> >
> > That's not the UMD's call to make.
> >
> > That's absolutely the UMD's call to make because that's mandated by the
> hw and API design
>
> Can you point us to a spec which mandates that the process must be killed
> in this case?
>
>
> > and only driver devs know this, which this thread is a proof of. The
> default behavior is to skip all command submission if a non-robust context
> is lost, which looks like a freeze. That's required to prevent infinite
> hangs from the same context and can be caused by the side effects of the
> GPU reset itself, not by the cause of the previous hang. The only way out
> of that is killing the process.
>
> The UMD killing the process is not the only way out of that, and doing so
> is overreach on its part. The UMD is but one out of many components in a
> process, not the main one or a special one. It doesn't get to decide when
> the process must die, certainly not under circumstances where it must be
> able to continue while ignoring API calls (that's required for robustness).
>

You're mixing things up. Robust apps don't any special action from a UMD.
Only non-robust apps need to be killed for proper recovery with the only
other alternative being not updating the window/screen, which is not
user-friendly because the user who's never heard of GPU hangs has no
fucking idea why it's frozen and what do with it. It doesn't matter that
you can debug it because you're not the average user. Also it's already
used and required by our customers on Android because killing a process
returns the user to the desktop screen and can generate a crash dump
instead of keeping the app output frozen, and they agree that this is the
best 

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-04 Thread Michel Dänzer
On 7/4/23 04:34, Marek Olšák wrote:
> On Mon, Jul 3, 2023, 03:12 Michel Dänzer  > wrote:
> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer 
> mailto:michel.daen...@mailbox.org> 
> >> 
> wrote:
> >> On 6/30/23 16:59, Alex Deucher wrote:
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> mailto:sebastian.w...@redhat.com> 
> >> wrote:
>  On Tue, Jun 27, 2023 at 3:23 PM André Almeida 
> mailto:andrealm...@igalia.com> 
> >> wrote:
> >
> > +Robustness
> > +--
> > +
> > +The only way to try to keep an application working after a reset 
> is if it
> > +complies with the robustness aspects of the graphical API that it 
> is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device 
> resets. However,
> > +there is no guarantee that the app will use such features 
> correctly, and the
> > +UMD can implement policies to close the app if it is a repeating 
> offender,
> > +likely in a broken loop. This is done to ensure that it does not 
> keep blocking
> > +the user interface from being correctly displayed. This should be 
> done even if
> > +the app is correct but happens to trigger some bug in the 
> hardware/driver.
> 
>  I still don't think it's good to let the kernel arbitrarily kill
>  processes that it thinks are not well-behaved based on some 
> heuristics
>  and policy.
> 
>  Can't this be outsourced to user space? Expose the information about
>  processes causing a device and let e.g. systemd deal with coming up
>  with a policy and with killing stuff.
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().
> >>
> >> It would be safer to just ignore API calls[0], similarly to what is 
> done until the application destroys the context with robustness. Calling 
> exit() likely results in losing any unsaved work, whereas at least some 
> applications might otherwise allow saving the work by other means.
> >
> > That's a terrible idea. Ignoring API calls would be identical to a 
> freeze. You might as well disable GPU recovery because the result would be 
> the same.
> 
> No GPU recovery would affect everything using the GPU, whereas this 
> affects only non-robust applications.
> 
> which is currently the majority.

Not sure where you're going with this. Applications need to use robustness to 
be able to recover from a GPU hang, and the GPU needs to be reset for that. So 
disabling GPU reset is not the same as what we're discussing here.


> > - non-robust contexts: call exit(1) immediately, which is the best way 
> to recover
> 
> That's not the UMD's call to make.
> 
> That's absolutely the UMD's call to make because that's mandated by the hw 
> and API design

Can you point us to a spec which mandates that the process must be killed in 
this case?


> and only driver devs know this, which this thread is a proof of. The default 
> behavior is to skip all command submission if a non-robust context is lost, 
> which looks like a freeze. That's required to prevent infinite hangs from the 
> same context and can be caused by the side effects of the GPU reset itself, 
> not by the cause of the previous hang. The only way out of that is killing 
> the process.

The UMD killing the process is not the only way out of that, and doing so is 
overreach on its part. The UMD is but one out of many components in a process, 
not the main one or a special one. It doesn't get to decide when the process 
must die, certainly not under circumstances where it must be able to continue 
while ignoring API calls (that's required for robustness).


> >>     [0] Possibly accompanied by a one-time message to stderr along the 
> lines of "GPU reset detected but robustness not enabled in context, ignoring 
> OpenGL API calls".


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-04 Thread Pekka Paalanen
On Mon, 3 Jul 2023 12:00:22 -0300
André Almeida  wrote:

> Em 03/07/2023 05:49, Pekka Paalanen escreveu:

> > If an application freezes, that's "no problem"; the end user can just
> > continue using everything else. Alt-tab away etc. if the app was
> > fullscreen. I do that already with games on even Xorg.
> > 
> > If a display server freezes, that's a desktop-wide problem, but so is
> > killing it.
> >   
> 
> Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), 
> hanging the GPU usually means that the rest of applications/compositor 
> can't use the GPU either, freezing all user interactions. So killing the 
> guilty app is one effective solution currently, but ignoring calls may 
> help as well.

I don't know if what I'm seeing is a GPU hang or just e.g. Proton
getting somehow stuck, all I see is a game freezing. I just Alt+tab
back to Steam, force-stop it, and then all is fine again. This is how
it should work regardless of why a game freezes.

However, even if it was a GPU hang, if I am on a display server that
actually handles GPU resets, I don't see why the rest of the desktop
would not be able to recover. Individual apps are each to their own,
but at the very least non-GPU apps and the DE itself should not have
any problem (DE components can simply be restarted automatically).


Thanks,
pq


pgpwiHIPewZ4g.pgp
Description: OpenPGP digital signature


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread Randy Dunlap



On 7/3/23 19:44, Marek Olšák wrote:
> 
> 
> On Mon, Jul 3, 2023, 22:38 Randy Dunlap  > wrote:
> 
> 
> 
> On 7/3/23 19:34, Marek Olšák wrote:
> >
> >
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer    >> wrote:
> >
> 
> Marek,
> Please stop sending html emails to the mailing lists.
> The mailing list software drops them.
> 
> Please set your email interface to use plain text mode instead.
> Thanks.
> 
> 
> The mobile Gmail app doesn't support plain text, which I use frequently.

Perhaps you should consider some other mobile app for kernel discussions.

E.g., it looks like the K-9 mail app works with gmail.

-- 
~Randy


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread Marek Olšák
On Mon, Jul 3, 2023, 22:38 Randy Dunlap  wrote:

>
>
> On 7/3/23 19:34, Marek Olšák wrote:
> >
> >
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer  > wrote:
> >
>
> Marek,
> Please stop sending html emails to the mailing lists.
> The mailing list software drops them.
>
> Please set your email interface to use plain text mode instead.
> Thanks.
>

The mobile Gmail app doesn't support plain text, which I use frequently.

Marek


> --
> ~Randy
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread Randy Dunlap



On 7/3/23 19:34, Marek Olšák wrote:
> 
> 
> On Mon, Jul 3, 2023, 03:12 Michel Dänzer  > wrote:
> 

Marek,
Please stop sending html emails to the mailing lists.
The mailing list software drops them.

Please set your email interface to use plain text mode instead.
Thanks.

-- 
~Randy


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread Marek Olšák
On Mon, Jul 3, 2023, 03:12 Michel Dänzer  wrote:

> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daen...@mailbox.org > wrote:
> >> On 6/30/23 16:59, Alex Deucher wrote:
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> mailto:sebastian.w...@redhat.com>> wrote:
>  On Tue, Jun 27, 2023 at 3:23 PM André Almeida  > wrote:
> >
> > +Robustness
> > +--
> > +
> > +The only way to try to keep an application working after a reset is
> if it
> > +complies with the robustness aspects of the graphical API that it
> is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device
> resets. However,
> > +there is no guarantee that the app will use such features
> correctly, and the
> > +UMD can implement policies to close the app if it is a repeating
> offender,
> > +likely in a broken loop. This is done to ensure that it does not
> keep blocking
> > +the user interface from being correctly displayed. This should be
> done even if
> > +the app is correct but happens to trigger some bug in the
> hardware/driver.
> 
>  I still don't think it's good to let the kernel arbitrarily kill
>  processes that it thinks are not well-behaved based on some heuristics
>  and policy.
> 
>  Can't this be outsourced to user space? Expose the information about
>  processes causing a device and let e.g. systemd deal with coming up
>  with a policy and with killing stuff.
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().
> >>
> >> It would be safer to just ignore API calls[0], similarly to what is
> done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> >
> > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
>
> No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
>

which is currently the majority.


>
> > - non-robust contexts: call exit(1) immediately, which is the best way
> to recover
>
> That's not the UMD's call to make.
>

That's absolutely the UMD's call to make because that's mandated by the hw
and API design and only driver devs know this, which this thread is a proof
of. The default behavior is to skip all command submission if a non-robust
context is lost, which looks like a freeze. That's required to prevent
infinite hangs from the same context and can be caused by the side effects
of the GPU reset itself, not by the cause of the previous hang. The only
way out of that is killing the process.

Marek


>
> >> [0] Possibly accompanied by a one-time message to stderr along the
> lines of "GPU reset detected but robustness not enabled in context,
> ignoring OpenGL API calls".
>
>
> --
> Earthling Michel Dänzer|  https://redhat.com
> Libre software enthusiast  | Mesa and Xwayland developer
>
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread André Almeida




Em 03/07/2023 05:49, Pekka Paalanen escreveu:

On Mon, 3 Jul 2023 09:12:29 +0200
Michel Dänzer  wrote:


On 6/30/23 22:32, Marek Olšák wrote:

On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer mailto:michel.daen...@mailbox.org>> wrote:

On 6/30/23 16:59, Alex Deucher wrote:

On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
mailto:sebastian.w...@redhat.com>> wrote:

On Tue, Jun 27, 2023 at 3:23 PM André Almeida mailto:andrealm...@igalia.com>> wrote:


+Robustness
+--
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to applications to deal with device resets. 
However,
+there is no guarantee that the app will use such features correctly, and the
+UMD can implement policies to close the app if it is a repeating offender,
+likely in a broken loop. This is done to ensure that it does not keep blocking
+the user interface from being correctly displayed. This should be done even if
+the app is correct but happens to trigger some bug in the hardware/driver.


I still don't think it's good to let the kernel arbitrarily kill
processes that it thinks are not well-behaved based on some heuristics
and policy.

Can't this be outsourced to user space? Expose the information about
processes causing a device and let e.g. systemd deal with coming up
with a policy and with killing stuff.


I don't think it's the kernel doing the killing, it would be the UMD.
E.g., if the app is guilty and doesn't support robustness the UMD can
just call exit().


It would be safer to just ignore API calls[0], similarly to what
is done until the application destroys the context with
robustness. Calling exit() likely results in losing any unsaved
work, whereas at least some applications might otherwise allow
saving the work by other means.


That's a terrible idea. Ignoring API calls would be identical to a
freeze. You might as well disable GPU recovery because the result
would be the same.


No GPU recovery would affect everything using the GPU, whereas this
affects only non-robust applications.



- non-robust contexts: call exit(1) immediately, which is the best
way to recover


That's not the UMD's call to make.



 [0] Possibly accompanied by a one-time message to stderr along
the lines of "GPU reset detected but robustness not enabled in
context, ignoring OpenGL API calls".




Hi,

Michel does have a point. It's not just games and display servers that
use GPU, but productivity tools as well. They may have periodic
autosave in anticipation of crashes, but being able to do the final
save before quitting would be nice. UMD killing the process would be
new behaviour, right? Previously either application's GPU thread hangs
or various API calls return errors, but it didn't kill the process, did
it?



In Intel's Iris, UMD may call abort() for the reset guilty application:

https://elixir.bootlin.com/mesa/mesa-23.0.4/source/src/gallium/drivers/iris/iris_batch.c#L1063

I was pretty sure this was the same for RadeonSI, but I failed to find 
the code for this, so I might be wrong.



If an application freezes, that's "no problem"; the end user can just
continue using everything else. Alt-tab away etc. if the app was
fullscreen. I do that already with games on even Xorg.

If a display server freezes, that's a desktop-wide problem, but so is
killing it.



Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), 
hanging the GPU usually means that the rest of applications/compositor 
can't use the GPU either, freezing all user interactions. So killing the 
guilty app is one effective solution currently, but ignoring calls may 
help as well.



OTOH, if UMD really does need to terminate the process, then please do
it in a way that causes a crash report to be recorded. _exit() with an
error code is not it.



In the "Reporting causes of resets" subsection of this document I can 
add something for UMD as well.




Thanks,
pq


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread Pekka Paalanen
On Mon, 3 Jul 2023 09:12:29 +0200
Michel Dänzer  wrote:

> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer  > > wrote:  
> >> On 6/30/23 16:59, Alex Deucher wrote:  
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> mailto:sebastian.w...@redhat.com>> wrote:  
>  On Tue, Jun 27, 2023 at 3:23 PM André Almeida   > wrote:  
> >
> > +Robustness
> > +--
> > +
> > +The only way to try to keep an application working after a reset is if 
> > it
> > +complies with the robustness aspects of the graphical API that it is 
> > using.
> > +
> > +Graphical APIs provide ways to applications to deal with device 
> > resets. However,
> > +there is no guarantee that the app will use such features correctly, 
> > and the
> > +UMD can implement policies to close the app if it is a repeating 
> > offender,
> > +likely in a broken loop. This is done to ensure that it does not keep 
> > blocking
> > +the user interface from being correctly displayed. This should be done 
> > even if
> > +the app is correct but happens to trigger some bug in the 
> > hardware/driver.  
> 
>  I still don't think it's good to let the kernel arbitrarily kill
>  processes that it thinks are not well-behaved based on some heuristics
>  and policy.
> 
>  Can't this be outsourced to user space? Expose the information about
>  processes causing a device and let e.g. systemd deal with coming up
>  with a policy and with killing stuff.  
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().  
> >>
> >> It would be safer to just ignore API calls[0], similarly to what
> >> is done until the application destroys the context with
> >> robustness. Calling exit() likely results in losing any unsaved
> >> work, whereas at least some applications might otherwise allow
> >> saving the work by other means.  
> > 
> > That's a terrible idea. Ignoring API calls would be identical to a
> > freeze. You might as well disable GPU recovery because the result
> > would be the same.  
> 
> No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
> 
> 
> > - non-robust contexts: call exit(1) immediately, which is the best
> > way to recover  
> 
> That's not the UMD's call to make.
> 
> 
> >> [0] Possibly accompanied by a one-time message to stderr along
> >> the lines of "GPU reset detected but robustness not enabled in
> >> context, ignoring OpenGL API calls".  
> 

Hi,

Michel does have a point. It's not just games and display servers that
use GPU, but productivity tools as well. They may have periodic
autosave in anticipation of crashes, but being able to do the final
save before quitting would be nice. UMD killing the process would be
new behaviour, right? Previously either application's GPU thread hangs
or various API calls return errors, but it didn't kill the process, did
it?

If an application freezes, that's "no problem"; the end user can just
continue using everything else. Alt-tab away etc. if the app was
fullscreen. I do that already with games on even Xorg.

If a display server freezes, that's a desktop-wide problem, but so is
killing it.

OTOH, if UMD really does need to terminate the process, then please do
it in a way that causes a crash report to be recorded. _exit() with an
error code is not it.


Thanks,
pq


pgpsZkuqTFX9g.pgp
Description: OpenPGP digital signature


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread Michel Dänzer
On 6/30/23 22:32, Marek Olšák wrote:
> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer  > wrote:
>> On 6/30/23 16:59, Alex Deucher wrote:
>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>>> mailto:sebastian.w...@redhat.com>> wrote:
 On Tue, Jun 27, 2023 at 3:23 PM André Almeida >>> > wrote:
>
> +Robustness
> +--
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is 
> using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. 
> However,
> +there is no guarantee that the app will use such features correctly, and 
> the
> +UMD can implement policies to close the app if it is a repeating 
> offender,
> +likely in a broken loop. This is done to ensure that it does not keep 
> blocking
> +the user interface from being correctly displayed. This should be done 
> even if
> +the app is correct but happens to trigger some bug in the 
> hardware/driver.

 I still don't think it's good to let the kernel arbitrarily kill
 processes that it thinks are not well-behaved based on some heuristics
 and policy.

 Can't this be outsourced to user space? Expose the information about
 processes causing a device and let e.g. systemd deal with coming up
 with a policy and with killing stuff.
>>>
>>> I don't think it's the kernel doing the killing, it would be the UMD.
>>> E.g., if the app is guilty and doesn't support robustness the UMD can
>>> just call exit().
>>
>> It would be safer to just ignore API calls[0], similarly to what is done 
>> until the application destroys the context with robustness. Calling exit() 
>> likely results in losing any unsaved work, whereas at least some 
>> applications might otherwise allow saving the work by other means.
> 
> That's a terrible idea. Ignoring API calls would be identical to a freeze. 
> You might as well disable GPU recovery because the result would be the same.

No GPU recovery would affect everything using the GPU, whereas this affects 
only non-robust applications.


> - non-robust contexts: call exit(1) immediately, which is the best way to 
> recover

That's not the UMD's call to make.


>> [0] Possibly accompanied by a one-time message to stderr along the lines 
>> of "GPU reset detected but robustness not enabled in context, ignoring 
>> OpenGL API calls".


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-30 Thread Marek Olšák
That's a terrible idea. Ignoring API calls would be identical to a freeze.
You might as well disable GPU recovery because the result would be the same.

There are 2 scenarios:
- robust contexts: report the GPU reset status and skip API calls; let the
app recreate the context to recover
- non-robust contexts: call exit(1) immediately, which is the best way to
recover

Marek

On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer 
wrote:

> On 6/30/23 16:59, Alex Deucher wrote:
> > On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >  wrote:
> >> On Tue, Jun 27, 2023 at 3:23 PM André Almeida 
> wrote:
> >>>
> >>> +Robustness
> >>> +--
> >>> +
> >>> +The only way to try to keep an application working after a reset is
> if it
> >>> +complies with the robustness aspects of the graphical API that it is
> using.
> >>> +
> >>> +Graphical APIs provide ways to applications to deal with device
> resets. However,
> >>> +there is no guarantee that the app will use such features correctly,
> and the
> >>> +UMD can implement policies to close the app if it is a repeating
> offender,
> >>> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> >>> +the user interface from being correctly displayed. This should be
> done even if
> >>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >>
> >> I still don't think it's good to let the kernel arbitrarily kill
> >> processes that it thinks are not well-behaved based on some heuristics
> >> and policy.
> >>
> >> Can't this be outsourced to user space? Expose the information about
> >> processes causing a device and let e.g. systemd deal with coming up
> >> with a policy and with killing stuff.
> >
> > I don't think it's the kernel doing the killing, it would be the UMD.
> > E.g., if the app is guilty and doesn't support robustness the UMD can
> > just call exit().
>
> It would be safer to just ignore API calls[0], similarly to what is done
> until the application destroys the context with robustness. Calling exit()
> likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
>
>
> [0] Possibly accompanied by a one-time message to stderr along the lines
> of "GPU reset detected but robustness not enabled in context, ignoring
> OpenGL API calls".
>
> --
> Earthling Michel Dänzer|  https://redhat.com
> Libre software enthusiast  | Mesa and Xwayland developer
>
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-30 Thread Sebastian Wick
On Fri, Jun 30, 2023 at 4:59 PM Alex Deucher  wrote:
>
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>  wrote:
> >
> > On Tue, Jun 27, 2023 at 3:23 PM André Almeida  
> > wrote:
> > >
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen 
> > > Signed-off-by: André Almeida 
> > > ---
> > >
> > > v4: 
> > > https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
> > >
> > > Changes:
> > >  - Grammar fixes (Randy)
> > >
> > >  Documentation/gpu/drm-uapi.rst | 68 ++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst 
> > > b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
> > > third handler for
> > >  mmapped regular files. Threads cause additional pain with signal
> > >  handling as well.
> > >
> > > +Device reset
> > > +
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware 
> > > bugs,
> > > +faulty applications and everything in between the many layers. Some 
> > > errors
> > > +require resetting the device in order to make the device usable again. 
> > > This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +--
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to 
> > > perform
> > > +it as needed. Usually a hang is detected when a job gets stuck 
> > > executing. KMD
> > > +should keep track of resets, because userspace can query any time about 
> > > the
> > > +reset stats for an specific context. This is needed to propagate to the 
> > > rest of
> > > +the stack that a reset has happened. Currently, this is implemented by 
> > > each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the 
> > > device has
> > > +been reset, and this can be checked more often if the UMD requires it. 
> > > After
> > > +detecting a reset, UMD will then proceed to report it to the application 
> > > using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +--
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is 
> > > using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. 
> > > However,
> > > +there is no guarantee that the app will use such features correctly, and 
> > > the
> > > +UMD can implement policies to close the app if it is a repeating 
> > > offender,
> > > +likely in a broken loop. This is done to ensure that it does not keep 
> > > blocking
> > > +the user interface from being correctly displayed. This should be done 
> > > even if
> > > +the app is correct but happens to trigger some bug in the 
> > > hardware/driver.
> >
> > I still don't think it's good to let the kernel arbitrarily kill
> > processes that it thinks are not well-behaved based on some heuristics
> > and policy.
> >
> > Can't this be outsourced to user space? Expose the information about
> > processes causing a device and let e.g. systemd deal with coming up
> > with a policy and with killing stuff.
>
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().

Ah, right, completely skipped over the UMD part. That makes more sense.
>
> Alex
>
> >
> > > +
> > > +OpenGL
> > > +~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL 
> > > ES). This
> > > +interface tells if a reset has happened, and if so, all the context 
> > > state is
> > > +considered lost and the app proceeds by creating new ones. If it is 
> > > possible to
> > > +determine that robustness is not in use, the UMD will terminate the app 
> > > when a
> > > +reset is detected, giving that the contexts are lost and the app won't 
> > > be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for 
> > > submissions.
> > > +This error code means, among other things, that a device reset has 
> > > happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--
> > > +
> > > +Apart from 

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-30 Thread Michel Dänzer
On 6/30/23 16:59, Alex Deucher wrote:
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>  wrote:
>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida  wrote:
>>>
>>> +Robustness
>>> +--
>>> +
>>> +The only way to try to keep an application working after a reset is if it
>>> +complies with the robustness aspects of the graphical API that it is using.
>>> +
>>> +Graphical APIs provide ways to applications to deal with device resets. 
>>> However,
>>> +there is no guarantee that the app will use such features correctly, and 
>>> the
>>> +UMD can implement policies to close the app if it is a repeating offender,
>>> +likely in a broken loop. This is done to ensure that it does not keep 
>>> blocking
>>> +the user interface from being correctly displayed. This should be done 
>>> even if
>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>
>> I still don't think it's good to let the kernel arbitrarily kill
>> processes that it thinks are not well-behaved based on some heuristics
>> and policy.
>>
>> Can't this be outsourced to user space? Expose the information about
>> processes causing a device and let e.g. systemd deal with coming up
>> with a policy and with killing stuff.
> 
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().

It would be safer to just ignore API calls[0], similarly to what is done until 
the application destroys the context with robustness. Calling exit() likely 
results in losing any unsaved work, whereas at least some applications might 
otherwise allow saving the work by other means.


[0] Possibly accompanied by a one-time message to stderr along the lines of 
"GPU reset detected but robustness not enabled in context, ignoring OpenGL API 
calls".

-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-30 Thread Alex Deucher
On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
 wrote:
>
> On Tue, Jun 27, 2023 at 3:23 PM André Almeida  wrote:
> >
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen 
> > Signed-off-by: André Almeida 
> > ---
> >
> > v4: 
> > https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
> >
> > Changes:
> >  - Grammar fixes (Randy)
> >
> >  Documentation/gpu/drm-uapi.rst | 68 ++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third 
> > handler for
> >  mmapped regular files. Threads cause additional pain with signal
> >  handling as well.
> >
> > +Device reset
> > +
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +--
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to 
> > perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. 
> > KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the 
> > rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +
> > +
> > +The UMD should check before submitting new commands to the KMD if the 
> > device has
> > +been reset, and this can be checked more often if the UMD requires it. 
> > After
> > +detecting a reset, UMD will then proceed to report it to the application 
> > using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +--
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. 
> > However,
> > +there is no guarantee that the app will use such features correctly, and 
> > the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep 
> > blocking
> > +the user interface from being correctly displayed. This should be done 
> > even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
>
> I still don't think it's good to let the kernel arbitrarily kill
> processes that it thinks are not well-behaved based on some heuristics
> and policy.
>
> Can't this be outsourced to user space? Expose the information about
> processes causing a device and let e.g. systemd deal with coming up
> with a policy and with killing stuff.

I don't think it's the kernel doing the killing, it would be the UMD.
E.g., if the app is guilty and doesn't support robustness the UMD can
just call exit().

Alex

>
> > +
> > +OpenGL
> > +~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). 
> > This
> > +interface tells if a reset has happened, and if so, all the context state 
> > is
> > +considered lost and the app proceeds by creating new ones. If it is 
> > possible to
> > +determine that robustness is not in use, the UMD will terminate the app 
> > when a
> > +reset is detected, giving that the contexts are lost and the app won't be 
> > able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for 
> > submissions.
> > +This error code means, among other things, that a device reset has 
> > happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--
> > +
> > +Apart from propagating the reset through the stack so apps can recover, 
> > it's
> > +really useful for driver developers to learn more about what caused the 
> > reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
> > +
> >  .. _drm_driver_ioctl:
> >
> >  IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-30 Thread Sebastian Wick
On Tue, Jun 27, 2023 at 3:23 PM André Almeida  wrote:
>
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen 
> Signed-off-by: André Almeida 
> ---
>
> v4: 
> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
>
> Changes:
>  - Grammar fixes (Randy)
>
>  Documentation/gpu/drm-uapi.rst | 68 ++
>  1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third 
> handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +--
> +
> +The KMD is responsible for checking if the device needs a reset, and to 
> perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest 
> of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +
> +
> +The UMD should check before submitting new commands to the KMD if the device 
> has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application 
> using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +--
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. 
> However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep 
> blocking
> +the user interface from being correctly displayed. This should be done even 
> if
> +the app is correct but happens to trigger some bug in the hardware/driver.

I still don't think it's good to let the kernel arbitrarily kill
processes that it thinks are not well-behaved based on some heuristics
and policy.

Can't this be outsourced to user space? Expose the information about
processes causing a device and let e.g. systemd deal with coming up
with a policy and with killing stuff.

> +
> +OpenGL
> +~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). 
> This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible 
> to
> +determine that robustness is not in use, the UMD will terminate the app when 
> a
> +reset is detected, giving that the contexts are lost and the app won't be 
> able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened 
> and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the 
> reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.41.0
>



Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-29 Thread André Almeida




Em 27/06/2023 18:17, André Almeida escreveu:

Em 27/06/2023 14:47, Christian König escreveu:

Am 27.06.23 um 15:23 schrieb André Almeida:

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Acked-by: Pekka Paalanen 
Signed-off-by: André Almeida 
---

v4: 
https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/


Changes:
  - Grammar fixes (Randy)

  Documentation/gpu/drm-uapi.rst | 68 ++
  1 file changed, 68 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst 
b/Documentation/gpu/drm-uapi.rst

index 65fb3036a580..3cbffa25ed93 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
third handler for

  mmapped regular files. Threads cause additional pain with signal
  handling as well.
+Device reset
+
+
+The GPU stack is really complex and is prone to errors, from 
hardware bugs,
+faulty applications and everything in between the many layers. Some 
errors
+require resetting the device in order to make the device usable 
again. This

+sections describes the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Kernel Mode Driver
+--
+
+The KMD is responsible for checking if the device needs a reset, and 
to perform
+it as needed. Usually a hang is detected when a job gets stuck 
executing. KMD
+should keep track of resets, because userspace can query any time 
about the

+reset stats for an specific context.


Maybe drop the part "for a specific context". Essentially the reset 
query could use global counters instead and we won't need the context 
any more here.




Right, I wrote like this to reflect how it's currently implemented.

If follow correctly what you meant, KMD could always notify the global 
count for UMD, and we would move to the UMD the responsibility to manage 
the reset counters, right? This would also simplify my 
DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc 
version.




Actually, if we drop the context identifier we would lose the ability to 
track which is the guilty context. Vulkan API doesn't seem to care about 
this, but OpenGL does.



Apart from that this sounds good to me, feel free to add my rb.

Regards,
Christian.




Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-27 Thread Marek Olšák
On Tue, Jun 27, 2023 at 5:31 PM André Almeida 
wrote:

> Hi Marek,
>
> Em 27/06/2023 15:57, Marek Olšák escreveu:
> > On Tue, Jun 27, 2023, 09:23 André Almeida  > > wrote:
> >
> > +User Mode Driver
> > +
> > +
> > +The UMD should check before submitting new commands to the KMD if
> > the device has
> > +been reset, and this can be checked more often if the UMD requires
> > it. After
> > +detecting a reset, UMD will then proceed to report it to the
> > application using
> > +the appropriate API error code, as explained in the section below
> about
> > +robustness.
> >
> >
> > The UMD won't check the device status before every command submission
> > due to ioctl overhead. Instead, the KMD should skip command submission
> > and return an error that it was skipped.
>
> I wrote like this because when reading the source code for
> vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under
> the impression that UMD checks the reset status before every
> submission/flush.
>

It only does that before every command submission when the context is
robust. When it's not robust, radeonsi doesn't do anything.


>
> Is your comment about of how things are currently implemented, or how
> they would ideally work? Either way I can apply your suggestion, I just
> want to make it clear.
>

Yes. Ideally, we would get the reply whether the context is lost from the
CS ioctl. This is not currently implemented.

Marek


>
> [0]
>
> https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142
> [1]
>
> https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83
>
> >
> > The only case where that won't be applicable is user queues where
> > drivers don't call into the kernel to submit work, but they do call into
> > the kernel to create a dma_fence. In that case, the call to create a
> > dma_fence can fail with an error.
> >
> > Marek
>
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-27 Thread André Almeida

Hi Marek,

Em 27/06/2023 15:57, Marek Olšák escreveu:
On Tue, Jun 27, 2023, 09:23 André Almeida > wrote:


+User Mode Driver
+
+
+The UMD should check before submitting new commands to the KMD if
the device has
+been reset, and this can be checked more often if the UMD requires
it. After
+detecting a reset, UMD will then proceed to report it to the
application using
+the appropriate API error code, as explained in the section below about
+robustness.


The UMD won't check the device status before every command submission 
due to ioctl overhead. Instead, the KMD should skip command submission 
and return an error that it was skipped.


I wrote like this because when reading the source code for 
vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under 
the impression that UMD checks the reset status before every 
submission/flush.


Is your comment about of how things are currently implemented, or how 
they would ideally work? Either way I can apply your suggestion, I just 
want to make it clear.


[0] 
https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142
[1] 
https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83




The only case where that won't be applicable is user queues where 
drivers don't call into the kernel to submit work, but they do call into 
the kernel to create a dma_fence. In that case, the call to create a 
dma_fence can fail with an error.


Marek




Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-27 Thread André Almeida

Em 27/06/2023 14:47, Christian König escreveu:

Am 27.06.23 um 15:23 schrieb André Almeida:

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Acked-by: Pekka Paalanen 
Signed-off-by: André Almeida 
---

v4: 
https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/


Changes:
  - Grammar fixes (Randy)

  Documentation/gpu/drm-uapi.rst | 68 ++
  1 file changed, 68 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst 
b/Documentation/gpu/drm-uapi.rst

index 65fb3036a580..3cbffa25ed93 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
third handler for

  mmapped regular files. Threads cause additional pain with signal
  handling as well.
+Device reset
+
+
+The GPU stack is really complex and is prone to errors, from hardware 
bugs,
+faulty applications and everything in between the many layers. Some 
errors
+require resetting the device in order to make the device usable 
again. This

+sections describes the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Kernel Mode Driver
+--
+
+The KMD is responsible for checking if the device needs a reset, and 
to perform
+it as needed. Usually a hang is detected when a job gets stuck 
executing. KMD
+should keep track of resets, because userspace can query any time 
about the

+reset stats for an specific context.


Maybe drop the part "for a specific context". Essentially the reset 
query could use global counters instead and we won't need the context 
any more here.




Right, I wrote like this to reflect how it's currently implemented.

If follow correctly what you meant, KMD could always notify the global 
count for UMD, and we would move to the UMD the responsibility to manage 
the reset counters, right? This would also simplify my 
DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc 
version.



Apart from that this sounds good to me, feel free to add my rb.

Regards,
Christian.




Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-27 Thread Marek Olšák
On Tue, Jun 27, 2023, 09:23 André Almeida  wrote:

> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen 
> Signed-off-by: André Almeida 
> ---
>
> v4:
> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
>
> Changes:
>  - Grammar fixes (Randy)
>
>  Documentation/gpu/drm-uapi.rst | 68 ++
>  1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst
> b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third
> handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +
> +
> +The GPU stack is really complex and is prone to errors, from hardware
> bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again.
> This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +--
> +
> +The KMD is responsible for checking if the device needs a reset, and to
> perform
> +it as needed. Usually a hang is detected when a job gets stuck executing.
> KMD
> +should keep track of resets, because userspace can query any time about
> the
> +reset stats for an specific context. This is needed to propagate to the
> rest of
> +the stack that a reset has happened. Currently, this is implemented by
> each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +
> +
> +The UMD should check before submitting new commands to the KMD if the
> device has
> +been reset, and this can be checked more often if the UMD requires it.
> After
> +detecting a reset, UMD will then proceed to report it to the application
> using
> +the appropriate API error code, as explained in the section below about
> +robustness.
>

The UMD won't check the device status before every command submission due
to ioctl overhead. Instead, the KMD should skip command submission and
return an error that it was skipped.

The only case where that won't be applicable is user queues where drivers
don't call into the kernel to submit work, but they do call into the kernel
to create a dma_fence. In that case, the call to create a dma_fence can
fail with an error.

Marek

+
> +Robustness
> +--
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is
> using.
> +
> +Graphical APIs provide ways to applications to deal with device resets.
> However,
> +there is no guarantee that the app will use such features correctly, and
> the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> +the user interface from being correctly displayed. This should be done
> even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES).
> This
> +interface tells if a reset has happened, and if so, all the context state
> is
> +considered lost and the app proceeds by creating new ones. If it is
> possible to
> +determine that robustness is not in use, the UMD will terminate the app
> when a
> +reset is detected, giving that the contexts are lost and the app won't be
> able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for
> submissions.
> +This error code means, among other things, that a device reset has
> happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--
> +
> +Apart from propagating the reset through the stack so apps can recover,
> it's
> +really useful for driver developers to learn more about what caused the
> reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.41.0
>
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-27 Thread Christian König

Am 27.06.23 um 15:23 schrieb André Almeida:

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Acked-by: Pekka Paalanen 
Signed-off-by: André Almeida 
---

v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/

Changes:
  - Grammar fixes (Randy)

  Documentation/gpu/drm-uapi.rst | 68 ++
  1 file changed, 68 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..3cbffa25ed93 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third 
handler for
  mmapped regular files. Threads cause additional pain with signal
  handling as well.
  
+Device reset

+
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in between the many layers. Some errors
+require resetting the device in order to make the device usable again. This
+sections describes the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Kernel Mode Driver
+--
+
+The KMD is responsible for checking if the device needs a reset, and to perform
+it as needed. Usually a hang is detected when a job gets stuck executing. KMD
+should keep track of resets, because userspace can query any time about the
+reset stats for an specific context.


Maybe drop the part "for a specific context". Essentially the reset 
query could use global counters instead and we won't need the context 
any more here.


Apart from that this sounds good to me, feel free to add my rb.

Regards,
Christian.


  This is needed to propagate to the rest of
+the stack that a reset has happened. Currently, this is implemented by each
+driver separately, with no common DRM interface.
+
+User Mode Driver
+
+
+The UMD should check before submitting new commands to the KMD if the device 
has
+been reset, and this can be checked more often if the UMD requires it. After
+detecting a reset, UMD will then proceed to report it to the application using
+the appropriate API error code, as explained in the section below about
+robustness.
+
+Robustness
+--
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to applications to deal with device resets. 
However,
+there is no guarantee that the app will use such features correctly, and the
+UMD can implement policies to close the app if it is a repeating offender,
+likely in a broken loop. This is done to ensure that it does not keep blocking
+the user interface from being correctly displayed. This should be done even if
+the app is correct but happens to trigger some bug in the hardware/driver.
+
+OpenGL
+~~
+
+Apps using OpenGL should use the available robust interfaces, like the
+extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
+interface tells if a reset has happened, and if so, all the context state is
+considered lost and the app proceeds by creating new ones. If it is possible to
+determine that robustness is not in use, the UMD will terminate the app when a
+reset is detected, giving that the contexts are lost and the app won't be able
+to figure this out and recreate the contexts.
+
+Vulkan
+~~
+
+Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
+This error code means, among other things, that a device reset has happened and
+it needs to recreate the contexts to keep going.
+
+Reporting causes of resets
+--
+
+Apart from propagating the reset through the stack so apps can recover, it's
+really useful for driver developers to learn more about what caused the reset 
in
+first place. DRM devices should make use of devcoredump to store relevant
+information about the reset, so this information can be added to user bug
+reports.
+
  .. _drm_driver_ioctl:
  
  IOCTL Support on Device Nodes




Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-27 Thread Randy Dunlap
Hi André,

I have just a few more below:

On 6/27/23 06:23, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
> 
> Acked-by: Pekka Paalanen 
> Signed-off-by: André Almeida 
> ---
> 
> v4: 
> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
> 
> Changes:
>  - Grammar fixes (Randy)
> 
>  Documentation/gpu/drm-uapi.rst | 68 ++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third 
> handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>  
> +Device reset
> +
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a

   section

> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +--
> +
> +The KMD is responsible for checking if the device needs a reset, and to 
> perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest 
> of

   for a specific

stats or status?

> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +
> +
> +The UMD should check before submitting new commands to the KMD if the device 
> has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application 
> using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +--
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. 
> However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep 
> blocking
> +the user interface from being correctly displayed. This should be done even 
> if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). 
> This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible 
> to
> +determine that robustness is not in use, the UMD will terminate the app when 
> a
> +reset is detected, giving that the contexts are lost and the app won't be 
> able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened 
> and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the 
> reset in
> +first place. DRM devices should make use of devcoredump to store relevant

   the first place.

> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>  
>  IOCTL Support on Device Nodes

and with those addressed:

Reviewed-by: Randy Dunlap 

Thanks for adding the documentation.

-- 
~Randy