Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-26 Thread Timur Kristóf
On Tue, 2023-07-25 at 19:00 +0200, Michel Dänzer wrote:
> On 7/25/23 17:05, Marek Olšák wrote:
> > On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
> >  wrote:
> > > On 7/25/23 04:55, André Almeida wrote:
> > > > Hi everyone,
> > > > 
> > > > It's not clear what we should do about non-robust OpenGL apps
> > > > after GPU resets, so I'll try to summarize the topic, show some
> > > > options and my proposal to move forward on that.
> > > > 
> > > > Em 27/06/2023 10:23, André Almeida escreveu:
> > > > > +Robustness
> > > > > +--
> > > > > +
> > > > > +The only way to try to keep an application working after a
> > > > > reset is if it
> > > > > +complies with the robustness aspects of the graphical API
> > > > > that it is using.
> > > > > +
> > > > > +Graphical APIs provide ways to applications to deal with
> > > > > device resets. However,
> > > > > +there is no guarantee that the app will use such features
> > > > > correctly, and the
> > > > > +UMD can implement policies to close the app if it is a
> > > > > repeating offender,
> > > > > +likely in a broken loop. This is done to ensure that it does
> > > > > not keep blocking
> > > > > +the user interface from being correctly displayed. This
> > > > > should be done even if
> > > > > +the app is correct but happens to trigger some bug in the
> > > > > hardware/driver.
> > > > > +
> > > > Depending on the OpenGL version, there are different robustness
> > > > API available:
> > > > 
> > > > - OpenGL ABR extension [0]
> > > > - OpenGL KHR extension [1]
> > > > - OpenGL ES extension  [2]
> > > > 
> > > > Apps written in OpenGL should use whatever version is available
> > > > for them to make the app robust for GPU resets. That usually
> > > > means calling GetGraphicsResetStatusARB(), checking the status,
> > > > and if it encounter something different from NO_ERROR, that
> > > > means that a reset has happened, the context is considered lost
> > > > and should be recreated. If an app follow this, it will likely
> > > > succeed recovering a reset.
> > > > 
> > > > What should non-robustness apps do then? They certainly will
> > > > not be notified if a reset happens, and thus can't recover if
> > > > their context is lost. OpenGL specification does not explicitly
> > > > define what should be done in such situations[3], and I believe
> > > > that usually when the spec mandates to close the app, it would
> > > > explicitly note it.
> > > > 
> > > > However, in reality there are different types of device resets,
> > > > causing different results. A reset can be precise enough to
> > > > damage only the guilty context, and keep others alive.
> > > > 
> > > > Given that, I believe drivers have the following options:
> > > > 
> > > > a) Kill all non-robust apps after a reset. This may lead to
> > > > lose work from innocent applications.
> > > > 
> > > > b) Ignore all non-robust apps OpenGL calls. That means that
> > > > applications would still be alive, but the user interface would
> > > > be freeze. The user would need to close it manually anyway, but
> > > > in some corner cases, the app could autosave some work or the
> > > > user might be able to interact with it using some alternative
> > > > method (command line?).
> > > > 
> > > > c) Kill just the affected non-robust applications. To do that,
> > > > the driver need to be 100% sure on the impact of its resets.
> > > > 
> > > > RadeonSI currently implements a), as can be seen at [4], while
> > > > Iris implements what I think it's c)[5].
> > > > 
> > > > For the user experience point-of-view, c) is clearly the best
> > > > option, but it's the hardest to archive. There's not much gain
> > > > on having b) over a), perhaps it could be an optional env var
> > > > for such corner case applications.
> > > 
> > > I disagree on these conclusions.
> > > 
> > > c) is certainly better than a), but it's not "clearly the best"
> > > in all cases. The OpenGL UMD is not a privileged/special
> > > component and is in no position to decide whether or not the
> > > process as a whole (only some thread(s) of which may use OpenGL
> > > at all) gets to continue running or not.
> > 
> > That's not true.
> 
> Which part of what I wrote are you referring to?
> 
> 
> > I recommend that you enable b) with your driver and then hang the
> > GPU under different scenarios and see the result.
> 
> I've been doing GPU driver development for over two decades, I'm
> perfectly aware what it means. It doesn't change what I wrote above.
> 

Michel, I understand that you disagree with the proposed solutions in
this email thread but from your mails it is unclear to me what exactly
is the solution that you would actually recommend, can you please
clarify?

Thanks,
Timur



Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
Hi Felix,

On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
> That's the worst-case scenario where you're debugging HW or FW
> issues. 
> Those should be pretty rare post-bringup. But are there hangs caused
> by 
> user mode driver or application bugs that are easier to debug and 
> probably don't even require a GPU reset?

There are many GPU hangs that gamers experience while playing. We have
dozens of open bug reports against RADV about GPU hangs on various GPU
generations. These usually fall into two categories:

1. When the hang always happens at the same point in a game. These are
painful to debug but manageable.
2. "Random" hangs that happen to users over the course of playing a
game for several hours. It is absolute hell to try to even reproduce
let alone diagnose these issues, and this is what we would like to
improve.

For these hard-to-diagnose problems, it is already a challenge to
determine whether the problem is the kernel (eg. setting wrong voltages
/ frequencies) or userspace (eg. missing some synchronization), can be
even a game bug that we need to work around.

> For example most VM faults can 
> be handled without hanging the GPU. Similarly, a shader in an endless
> loop should not require a full GPU reset.

This is actually not the case, AFAIK André's test case was an app that
had an infinite loop in a shader.

> 
> It's more complicated for graphics because of the more complex
> pipeline 
> and the lack of CWSR. But it should still be possible to do some 
> debugging without JTAG if the problem is in SW and not HW or FW. It's
> probably worth improving that debugability without getting hung-up on
> the worst case.

I agree, and we welcome any constructive suggestion to improve the
situation. It seems like our idea doesn't work if the kernel can't give
us the information we need.

How do we move forward?

Best regards,
Timur



Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Timur Kristóf
On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote:
> On Tue, May 2, 2023 at 9:35 AM Timur Kristóf
>  wrote:
> > 
> > Hi,
> > 
> > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > > > 
> > > > Christian König  ezt írta (időpont:
> > > > 2023.
> > > > máj. 2., Ke 9:59):
> > > > 
> > > > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > > > >  > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > > > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > > > > 
> > > > >  >> wrote:
> > > > >  >>>
> > > > >  >>> I know that devcoredump is also used for this kind of
> > > > > information,
> > > > >  >>> but I believe
> > > > >  >>> that using an IOCTL is better for interfacing Mesa +
> > > > > Linux
> > > > > rather
> > > > >  >>> than parsing
> > > > >  >>> a file that its contents are subjected to be changed.
> > > > >  >>
> > > > >  >> Can you elaborate a bit on that?  Isn't the whole point
> > > > > of
> > > > > devcoredump
> > > > >  >> to store this sort of information?
> > > > >  >>
> > > > >  >
> > > > >  > I think that devcoredump is something that you could use
> > > > > to
> > > > > submit to
> > > > >  > a bug report as it is, and then people can read/parse as
> > > > > they
> > > > > want,
> > > > >  > not as an interface to be read by Mesa... I'm not sure
> > > > > that
> > > > > it's
> > > > >  > something that I would call an API. But I might be wrong,
> > > > > if
> > > > > you know
> > > > >  > something that uses that as an API please share.
> > > > >  >
> > > > >  > Anyway, relying on that for Mesa would mean that we would
> > > > > need
> > > > > to
> > > > >  > ensure stability for the file content and format, making
> > > > > it
> > > > > less
> > > > >  > flexible to modify in the future and probe to bugs, while
> > > > > the
> > > > > IOCTL is
> > > > >  > well defined and extensible. Maybe the dump from Mesa +
> > > > > devcoredump
> > > > >  > could be complementary information to a bug report.
> > > > > 
> > > > >  Neither using an IOCTL nor devcoredump is a good approach
> > > > > for
> > > > > this since
> > > > >  the values read from the hw register are completely
> > > > > unreliable.
> > > > > They
> > > > >  could not be available because of GFXOFF or they could be
> > > > > overwritten or
> > > > >  not even updated by the CP in the first place because of a
> > > > > hang
> > > > > etc
> > > > > 
> > > > >  If you want to track progress inside an IB what you do
> > > > > instead
> > > > > is to
> > > > >  insert intermediate fence write commands into the IB. E.g.
> > > > > something
> > > > >  like write value X to location Y when this executes.
> > > > > 
> > > > >  This way you can not only track how far the IB processed,
> > > > > but
> > > > > also in
> > > > >  which stages of processing we where when the hang occurred.
> > > > > E.g.
> > > > > End of
> > > > >  Pipe, End of Shaders, specific shader stages etc...
> > > > > 
> > > > > 
> > > > 
> > > > Currently our biggest challenge in the userspace driver is
> > > > debugging "random" GPU hangs. We have many dozens of bug
> > > > reports
> > > > from users which are like: "play the game for X hours and it
> > > > will
> > > > eventually hang the GPU". With the currently available tools,
> > > > it is
> > > > impossible for us to tackle these issues. André's proposal
> > > > would be
> > > > a step in improving this situation.
> > > > 
> > > > We already do something like

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Timur Kristóf
Hi,

On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > 
> > Christian König  ezt írta (időpont: 2023.
> > máj. 2., Ke 9:59):
> >  
> > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > >  > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > >  
> > >  >> wrote:
> > >  >>>
> > >  >>> I know that devcoredump is also used for this kind of
> > > information, 
> > >  >>> but I believe
> > >  >>> that using an IOCTL is better for interfacing Mesa + Linux
> > > rather 
> > >  >>> than parsing
> > >  >>> a file that its contents are subjected to be changed.
> > >  >>
> > >  >> Can you elaborate a bit on that?  Isn't the whole point of
> > > devcoredump
> > >  >> to store this sort of information?
> > >  >>
> > >  >
> > >  > I think that devcoredump is something that you could use to
> > > submit to 
> > >  > a bug report as it is, and then people can read/parse as they
> > > want, 
> > >  > not as an interface to be read by Mesa... I'm not sure that
> > > it's 
> > >  > something that I would call an API. But I might be wrong, if
> > > you know 
> > >  > something that uses that as an API please share.
> > >  >
> > >  > Anyway, relying on that for Mesa would mean that we would need
> > > to 
> > >  > ensure stability for the file content and format, making it
> > > less 
> > >  > flexible to modify in the future and probe to bugs, while the
> > > IOCTL is 
> > >  > well defined and extensible. Maybe the dump from Mesa +
> > > devcoredump 
> > >  > could be complementary information to a bug report.
> > >  
> > >  Neither using an IOCTL nor devcoredump is a good approach for
> > > this since 
> > >  the values read from the hw register are completely unreliable.
> > > They 
> > >  could not be available because of GFXOFF or they could be
> > > overwritten or 
> > >  not even updated by the CP in the first place because of a hang
> > > etc
> > >  
> > >  If you want to track progress inside an IB what you do instead
> > > is to 
> > >  insert intermediate fence write commands into the IB. E.g.
> > > something 
> > >  like write value X to location Y when this executes.
> > >  
> > >  This way you can not only track how far the IB processed, but
> > > also in 
> > >  which stages of processing we where when the hang occurred. E.g.
> > > End of 
> > >  Pipe, End of Shaders, specific shader stages etc...
> > >  
> > > 
> >  
> > Currently our biggest challenge in the userspace driver is
> > debugging "random" GPU hangs. We have many dozens of bug reports
> > from users which are like: "play the game for X hours and it will
> > eventually hang the GPU". With the currently available tools, it is
> > impossible for us to tackle these issues. André's proposal would be
> > a step in improving this situation.
> > 
> > We already do something like what you suggest, but there are
> > multiple problems with that approach:
> >  
> > 1. we can only submit 1 command buffer at a time because we won't
> > know which IB hanged
> > 2. we can't use chaining because we don't know where in the IB it
> > hanged
> > 3. it needs userspace to insert (a lot of) extra commands such as
> > extra synchronization and memory writes
> > 4. It doesn't work when GPU recovery is enabled because the
> > information is already gone when we detect the hang
> > 
>  You can still submit multiple IBs and even chain them. All you need
> to do is to insert into each IB commands which write to an extra
> memory location with the IB executed and the position inside the IB.
> 
>  The write data command allows to write as many dw as you want (up to
> multiple kb). The only potential problem is when you submit the same
> IB multiple times.
> 
>  And yes that is of course quite some extra overhead, but I think
> that should be manageable.

Thanks, this sounds doable and would solve the limitation of how many
IBs are submitted at a time. However it doesn't address the problem
that enabling this sort of debugging will still have extra overhead.

I don't mean the overhead from writing a couple of dwords for the
trace, but rather, the overhead from needing to emit flushes or top of
pipe events or whatever else we need so that we can tell which command
hung the GPU.

>  
> > In my opinion, the correct solution to those problems would be if
> > the kernel could give userspace the necessary information about a
> > GPU hang before a GPU reset.
> >   
>  The fundamental problem here is that the kernel doesn't have that
> information either. We know which IB timed out and can potentially do
> a devcoredump when that happens, but that's it.


Is it really not possible to know such a fundamental thing as what the
GPU was doing when it hung? How are we supposed to do any kind of
debugging without knowing that?

I wonder what AMD's Windows driver team is doing with this problem,
surely they must have better tools to deal with GPU hangs?

Best regards,
Timur





Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Timur Kristóf
Hi Christian,

Christian König  ezt írta (időpont: 2023. máj.
2., Ke 9:59):

> Am 02.05.23 um 03:26 schrieb André Almeida:
> > Em 01/05/2023 16:24, Alex Deucher escreveu:
> >> On Mon, May 1, 2023 at 2:58 PM André Almeida 
> >> wrote:
> >>>
> >>> I know that devcoredump is also used for this kind of information,
> >>> but I believe
> >>> that using an IOCTL is better for interfacing Mesa + Linux rather
> >>> than parsing
> >>> a file that its contents are subjected to be changed.
> >>
> >> Can you elaborate a bit on that?  Isn't the whole point of devcoredump
> >> to store this sort of information?
> >>
> >
> > I think that devcoredump is something that you could use to submit to
> > a bug report as it is, and then people can read/parse as they want,
> > not as an interface to be read by Mesa... I'm not sure that it's
> > something that I would call an API. But I might be wrong, if you know
> > something that uses that as an API please share.
> >
> > Anyway, relying on that for Mesa would mean that we would need to
> > ensure stability for the file content and format, making it less
> > flexible to modify in the future and probe to bugs, while the IOCTL is
> > well defined and extensible. Maybe the dump from Mesa + devcoredump
> > could be complementary information to a bug report.
>
> Neither using an IOCTL nor devcoredump is a good approach for this since
> the values read from the hw register are completely unreliable. They
> could not be available because of GFXOFF or they could be overwritten or
> not even updated by the CP in the first place because of a hang etc
>
> If you want to track progress inside an IB what you do instead is to
> insert intermediate fence write commands into the IB. E.g. something
> like write value X to location Y when this executes.
>
> This way you can not only track how far the IB processed, but also in
> which stages of processing we where when the hang occurred. E.g. End of
> Pipe, End of Shaders, specific shader stages etc...
>

Currently our biggest challenge in the userspace driver is debugging
"random" GPU hangs. We have many dozens of bug reports from users which are
like: "play the game for X hours and it will eventually hang the GPU". With
the currently available tools, it is impossible for us to tackle these
issues. André's proposal would be a step in improving this situation.

We already do something like what you suggest, but there are multiple
problems with that approach:

1. we can only submit 1 command buffer at a time because we won't know
which IB hanged
2. we can't use chaining because we don't know where in the IB it hanged
3. it needs userspace to insert (a lot of) extra commands such as extra
synchronization and memory writes
4. It doesn't work when GPU recovery is enabled because the information is
already gone when we detect the hang

Consequences:

A. It has a huge perf impact, so we can't enable it always
B. Thanks to the extra synchronization, some issues can't be reproduced
when this kind of debugging is enabled
C. We have to ask users to disable GPU recovery to collect logs for us

In my opinion, the correct solution to those problems would be if the
kernel could give userspace the necessary information about a GPU hang
before a GPU reset. To avoid the massive peformance cost, it would be best
if we could know which IB hung and what were the commands being executed
when it hung (perhaps pointers to the VA of the commands), along with which
shaders were in flight (perhaps pointers to the VA of the shader binaries).

If such an interface could be created, that would mean we could easily
query this information and create useful logs of GPU hangs without much
userspace overhead and without requiring the user to disable GPU resets etc.

If it's not possible to do this, we'd appreciate some suggestions on how to
properly solve this without the massive performance cost and without
requiring the user to disable GPU recovery.

Side note, it is also extremely difficult to even determine whether the
problem is in userspace or the kernel. While kernel developers usually
dismiss all GPU hangs as userspace problems, we've seen many issues where
the problem was in the kernel (eg. bugs where wrong voltages were set,
etc.) - any idea for tackling those kind of issues is also welcome.

Thanks & best regards,
Timur

>


Re: [PATCH v2 3/3] drm/amdgpu: Add support for querying the max ibs in a submission. (v2)

2023-04-11 Thread Timur Kristóf
Bas Nieuwenhuizen  ezt írta (időpont: 2023. ápr.
11., Ke 10:25):

> On Tue, Apr 11, 2023 at 10:10 AM Christian König
>  wrote:
> >
> > Am 09.04.23 um 20:59 schrieb Bas Nieuwenhuizen:
> > > We need to introduce a new version of the info struct because
> > > libdrm_amdgpu forgot any versioning info in amdgpu_query_hw_ip_info
> > > so the mesa<->libdrm_amdgpu interface can't handle increasing the
> > > size.
> > >
> > > This info would be used by radv to figure out when we need to
> > > split a submission into multiple submissions. radv currently has
> > > a limit of 192 which seems to work for most gfx submissions, but
> > > is way too high for e.g. compute or sdma.
> >
> > Why do you need so many IBs in the first place?
> >
> > You can use sub-IBs since R600 and newer hw even supports a JUMP command
> > to chain IBs together IIRC.
>
> Couple of reasons:
>
> 1) We can't reliably use sub-IBs (both because on GFX6-7 there are
> indirect draw commands that cannot be used in sub-IBs and because the
> Vulkan API exposes sub-IBs, so we can have the primary IBs be sub-IBs
> already)
>

Furthermore, only the GFX queue supports the "IB2" packet that is used to
implement sub-IBs.

(The same packet hangs the compute queue and is documented as not working
in the PAL source code. Other queues don't seem to have a packet for this
purpose.)

2) We believed GFX6 may not support chaining. (Then again, it turns
> out the GFX7+ packet may just work on GFX6?
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22406)
>

I was actually quite surprised when I found this out. Mesa developers seem
to have believed that this is not possible on GFX6. I'd love to get some
more context on this, did it always work or was it added in a FW update?

3) Chaining doesn't work if the IB may be in flight multiple times in
> a different sequence.
>

Additionally, chaining also doesn't work on any queue other than GFX and
compute. As far as we know, SDMA and the various encoder/decoder queues
don't have a packet for chaining.

Christian, please let us know if we are wrong about this.


Best regards,
Timur




> We try to chain when we can but (2) and (3) are cases where we
> fallback to submitting multiple IBs.
>
> >
> > For the kernel UAPI you only need separate IBs if you have separate
> > properties on them, E.g. preamble or submitting to a different engine.
> >
> > Everything else just needs additional CPU overhead in the IOCTL.
> >
> > Regards,
> > Christian.
> >
> > >
> > > Because the kernel handling is just fine we can just use the v2
> > > everywhere and have the return_size do the versioning for us,
> > > with userspace interpreting 0 as unknown.
> > >
> > > Userspace implementation:
> > > https://gitlab.freedesktop.org/bnieuwenhuizen/drm/-/tree/ib-rejection
> > > https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/tree/ib-rejection
> > >
> > > v2: Use base member in the new struct.
> > >
> > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2498
> > > Signed-off-by: Bas Nieuwenhuizen 
> > > Cc: David Airlie 
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 31
> ++---
> > >   include/uapi/drm/amdgpu_drm.h   | 14 +++
> > >   2 files changed, 31 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > index 89689b940493..5b575e1aef1a 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > @@ -360,7 +360,7 @@ static int amdgpu_firmware_info(struct
> drm_amdgpu_info_firmware *fw_info,
> > >
> > >   static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> > >struct drm_amdgpu_info *info,
> > > -  struct drm_amdgpu_info_hw_ip *result)
> > > +  struct drm_amdgpu_info_hw_ip2 *result)
> > >   {
> > >   uint32_t ib_start_alignment = 0;
> > >   uint32_t ib_size_alignment = 0;
> > > @@ -431,6 +431,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device
> *adev,
> > >   return -EINVAL;
> > >   }
> > >
> > > + result->max_ibs = UINT_MAX;
> > >   for (i = 0; i < adev->num_rings; ++i) {
> > >   /* Note that this uses that ring types alias the
> equivalent
> > >* HW IP exposes to userspace.
> > > @@ -438,6 +439,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device
> *adev,
> > >   if (adev->rings[i]->funcs->type ==
> info->query_hw_ip.type &&
> > >   adev->rings[i]->sched.ready) {
> > >   ++num_rings;
> > > + result->max_ibs = min(result->max_ibs,
> > > +   adev->rings[i]->max_ibs);
> > >   }
> > >   }
> > >
> > > @@ -452,36 +455,36 @@ static int amdgpu_hw_ip_info(struct
> amdgpu_device *adev,
> > >   num_rings = min(amdgpu_ctx_num_entities[info->query_hw_ip.type],
> > >  

Re: [PATCH v2 3/3] drm/amdgpu: Add support for querying the max ibs in a submission. (v2)

2023-04-11 Thread Timur Kristóf
Christian König  ezt írta (időpont: 2023. ápr.
11., Ke 11:23):

> Am 11.04.23 um 11:06 schrieb Timur Kristóf:
>
>
>
> Bas Nieuwenhuizen  ezt írta (időpont: 2023. ápr.
> 11., Ke 10:25):
>
>> On Tue, Apr 11, 2023 at 10:10 AM Christian König
>>  wrote:
>> >
>> > Am 09.04.23 um 20:59 schrieb Bas Nieuwenhuizen:
>> > > We need to introduce a new version of the info struct because
>> > > libdrm_amdgpu forgot any versioning info in amdgpu_query_hw_ip_info
>> > > so the mesa<->libdrm_amdgpu interface can't handle increasing the
>> > > size.
>> > >
>> > > This info would be used by radv to figure out when we need to
>> > > split a submission into multiple submissions. radv currently has
>> > > a limit of 192 which seems to work for most gfx submissions, but
>> > > is way too high for e.g. compute or sdma.
>> >
>> > Why do you need so many IBs in the first place?
>> >
>> > You can use sub-IBs since R600 and newer hw even supports a JUMP command
>> > to chain IBs together IIRC.
>>
>> Couple of reasons:
>>
>> 1) We can't reliably use sub-IBs (both because on GFX6-7 there are
>> indirect draw commands that cannot be used in sub-IBs and because the
>> Vulkan API exposes sub-IBs, so we can have the primary IBs be sub-IBs
>> already)
>>
>
> Furthermore, only the GFX queue supports the "IB2" packet that is used to
> implement sub-IBs.
>
> (The same packet hangs the compute queue and is documented as not working
> in the PAL source code. Other queues don't seem to have a packet for this
> purpose.)
>
> 2) We believed GFX6 may not support chaining. (Then again, it turns
>> out the GFX7+ packet may just work on GFX6?
>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22406)
>>
>
> I was actually quite surprised when I found this out. Mesa developers seem
> to have believed that this is not possible on GFX6. I'd love to get some
> more context on this, did it always work or was it added in a FW update?
>
> 3) Chaining doesn't work if the IB may be in flight multiple times in
>> a different sequence.
>>
>
> Additionally, chaining also doesn't work on any queue other than GFX and
> compute. As far as we know, SDMA and the various encoder/decoder queues
> don't have a packet for chaining.
>
> Christian, please let us know if we are wrong about this.
>
>
> I'm not an expert for this either. Marek and Pierre-eric know more about
> that stuff than me. On the other hand I could ping the firmware people as
> well if our UMD guys can't answer that.
>

Thanks, I'm looking forward to hearing more about it.


> It's just that last time we discussed this internally somebody from the
> PAL team commented that this isn't an issue any more because we don't need
> that many IBs any more.
>

It is true that the typical game would only submit a few IBs: either 2 (1
preamble + 1 IB that has all the command buffers chained) or 8 for gang
submit (4 preambles, 1 ACE IB, 1 GFX IB, 2 postambles), but we would like
to ensure that the less common use cases that can't use chaining work well
too.

We also have an increased interest in using the other HW queues in RADV
these days: video encode/decode queues for the Vulkan Video API, and of
course SDMA which we are considering for transfer queues in the future.


> libdrm defined that you shouldn't use more than 4 IBs in a CS, on the
> other hand we dropped checking that long ago and exposing the numbers of
> IBs the UMD can use is just good design.
>

Yes, I agree.

Can we remove that old (and confusing) define from libdrm? Or does anyone
still depend on that?


> Bas what do you think of adding an AMDGPU_INFO_MAX_IBS query instead of
> extending the drm_amdgpu_info_hw_ip structure?
>
> Background is that the drm_amdgpu_info_hw_ip structure is actually not
> meant to be used for sw parameters (which the maximum number of IBs is) and
> we wouldn't need to dance around issues with query size parameters because
> that function takes the size as parameter.
>

Sounds good to me but I'll defer to Bas's judgement on this.




> Regards,
> Christian.
>
>
>
> Best regards,
> Timur
>
>
>
>
>> We try to chain when we can but (2) and (3) are cases where we
>> fallback to submitting multiple IBs.
>>
>> >
>> > For the kernel UAPI you only need separate IBs if you have separate
>> > properties on them, E.g. preamble or submitting to a different engine.
>> >
>> > Everything else just needs additional

Re: [PATCH 3/3] drm/amdgpu: Add support for querying the max ibs in a submission.

2023-04-10 Thread Timur Kristóf
Christian König  ezt írta (időpont: 2023.
ápr. 9., Vas 17:38):

> Am 09.04.23 um 17:32 schrieb Bas Nieuwenhuizen:
> > On Sun, Apr 9, 2023 at 5:30 PM Christian König
> >  wrote:
> >> Am 09.04.23 um 16:44 schrieb Bas Nieuwenhuizen:
> >>> We need to introduce a new version of the info struct because
> >>> libdrm_amdgpu forgot any versioning info in amdgpu_query_hw_ip_info
> >>> so the mesa<->libdrm_amdgpu interface can't handle increasing the
> >>> size.
> >> Those are not forgotten, but simply unnecessary.
> >>
> >> The size of the input output structures are given to the IOCTL in bytes
> >> and additional bytes should be filled with zeros.
> > At the ioctl side, yes, but it is libdrm_amdgpu filling in the size,
> > while passing in the struct directly from the client (mesa or
> > whatever). So if we have new libdrm_amdgpu and old mesa, then mesa
> > allocates  N bytes on the stack and libdrm_amdgpu happily tells the
> > kernel in the ioctl "this struct is N+8 bytes long" which would cause
> > corruption?
>
> WTF? This has a wrapper in libdrm? Well then that's indeed horrible broken.
>
> In this case please define the new structure as extension of the old
> one. E.g. something like:
>
> struct drm_amdgpu_info_hw_ip2 {
>  struct drm_amdgpu_info_hw_ipbase;
>  
> };
>
> This way we can make it clear that this is an extension.
>


Can we solve this in userspace by letting mesa pass the struct size to
libdrm as well? Or would that create other compatibility issues?




> Thanks,
> Christian.
>
> >
> > - Bas
> >
> >> So you should be able to extend the structures at the end without
> >> breaking anything.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> This info would be used by radv to figure out when we need to
> >>> split a submission into multiple submissions. radv currently has
> >>> a limit of 192 which seems to work for most gfx submissions, but
> >>> is way too high for e.g. compute or sdma.
> >>>
> >>> Because the kernel handling is just fine we can just use the v2
> >>> everywhere and have the return_size do the versioning for us,
> >>> with userspace interpreting 0 as unknown.
> >>>
> >>> Userspace implementation:
> >>> https://gitlab.freedesktop.org/bnieuwenhuizen/drm/-/tree/ib-rejection
> >>> https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/tree/ib-rejection
> >>>
> >>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2498
> >>> Signed-off-by: Bas Nieuwenhuizen 
> >>> Cc: David Airlie 
> >>> ---
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c |  7 --
> >>>include/uapi/drm/amdgpu_drm.h   | 29
> +
> >>>2 files changed, 34 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >>> index 89689b940493..c7e815c2733e 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >>> @@ -360,7 +360,7 @@ static int amdgpu_firmware_info(struct
> drm_amdgpu_info_firmware *fw_info,
> >>>
> >>>static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> >>> struct drm_amdgpu_info *info,
> >>> -  struct drm_amdgpu_info_hw_ip *result)
> >>> +  struct drm_amdgpu_info_hw_ip2 *result)
> >>>{
> >>>uint32_t ib_start_alignment = 0;
> >>>uint32_t ib_size_alignment = 0;
> >>> @@ -431,6 +431,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device
> *adev,
> >>>return -EINVAL;
> >>>}
> >>>
> >>> + result->max_ibs = UINT_MAX;
> >>>for (i = 0; i < adev->num_rings; ++i) {
> >>>/* Note that this uses that ring types alias the
> equivalent
> >>> * HW IP exposes to userspace.
> >>> @@ -438,6 +439,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device
> *adev,
> >>>if (adev->rings[i]->funcs->type ==
> info->query_hw_ip.type &&
> >>>adev->rings[i]->sched.ready) {
> >>>++num_rings;
> >>> + result->max_ibs = min(result->max_ibs,
> >>> +   adev->rings[i]->max_ibs);
> >>>}
> >>>}
> >>>
> >>> @@ -536,7 +539,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> *data, struct drm_file *filp)
> >>>}
> >>>return copy_to_user(out, &ui32, min(size, 4u)) ?
> -EFAULT : 0;
> >>>case AMDGPU_INFO_HW_IP_INFO: {
> >>> - struct drm_amdgpu_info_hw_ip ip = {};
> >>> + struct drm_amdgpu_info_hw_ip2 ip = {};
> >>>int ret;
> >>>
> >>>ret = amdgpu_hw_ip_info(adev, info, &ip);
> >>> diff --git a/include/uapi/drm/amdgpu_drm.h
> b/include/uapi/drm/amdgpu_drm.h
> >>> index b6eb90df5d05..45e5ae546d19 100644
> >>> --- a/include/uapi/drm/amdgpu_drm.h
> >>> +++ b/include/uapi/drm/amdgpu_drm.h
> >>> @@ -1128,6 +1128,9 @@ struct drm_amdgpu_info_device {
> >>>

Re: [PATCH] drm/amdgpu: handle gang submit before VMID

2022-11-21 Thread Timur Kristóf
The patch did not make it into rc6 after all.
Can you please make sure it goes into rc7?

Thanks,
Timur

On Fri, 2022-11-18 at 17:53 +0100, Christian König wrote:
> Pushed to drm-misc-fixes, should be picked up for the next rc.
> 
> Let me know if you run into any more problems with that.
> 
> Thanks,
> Christian.
> 
> Am 18.11.22 um 16:36 schrieb Timur Kristóf:
> > Can you guys please push this into the next 6.1 RC? This solves a
> > significant issue with gang submit.
> > 
> > On Fri, 2022-11-18 at 16:30 +0100, Christian König wrote:
> > > Otherwise it can happen that not all gang members can get a VMID
> > > assigned and we deadlock.
> > > 
> > > Signed-off-by: Christian König 
> > > Tested-by: Timur Kristóf 
> > > Acked-by: Timur Kristóf 
> > > Fixes: 68ce8b242242 ("drm/amdgpu: add gang submit backend v2")
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 +++---
> > >   1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > index cd968e781077..abb99cff8b4b 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > @@ -254,6 +254,9 @@ static struct dma_fence
> > > *amdgpu_job_dependency(struct drm_sched_job *sched_job,
> > >  DRM_ERROR("Error adding fence (%d)\n",
> > > r);
> > >  }
> > >   
> > > +   if (!fence && job->gang_submit)
> > > +   fence = amdgpu_device_switch_gang(ring->adev,
> > > job-
> > > > gang_submit);
> > > +
> > >  while (fence == NULL && vm && !job->vmid) {
> > >  r = amdgpu_vmid_grab(vm, ring, &job->sync,
> > >   &job->base.s_fence-
> > > >finished,
> > > @@ -264,9 +267,6 @@ static struct dma_fence
> > > *amdgpu_job_dependency(struct drm_sched_job *sched_job,
> > >  fence = amdgpu_sync_get_fence(&job->sync);
> > >  }
> > >   
> > > -   if (!fence && job->gang_submit)
> > > -   fence = amdgpu_device_switch_gang(ring->adev,
> > > job-
> > > > gang_submit);
> > > -
> > >  return fence;
> > >   }
> > >   
> 



Re: [PATCH] drm/amdgpu: handle gang submit before VMID

2022-11-18 Thread Timur Kristóf
Can you guys please push this into the next 6.1 RC? This solves a
significant issue with gang submit.

On Fri, 2022-11-18 at 16:30 +0100, Christian König wrote:
> Otherwise it can happen that not all gang members can get a VMID
> assigned and we deadlock.
> 
> Signed-off-by: Christian König 
> Tested-by: Timur Kristóf 
> Acked-by: Timur Kristóf 
> Fixes: 68ce8b242242 ("drm/amdgpu: add gang submit backend v2")
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index cd968e781077..abb99cff8b4b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -254,6 +254,9 @@ static struct dma_fence
> *amdgpu_job_dependency(struct drm_sched_job *sched_job,
> DRM_ERROR("Error adding fence (%d)\n", r);
> }
>  
> +   if (!fence && job->gang_submit)
> +   fence = amdgpu_device_switch_gang(ring->adev, job-
> >gang_submit);
> +
> while (fence == NULL && vm && !job->vmid) {
> r = amdgpu_vmid_grab(vm, ring, &job->sync,
>  &job->base.s_fence->finished,
> @@ -264,9 +267,6 @@ static struct dma_fence
> *amdgpu_job_dependency(struct drm_sched_job *sched_job,
> fence = amdgpu_sync_get_fence(&job->sync);
> }
>  
> -   if (!fence && job->gang_submit)
> -   fence = amdgpu_device_switch_gang(ring->adev, job-
> >gang_submit);
> -
> return fence;
>  }
>  



Re: [PATCH] drm/amdgpu: use the last IB as gang leader v2

2022-11-15 Thread Timur Kristóf
I can confirm this patch solves an issue with gang submit.
It removes the necessity to sort IBs by IP type in user space.
Now only the IP type of the last IB matters, as was intended.

Tested-by: Timur Kristóf 
Acked-by: Timur Kristóf 

On Tue, 2022-11-15 at 10:43 +0100, Christian König wrote:
> Am 15.11.22 um 10:42 schrieb Christian König:
> > It turned out that not the last IB specified is the gang leader,
> > but instead the last job allocated.
> > 
> > This is a bit unfortunate and not very intuitive for the CS
> > interface, so try to fix this.
> 
> Alex could you take a look at this? I would really like to get this
> into 
> the next -rc.
> 
> Thanks,
> Christian.
> 
> > 
> > Signed-off-by: Christian König 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 23 -
> > --
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h |  1 +
> >   2 files changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 1bbd39b3b0fc..fbdf139cf497 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -109,6 +109,7 @@ static int amdgpu_cs_p1_ib(struct
> > amdgpu_cs_parser *p,
> > return r;
> >   
> > ++(num_ibs[r]);
> > +   p->gang_leader_idx = r;
> > return 0;
> >   }
> >   
> > @@ -300,7 +301,7 @@ static int amdgpu_cs_pass1(struct
> > amdgpu_cs_parser *p,
> > if (ret)
> > goto free_all_kdata;
> > }
> > -   p->gang_leader = p->jobs[p->gang_size - 1];
> > +   p->gang_leader = p->jobs[p->gang_leader_idx];
> >   
> > if (p->ctx->vram_lost_counter != p->gang_leader-
> > >vram_lost_counter) {
> > ret = -ECANCELED;
> > @@ -1194,16 +1195,18 @@ static int amdgpu_cs_sync_rings(struct
> > amdgpu_cs_parser *p)
> > return r;
> > }
> >   
> > -   for (i = 0; i < p->gang_size - 1; ++i) {
> > +   for (i = 0; i < p->gang_size; ++i) {
> > +   if (p->jobs[i] == leader)
> > +   continue;
> > +
> > r = amdgpu_sync_clone(&leader->sync, &p->jobs[i]-
> > >sync);
> > if (r)
> > return r;
> > }
> >   
> > -   r = amdgpu_ctx_wait_prev_fence(p->ctx, p->entities[p-
> > >gang_size - 1]);
> > +   r = amdgpu_ctx_wait_prev_fence(p->ctx, p->entities[p-
> > >gang_leader_idx]);
> > if (r && r != -ERESTARTSYS)
> > DRM_ERROR("amdgpu_ctx_wait_prev_fence failed.\n");
> > -
> > return r;
> >   }
> >   
> > @@ -1237,9 +1240,12 @@ static int amdgpu_cs_submit(struct
> > amdgpu_cs_parser *p,
> > for (i = 0; i < p->gang_size; ++i)
> > drm_sched_job_arm(&p->jobs[i]->base);
> >   
> > -   for (i = 0; i < (p->gang_size - 1); ++i) {
> > +   for (i = 0; i < p->gang_size; ++i) {
> > struct dma_fence *fence;
> >   
> > +   if (p->jobs[i] == leader)
> > +   continue;
> > +
> > fence = &p->jobs[i]->base.s_fence->scheduled;
> > r = amdgpu_sync_fence(&leader->sync, fence);
> > if (r)
> > @@ -1275,7 +1281,10 @@ static int amdgpu_cs_submit(struct
> > amdgpu_cs_parser *p,
> > list_for_each_entry(e, &p->validated, tv.head) {
> >   
> > /* Everybody except for the gang leader uses READ
> > */
> > -   for (i = 0; i < (p->gang_size - 1); ++i) {
> > +   for (i = 0; i < p->gang_size; ++i) {
> > +   if (p->jobs[i] == leader)
> > +   continue;
> > +
> > dma_resv_add_fence(e->tv.bo->base.resv,
> >    &p->jobs[i]-
> > >base.s_fence->finished,
> >    DMA_RESV_USAGE_READ);
> > @@ -1285,7 +1294,7 @@ static int amdgpu_cs_submit(struct
> > amdgpu_cs_parser *p,
> > e->tv.num_shared = 0;
> > }
> >   
> > -   seq = amdgpu_ctx_add_fence(p->ctx, p->entities[p->gang_size
> > - 1],
> > +   seq = amdgpu_ctx_add_fence(p->ctx, p->entities[p-
> > >gang_leader_idx],
> >    p->fence);
> > amdgpu_cs_post_dependencies(p);
> >   
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> > index cbaa19b2b8a3..f80adf9069ec 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h
> > @@ -54,6 +54,7 @@ struct amdgpu_cs_parser {
> >   
> > /* scheduler job objects */
> > unsigned intgang_size;
> > +   unsigned intgang_leader_idx;
> > struct drm_sched_entity *entities[AMDGPU_CS_GANG_SIZE];
> > struct amdgpu_job   *jobs[AMDGPU_CS_GANG_SIZE];
> > struct amdgpu_job   *gang_leader;
> 



Re: [Mesa-dev] [Intel-gfx] gitlab.fd.o financial situation and impact on services

2020-03-02 Thread Timur Kristóf
On Sat, 2020-02-29 at 14:46 -0500, Nicolas Dufresne wrote:
> > 
> > 1. I think we should completely disable running the CI on MRs which
> > are
> > marked WIP. Speaking from personal experience, I usually make a lot
> > of
> > changes to my MRs before they are merged, so it is a waste of CI
> > resources.
> 
> In the mean time, you can help by taking the habit to use:
> 
>   git push -o ci.skip

Thanks for the advice, I wasn't aware such an option exists. Does this
also work on the mesa gitlab or is this a GStreamer only thing?

How hard would it be to make this the default?

> That's a much more difficult goal then it looks like. Let each
> projects
> manage their CI graph and content, as each case is unique. Running
> more
> tests, or building more code isn't the main issue as the CPU time is
> mostly sponsored. The data transfers between the cloud of gitlab and
> the runners (which are external), along to sending OS image to Lava
> labs is what is likely the most expensive.
> 
> As it was already mention in the thread, what we are missing now, and
> being worked on, is per group/project statistics that give us the
> hotspot so we can better target the optimization work.

Yes, would be nice to know what the hotspot is, indeed.

As far as I understand, the problem is not CI itself, but the bandwidth
needed by the build artifacts, right? Would it be possible to not host
the build artifacts on the gitlab, but rather only the place where the
build actually happened? Or at least, only transfer the build artifacts
on-demand?

I'm not exactly familiar with how the system works, so sorry if this is
a silly question.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Mesa-dev] [Intel-gfx] gitlab.fd.o financial situation and impact on services

2020-03-02 Thread Timur Kristóf
On Fri, 2020-02-28 at 10:43 +, Daniel Stone wrote:
> On Fri, 28 Feb 2020 at 10:06, Erik Faye-Lund
>  wrote:
> > On Fri, 2020-02-28 at 11:40 +0200, Lionel Landwerlin wrote:
> > > Yeah, changes on vulkan drivers or backend compilers should be
> > > fairly
> > > sandboxed.
> > > 
> > > We also have tools that only work for intel stuff, that should
> > > never
> > > trigger anything on other people's HW.
> > > 
> > > Could something be worked out using the tags?
> > 
> > I think so! We have the pre-defined environment variable
> > CI_MERGE_REQUEST_LABELS, and we can do variable conditions:
> > 
> > https://docs.gitlab.com/ee/ci/yaml/#onlyvariablesexceptvariables
> > 
> > That sounds like a pretty neat middle-ground to me. I just hope
> > that
> > new pipelines are triggered if new labels are added, because not
> > everyone is allowed to set labels, and sometimes people forget...
> 
> There's also this which is somewhat more robust:
> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2569

My 20 cents:

1. I think we should completely disable running the CI on MRs which are
marked WIP. Speaking from personal experience, I usually make a lot of
changes to my MRs before they are merged, so it is a waste of CI
resources.

2. Maybe we could take this one step further and only allow the CI to
be only triggered manually instead of automatically on every push.

3. I completely agree with Pierre-Eric on MR 2569, let's not run the
full CI pipeline on every change, only those parts which are affected
by the change. It not only costs money, but is also frustrating when
you submit a change and you get unrelated failures from a completely
unrelated driver.

Best regards,
Timur

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx