Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Marek Olšák Wed, 03 May 2023 10:15:57 -0700

WRITE_DATA with ENGINE=PFP will execute the packet on the frontend engine,
while ENGINE=ME will execute the packet on the backend engine.


Marek

On Wed, May 3, 2023 at 1:08 PM Marek Olšák <mar...@gmail.com> wrote:

> GPU hangs are pretty common post-bringup. They are not common per user,
> but if we gather all hangs from all users, we can have lots and lots of
> them.
>
> GPU hangs are indeed not very debuggable. There are however some things we
> can do:
> - Identify the hanging IB by its VA (the kernel should know it)
> - Read and parse the IB to detect memory corruption.
> - Print active waves with shader disassembly if SQ isn't hung (often it's
> not).
>
> Determining which packet the CP is stuck on is tricky. The CP has 2
> engines (one frontend and one backend) that work on the same command
> buffer. The frontend engine runs ahead, executes some packets and forwards
> others to the backend engine. Only the frontend engine has the command
> buffer VA somewhere. The backend engine only receives packets from the
> frontend engine via a FIFO, so it might not be possible to tell where it's
> stuck if it's stuck.
>
> When the gfx pipeline hangs outside of shaders, making a scandump seems to
> be the only way to have a chance at finding out what's going wrong, and
> only AMD-internal versions of hw can be scanned.
>
> Marek
>
> On Wed, May 3, 2023 at 11:23 AM Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Am 03.05.23 um 17:08 schrieb Felix Kuehling:
>> > Am 2023-05-03 um 03:59 schrieb Christian König:
>> >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
>> >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
>> >>> <timur.kris...@gmail.com> wrote:
>> >>>> [SNIP]
>> >>>>>>>> In my opinion, the correct solution to those problems would be
>> >>>>>>>> if
>> >>>>>>>> the kernel could give userspace the necessary information about
>> >>>>>>>> a
>> >>>>>>>> GPU hang before a GPU reset.
>> >>>>>>>>
>> >>>>>>>   The fundamental problem here is that the kernel doesn't have
>> >>>>>>> that
>> >>>>>>> information either. We know which IB timed out and can
>> >>>>>>> potentially do
>> >>>>>>> a devcoredump when that happens, but that's it.
>> >>>>>>
>> >>>>>> Is it really not possible to know such a fundamental thing as what
>> >>>>>> the
>> >>>>>> GPU was doing when it hung? How are we supposed to do any kind of
>> >>>>>> debugging without knowing that?
>> >>
>> >> Yes, that's indeed something at least I try to figure out for years
>> >> as well.
>> >>
>> >> Basically there are two major problems:
>> >> 1. When the ASIC is hung you can't talk to the firmware engines any
>> >> more and most state is not exposed directly, but just through some
>> >> fw/hw interface.
>> >>     Just take a look at how umr reads the shader state from the SQ.
>> >> When that block is hung you can't do that any more and basically have
>> >> no chance at all to figure out why it's hung.
>> >>
>> >>     Same for other engines, I remember once spending a week figuring
>> >> out why the UVD block is hung during suspend. Turned out to be a
>> >> debugging nightmare because any time you touch any register of that
>> >> block the whole system would hang.
>> >>
>> >> 2. There are tons of things going on in a pipeline fashion or even
>> >> completely in parallel. For example the CP is just the beginning of a
>> >> rather long pipeline which at the end produces a bunch of pixels.
>> >>     In almost all cases I've seen you ran into a problem somewhere
>> >> deep in the pipeline and only very rarely at the beginning.
>> >>
>> >>>>>>
>> >>>>>> I wonder what AMD's Windows driver team is doing with this problem,
>> >>>>>> surely they must have better tools to deal with GPU hangs?
>> >>>>> For better or worse, most teams internally rely on scan dumps via
>> >>>>> JTAG
>> >>>>> which sort of limits the usefulness outside of AMD, but also gives
>> >>>>> you
>> >>>>> the exact state of the hardware when it's hung so the hardware teams
>> >>>>> prefer it.
>> >>>>>
>> >>>> How does this approach scale? It's not something we can ask users to
>> >>>> do, and even if all of us in the radv team had a JTAG device, we
>> >>>> wouldn't be able to play every game that users experience random
>> hangs
>> >>>> with.
>> >>> It doesn't scale or lend itself particularly well to external
>> >>> development, but that's the current state of affairs.
>> >>
>> >> The usual approach seems to be to reproduce a problem in a lab and
>> >> have a JTAG attached to give the hw guys a scan dump and they can
>> >> then tell you why something didn't worked as expected.
>> >
>> > That's the worst-case scenario where you're debugging HW or FW issues.
>> > Those should be pretty rare post-bringup. But are there hangs caused
>> > by user mode driver or application bugs that are easier to debug and
>> > probably don't even require a GPU reset? For example most VM faults
>> > can be handled without hanging the GPU. Similarly, a shader in an
>> > endless loop should not require a full GPU reset. In the KFD compute
>> > case, that's still preemptible and the offending process can be killed
>> > with Ctrl-C or debugged with rocm-gdb.
>>
>> We also have infinite loop in shader abort for gfx and page faults are
>> pretty rare with OpenGL (a bit more often with Vulkan) and can be
>> handled gracefully on modern hw (they just spam the logs).
>>
>> The majority of the problems is unfortunately that we really get hard
>> hangs because of some hw issues. That can be caused by unlucky timing,
>> power management or doing things in an order the hw doesn't expected.
>>
>> Regards,
>> Christian.
>>
>> >
>> > It's more complicated for graphics because of the more complex
>> > pipeline and the lack of CWSR. But it should still be possible to do
>> > some debugging without JTAG if the problem is in SW and not HW or FW.
>> > It's probably worth improving that debugability without getting
>> > hung-up on the worst case.
>> >
>> > Maybe user mode graphics queues will offer a better way of recovering
>> > from these kinds of bugs, if the graphics pipeline can be unstuck
>> > without a GPU reset, just by killing the offending user mode queue.
>> >
>> > Regards,
>> >   Felix
>> >
>> >
>> >>
>> >> And yes that absolutely doesn't scale.
>> >>
>> >> Christian.
>> >>
>> >>>
>> >>> Alex
>> >>
>>
>>

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Reply via email to