WRITE_DATA with ENGINE=PFP will execute the packet on the frontend engine, while ENGINE=ME will execute the packet on the backend engine.
Marek On Wed, May 3, 2023 at 1:08 PM Marek Olšák <mar...@gmail.com> wrote: > GPU hangs are pretty common post-bringup. They are not common per user, > but if we gather all hangs from all users, we can have lots and lots of > them. > > GPU hangs are indeed not very debuggable. There are however some things we > can do: > - Identify the hanging IB by its VA (the kernel should know it) > - Read and parse the IB to detect memory corruption. > - Print active waves with shader disassembly if SQ isn't hung (often it's > not). > > Determining which packet the CP is stuck on is tricky. The CP has 2 > engines (one frontend and one backend) that work on the same command > buffer. The frontend engine runs ahead, executes some packets and forwards > others to the backend engine. Only the frontend engine has the command > buffer VA somewhere. The backend engine only receives packets from the > frontend engine via a FIFO, so it might not be possible to tell where it's > stuck if it's stuck. > > When the gfx pipeline hangs outside of shaders, making a scandump seems to > be the only way to have a chance at finding out what's going wrong, and > only AMD-internal versions of hw can be scanned. > > Marek > > On Wed, May 3, 2023 at 11:23 AM Christian König < > ckoenig.leichtzumer...@gmail.com> wrote: > >> Am 03.05.23 um 17:08 schrieb Felix Kuehling: >> > Am 2023-05-03 um 03:59 schrieb Christian König: >> >> Am 02.05.23 um 20:41 schrieb Alex Deucher: >> >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf >> >>> <timur.kris...@gmail.com> wrote: >> >>>> [SNIP] >> >>>>>>>> In my opinion, the correct solution to those problems would be >> >>>>>>>> if >> >>>>>>>> the kernel could give userspace the necessary information about >> >>>>>>>> a >> >>>>>>>> GPU hang before a GPU reset. >> >>>>>>>> >> >>>>>>> The fundamental problem here is that the kernel doesn't have >> >>>>>>> that >> >>>>>>> information either. We know which IB timed out and can >> >>>>>>> potentially do >> >>>>>>> a devcoredump when that happens, but that's it. >> >>>>>> >> >>>>>> Is it really not possible to know such a fundamental thing as what >> >>>>>> the >> >>>>>> GPU was doing when it hung? How are we supposed to do any kind of >> >>>>>> debugging without knowing that? >> >> >> >> Yes, that's indeed something at least I try to figure out for years >> >> as well. >> >> >> >> Basically there are two major problems: >> >> 1. When the ASIC is hung you can't talk to the firmware engines any >> >> more and most state is not exposed directly, but just through some >> >> fw/hw interface. >> >> Just take a look at how umr reads the shader state from the SQ. >> >> When that block is hung you can't do that any more and basically have >> >> no chance at all to figure out why it's hung. >> >> >> >> Same for other engines, I remember once spending a week figuring >> >> out why the UVD block is hung during suspend. Turned out to be a >> >> debugging nightmare because any time you touch any register of that >> >> block the whole system would hang. >> >> >> >> 2. There are tons of things going on in a pipeline fashion or even >> >> completely in parallel. For example the CP is just the beginning of a >> >> rather long pipeline which at the end produces a bunch of pixels. >> >> In almost all cases I've seen you ran into a problem somewhere >> >> deep in the pipeline and only very rarely at the beginning. >> >> >> >>>>>> >> >>>>>> I wonder what AMD's Windows driver team is doing with this problem, >> >>>>>> surely they must have better tools to deal with GPU hangs? >> >>>>> For better or worse, most teams internally rely on scan dumps via >> >>>>> JTAG >> >>>>> which sort of limits the usefulness outside of AMD, but also gives >> >>>>> you >> >>>>> the exact state of the hardware when it's hung so the hardware teams >> >>>>> prefer it. >> >>>>> >> >>>> How does this approach scale? It's not something we can ask users to >> >>>> do, and even if all of us in the radv team had a JTAG device, we >> >>>> wouldn't be able to play every game that users experience random >> hangs >> >>>> with. >> >>> It doesn't scale or lend itself particularly well to external >> >>> development, but that's the current state of affairs. >> >> >> >> The usual approach seems to be to reproduce a problem in a lab and >> >> have a JTAG attached to give the hw guys a scan dump and they can >> >> then tell you why something didn't worked as expected. >> > >> > That's the worst-case scenario where you're debugging HW or FW issues. >> > Those should be pretty rare post-bringup. But are there hangs caused >> > by user mode driver or application bugs that are easier to debug and >> > probably don't even require a GPU reset? For example most VM faults >> > can be handled without hanging the GPU. Similarly, a shader in an >> > endless loop should not require a full GPU reset. In the KFD compute >> > case, that's still preemptible and the offending process can be killed >> > with Ctrl-C or debugged with rocm-gdb. >> >> We also have infinite loop in shader abort for gfx and page faults are >> pretty rare with OpenGL (a bit more often with Vulkan) and can be >> handled gracefully on modern hw (they just spam the logs). >> >> The majority of the problems is unfortunately that we really get hard >> hangs because of some hw issues. That can be caused by unlucky timing, >> power management or doing things in an order the hw doesn't expected. >> >> Regards, >> Christian. >> >> > >> > It's more complicated for graphics because of the more complex >> > pipeline and the lack of CWSR. But it should still be possible to do >> > some debugging without JTAG if the problem is in SW and not HW or FW. >> > It's probably worth improving that debugability without getting >> > hung-up on the worst case. >> > >> > Maybe user mode graphics queues will offer a better way of recovering >> > from these kinds of bugs, if the graphics pipeline can be unstuck >> > without a GPU reset, just by killing the offending user mode queue. >> > >> > Regards, >> > Felix >> > >> > >> >> >> >> And yes that absolutely doesn't scale. >> >> >> >> Christian. >> >> >> >>> >> >>> Alex >> >> >> >>