Re: AMDGPU: Floating Point traps in Display Core code

Jeff Frasca Sat, 25 Feb 2023 23:33:24 -0800

I'm replying on my phone.  My apologies if it does dumb things to the
formatting.


On Sat, Feb 25, 2023, 06:18 Taylor R Campbell <
campbell+netbsd-tech-k...@mumble.net> wrote:

> > Date: Fri, 24 Feb 2023 23:21:35 -0800
> > From: Jeff Frasca <that...@jeff-frasca.name>
> >
> > Ok, first off, the FP code I've run into is in the Display
> > Core code, specifically in:
> >   sys/external/bsd/drm2/dist/drm/amd/display/dc/calcs/amdgpu_dcn_calcs.c
> > It's all SIMD code operating on xmmN registers.  To get to
> > this codepath, I needed to have CONFIG_DRM_AMD_DC set during
> > compilation.  I've attached a diff that adds this to files.amdgpu.
> >
> > A typical backtrace printed out by ddb is:
> > breakpoint()
> > vpanic()
> > panic()
> > fputrap()
> > Xtrap16()
> > dcn10_create_resource_pool()
> > [...]
> >
> > (I had to type it manually from a picture snapped on my
> > phone, so, no offsets, if any of those are of interest,
> > let me know.)
>
> Probably not, except perhaps the one in dcn10_create_resource_pool to
> confirm that it is where you think it is, in dcn_bw_update_from_pplib.
>
> > There's a missing call from the backtrace that (I think)
> > gets eaten by the trap jump: dcn_bw_update_from_pplib().
> > (It's in amdgpu_dcn_calcs.c)
>
> That's via dcn10_resource_construct, I assume?  (which is a single-use
> static that presumably gets compiled away)
>

The missing call is dcn_bw_update_from_pplib(), and it's definitely where
the exception is thrown.  I confirmed this with debug printfs in my local
build.

> [...]
> The logic looks like this:
> [...]
> So the return address of fputrap will always live in the Xtrap16
> symbol, not the Xtrap19 one, even if it gets there by trap 19.
>

Neat, and thanks.

> [...]
>
> This looks like a mistake on my part.  It's possible that we never
> noticed with the crypto code because it largely doesn't deal in
> floating-point exceptions, and that's all that we've use the FP/SIMD
> unit for in the kernel so far.
>
> But we should have set the MXCSR (and FPSW/FPCW, if that matters) to a
> reliable state.  And we need to do that anyway for crypto on CPUs with
> the MCDT bug (https://gnats.netbsd.org/57230).
>
> Since we're definitely not in a position to handle floating-point
> exception traps in the kernel, I just committed a change to set MXCSR
> to 0x1fbf (all exception status bits set, denormals-are-zero disabled,
> all exception trap mask bits set, round-to-nearest/ties-to-even,
> flush-to-zero disabled).
>

Cool.  Because this looked like the right way to do it.

Does that change anything?
>
> > Unless I add another function call to DC_FP_START() that
> > masks all the non-fatal FP traps in MXCSR.  I tried
> > setting it to 0x00001d00 and 0x00009d40.  The former just
> > masks the non-fatal traps and the latter tries to set "do
> > sane things with edge cases" flags.  (If I try to mask
> > MXCSR in fpu_kern_enter(), then some of the crypto code
> > breaks.)
>
> Can you be more specific about the crypto code breaking?  Does it
> still break with the change I just committed?  (I verified all the
> self-tests at boot run under qemu before committing, of course, but
> it's possible I broke something on real hardware.)
>
> https://mail-index.netbsd.org/source-changes/2023/02/25/msg143547.html


I will try this as soon as I can.  Probably tomorrow evening.  I think I
may have gotten the mask wrong when modifying fpu_kern_enter() the first
time, so there's a good chance this will work.  (I switched to the separate
function call just to isolate my changes to the AMDGPU driver and figured
we'd have this discussion at this time :-) Hopefully I don't reproduce the
crash I saw in the crypto code.

Re: AMDGPU: Floating Point traps in Display Core code

Reply via email to