I'm replying on my phone. My apologies if it does dumb things to the formatting.
On Sat, Feb 25, 2023, 06:18 Taylor R Campbell < campbell+netbsd-tech-k...@mumble.net> wrote: > > Date: Fri, 24 Feb 2023 23:21:35 -0800 > > From: Jeff Frasca <that...@jeff-frasca.name> > > > > Ok, first off, the FP code I've run into is in the Display > > Core code, specifically in: > > sys/external/bsd/drm2/dist/drm/amd/display/dc/calcs/amdgpu_dcn_calcs.c > > It's all SIMD code operating on xmmN registers. To get to > > this codepath, I needed to have CONFIG_DRM_AMD_DC set during > > compilation. I've attached a diff that adds this to files.amdgpu. > > > > A typical backtrace printed out by ddb is: > > breakpoint() > > vpanic() > > panic() > > fputrap() > > Xtrap16() > > dcn10_create_resource_pool() > > [...] > > > > (I had to type it manually from a picture snapped on my > > phone, so, no offsets, if any of those are of interest, > > let me know.) > > Probably not, except perhaps the one in dcn10_create_resource_pool to > confirm that it is where you think it is, in dcn_bw_update_from_pplib. > > > There's a missing call from the backtrace that (I think) > > gets eaten by the trap jump: dcn_bw_update_from_pplib(). > > (It's in amdgpu_dcn_calcs.c) > > That's via dcn10_resource_construct, I assume? (which is a single-use > static that presumably gets compiled away) > The missing call is dcn_bw_update_from_pplib(), and it's definitely where the exception is thrown. I confirmed this with debug printfs in my local build. > [...] > The logic looks like this: > [...] > So the return address of fputrap will always live in the Xtrap16 > symbol, not the Xtrap19 one, even if it gets there by trap 19. > Neat, and thanks. > [...] > > This looks like a mistake on my part. It's possible that we never > noticed with the crypto code because it largely doesn't deal in > floating-point exceptions, and that's all that we've use the FP/SIMD > unit for in the kernel so far. > > But we should have set the MXCSR (and FPSW/FPCW, if that matters) to a > reliable state. And we need to do that anyway for crypto on CPUs with > the MCDT bug (https://gnats.netbsd.org/57230). > > Since we're definitely not in a position to handle floating-point > exception traps in the kernel, I just committed a change to set MXCSR > to 0x1fbf (all exception status bits set, denormals-are-zero disabled, > all exception trap mask bits set, round-to-nearest/ties-to-even, > flush-to-zero disabled). > Cool. Because this looked like the right way to do it. Does that change anything? > > > Unless I add another function call to DC_FP_START() that > > masks all the non-fatal FP traps in MXCSR. I tried > > setting it to 0x00001d00 and 0x00009d40. The former just > > masks the non-fatal traps and the latter tries to set "do > > sane things with edge cases" flags. (If I try to mask > > MXCSR in fpu_kern_enter(), then some of the crypto code > > breaks.) > > Can you be more specific about the crypto code breaking? Does it > still break with the change I just committed? (I verified all the > self-tests at boot run under qemu before committing, of course, but > it's possible I broke something on real hardware.) > > https://mail-index.netbsd.org/source-changes/2023/02/25/msg143547.html I will try this as soon as I can. Probably tomorrow evening. I think I may have gotten the mask wrong when modifying fpu_kern_enter() the first time, so there's a good chance this will work. (I switched to the separate function call just to isolate my changes to the AMDGPU driver and figured we'd have this discussion at this time :-) Hopefully I don't reproduce the crash I saw in the crypto code.