On 10/14/20 9:10 AM, Andy Lutomirski wrote: >> Actually, I think the modified optimization would survive such a scheme: >> >> * copy page array into percpu area >> * XRSTORS from percpu area, modified optimization tuple is saved >> * run userspace >> * XSAVES back to percpu area. tuple matches, modified optimization >> is still in play >> * copy percpu area back to page array >> >> Since the XRSTORS->XSAVES pair is both done to the percpu area, the >> XSAVE tracking hardware never knows it isn't working on the "canonical" >> buffer (the page array). > I was suggesting something a little bit different. We'd keep XMM, > YMM, ZMM, etc state stored exactly the way we do now and, for > AMX-using tasks, we would save the AMX state in an entirely separate > buffer. This way the pain of having a variable xstate layout is > confined just to AMX tasks.
OK, got it. So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying of the registers out to memory. We can preserve the modified optimization if we're careful about ordering, but only for *ONE* of the XSAVE buffers (if we use two). > I'm okay with vmalloc() too, but I do think we need to deal with the > various corner cases like allocation failing. Yeah, agreed about handling the corner cases. Also, if we preserve plain old vmalloc() for now, we need good tracepoints or stats so we can precisely figure out how many vmalloc()s (and IPIs) are due to AMX.