On Tue, Oct 13, 2020 at 11:03 PM Dave Hansen <dave.han...@intel.com> wrote: > > On 10/13/20 6:11 PM, Andy Lutomirski wrote: > > I have no problem with vmalloc(), but I do have a problem with > > vfree() due to the IPIs that result. We need a cache or something. > > This sounds like the kind of thing we should just build into vmalloc() > instead of having a bunch of callers implement their own caches. It > shouldn't be too much of a challenge to have vmalloc() keep a cacheline > or two of stats about common vmalloc() sizes and keep some pools around. > > It's not going to be hard to implement caches to reduce vfree()-induced > churn, but I'm having a hard time imaging that it'll have anywhere near > the benefits that it did for stacks. Tasks fundamentally come and go a > *lot*, and those paths are hot. > > Tasks who go to the trouble to populate 8k or 64k of register state > fundamentally *can't* come and go frequently. We also (probably) don't > have to worry about AMX tasks doing fork()/exec() pairs and putting > pressure on vmalloc(). Before an app can call out to library functions > to do the fork, they've got to save the state off themselves and likely > get it back to the init state. The fork() code can tell AMX is in the > init state and decline to allocate actual space for the child. > > > I have to say: this mechanism is awful. Can we get away with skipping > > the dynamic XSAVES mess entirely? What if we instead allocate > > however much space we need as an array of pages and have one percpu > > contiguous region. To save, we XSAVE(S or C) just the AMX state to > > the percpu area and then copy it. To restore, we do the inverse. Or > > would this kill the modified optimization and thus be horrible? > > Actually, I think the modified optimization would survive such a scheme: > > * copy page array into percpu area > * XRSTORS from percpu area, modified optimization tuple is saved > * run userspace > * XSAVES back to percpu area. tuple matches, modified optimization > is still in play > * copy percpu area back to page array > > Since the XRSTORS->XSAVES pair is both done to the percpu area, the > XSAVE tracking hardware never knows it isn't working on the "canonical" > buffer (the page array).
I was suggesting something a little bit different. We'd keep XMM, YMM, ZMM, etc state stored exactly the way we do now and, for AMX-using tasks, we would save the AMX state in an entirely separate buffer. This way the pain of having a variable xstate layout is confined just to AMX tasks. I'm okay with vmalloc() too, but I do think we need to deal with the various corner cases like allocation failing.