> I have an obnoxious question: do we really want to use the XFD mechanism?
Obnoxious questions are often the most valuable! :-) > Right now, glibc, and hence most user space code, blindly uses > whatever random CPU features are present for no particularly good > reason, which means that all these features get stuck in the XINUSE=1 > state, even if there is no code whatsoever in the process that > benefits. AVX512 is bad enough as we're seeing right now. AMX will > be much worse if this happens. > > We *could* instead use XCR0 and require an actual syscall to enable > it. We could even then play games like requiring whomever enables the > feature to allocate memory for the state save area for signals, and > signal delivery could save the state and disable the feature, this > preventing the signal frame from blowing up to 8 or 12 or who knows > how many kB. This approach would have some challenges. Enumeration today is two parts. 1. CPUID tells you if the feature exists in the HW 2. xgetbv/XCR0 tells you if the OS supports that feature Since #2 would be missing, you are right, there would need to be a new API enabling the user to request the OS to enable support for that task. If that new API is not invoked before the user touches the feature, they die with a #UD. And so there would need to be some assurance that the API is successfully called before any library might use the feature. Is there a practical way to guarantee that, given that the feature may be used (or not) only by a dynamically linked library? If a library spawns threads and queries the size of XSAVE before the API is called, it may be confused when that size changes after the API is called. So a simple question, "who calls the API, and when?" isn't so simple. Finally, note that XCR0 updates cause a VMEXIT, while XFD updates do not. So context switching XCR0 is possible, but is problematic. The other combination is XFD + API rather than XCR0 + API. With XFD, the context switching is faster, and the faulting (#NM and the new MSR with #NM cause) is viable. We have the bit set in XCR0, so no state size advantage. Still have issues with API logistics. So we didn't see that the API adds any value, only pain, over transparent 1st use enabling with XFD and no API. cheers, Len Brown, Intel Open Source Technology Center ps. I agree that un-necessary XINUSE=1 is possible. Notwithstanding the issues initially deploying AVX512, I am skeptical that it is common today. IMO, the problem with AVX512 state is that we guaranteed it will be zero for XINUSE=0. That means we have to write 0's on saves. It would be better to be able to skip the write -- even if we can't save the space we can save the data transfer. (This is what we did for AMX). pps. your idea of requiring the user to allocate their own signal stack is interesting. It isn't really about allocating the stack, though -- the stack of the task that uses the feature is generally fine already. The opportunity is to allow tasks that do *not* use the new feature to get away with minimal data transfer and stack size. As we don't have the 0's guarantee for AMX, we bought the important part of that back.