On Tue, Apr 13, 2021 at 3:47 PM Len Brown <l...@kernel.org> wrote: > > On Tue, Apr 13, 2021 at 4:16 PM Andy Lutomirski <l...@kernel.org> wrote: > > > > On Mon, Apr 12, 2021 at 4:46 PM Len Brown <l...@kernel.org> wrote: > > > > > > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski <l...@kernel.org> wrote: > > > > > > > AMX: Multiplying a 4x4 matrix probably looks *great* in a > > > > microbenchmark. Do it once and you permanently allocate 8kB (is that > > > > even a constant? can it grow in newer parts?), potentially hurts all > > > > future context switches, and does who-knows-what to Turbo licenses and > > > > such. > > > > > > Intel expects that AMX will be extremely valuable to key workloads. > > > It is true that you may never run that kind of workload on the machine > > > in front of you, > > > and so you have every right to be doubtful about the value of AMX. > > > > I fully believe that AMX will be amazing when used for the right > > workload. The problem is that a library may have no way to tell > > whether a workload is the type of computationally intensive workload > > for which it makes sense. Imagine you have a little function: > > > > int matrix_times_vector(int dim, float *out, const float *matrix, > > const float *vector); > > > > A clever library might use AMX for this. If dim == 4 and the caller > > is planning to call it in a long, tight loop, maybe this even makes > > sense. If dim == 4 and it's being called once, AMX is probably a > > losing proposition. With previous technologies, at least the impact > > was limited to the function itself and maybe once per call to the > > caller. But now, with AMX, the program that invoked this takes a > > performance and memory hit *forever* if it uses AMX once. > > Again... > > As this is a "clever" library, built with a clever toolchain, and the > result is that > TILERELEASE was properly issued at the end of computation. > Thus the hardware knows that the (volatile) AMX registers are no longer live.
My argument has *nothing* to do with TILERELEASE. Let me try again. Suppose I write some user code an call into a library that uses AMX because the library authors benchmarked it and determined that using AMX is faster when called in a loop. But I don't call it in a loop. Then I take the transition penalty into and out of AMX code (I'll believe there is no penalty when I see it -- we've had a penalty with VEX and with AVX-512) and my program runs *slower*. And, to top it off, I've just permanently allocated 8kB of extra FPU state buffer, *and* I'm taking either an XCR0 or an XFD write penalty on every future context switch. Someone or something needs to make a decision as to whether AMX should actually be used for a given algorithm. The user library community has swept this under the rug by declaring that libraries should use the best-in-a-tight-loop code for the entire existence of extensions beyond XMM, and the cost keeps getting higher. > > Beyond that, we have the signal handling issue. > > I'm unaware of any unresolved feedback on the signal handling series > other than a wistful "wouldn't a new SIGFAIL be more clear (for future apps) > than the existing SIGSEGV?" I agree with this sentiment, but I don't > think we should hold up a patch to prevent corrupting user data > because a new signal number to describe the scenario doesn't exit. > Particularly since the new code that knows about the new SIGFAIL > will also be new code that has been compiled with the new glibc > that for most cases will prevent this scenario in the first place... > > > One solution, going > > off of what WIlly mentioned, is: > > > > bool amx_begin(void *signal_save_buffer); > > void amx_end(); > > > > In the amx_begin() region, if you get a signal, the AMX state is saved > > in the buffer. Outside the region, if you get a signal and AMX is in > > use, the kernel will either unceremoniously kill the task or will > > deliver SIGYOUBLEWIT. [0] > > I think it is clear that if a new signal ABI is going to be invented, > that it should be opt-in on state, so that it can run fast on machines > far into the future by not choosing to opt-in on anything. > > It isn't clear that changing the signal save state around critical regions > (in multiple threads) so that a single (per process definition) of a signal > handler gets a different result at different times is going to make that > (new) signal handler author especially happy. More likely they > either always want the state, or they do not. Perhaps some form of decision should be reached before AMX lands? Landing AMX in its current form is a decision, and we should make a credible effort to decide if it's the right one. --Andy