On Tue Feb 23 02:36:41 PST 2016, kennylevin...@gmail.com wrote: > Ah, no - it is not a system-wide adjustment, but adjustment of the plan9 > specific runtime.sighandler implementation and everything called by it > directly. Notes that don't exit the process are queued and should run outside > the actual note handler. > > I think the "magic" code will be isolated, and might fend off accidental > future additions of floating point registers. The magic-ness also only > revolves around avoiding duffzero and duffcopy in some way. I also think that > removing conditionals in the compiler will be a positive thing. > > I still do not know the feasibility of my plan, whether it is possible to do > cleanly, or possible at all. Maybe someone smarter than me with knowledge on > the matter could chime in and call me an idiot? > > Avoiding duffcopy should be easy with a simple memmove implementation. If > done right, we can also remove the plan9 specific runtime.memmove and only > use the slow memmove in sighandler (The globlal runtime.memmove is > implemented using MOVUPS just like duffcopy. Duffcopy is used for blockcopies > by the compiler in some cases, although I must admit to not know all the > cases yet). > > Avoiding duffzero without compiler assistance is a bit more tricky - global > variables, stack on assembly functions, something like that.
fwiw, on modern amd64 machines, using the xmm and ymm registers has a benefit only in a narrow range of sizes (384-511 bytes) and a subset of (mis-)alignments that i've forgotten. at least for the exact test setup i used on 3-4 different µarches. intel claims rep; movs is the (architecturally) fastest way to go. i am not sure any of this makes much difference, as it's hard to know what a real-world memory access pattern looks like, and that seems to dominate all but gigantic moves, for which rep; movs is actually no slower than even the trickiest use of ymm registers. - erik