On Thu, Mar 21, 2024 at 2:55 AM Nathan Bossart <nathandboss...@gmail.com> wrote: > > On Wed, Mar 20, 2024 at 09:31:16AM -0500, Nathan Bossart wrote:
> > I don't mind removing the 2-register stuff if that's what you think we > > should do. I'm cautiously optimistic that it'd help more than the extra > > branch prediction might hurt, and it'd at least help avoid regressing the > > lower end for the larger AVX2 registers, but I probably won't be able to > > prove that without constructing another benchmark. And TBH I'm not sure > > it'll significantly impact any real-world workload, anyway. > > Here's a new version of the patch set with the 2-register stuff removed, I'm much happier about v5-0001. With a small tweak it would match what I had in mind: + if (nelem < nelem_per_iteration) + goto one_by_one; If this were "<=" then the for long arrays we could assume there is always more than one block, and wouldn't need to check if any elements remain -- first block, then a single loop and it's done. The loop could also then be a "do while" since it doesn't have to check the exit condition up front. > plus a fresh run of the benchmark. The weird spike for AVX2 is what led me > down the 2-register path earlier. Yes, that spike is weird, because it seems super-linear. However, the more interesting question for me is: AVX2 isn't really buying much for the numbers covered in this test. Between 32 and 48 elements, and between 64 and 80, it's indistinguishable from SSE2. The jumps to the next shelf are postponed, but the jumps are just as high. From earlier system benchmarks, I recall it eventually wins out with hundreds of elements, right? Is that still true? Further, now that the algorithm is more SIMD-appropriate, I wonder what doing 4 registers at a time is actually buying us for either SSE2 or AVX2. It might just be a matter of scale, but that would be good to understand.