On Mon, Sep 16, 2019 at 2:18 AM Rasmus Villemoes <li...@rasmusvillemoes.dk> wrote: > > Eh, this benchmark doesn't seem to provide any hints on where to set the > cut-off for a compile-time constant n, i.e. the 32 in
Yes, you'd need to use proper fixed-size memset's with __builtin_memset() to test that case. Probably easy enough with some preprocessor macros to expand to a lot of cases. But even then it will not show some of the advantages of inlining the memset (quite often you have a "memset structure to zero, then initialize a couple of fields" pattern, and gcc does much better for that when it just inlines the memset to stores - to the point of just removing all the memset entirely and just storing a couple of zeroes between the fields you initialized). So the "inline constant sizes" case has advantages over and beyond the obvious ones. I suspect that a reasonable cut-off point is somethinig like "8*sizeof(long)". But look at things like "struct kstat" uses etc, the limit might actually be even higher than that. Also note that while "rep stosb" is _reasonably_ good with current CPU's (ie roughly gen 8+), it's not so great a few generations ago (gen 6ish), and it can be absolutely horrid on older cores and/or atom. The limit for when it is a win ends up depending on whether I$ footprint is an issue too, of course, but some of the bigger wins tend to happen when you have sizes >= 128. You can basically always beat "rep movs/stos" with hand-tuned AVX2/512 code for specific cases if you don't look at I$ footprint and the cost of the AVX setup (and the cost of frequency changes, which often go hand-in-hand with the AVX use). So "rep movs/stos" is seldom _optimal_, but it tends to be "quite good" for modern CPU's with variable sizes that are in the 100+ byte range. Linus